Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs with get_zcopy and cuda_ipc on dual-GPU workstations #888

Open
pentschev opened this issue Oct 12, 2022 · 10 comments
Open

Hangs with get_zcopy and cuda_ipc on dual-GPU workstations #888

pentschev opened this issue Oct 12, 2022 · 10 comments

Comments

@pentschev
Copy link
Member

In UCX-Py, we used to add the UCX_RNDV_SCHEME=get_zcopy override due to potential performance issues, this was recently removed in #836 to allow bounce buffers to function.

In some cases, using UCX_RNDV_SCHEME=get_zcopy (which may still be triggered with the new UCX_RNDV_SCHEME=auto default) causes UCX-Py to hang non-deterministically. This seems only to be reproducible on workstations with at least two GPUs connected via NVLink, and I'm unable to reproduce it on a DGX-1, for example.

The MRE involves 3 files. A listener.py working as server, and sender.py that reproduces the hang mentioned above, and a sender2.py that does NOT reproduce the issue. The only difference from sender2.py to the original hang-reproducing sender.py is how the message to be sent is passed, originally passed to the Python async function (hangs) or created directly within the Python async function (does NOT hang). The files to reproduce are below.

listener.py
import asyncio

import ucp
import cupy as cp

async def run(device):
    # Set CUDA device and create context before initializing UCX
    cp.cuda.runtime.setDevice(device)
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    async def receiver(endpoint):
        print("  CLIENT: receiving sources...", end="", flush=True)
        await endpoint.recv_obj(allocator=lambda n: cp.empty(n, dtype="uint8"))
        print("CLIENT done receiving sources", flush=True)

        await endpoint.close()
        listener.close()

    print("CREATING listener...", flush=True)
    listener = ucp.create_listener(receiver, 9092)
    print("DONE creating listener...", flush=True)

    while not listener.closed():
        await asyncio.sleep(0.05)


if __name__ == "__main__":
    r = asyncio.run(run(1))
sender.py
import asyncio

import ucp
import cupy as cp


async def send(msg):
    ep = await ucp.create_endpoint("localhost", 9092)

    print(f"   SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
    await ep.send_obj(msg)
    print("    SERVER: done", flush=True)
    await ep.close()


if __name__ == '__main__':
    # Create CUDA context before initializing UCX
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    asyncio.run(send(cp.ones(222221, dtype='int32')))
    print("DONE SENDING", flush=True)
sender2.py
import asyncio

import ucp
import cupy as cp


async def send():
    ep = await ucp.create_endpoint("localhost", 9092)

    msg = cp.ones(222221, dtype='int32')
    print(f"   SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
    await ep.send_obj(msg)
    print("    SERVER: done", flush=True)
    await ep.close()


if __name__ == '__main__':
    # Create CUDA context before initializing UCX
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    asyncio.run(send())
    print("DONE SENDING", flush=True)

Since the hang does not reproduce deterministically, it may be necessary to run listener/client in a loop, e.g.:

# Listener loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_TCP_CM_REUSEADDR=y python listener.py; done

# Client loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc python sender.py; done

By replacing sender.py in the client loop with sender2.py, all 100 iterations should complete without any hangs, whereas the hang with sender.py occurs normally in under 10 iterations. Also running with UCX_RNDV_SCHEME=put_zcopy instead should not reproduce the hang either.

Given the error occurs depending on how we pass the message via the async Python interface, I suspect there may be something to do with how the event loop is executing, but I do not have any solid evidence for that at the moment.

cc @rlratzel

@wence-
Copy link
Contributor

wence- commented Oct 12, 2022

Can you get a backtrace of the hung process to give us some hints? A python one can be obtained with py-spy dump --pid ...

@pentschev
Copy link
Member Author

Yeah, unfortunately there's nothing useful there, as usual with Python async, see below:

$ py-spy dump -p 7886
Process 7886: python listener.py
Python v3.9.13 (/home/pentschev/miniconda3/envs/rn-221011/bin/python3.9)

Thread 7886 (idle): "MainThread"
    select (selectors.py:469)
    _run_once (asyncio/base_events.py:1869)
    run_forever (asyncio/base_events.py:601)
    run_until_complete (asyncio/base_events.py:634)
    run (asyncio/runners.py:44)
    <module> (listener.py:33)

$ py-spy dump -p 7885
Process 7885: python sender.py
Python v3.9.13 (/home/pentschev/miniconda3/envs/rn-221011/bin/python3.9)

Thread 7885 (idle): "MainThread"
    select (selectors.py:469)
    _run_once (asyncio/base_events.py:1869)
    run_forever (asyncio/base_events.py:601)
    run_until_complete (asyncio/base_events.py:634)
    run (asyncio/runners.py:44)
    <module> (sender.py:24)

@wence-
Copy link
Contributor

wence- commented Oct 12, 2022

We can get a little more information with py-spy dump -ll -p 7886 and maybe also with a native backtrace (adding -n). A gdb backtrace would also be useful (sorry, only have DGX-1 access so can't repro locally)

@pentschev
Copy link
Member Author

Unfortunately we have nothing useful either. I took a look at those before as well, but we have only regular Python stack, AFAICT. See logs attached.

One more thing that occurred to me now was to enable non-blocking UCX-Py mode with UCXPY_NON_BLOCKING_MODE=1 and it works in that case. This could mean either an issue with UCX-Py blocking mode, or UCX isn't notifying completion for some reason.

In any case, I won't spend much more time on this now, I need to test this with the new C++ implementation of UCX-Py, perhaps the problem is resolved there already and we won't need to worry about this soon.

@pentschev
Copy link
Member Author

Forgot to attach the files on the previous comment, files are below.

listener-gdb.log
listener-py-spy.log
sender-gdb.log
sender-py-spy.log

@wence-
Copy link
Contributor

wence- commented Oct 12, 2022

One more thing that occurred to me now was to enable non-blocking UCX-Py mode with UCXPY_NON_BLOCKING_MODE=1 and it works in that case. This could mean either an issue with UCX-Py blocking mode, or UCX isn't notifying completion for some reason.

Thanks, this is a useful hint. I suspect indeed that UCX-Py blocking mode has an issue. In particular the code used by @madsbk referenced from stackoverflow to check for an idle event loop does not work! This was not obvious at the time it was used, but a comment in September 2020 (#415 was merged in April 2020) notes that the proposed solution is insufficient. A possible solution might be here: https://stackoverflow.com/questions/63732618/

I can try it out...

@pentschev
Copy link
Member Author

Thanks @wence- , that is a good finding! If you have a bit of time and want to give it a go, please go ahead. I would just suggest not spending too much time on that right away as we may have better performance with a separate progress thread that is implemented in C++ UCX-Py, and then this may become obsolete anyway.

@pentschev
Copy link
Member Author

I've found another case where the blocking progress mode seems to be the cause of failure. This one involves managed memory, by having large enough buffers that will cause device memory to be evicted will eventually lead to a UCXCanceled error while receiving the message, see sample below.

$ UCX_RNDV_FRAG_MEM_TYPE=cuda python -m ucp.benchmarks.send_recv -l ucp-async -d 0 -e 3 -o rmm --rmm-managed-memory -n 10GiB --n-iter 3 --n-warmup-iter 0
Server Running at 10.33.225.163:60512
Client connecting to server at 10.33.225.163:60512
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/send_recv.py", line 95, in client
    loop.run_until_complete(client.run())
  File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/backends/ucp_async.py", line 142, in run
    await ep.recv(recv_msg)
  File "/datasets/pentschev/src/ucx-py/ucp/core.py", line 728, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp._libs.exceptions.UCXCanceled: <[Recv #001] ep: 0x7f3af9198108, tag: 0x8d190517b568099f, nbytes: 10737418240, type: <class 'cupy.ndarray'>>:
Traceback (most recent call last):
  File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/datasets/pentschev/miniconda3/envs/rn-221007-bounce/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/send_recv.py", line 376, in <module>
    main()
  File "/datasets/pentschev/src/ucx-py/ucp/benchmarks/send_recv.py", line 368, in main
    assert not p2.exitcode
AssertionError

If we set UCXPY_NON_BLOCKING_MODE=1 the command above completes successfully. This examples requires #836 for correct CUDA device setting and to remove the UCX_RNDV_SCHEME=get_zcopy pin. Given that we need to rely on UCX_RNDV_SCHEME=auto to transfer managed memory with the staging buffers, I can't confirm whether UCX_RNDV_SCHEME=put_zcopy would work here, as per the original problem description. In any case, it seems that both cases converge to the blocking progress mode.

@pentschev
Copy link
Member Author

Forgot to mention that the sample above can be reproduced on a DGX-1.

@pentschev
Copy link
Member Author

It seems that this issue actually stems from a deadlock in UCX. What appears to occur is the cuda_copy callback attempts to acquire the spinlock that is already locked by another thread. This code path is necessary by the wakeup mechanism (i.e., the epoll_fd that is notified upon new events, and that we use in blocking progress mode), but not by continuous progressing (i.e., non-blocking mode). This spinlock is also only present in UCX multi-thread mode which is hardcoded in UCX-Py, and I was able to confirm that replacing that by single-thread does indeed allows completing the reproduced from #888 (comment) without problems, as well as Dask-CUDA benchmarks that use CUDA managed memory and are too large to fit in the device's memory and thus cause eviction, which otherwise also causes deadlocks.

Theoretically speaking, UCX-Py is not thread-safe and thus switching to single-thread mode would not be a problem. However, RAFT also uses UCX-Py directly, and it's still not clear to me whether transfers also share the same thread as Dask communication thread or occur on a separate thread, if communication only occurs on the Dask comms thread then we should be ok to just switch to single-thread mode, otherwise we will need to wait for a fix in UCX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants