Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GCSClient instead of PythonGCSClient #211

Merged
merged 27 commits into from
Oct 26, 2023
Merged

Use GCSClient instead of PythonGCSClient #211

merged 27 commits into from
Oct 26, 2023

Conversation

glennmoy
Copy link
Contributor

@glennmoy glennmoy commented Oct 20, 2023

Break off of #202

Closes #76

Two things worth calling out:

  • Unlike the PythonGcsClient the GcsClient has no direct interface for specifying a timeout. This is instead set through RayConfig parameters which can be overriden via environment variables.
  • Attempting to Connect to unintiated GCS Servers issues a warning message before killing the driver process after a 1-min timeout.

@glennmoy glennmoy marked this pull request as draft October 20, 2023 12:48
src/runtime.jl Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

Merging #211 (9bb6921) into main (caddc0e) will increase coverage by 0.46%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
+ Coverage   97.65%   98.12%   +0.46%     
==========================================
  Files          12       12              
  Lines         640      639       -1     
==========================================
+ Hits          625      627       +2     
+ Misses         15       12       -3     
Flag Coverage Δ
Ray.jl 98.12% <100.00%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/function_manager.jl 97.77% <100.00%> (+5.47%) ⬆️
src/ray_julia_jll/common.jl 95.30% <100.00%> (+0.19%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

src/function_manager.jl Show resolved Hide resolved
Comment on lines -38 to -43
# ideally we'd throw on connect but it returns OK......
badclient = JuliaGcsClient("127.0.0.1:6378")
status = Connect(badclient)

# ...but then throws when we try to do anything so at least there's that
@test_throws ErrorException Put(badclient, ns, "computer", "mistaek", false, -1)
Copy link
Contributor Author

@glennmoy glennmoy Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that the Connect call returns Status::OK irrespective of whether the GCS Server exists.

It first reports after 5 seconds that it can't connect, then after a minute kills the session with an EXIT_FAILURE.
Again these are set by RayConfig params.

If the client does not exist then then the thread executing the server (I think) throws the error which only gets reported but not caught in the Julia REPL

https://github.com/ray-project/ray/blob/cde6e887cbb21a9cae2632e3e4b883d913d38a05/src/ray/rpc/gcs_server/gcs_rpc_client.h#L212-L216

Unfortunately the gcs_is_down_ field is private, however there is a way to check if the server is alive that uses a callback

However, I don't think it's worth directly implementing this. The timeout should take care of things it's just that the error won't be nicely caught/reported in Julia but we can add that as a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MWE

julia> using Ray.ray_julia_jll

julia> badclient = ray_julia_jll.JuliaGcsClient("127.0.0.1:6378")
Ray.ray_julia_jll.JuliaGcsClientAllocated(Ptr{Nothing} @0x00000001376270e0)

julia> ray_julia_jll.Connect(badclient)
[2023-10-20 18:04:32,822 E 28309 8652849] gcs_rpc_client.h:207: Failed to connect to GCS at address 127.0.0.1:6378 within 5 seconds.
OK

julia> [2023-10-20 18:05:27,877 E 28309 8653025] gcs_rpc_client.h:537: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
(venv) MacBook-Air~/.j/d/Ray (gm/gcsclient|✔)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario in which this would occur is if you run Ray.init() without a local raylet present? If so, I'm fine with making this into an issue to tackle later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and even then it should error earlier if it tried to get the GCS address during Ray.init

test/function_manager.jl Outdated Show resolved Hide resolved
build/wrapper.cc Show resolved Hide resolved
@glennmoy glennmoy marked this pull request as ready for review October 20, 2023 17:08
@glennmoy glennmoy requested review from omus and kleinschmidt October 20, 2023 17:08
Copy link
Member

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. The C++ code was a bit hard to follow having not dived into the GCS myself. I think adding more comments and links in the C++ code would be helpful. I agree supporting timeouts isn't worth the effort at this time as we haven't been using it anyway.

build/wrapper.cc Show resolved Hide resolved
src/function_manager.jl Outdated Show resolved Hide resolved
src/function_manager.jl Show resolved Hide resolved
test/function_manager.jl Outdated Show resolved Hide resolved
Comment on lines -38 to -43
# ideally we'd throw on connect but it returns OK......
badclient = JuliaGcsClient("127.0.0.1:6378")
status = Connect(badclient)

# ...but then throws when we try to do anything so at least there's that
@test_throws ErrorException Put(badclient, ns, "computer", "mistaek", false, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario in which this would occur is if you run Ray.init() without a local raylet present? If so, I'm fine with making this into an issue to tackle later.

test/ray_julia_jll/gcs_client.jl Outdated Show resolved Hide resolved
src/function_manager.jl Outdated Show resolved Hide resolved
test/ray_julia_jll/gcs_client.jl Outdated Show resolved Hide resolved
test/ray_julia_jll/gcs_client.jl Outdated Show resolved Hide resolved
test/ray_julia_jll/gcs_client.jl Outdated Show resolved Hide resolved
src/function_manager.jl Outdated Show resolved Hide resolved
@glennmoy glennmoy requested a review from omus October 23, 2023 20:10
Copy link
Member

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the FunctionManager changes are blocking approval

test/ray_julia_jll/gcs_client.jl Show resolved Hide resolved
src/ray_julia_jll/common.jl Outdated Show resolved Hide resolved
src/ray_julia_jll/common.jl Outdated Show resolved Hide resolved
src/function_manager.jl Show resolved Hide resolved
src/function_manager.jl Outdated Show resolved Hide resolved
@glennmoy glennmoy requested a review from omus October 24, 2023 21:48
glennmoy and others added 2 commits October 25, 2023 11:56
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
src/function_manager.jl Outdated Show resolved Hide resolved
src/function_manager.jl Outdated Show resolved Hide resolved
@omus
Copy link
Member

omus commented Oct 25, 2023

Failures definitely look related:

 [ Info: Connecting function manager to GCS at 10.0.0.105:6379...
terminate called after throwing an instance of 'std::runtime_error'
  what():  NotFound: Failed to find the key

signal (6): Aborted
in expression starting at /home/runner/work/Ray.jl/Ray.jl/test/ray_serializer.jl:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)

https://github.com/beacon-biosignals/Ray.jl/actions/runs/6645852901/job/18058052127?pr=211#step:12:215

 [ Info: Connecting function manager to GCS at 127.0.0.1:6379...
libc++abi: terminating due to uncaught exception of type std::runtime_error: NotFound: Failed to find the key

signal (6): Abort trap: 6
in expression starting at /Users/runner/work/Ray.jl/Ray.jl/test/ray_serializer.jl:81
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 44751822 (Pool: 44725417; Big: 26405); GC: 51

https://github.com/beacon-biosignals/Ray.jl/actions/runs/6645852901/job/18058052370?pr=211#step:12:211

@omus
Copy link
Member

omus commented Oct 25, 2023

I can manage to reproduce this on my local machine:

┌ Debug: Skipping registering ownership for ObjectRef("00ffffffffffffffffffffffffffffffffffffff4300000013000000") as object has known owner
└ @ Ray ~/.julia/dev/Ray/src/object_ref.jl:128
libc++abi: terminating due to uncaught exception of type std::runtime_error: NotFound: Failed to find the key

[1307] signal (6): Abort trap: 6
in expression starting at /Users/cvogt/.julia/dev/Ray/test/runtests.jl:19
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 30846806 (Pool: 30821062; Big: 25744); GC: 44
ERROR: Package Ray errored during testing (received signal: 6)

omus added 6 commits October 25, 2023 21:01
Avoids this failure when `JuliaGcsClient` is garbage collected:

```
libc++abi: terminating due to uncaught exception of type std::runtime_error: GCS client not initialized; did you forget to Connect?

[13612] signal (6): Abort trap: 6
in expression starting at /Users/cvogt/.julia/dev/Ray/test/runtests.jl:19
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 30955062 (Pool: 30929299; Big: 25763); GC: 43
ERROR: Package Ray errored during testing (received signal: 6)
```
@omus
Copy link
Member

omus commented Oct 26, 2023

The issue noticed above was just due to us failing to Disconnect the JuliaGcsClient used in the function_manager.jl tests. I've added a destructor to the JuliaGcsClient to automatically disconnect the client to avoid the SIGABRT failure. However, since users of JuliaGcsClient should really be handling disconnecting themselves I added in a STDERR warning from C++ indicating that a connected JuliaGcsClient failed to disconnect on their own. As Ray.jl users aren't exposed to this class mostly this is just a reminder for Ray.jl developers to be explicit about disconnecting.

@omus omus merged commit 9cd5715 into main Oct 26, 2023
8 checks passed
@omus omus deleted the gm/gcsclient branch October 26, 2023 04:48
@glennmoy
Copy link
Contributor Author

The issue noticed above was just due to us failing to Disconnect the JuliaGcsClient used in the function_manager.jl tests. I've added a destructor to the JuliaGcsClient to automatically disconnect the client to avoid the SIGABRT failure. However, since users of JuliaGcsClient should really be handling disconnecting themselves I added in a STDERR warning from C++ indicating that a connected JuliaGcsClient failed to disconnect on their own. As Ray.jl users aren't exposed to this class mostly this is just a reminder for Ray.jl developers to be explicit about disconnecting.

yeah when I removed the finalizer from the inner constructor as per your suggestion above I should have done so by reverting 6b98717 as I missed adding back the Disconnect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop using the PythonGcsClient
2 participants