-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GCSClient instead of PythonGCSClient #211
Conversation
Codecov Report
@@ Coverage Diff @@
## main #211 +/- ##
==========================================
+ Coverage 97.65% 98.12% +0.46%
==========================================
Files 12 12
Lines 640 639 -1
==========================================
+ Hits 625 627 +2
+ Misses 15 12 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
# ideally we'd throw on connect but it returns OK...... | ||
badclient = JuliaGcsClient("127.0.0.1:6378") | ||
status = Connect(badclient) | ||
|
||
# ...but then throws when we try to do anything so at least there's that | ||
@test_throws ErrorException Put(badclient, ns, "computer", "mistaek", false, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that the Connect
call returns Status::OK
irrespective of whether the GCS Server exists.
It first reports after 5 seconds that it can't connect, then after a minute kills the session with an EXIT_FAILURE
.
Again these are set by RayConfig
params.
If the client does not exist then then the thread executing the server (I think) throws the error which only gets reported but not caught in the Julia REPL
Unfortunately the gcs_is_down_
field is private, however there is a way to check if the server is alive that uses a callback
However, I don't think it's worth directly implementing this. The timeout should take care of things it's just that the error won't be nicely caught/reported in Julia but we can add that as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MWE
julia> using Ray.ray_julia_jll
julia> badclient = ray_julia_jll.JuliaGcsClient("127.0.0.1:6378")
Ray.ray_julia_jll.JuliaGcsClientAllocated(Ptr{Nothing} @0x00000001376270e0)
julia> ray_julia_jll.Connect(badclient)
[2023-10-20 18:04:32,822 E 28309 8652849] gcs_rpc_client.h:207: Failed to connect to GCS at address 127.0.0.1:6378 within 5 seconds.
OK
julia> [2023-10-20 18:05:27,877 E 28309 8653025] gcs_rpc_client.h:537: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
(venv) MacBook-Air~/.j/d/Ray (gm/gcsclient|✔)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario in which this would occur is if you run Ray.init()
without a local raylet present? If so, I'm fine with making this into an issue to tackle later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah and even then it should error earlier if it tried to get the GCS address during Ray.init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. The C++ code was a bit hard to follow having not dived into the GCS myself. I think adding more comments and links in the C++ code would be helpful. I agree supporting timeouts isn't worth the effort at this time as we haven't been using it anyway.
# ideally we'd throw on connect but it returns OK...... | ||
badclient = JuliaGcsClient("127.0.0.1:6378") | ||
status = Connect(badclient) | ||
|
||
# ...but then throws when we try to do anything so at least there's that | ||
@test_throws ErrorException Put(badclient, ns, "computer", "mistaek", false, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario in which this would occur is if you run Ray.init()
without a local raylet present? If so, I'm fine with making this into an issue to tackle later.
7625fef
to
7e495d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the FunctionManager
changes are blocking approval
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This reverts commit 7769d88.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
30c48ad
to
9e15271
Compare
Co-authored-by: Curtis Vogt <[email protected]>
Co-authored-by: Curtis Vogt <[email protected]>
Failures definitely look related:
|
I can manage to reproduce this on my local machine:
|
This reverts commit cc8f43a.
Avoids this failure when `JuliaGcsClient` is garbage collected: ``` libc++abi: terminating due to uncaught exception of type std::runtime_error: GCS client not initialized; did you forget to Connect? [13612] signal (6): Abort trap: 6 in expression starting at /Users/cvogt/.julia/dev/Ray/test/runtests.jl:19 __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line) Allocations: 30955062 (Pool: 30929299; Big: 25763); GC: 43 ERROR: Package Ray errored during testing (received signal: 6) ```
The issue noticed above was just due to us failing to |
yeah when I removed the |
Break off of #202
Closes #76
Two things worth calling out:
PythonGcsClient
theGcsClient
has no direct interface for specifying atimeout
. This is instead set throughRayConfig
parameters which can be overriden via environment variables.Connect
to unintiated GCS Servers issues a warning message before killing the driver process after a 1-min timeout.