-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytestream write can leak goroutines if disk.Put doesn't drain the io.Reader #473
Comments
Thanks for the bug report, I will take a look. |
io.PipeWriter Write calls block until all the data written in the call is read from the corresponding io.PipeReader. If we don't read all that data, then the writing goroutine will block forever. This PipeWriter is intended to be consumed by disk.Put(), but if that returns early then there will be blocked writes. To un-block them, we can ensure that any remaining data is drained after disk.Put returns. This might fix buchgr#473
io.PipeWriter Write calls block until all the data written in the call is read from the corresponding io.PipeReader. If we don't read all that data, then the writing goroutine will block forever. This PipeWriter is intended to be consumed by disk.Put(), but if that returns early then there will be blocked writes. To un-block them, we can ensure that any remaining data is drained after disk.Put returns. This might fix buchgr#473
io.PipeWriter Write calls block until all the data written in the call is read from the corresponding io.PipeReader. If we don't read all that data, then the writing goroutine will block forever. This PipeWriter used in bytestream Write calls is intended to be consumed by disk.Put(), but if that returns early then there will be blocked writes. To un-block them, we should ensure that any remaining data is drained before disk.Put returns. This might fix buchgr#473
io.PipeWriter Write calls block until all the data written in the call is read from the corresponding io.PipeReader. If we don't read all that data, then the writing goroutine will block forever. The PipeWriter used in bytestream Write calls is intended to be consumed by disk.Put(), but if that returns early then there will be blocked writes. To un-block them, we should ensure that any remaining data is drained before disk.Put returns. This might fix buchgr#473
I think what's happening is:
To fix this, we can ensure that disk.Put always drains the reader before returning. Could you test a build from this PR, and let me know if it worked or not? #474 |
Thanks for the quick fix @mostynb ! I verified that goroutines number did decrease. it seems works
This seems too big, I check the disk folder and it only have 206.2 MB data.
I think disk only have around 200 MB use and memory shouldn't use near that much. |
I will need to look at your numbers and try to understand them some more, but my initial guess is that you're using If you clear the cache directory and try On the other hand, if you are seeing a lot of errors when uploading to the cache, then maybe it's worth finding a way to stop writing this data that needs to be drained. Do you have any put/upload errors in your logs? (Another thing we could try would be to use low-memory options for the zstd library bazel-remote uses.) |
Even when compressed the data, after the data is been saved, it should decrease the memory right? I monitor sometime after my bazel test finished and the memory still be that much.
I set the upload file limit so guess this is the place cause this error, no other errors. One thing is after I set
But the problem is this time when I check the disk folder it is only used around 1 MB after I run the same tests packages. |
Did you clear the cache dir before restarting bazel-remote with Which version of top are you using? When I try locally (on ubuntu) I don't get the same fields as in your logs. I will land #474 (since that seems to fix the goroutine leak), and then try making a build that uses the low-memory options for the zstd library for you to try. |
io.PipeWriter Write calls block until all the data written in the call is read from the corresponding io.PipeReader. If we don't read all that data, then the writing goroutine will block forever. The PipeWriter used in bytestream Write calls is intended to be consumed by disk.Put(), but if that returns early then there will be blocked writes. To un-block them, we should ensure that any remaining data is drained before disk.Put returns. This might fix #473
Let's see how this affects memory usage and performance. Related to buchgr#473.
Here's a version with the zstd libary's low memory options enabled to try out (not in all places, but most): #475 |
Cool, thanks. When will you publish the newer version of your fix? Will deploy them. For the version I use to test is just the branch for you fix, I think which version you fix depends on is the one I used. |
The goroutine fix has landed on the master branch, and been pushed to the "latest" tag on docker hub, if that helps. I don't mind making a versioned release tomorrow with just the goroutine fix if it would be useful, but I can wait a few days if you want to test the "lowmem" changes first. (The "lowmem" PR is based on the tip of master, with the goroutine fix.) |
We didn't run under docker image, we directly run it managed by systemd. I could build by myself. But a version would be better to track for future maintains :) |
We think have a version tag is better to manage. In this case, could you please create a version tag for this fix 🙏 |
v2.1.4 and v1.3.3 have been released with this fix. Let's move discussion of the "lowmem" experiment over to #475. |
We run bazel-remote cache as a gRPC server and have observed it leaks memory.
go_goroutines
appears to increase in an unbounded fashion in proportion to the number of requests the cache serves. At our request current scale our cache crashes with an OOM every 2 days.Configuration
Bazel Remote Version: 2.1.1
Disk: 1 TB
Memory: 300 GB
Observed OOM Crash
[Aug18 20:41] bazel-remote invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[ +0.003323] oom_kill_process+0x223/0x410
[ +0.006224] Out of memory: Kill process 48578 (bazel-remote) score 963 or sacrifice child
[ +0.012113] Killed process 48578 (bazel-remote) total-vm:391175744kB, anon-rss:388574312kB, file-rss:0kB, shmem-rss:0kB
Reproduction Steps
curl localhost:8080/metrics | grep gorou
This issue does not reproduce when artifacts are fetched from the remote cache server over HTTP.
Go Routine Profile
Writes appear to be blocked for individual go routines.
One Potential Culprit
https://github.com/buchgr/bazel-remote/blob/master/server/grpc_bytestream.go#L448
putResult
is in a nested channel that does not return. Upon subsequent iterations this code tries to push torecvResult
, but nobody no one is reading from it which may lead to blocked writes.The text was updated successfully, but these errors were encountered: