-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: helia-docker container dies when running in tiros #18
Comments
After some research last week, it looks like helia is adding too many eventListeners to user provided signals, and recalling And overall just mis-managing the AbortSignal's used (shutdown controller, timeout signal per dial, user shutdown controller, combineSignal that adds more event Listeners) and this is causing the docker container to fall over. js-libp2p (in node) currently sets max eventlisteners to Infinity as a workaround for this.. but.. I don't imagine that's a great way to handle things and need to investigate this further. |
Error from my container running for 12 hours locally:
|
starting up helia-http-gateway outside of docker and requesting
|
the docker container seems more stable now, but we might want to recommend running it in tiros with USE_LIBP2P=false USE_BITSWAP=false for now to keep mem usage & cpu low |
I was able to get an error just in the node runtime by running #!/usr/bin/env bash
# Query all endpoints until failure
# This script is intended to be run from the root of the helia-http-gateway repository
mkdir -p test-output
wget -T 180 -O test-output/blog.ipfs.tech http://localhost:8080/ipns/blog.ipfs.tech
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/blog.libp2p.io http://localhost:8080/ipns/blog.libp2p.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/consensuslab.world http://localhost:8080/ipns/consensuslab.world
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/docs.ipfs.tech http://localhost:8080/ipns/docs.ipfs.tech
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/docs.libp2p.io http://localhost:8080/ipns/docs.libp2p.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/drand.love http://localhost:8080/ipns/drand.love
curl -X POST http://localhost:8080/api/v0/repo/gc
# wg -O test-output/fil.orget http://localhost:8080/ipns/fil.org
# curl -X POST http://localhost:8080/api/v0/repo/gc
#
wget -T 180 -O test-output/filecoin.io http://localhost:8080/ipns/filecoin.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/green.filecoin.io http://localhost:8080/ipns/green.filecoin.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/ipfs.tech http://localhost:8080/ipns/ipfs.tech
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/ipld.io http://localhost:8080/ipns/ipld.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/libp2p.io http://localhost:8080/ipns/libp2p.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/n0.computer http://localhost:8080/ipns/n0.computer
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/probelab.io http://localhost:8080/ipns/probelab.io
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/protocol.ai http://localhost:8080/ipns/protocol.ai
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/research.protocol.ai http://localhost:8080/ipns/research.protocol.ai
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/singularity.storage http://localhost:8080/ipns/singularity.storage
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/specs.ipfs.tech http://localhost:8080/ipns/specs.ipfs.tech
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/strn.network http://localhost:8080/ipns/strn.network
curl -X POST http://localhost:8080/api/v0/repo/gc
wget -T 180 -O test-output/web3.storage http://localhost:8080/ipns/web3.storage
curl -X POST http://localhost:8080/api/v0/repo/gc
rm -rf test-output
The error output was:[10:22:36.275] INFO (57554): helia-http-gateway:fastifyincoming request
reqId: "req-f"
req: {
"method": "GET",
"url": "/ipns/green.filecoin.io",
"hostname": "localhost:8080",
"remoteAddress": "127.0.0.1",
"remotePort": 58478
}
(node:57554) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
file:///Users/sgtpooki/code/work/protocol.ai/ipfs/helia-http-gateway/node_modules/@libp2p/webrtc/dist/src/private-to-private/handler.js:21
connectedPromise.reject(new CodeError('Timed out while trying to connect', 'ERR_TIMEOUT'));
^
CodeError: Timed out while trying to connect
at signal.onabort (file:///Users/sgtpooki/code/work/protocol.ai/ipfs/helia-http-gateway/node_modules/@libp2p/webrtc/dist/src/private-to-private/handler.js:21:37)
at EventTarget.eventHandler (node:internal/event_target:1093:12)
at [nodejs.internal.kHybridDispatch] (node:internal/event_target:807:20)
at EventTarget.dispatchEvent (node:internal/event_target:742:26)
at abortSignal (node:internal/abort_controller:369:10)
at Timeout._onTimeout (node:internal/abort_controller:126:7)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
code: 'ERR_TIMEOUT',
props: {}
} |
I'm running the above script, with a minor edit, using https://www.npmjs.com/package/until-death, to try to kill the helia-http-gateway with The script is as follows:
|
I was able to kill it with the above method, getting a JS heap out of memory error:
|
i've got those scripts in a |
seems like the container and node.js process has been significantly improved on my local branch. changes coming after an event at my sons school. here's some output:
|
USE_LIBP2P=false & USE_BITSWAP=false significantly improve TTFB metrics:
|
We need to move away from using an in-memory data-store because there is no way to limit the amount of memory used. @wemeetagain mentioned that Lodestar uses an in-mem datastore that flushes to file-system. If lodestar has a datastore we can use, we should use that, otherwise we should migrate to use file-system directly and then update to use lodestar strategy. |
the crashing is reproducing in some CI tests: https://github.com/ipfs/helia-http-gateway/actions/runs/7104233205/job/19338737826?pr=59#step:6:27 |
I believe this is fixed: ipfs/helia#275 (comment) |
not quite fixed. see https://github.com/plprobelab/probelab-infra/issues/87. we have hanging promises probably because signals not being passed through properly. |
there are additional errors reported in slack: https://ipshipyard.slack.com/archives/C06C9LWQZC3/p1711097813106169?thread_ts=1710996462.233149&cid=C06C9LWQZC3 (private access) Some info from @dennis-tra
solution ideasI believe the update to latest verified-fetch that handles aborted signals should help, along with pulling in stores from ipfs/js-stores#287 for preventing excess memory |
FYI probe-lab/tiros#12 is out to help make testing helia-http-gateway against tiros easier |
currently running a battery of tests against a locally built helia-http-gateway with hyperfine: |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
FYI, looking at the heliadr-logs.txt logfile, this seems key to me:
we are aborting the request in helia-http-gateway, but we were using A lot of changes happened between 1.1.2 and 1.3.2 of helia-verified-fetch, such as handling signals properly. From the changelog: @helia/verified-fetch 1.3.2 (2024-03-25)Bug Fixes@helia/verified-fetch 1.3.1 (2024-03-22)Bug Fixes@helia/verified-fetch 1.3.0 (2024-03-21)FeaturesBug Fixes |
For helia-logs.txt, it appears to be a legitimate heap allocation failure. I think the js-stores at ipfs/js-stores#287 will fix this.
|
@achingbrain do you have any other ideas what could be happening here, or why we would be running into issues when ran in a linux container environment? @dennis-tra I wonder if we could more effectively reproduce this by running the tiros job locally in a container. Any ideas? Also, what are the machine specs being used by the IPFS container? |
bah. apparently my helia-dr and helia tests on local weren't being hit.. thats why memory never seemed to increase... I'm modifying tiros run command to not test HTTP, and throw if ERR_NAME_NOT_RESOLVED happens. |
fixed with #81 (comment) will be running tests again |
FYI that subdomains of docker hosts are not resolving properly, even for the kubo container: the below is ran from the chrome container obtained from tiros repo when running
edit: so we should probably be setting USE_SUBDOMAINS=false for tiros. |
this is probably an error with my mac local network, because subdomains were working for tiros in deployed env.. but checking responses from Kubo, it looks like setting USE_SUBDOMAINS=false gives us the same result: Kubo:
helia-http-gateway:
So i'll just continue my testing with this flag disabled. |
I was able to get the
this is on the latest @helia/verified-fetch which does pass & handle signals.. I will investigate further to see if I can reproduce outside of tiros FYI: this error also caused tiros to continue running in the background, still listening on 6666, which was causing issues when trying to start up again:
|
I've made some edits to helia-http-gateway locally and using the Before changesAfter ChangesChanges made:
After changes:However, I still ran into the helia-http-gateway:server fetching url "http://ipfs-tech.ipns.localhost:8080/" with @helia/verified-fetch +0ms
helia-http-gateway:server:trace request destroyed for url "http://ipfs-tech.ipns.localhost:8080/" +358ms
http://localhost:8080/ipns/ipfs.tech: HTTP_200 in 0.384512 seconds (TTFB: 0.384370, rediect: 0.027293)
running GC
node:events:496
throw er; // Unhandled 'error' event
^
Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
at Socket.onclose (node:internal/streams/end-of-stream:154:30)
at Socket.emit (node:events:530:35)
at TCP.<anonymous> (node:net:337:12)
Emitted 'error' event on JSStreamSocket instance at:
at Duplex.<anonymous> (node:internal/js_stream_socket:64:38)
at Duplex.emit (node:events:518:28)
at emitErrorNT (node:internal/streams/destroy:169:8)
at emitErrorCloseNT (node:internal/streams/destroy:128:3)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
code: 'ERR_STREAM_PREMATURE_CLOSE'
} Also, I am explicitly making this problem worse by setting |
doing some processing of the log... ╰─ ✔ ❯ grep -a -o 'ERR_[A-Z_]\+' until-death.log | sort | uniq -c | sort
3 ERR_SOCKET_READ_TIMEOUT
6 ERR_UNREQUESTED_PING
7 ERR_SOCKET_CLOSE_TIMEOUT
20 ERR_UNSUPPORTED_PROTOCOL
27 ERR_TOO_MANY_ADDRESSES
29 ERR_MAX_RECURSIVE_DEPTH_REACHED
32 ERR_RELAYED_DIAL
59 ERR_CONNECTION_FAILED
80 ERR_TRANSIENT_CONNECTION
109 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS
284 ERR_HOP_REQUEST_FAILED
308 ERR_CONNECTION_BEING_CLOSED
936 ERR_TIMEOUT
1474 ERR_MUXER_UNAVAILABLE
3038 ERR_TRANSPORT_DIAL_FAILED
3731 ERR_DIALED_SELF
16747 ERR_STREAM_RESET
18733 ERR_ENCRYPTION_FAILED
49152 ERR_MUXER_LOCAL_CLOSED
105748 ERR_STREAM_PREMATURE_CLOSE
120758 ERR_NO_VALID_ADDRESSES
280800 ERR_UNEXPECTED_EOF |
helia-wg notes 2024-03-28
|
FYI: with my "helia-all" configuration, I can usually repro this error in a few minutes with the |
Ok, i've been working on this quite a bit the past few days. I've got a script running now that executes until-death.sh a given number of iterations for each permutation of $USE_SUBDOMAINS,$USE_BITSWAP,$USE_TRUSTLESS_GATEWAYS,$USE_LIBP2P,$USE_DELEGATED_ROUTING, filtering out usecases where fetching of content would be impossible: if [ "$USE_BITSWAP" = false ]; then
echo "Skipping test for configuration: $config_id"
return
fi
if [ "$USE_LIBP2P" = false ] && [ "$USE_DELEGATED_ROUTING" = false ]; then
echo "Skipping test for configuration: $config_id"
return
fi I'm enabling all debug logs for helia & libp2p with I'm running my script with
until-death.sh is given 5 minutes to run, or it times out and closes the run as successful, deleting the run's log file. If unsuccessful (until.death.sh fails before timeout occurs) then it keeps the log file, and continues on to the next permutation or iteration. The I've also got a local grafana + prometheus setup, and some code for folks to reuse for standing up their own with provisioned dashboards, datasource, and docker-compose for easy running. I've also updated the I've discovered a few things while repeatedly bashing things (sorry to all peers getting bombarded with requests for the same content) Problems that need fixed
Actions to take:
|
**new hypothesis**: attempting to dial ip6 on a network that doesn't support it isn't handled properlyIP6 doesn't work on my network. (update: apparently it does...?)
I'm going to test this more thoroughly once my Nope... blocking all IP4 addrs works fine |
We need to move away from using an in-memory data-store because there is no way to limit the amount of memory used. @wemeetagain mentioned that Lodestar uses an in-mem datastore that flushes to file-system. The code for that is https://github.com/ChainSafe/lodestar/blob/unstable/packages/beacon-node/src/network/peers/datastore.ts#L71
If lodestar has a datastore we can use, we should use that, otherwise we should migrate to use file-system directly and then update to use lodestar strategy.
from #18 (comment)
log-events-viewer-result.csv
Action items before close
The text was updated successfully, but these errors were encountered: