-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: catch uncaught exceptions & gc handles request aborts #102
Conversation
const uncaughtHandler = (error: any): void => { | ||
log.error('Uncaught Exception:', error) | ||
if (ALLOW_UNHANDLED_ERROR_RECOVERY && (RECOVERABLE_ERRORS === 'all' || RECOVERABLE_ERRORS.includes(error?.code) || RECOVERABLE_ERRORS.includes(error?.name))) { | ||
log.trace('Ignoring error') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't considered best practice - https://nodejs.org/api/process.html#warning-using-uncaughtexception-correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but we need some kind of error recovery instead of allowing things to just die like they were in Tiros. We could probably default this to FALSE and add a warning in the readme. Tiros needs this and probably some restarting of the server (to follow best practices).
Unhandled exceptions inherently mean that an application is in an undefined state.
I started to just listen for ERR_STREAM_PREMATURE_CLOSE
which we know is a recoverable state. "in an undefined state" is not true in this instance.
this change allows us to recover from anything which could cause problems in the future:
There is a case where some error in libp2p/helia/fastify/helia-server.ts could happen that is unrecoverable, and this does an infinite loop of "On error resume next" and we don't find out until money has been eaten up running a dead service.
However, we still need to block this server from dying in instances where we know we can safely recover, and unblocking Tiros was foremost on my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of the linked warning is that you can't tell if you can safely recover so the only safe thing you can do is exit the process and restart.
If we have unhandled exceptions being thrown, these are bugs that should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree they're bugs that should be fixed, but should we allow a server to die given underlying libraries bugs if they're recoverable? Given the expectation of helia-http-server, I don't think so. It's supposed to fetch content using helia/libp2p, and return a result. If they die when fetching, we should certainly return a 500 error instead, and recover, right?
BTW, The only other place I saw ERR_STREAM_PREMATURE_CLOSE
was in https://github.com/ChainSafe/js-libp2p-yamux, which is listening for those errors, so i don't think the uncaught exception is coming from there. it's likely coming from somewhere else in the libp2p stack.
edit: or stream_premature_close error is coming from fastify req/resp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we should add a listener on the req and resp stream for ERR_STREAM_PREMATURE_CLOSE...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated #112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we allow a server to die given underlying libraries bugs if they're recoverable?
Again, the point of the linked warning is that you can't tell if you can safely recover.
Consider something like this:
process.on('uncaughtException', (err) => {
if (err.message === 'recoverable') {
// it's ok, it's recoverable
return
}
console.error(err)
process.exit(1)
})
const fd = await fs.open('/path/to/file.txt')
// something that causes an error
throw new Error('recoverable')
fd.close()
It looks recoverable, but the file descriptor is never closed so it leaks memory.
Basically if you're in an uncaught exception handler all bets are off.
@@ -272,8 +266,26 @@ export class HeliaServer { | |||
*/ | |||
request.raw.on('close', cleanupFn) | |||
|
|||
if (timeout != null) { | |||
setTimeout(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts like this use resources and can keep the process running. Better to use AbortSignal.timeout(ms)
and combine signals with any-signal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have cleared the timeout here, good catch, thanks. any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.
I will open an issue to update this to abortSignal.timeout(50) and addEventListener
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.
Are you passing the same signal to anySignal
multiple times? Sounds like a bug if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not, but I know libraries that do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are not, then it should be okay to use it.
If you know of libraries that do, can you please open issues or better yet, PRs?
Title
fix: catch uncaught exceptions & gc handles request aborts
Description
A few changes here from #18 (comment)
Notes & open questions
a lot of fixes here from discoveries made investigating #18
Change checklist