Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: catch uncaught exceptions & gc handles request aborts #102

Merged
merged 3 commits into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions src/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,28 @@ export const USE_DELEGATED_ROUTING = process.env.USE_DELEGATED_ROUTING !== 'fals
* If not set, we will default delegated routing to `https://delegated-ipfs.dev`
*/
export const DELEGATED_ROUTING_V1_HOST = process.env.DELEGATED_ROUTING_V1_HOST ?? 'https://delegated-ipfs.dev'

/**
* You can set `RECOVERABLE_ERRORS` to a comma delimited list of errors to recover from.
* If you want to recover from all errors, set `RECOVERABLE_ERRORS` to 'all'.
* If you want to recover from no errors, set `RECOVERABLE_ERRORS` to ''.
*/
export const RECOVERABLE_ERRORS = (() => {
if (process.env.RECOVERABLE_ERRORS === 'all') {
return 'all'
}
if (process.env.RECOVERABLE_ERRORS === '') {
return ''
}
return process.env.RECOVERABLE_ERRORS?.split(',') ?? 'all'
})()

export const ALLOW_UNHANDLED_ERROR_RECOVERY = (() => {
if (RECOVERABLE_ERRORS === 'all') {
return true
}
if (RECOVERABLE_ERRORS === '') {
return false
}
return RECOVERABLE_ERRORS.length > 0
})()
4 changes: 2 additions & 2 deletions src/get-custom-helia.ts
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ export async function getCustomHelia (): Promise<Helia> {
}

let blockstore: HeliaInit['blockstore'] | undefined
if (FILE_BLOCKSTORE_PATH != null) {
if (FILE_BLOCKSTORE_PATH != null && FILE_BLOCKSTORE_PATH !== '') {
blockstore = new LevelBlockstore(FILE_BLOCKSTORE_PATH)
}

let datastore: HeliaInit['datastore'] | undefined
if (FILE_DATASTORE_PATH != null) {
if (FILE_DATASTORE_PATH != null && FILE_DATASTORE_PATH !== '') {
datastore = new LevelDatastore(FILE_DATASTORE_PATH)
}

Expand Down
33 changes: 23 additions & 10 deletions src/helia-server.ts
Original file line number Diff line number Diff line change
Expand Up @@ -242,13 +242,7 @@ export class HeliaServer {
}
}

/**
* Fetches a content for a subdomain, which basically queries delegated routing API and then fetches the path from helia.
*/
async fetch ({ request, reply }: RouteHandler): Promise<void> {
const url = this.#getFullUrlFromFastifyRequest(request)
this.log('fetching url "%s" with @helia/verified-fetch', url)

#getRequestAwareSignal (request: FastifyRequest, url = this.#getFullUrlFromFastifyRequest(request), timeout?: number): AbortSignal {
const opController = new AbortController()
setMaxListeners(Infinity, opController.signal)
const cleanupFn = (): void => {
Expand All @@ -272,8 +266,26 @@ export class HeliaServer {
*/
request.raw.on('close', cleanupFn)

if (timeout != null) {
setTimeout(() => {
Copy link
Member

@achingbrain achingbrain Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts like this use resources and can keep the process running. Better to use AbortSignal.timeout(ms) and combine signals with any-signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have cleared the timeout here, good catch, thanks. any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.

I will open an issue to update this to abortSignal.timeout(50) and addEventListener.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any-signal does not prevent duplicate handlers from being added to the same signal, so I prefer not to use it.

Are you passing the same signal to anySignal multiple times? Sounds like a bug if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not, but I know libraries that do :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are not, then it should be okay to use it.

If you know of libraries that do, can you please open issues or better yet, PRs?

this.log.trace('request timed out for url "%s"', url)
opController.abort()
}, timeout)
}
return opController.signal
}

/**
* Fetches a content for a subdomain, which basically queries delegated routing API and then fetches the path from helia.
*/
async fetch ({ request, reply }: RouteHandler): Promise<void> {
const url = this.#getFullUrlFromFastifyRequest(request)
this.log('fetching url "%s" with @helia/verified-fetch', url)

const signal = this.#getRequestAwareSignal(request, url)

await this.isReady
const resp = await this.heliaFetch(url, { signal: opController.signal, redirect: 'manual' })
const resp = await this.heliaFetch(url, { signal, redirect: 'manual' })
await this.#convertVerifiedFetchResponseToFastifyReply(resp, reply)
}

Expand Down Expand Up @@ -319,10 +331,11 @@ export class HeliaServer {
/**
* GC the node
*/
async gc ({ reply }: RouteHandler): Promise<void> {
async gc ({ reply, request }: RouteHandler): Promise<void> {
await this.isReady
this.log('running `gc` on Helia node')
await this.heliaNode?.gc({ signal: AbortSignal.timeout(20000) })
const signal = this.#getRequestAwareSignal(request, undefined, 20000)
await this.heliaNode?.gc({ signal })
await reply.code(200).send('OK')
}

Expand Down
17 changes: 15 additions & 2 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
* | `ECHO_HEADERS` | A debug flag to indicate whether you want to output request and response headers | `false` |
* | `USE_DELEGATED_ROUTING` | Whether to use the delegated routing v1 API | `true` |
* | `DELEGATED_ROUTING_V1_HOST` | Hostname to use for delegated routing v1 | `https://delegated-ipfs.dev` |
* | `RECOVERABLE_ERRORS` | A comma delimited list of errors to recover from. These errors are checked in `uncaughtException` and `unhandledRejection` callbacks | `all` |
*
* <!--
* TODO: currently broken when used in docker, but they work when running locally (you can cache datastore and blockstore locally to speed things up if you want)
Expand Down Expand Up @@ -159,7 +160,7 @@ import compress from '@fastify/compress'
import cors from '@fastify/cors'
import Fastify from 'fastify'
import metricsPlugin from 'fastify-metrics'
import { HOST, PORT, METRICS, ECHO_HEADERS, FASTIFY_DEBUG } from './constants.js'
import { HOST, PORT, METRICS, ECHO_HEADERS, FASTIFY_DEBUG, RECOVERABLE_ERRORS, ALLOW_UNHANDLED_ERROR_RECOVERY } from './constants.js'
import { HeliaServer, type RouteEntry } from './helia-server.js'
import { logger } from './logger.js'

Expand Down Expand Up @@ -249,7 +250,7 @@ const stopWebServer = async (): Promise<void> => {
}

let shutdownRequested = false
async function closeGracefully (signal: number): Promise<void> {
async function closeGracefully (signal: number | string): Promise<void> {
log(`Received signal to terminate: ${signal}`)
if (shutdownRequested) {
log('closeGracefully: shutdown already requested, exiting callback.')
Expand All @@ -268,3 +269,15 @@ async function closeGracefully (signal: number): Promise<void> {
// eslint-disable-next-line @typescript-eslint/no-misused-promises
process.once(signal, closeGracefully)
})

const uncaughtHandler = (error: any): void => {
log.error('Uncaught Exception:', error)
if (ALLOW_UNHANDLED_ERROR_RECOVERY && (RECOVERABLE_ERRORS === 'all' || RECOVERABLE_ERRORS.includes(error?.code) || RECOVERABLE_ERRORS.includes(error?.name))) {
log.trace('Ignoring error')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but we need some kind of error recovery instead of allowing things to just die like they were in Tiros. We could probably default this to FALSE and add a warning in the readme. Tiros needs this and probably some restarting of the server (to follow best practices).

Unhandled exceptions inherently mean that an application is in an undefined state.

I started to just listen for ERR_STREAM_PREMATURE_CLOSE which we know is a recoverable state. "in an undefined state" is not true in this instance.

this change allows us to recover from anything which could cause problems in the future:

There is a case where some error in libp2p/helia/fastify/helia-server.ts could happen that is unrecoverable, and this does an infinite loop of "On error resume next" and we don't find out until money has been eaten up running a dead service.

However, we still need to block this server from dying in instances where we know we can safely recover, and unblocking Tiros was foremost on my mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the linked warning is that you can't tell if you can safely recover so the only safe thing you can do is exit the process and restart.

If we have unhandled exceptions being thrown, these are bugs that should be fixed.

Copy link
Member Author

@SgtPooki SgtPooki Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree they're bugs that should be fixed, but should we allow a server to die given underlying libraries bugs if they're recoverable? Given the expectation of helia-http-server, I don't think so. It's supposed to fetch content using helia/libp2p, and return a result. If they die when fetching, we should certainly return a 500 error instead, and recover, right?

BTW, The only other place I saw ERR_STREAM_PREMATURE_CLOSE was in https://github.com/ChainSafe/js-libp2p-yamux, which is listening for those errors, so i don't think the uncaught exception is coming from there. it's likely coming from somewhere else in the libp2p stack.

edit: or stream_premature_close error is coming from fastify req/resp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we should add a listener on the req and resp stream for ERR_STREAM_PREMATURE_CLOSE...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated #112

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we allow a server to die given underlying libraries bugs if they're recoverable?

Again, the point of the linked warning is that you can't tell if you can safely recover.

Consider something like this:

process.on('uncaughtException', (err) => {
  if (err.message === 'recoverable') {
    // it's ok, it's recoverable
    return
  }

  console.error(err)
  process.exit(1)
})

const fd = await fs.open('/path/to/file.txt')

// something that causes an error
throw new Error('recoverable')

fd.close()

It looks recoverable, but the file descriptor is never closed so it leaks memory.

Basically if you're in an uncaught exception handler all bets are off.

return
}
void closeGracefully('SIGTERM')
}

process.on('uncaughtException', uncaughtHandler)
process.on('unhandledRejection', uncaughtHandler)
Loading