Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Circular emails distributed #2659

Open
jracusin opened this issue Nov 1, 2024 · 4 comments
Open

Duplicate Circular emails distributed #2659

jracusin opened this issue Nov 1, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@jracusin
Copy link
Contributor

jracusin commented Nov 1, 2024

Description

Some circulars are being distributed via email multiple times. Some recent examples are Circs: 38023, 38021, 38016, 38009, 37989, 37987, 37980, 37979, 37963, 37962.

Similar issue in #924, but that seemed to have resolved the problem, at least temporarily.

Acceptance Criteria

Distribute only 1 copy of every circular to subscribed users.

@jracusin jracusin added the bug Something isn't working label Nov 1, 2024
@jracusin jracusin added this to GCN Nov 1, 2024
@lpsinger
Copy link
Member

lpsinger commented Nov 4, 2024

Similar issue in #924, but that seemed to have resolved the problem, at least temporarily.

The root cause is probably not related to #924.

@jracusin jracusin moved this to Open in GCN Nov 16, 2024
@lpsinger
Copy link
Member

lpsinger commented Dec 6, 2024

The pattern that I see is that duplicate GCN Circulars emails correspond to invocations of the circulars table stream lambda that log the following errors:

2024-12-04T13:47:18.740Z	630c00d0-3149-4ca2-8fda-ac8b0ea15924	ERROR	{
    "level": "ERROR",
    "timestamp": "2024-12-04T13:47:18.740Z",
    "logger": "kafkajs",
    "message": "[BrokerPool] outgoing request timed out after 3500ms",
    "retryCount": 0,
    "retryTime": 287
}
2024-12-04T13:47:18.741Z	630c00d0-3149-4ca2-8fda-ac8b0ea15924	ERROR	Invoke Error 	{
    "errorType": "Error",
    "errorMessage": "oy: outgoing request timed out after 3500ms",
    "stack": [
        "Error: oy: outgoing request timed out after 3500ms",
        "    at Object.intoError (file:///var/runtime/index.mjs:46:16)",
        "    at Object.textErrorLogger [as logError] (file:///var/runtime/index.mjs:684:56)",
        "    at postError (file:///var/runtime/index.mjs:801:27)",
        "    at done (file:///var/runtime/index.mjs:833:11)",
        "    at fail (file:///var/runtime/index.mjs:845:11)",
        "    at file:///var/runtime/index.mjs:872:20"
    ]
}

Probably Lambda is retrying the failed invocations, hence the duplicate emails.

@lpsinger
Copy link
Member

lpsinger commented Dec 6, 2024

The outgoing request timed out after 3500ms is being reported by KafkaJS, but it is originating inside node-openid-client, here: https://github.com/panva/openid-client/blob/v5.6.1/lib/helpers/request.js#L137

So the authentication is timing out.

lpsinger added a commit to lpsinger/gcn.nasa.gov that referenced this issue Dec 6, 2024
The authentication was timing out, leading to the Lambda
invocations failing and being retried, which caused duplicate
distributions of GCN Circulars.

See nasa-gcn#2659.
lpsinger added a commit that referenced this issue Dec 6, 2024
The authentication was timing out, leading to the Lambda
invocations failing and being retried, which caused duplicate
distributions of GCN Circulars.

See #2659.
@jracusin jracusin moved this from Open to In Progress in GCN Dec 9, 2024
@lpsinger
Copy link
Member

lpsinger commented Dec 9, 2024

I think that #2754 helped, but I recently got two copies of https://gcn.nasa.gov/circulars/38476. Here's the server log:

2024-12-09T05:56:29.521Z	6cb876d0-b08b-4341-9fcd-8a9a69c783f8	ERROR	Invoke Error 	{
    "errorType": "Error",
    "errorMessage": "ThrottlingException: Rate exceeded",
    "stack": [
        "Error: ThrottlingException: Rate exceeded",
        "    at Object.intoError (file:///var/runtime/index.mjs:46:16)",
        "    at Object.textErrorLogger [as logError] (file:///var/runtime/index.mjs:684:56)",
        "    at postError (file:///var/runtime/index.mjs:801:27)",
        "    at done (file:///var/runtime/index.mjs:833:11)",
        "    at fail (file:///var/runtime/index.mjs:845:11)",
        "    at file:///var/runtime/index.mjs:872:20",
        "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
    ]
}

It's hard to tell what is timing out because the stack trace is referring to minified source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Open
Development

No branches or pull requests

3 participants