Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uploader: Improve upload retry policy #61

Merged
merged 7 commits into from
Aug 22, 2024
Merged

uploader: Improve upload retry policy #61

merged 7 commits into from
Aug 22, 2024

Conversation

victorges
Copy link
Member

@victorges victorges commented Jun 21, 2024

Implement 2 tiers of retries:

  • one shorter cycle, when trying to upload to primary or backup, where
    we retry a couple times up to 30s
    • This solves for transient errors when saving to primary or backup storage
  • a longer cycle wrapping those 2, to sustain potentially long running
    crisis. We wait longer between each retry and keep trying for up to 1h
    • This solves for longer running incidents, for which it's better to keep trying
      for a long time instead of dropping the process and lose the segment.

The inner retry loop also avoids waiting too long before falling back to the backup storage.
We need to avoid waiting too long otherwise the recordings could start to get processed
while segments were still trying to upload to the storage.

In order to be able to hold the upload attempt for so long I also changed the logic to
save the file in a temporary local file instead of keeping it in memory. This will keep the
memory usage low in case we have a long-running incident.

@victorges victorges changed the title core: Improve upload retry policy uploader: Improve upload retry policy Jun 21, 2024
@victorges victorges force-pushed the vg/feat/better-retries branch 3 times, most recently from 4c56dc9 to 4a43662 Compare June 24, 2024 19:10
@victorges victorges force-pushed the vg/feat/better-retries branch 3 times, most recently from c37cc5a to b4c376a Compare June 25, 2024 17:32
Base automatically changed from vg/feat/backup-storage to main June 25, 2024 19:06
Have 2 tiers of retries:
 - one shorter cycle, when trying to upload to primary or backup, where
   we retry a couple times up to 1m
 - a longer cycle wrapping those 2, to sustain potentially long running
   crisis. We wait longer between each retry and keep trying for up to 1h

The first loop solves for transient errors when saving to primary or backup
storage, while the second loop solves for longer running incidents, for which
it's probably better to keep trying for a long time instead of dropping the
process and lose the recording saving.
core/uploader.go Outdated
}

func SingleRequestRetryBackoff() backoff.BackOff {
return newExponentialBackOffExecutor(5*time.Second, 10*time.Second, 30*time.Second)
}

const segmentWriteTimeout = 5 * time.Minute
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder that we need to lower this value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a timetaken field to the "succeeded" log line so that we can get an idea of how long these uploads are taking. So then i'll lower it in a further PR does that sound ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of making this an env var config, like the backup storage? So at least we can tune it better without a new code change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good idea 👍

@mjh1
Copy link
Contributor

mjh1 commented Aug 12, 2024

@victorges I've just tweaked the outer retries down a bit, does that look ok to you?

Copy link
Member Author

@victorges victorges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mjh1 mjh1 merged commit 3517b44 into main Aug 22, 2024
8 checks passed
@mjh1 mjh1 deleted the vg/feat/better-retries branch August 22, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants