You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When creating our httpFetcher for each layer we loop through all hosts that contains the layer and attempt to fetch the "real" layer/blob URL (for registries that have multiple storage backends (eg: ECR -> S3). In the case of ECR, we get a pre-signed S3 URL with some expiration time. We use that URL to fetch span content, and if it expires we are required to go back to ECR to get a new pre-signed URL. If there are many concurrent span requests during this time, we will be forced to refresh the URL multiple times. This is generally not a common scenario, as we will most likely have fetched all the layer content before URL expiration. I still do think it's worth exploring ways we can either reduce these unneeded round-trips or remove them entirely as they open the door for network failures that can result in IO failures for the running container.
We could potentially lock the URL over the entirety of refreshURL. This wouldn't solve the problem entirely, since there still could be races where other threads still manage to get a copy of the old URL, but it would make it harder to hit. In the worst case if many threads slip through, we will have a queue of span requests waiting, each trying to refresh the URL.
We could also use a RWMutex instead, where we use a read lock every time we try to read from URL and a write lock when we attempt to refresh the URL. This is functionally equivalent to the solution above, expect it doesn't require a normal lock to read the URL.
Host information
OS:
Snapshotter Version:
Containerd Version:
Any additional context or information about the bug
No response
The text was updated successfully, but these errors were encountered:
Description
When creating our
httpFetcher
for each layer we loop through all hosts that contains the layer and attempt to fetch the "real" layer/blob URL (for registries that have multiple storage backends (eg: ECR -> S3). In the case of ECR, we get a pre-signed S3 URL with some expiration time. We use that URL to fetch span content, and if it expires we are required to go back to ECR to get a new pre-signed URL. If there are many concurrent span requests during this time, we will be forced to refresh the URL multiple times. This is generally not a common scenario, as we will most likely have fetched all the layer content before URL expiration. I still do think it's worth exploring ways we can either reduce these unneeded round-trips or remove them entirely as they open the door for network failures that can result in IO failures for the running container.Steps to reproduce the bug
No response
Describe the results you expected
Re: #977 (comment)
We could potentially lock the URL over the entirety of
refreshURL
. This wouldn't solve the problem entirely, since there still could be races where other threads still manage to get a copy of the old URL, but it would make it harder to hit. In the worst case if many threads slip through, we will have a queue of span requests waiting, each trying to refresh the URL.We could also use a RWMutex instead, where we use a read lock every time we try to read from URL and a write lock when we attempt to refresh the URL. This is functionally equivalent to the solution above, expect it doesn't require a normal lock to read the URL.
Host information
Any additional context or information about the bug
No response
The text was updated successfully, but these errors were encountered: