How much caching does Openverse API really need? #2453

sarayourfriend · 2023-06-20T02:57:10Z

sarayourfriend
Jun 20, 2023
Collaborator

I started working on #1969 today and ran across the interesting question that I hadn't asked during project planning: how much caching do we really need for the API to be performant? The main impetus for this question essentially stems from a rather difficult problem related to data refreshes. Data refreshes cause our data to update at a semiregular cadence. Ideally this cadence would be extremely regular, but we know too well that disruptions to it will occur, sometimes for a long while for a given media type. I began to consider the idea that we could use a relatively long TTL on endpoints effected by the data refresh (like search) and manually invalidate all cached results for those endpoints when a data refresh was finished and new results were available. There's no way to anticipate which cached search queries will be effected by a data refresh. Thus, the only way to ensure new or updated results appear in search "right away" (or in an acceptable amount of time) is to invalidate the entire search response cache for the media type. Granted, this might not be a concern, we may be okay with a 1 week or 1 month expiration for a cached search response, and indeed we currently accept a 1 month TTL for this. However, consider that a data refresh could cause wildly different results to return for a query if even small query parameters change on it. For example, if the first page for a query "blackbirds" was cached but not the second, and a data refresh added many new results that had better scores for "blackbirds" than previously existing results, the results on that cached first page could all get shifted to the second. Then, if the second page is requested, search would return duplicate results between the two pages (so long as you are requesting against the cache)!

So, grant that a too-long cache is a problem. Even if cache TTL for search was 1 week (the regular data refresh latency), requests newly cached immediately before a data refresh finishes would still return invalid/potentially wonky results for 1 week. There are three approaches I can think of that would solve this, each with distinct trade offs:

Cache search responses for a long time and manually invalidate cached search responses (including on the /related endpoint) when a data refresh finishes. This would be technically possible after the fine-grained cache control IP is implemented because we could use a stable key to iterate through all cached responses for a particular media type, but we'd need to prevent that operation from running forever as new cached responses are added and this operation is O(n) where n is the number of keys due to Redis's DEL command's implementation.
Generate a unique cache key component after each data refresh that is used to generate cache keys for the media type. Cache responses for 1 week. When the cache key component is updated after a data refresh happens, the cache keys will be calculated differently and queries that are cached from the previous data refresh will be ignored. We could then slowly delete these keys to reduce Redis memory consumption during the overlapping period of time or let them expire on their own. After 1 week all the responses cached for the previous data refresh will be gone.
Lower the search result TTL to something very short like 1 minute or 1 hour.

The first two would unavoidable create a period of time during which no requests would be cached on first access but this would amortise over the course of the period of time until the next data refresh (or the configured TTL for the keys). The last option would only work if search requests are sufficiently performant. However, only the second option would truly eliminate all possibility of overlap because it is the only case where no paginated queries would ever hit a mix of both the cache and the refreshed index (because cache key would be tied to the current index).

From my perspective, the first is a non-starter because of the issues with iterating through keys in Redis and DEL's O(n) performance. The other two depend highly on the ability of our API to respond quickly without a cache.

Let's consider what all happens during a search request to our API:

The request parameters are marshalled into an ES query object,
The query is sent to ES which efficiently caches queries and responses so that repeated and related queries perform well.
- Caching is per-index so upon data refresh ES will not be reusing cached responses
We check for dead links in the response by making outbound requests to each result's URL. These are also cached so that repeated queries (even paginated ones) skip validation entirely via the dead link mask. Additionally, results that have recently appeared in other queries have their validation cached in Redis, so if results appear in multiple query responses we do not make repeated requests to validate their liveliness.
A single additional query is sent to Elasticsearch to determine which results exist in the filtered index. This occurs against the keyword indexed identifier for results and is very fast to begin with but is also subject to ES's excellent caching.
At most two queries are sent to Postgres to retrieve result and related data from the database.
Results are marshalled into JSON and returned to the client.

Right now our average search response time (for requests with q= in the URL) is usually less than 200 milliseconds, rarely going above a third of a second, and often staying below 150 milliseconds. Our search endpoint is generally very fast. I suspect we could handle a much higher volume of search requests that we do at the moment. However, we'd need to evaluate that more closely when reindexing is happening.

For the third option to be viable, we'd need to be able to handle a regular, sustained load of increased search requests actually hitting Django. It could result in lower overall Redis memory consumption because rarer queries would quickly expire rather than sticking around for 1 week or longer. For the second option to be viable, we'd need to be able to handle a burst of additional search requests hitting our service, but cache hits would increase over the course of the week until the next data refresh. It would result in higher overall Redis memory consumption because queries could be cached twice, once for the previous data refresh key and once for the new data refresh key. However, it would stabilise with traffic rather than growing infinitely.

On the one hand, I prefer the "correctness" of the second approach, it entirely eliminates the overlapping queries issue. If Redis memory consumption is going to increase such that we are forced to use the higher memory instance anyway (because we'll be caching results one way or another in it after the IP is implemented), then higher Redis memory consumption may not be an issue. On the other hand, I like the consistency and simplicity of the third approach. Even though it does not completely eliminate the overlapping queries, it significantly reduces the chances of them happening or the longevity of the effect when it inevitably does. It does so without requiring new Redis clean up DAGs or further complicating cache key generation for search requests by including a component to reflect the data refresh.

I'm pinging @zackkrida and @krysal for opinions here as between the two of you I think there's the most experience with our current caching configuration and with our Redis usage (after looking into the recent incident especially) but anyone is welcome to share input. Between the second and third approach, which do y'all think is most reasonable? Is this issue even worth worrying about, or should we just stick with our current TTL of 1 month and accept that new and updated data does not appear in results very fast for common queries. If I wanted to see how much load our search endpoints can handle, would y'all support exploring ways of temporarily directing more search traffic to our API?

zackkrida · 2023-06-21T03:22:41Z

zackkrida
Jun 21, 2023
Collaborator

I just looked at image searches (hits to /v1/images/) over the last 24 hours and it appears that only 3% of requests were Cloudflare cache HITs. I think option 3, lowering the cache TTL to something really short, like a minute, would be fine. This would also help us deal with content moderation issues and ease removing images from the search results. This might also be an ideal reality to accept and embrace, as we may be able to find further performance optimizations in our API itself. Finally, this would also better support if we switch to data refresh approaches that are more instantaneous in the future rather than scheduled bulk refreshes.

3 replies

sarayourfriend Jun 21, 2023
Collaborator Author

Sounds good and your reasoning makes sense to me. It is much simpler long run.

Once we've deployed the production thumbnails service and confirmed everything still works as expected, let's leave that running for a week and see how our overall stability responds. If things look better, then we can experiment with a day of a 1-minute search cache TTL. If it looks okay for 24 hours during peak traffic then we could try it for a week and go from there?

This would also help us deal with content moderation issues and ease removing images from the search results.

Can you clarify what you mean by this? If we lowered the TTL for search and related to 1 minute, are you saying that we could accept a 1-minute latency between a result being deindexed out of Openverse and it no longer appearing in any results?

If so, that would massively simplify the fine-grained cache control PR because it would effectively eliminate any need for Django-managed caching (the only significantly complex aspect of that implementation plan). Instead, the plan would only need to cover making automated calls to remove stable paths based on the deindexed result (like thumbnails, waveform, etc).

If that is indeed what you mean, what do you think the maximum latency we can accept for a deindexed result showing in search and related results? Keep in mind that with cache manged purely in Cloudflare we can still remove it from single results and associated thumbnails, etc. The only place we cannot remove its content is from appearing in results from search or being included in a cached response of related results for another work. The embed endpoint also cannot be invalidated but it doesn't return anything sensitive other than potentially the title and we can give it a very low cache TTL as well (it may not have one at all right now, I haven't checked).

zackkrida Jun 21, 2023
Collaborator

Can you clarify what you mean by this? If we lowered the TTL for search and related to 1 minute, are you saying that we could accept a 1-minute latency between a result being deindexed out of Openverse and it no longer appearing in any results?

Apologies I wasn't more detailed. Yes that's exactly what I mean!

If that is indeed what you mean, what do you think the maximum latency we can accept for a deindexed result showing in search and related results?

This is a good question. While we of course want to limit exposure as soon as sensitive content is identified, in practice the real bulk of latency will come from the moderation response time. If that process takes a number of hours, or even days, several seconds or even minutes for the result to be flushed from the system seems fine to me.

I don't have a specific number but something like "as quickly as possible without degrading service".

sarayourfriend Jun 21, 2023
Collaborator Author

in practice the real bulk of latency will come from the moderation response time

That's what I was thinking as well. Even at 1 minute after the moderator actions on the report, it could easily take close to 1 minute for any cache invalidation to be effective.

However, while it is true that moderation response will probably be the biggest bottleneck, once a report is actioned on, ideally the time after that would not be longer than {some amount of time}. 1 minute seems "fine" to me. 2 minutes? Maybe also.

In any case, I like the "as quickly as possible without degrading service" approach as a way to find how low we can go with the number, with the caveat that right now we'd have to wait a month to see the full effect of changes to search TTL.

krysal · 2023-06-21T20:43:05Z

krysal
Jun 21, 2023
Maintainer

Given the fact that Zack shares, it looks like search requests are sufficiently performant to handle the third solution! I like its simplicity and would also support going for it and perhaps starting with 1 hour or 30 minutes of TTL. One minute is insufficient to catch much from someone trying slightly different search terms in one session.

1 reply

sarayourfriend Jun 21, 2023
Collaborator Author

One minute is insufficient to catch much from someone trying slightly different search terms in one session.

I think client-side caching would cover that, right, at least for the frontend. If you make repeated requests on the frontend it will cache the response, though I'm not sure for how long. Cache-Control is set to max-age=14400 (4 minutes). I toggled a license filter on a search on the frontend back and forth and confirmed that the response is cached client-side (Firefox doesn't make that network request). However, it does continue that behaviour for longer than 4 minutes, so I'm not sure how max-age is being used there.

This was the first time I paid special attention to the network tab of our frontend though. Most requests come back longer than 500ms and some search responses take over a minute 🤔. I created an issue to get client-side timing data: #2471

I think it would be important to have that before we made major changes to this kind of thing.

AetherUnbound · 2023-07-26T22:13:54Z

AetherUnbound
Jul 26, 2023
Collaborator

Thanks for the write-up and summary of things as they stand! I was initially swayed towards solution 2, as it seemed the most thorough with the lowest impact on our performance. However, based on some of the stats being shared by Zack and yourself, it seems that solution 3 might actually be completely viable for us. If that's the case (especially if it simplifies some cache management for us as you suspect it will!) then I fully support it! You're right to point out that we have lots of other optimizations built in besides caching the whole result itself. Hopefully we can leverage those and the power of Elasticsearch to make the actual searches themselves quick 😄

A few notes on the other approaches, just as I was reading the discussion:

If we had a specific mechanism for flushing the cache for a given media type, it could also be used if/when we change index settings (e.g. analyzers).
For solution 2, the cache key component could potentially be the Elasticsearch index suffix, since that's what changes during a data refresh. Might make debugging/correlation easier too.

Both of those thoughts are less meaningful if we go with solution 3 anyway 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much caching does Openverse API really need? #2453

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How much caching does Openverse API really need? #2453

sarayourfriend Jun 20, 2023 Collaborator

Replies: 3 comments · 4 replies

zackkrida Jun 21, 2023 Collaborator

sarayourfriend Jun 21, 2023 Collaborator Author

zackkrida Jun 21, 2023 Collaborator

sarayourfriend Jun 21, 2023 Collaborator Author

krysal Jun 21, 2023 Maintainer

sarayourfriend Jun 21, 2023 Collaborator Author

AetherUnbound Jul 26, 2023 Collaborator

sarayourfriend
Jun 20, 2023
Collaborator

Replies: 3 comments 4 replies

zackkrida
Jun 21, 2023
Collaborator

sarayourfriend Jun 21, 2023
Collaborator Author

zackkrida Jun 21, 2023
Collaborator

sarayourfriend Jun 21, 2023
Collaborator Author

krysal
Jun 21, 2023
Maintainer

sarayourfriend Jun 21, 2023
Collaborator Author

AetherUnbound
Jul 26, 2023
Collaborator