How much caching does Openverse API really need? #2453
Replies: 3 comments 4 replies
-
I just looked at image searches (hits to /v1/images/) over the last 24 hours and it appears that only 3% of requests were Cloudflare cache HITs. I think option 3, lowering the cache TTL to something really short, like a minute, would be fine. This would also help us deal with content moderation issues and ease removing images from the search results. This might also be an ideal reality to accept and embrace, as we may be able to find further performance optimizations in our API itself. Finally, this would also better support if we switch to data refresh approaches that are more instantaneous in the future rather than scheduled bulk refreshes. |
Beta Was this translation helpful? Give feedback.
-
Given the fact that Zack shares, it looks like search requests are sufficiently performant to handle the third solution! I like its simplicity and would also support going for it and perhaps starting with 1 hour or 30 minutes of TTL. One minute is insufficient to catch much from someone trying slightly different search terms in one session. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the write-up and summary of things as they stand! I was initially swayed towards solution 2, as it seemed the most thorough with the lowest impact on our performance. However, based on some of the stats being shared by Zack and yourself, it seems that solution 3 might actually be completely viable for us. If that's the case (especially if it simplifies some cache management for us as you suspect it will!) then I fully support it! You're right to point out that we have lots of other optimizations built in besides caching the whole result itself. Hopefully we can leverage those and the power of Elasticsearch to make the actual searches themselves quick 😄 A few notes on the other approaches, just as I was reading the discussion:
Both of those thoughts are less meaningful if we go with solution 3 anyway 🙂 |
Beta Was this translation helpful? Give feedback.
-
I started working on #1969 today and ran across the interesting question that I hadn't asked during project planning: how much caching do we really need for the API to be performant? The main impetus for this question essentially stems from a rather difficult problem related to data refreshes. Data refreshes cause our data to update at a semiregular cadence. Ideally this cadence would be extremely regular, but we know too well that disruptions to it will occur, sometimes for a long while for a given media type. I began to consider the idea that we could use a relatively long TTL on endpoints effected by the data refresh (like search) and manually invalidate all cached results for those endpoints when a data refresh was finished and new results were available. There's no way to anticipate which cached search queries will be effected by a data refresh. Thus, the only way to ensure new or updated results appear in search "right away" (or in an acceptable amount of time) is to invalidate the entire search response cache for the media type. Granted, this might not be a concern, we may be okay with a 1 week or 1 month expiration for a cached search response, and indeed we currently accept a 1 month TTL for this. However, consider that a data refresh could cause wildly different results to return for a query if even small query parameters change on it. For example, if the first page for a query "blackbirds" was cached but not the second, and a data refresh added many new results that had better scores for "blackbirds" than previously existing results, the results on that cached first page could all get shifted to the second. Then, if the second page is requested, search would return duplicate results between the two pages (so long as you are requesting against the cache)!
So, grant that a too-long cache is a problem. Even if cache TTL for search was 1 week (the regular data refresh latency), requests newly cached immediately before a data refresh finishes would still return invalid/potentially wonky results for 1 week. There are three approaches I can think of that would solve this, each with distinct trade offs:
/related
endpoint) when a data refresh finishes. This would be technically possible after the fine-grained cache control IP is implemented because we could use a stable key to iterate through all cached responses for a particular media type, but we'd need to prevent that operation from running forever as new cached responses are added and this operation is O(n) where n is the number of keys due to Redis'sDEL
command's implementation.The first two would unavoidable create a period of time during which no requests would be cached on first access but this would amortise over the course of the period of time until the next data refresh (or the configured TTL for the keys). The last option would only work if search requests are sufficiently performant. However, only the second option would truly eliminate all possibility of overlap because it is the only case where no paginated queries would ever hit a mix of both the cache and the refreshed index (because cache key would be tied to the current index).
From my perspective, the first is a non-starter because of the issues with iterating through keys in Redis and
DEL
's O(n) performance. The other two depend highly on the ability of our API to respond quickly without a cache.Let's consider what all happens during a search request to our API:
Right now our average search response time (for requests with
q=
in the URL) is usually less than 200 milliseconds, rarely going above a third of a second, and often staying below 150 milliseconds. Our search endpoint is generally very fast. I suspect we could handle a much higher volume of search requests that we do at the moment. However, we'd need to evaluate that more closely when reindexing is happening.For the third option to be viable, we'd need to be able to handle a regular, sustained load of increased search requests actually hitting Django. It could result in lower overall Redis memory consumption because rarer queries would quickly expire rather than sticking around for 1 week or longer. For the second option to be viable, we'd need to be able to handle a burst of additional search requests hitting our service, but cache hits would increase over the course of the week until the next data refresh. It would result in higher overall Redis memory consumption because queries could be cached twice, once for the previous data refresh key and once for the new data refresh key. However, it would stabilise with traffic rather than growing infinitely.
On the one hand, I prefer the "correctness" of the second approach, it entirely eliminates the overlapping queries issue. If Redis memory consumption is going to increase such that we are forced to use the higher memory instance anyway (because we'll be caching results one way or another in it after the IP is implemented), then higher Redis memory consumption may not be an issue. On the other hand, I like the consistency and simplicity of the third approach. Even though it does not completely eliminate the overlapping queries, it significantly reduces the chances of them happening or the longevity of the effect when it inevitably does. It does so without requiring new Redis clean up DAGs or further complicating cache key generation for search requests by including a component to reflect the data refresh.
I'm pinging @zackkrida and @krysal for opinions here as between the two of you I think there's the most experience with our current caching configuration and with our Redis usage (after looking into the recent incident especially) but anyone is welcome to share input. Between the second and third approach, which do y'all think is most reasonable? Is this issue even worth worrying about, or should we just stick with our current TTL of 1 month and accept that new and updated data does not appear in results very fast for common queries. If I wanted to see how much load our search endpoints can handle, would y'all support exploring ways of temporarily directing more search traffic to our API?
Beta Was this translation helpful? Give feedback.
All reactions