Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tiered caching blog #3376

Merged
merged 10 commits into from
Oct 24, 2024

Conversation

kolchfa-aws
Copy link
Collaborator

Closes #3374

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License.

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws @pajuric Editorial review complete. Please see my changes and let me know if you have any questions. Thanks!

_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved

As discussed, on-heap caches have limitations when handling larger datasets. A more effective caching mechanism is *tiered caching*, which uses multiple cache layers, starting with on-heap caching and extending to a disk-based tier. This approach balances performance and capacity, allowing you to store larger datasets without consuming valuable heap memory.

In the past, using a disk for caching raised concerns because traditional spinning hard drives were slower. However, advancements in storage technology, like modern SSD and NVMe drives, now deliver much faster performance. Although disk access is still slower than memory, the speed gap has narrowed enough that the performance trade-off is minimal and often outweighed by the advantage of increased storage capacity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"advantage" => "benefit"?

_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
_posts/2024-10-11-tiered-cache.md Outdated Show resolved Hide resolved
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@kolchfa-aws
Copy link
Collaborator Author

@pajuric Editorial comments addressed. This blog is ready to publish.

@kolchfa-aws kolchfa-aws removed their assignment Oct 14, 2024
@sgup432
Copy link

sgup432 commented Oct 14, 2024

@kolchfa-aws

Editorial comments addressed. This blog is ready to publish.

We still need to add Peter(@peteralfonsi) before we publish this blog, right?

@kolchfa-aws
Copy link
Collaborator Author

Yes, I am waiting for his info and will add it to this PR when it's available.

Signed-off-by: Fanit Kolchina <[email protected]>
@kolchfa-aws
Copy link
Collaborator Author

@sgup432 @peteralfonsi Peter's bio added.

Copy link

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice blog @sgup432! Minor comments to help improve the readability


## On-heap caching: A good start, but is it enough?

On-heap caching in OpenSearch provides a quick, simple, and efficient way to cache data locally on a node. It offers low-latency data retrieval and thereby provides significant performance gains. However, these advantages come with trade-offs, especially as the cache grows, which may lead to performance challenges.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, these advantages come with trade-offs, especially as the cache grows, which may lead to performance challenges.

Maybe it is just me, but this line reads slightly incomplete or looks confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Can we reword this to something like below? As it is more specific.

"However, these advantages come with trade-offs, especially as the cache grows in size and reaches its capacity, which may lead to performance challenges due to high evictions and misses."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded version added.


## When to use tiered caching

Because tiered caching currently only applies to the request cache, it's useful when the existing on-heap request cache isn't large enough to store your datasets and you encounter frequent evictions. You can check request cache statistics using the `GET /_nodes/stats/indices/request_cache` endpoint to monitor evictions, hits, and misses. If you notice frequent evictions along with some hits, enabling tiered caching could provide a significant performance boost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we roughly quantify along with some hints?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgup432 Could you address this comment?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws @jainankitk I think we can leave it as it is. I deliberately left it like that i.e. not quantify it as it is hard to say whether tiered cache will only benefit with >50% or >30% cache hit ratio.


Tiered caching is especially beneficial in these situations:

- Your domain experiences many cache evictions and has repeatable queries. You can confirm this by using request cache statistics.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: repeatable -> repeating/repeated ?

Tiered caching is especially beneficial in these situations:

- Your domain experiences many cache evictions and has repeatable queries. You can confirm this by using request cache statistics.
- You're working with log analytics or read-only indexes, in which data doesn't change often, and you're encountering frequent evictions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indexes -> indices

in which data doesn't change often

looks redundant I guess. read-only indices is self explanatory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexes is used per our style guide.

- Your domain experiences many cache evictions and has repeatable queries. You can confirm this by using request cache statistics.
- You're working with log analytics or read-only indexes, in which data doesn't change often, and you're encountering frequent evictions.

By default, the request cache only stores aggregation queries. You can enable caching for specific requests by using the `?request_cache=true` query parameter.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stores aggregation queries

stores aggregation query results


## What's next?

While tiered caching is a promising feature, we're actively working on further improvements. We're currently exploring ways to make tiered caching more performant. Future enhancements may include promoting frequently accessed items from the disk cache to the on-heap cache, persisting disk cache data between restarts, or integrating tiered caching with other OpenSearch cache types, such as the query cache. You can follow our progress in [this issue](https://github.com/opensearch-project/OpenSearch/issues/10024). We encourage you to try tiered caching in a non-production environment and to share your feedback to help make this feature more robust.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

share your feedback to help make this feature more robust

share your feedback to help improve this feature


## What's next?

While tiered caching is a promising feature, we're actively working on further improvements. We're currently exploring ways to make tiered caching more performant. Future enhancements may include promoting frequently accessed items from the disk cache to the on-heap cache, persisting disk cache data between restarts, or integrating tiered caching with other OpenSearch cache types, such as the query cache. You can follow our progress in [this issue](https://github.com/opensearch-project/OpenSearch/issues/10024). We encourage you to try tiered caching in a non-production environment and to share your feedback to help make this feature more robust.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restarts, or integrating

restarts, and integrating

Signed-off-by: Fanit Kolchina <[email protected]>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Just one minor change.

Signed-off-by: Fanit Kolchina <[email protected]>
has_science_table: true
meta_keywords: tiered caching, disk-based caching, on-heap caching, OpenSearch caching performance, how tiered caching works
meta_description: Explore how OpenSearch combines on-heap and disk-based caching to handle larger datasets and improve performance. Learn about the trade-offs of tiered caching, how it works, and future developments.
---
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the meta with the following:

meta_keywords: tiered caching, on-heap cache, disk-based caching, how tiered caching works, OpenSearch cache optimization

meta_description: Explore the benefits of combining on-heap and disk-based caching in OpenSearch to manage large datasets. Learn how tiered caching works, when to use it, and the performance results of our testing.

- peteral
- kkhatua
- kolchfa
date: 2024-10-11
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update blog date to 2024-10-24

@pajuric
Copy link

pajuric commented Oct 24, 2024

@kkhatua - You are approved to push this live.

Copy link
Member

@peterzhuamazon peterzhuamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push to staging.

@peterzhuamazon peterzhuamazon merged commit c472604 into opensearch-project:main Oct 24, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BLOG] Tiered caching in OpenSearch
6 participants