-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-262: [C++] Support async io prefetch for orc c++ lib #2048
base: main
Are you sure you want to change the base?
Conversation
|
It is totally decided by users to choose whether to prefetch the whole orc file or single/multiple columns in single stripe or single column in single/multiple stripes. It is better letting user invoke |
@wgtmac That's a great work. We could do more improvements on IO latency hiding after it is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just finished the initial review. Thanks @taiyang-li! Please see my inline comments. My main concern is the usability that it requires user to call preBuffer
instead of automatically prefetching required data.
What changes were proposed in this pull request?
Support async io prefetch for orc c++ lib. Close https://issues.apache.org/jira/browse/ORC-262
Changes:
InputStream::readAsync
(default unimplemented). It reads io asynchronously within the specified range.ReadRangeCache
to cache async io results. This borrows from a similar design of Parquet Reader in https://github.com/apache/arrowReader::preBuffer
to trigger io prefetch. In the specific implementation ofReaderImpl::preBuffer
, the io ranges will be calculated according to the selected stripe and columns, and then these ranges will be merged and sorted, andReadRangeCache::cache
will be called to trigger the asynchronous io in the background, waiting for the use of the upper layerReader::releaseBuffer
, which is used to release all cached io ranges before an offsetWhy are the changes needed?
Async io prefetch could hide io latency during reading orc files, which improves performance of scan operators in ClickHouse.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?