-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cudf::io::datasource::create(). #17115
base: branch-24.12
Are you sure you want to change the base?
Improve cudf::io::datasource::create(). #17115
Conversation
44c3e36
to
a060422
Compare
Introduce new `datasource_kind` and `datasource_params` data types, and update the cudf::io::datasource::create() signature to allow parameterized datasource creation. Additionally, implement new datasources: - host_source: base class that does simple host-based pread() calls - odirect_source: derivation of above that uses O_DIRECT - kvikio_source: simple Kvikio-based class (that does not fall back to mmap)
a060422
to
fe981b1
Compare
// parameter in the `kvikio_datasource_params`. | ||
new_params.use_compat_mode = true; | ||
} else if (kind == datasource_kind::KVIKIO_GDS) { | ||
// GDS is unique in that we are expected to throw a cudf::runtime_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/runtime_error/logic_error/
|
||
/** | ||
* @brief The threshold at which the data source will switch from using | ||
* host-based reads to device-based (i.e. GPUDirect) reads, if GPUDirect is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I've used GDS everywhere else--change these GPUDirect occurrences to GDS too, for consistency.
@@ -92,15 +299,23 @@ class datasource { | |||
* this case, `max_size_estimate` can include padding after the byte range, to include additional | |||
* data that may be needed for processing. | |||
* | |||
* @throws cudf::logic_error if the minimum size estimate is greater than the maximum size estimate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, dodgy rebase! This should be removed.
// Copy the user-provided parameters into our local variable. | ||
new_params = *odirect_params; | ||
} else { | ||
throw cudf::logic_error("Invalid parameters for O_DIRECT-based datasource."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CUDF_FAIL() here.
// Copy the user-provided parameters into our local variable. | ||
new_params = *kvikio_params; | ||
} else { | ||
throw cudf::logic_error("Invalid parameters for KVIKIO-based datasource."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CUDF_FAIL() here.
This PR introduces new functionality to
cudf::io::datasource
that allows for greater control over backend datasource creation. Specifically, the static factorycreate()
method has been expanded to takedatasource_kind
anddatasource_params
that can be used to parameterize the datasource creation.Also introduced are three new datasources:
- host_source: base class that does simple host-based pread() calls
- odirect_source: derivation of above that uses O_DIRECT
- kvikio_source: simple Kvikio-based class (that does not fall back to mmap)
To NVIDIA cudf folks: this undoubtedly warrants some design discussion as I'm introducing new ways of doing things that might not align with in-flight or planned tasks for the datasource component. Happy to jump on a call and discuss. One thing that stands out to me is that the
KVIKIO
vsKVIKIO_COMPAT
vsKVIKIO_GDS
kinds feels a bit hacky and/or like a leaky abstraction, especially when you factor in alternate configuration like cufile or env vars.The idea behind the PR overall is that callers can have much more control over the exact kind of datasource they want created, including fast-fail if they're expecting to create a GDS-accelerated source but can't at runtime, for whatever reason.
The datasource_kind::ODIRECT is also extremely useful for eliminating the variance associated with the page cache when doing back to back runs of large data sets (where the presence or absence of data in the cache will have a huge impact on runtime).
Checklist