Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cudf::io::datasource::create(). #17115

Open
wants to merge 1 commit into
base: branch-24.12
Choose a base branch
from

Conversation

tpn
Copy link
Contributor

@tpn tpn commented Oct 17, 2024

This PR introduces new functionality to cudf::io::datasource that allows for greater control over backend datasource creation. Specifically, the static factory create() method has been expanded to take datasource_kind and datasource_params that can be used to parameterize the datasource creation.

Also introduced are three new datasources:
- host_source: base class that does simple host-based pread() calls
- odirect_source: derivation of above that uses O_DIRECT
- kvikio_source: simple Kvikio-based class (that does not fall back to mmap)

To NVIDIA cudf folks: this undoubtedly warrants some design discussion as I'm introducing new ways of doing things that might not align with in-flight or planned tasks for the datasource component. Happy to jump on a call and discuss. One thing that stands out to me is that the KVIKIO vs KVIKIO_COMPAT vs KVIKIO_GDS kinds feels a bit hacky and/or like a leaky abstraction, especially when you factor in alternate configuration like cufile or env vars.

The idea behind the PR overall is that callers can have much more control over the exact kind of datasource they want created, including fast-fail if they're expecting to create a GDS-accelerated source but can't at runtime, for whatever reason.

The datasource_kind::ODIRECT is also extremely useful for eliminating the variance associated with the page cache when doing back to back runs of large data sets (where the presence or absence of data in the cache will have a huge impact on runtime).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Oct 17, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 17, 2024
@tpn tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from 44c3e36 to a060422 Compare October 21, 2024 03:14
Introduce new `datasource_kind` and `datasource_params` data types, and
update the cudf::io::datasource::create() signature to allow
parameterized datasource creation.

Additionally, implement new datasources:

    - host_source: base class that does simple host-based pread() calls
    - odirect_source: derivation of above that uses O_DIRECT
    - kvikio_source: simple Kvikio-based class (that does not fall back
      to mmap)
@tpn tpn force-pushed the 17110-fea-improve-cudfiodatasourcecreate branch from a060422 to fe981b1 Compare October 21, 2024 03:16
@tpn tpn marked this pull request as ready for review October 21, 2024 03:26
@tpn tpn requested a review from a team as a code owner October 21, 2024 03:26
@tpn tpn changed the title WIP: Improve cudf::io::datasource::create(). Improve cudf::io::datasource::create(). Oct 21, 2024
// parameter in the `kvikio_datasource_params`.
new_params.use_compat_mode = true;
} else if (kind == datasource_kind::KVIKIO_GDS) {
// GDS is unique in that we are expected to throw a cudf::runtime_error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/runtime_error/logic_error/


/**
* @brief The threshold at which the data source will switch from using
* host-based reads to device-based (i.e. GPUDirect) reads, if GPUDirect is
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I've used GDS everywhere else--change these GPUDirect occurrences to GDS too, for consistency.

@@ -92,15 +299,23 @@ class datasource {
* this case, `max_size_estimate` can include padding after the byte range, to include additional
* data that may be needed for processing.
*
* @throws cudf::logic_error if the minimum size estimate is greater than the maximum size estimate
Copy link
Contributor Author

@tpn tpn Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, dodgy rebase! This should be removed.

// Copy the user-provided parameters into our local variable.
new_params = *odirect_params;
} else {
throw cudf::logic_error("Invalid parameters for O_DIRECT-based datasource.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CUDF_FAIL() here.

// Copy the user-provided parameters into our local variable.
new_params = *kvikio_params;
} else {
throw cudf::logic_error("Invalid parameters for KVIKIO-based datasource.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use CUDF_FAIL() here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant