Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro dataset CLI commands #3714

Open
datajoely opened this issue Mar 14, 2024 · 0 comments
Open

Kedro dataset CLI commands #3714

datajoely opened this issue Mar 14, 2024 · 0 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@datajoely
Copy link
Contributor

Description

Related to the overall plug-in epic of #583 I've been thinking about both the Kedro team's own maintenance burden and what user friction I see with working with dataset contributions today.

Context

At a high level the following points contribute to this status quo:

  • Datasets are hard to maintain
  • Dataset contributions are welcome but the barrier is high, often prohibitively so
  • Datasets that should be contributed never are
  • Dataset PRs take ages to be merged/released
  • Lots of copying and pasting is happening
  • fsspec boilerplate overheard in every single file based class.
  • Poor metrics on popularity through docs/cli telemetry.

Possible Implementation

I suggest Kedro introduce a set of CLI commands focused on this dataset workflow. We have history of these ideas in the micropackaging journey as well.

They would all follow the kedro dataset <command> pattern:

command priority description
pull P0 This would accept either kedro-datasets name as per the catalog e.g. polars.GenericDataSet. It would pull the source code, add the dependencies and provide an example catalog entry. Longer term we could think about how 3rd party polyrepos could work e.g. (1) (2)
create P0 Create class in users environment with correct structure, may need a workflow for file based (fsspec) or not. Get users contribution ready on day 1, can even include test and lint rules.
install P2 Provide an easy wrapper over the correct pip command, adding the dependency to your project and providing an example catalog entry.
contribute P2 Provide a workflow for pushing the results of pulls/creates back into the open source project
@datajoely datajoely added the Issue: Feature Request New feature or improvement to existing feature label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

1 participant