You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to the overall plug-in epic of #583 I've been thinking about both the Kedro team's own maintenance burden and what user friction I see with working with dataset contributions today.
Context
At a high level the following points contribute to this status quo:
Datasets are hard to maintain
Dataset contributions are welcome but the barrier is high, often prohibitively so
Datasets that should be contributed never are
Dataset PRs take ages to be merged/released
Lots of copying and pasting is happening
fsspec boilerplate overheard in every single file based class.
Poor metrics on popularity through docs/cli telemetry.
Possible Implementation
I suggest Kedro introduce a set of CLI commands focused on this dataset workflow. We have history of these ideas in the micropackaging journey as well.
They would all follow the kedro dataset <command> pattern:
command
priority
description
pull
P0
This would accept either kedro-datasets name as per the catalog e.g. polars.GenericDataSet. It would pull the source code, add the dependencies and provide an example catalog entry. Longer term we could think about how 3rd party polyrepos could work e.g. (1) (2)
create
P0
Create class in users environment with correct structure, may need a workflow for file based (fsspec) or not. Get users contribution ready on day 1, can even include test and lint rules.
install
P2
Provide an easy wrapper over the correct pip command, adding the dependency to your project and providing an example catalog entry.
contribute
P2
Provide a workflow for pushing the results of pulls/creates back into the open source project
The text was updated successfully, but these errors were encountered:
Description
Related to the overall plug-in epic of #583 I've been thinking about both the Kedro team's own maintenance burden and what user friction I see with working with dataset contributions today.
Context
At a high level the following points contribute to this status quo:
fsspec
boilerplate overheard in every single file based class.Possible Implementation
I suggest Kedro introduce a set of CLI commands focused on this dataset workflow. We have history of these ideas in the micropackaging journey as well.
They would all follow the
kedro dataset <command>
pattern:pull
kedro-datasets
name as per the catalog e.g.polars.GenericDataSet
. It would pull the source code, add the dependencies and provide an example catalog entry. Longer term we could think about how 3rd party polyrepos could work e.g. (1) (2)create
install
pip
command, adding the dependency to your project and providing an example catalog entry.contribute
pull
s/create
s back into the open source projectThe text was updated successfully, but these errors were encountered: