-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on definitions of regular and experimental contributions #583
Comments
My thinking around this is as follows: Regular dataset contributions
Datasets that don’t meet the above criteria
Experimental dataset contributions
|
Do we need to run the dataset manually at least? And should the contribution include some guide on how to install the dependencies? I'd add if there are similar datasets exists already, we should try to merge them instead of having 10 different contributions for the same thing. |
When we started discussing about this I had a simpler workflow in mind. I'd like to give some holistic feedback on #581, #582, #583 (this issue) and propose an alternative, simpler way forward. First, some remarks
I think this can open us up for lots of frustration. The experimental path was meant to lower the barrier of entry and lift some burden from us, but making someone commit to "own" the dataset seems to raise it. If we do so, they might as well keep their dataset closed source, or publish it in their own repo (which is fine, keep reading). And even if they commit to "owning" it, they can leave whenever they want, and this will happen. So going forward we'll need to keep track of which datasets are orphan (hence do a lot of #581 and #582). I think this is a lot of toil.
(@noklam on #581 (comment)) At the moment, contributing datasets is really hard, from a technical perspective. #165 was open for half a year, and in the end we had to skip some tests. #435 took ages to merge. #580 is having CI failures just when it was about to be merged. We are looking at a systemic issue here which has nothing to do with the experimental process. Even if these authors wanted to make these datasets "official"/tier-1/regular, the process would have been quite painful. The reality is that running the CI for the amount of datasets we have and the current design of datasets, which requires tons of mocking (kedro-org/kedro#1936, kedro-org/kedro#1778), is just hard. Installing all the dependencies of all datasets (essentially needed to run the tests and also to build the docs) is getting more and more difficult.
(myself in #535 (comment)) The problem with "visibility"At the same time, we seem to be stuck in very long discussions, making it seem like the only valid way to accept a dataset is merging it to All in all, I think this set of requirements (have a reasonable CI to maintain, have a lightweight governance on this repo, and have everything on kedro-datasets) is impossible to satisfy. Alternative proposal
By doing this, we will likely reduce the surface area of kedro-datasets, make it more clear that it's maintained by the TSC, and tie the governance to the existing process we have, without adding more. Beyond
|
Additional criteria I would like to provide for core datasets vs. experimental:
|
I would still like to see a higher standard than "random dataset implementation found on the internet." People will still use a dataset in Kedro-Datasets, if it's experimental, and I think we don't want to say, "I have no idea if this even runs, or how to use it." The nightly build process seems fine (to make CI lighter), but I think it could largely contain a similar test suite?
Sounds good.
I don't think anybody should become a maintainer because of work on a single dataset; however, it can be evidence combined with other things that they could be considered to become a maintainer.
Will there be primary (TSC) ownership listed for core datasets, then? Or at least the more "niche" ones? Otherwise, how do you know who owned it to begin with. I think this is fine, just checking. We can easily do this with CODEOWNERS. :)
Ehh... I still disagree with this, unless we take a different structure (i.e. other datasets in
|
The results of the survey are in and they are as follows: Characteristics of regular datasets1. Datasets that the Kedro TSC is willing to maintain: 100% agreement (13/13) There's strong agreement on the first 5 points, and slight majority agreement on the last 2 points. With that, I propose to have the first 5 as "must-haves" for a regular datasets and the last 2 as "should-haves". This will be described clearly in the new contribution docs #579. Characteristics of experimental datasets1. Datasets that the Kedro TSC doesn’t want to maintain: 100% agreement (13/13) There's strong agreement on all these points and so they will be included as is in the new contribution docs #579. Statements on processContribution1. Any new contribution will be considered as an experimental contribution: 8 agree (62%), 5 disagree (38%) Slight majority agreement, but this is quite a fundamental point of the new process so I suggest that every new contribution should be considered independently and the TSC decides if it's a regular or experimental contribution instead of making it experimental by default. Graduation1. Anyone (TSC members + users) can trigger a graduation process: 100% agreement (13/13) Demotion1. The demotion process will be triggered by someone from the TSC when a dataset isn’t deemed fit as a regular contribution anymore. 92% agreement (12/13) Majority agreement on all points, but there were several comments saying that 2/3 approval from the TSC is too rigid and hard to get increasing the difficulty of contributing. So instead, I'll change this to be a 1/2 approval from the TSC. All of that will be included in the new contribution docs Still to discuss:
|
Thanks a lot for the summary @merelcht !
Are we setting any minimum requirements here, like "it must work on Linux"? Or would we consider datasets that, say, only work on Windows? I think @deepyaman has voiced his opinion somewhere that Windows specifically shouldn't be a hard requirement, which I agree with. Maybe more clarity on this specific point would be helpful.
The switch to 1/2 TSC approval instead of 2/3 would be for demotion and graduation? (Including adding new datasets?)
Indeed all examples from https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners#codeowners-syntax are for wildcards or directories, but maybe it works for individual files? |
Well yes, because we are saying a dataset should be able to run as part of CI, so that means it should at least work for one of the OS setups we have in CI. We can phrase it as "must" work on at least one of ... and ideally "should" work on all of them.
Yes, let me clarify that in my summary above.
Yeah I think it might be possible, but after reading this https://www.arnica.io/blog/what-every-developer-should-know-about-github-codeowners I wasn't so sure if CODEOWNERS is really the way we want to go? |
I'm closing this as completed and will proceed with #579. We will discuss dependency resolution requirements for regular and experimental datasets separately. |
Description
Establish clear criteria and guidelines for determining when a contribution is considered a regular (non-experimental) contribution versus an experimental one. This will help contributors and maintainers understand the expectations and classification of datasets.
Context
#517
The text was updated successfully, but these errors were encountered: