Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: combine metadata + "splits json" #4

Open
NickleDave opened this issue Sep 15, 2024 · 0 comments
Open

ENH: combine metadata + "splits json" #4

NickleDave opened this issue Sep 15, 2024 · 0 comments

Comments

@NickleDave
Copy link
Contributor

Currently the dataset class built into vak that loads splits from CMACBench uses the filename of the splits path to determine metadata about the split. We use this metadata directly when we need the duration of a frame, and indirectly when we need to determine which labelmap to load, based on the biosound group, unit, and ID (all three things we currently consider metadata). I am already in the process of refactoring so that we can specify a different labelmap, e.g. through a class method: vocalpy/vak#776

I am realizing that a better way to handle this might be to replace the dataset parameter "splits_path" with "metadata_path".

Here's my logic: we determine the metadata by using an (undeclared) naming scheme. This is fragile; if the naming scheme changes, the function breaks. This makes it harder to run any experiment that is slightly different from what is captured by the naming scheme. E.g., if we train models on multi-species datasets, then we are no longer thinking about IDs within one species / biosound group, and so it's not meaningful to put the ID in the filename for the splits. So in theory we have the flexibility to specify different splits through splits_path, but in practice as soon as we do anything besides the exact experiments prescribed by CMACBench, we break the naming scheme, and to get around this we have to use a hack where we put some placeholder in the field of the naming scheme that is not relevant (e.g. a fake ID like "all-species"). The same change in experiments (species instead of ID) also breaks the logic in the vak dataset class that relies on group name + unit name + ID to determine which labelmap to use. But we don't actually use any of the rest of the metadata (group, unit, ID, etc.) to train the model.

So instead of relying on a a naming scheme, or having a dataclass that represents all the metadata as we do in this repo, I think we should just put the metadata directly in the json file when we prep the dataset here, and then the only thing vak needs to know is that it can get the exact metadata that it needs out of that file

We already save metadata for each split when we make the splits, but it's in one big separate json file. Also, the built-in vak datapipe classes already do some similar, use metadata that is saved in a json file during the prep step.

So the change here is just to save a "metadata.json" file for each split. We can use a naming scheme for these, but just to keep the files from over-writing each other. And then a user can provide their own metadata as long as it provides a splits path, frame dur, labelmap json path, and the bookkeeping vector paths (I guess I just declared a schema).
We can fix split functions to do this later -- for now I will use the existing files to fix splits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant