ENH: combine metadata + "splits json" #4

NickleDave · 2024-09-15T16:23:06Z

Currently the dataset class built into vak that loads splits from CMACBench uses the filename of the splits path to determine metadata about the split. We use this metadata directly when we need the duration of a frame, and indirectly when we need to determine which labelmap to load, based on the biosound group, unit, and ID (all three things we currently consider metadata). I am already in the process of refactoring so that we can specify a different labelmap, e.g. through a class method: vocalpy/vak#776

I am realizing that a better way to handle this might be to replace the dataset parameter "splits_path" with "metadata_path".

Here's my logic: we determine the metadata by using an (undeclared) naming scheme. This is fragile; if the naming scheme changes, the function breaks. This makes it harder to run any experiment that is slightly different from what is captured by the naming scheme. E.g., if we train models on multi-species datasets, then we are no longer thinking about IDs within one species / biosound group, and so it's not meaningful to put the ID in the filename for the splits. So in theory we have the flexibility to specify different splits through splits_path, but in practice as soon as we do anything besides the exact experiments prescribed by CMACBench, we break the naming scheme, and to get around this we have to use a hack where we put some placeholder in the field of the naming scheme that is not relevant (e.g. a fake ID like "all-species"). The same change in experiments (species instead of ID) also breaks the logic in the vak dataset class that relies on group name + unit name + ID to determine which labelmap to use. But we don't actually use any of the rest of the metadata (group, unit, ID, etc.) to train the model.

So instead of relying on a a naming scheme, or having a dataclass that represents all the metadata as we do in this repo, I think we should just put the metadata directly in the json file when we prep the dataset here, and then the only thing vak needs to know is that it can get the exact metadata that it needs out of that file

We already save metadata for each split when we make the splits, but it's in one big separate json file. Also, the built-in vak datapipe classes already do some similar, use metadata that is saved in a json file during the prep step.

So the change here is just to save a "metadata.json" file for each split. We can use a naming scheme for these, but just to keep the files from over-writing each other. And then a user can provide their own metadata as long as it provides a splits path, frame dur, labelmap json path, and the bookkeeping vector paths (I guess I just declared a schema).
We can fix split functions to do this later -- for now I will use the existing files to fix splits

The text was updated successfully, but these errors were encountered:

NickleDave mentioned this issue Sep 15, 2024

CLN: Rename/refactor BioSoundSegBench dataset -> CMACBench vocalpy/vak#776

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: combine metadata + "splits json" #4

ENH: combine metadata + "splits json" #4

NickleDave commented Sep 15, 2024

ENH: combine metadata + "splits json" #4

ENH: combine metadata + "splits json" #4

Comments

NickleDave commented Sep 15, 2024