Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splits for HSCDataSet #105

Merged
merged 3 commits into from
Oct 25, 2024
Merged

Splits for HSCDataSet #105

merged 3 commits into from
Oct 25, 2024

Conversation

mtauraso
Copy link
Collaborator

  • HSCDataSet has been significantly modified to implement splits
  • Constructors for fibad_data_sets now take a split argument
  • Tests pass, but new functionality not yet tested
  • New configs added in new "prepare" section to define splits
  • Some configs in "model" reorganized to a "train" section that is similar to the "prepare" "predict" and "download" sections in that it configures the action more than anything else.
  • Added "split" config to train and predict sections to select the split that will be used.

- HSCDataSet has been significantly modified to implement splits
- Constructors for fibad_data_sets now take a split argument
- Tests pass, but new functionality not yet tested
- New configs added in new "prepare" section to define splits
- Some configs in "model" reorganized to a "train" section that
  is similar to the "prepare" "predict" and "download" sections
  in that it configures the action more than anything else.
- Added "split" config to train and predict sections to select
  the split that will be used.
@mtauraso mtauraso self-assigned this Oct 23, 2024
@mtauraso mtauraso linked an issue Oct 23, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Oct 23, 2024

Before [b504b62] After [a0000dc] Ratio Benchmark (Parameter)
1.78±1s 1.25±0.8s ~0.70 benchmarks.time_computation
3.95k 4.06k 1.03 benchmarks.mem_list

Click here to view all benchmarks.

Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 84.25926% with 17 lines in your changes missing coverage. Please review.

Project coverage is 41.10%. Comparing base (dad7062) to head (f173f9b).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/fibad/data_sets/hsc_data_set.py 93.75% 6 Missing ⚠️
src/fibad/data_sets/example_cifar_data_set.py 20.00% 4 Missing ⚠️
src/fibad/train.py 0.00% 3 Missing ⚠️
src/fibad/predict.py 0.00% 2 Missing ⚠️
src/fibad/pytorch_ignite.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #105      +/-   ##
==========================================
+ Coverage   34.83%   41.10%   +6.26%     
==========================================
  Files          18       18              
  Lines         887     1017     +130     
==========================================
+ Hits          309      418     +109     
- Misses        578      599      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


"""

def __init__(self, config, split: Union[str, None] = None):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of split=None here is to get some tests to pass short-term. The intent is to have the function signature for any @fibad_data_set __init__() require a split to be provided, or you explicitly pass None because it was in a config.

return len(self.current_split)


class HSCDataSetSplit(Dataset):
Copy link
Collaborator Author

@mtauraso mtauraso Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class had more functionality in the first draft of this, but it now looks a lot like a glorified numpy masked array.

I ended up writing it this way to ensure there's only ever one HSCDatasetContainer I expect that object to have bookeeping for ~10M files at runtime, and I essentially never want a copy made of that object.

If anyone has suggestions to preserve object lifetimes while rewriting this to be shorter with numpy.ma I'm interested.

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good for now. The amount of extra code in the hsc_data_set module makes me a little nervous about future users needing to implement similar code for splits, but I also think that is something we can address down the line.

And I can work with Max to figure out how to adjust the kbmod-ml dataset module to work within this new paradigm.

# If `int`, represents the absolute number of test samples.
# If `false`, the value is set to the complement of the train size.
# If `train_size` is also `false`, it will be set to `0.25`.
test_size = 0.6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a trivial comment, but I would expect test to be the last in the list. In my mind at least I always think in order of train, validate, test.

Also, perhaps this is ignorance talking, but wouldn't typical values for these be:

  • test_size = 0.2
  • train_size = 0.6
  • validation_size = 0.2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to fix this, because like you say it is trivial to fix.

I'm also going to adopt your values, because I'm not sure where I got those values from 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks fine for now. I just searched a bit for CIFAR10 validation sets, and found an inspirational gist here: https://gist.github.com/kevinzakka/d33bf8d6c7f06a9d8c76d97a7879f5cb#file-data_loader-py-L90-L109

Also, I was unaware of the sampler parameter that can be passed to DataLoader. I do wonder if it would be helpful, of just get in the way of using HuggingFace datasets.

@mtauraso mtauraso changed the title [Draft] Initial version of splits Splits for HSCDataSet Oct 25, 2024
@mtauraso mtauraso marked this pull request as ready for review October 25, 2024 03:43
@mtauraso
Copy link
Collaborator Author

The amount of extra code in the hsc_data_set module makes me a little nervous about future users needing to implement similar code for splits, but I also think that is something we can address down the line.

This also worried me while writing it. I'm going to think about what an external dataset writer can do as we implement huggingface. Hopefully there is an easy way to essentially explain one's data to huggingface and then ask it to perform the split. I tried to fashion this after their (and scikit.learn's) split code in the hopes that when we get there we're interface compatible. That choice did make the implementation of this PR somewhat more complex than just "add splits"

@mtauraso mtauraso merged commit ea42b17 into main Oct 25, 2024
8 checks passed
@mtauraso mtauraso deleted the issue/74/splits branch October 25, 2024 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Data Splits in HSCDataSet
2 participants