Splits for HSCDataSet #105

mtauraso · 2024-10-23T23:46:17Z

HSCDataSet has been significantly modified to implement splits
Constructors for fibad_data_sets now take a split argument
Tests pass, but new functionality not yet tested
New configs added in new "prepare" section to define splits
Some configs in "model" reorganized to a "train" section that is similar to the "prepare" "predict" and "download" sections in that it configures the action more than anything else.
Added "split" config to train and predict sections to select the split that will be used.

- HSCDataSet has been significantly modified to implement splits - Constructors for fibad_data_sets now take a split argument - Tests pass, but new functionality not yet tested - New configs added in new "prepare" section to define splits - Some configs in "model" reorganized to a "train" section that is similar to the "prepare" "predict" and "download" sections in that it configures the action more than anything else. - Added "split" config to train and predict sections to select the split that will be used.

github-actions · 2024-10-23T23:49:17Z

Before [`b504b62`]	After [`a0000dc`]	Ratio	Benchmark (Parameter)
1.78±1s	1.25±0.8s	~0.70	benchmarks.time_computation
3.95k	4.06k	1.03	benchmarks.mem_list

Click here to view all benchmarks.

codecov · 2024-10-23T23:49:34Z

Codecov Report

Attention: Patch coverage is 84.25926% with 17 lines in your changes missing coverage. Please review.

Project coverage is 41.10%. Comparing base (dad7062) to head (f173f9b).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/fibad/data_sets/hsc_data_set.py	93.75%	6 Missing ⚠️
src/fibad/data_sets/example_cifar_data_set.py	20.00%	4 Missing ⚠️
src/fibad/train.py	0.00%	3 Missing ⚠️
src/fibad/predict.py	0.00%	2 Missing ⚠️
src/fibad/pytorch_ignite.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #105      +/-   ##
==========================================
+ Coverage   34.83%   41.10%   +6.26%     
==========================================
  Files          18       18              
  Lines         887     1017     +130     
==========================================
+ Hits          309      418     +109     
- Misses        578      599      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mtauraso · 2024-10-23T23:51:16Z

src/fibad/data_sets/hsc_data_set.py

+
+    """
+
+    def __init__(self, config, split: Union[str, None] = None):


The default value of split=None here is to get some tests to pass short-term. The intent is to have the function signature for any @fibad_data_set __init__() require a split to be provided, or you explicitly pass None because it was in a config.

mtauraso · 2024-10-23T23:54:35Z

src/fibad/data_sets/hsc_data_set.py

+        return len(self.current_split)
+
+
+class HSCDataSetSplit(Dataset):


This class had more functionality in the first draft of this, but it now looks a lot like a glorified numpy masked array.

I ended up writing it this way to ensure there's only ever one HSCDatasetContainer I expect that object to have bookeeping for ~10M files at runtime, and I essentially never want a copy made of that object.

If anyone has suggestions to preserve object lifetimes while rewriting this to be shorter with numpy.ma I'm interested.

drewoldag

Overall I think this looks good for now. The amount of extra code in the hsc_data_set module makes me a little nervous about future users needing to implement similar code for splits, but I also think that is something we can address down the line.

And I can work with Max to figure out how to adjust the kbmod-ml dataset module to work within this new paradigm.

drewoldag · 2024-10-24T17:54:50Z

src/fibad/fibad_default_config.toml

+# If `int`, represents the absolute number of test samples.
+# If `false`, the value is set to the complement of the train size.
+# If `train_size` is also `false`, it will be set to `0.25`.
+test_size = 0.6


This is a trivial comment, but I would expect test to be the last in the list. In my mind at least I always think in order of train, validate, test.

Also, perhaps this is ignorance talking, but wouldn't typical values for these be:

test_size = 0.2

train_size = 0.6

validation_size = 0.2

I'm going to fix this, because like you say it is trivial to fix.

I'm also going to adopt your values, because I'm not sure where I got those values from 😄

drewoldag · 2024-10-24T21:06:22Z

src/fibad/data_sets/example_cifar_data_set.py

This section looks fine for now. I just searched a bit for CIFAR10 validation sets, and found an inspirational gist here: https://gist.github.com/kevinzakka/d33bf8d6c7f06a9d8c76d97a7879f5cb#file-data_loader-py-L90-L109

Also, I was unaware of the sampler parameter that can be passed to DataLoader. I do wonder if it would be helpful, of just get in the way of using HuggingFace datasets.

mtauraso · 2024-10-25T03:47:23Z

The amount of extra code in the hsc_data_set module makes me a little nervous about future users needing to implement similar code for splits, but I also think that is something we can address down the line.

This also worried me while writing it. I'm going to think about what an external dataset writer can do as we implement huggingface. Hopefully there is an easy way to essentially explain one's data to huggingface and then ask it to perform the split. I tried to fashion this after their (and scikit.learn's) split code in the hopes that when we get there we're interface compatible. That choice did make the implementation of this PR somewhat more complex than just "add splits"

mtauraso requested a review from drewoldag October 23, 2024 23:46

mtauraso self-assigned this Oct 23, 2024

mtauraso linked an issue Oct 23, 2024 that may be closed by this pull request

Implement Data Splits in HSCDataSet #74

Closed

mtauraso requested a review from aritraghsh09 October 23, 2024 23:46

mtauraso commented Oct 23, 2024

View reviewed changes

drewoldag approved these changes Oct 24, 2024

View reviewed changes

Added Tests for new HSCDataSet split functionality.

8ee1d9b

mtauraso changed the title ~~[Draft] Initial version of splits~~ Splits for HSCDataSet Oct 25, 2024

mtauraso marked this pull request as ready for review October 25, 2024 03:43

mtauraso force-pushed the issue/74/splits branch from 815cc82 to 352481a Compare October 25, 2024 04:04

Addressing comments, fixing CI, adding more tests

f173f9b

mtauraso force-pushed the issue/74/splits branch from 352481a to f173f9b Compare October 25, 2024 04:04

mtauraso merged commit ea42b17 into main Oct 25, 2024
8 checks passed

mtauraso deleted the issue/74/splits branch October 25, 2024 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splits for HSCDataSet #105

Splits for HSCDataSet #105

mtauraso commented Oct 23, 2024

github-actions bot commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading

mtauraso Oct 23, 2024

mtauraso Oct 23, 2024 •

edited

Loading

drewoldag left a comment

drewoldag Oct 24, 2024

mtauraso Oct 25, 2024

drewoldag Oct 24, 2024

mtauraso commented Oct 25, 2024


		"""

		def __init__(self, config, split: Union[str, None] = None):

		return len(self.current_split)


		class HSCDataSetSplit(Dataset):

Splits for HSCDataSet #105

Splits for HSCDataSet #105

Conversation

mtauraso commented Oct 23, 2024

github-actions bot commented Oct 23, 2024 • edited Loading

codecov bot commented Oct 23, 2024 • edited Loading

Codecov Report

mtauraso Oct 23, 2024

Choose a reason for hiding this comment

mtauraso Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

drewoldag left a comment

Choose a reason for hiding this comment

drewoldag Oct 24, 2024

Choose a reason for hiding this comment

mtauraso Oct 25, 2024

Choose a reason for hiding this comment

drewoldag Oct 24, 2024

Choose a reason for hiding this comment

mtauraso commented Oct 25, 2024

github-actions bot commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading

mtauraso Oct 23, 2024 •

edited

Loading