Big Dataset Examples #163

satyaog · 2023-03-14T21:42:35Z

No description provided.

satyaog · 2023-03-14T23:06:57Z

quick urls :

satyaog · 2023-03-14T23:12:14Z

I actually failed to understand how HF can allow the use of a custom download of the dataset the pile yet but I plan to add another example with that dataset

lebrice · 2023-03-15T02:08:11Z

Wouldn't we want to extract the archives into SLURM_TMPDIR in the sbatch script?

satyaog · 2023-03-15T15:43:25Z

Yes I was also thinking about that and the currently strategy in job.sh should work with most of the torchvision datasets (particularly for ImageNet which needs reorganization steps done by torchvision) but for datasets that are not in the torchvision's list then yes it's much simpler to do the extracting in job.sh so I'll find another dataset for it

btravouillon · 2023-04-05T17:47:47Z

Waiting for merge of #161.

lebrice · 2023-08-08T18:17:21Z

docs/Minimal_examples.rst

@@ -5,4 +5,5 @@

 .. include:: examples/frameworks/index.rst
 .. include:: examples/distributed/index.rst
+.. include:: examples/data/index.rst


This might fit nicely in good_practices, what do you think?

docs/examples/data/index.rst

lebrice · 2023-08-08T18:19:00Z

docs/examples/data/torchvision/_index.rst

+
+**job.sh**
+
+.. literalinclude:: examples/data/torchvision/job.sh.diff


Suggested change

.. literalinclude:: examples/data/torchvision/job.sh.diff

.. literalinclude:: job.sh.diff

lebrice · 2023-08-08T18:19:10Z

docs/examples/data/torchvision/_index.rst

+
+**main.py**
+
+.. literalinclude:: examples/data/torchvision/main.py.diff


Suggested change

.. literalinclude:: examples/data/torchvision/main.py.diff

.. literalinclude:: main.py.diff

lebrice · 2023-08-08T18:19:21Z

docs/examples/data/torchvision/_index.rst

+
+**data.py**
+
+.. literalinclude:: examples/data/torchvision/data.py


Suggested change

.. literalinclude:: examples/data/torchvision/data.py

.. literalinclude:: data.py

docs/examples/data/torchvision/data.py

docs/examples/data/torchvision/job.sh

docs/examples/data/torchvision/main.py

docs/examples/generate_diffs.sh

satyaog · 2023-09-21T14:42:22Z

@lebrice did you had time to check the recent updates to this PR?

lebrice · 2023-09-21T16:18:36Z

@lebrice did you had time to check the recent updates to this PR?

Not fully, but a glance, my comment here doesnt seem to have been addressed: #163 (comment)

Edit: Okay I've looked at it now, my previous comments about the content are still relevant (for the most part).

lebrice

Sorry, same comment (third time I make it): #163 (comment)
Let me know what you think.

Co-authored-by: Fabrice Normandin <[email protected]>

satyaog · 2023-09-21T18:06:47Z

So I think the only issues remaining were the main.py diff and the good_practices general section. As I failed to make sphinx land directly on the data example if it is alone in it's category, I've move it to good_practices.

satyaog · 2023-09-21T18:16:50Z

https://mila-docs--163.org.readthedocs.build/en/163/examples/good_practices/data/index.html#

lebrice · 2023-09-21T20:34:33Z

Let me clarify the comment #163 (comment) :

What I'm saying is that I don't really see the value in having the main.py file included in this example, or showing a diff with respect to the single-gpu job's main.py (you did address this part by removing the diff, thanks!). In my opinion, the main "body" of the example is data.py, and showing how to use srun to launch the data pre-processing commands (with all the resources of each node, instead of the usual srun with one task per gpu) before training.

What do you think?

lebrice · 2023-09-21T21:59:23Z

To be clear, if you feel like you want to merge this, then sure, it's fine as-is. I was just hoping that perhaps we could re-focus the example a bit so it doesn't dilute or mix up the important part of the content with what's already in the GPU job example.

One other thing: Why do we allow customizing the number of workers for data preparation? Is there a context in which we don't want to use the number of data preparation workers = number of cpus per node?

satyaog · 2023-09-22T15:29:33Z

Nah not on the cluster, people will use all CPUs available, this is mostly a left over from the scripts I'm personally using to preprocess datasets (at least the bash version). We're also showing a very good practice which is to not override environnement variables if they exists but I'm ok with removing it.

I agree for the main.py but then I think we could have a main that benches the dataloading time for a couple of epochs so people can use it to check the difference between multiple filesystems. At the same time, dataloading time is really not much if it is compared to the time it takes to train on GPU so might as well keep the training part and remove the accuracy logs. Later on, this example could also be used as a base for an example which shows (or not) the loss in performance for overzealous logs when a lot of cpu-gpu syncs are involved

satyaog force-pushed the big_datasets branch from e11cfed to 8c9e9a2 Compare March 14, 2023 23:03

satyaog force-pushed the big_datasets branch 3 times, most recently from ab3e057 to f861529 Compare March 15, 2023 21:32

satyaog changed the title ~~ImageNet Dataset Example~~ Big Dataset Example Mar 17, 2023

satyaog force-pushed the big_datasets branch 5 times, most recently from dd09537 to 41094f9 Compare March 17, 2023 22:09

satyaog changed the title ~~Big Dataset Example~~ Big Dataset Examples Mar 17, 2023

satyaog force-pushed the big_datasets branch from 41094f9 to c88b9dc Compare March 20, 2023 15:40

satyaog marked this pull request as ready for review March 21, 2023 20:48

satyaog requested a review from btravouillon as a code owner March 21, 2023 20:48

btravouillon mentioned this pull request Mar 22, 2023

Checkpointing Example #164

Merged

satyaog force-pushed the big_datasets branch from 6eb17bc to c88b9dc Compare March 27, 2023 16:56

satyaog force-pushed the big_datasets branch 9 times, most recently from 94be372 to 1572fd8 Compare April 8, 2023 06:35

satyaog force-pushed the big_datasets branch from 7983c0c to 0afd759 Compare July 13, 2023 21:07

lebrice requested changes Aug 8, 2023

View reviewed changes

satyaog force-pushed the big_datasets branch 7 times, most recently from 3792d9d to c159806 Compare August 16, 2023 16:12

satyaog requested a review from lebrice September 6, 2023 17:36

lebrice requested changes Sep 21, 2023

View reviewed changes

satyaog and others added 8 commits September 21, 2023 14:05

Add big dataset examples

c667262

Cleaner bash code

5ba6d38

Move code to python

69ff9cd

Update docs/examples/data/torchvision/data.py

a6e71e2

Co-authored-by: Fabrice Normandin <[email protected]>

Update docs/examples/data/torchvision/data.py

b9cea06

Co-authored-by: Fabrice Normandin <[email protected]>

Update docs/examples/data/torchvision/main.py

2159ad8

Co-authored-by: Fabrice Normandin <[email protected]>

Fix for mila-iqia#200

c3846de

Update docs/examples/data/torchvision/job.sh

c1b4fbb

Co-authored-by: Fabrice Normandin <[email protected]>

satyaog force-pushed the big_datasets branch from 5744f32 to e40566a Compare September 21, 2023 18:06

satyaog force-pushed the big_datasets branch from e40566a to 2e81ced Compare September 21, 2023 18:12

PR comments

c71bfc7

satyaog force-pushed the big_datasets branch from 2e81ced to c71bfc7 Compare September 21, 2023 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Dataset Examples #163

Big Dataset Examples #163

satyaog commented Mar 14, 2023

satyaog commented Mar 14, 2023 •

edited

Loading

satyaog commented Mar 14, 2023

lebrice commented Mar 15, 2023

satyaog commented Mar 15, 2023

btravouillon commented Apr 5, 2023

lebrice Aug 8, 2023

lebrice Aug 8, 2023

lebrice Aug 8, 2023

lebrice Aug 8, 2023

satyaog commented Sep 21, 2023

lebrice commented Sep 21, 2023 •

edited

Loading

lebrice left a comment •

edited

Loading

satyaog commented Sep 21, 2023

satyaog commented Sep 21, 2023

lebrice commented Sep 21, 2023 •

edited

Loading

lebrice commented Sep 21, 2023

satyaog commented Sep 22, 2023


		job.sh

		.. literalinclude:: examples/data/torchvision/job.sh.diff

	.. literalinclude:: examples/data/torchvision/job.sh.diff
	.. literalinclude:: job.sh.diff


		main.py

		.. literalinclude:: examples/data/torchvision/main.py.diff

	.. literalinclude:: examples/data/torchvision/main.py.diff
	.. literalinclude:: main.py.diff


		data.py

		.. literalinclude:: examples/data/torchvision/data.py

	.. literalinclude:: examples/data/torchvision/data.py
	.. literalinclude:: data.py

Big Dataset Examples #163

Are you sure you want to change the base?

Big Dataset Examples #163

Conversation

satyaog commented Mar 14, 2023

satyaog commented Mar 14, 2023 • edited Loading

satyaog commented Mar 14, 2023

lebrice commented Mar 15, 2023

satyaog commented Mar 15, 2023

btravouillon commented Apr 5, 2023

lebrice Aug 8, 2023

Choose a reason for hiding this comment

lebrice Aug 8, 2023

Choose a reason for hiding this comment

lebrice Aug 8, 2023

Choose a reason for hiding this comment

lebrice Aug 8, 2023

Choose a reason for hiding this comment

satyaog commented Sep 21, 2023

lebrice commented Sep 21, 2023 • edited Loading

lebrice left a comment • edited Loading

Choose a reason for hiding this comment

satyaog commented Sep 21, 2023

satyaog commented Sep 21, 2023

lebrice commented Sep 21, 2023 • edited Loading

lebrice commented Sep 21, 2023

satyaog commented Sep 22, 2023

satyaog commented Mar 14, 2023 •

edited

Loading

lebrice commented Sep 21, 2023 •

edited

Loading

lebrice left a comment •

edited

Loading

lebrice commented Sep 21, 2023 •

edited

Loading