Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Dataset Examples #163

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

satyaog
Copy link
Member

@satyaog satyaog commented Mar 14, 2023

No description provided.

@satyaog
Copy link
Member Author

satyaog commented Mar 14, 2023

I actually failed to understand how HF can allow the use of a custom download of the dataset the pile yet but I plan to add another example with that dataset

@lebrice
Copy link
Contributor

lebrice commented Mar 15, 2023

Wouldn't we want to extract the archives into SLURM_TMPDIR in the sbatch script?

@satyaog
Copy link
Member Author

satyaog commented Mar 15, 2023

Yes I was also thinking about that and the currently strategy in job.sh should work with most of the torchvision datasets (particularly for ImageNet which needs reorganization steps done by torchvision) but for datasets that are not in the torchvision's list then yes it's much simpler to do the extracting in job.sh so I'll find another dataset for it

@satyaog satyaog force-pushed the big_datasets branch 3 times, most recently from ab3e057 to f861529 Compare March 15, 2023 21:32
@satyaog satyaog changed the title ImageNet Dataset Example Big Dataset Example Mar 17, 2023
@satyaog satyaog force-pushed the big_datasets branch 5 times, most recently from dd09537 to 41094f9 Compare March 17, 2023 22:09
@satyaog satyaog changed the title Big Dataset Example Big Dataset Examples Mar 17, 2023
@satyaog satyaog marked this pull request as ready for review March 21, 2023 20:48
@btravouillon
Copy link
Collaborator

Waiting for merge of #161.

@satyaog satyaog force-pushed the big_datasets branch 9 times, most recently from 94be372 to 1572fd8 Compare April 8, 2023 06:35
@@ -5,4 +5,5 @@

.. include:: examples/frameworks/index.rst
.. include:: examples/distributed/index.rst
.. include:: examples/data/index.rst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might fit nicely in good_practices, what do you think?

docs/examples/data/index.rst Outdated Show resolved Hide resolved

**job.sh**

.. literalinclude:: examples/data/torchvision/job.sh.diff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. literalinclude:: examples/data/torchvision/job.sh.diff
.. literalinclude:: job.sh.diff


**main.py**

.. literalinclude:: examples/data/torchvision/main.py.diff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. literalinclude:: examples/data/torchvision/main.py.diff
.. literalinclude:: main.py.diff


**data.py**

.. literalinclude:: examples/data/torchvision/data.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. literalinclude:: examples/data/torchvision/data.py
.. literalinclude:: data.py

docs/examples/data/torchvision/data.py Outdated Show resolved Hide resolved
docs/examples/data/torchvision/data.py Outdated Show resolved Hide resolved
docs/examples/data/torchvision/job.sh Outdated Show resolved Hide resolved
docs/examples/data/torchvision/main.py Outdated Show resolved Hide resolved
docs/examples/generate_diffs.sh Outdated Show resolved Hide resolved
@satyaog satyaog force-pushed the big_datasets branch 7 times, most recently from 3792d9d to c159806 Compare August 16, 2023 16:12
@satyaog satyaog requested a review from lebrice September 6, 2023 17:36
@satyaog
Copy link
Member Author

satyaog commented Sep 21, 2023

@lebrice did you had time to check the recent updates to this PR?

@lebrice
Copy link
Contributor

lebrice commented Sep 21, 2023

@lebrice did you had time to check the recent updates to this PR?

Not fully, but a glance, my comment here doesnt seem to have been addressed: #163 (comment)

Edit: Okay I've looked at it now, my previous comments about the content are still relevant (for the most part).

Copy link
Contributor

@lebrice lebrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, same comment (third time I make it): #163 (comment)
Let me know what you think.

@satyaog
Copy link
Member Author

satyaog commented Sep 21, 2023

So I think the only issues remaining were the main.py diff and the good_practices general section. As I failed to make sphinx land directly on the data example if it is alone in it's category, I've move it to good_practices.

@satyaog
Copy link
Member Author

satyaog commented Sep 21, 2023

@lebrice
Copy link
Contributor

lebrice commented Sep 21, 2023

Let me clarify the comment #163 (comment) :

What I'm saying is that I don't really see the value in having the main.py file included in this example, or showing a diff with respect to the single-gpu job's main.py (you did address this part by removing the diff, thanks!). In my opinion, the main "body" of the example is data.py, and showing how to use srun to launch the data pre-processing commands (with all the resources of each node, instead of the usual srun with one task per gpu) before training.

What do you think?

@lebrice
Copy link
Contributor

lebrice commented Sep 21, 2023

To be clear, if you feel like you want to merge this, then sure, it's fine as-is. I was just hoping that perhaps we could re-focus the example a bit so it doesn't dilute or mix up the important part of the content with what's already in the GPU job example.

One other thing: Why do we allow customizing the number of workers for data preparation? Is there a context in which we don't want to use the number of data preparation workers = number of cpus per node?

@satyaog
Copy link
Member Author

satyaog commented Sep 22, 2023

Nah not on the cluster, people will use all CPUs available, this is mostly a left over from the scripts I'm personally using to preprocess datasets (at least the bash version). We're also showing a very good practice which is to not override environnement variables if they exists but I'm ok with removing it.

I agree for the main.py but then I think we could have a main that benches the dataloading time for a couple of epochs so people can use it to check the difference between multiple filesystems. At the same time, dataloading time is really not much if it is compared to the time it takes to train on GPU so might as well keep the training part and remove the accuracy logs. Later on, this example could also be used as a base for an example which shows (or not) the loss in performance for overzealous logs when a lot of cpu-gpu syncs are involved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants