Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/benchmarks/ #416

Closed
7 tasks
msaroufim opened this issue May 17, 2022 · 3 comments · May be fixed by #422
Closed
7 tasks

data/benchmarks/ #416

msaroufim opened this issue May 17, 2022 · 3 comments · May be fixed by #422

Comments

@msaroufim
Copy link
Member

msaroufim commented May 17, 2022

🚀 The feature

We're proposing a folder to hold all benchmark scripts which would be easily reproducible by anyone from the core PyTorch Data team, PyTorch domain teams and the broader community. original author @vitaly-fedyunin

Motivation, pitch

As pytorch/data gains more widespread adoption there's going to be more questions about its performance. So it's important to have reusable reproducible scripts we can use. The dev team also needs to be able to monitor for regressions between releases and use benchmarks to inform additional performance optimizations

The script should be runnable with clear instructions and dependencies in a README.md, it should be possible to run the same script in CI with no changes. The script should output metrics in a human-readable markdown file

The main metric we're going to take a look is time to convergence in training against the traditional Dataset baseline using both DataLoader v1 and the experimental DataLoaderv2. The second most important metric is model accuracy to make sure we don't degrade training performance too much (see shuffling issues)

The final outcome should be support the cross product of all of the below configurations.

Datasets

Models

  • Resnet50
  • Resnet128
  • BERT-B

Storage configuration

  • SSD
  • HDD
  • NFS
  • Cloud (S3)
  • Web (HTTP)

Other Metrics

  • Things to track:
  • Time per batch
  • Time per epoch
  • Precision Over Time
  • CPU load
  • GPU load (starvation)
  • Memory Usage

Alternatives

No response

Additional context

For each of the datasets continuously track the next implementations:

  • Baseline implementation (before conversion to DataPipes).
  • Migrated DataSet (to DataPipes), version as-is from Vision GitHub.
  • Option with Tar (or other simple archiving).
  • Option with Data Preproc (rearrange/repack/reformat) - using DataPipes & serialization, consider the speed of repacking too).
  • For Vision: Use NVidia Webdataset DataLoader and NVidia DALI DataLoader for comparison.
  • For Text: Use the HuggingFace dataset for comparison.

Ideally we would put one of these large datasets in an S3 bucket but they will throttle it so instead it’s best to setup an EC2 instance with a simple http server that makes the dataset available on an attached SSD disk which will allow us to do single node 8 GPU experiments. For multi-nodes we need to come up with a story for distributed storage.

The main metrics we need to look at

  • GPU utilization - higher is better we want preprocessing and training to happen concurrently
  • Data throughput per datapipe to measure any bottlenecks - more useful for data pipe authors. Can add some simple telemetry by default to any datapipe
  • End user is going to be looking at time to epoch and any accuracy loss
@NicolasHug
Copy link
Member

NicolasHug commented May 18, 2022

Thanks for opening this issue @msaroufim !

On top of model training time and accuracy, I think we'll also want to monitor the time for the DataLoader to yield an entire epoch (or 5), without a training loop. Ultimately we do care about training time, but it depends a lot on the GPU (and the number of GPUs).

Regarding the vision models to benchmarks, I would suggest the following instead of Resnet50 and Resnet128:

  • large batch + small model: IO bound. mobilenet_v3_large with batch-size 128
  • small batch + large model: compute heavy. resnext50_32x4d batch-size 32

(this is taken from past investigations from @datumbox (unrelated to datapipes)).

I spent a lot of time porting the torchvision training references to use datapipes. I don't think they're suitable for the kind of benchmark we want to do here (because they support tons of other training features, so they're too complex to be public as-is), but they could be a good start. Happy to get you started if you need.

@NicolasHug
Copy link
Member

I spent a lot of time porting the torchvision training references to use datapipes. I don't think they're suitable for the kind of benchmark we want to do here (because they support tons of other training features, so they're too complex to be public as-is), but they could be a good start. Happy to get you started if you need.

FYI I just published this PR pytorch/vision#6196 which adds datapipe support to torchvision's classification training reference (without all the complex async-io stuff).

DataLoaderV2 doesn't support the DistributedReadingService right now so I'm sticking to DL1, but I'll start running more intensive benchmarks on my side as well.

@NicolasHug
Copy link
Member

NicolasHug commented Jun 24, 2022

Some basic results, which are consistent with what I had a few months ago:

Benchmarking mobilenet_v3_large (io bound) from the torchvision training references (pytorch/vision#6196) on the AWS cluster, distributed over 8 A100 GPUs with 12 workers each. This is a very typical setup that we use constantly.

  • On fsx (pretty slow file system):
    • training with datapipes is ~30% faster than with map-style datasets. But strangely, datapipe epochs seem to take increasinly longer (see details below).
  • On ontap (fast file system):
    • training with datapipes is ~10% slower than with map-style datasets.

The ontap reports are more relevant, because in general there is no reason to use the slow fsx file system.

I will start running more in-depth experiments, e.g. completely removing the model-training part, to see if we can identify what could cause such stark differences.


Details

python -u ~/slurm/run_with_submitit.py --ngpus 8 --nodes 1 --model mobilenet_v3_large --epochs 5 --batch-size 128 --workers 12 --ds-type $ds_type --fs $fs

For ref: running just the model training with a pre-loaded dataset (no IO, no transforms) takes ~13 mins both both datapipes and mapstyle datasets. This is the "best" possible training time, assuming data-loading time is zero.

Note: we should ignore the first epoch because these file-system are sensitive to warm-up / caching.

file-system = fsx
ds-type=dp
Epoch: [0] Total time: 0:15:07
Epoch: [1] Total time: 0:15:36
Epoch: [2] Total time: 0:16:42
Epoch: [3] Total time: 0:19:26
Epoch: [4] Total time: 0:20:41
Training time 1:32:03

file-system = fsx
ds-type=mapstyle
Epoch: [0] Total time: 0:22:09
Epoch: [1] Total time: 0:24:51
Epoch: [2] Total time: 0:26:21
Epoch: [3] Total time: 0:25:50
Epoch: [4] Total time: 0:25:40
Training time 2:10:07

file-system = ontap
ds-type=dp
Epoch: [0] Total time: 0:10:02
Epoch: [1] Total time: 0:04:12
Epoch: [2] Total time: 0:04:10
Epoch: [3] Total time: 0:04:10
Epoch: [4] Total time: 0:04:10
Training time 0:28:01

file-system = ontap
ds-type=mapstyle
Epoch: [0] Total time: 0:07:32
Epoch: [1] Total time: 0:03:46
Epoch: [2] Total time: 0:03:47
Epoch: [3] Total time: 0:03:46
Epoch: [4] Total time: 0:03:45
Training time 0:23:40

facebook-github-bot pushed a commit that referenced this issue Aug 12, 2022
Summary:
Towards #416

This is a modified and simplified version of the torchvision classification training reference that provides:

-   Distributed Learning (DDP) vs 1-GPU training
-    Datapipes (with DataLoader or torchdata.dataloader2) vs Iterable datasets (non-DP) vs MapStyle Datasets
-   Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets)
-   Timing of data-loading vs model training
-   any classification model from torchvision

I removed a lot of non-essential features from the original reference, but I can simplify further. Typically I would expect the `MetricLogger` to disappear, or be trimmed down to its most essential bits.

Pull Request resolved: #714

Reviewed By: NivekT

Differential Revision: D38569273

Pulled By: NicolasHug

fbshipit-source-id: 1bc4442ab826256123f8360c14dc8b3eccd73256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants