This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 22
/
Pytorch-Apex-cifar10-DDP-nccl.yaml
56 lines (51 loc) · 1.93 KB
/
Pytorch-Apex-cifar10-DDP-nccl.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
protocolVersion: 2
name: Pytorch-Apex-cifar10-DDP-nccl
type: job
jobRetryCount: 0
description: |
Pytorch DDP Example With Apex (NCCL backend)
This example shows how to train a custom neural network on cifar10 with Pytorch on OpenPAI.
We installed Apex before running `python <script.py>`,we recommend using the right network interface, so we used environment variables, and the sample program will be trained on two machines, each with two gpus.
And mixed precision training (training in a combination of float (FP32) and half (FP16) precision) allows us to use larger batch sizes and take advantage of NVIDIA Tensor Cores for faster computation.
The NCCL backend provides an optimized implementation of collective operations against CUDA tensors.If you only use CUDA tensors for your collective operations, consider using this backend for the best in class performance. The NCCL backend is included in the pre-built binaries with CUDA support.
prerequisites:
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
worker:
instances: 2
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 2
cpu: 8
memoryMB: 16384
ports:
SynPort: 1
commands:
- 'git clone https://github.com/NVIDIA/apex'
- cd apex
- python setup.py install
- cd ..
- >-
wget
https://raw.githubusercontent.com/microsoft/pai/master/examples/Distributed-example/cifar10-single-mul-DDP-nccl-gloo-Apex-mixed.py
- >-
python cifar10-single-mul-DDP-nccl-gloo-Apex-mixed.py -n 2 -g 2
--epochs 2 --dist-backend nccl
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
userssh: {}
hivedScheduler:
taskRoles:
worker:
skuNum: 2
skuType: null