This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 22
/
Pytorch-Apex-cifar10-DDP-gloo.yaml
57 lines (51 loc) · 1.87 KB
/
Pytorch-Apex-cifar10-DDP-gloo.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
protocolVersion: 2
name: Pytorch-Apex-cifar10-DDP-gloo
type: job
jobRetryCount: 0
description: |
Pytorch DDP Example With Apex (Gloo backend)
This example shows how to train a custom neural network on cifar10 with Pytorch on OpenPAI.
We installed Apex before running `python <script.py>`,we recommend using the right network interface, so we used environment variables, and the sample program will be trained on two machines, each with two gpus.
If you’re using the Gloo backend, you can specify multiple interfaces by separating them by a comma, like this: . The backend will dispatch operations in a round-robin fashion across these interfaces.And mixed precision training (training in a combination of float (FP32) and half (FP16) precision) allows us to use larger batch sizes and take advantage of NVIDIA Tensor Cores for faster computation.
prerequisites:
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
worker:
instances: 2
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 2
cpu: 8
memoryMB: 16384
ports:
SynPort: 1
commands:
- export GLOO_SOCKET_IFNAME=eth0
- 'git clone https://github.com/NVIDIA/apex'
- cd apex
- python setup.py install
- cd ..
- >-
wget
https://raw.githubusercontent.com/microsoft/pai/master/examples/Distributed-example/cifar10-single-mul-DDP-nccl-gloo-Apex-mixed.py
- >-
python cifar10-single-mul-DDP-nccl-gloo-Apex-mixed.py -n 2 -g 2
--epochs 2 --dist-backend gloo
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
userssh: {}
hivedScheduler:
taskRoles:
worker:
skuNum: 2
skuType: null