-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-node communication problem using Slurm and NeMo Megatron official GPT docs example #5819
Comments
Seems possibly related to this Lightning issue: Lightning-AI/pytorch-lightning#10098 |
Have you tried the following config: ...
#SBATCH --nodes=<n>
#SBATCH --tasks-per-node=<m>
#SBATCH --gpus-per-node=<m>
...
trainer.devices=-1 \
trainer.num_nodes=$SLURM_JOB_NUM_NODES \
This will start up As for other potential causes, they may depend on the specifics of your setup, for those you may wish to enable the following:
It helped me in tracking down an issue I had with |
Thanks for sharing and suggesting fixes. When I try your suggested config of
it still has problems setting the global ranks correctly. All the processes now are
In all the different configurations of sbatch params I've tried, Pytorch Lightning seems to have issues setting It results in all processes except one crashing when they connect to the same address:
If I change
We haven't been using this pattern of |
Seems like the likely culprit in our case is that only 1 GPU looks to be available per process when running However, all 4 GPU devices show up in each of the individual processes when running *edit: Although
gives only single visible devices for each task 😒 . |
We found a rather hacky solution to make training work. To anyone reading this in the future who runs in to the same issue: Our problem was that all GPUs in a node were not visible to a process whenever we started Slurm jobs with the recommendation of Solution
For reference, here's our sbatch-script: #!/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --qos=test
#SBATCH --account=p200097
#SBATCH --job-name=gpt_nemo
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --time=0-00:30:00
#SBATCH --output=logs/gpt_nemo.log
# Modules
pwd
module purge
module load Singularity-CE
## Create needed distributed env variables
addr=$(/bin/hostname -s)
export MASTER_ADDR=$addr
export MASTER_PORT=16783 # Meluxina overwrites this variable after srun
export GPUS_PER_NODE=4
export NCCL_CROSS_NIC=1
# debugging flags (optional)
export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
export PYTHONFAULTHANDLER=1
export HYDRA_FULL_ERROR=1
# Logfile
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
PROJECT=/project/home/p200097/faton/nemo_test # Use abs path, not symbolic link
CONTAINER_PATH=/project/home/p200097/faton/nemo_test/nemo2209b.sif
LOGGING=$PROJECT/logs
LOGFILE="${LOGGING}/%x_${DATETIME}.log"
echo $LOGFILE
ls -lh
cmd="srun -l --output=$LOGGING/gpt_nemo_$DATETIME.log \
singularity exec --nv --bind $PROJECT:$PROJECT --bind /project/scratch/p200097/data/nemo_test:/mnt $CONTAINER_PATH \
bash $PROJECT/training_args.sh"
$cmd And here's our /bin/hostname -s
export MASTER_PORT=16783
export NODE_RANK=$SLURM_NODEID
# export LOCAL_RANK=$SLURM_LOCALID # Local rank needs to be uninitialized for Lightning to work properly with DDP and 1 process per node
# export GLOBAL_RANK=$SLURM_PROCID # if --ntasks-per-node == devices, then PROCID is the global_rank. But training with --ntasks-per-node doesn't work.
export GLOBAL_RANK=$((SLURM_NODEID * GPUS_PER_NODE + LOCAL_RANK)) # When only 1 process per node, this calculates global_rank
echo "----------"
echo "NODE_RANK" $NODE_RANK
echo "LOCAL_RANK" $LOCAL_RANK
echo "GLOBAL_RANK" $GLOBAL_RANK
echo "WORLD_SIZE" $WORLD_SIZE
echo "MASTER_PORT" $MASTER_PORT
echo "---------------------"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
nvidia-smi -L
python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=/workspace/nemo/examples/nlp/language_modeling/conf \
--config-name=megatron_gpt_config \
trainer.devices=$GPUS_PER_NODE \
...bunch-of-args \
... Hope this helps someone in the future trying to train multi-node with NeMo and Slurm. |
We use slurm for all our clusters, none of the above is needed. We follow PTL guidelines, and the only thing we normally do is add cuda visible devices flag with all the GPUs in the list. That seems to work fine without resorting to these steps So if there are 8 GPUs per node, we do CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 python nemo_script.py ... trainer.num_nodes=x trainer.devices=-1 |
Thank you very much @titu1994 . It had not ever occurred to me that It would probably be helpful if you guys posted an example sbatch script in the documentation, to save some others from future headache. Thanks again for the tip about setting |
@SeanNaren we should note this in your aws tutorial (though I dunno if that uses slurm directly or AWS sage maker). Maybe also let's comment on PTL slack to add this info to the end of https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html |
Describe the bug
We are trying to get multi-node training to work with NeMo Megatron by following the steps in the quick start steps in your GPT model training docs. We're using Slurm on an HPC, and are able to successfully train using Megatron-LM, but not with NeMo.
NeMo keeps insisting we are running multi-node training without SLURM handling the processes:
and the global ranks of our GPUs seem to be incorrectly initialised as result:
Steps/Code to reproduce bug
.
And here are the setting and launch script in
training_args.sh
:Expected behavior
Nemo/Pytorch Lightning recognizing job is run through slurm and starting the job successfully.
Environment overview (please complete the following information)
Additional context
1 node in our case consists of 4 A100 GPUs.
We saw that you referred to the Pytorch Lightning documentation when asked about multi-node training in this previous issue. However, the Pytorch Lightning docs' example sbatch script has a setting that makes no sense to us:
#SBATCH --ntasks-per-node=8 # This needs to match Trainer(devices=...)
If we set
--ntasks-per-node=4
this creates 4 separate processes in a node consisting of 4 GPUs, and each GPU is placed in a separate process, with only a single GPU being available per process. We tried the above method, and it only resulted in training crashing because NeMo/Lightning expected 4 devices (0, 1, 2, 3
) but only saw one device (0
) per process.In the github issue thread we referenced, you write that you guys use Slurm internally. Could you provide a working example of launching a multi-node job with NeMo Megatron using sbatch and the example in your docs?
Log outputs:
The text was updated successfully, but these errors were encountered: