-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NeMo Megatron dataset helper makefile compiles output to write protected container folder (Singularity) #5820
Comments
We solve this on our end by adding Ideally, this file should however be compiled when you guys build the docker container and install NeMo in it @ericharper . Would make it easier to use NeMo containers on HPC. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
@Lauler , I am running into the same issue. Did you build the helpers.cpp in container? can you share the container with post and make command? |
Identify where python is installed in the container, and where nemo package loated in
|
@Lauler how long did it take for your image to be built? I am building it remotely, and it takes a lot of time. Would it be possible for your to share your .sif file? |
Describe the bug
The C++ dataset helper makefile of Megatron in NeMo attempts to write its output to
/usr/bin/ld
, causing training to crash when using singularity containers build off of your NVIDIA NGC NeMo containers on HPC clusters.Steps/Code to reproduce bug
Build a singularity container for use in HPC. This is our definition file
nemo.def
, and we build it locally (outside of HPC environment) viasudo singularity build nemo2209.sif nemo.def
:We transfer the image
nemo2209.sif
to HPC, and follow the NeMo GPT training docs.See this other issue for sbatch config and launch script (changing
--nodes=2
to--nodes=1
).Expected behavior
Most users will probably use NeMo Megatron on HPC, where they don't have
sudo
rights and need to use Singularity instead of Docker. It would be nice if you would test that your documentation examples are launchable with Singularity containers on systems where you do not have root/sudo. A container that is already built should be able to launch training without errors, and without building/compiling extra stuff that needs to be written to write protected folders.Regular NVIDIA Pytorch containers work out of the box when converted to Singularity containers and used with Megatron-LM.
Environment overview (please complete the following information)
HPC cluster, Slurm.
Environment details
NGC Nemo containers 22.08 and 22.09.
Additional context
A100 GPUs.
The text was updated successfully, but these errors were encountered: