You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two machines each equipped with one GPU. I want to run multiple workers on each machine. Is this possible in BytePS? I tried to run 4 worker processes (2 process on each machine) and 2 servers (1 server process on each machine) but the last 3 worker processes fail with the following error and the first worker is stuck. I ran the commands as I would do for a normal 1 worker per GPU machine (which works in that case)
BytePS launching worker
enable NUMA finetune...
Command: numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64
[19:28:20] src/postoffice.cc:63: Creating Van: zmq. group_size=1
[19:28:20] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[19:28:58] src/./zmq_van.h:351: Start ZMQ recv thread
[2022-02-17 19:29:02.800368: F byteps/common/operations.cc:290] Check failed: (size) > (0) init tensor size not larger than 0
Aborted (core dumped)
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 254, in <module>
launch_bps()
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 240, in launch_bps
t[i].join()
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
raise self.exc
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 192, in worker
subprocess.check_call(command, env=my_env,
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'numactl --physcpubind 0-4,20-24 python /users/halmas3/byteps/example/pytorch/benchmark_byteps.py --model=vgg19 --batch-size=64' returned non-zero exit status 134.
I have two questions here:
Is it possible to run BytePS with multiple workers on a single GPU machine?
Is it possible to run BytePS on CPU-only machines as the workers?
Thank you!
The text was updated successfully, but these errors were encountered:
I have two machines each equipped with one GPU. I want to run multiple workers on each machine. Is this possible in BytePS? I tried to run 4 worker processes (2 process on each machine) and 2 servers (1 server process on each machine) but the last 3 worker processes fail with the following error and the first worker is stuck. I ran the commands as I would do for a normal 1 worker per GPU machine (which works in that case)
I have two questions here:
Thank you!
The text was updated successfully, but these errors were encountered: