-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem Running MPI on Metworx #41
Comments
Any speedup with “n=2” and “chains=1”? |
Nope. Takes about the same time for whether |
Actually, I just realized the worker nodes I was using only had 4GB of RAM each. Could that be a problem? If I recall I think my master node had 4 cores, so maybe that's why the error is happing when I go from Let me know if you need more information about the setup and environment. |
Allow me first test the example on my local build. There were a few occasions the same error were seen but gone when I switched to a different MPI build. |
Here's what I found. After using a fresh-installed MPI library (MPICH), I was able to build and run the model with number of processes = 1, 2 ,4, and get a speed up 1.6(n=2) and 2.7(n=4). I was running it using cmdstan: mpiexec -n 4 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.data.R random seed=3289 I have no access to metworx so wasn't able to test n=8. Since you didn't see any speedup I suspect the running script was not working properly. How was |
Not sure. What do you mean? I also got the same error using cmdstan in the terminal. I'll try running on a master node and report back. Is it also possible Metworx has an old incompatible version of MPICH? I noticed the instructions here mention to set |
When run @billgillespie maybe you and someone in metworx team can help @aryaamgen ? We have some scripts to generate hostfile and metrum's aws specialist may be able to automate that even more.
Yes, metworx can provide previous snapshots. But at this point I'm not sure the library is the cause here (well, except that Open MPI's error message is not very helpful). We can take this path if the issue is reproduced on master node and submission to work nodes is fixed. |
Ok I tried running again on a 4 core master node and did get about a 2x speedup going from 1 to 2 nodes, but then no speedup at 4. I was also able to to try 7 nodes and that worked and was actually slower (I assume because communication to the worker nodes is slow). But for some reason 8 nodes still gives that same error.
Ah ok I see. Yeah I did not know about host files at all. So I guess Metworx is somehow generating hostfiles automatically? I wonder if there a way to see and edit those. |
Is that 4 physical cores or 4 vCPUs? You can check that at metworx UI (master node information). If it's vCPU then what's provisioned is actually 4 threads, which would not scale at all in current MPI implementation. |
Ah ok interesting. Yes it's 4 vCPUs. Does that mean that for a single multicore machine I'd be better off using So far I've been using a 96 vCPU machine on Metworx and just wrapping my calls to
|
Then the performance will almost surely degrade when n=4.
Probably but it also depends on specific model.
This is the exact motivation of using MPI. |
I confirmed that n >= 8 fails on Metworx using the default MPICH setup. It happened with 16 and 32 vCPU instances, so it's not due to limits on physical cores. I need to learn a bit more about MPICH to get around it---or switch to OpenMPI. |
I'm almost sure this is the same problem we've run into previously, that metworx' vanilla installation fails but manually built MPICH doesn't. I just tested it on my local machine with only 4 cores that n=8, 16, 32 all run well. I'll see if this behavior can be reproduced on metworx later today. |
Interesting. Should I try to manually build MPICH myself on the Metworx cluster or would that be too difficult if I don't have much experience with cluster submission? Did you follow these directions? |
I was also able to confirm what @billgillespie saw. The following is probably the simplest workaround (master node metworx instance only. We can work on worker node details after clearing this). Install openmpisudo apt update
sudo apt -y install openmpi-bin which installs openmpi's version of mpiexec --version The output should be mpiexec (OpenRTE) 2.1.1 Use the new mpiexec to execute the jobNow mpiexec -n 8 ./twocpt_population sample data file=twocpt_population.data.R init=twocpt_population.init.R should not crash (neither should n > 8). Let me know if it works on your end. |
Thanks Yi! I tried the above directions installing openmpi on an instance with vCPU=8 and no worker nodes, just master. I no longer get an error on n=8. However, I don't get any speedup going from 1 to 2 or 2 to 4 or 4 to 8. Actually at 8 it takes twice as long. By the way should I still have the following in the
|
This is expected, because your binary There are two paths we can follow from this point: we can help you install a fresh MPI library, or we can wait for metworx team's fix. If you are comfortable following the MPICH installation guide I'd say go for it. If your team runs multiple instances of metworx it's probably a good idea we wait for the patch. |
I installed following the install guide here. Although skipped steps 9 and 10 since I'm just running on a single machine. I'm now able to see a speedup going from n=1 to n=2 and to n=4. But I get the error again when I go up to n=8. |
You need to ensure you are using the newly built mpich instead of the system version. I just did the following: configure and make mpich in my instance in a local foldere.g. change
|
Thanks @yizhang-yiz . That worked. I'm now able to run with n>8 and I see the speedup on the master node. Not sure how to set up the worker nodes to be able to communicate with the master. Do you know how to do that on Metworx? |
Let me write out the procedure and test it on metworx before posting here.
…On Jun 24, 2022, 15:58 -0700, aryaamgen ***@***.***>, wrote:
Thanks @yizhang-yiz . That worked. I'm now able to run with n>8 and I see the speedup on the master node. Not sure how to set up the worker nodes to be able to communicate with the master. Do you know how to do that on Metworx?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@yizhang-yiz any update on this? Was starting to try to get together some stuff for ACoP and would be nice to get a cluster working on Metworx. |
Sorry for the delay. I was able to only look into this on and off. Let me take another look in this weekend. |
You can follow the steps below:
In general as the number of cluster nodes increases the communication latency and other cluster overhead catches up, and the speedup tapers off. Also a small number of big nodes (nodes with more cores on each) is likely to be faster than a large number of small nodes. It may take several iterations should one target the optimal performance. |
Here's another approach to using Torsten with MPI on a Metworx cluster. The attached example uses In this case you don't need to pre-specify a fixed size cluster. This approach takes advantage of Metworx's auto-scaling capabilities. I.e., it will automatically launch the required number of compute nodes and then shut them down when they're no longer needed. The attached zip file unzips into a directory named The attached plot One klugy bit about the example is it requires that at least one compute node be available before it will work. If no compute node is running, I launch one using the |
Thanks @yizhang-yiz! I was able to create the hostfile and run but it basically pauses once warmup is done and I have to ctrl-C out. I'm running the following command (note that I limited the tree depth to remove that confounder and making testing faster)
Then once I ctrl-C when it hangs at 50% after warmup I get the following errors for each worker
The main node has 2vCPU and there are 4 nodes with 2vCPU each. Any idea what could be going on? |
What happens if you use “num_chains=4” instead of 1? edit: I meant to ask what if “mpiexec -n 1” is used with “num_chains=1”, does it still hang? |
Thanks @billgillespie! I'll try to work through this. So this method essentially send each chain to a worker node? Since the maximum number of vCPUs per node is 96 does that limit the method to 96/2 cores per chain? The auto-scaling is attractive though. Does that mean I could have a small master node with say only 8 vCPU then the 96 vCPU workers will be spun up as needed or does the master also need to have that many vCPU's? |
|
That depends on the cores requested per chain. For example if I specify 16 cores per chain and use 16 core (32 vCPU) compute nodes, then each chain gets its own compute node. On the other hand if I use 32 core nodes, then 2 chains run on each compute node.
Yes.
The former. In fact the master node just needs enough RAM to handle what the compute nodes return. A 2 vCPU master node is fine as long as it has enough RAM---which it probably doesn't with the available Metworx instances. |
Not exactly the same run but /data/mpich-install/bin/mpiexec -n 4 -f hostfile -bind-to core -l ./twocpt_population sample num_samples=10 num_warmup=500 num_chains=1 data file=./twocpt_population.data.R init=./twocpt_population.init.R random seed=1234 output refresh=10 did not hang. Could you try the above run? Also please check that the $ cat hostfile
ip-10-31-18-160.ec2.internal
ip-10-31-19-18.ec2.internal
ip-10-31-28-213.ec2.internal
ip-10-31-28-85.ec2.internal |
Thanks @yizhang-yiz! I didn't change anything, but booted up a new cluster and it's no longer hanging. My main node has 2vCPU and 16GB RAM and I have 3 worker nodes with 2vCPU/16GB RAM each. I get about a 2x speedup going from By the way should the main node be in the hostfile? For me it is not. I only have the 3 worker nodes in there. |
Hostfile should only contain the nodes you'd like to use for the Stan run, in most cases, the slave nodes. If you have 3 slave nodes, In general, the following numbers affect population model scaling
Assuming subjects have similar computing cost, we usually want |
Thanks again @yizhang-yiz for baring with me. This has been really informative. A couple last questions:
Oh so I can have my master node have fewer vCPUs since I'm only using the slave nodes for the Stan run? Also can I manually scale up and down the number of slave nodes as long as I recreate the host file with So just to make sure I have the rule right regarding |
yes
yes
yes
Setting -n 8 is the not necessarily more efficient than -n 4, as it also depends on the model. But setting >8 will almost surely see performance degrade. In addition, 4 worker nodes with 2 cores is possibly less efficient than a single worker node with 8 cores. For PKPD models, I'd say "fewer large nodes" is likely better than "more small nodes". |
Thanks again @yizhang-yiz! |
@yizhang-yiz I've been playing with this more and I had a question. I'm now able to run with cmdstanr using the following command:
I notice that with |
@yizhang-yiz I thought about this and one possible workaround is to just take @billgillespie's approach and do a separate Although I'm not sure how to also tie that in with autoscaling on Metworx. I still have to take a closer look at @billgillespie's approach to see how that works. |
Unfortunately Bill's approach is probably the best here. Note that When you use |
I'm currently running Torsten v0.90.0 on Metrum's Metworx platform on a cluster, but I'm not sure if MPI is working. I've adding the following to
Torsten/cmdstan/make/local
I'm able to compile and run the
twocpt_population.stan
example usingmod$sample_mpi()
in cmdstanr, but I don't get any speedup from increasingn
in thempi_args = list("n" = 1)
argument. Furthermore, when I setn=8
I get the following error:Any ideas what could be going wrong? @yizhang-yiz
The text was updated successfully, but these errors were encountered: