default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

mvertens · 2023-06-15T11:31:21Z

As part of introducing new regression testing for NorESM - I have found that when I create a case
for NOINYOC at T62_n14 on betzy - I get the following pelayout results in a hang.

Comp  NTASKS  NTHRDS  ROOTPE PSTRIDE
CPL :    126/     1;      0      1
ATM :      1/     1;     96      1
LND :      1/     1;     97      1
ICE :     96/     1;      0      1
OCN :    354/     1;    126      1
ROF :      1/     1;     98      1
GLC :      1/     1;      0      1
WAV :      1/     1;      0      1
ESP :      1/     1;      0      1
ESMF_AWARE_THREADING is False
ROOTPE is with respect to 128.0 tasks per node

Changing the pe-layout to the following results in a successful model run.

Comp  NTASKS  NTHRDS  ROOTPE PSTRIDE
CPL :     96/     1;      0      1
ATM :     96/     1;      0      1
LND :     96/     1;      0      1
ICE :     96/     1;      0      1
OCN :    256/     1;     96      1
ROF :     96/     1;      0      1
GLC :     96/     1;      0      1
WAV :     96/     1;      0      1
ESP :     96/     1;      0      1
ESMF_AWARE_THREADING is False

Note - I am not suggesting that the above is an optimal layout for running BLOM%ECO - simply that it fixes the hang.

I am using the noresm_develop_v6 tag in NorESM hub. To duplicate this problem you can do the following:

> git clone https://github.com/NorESMhub/NorESM.git
> cd NorESM
> git checkout noresm_develop_v6
> ./manage_externals/checkout_externals -v
> cd cime/scripts
> ./create_test SMS_Ld1.T62_tn14.NOINYOC.betzy_intel --project <project_number>

I will be changing the default tasks in an upcoming PR.

The text was updated successfully, but these errors were encountered:

monsieuralok · 2023-06-15T11:51:14Z

@mvertens Thanks, I would suggest to keep OCN NTASKS to 354 as it would have better throughput. Also, we would be occupying atleast 4 nodes on Betzy as minimum requirement. So, we should change ICE NTASKs to 158 (still remaining CPU-cores) if they provide better throughput.

mvertens · 2023-06-15T12:09:38Z

@monsieuralok - I agree with your suggestions. I simply backed up to a layout that I knew would work as a first step.
The key point is that the out of the box layout causes the model to hang. I'm going to look at increasing the ocean and ice tasks next as you suggested.

mvertens · 2023-06-15T12:14:54Z

@monsieuralok - also I'm running a compset with DATM not CAM here - and the coupling frequency for all components is 1 hour in this case. So the ocean is not running concurrently in time. But yet the PE-layout has the ocean on separate PEs. So I'm thinking that maybe a layout with everyone running on 354 tasks for this configuration might be more efficient. This would not be the case when running a fully prognostic configuration (i.e. with CAM and CTSM) since in that case the atm/ice/lnd coupling frequency is 1/2 hour and the ocean is 1 hour - so the ocean does run on its own pes. I'll explore several pes and see what the optimal set is.

mvertens · 2023-06-15T15:02:46Z

@monsieuralok - in looking at the run sequence for this configuration in more detail it turns out that ICE and OCN do in fact run concurrently and so it makes sense to put them on different processor sets. Your suggestion of using
OCN NTASKS to 354 and ICE_TASKS (along with all other components of 128) is working and no longer results in a hang.

TomasTorsvik · 2023-08-07T20:21:40Z

It seem the new PE introduced in #262 breaks bit-identical backwards compatibility on the master branch when running on betzy. Some option how to deal with this:

revert back to previous PE table. Make a new PE table entry for new NorESM configurations.
keep the current PE table. Introduce a CMIP6 table for backward compatibility.
keep the current PE table, and allow master to diverge from previous state.

mvertens · 2023-08-08T08:59:27Z

@TomasTorsvik - thanks for tracking this down. The problem is that NOINYOC at T62_n14 hangs on betzy with the previous PE configuration. So backing things up will result in a hang reappearing. I'm confused as to how you were able to verify that master ran with the previous PE table? Did you not run into a hang?

That said - when we have regression testing in place we will immediately see answer divergences as part of PRs.

JorgSchwinger · 2023-08-08T10:49:16Z

Is this a MCT vs NUOPC problem, maybe? I have been running NOINYOC_T62_tn14 with the old PE layout with NorESM2 (not exactly sure which release) on betzy without hanging.

mvertens · 2023-08-08T10:54:51Z

That's a really interesting observation. The current development NorESM snapshots no longer have MCT support (as of cesm2_3_beta11 CESM dropped support for MCT). So since the development code hangs with the older PE layout - I'm not sure how we can move away from that. This would be another great topic to discuss tomorrow.

monsieuralok · 2023-08-08T10:58:09Z

Yes, that could be some-other issue; Because I have tried also with noresm2 old PE layout was working without hanging. But, in latest CESM, even without NUOPC It was hanging.

Finally noresm2, we checkout using tags, it should not matter else we should have separate branch for NUOPC development.

mvertens self-assigned this Jun 15, 2023

TomasTorsvik added this to NorESM Development Jun 29, 2023

mvertens added a commit to mvertens/BLOM that referenced this issue Jul 26, 2023

update to PE layout to resolve issue NorESMhub#258

de96479

mvertens mentioned this issue Jul 26, 2023

add BLOM regression test functionality #262

Merged

TomasTorsvik mentioned this issue Aug 7, 2023

refactor buildnml to use the CIME-CCS python based namelist generation capabilities #263

Merged

gold2718 moved this to Todo in NorESM Development Dec 12, 2023

gold2718 removed the status in NorESM Development Jan 18, 2024

gold2718 moved this to Todo in NorESM Development Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

mvertens commented Jun 15, 2023 •

edited

Loading

monsieuralok commented Jun 15, 2023

mvertens commented Jun 15, 2023

mvertens commented Jun 15, 2023 •

edited

Loading

mvertens commented Jun 15, 2023 •

edited

Loading

TomasTorsvik commented Aug 7, 2023

mvertens commented Aug 8, 2023

JorgSchwinger commented Aug 8, 2023

mvertens commented Aug 8, 2023

monsieuralok commented Aug 8, 2023

default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

Comments

mvertens commented Jun 15, 2023 • edited Loading

monsieuralok commented Jun 15, 2023

mvertens commented Jun 15, 2023

mvertens commented Jun 15, 2023 • edited Loading

mvertens commented Jun 15, 2023 • edited Loading

TomasTorsvik commented Aug 7, 2023

mvertens commented Aug 8, 2023

JorgSchwinger commented Aug 8, 2023

mvertens commented Aug 8, 2023

monsieuralok commented Aug 8, 2023

mvertens commented Jun 15, 2023 •

edited

Loading

mvertens commented Jun 15, 2023 •

edited

Loading

mvertens commented Jun 15, 2023 •

edited

Loading