Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

Open
mvertens opened this issue Jun 15, 2023 · 9 comments
Open

default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258

mvertens opened this issue Jun 15, 2023 · 9 comments
Assignees

Comments

@mvertens
Copy link
Contributor

mvertens commented Jun 15, 2023

As part of introducing new regression testing for NorESM - I have found that when I create a case
for NOINYOC at T62_n14 on betzy - I get the following pelayout results in a hang.

Comp  NTASKS  NTHRDS  ROOTPE PSTRIDE
CPL :    126/     1;      0      1
ATM :      1/     1;     96      1
LND :      1/     1;     97      1
ICE :     96/     1;      0      1
OCN :    354/     1;    126      1
ROF :      1/     1;     98      1
GLC :      1/     1;      0      1
WAV :      1/     1;      0      1
ESP :      1/     1;      0      1
ESMF_AWARE_THREADING is False
ROOTPE is with respect to 128.0 tasks per node

Changing the pe-layout to the following results in a successful model run.

Comp  NTASKS  NTHRDS  ROOTPE PSTRIDE
CPL :     96/     1;      0      1
ATM :     96/     1;      0      1
LND :     96/     1;      0      1
ICE :     96/     1;      0      1
OCN :    256/     1;     96      1
ROF :     96/     1;      0      1
GLC :     96/     1;      0      1
WAV :     96/     1;      0      1
ESP :     96/     1;      0      1
ESMF_AWARE_THREADING is False

Note - I am not suggesting that the above is an optimal layout for running BLOM%ECO - simply that it fixes the hang.

I am using the noresm_develop_v6 tag in NorESM hub. To duplicate this problem you can do the following:

> git clone https://github.com/NorESMhub/NorESM.git
> cd NorESM
> git checkout noresm_develop_v6
> ./manage_externals/checkout_externals -v
> cd cime/scripts
> ./create_test SMS_Ld1.T62_tn14.NOINYOC.betzy_intel --project <project_number> 

I will be changing the default tasks in an upcoming PR.

@mvertens mvertens self-assigned this Jun 15, 2023
@monsieuralok
Copy link
Collaborator

@mvertens Thanks, I would suggest to keep OCN NTASKS to 354 as it would have better throughput. Also, we would be occupying atleast 4 nodes on Betzy as minimum requirement. So, we should change ICE NTASKs to 158 (still remaining CPU-cores) if they provide better throughput.

@mvertens
Copy link
Contributor Author

@monsieuralok - I agree with your suggestions. I simply backed up to a layout that I knew would work as a first step.
The key point is that the out of the box layout causes the model to hang. I'm going to look at increasing the ocean and ice tasks next as you suggested.

@mvertens
Copy link
Contributor Author

mvertens commented Jun 15, 2023

@monsieuralok - also I'm running a compset with DATM not CAM here - and the coupling frequency for all components is 1 hour in this case. So the ocean is not running concurrently in time. But yet the PE-layout has the ocean on separate PEs. So I'm thinking that maybe a layout with everyone running on 354 tasks for this configuration might be more efficient. This would not be the case when running a fully prognostic configuration (i.e. with CAM and CTSM) since in that case the atm/ice/lnd coupling frequency is 1/2 hour and the ocean is 1 hour - so the ocean does run on its own pes. I'll explore several pes and see what the optimal set is.

@mvertens
Copy link
Contributor Author

mvertens commented Jun 15, 2023

@monsieuralok - in looking at the run sequence for this configuration in more detail it turns out that ICE and OCN do in fact run concurrently and so it makes sense to put them on different processor sets. Your suggestion of using
OCN NTASKS to 354 and ICE_TASKS (along with all other components of 128) is working and no longer results in a hang.

@TomasTorsvik
Copy link
Contributor

It seem the new PE introduced in #262 breaks bit-identical backwards compatibility on the master branch when running on betzy. Some option how to deal with this:

  • revert back to previous PE table. Make a new PE table entry for new NorESM configurations.
  • keep the current PE table. Introduce a CMIP6 table for backward compatibility.
  • keep the current PE table, and allow master to diverge from previous state.

@mvertens
Copy link
Contributor Author

mvertens commented Aug 8, 2023

@TomasTorsvik - thanks for tracking this down. The problem is that NOINYOC at T62_n14 hangs on betzy with the previous PE configuration. So backing things up will result in a hang reappearing. I'm confused as to how you were able to verify that master ran with the previous PE table? Did you not run into a hang?

That said - when we have regression testing in place we will immediately see answer divergences as part of PRs.

@JorgSchwinger
Copy link
Contributor

Is this a MCT vs NUOPC problem, maybe? I have been running NOINYOC_T62_tn14 with the old PE layout with NorESM2 (not exactly sure which release) on betzy without hanging.

@mvertens
Copy link
Contributor Author

mvertens commented Aug 8, 2023

That's a really interesting observation. The current development NorESM snapshots no longer have MCT support (as of cesm2_3_beta11 CESM dropped support for MCT). So since the development code hangs with the older PE layout - I'm not sure how we can move away from that. This would be another great topic to discuss tomorrow.

@monsieuralok
Copy link
Collaborator

Yes, that could be some-other issue; Because I have tried also with noresm2 old PE layout was working without hanging. But, in latest CESM, even without NUOPC It was hanging.

Finally noresm2, we checkout using tags, it should not matter else we should have separate branch for NUOPC development.

@gold2718 gold2718 moved this to Todo in NorESM Development Dec 12, 2023
@gold2718 gold2718 removed the status in NorESM Development Jan 18, 2024
@gold2718 gold2718 moved this to Todo in NorESM Development Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

4 participants