-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
default PE layout for compset NOINYOC at T62_tn14 hangs on betzy #258
Comments
@mvertens Thanks, I would suggest to keep OCN NTASKS to 354 as it would have better throughput. Also, we would be occupying atleast 4 nodes on Betzy as minimum requirement. So, we should change ICE NTASKs to 158 (still remaining CPU-cores) if they provide better throughput. |
@monsieuralok - I agree with your suggestions. I simply backed up to a layout that I knew would work as a first step. |
@monsieuralok - also I'm running a compset with DATM not CAM here - and the coupling frequency for all components is 1 hour in this case. So the ocean is not running concurrently in time. But yet the PE-layout has the ocean on separate PEs. So I'm thinking that maybe a layout with everyone running on 354 tasks for this configuration might be more efficient. This would not be the case when running a fully prognostic configuration (i.e. with CAM and CTSM) since in that case the atm/ice/lnd coupling frequency is 1/2 hour and the ocean is 1 hour - so the ocean does run on its own pes. I'll explore several pes and see what the optimal set is. |
@monsieuralok - in looking at the run sequence for this configuration in more detail it turns out that ICE and OCN do in fact run concurrently and so it makes sense to put them on different processor sets. Your suggestion of using |
It seem the new PE introduced in #262 breaks bit-identical backwards compatibility on the
|
@TomasTorsvik - thanks for tracking this down. The problem is that NOINYOC at T62_n14 hangs on betzy with the previous PE configuration. So backing things up will result in a hang reappearing. I'm confused as to how you were able to verify that master ran with the previous PE table? Did you not run into a hang? That said - when we have regression testing in place we will immediately see answer divergences as part of PRs. |
Is this a MCT vs NUOPC problem, maybe? I have been running NOINYOC_T62_tn14 with the old PE layout with NorESM2 (not exactly sure which release) on betzy without hanging. |
That's a really interesting observation. The current development NorESM snapshots no longer have MCT support (as of cesm2_3_beta11 CESM dropped support for MCT). So since the development code hangs with the older PE layout - I'm not sure how we can move away from that. This would be another great topic to discuss tomorrow. |
Yes, that could be some-other issue; Because I have tried also with noresm2 old PE layout was working without hanging. But, in latest CESM, even without NUOPC It was hanging. Finally noresm2, we checkout using tags, it should not matter else we should have separate branch for NUOPC development. |
As part of introducing new regression testing for NorESM - I have found that when I create a case
for NOINYOC at T62_n14 on betzy - I get the following pelayout results in a hang.
Changing the pe-layout to the following results in a successful model run.
Note - I am not suggesting that the above is an optimal layout for running BLOM%ECO - simply that it fixes the hang.
I am using the noresm_develop_v6 tag in NorESM hub. To duplicate this problem you can do the following:
I will be changing the default tasks in an upcoming PR.
The text was updated successfully, but these errors were encountered: