Parcels runs slowly with large velocity fields/weird lack of scaling with different time-steps #1404
Replies: 6 comments
-
Tested again, this time with inbuilt RK4 and mixing kernels releasing all particles at once, similar result. So, unlikely to be a kernel issue. Tried pointing the lat, lon and depth grid data to a separate files instead of reading it from the first u ,v, wt and T files, in case attempting to read many 35GB files was the issue. Again made no difference. Runs with monthly data scale with dt. So, it's not something wrong with interpolation and advection routines Still need to try splitting the daily data from the 30-day files into 1-day files, and that might not be practical for the full scale run due to available storage space or the inevitable overhead if I process files "just in time". Otherwise might be because the netcdf files are using compression (deflate level 5 pops out when I bash the files with ncdump -sh ) or maybe I need to match the chunking provided to the fieldset with the chunking used by the netcdf files? @erikvansebille Any advice on working with large datasets? And tricks to chunking? |
Beta Was this translation helpful? Give feedback.
-
Thanks for reporting this, @croachutas. I don't have too much time to delve into this right now (busy working on #1402), but could it be that you are simply running with too few particles for the particle-dependent processes (e.g. advection) to make a difference; that your job runtime is completely dominated by reading the hydrodynamic input data? What if you up the particle count by a factor 100 (to 5 Million particles), do you then still see the same scaling as the graph at the top? By the way, have you looked at @CKehl's paper on the efficiency and scaling of Parcels? See here for the article Efficiently simulating Lagrangian particles in large-scale ocean flows — Data structures and their impact on geophysical applications. That might provide some background/guidance too? |
Beta Was this translation helpful? Give feedback.
-
I would certainly play with the chunking of the fieldset to match that of the netCDF file (or be an even subset of it -- for example, if the netcdf is chunked 1,512,512, try having parcels be 1,256,256 or 2,1024,1024... I ended up rechecking my netCDF files for optimal performance. This is straightforward, and I can send you a script to do so if you wish. |
Beta Was this translation helpful? Give feedback.
-
Hi all, Run further tests with larger number of particles, and yep, the scaling for the daily fields is correct once input is no longer the dominant constraints. I've also played round with chunking and found that chunking in line with the input netcdf files is more efficient than my initial arbitrary chunking (~half the runtime). I'll continue playing with exact chunking dimensions to see if I can get thing more efficient. |
Beta Was this translation helpful? Give feedback.
-
Hey @erikvansebille I see the paper mentions autochunking with an apparent fairly significant increase in runtime. How do you turn autochunking on? Just wondering if it might improve performance. Also, I've noticed that runtime per day increases later in the runs (understandably as the particles are more dispersed, thus, there's a need to read in more chunks at a time). Are there any recommendations on how to handle this? For instance pause, change fileset chunking then resume? |
Beta Was this translation helpful? Give feedback.
-
Yes, you can also control the chunking yourself. See e.g. And then add this Examples for other hydrodynamic data are in docs/examples/example_dask_chunk_OCMs.py I wouldn't change the chunking halfway. If you also use MPI you could rebalance the particle sets halfway (creating. new ParticleSet from the old locations?) to make sure the particles are somewhat more aligned with the chunks; but this is fiddly and I'm not sure how much it will improve performance. See also https://docs.oceanparcels.org/en/latest/examples/documentation_MPI.html#Future-developments:-load-balancing for a short discussion on this |
Beta Was this translation helpful? Give feedback.
-
G'day,
Been a few years since I've use Parcels (relocating from LOCEAN to MetOcean Solutions saw me shift to using OpenDrift), but I've moved on from MetOocean (back) to UTAS and I'm back using Parcels now (due mostly to OpenDrift's pre-built reader routines not handling non-CF compliant netcdf files and the utter pain in either developing a custom reader or reprocessing a large number of files to CF-complaint netcdfs...).
I'm using daily model output from the ACCESS-OM2-01 0.1deg global model (which uses MOM5 code for the ocean component) as a part of a study of trans-Tasman larval connectivity of a lobster species. I've got it set up with 189 spawning sites spitting out 10 particles per day for a month (giving a total of about 50,000 particles) and then tracking the particles for quite some time (about a year in the "final" version but just 30 days for an initial test case). I've cobbled together a custom kernel to try to work around the known repeatdt zarr chunking issue #1387 which I had initially thought was why things were running slow (spawn all particles on day 0 but only allow each particle to move once we're at the right day, also some crude debeaching behavior, if the next step take the particle into land we instead kick it backwards a quarter of a step instead). Needless to say the velocity fields (u, v, w) I'm using are stored in some rather large files (~35GB for each of u, v and w with 30 days of data plus ~15TB for the temperature data, so 1.6TB for a one-year run, of cause not all loaded at once).
But I'm still finding that things are running rather slowly. In itself that's an annoyance rather than a serious problem: I can accept a longer runtime or I can use longer timesteps (not ideal but given year long tracking you've gotta accept some compromises). But I noticed when I change the timestep dt things aren't behaving as you'd expect:
You'd expect in an idealized case (no I/O cost) doubling dt should roughly halve the runtime, but nope! Of cause, realistically there's I/O costs but even then doubling the length of the timestep should reduce total runtime.
I've found this when running both on NCI's Gadi HPC and on my personal laptop (source for the plot above). I don't think it's a memory issue (if it was would expect Gadi to kill the process when it exceeds the 8GB I've requested for tests rather than just run slow, and I know from looking at memory usage on my laptop that the test case only uses about 1-2GB of RAM).
I've tried running my code without writing data to an output and that makes little difference. So, it doesn't look like writing output is the bottleneck. So, it's either something to do with reading input files or with my custom kernel.
I've set up indexing and chunk size to try to keep the amount of data loaded manageable but that doesn't seem to have helped :
indices = {'lon': range(500, 1100), 'lat': range(500, 1100)}
It could be something with the custom kernel, but tests with monthly rather than daily fields seemed to run much quicker suggesting that probably isn't it:
Have any of you experienced similar issues when running experiments with large input data files? Is there anything obviously wrong with the kernel?
Full code provided in zip file:
ACCESS_01_tracking_Au_release_delayed_particle_start_benchmarking_no_output.py.zip
Beta Was this translation helpful? Give feedback.
All reactions