Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file-read performance with parallel call #713

Open
jameshalgren opened this issue Dec 14, 2023 · 2 comments
Open

Fix file-read performance with parallel call #713

jameshalgren opened this issue Dec 14, 2023 · 2 comments

Comments

@jameshalgren
Copy link
Contributor

jameshalgren commented Dec 14, 2023

Running T-Route from Ngen inputs takes far too long due to the back-and-forth reading and writing of qlateral inputs.

There are reasonably sane explanations for how things got that way.

This line seems like a ripe option for improvement with a relatively focused fix:
https://github.com/NOAA-OWP/t-route/blame/07d511bb4f93aed23b9031d0b464b17bdf8f2060/src/troute-network/troute/AbstractNetwork.py#L870C9-L870C9

PR forthcoming.

@jameshalgren
Copy link
Contributor Author

@JoshCu

@JoshCu
Copy link
Contributor

JoshCu commented Dec 18, 2023

I started looking into this and have a possibly too invasive fix for it, I'm looking into some other less complex modifications that maintain most of the speedup without the additional complexity.

https://github.com/JoshCu/t-route/tree/optimise_parquet

The bottleneck

Currently there are number_of_catchment * number_of_timesteps file io operations so ~51k file writes for the ngiab demo data of ~220 catchments over 240 timesteps.

The required functionality

When building the qlat array before the routing calculations are run, the output nexus files from ngen are one file per catchment containing timestamped data. They need to be pivoted to one file per timestep containing the data for each catchment at that timestep

Considerations for the fix

  • Can't use python modules that aren't allowed on the super computer
  • Have to consider memory limitations with reading files
  • Needs to be backwards compatible

Room for optimization

  • Perform the file reading and writing in parallel
  • write output to fewer files

The fix I have addresses both of those issues, but the former uses Dask and the latter adds a lot of extra code to handle grouped up parquet files. I'm going to try out a lighter version of the change to get some performance improvements quickly without getting in the parquet file weeds.

Parquet files

From some cursory research it looks like optimal parquet files are anywhere from ~200mb-1gb.
The currently with one per timestep, the largest a file would ever be (worst case) is 2.6mb for every single catchment in the united states. It could be a good idea to make the parquet files instead be a range of timesteps, that would allow a massive reduction in file IO. The current code looks like there are fragments designed to handle reading multiple parquet files to get the time-steps needed, but nothing end to end yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants