Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204

aaraney · 2024-10-02T15:22:34Z

ngen-cal/python/ngen_cal/src/ngen/cal/ngen_hooks/ngen_output.py

Line 181 in 2823b2c

df = pd.read_csv(filepath, index_col=0)

ngen.cal supports reading t-route output in a variety of formats (see #153). One supported format is csv_output. This format contains simulated flow, velocity, and depth values for each waterbody for each t-route timestep. For example:

,"(0, 'q')","(0, 'v')","(0, 'd')","(1, 'q')","(1, 'v')","(1, 'd')"
2420800,0.0,0.0,0.0,0.0,0.0,0.0

t-route csv_output configuration

output_parameters:
  csv_output:
    csv_output_folder: output/

Crucially, this means the longer the simulation time the wider each row will be.

csv parsers like pandas c parser or arrow's csv parser optimize for reading long csv files rather than wide csv files. Both of these parsers use a "chunking" approach where they allocate a buffer, read rows from the csv file into the buffer until its full, and process the data. However, when a row is sufficiently long it cannot fit fully into the buffer. Because of this and other implementation specific details, parsing and deserializing these wide csv files into a pandas.DataFrame can take on the order of minutes. In a local test I found that a csv file with 3 years of 5 minute timestep data (315360 timesteps) took roughly 3.5 minutes to deserialize into a pandas dataframe on an M2 pro macbook.

One potential solution to this is to disable pd.read_csv's low_memory flag:

df = pd.read_csv(filepath, index_col=0, engine="c", low_memory=False)

In local testing it too ~9 seconds to read and deserialize the same file.

For now, my general recommendation is to use t-route's stream_output instead of csv_output if possible. stream_output still supports csv, but instead uses a long format instead of a wide format that does not suffer the same performance penalty. See the most up to date examples of this on the t-route repo or in #153.

The text was updated successfully, but these errors were encountered:

aaraney · 2024-10-02T15:35:01Z

Credit @ajkhattak for reporting this! Thanks 🎉

hellkite500 · 2024-10-02T16:03:38Z

I think we can add this to the t-route config settings as a user flag.

aaraney · 2024-10-02T17:19:45Z

@hellkite500, the more I work with the csv flow, depth, velocity output i'm more and more inclined to drop support for it and move it to an example plugin.

aaraney added the ngen.cal Related to ngen.cal package label Oct 2, 2024

aaraney self-assigned this Oct 2, 2024

aaraney added the performance Something is slow label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204

Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204

aaraney commented Oct 2, 2024 •

edited

Loading

aaraney commented Oct 2, 2024 •

edited

Loading

hellkite500 commented Oct 2, 2024

aaraney commented Oct 2, 2024

Reading "wide" t-route flow velocity depth csv's has high performance penalty #204

Reading "wide" t-route flow velocity depth csv's has high performance penalty #204

Comments

aaraney commented Oct 2, 2024 • edited Loading

aaraney commented Oct 2, 2024 • edited Loading

hellkite500 commented Oct 2, 2024

aaraney commented Oct 2, 2024

Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204

Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204

aaraney commented Oct 2, 2024 •

edited

Loading

aaraney commented Oct 2, 2024 •

edited

Loading