From 09109fb6da4556add8b8c4634ffab5e806041459 Mon Sep 17 00:00:00 2001 From: Hagen Wierstorf Date: Fri, 9 Feb 2024 16:53:59 +0100 Subject: [PATCH] Add memory consumption to deps save/load benchmark (#366) --- benchmarks/README.md | 52 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/benchmarks/README.md b/benchmarks/README.md index cc7aada5..313b464e 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -145,3 +145,55 @@ with `pandas.read_csv()`. | \-\-\-\-> pa.Table -> pd.DataFrame[pyarrow] | 0.109 | | 0.069 | | \-\-\-\-> pa.Table -> pd.DataFrame[pyarrow] -> pd.DataFrame[object] | 0.335 | | | | \-\-\-\-> pa.Table | 0.049 | | 0.051 | + + +### File sizes + +Storing a dependency table with 1,000,000 entries resulted in: + +* 102 MB for csv +* 131 MB for pickle +* 20 MB for parquet + +When zipped all files can be further reduced by 50%. + + +### Memory consumption + +Besides the execution time, +memory consumption might also be considered. +We use [memray](https://github.com/bloomberg/memray) v1.11.0, +to measure it. +As the evaluation of the results cannot be easily automated, +the investigation was done manually +by creating single Python scripts +containing code for the desired operations, +running `memray` +and inspecting the results. + +**Writing** + +When writing to files +there is no memory overhead +when converting a `pandas.DataFrame` +first to `pyarrow.Table`. +Hence, we don't have to compare results. + +**Reading** + +Peak memory consumption when reading a dependency table containing 1,000,000 files. + +| method | csv | pickle | parquet | +| ------------------------------------------- | ------- | ------ | ------- | +| \-\-\-\-> pd.DataFrame[object] | 391 MB | 275 MB | 754 MB | +| \-\-\-\-> pd.DataFrame[string] | 356 MB | 275 MB | 874 MB | +| \-\-\-\-> pd.DataFrame[pyarrow] | 696 MB | 161 MB | 903 MB | +| \-c--> pd.DataFrame[object] | 390 MB | | | +| \-c--> pd.DataFrame[string] | 356 MB | | | +| \-c--> pd.DataFrame[pyarrow] | 696 MB | | | +| \-pa-> pd.DataFrame[object] | 1295 MB | | | +| \-pa-> pd.DataFrame[string] | 1333 MB | | | +| \-pa-> pd.DataFrame[pyarrow] | 1420 MB | | | +| \-\-\-\-> pa.Table | 530 MB | | 381 MB | +| \-\-\-\-> pa.Table -> pd.DataFrame[object] | 994 MB | | 897 MB | +| \-\-\-\-> pa.Table -> pd.DataFrame[pyarrow] | 541 MB | | 437 MB |