Polars benchmarks methods #424

ChristianGeng · 2024-05-29T12:30:24Z

closes #385

Benchmarking Polars (methods)

The current merge requests monkeypatches the dependencies module and replaces it with a polars version. At the current stage I am unsure whether it makes sense to include it into the code base, or wether a ghist with the files doing such an evaluation would be the better change. This is due to the currently rather mixed results in the benchmarking table, and due to the fact that possibly we do not want to maintain a polars dependenies module given that there are future changes to the module.

Polars uses arrow memory automatically. Furthermore, a cast to type "Object" is impossible. Therefore the comparison is limited to a comparison between "Pandas/string" and Polars. Possibly one could compare against pyarrow if this is more meaningful, the polars results would stay identical.

It probably makes sense to alter the scope of the issue: while it is useful to do the method benchmarking using polars, for benchmarking file i/o it probably would make more sense to test whether lance is faster than other formats. Somehow this is a different question though

method	pandas	polars	comment	winner
Dependencies.call()	0.000	0.000		polars
Dependencies.contains(10000 files)	0.007	0.674	both unvectorized	pandas
Dependencies.get_item(10000 files)	0.242	nan	approx 14.14	pandas
Dependencies.len()	0.000	0.000		pandas
Dependencies.str()	0.004	0.003		polars
Dependencies._add_attachment()	0.060	0.946	pandas casting	pandas
Dependencies._add_media(10000 files)	0.066	0.040		polars
Dependencies._add_meta()	0.183	1.245	pandas casting	pandas
Dependencies._drop()	0.076	0.063		polars
Dependencies._remove()	0.067	0.058		polars
Dependencies._update_media()	0.082	0.055		polars
Dependencies._update_media_version(10000 files)	0.010	0.100	concat(faster than in place)	pandas
Dependencies.archive(10000 files)	0.029	nan		pandas
Dependencies.archive(10000 files) / vectorized	0.009	nan		pandas
Dependencies.archives	0.143	0.213		pandas
Dependencies.attachment_ids	0.031	0.009		polars
Dependencies.attachments	0.027	0.011		polars
Dependencies.bit_depth(10000 files)	2.152	nan		pandas
Dependencies.bit_depth(10000 files) / vectorized	0.003	0.010		pandas
Dependencies.channels(10000 files)	2.128	nan		pandas
Dependencies.channels(10000 files) / vectorized	0.003	0.007		pandas
Dependencies.checksum(10000 files)	1.933	nan		pandas
Dependencies.checksum(10000 files) / vectorized	0.003	0.007		pandas
Dependencies.duration(10000 files)	2.153	nan		pandas
Dependencies.duration(10000 files) / vectorized	0.003	0.007		pandas
Dependencies.files	0.013	0.038		pandas
Dependencies.format(10000 files)	2.030	nan		pandas
Dependencies.format(10000 files) / vectorized	0.003	0.012		pandas
Dependencies.media	0.117	0.042		polars
Dependencies.removed(10000 files)	2.200	nan		pandas
Dependencies.removed(10000 files) / vectorized	0.003	0.010		pandas
Dependencies.removed_media	0.107	0.066		polars
Dependencies.sampling_rate(10000 files)	2.233	nan		pandas
Dependencies.sampling_rate(10000 files) / vectorized	0.003	0.008		pandas
Dependencies.table_ids	0.035	0.013		polars
Dependencies.tables	0.024	0.008		polars
Dependencies.type(10000 files)	2.252	nan		pandas
Dependencies.type(10000 files) / vectorized	0.004	0.008		pandas
Dependencies.version(10000 files)	2.024	nan		pandas
Dependencies.version(10000 files) / vectorized	0.003	0.008		pandas

Comments:

"pandas casting" means that I have not been bothered with improving the implementation a lot. So polars convers df from pandas and back on return

concat(faster than in place):

An inplace version of this particular function would look like that

self._df.with_columns(
      pl.when(pl.col(self.index_col).is_in(files)).then(
          pl.col(field).str.replace(".*",
                                    version)).otherwise(pl.col(field)))

Interestingly it was slower than the concat version that assigns new memory for the returned table.

there are a few methods that use concat. Note that this for polars alters the sort order of the dataframe. As access is always per (pandas) index this should not matter, does it?
approx 14.14: is not run as the polars version is extremely slow for now.
both unvectorized: slow for polars, but an improvement is probably possible.

Results and Interpretation:

in the current scenario I can see no benefit of replacing pandas with the polars dataframe engine. What would make sense however is to streamlne the code to include #407 and streamline the code such that all methods using Dependencies._column_loc would use vectorized code. I found a few instances where the current implementation iterates over files and calls Dependencies._column_loc on these single files.

method	pandas
Dependencies.archive(10000 files)	0.029
Dependencies.bit_depth(10000 files)	2.152
Dependencies.channels(10000 files)	2.128
Dependencies.checksum(10000 files)	1.933
Dependencies.duration(10000 files)	2.153
Dependencies.format(10000 files)	2.030
Dependencies.removed(10000 files)	2.200
Dependencies.sampling_rate(10000 files)	2.233
Dependencies.type(10000 files)	2.252
Dependencies.version(10000 files)	2.024
Dependencies.archive(10000 files) / vectorized	0.009
Dependencies.bit_depth(10000 files) / vectorized	0.003
Dependencies.channels(10000 files) / vectorized	0.003
Dependencies.checksum(10000 files) / vectorized	0.003
Dependencies.duration(10000 files) / vectorized	0.003
Dependencies.format(10000 files) / vectorized	0.003
Dependencies.removed(10000 files) / vectorized	0.003
Dependencies.sampling_rate(10000 files) / vectorized	0.003
Dependencies.type(10000 files) / vectorized	0.004
Dependencies.version(10000 files) / vectorized	0.003

Further direction: in order to be on the safe side, one should probably extend the polars method benchmarking to include different data sizes: it might be that using polars shines more on larger datasets (given that it is advertized using better threading). So the questions would be: How would a fortunate setting of num_rows and n_files look like, witouth making it unrealistically big

I am tentatively requesting a review, despite the fact that I know that this might need to be changed and/or extended.

benchmarks/benchmark-dependencies-methods-polars.py

hagenw · 2024-05-30T10:05:52Z

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see #356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

hagenw · 2024-05-30T10:09:06Z

Regarding lance, I created #425 as a first try on benchmarks. But to me it looks like its not worth to continue into this direction for now. Reading from a LANCE file is faster only when we stay with lance.LanceDataset object. But when trying to work with it, I'm sure we will face similar problems regarding addressing single rows, we have seen in #356 and we see here.

benchmarks/dependencies_polars.py

ChristianGeng · 2024-07-09T07:45:36Z

These two of your comments belong together. I will comment on all the changes in a separate thread summarizing all changes that I have made yesterday. The ghist of it is that the lack if speed for single elements has to do with the fact that only pandas has indices.

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see #356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

This implementation is very slow at the moment when requesting a single file. Is there maybe something similar to df.at with polars to speed this up?

ChristianGeng · 2024-07-09T08:37:05Z

Treatment Index variable

the previous previous version of this MR assumed that Dependencies._column_loc would operate in a vectorized fashion. However, instead of implmenenting, we decided to also rollback the type hints, meaning that essentially we will work with single element access.

The migration guide, or in this blogpost for a more detailed fashion discuss about polars not implementing indices. This in essence means that when doing random access of a single element, access cannot be fast per se as the whole data has to be searched: From my basic understanding I think that taking the value of a given index would be O(1), but finding the index of a given value would then be O(N). This SOV post recommended to maintain a dict. I do not know much about how dicts are implemented, I would have thought that they use red-black tree or b-trees, but on the internet they say that they are hash tables. So I am unsure whether this is the best implementation but I have used a normal python dict for now. Sorry for being lengthy, but this is also the reason why pyarrow random access fails.

I am currently maintaining the index as a variable Dependencies._idx and a private method Dependencis_update_idx. It actually contains sth. like {'file0.wav': 0, "file1.wav": 1}, then one can use and then use sth. like df.row(self._idx[file]) to locate elements. As many of the benchmark methods operate through Dependencies._column_loc, all of these are affected.

use `pl.update` often with "outer join"

When actually adding or changing data, I have refactored the slow methods using pl.update. This api is unstable though, and one would expect it to break at some time later.

For the methods where polars was fast in the first place I have not done so. So this is a little inconsistent.

`str`

__str__ had been slow with polars default settings. I have tweaked these to become slow, but tried to stay with the 15 lines of output that pandas uses.

Further comments

Dependencies.load currently only implemented for parquet file
test data are not created

method	pandas	polars	winner	factor
Dependencies.call()	0.000	0.000	polars	2.667
Dependencies.contains(10000 files)	0.003	0.002	polars	2.005
Dependencies.get_item(10000 files)	0.648	0.013	polars	50.382
Dependencies.len()	0.000	0.000	pandas	1.300
Dependencies.str()	0.004	0.000	polars	24.677
Dependencies._add_attachment()	0.171	0.104	polars	1.645
Dependencies._add_media(10000 files)	0.073	0.008	polars	9.589
Dependencies._add_meta()	0.127	0.100	polars	1.260
Dependencies._drop()	0.118	0.021	polars	5.628
Dependencies._remove()	0.067	0.002	polars	39.324
Dependencies._update_media()	0.142	0.066	polars	2.148
Dependencies._update_media_version(10000 files)	0.021	0.016	polars	1.341
Dependencies.archive(10000 files)	0.045	0.014	polars	3.250
Dependencies.archives	0.145	0.151	pandas	1.045
Dependencies.attachment_ids	0.018	0.008	polars	2.375
Dependencies.attachments	0.017	0.008	polars	2.194
Dependencies.bit_depth(10000 files)	0.029	0.014	polars	2.031
Dependencies.channels(10000 files)	0.030	0.013	polars	2.224
Dependencies.checksum(10000 files)	0.030	0.014	polars	2.201
Dependencies.duration(10000 files)	0.028	0.014	polars	2.066
Dependencies.files	0.012	0.011	polars	1.040
Dependencies.format(10000 files)	0.033	0.014	polars	2.345
Dependencies.media	0.068	0.040	polars	1.702
Dependencies.removed(10000 files)	0.029	0.014	polars	2.118
Dependencies.removed_media	0.068	0.038	polars	1.809
Dependencies.sampling_rate(10000 files)	0.029	0.014	polars	2.102
Dependencies.table_ids	0.025	0.013	polars	1.927
Dependencies.tables	0.017	0.008	polars	2.166
Dependencies.type(10000 files)	0.028	0.014	polars	2.063
Dependencies.version(10000 files)	0.032	0.013	polars	2.372

hagenw · 2024-07-11T14:34:39Z

Great, thanks for your effort, now we can directly compare polars to our current solution.
And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

it would add another dependency
we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I think it would make sense to merge this into the main branch for documentation purposes.
Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

ChristianGeng · 2024-07-12T08:27:24Z

Great, thanks for your effort, now we can directly compare polars to our current solution. And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

it would add another dependency

we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I also think that this is quite ambitious for now: it would necessitate to refactor all tests, so this is a larger decision.
I have not tackled the loading and saving here: My understanding was that pyarrow will be used under the hood anyway, so I perceived the more interesting comparisons in this module. Should a follow up issue be created to achive this?

I think it would make sense to merge this into the main branch for documentation purposes. Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

I have updated the requirements and the README. I also committed the script that I used to compare the analysis.
In turn I have made the local utils.py obsolete.

benchmarks/README.md

hagenw

This is ready to merge.

ChristianGeng · 2024-07-26T07:21:41Z

This is ready to merge.

After rebasing onto main (with no conflicts) I run into a failing test:

pytest -v  -s tests/test_publish.py::test_publish_text_media_files

which resutls in a test failure:

FAILED tests/test_publish.py::test_publish_text_media_files - AssertionError: assert ['db.files.parquet'] == ['db.files.csv']

Will I have to obtain some unmerged stuff from one of these?:

| * a7c3062	 (origin/fix-parquet) Use storage format variable in asserts (Hagen Wierstorf)
| * 7acf38c	 TST: fix tests for audformat>=1.3.0 (Hagen Wierstorf)
|/  
| * 9ca90db	 (origin/skip-pickle) Add pickle_cache argument to load() + load_table() (Hagen Wierstorf)
|/

Or is there a different reason that I am not seing?

hagenw · 2024-07-26T07:23:48Z

The test was fixed with #445, which is merged now.

Co-authored-by: Hagen Wierstorf <[email protected]>

ChristianGeng mentioned this pull request May 29, 2024

Dependencies._column_loc: files parameter has a mismatch between typing and implementation #407

Closed

hagenw reviewed May 29, 2024

View reviewed changes

benchmarks/benchmark-dependencies-methods-polars.py Outdated Show resolved Hide resolved

ChristianGeng force-pushed the polars-benchmarks-methods branch from 65c1d08 to 2c6143a Compare May 29, 2024 15:55

hagenw mentioned this pull request May 30, 2024

Benchmarks: lance file format for dependency table #425

Closed

hagenw reviewed Jun 5, 2024

View reviewed changes

benchmarks/dependencies_polars.py Outdated Show resolved Hide resolved

ChristianGeng force-pushed the polars-benchmarks-methods branch 2 times, most recently from b26f17a to 9e12393 Compare July 8, 2024 11:50

ChristianGeng marked this pull request as draft July 10, 2024 08:26

ChristianGeng marked this pull request as ready for review July 10, 2024 08:26

hagenw reviewed Jul 12, 2024

View reviewed changes

benchmarks/README.md Show resolved Hide resolved

hagenw reviewed Jul 12, 2024

View reviewed changes

benchmarks/README.md Outdated Show resolved Hide resolved

hagenw reviewed Jul 12, 2024

View reviewed changes

benchmarks/README.md Show resolved Hide resolved

benchmarks/README.md Outdated Show resolved Hide resolved

hagenw approved these changes Jul 26, 2024

View reviewed changes

ChristianGeng force-pushed the polars-benchmarks-methods branch from a4275f2 to 9b3b748 Compare July 26, 2024 07:06

Christian Geng added 8 commits July 26, 2024 10:01

Add vectorized benchmarks for col locator.

783fe0f

change order of processing.

16c7009

save temp results.

9967f03

add initial implementation.

6006788

add utils.

7fc626a

Set back to main as no vectorization needed.

c466adc

Save data.

80398e1

Implement new version of polars deps module.

308f36b

Christian Geng and others added 24 commits July 26, 2024 10:01

Remove typo.

61acf20

Format code.

3814299

Make benchmark script not use vectorization.

e3e48f3

Add comment concerning necessity to update index.

2a96fd0

Remove dead code.

0d526ce

Make string repr method local.

7c1f09d

beautify code.

e9db98f

remove discusson to markdown.

7f08de6

rove unncessary cast.

0a54ed6

Do not save csv, comment.

25738a4

Fix typos.

96b2884

format code.

583fadb

Remove property not used any longer.

913e839

Format code.

ce557d6

format code.

3ef74c8

comment module setting polars dep module as audb.dep.

98fe0ee

Add comparison script.

07a39fa

Add initial README.

458d5e0

Update requirements.

0fb0c67

remove local utils function.

9423ce4

Update benchmarks/README.md

929b7f0

Co-authored-by: Hagen Wierstorf <[email protected]>

Update benchmarks/README.md

3df9fc5

Co-authored-by: Hagen Wierstorf <[email protected]>

Update benchmarks/README.md

f3ff7b9

Co-authored-by: Hagen Wierstorf <[email protected]>

Add reference column in previous column.

38ef12d

ChristianGeng force-pushed the polars-benchmarks-methods branch from 9b3b748 to 38ef12d Compare July 26, 2024 08:04

ChristianGeng merged commit a821c8c into main Jul 26, 2024
8 checks passed

ChristianGeng deleted the polars-benchmarks-methods branch July 26, 2024 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars benchmarks methods #424

Polars benchmarks methods #424

ChristianGeng commented May 29, 2024 •

edited

Loading

hagenw commented May 30, 2024

hagenw commented May 30, 2024

ChristianGeng commented Jul 9, 2024

ChristianGeng commented Jul 9, 2024

hagenw commented Jul 11, 2024

ChristianGeng commented Jul 12, 2024

hagenw left a comment

ChristianGeng commented Jul 26, 2024

hagenw commented Jul 26, 2024

Polars benchmarks methods #424

Polars benchmarks methods #424

Conversation

ChristianGeng commented May 29, 2024 • edited Loading

Benchmarking Polars (methods)

hagenw commented May 30, 2024

hagenw commented May 30, 2024

ChristianGeng commented Jul 9, 2024

ChristianGeng commented Jul 9, 2024

Treatment Index variable

use pl.update often with "outer join"

__str__

Further comments

hagenw commented Jul 11, 2024

ChristianGeng commented Jul 12, 2024

hagenw left a comment

Choose a reason for hiding this comment

ChristianGeng commented Jul 26, 2024

hagenw commented Jul 26, 2024

ChristianGeng commented May 29, 2024 •

edited

Loading

use `pl.update` often with "outer join"

`str`