Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars benchmarks methods #424

Merged
merged 32 commits into from
Jul 26, 2024
Merged

Polars benchmarks methods #424

merged 32 commits into from
Jul 26, 2024

Conversation

ChristianGeng
Copy link
Member

@ChristianGeng ChristianGeng commented May 29, 2024

closes #385

Benchmarking Polars (methods)

The current merge requests monkeypatches the dependencies module and replaces it with a polars version. At the current stage I am unsure whether it makes sense to include it into the code base, or wether a ghist with the files doing such an evaluation would be the better change. This is due to the currently rather mixed results in the benchmarking table, and due to the fact that possibly we do not want to maintain a polars dependenies module given that there are future changes to the module.

Polars uses arrow memory automatically. Furthermore, a cast to type "Object" is impossible. Therefore the comparison is limited to a comparison between "Pandas/string" and Polars. Possibly one could compare against pyarrow if this is more meaningful, the polars results would stay identical.

It probably makes sense to alter the scope of the issue: while it is useful to do the method benchmarking using polars, for benchmarking file i/o it probably would make more sense to test whether lance is faster than other formats. Somehow this is a different question though

method pandas polars comment winner
Dependencies.call() 0.000 0.000 polars
Dependencies.contains(10000 files) 0.007 0.674 both unvectorized pandas
Dependencies.get_item(10000 files) 0.242 nan approx 14.14 pandas
Dependencies.len() 0.000 0.000 pandas
Dependencies.str() 0.004 0.003 polars
Dependencies._add_attachment() 0.060 0.946 pandas casting pandas
Dependencies._add_media(10000 files) 0.066 0.040 polars
Dependencies._add_meta() 0.183 1.245 pandas casting pandas
Dependencies._drop() 0.076 0.063 polars
Dependencies._remove() 0.067 0.058 polars
Dependencies._update_media() 0.082 0.055 polars
Dependencies._update_media_version(10000 files) 0.010 0.100 concat(faster than in place) pandas
Dependencies.archive(10000 files) 0.029 nan pandas
Dependencies.archive(10000 files) / vectorized 0.009 nan pandas
Dependencies.archives 0.143 0.213 pandas
Dependencies.attachment_ids 0.031 0.009 polars
Dependencies.attachments 0.027 0.011 polars
Dependencies.bit_depth(10000 files) 2.152 nan pandas
Dependencies.bit_depth(10000 files) / vectorized 0.003 0.010 pandas
Dependencies.channels(10000 files) 2.128 nan pandas
Dependencies.channels(10000 files) / vectorized 0.003 0.007 pandas
Dependencies.checksum(10000 files) 1.933 nan pandas
Dependencies.checksum(10000 files) / vectorized 0.003 0.007 pandas
Dependencies.duration(10000 files) 2.153 nan pandas
Dependencies.duration(10000 files) / vectorized 0.003 0.007 pandas
Dependencies.files 0.013 0.038 pandas
Dependencies.format(10000 files) 2.030 nan pandas
Dependencies.format(10000 files) / vectorized 0.003 0.012 pandas
Dependencies.media 0.117 0.042 polars
Dependencies.removed(10000 files) 2.200 nan pandas
Dependencies.removed(10000 files) / vectorized 0.003 0.010 pandas
Dependencies.removed_media 0.107 0.066 polars
Dependencies.sampling_rate(10000 files) 2.233 nan pandas
Dependencies.sampling_rate(10000 files) / vectorized 0.003 0.008 pandas
Dependencies.table_ids 0.035 0.013 polars
Dependencies.tables 0.024 0.008 polars
Dependencies.type(10000 files) 2.252 nan pandas
Dependencies.type(10000 files) / vectorized 0.004 0.008 pandas
Dependencies.version(10000 files) 2.024 nan pandas
Dependencies.version(10000 files) / vectorized 0.003 0.008 pandas

Comments:

  • "pandas casting" means that I have not been bothered with improving the implementation a lot. So polars convers df from pandas and back on return

  • concat(faster than in place):

    An inplace version of this particular function would look like that

    self._df.with_columns(
          pl.when(pl.col(self.index_col).is_in(files)).then(
              pl.col(field).str.replace(".*",
                                        version)).otherwise(pl.col(field)))
    

    Interestingly it was slower than the concat version that assigns new memory for the returned table.

  • there are a few methods that use concat. Note that this for polars alters the sort order of the dataframe. As access is always per (pandas) index this should not matter, does it?

  • approx 14.14: is not run as the polars version is extremely slow for now.

  • both unvectorized: slow for polars, but an improvement is probably possible.

Results and Interpretation:

in the current scenario I can see no benefit of replacing pandas with the polars dataframe engine. What would make sense however is to streamlne the code to include #407 and streamline the code such that all methods using Dependencies._column_loc would use vectorized code. I found a few instances where the current implementation iterates over files and calls Dependencies._column_loc on these single files.

method pandas
Dependencies.archive(10000 files) 0.029
Dependencies.bit_depth(10000 files) 2.152
Dependencies.channels(10000 files) 2.128
Dependencies.checksum(10000 files) 1.933
Dependencies.duration(10000 files) 2.153
Dependencies.format(10000 files) 2.030
Dependencies.removed(10000 files) 2.200
Dependencies.sampling_rate(10000 files) 2.233
Dependencies.type(10000 files) 2.252
Dependencies.version(10000 files) 2.024
Dependencies.archive(10000 files) / vectorized 0.009
Dependencies.bit_depth(10000 files) / vectorized 0.003
Dependencies.channels(10000 files) / vectorized 0.003
Dependencies.checksum(10000 files) / vectorized 0.003
Dependencies.duration(10000 files) / vectorized 0.003
Dependencies.format(10000 files) / vectorized 0.003
Dependencies.removed(10000 files) / vectorized 0.003
Dependencies.sampling_rate(10000 files) / vectorized 0.003
Dependencies.type(10000 files) / vectorized 0.004
Dependencies.version(10000 files) / vectorized 0.003

Further direction: in order to be on the safe side, one should probably extend the polars method benchmarking to include different data sizes: it might be that using polars shines more on larger datasets (given that it is advertized using better threading). So the questions would be: How would a fortunate setting of num_rows and n_files look like, witouth making it unrealistically big

I am tentatively requesting a review, despite the fact that I know that this might need to be changed and/or extended.

@hagenw
Copy link
Member

hagenw commented May 30, 2024

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see #356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

@hagenw
Copy link
Member

hagenw commented May 30, 2024

Regarding lance, I created #425 as a first try on benchmarks. But to me it looks like its not worth to continue into this direction for now. Reading from a LANCE file is faster only when we stay with lance.LanceDataset object. But when trying to work with it, I'm sure we will face similar problems regarding addressing single rows, we have seen in #356 and we see here.

@ChristianGeng ChristianGeng force-pushed the polars-benchmarks-methods branch 2 times, most recently from b26f17a to 9e12393 Compare July 8, 2024 11:50
@ChristianGeng
Copy link
Member Author

These two of your comments belong together. I will comment on all the changes in a separate thread summarizing all changes that I have made yesterday. The ghist of it is that the lack if speed for single elements has to do with the fact that only pandas has indices.

The implementation with polars seem to face the same problem as I encountered when trying to use pyarrow.Table instead of pandas.DataFrame, see #356. In general, performance is as good or better than with pandas.DataFrame, but not when we need to address single rows.

This implementation is very slow at the moment when requesting a single file. Is there maybe something similar to df.at with polars to speed this up?

@ChristianGeng
Copy link
Member Author

Treatment Index variable

the previous previous version of this MR assumed that Dependencies._column_loc would operate in a vectorized fashion. However, instead of implmenenting, we decided to also rollback the type hints, meaning that essentially we will work with single element access.

The migration guide, or in this blogpost for a more detailed fashion discuss about polars not implementing indices. This in essence means that when doing random access of a single element, access cannot be fast per se as the whole data has to be searched: From my basic understanding I think that taking the value of a given index would be O(1), but finding the index of a given value would then be O(N). This SOV post recommended to maintain a dict. I do not know much about how dicts are implemented, I would have thought that they use red-black tree or b-trees, but on the internet they say that they are hash tables. So I am unsure whether this is the best implementation but I have used a normal python dict for now. Sorry for being lengthy, but this is also the reason why pyarrow random access fails.

I am currently maintaining the index as a variable Dependencies._idx and a private method Dependencis_update_idx. It actually contains sth. like {'file0.wav': 0, "file1.wav": 1}, then one can use and then use sth. like df.row(self._idx[file]) to locate elements. As many of the benchmark methods operate through Dependencies._column_loc, all of these are affected.

use pl.update often with "outer join"

When actually adding or changing data, I have refactored the slow methods using pl.update. This api is unstable though, and one would expect it to break at some time later.

For the methods where polars was fast in the first place I have not done so. So this is a little inconsistent.

__str__

__str__ had been slow with polars default settings. I have tweaked these to become slow, but tried to stay with the 15 lines of output that pandas uses.

Further comments

  • Dependencies.load currently only implemented for parquet file
  • test data are not created
method pandas polars winner factor
Dependencies.call() 0.000 0.000 polars 2.667
Dependencies.contains(10000 files) 0.003 0.002 polars 2.005
Dependencies.get_item(10000 files) 0.648 0.013 polars 50.382
Dependencies.len() 0.000 0.000 pandas 1.300
Dependencies.str() 0.004 0.000 polars 24.677
Dependencies._add_attachment() 0.171 0.104 polars 1.645
Dependencies._add_media(10000 files) 0.073 0.008 polars 9.589
Dependencies._add_meta() 0.127 0.100 polars 1.260
Dependencies._drop() 0.118 0.021 polars 5.628
Dependencies._remove() 0.067 0.002 polars 39.324
Dependencies._update_media() 0.142 0.066 polars 2.148
Dependencies._update_media_version(10000 files) 0.021 0.016 polars 1.341
Dependencies.archive(10000 files) 0.045 0.014 polars 3.250
Dependencies.archives 0.145 0.151 pandas 1.045
Dependencies.attachment_ids 0.018 0.008 polars 2.375
Dependencies.attachments 0.017 0.008 polars 2.194
Dependencies.bit_depth(10000 files) 0.029 0.014 polars 2.031
Dependencies.channels(10000 files) 0.030 0.013 polars 2.224
Dependencies.checksum(10000 files) 0.030 0.014 polars 2.201
Dependencies.duration(10000 files) 0.028 0.014 polars 2.066
Dependencies.files 0.012 0.011 polars 1.040
Dependencies.format(10000 files) 0.033 0.014 polars 2.345
Dependencies.media 0.068 0.040 polars 1.702
Dependencies.removed(10000 files) 0.029 0.014 polars 2.118
Dependencies.removed_media 0.068 0.038 polars 1.809
Dependencies.sampling_rate(10000 files) 0.029 0.014 polars 2.102
Dependencies.table_ids 0.025 0.013 polars 1.927
Dependencies.tables 0.017 0.008 polars 2.166
Dependencies.type(10000 files) 0.028 0.014 polars 2.063
Dependencies.version(10000 files) 0.032 0.013 polars 2.372

@ChristianGeng ChristianGeng marked this pull request as draft July 10, 2024 08:26
@ChristianGeng ChristianGeng marked this pull request as ready for review July 10, 2024 08:26
@hagenw
Copy link
Member

hagenw commented Jul 11, 2024

Great, thanks for your effort, now we can directly compare polars to our current solution.
And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

  • it would add another dependency
  • we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I think it would make sense to merge this into the main branch for documentation purposes.
Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

@ChristianGeng
Copy link
Member Author

Great, thanks for your effort, now we can directly compare polars to our current solution. And it turns out that polars is indeed slightly faster (or much faster for Dependencies.__get_item__()).

There are a few points that need to be considered when switching to polars for handling dependencies:

  • it would add another dependency
  • we also need to see how performance of loading and saving parquet files is

I would propose, to not consider switching to polars for now, and first focus on a few other features. But it might indeed be a nice option to tackle at some point.

I also think that this is quite ambitious for now: it would necessitate to refactor all tests, so this is a larger decision.
I have not tackled the loading and saving here: My understanding was that pyarrow will be used under the hood anyway, so I perceived the more interesting comparisons in this module. Should a follow up issue be created to achive this?

I think it would make sense to merge this into the main branch for documentation purposes. Before doing so, could you also please update the requirements.txt file in the benchmarks/ folder, adding everything we need to run your scripts, and add the results to benchmarks/README.md.

I have updated the requirements and the README. I also committed the script that I used to compare the analysis.
In turn I have made the local utils.py obsolete.

benchmarks/README.md Outdated Show resolved Hide resolved
benchmarks/README.md Show resolved Hide resolved
benchmarks/README.md Outdated Show resolved Hide resolved
Copy link
Member

@hagenw hagenw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to merge.

@ChristianGeng
Copy link
Member Author

This is ready to merge.

After rebasing onto main (with no conflicts) I run into a failing test:

pytest -v  -s tests/test_publish.py::test_publish_text_media_files

which resutls in a test failure:

FAILED tests/test_publish.py::test_publish_text_media_files - AssertionError: assert ['db.files.parquet'] == ['db.files.csv']

Will I have to obtain some unmerged stuff from one of these?:

| * a7c3062	 (origin/fix-parquet) Use storage format variable in asserts (Hagen Wierstorf)
| * 7acf38c	 TST: fix tests for audformat>=1.3.0 (Hagen Wierstorf)
|/  
| * 9ca90db	 (origin/skip-pickle) Add pickle_cache argument to load() + load_table() (Hagen Wierstorf)
|/ 

Or is there a different reason that I am not seing?

@hagenw
Copy link
Member

hagenw commented Jul 26, 2024

The test was fixed with #445, which is merged now.

@ChristianGeng ChristianGeng merged commit a821c8c into main Jul 26, 2024
8 checks passed
@ChristianGeng ChristianGeng deleted the polars-benchmarks-methods branch July 26, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add polars to the comparison benchmark
2 participants