Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store index of Dependencies._df as object dtype #371

Merged
merged 2 commits into from
Feb 13, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Feb 13, 2024

This stores the index of the dependency table (audb.Depednencies._df) as object dtype.
This has the advantage that it improves accessing multiple files in methods like audb.Dependencies.archive()
to reach the same perfomance when using string (as current in main) or pyarrow (which we propose for the future) as dtype compared to object as dtype for storing the column values.

image

Example result now:

image

Example result before:

image


I also inspected how this affects writing and reading the dependency table to files by updating the existing benchmark.
In the following we list only results that show a significant change.

Writing

method format pull request before
pd.DataFrame[pyarrow] parquet 0.273 0.928

Reading

method format pull request before
----> pd.DataFrame[pyarrow] pickle 0.092 0.027
----> pd.DataFrame[pyarrow] parquet 0.408 0.288

Conclusion

It only affects results negatively when reading directly to pandas.DataFrame with pyarrow dtypes,
but not when starting with pyarrow.Table and converting then to pandas.DataFrame with pyarrow dtypes,
which is faster anyway.

It also seems to reduce memory consumption when going to pyarrow dtype. But this is slightly harder to judge as the results can vary between runs, and we have updated the code how it is calculated in this pull request. But at least, it does not get worse for any case.


It seems to be that object is still the fastest regarding indexing, compare

import pandas as pd
import timeit

points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
    index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
    df = pd.DataFrame(data, index=index, dtype=dtype)
    print(dtype)
    %timeit df.loc['index-2000']

which returns

object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Copy link

codecov bot commented Feb 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dd7642e) 100.0% compared to head (5b0fa8b) 100.0%.

Additional details and impacted files
Files Coverage Δ
audb/core/define.py 100.0% <100.0%> (ø)
audb/core/dependencies.py 100.0% <100.0%> (ø)

@hagenw hagenw marked this pull request as ready for review February 13, 2024 13:04
@hagenw hagenw merged commit 9f3b0ad into dev Feb 13, 2024
9 checks passed
@hagenw hagenw deleted the use-object-index-in-dependencies branch February 13, 2024 13:25
hagenw added a commit that referenced this pull request Feb 23, 2024
* Store index of Dependencies._df as object dtype

* Add memray to requirements for benchmark
hagenw added a commit that referenced this pull request May 3, 2024
* Store index of Dependencies._df as object dtype

* Add memray to requirements for benchmark
hagenw added a commit that referenced this pull request May 3, 2024
* Store index of Dependencies._df as object dtype

* Add memray to requirements for benchmark
hagenw added a commit that referenced this pull request May 3, 2024
* Store index of Dependencies._df as object dtype

* Add memray to requirements for benchmark
hagenw added a commit that referenced this pull request May 8, 2024
* Store index of Dependencies._df as object dtype

* Add memray to requirements for benchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant