Store index of Dependencies._df as object dtype #371

hagenw · 2024-02-13T12:54:36Z

This stores the index of the dependency table (audb.Depednencies._df) as object dtype.
This has the advantage that it improves accessing multiple files in methods like audb.Dependencies.archive()
to reach the same perfomance when using string (as current in main) or pyarrow (which we propose for the future) as dtype compared to object as dtype for storing the column values.

Example result now:

Example result before:

I also inspected how this affects writing and reading the dependency table to files by updating the existing benchmark.
In the following we list only results that show a significant change.

Writing

method	format	pull request	before
pd.DataFrame[pyarrow]	parquet	0.273	0.928

Reading

method	format	pull request	before
----> pd.DataFrame[pyarrow]	pickle	0.092	0.027
----> pd.DataFrame[pyarrow]	parquet	0.408	0.288

Conclusion

It only affects results negatively when reading directly to pandas.DataFrame with pyarrow dtypes,
but not when starting with pyarrow.Table and converting then to pandas.DataFrame with pyarrow dtypes,
which is faster anyway.

It also seems to reduce memory consumption when going to pyarrow dtype. But this is slightly harder to judge as the results can vary between runs, and we have updated the code how it is calculated in this pull request. But at least, it does not get worse for any case.

It seems to be that object is still the fastest regarding indexing, compare

import pandas as pd
import timeit

points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
    index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
    df = pd.DataFrame(data, index=index, dtype=dtype)
    print(dtype)
    %timeit df.loc['index-2000']

which returns

object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

codecov · 2024-02-13T12:57:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dd7642e) 100.0% compared to head (5b0fa8b) 100.0%.

Additional details and impacted files

Files	Coverage Δ
audb/core/define.py	`100.0% <100.0%> (ø)`
audb/core/dependencies.py	`100.0% <100.0%> (ø)`

* Store index of Dependencies._df as object dtype * Add memray to requirements for benchmark

Store index of Dependencies._df as object dtype

511a107

hagenw marked this pull request as ready for review February 13, 2024 13:04

Add memray to requirements for benchmark

5b0fa8b

hagenw merged commit 9f3b0ad into dev Feb 13, 2024
9 checks passed

hagenw deleted the use-object-index-in-dependencies branch February 13, 2024 13:25

hagenw mentioned this pull request Feb 14, 2024

Consider using object instead of string in index audeering/audformat#418

Open

hagenw added a commit that referenced this pull request Feb 23, 2024

Store index of Dependencies._df as object dtype (#371)

0c4300b

* Store index of Dependencies._df as object dtype * Add memray to requirements for benchmark

hagenw added a commit that referenced this pull request May 3, 2024

Store index of Dependencies._df as object dtype (#371)

7c21733

* Store index of Dependencies._df as object dtype * Add memray to requirements for benchmark

hagenw added a commit that referenced this pull request May 3, 2024

Store index of Dependencies._df as object dtype (#371)

b6cfce5

* Store index of Dependencies._df as object dtype * Add memray to requirements for benchmark

hagenw added a commit that referenced this pull request May 3, 2024

Store index of Dependencies._df as object dtype (#371)

c0d0979

* Store index of Dependencies._df as object dtype * Add memray to requirements for benchmark

hagenw added a commit that referenced this pull request May 8, 2024

Store index of Dependencies._df as object dtype (#371)

6119db8

* Store index of Dependencies._df as object dtype * Add memray to requirements for benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store index of Dependencies._df as object dtype #371

Store index of Dependencies._df as object dtype #371

hagenw commented Feb 13, 2024 •

edited

Loading

codecov bot commented Feb 13, 2024 •

edited

Loading

Store index of Dependencies._df as object dtype #371

Store index of Dependencies._df as object dtype #371

Conversation

hagenw commented Feb 13, 2024 • edited Loading

codecov bot commented Feb 13, 2024 • edited Loading

Codecov Report

hagenw commented Feb 13, 2024 •

edited

Loading

codecov bot commented Feb 13, 2024 •

edited

Loading