Store index of Dependencies._df as object dtype #371
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This stores the index of the dependency table (
audb.Depednencies._df
) asobject
dtype.This has the advantage that it improves accessing multiple files in methods like
audb.Dependencies.archive()
to reach the same perfomance when using
string
(as current inmain
) orpyarrow
(which we propose for the future) as dtype compared toobject
as dtype for storing the column values.Example result now:
Example result before:
I also inspected how this affects writing and reading the dependency table to files by updating the existing benchmark.
In the following we list only results that show a significant change.
Writing
Reading
Conclusion
It only affects results negatively when reading directly to
pandas.DataFrame
withpyarrow
dtypes,but not when starting with
pyarrow.Table
and converting then topandas.DataFrame
withpyarrow
dtypes,which is faster anyway.
It also seems to reduce memory consumption when going to
pyarrow
dtype. But this is slightly harder to judge as the results can vary between runs, and we have updated the code how it is calculated in this pull request. But at least, it does not get worse for any case.It seems to be that
object
is still the fastest regarding indexing, comparewhich returns