`to_lists()` with column subset much slower than all columns when chained to `apply(..., axis=1)` #33

dougbrn · 2024-04-22T21:53:13Z

Bug report

The following code is not a reproducible example because the data creation is not included, but applying this to any nestedframe should yield similar timing results.

nf.ztf_sources.nest.to_lists(["mjd","flux"]).apply(lambda x: x, axis=1) # 461ms
nf.ztf_sources.nest.to_lists().apply(lambda x: x, axis=1) # 34.4 ms

Before submitting
Please check the following:

I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

The text was updated successfully, but these errors were encountered:

hombit · 2024-05-29T16:23:08Z

I believe it may be unrelated to our code and it is (again) a pandas dtypes issue.

First, let me re-write the code with built-in generator:

from nested_pandas.datasets import generate_data

ndf = generate_data(n_base=1000, n_layer=100)
%timeit ndf.nested.nest.to_lists().apply(lambda x: x, axis=1)  # 8.2ms
%timeit ndf.nested.nest.to_lists(["t", "flux"]).apply(lambda x: x, axis=1)  # 76ms
%timeit ndf.nested.nest.to_lists(["t", "band"]).apply(lambda x: x, axis=1)  # 8.1ms

The difference between second line and two other lines is in dtypes of the columns: "t" and "flux" are float64 columns, while "band" is a string column. In both cases pandas creates a pd.Series from each row. When dtypes are different, pd.Series has object dtype, and when they are the same, well, pandas casts pd.Series to this dtype. Surprisedly, it turns to be a quiet expensive operation:

import numpy as np
import pandas as pd
import pyarrow as pa

zeros = pa.scalar(np.zeros(100, dtype=float))
%timeit pd.Series([zeros, zeros])  # 13.8 μs
%timeit pd.Series([zeros, zeros], dtype=pd.ArrowDtype(zeros.type))  # 160 μs

What we can do it calling .assign with raw=True, which is a close equivalent of our reduce():

from nested_pandas.datasets import generate_data

ndf = generate_data(n_base=1000, n_layer=100)
%timeit ndf.nested.nest.to_lists().apply(lambda x: x, axis=1, raw=True)  # 1.43 ms
%timeit ndf.nested.nest.to_lists(["t", "flux"]).apply(lambda x: x, axis=1, raw=True)  # 0.89 ms
%timeit ndf.nested.nest.to_lists(["t", "band"]).apply(lambda x: x, axis=1, raw=True)  # 1.29 ms

hombit · 2024-05-30T15:40:06Z

#96 changed reduce() to not use to_lists().assign, so I close this issue

dougbrn added the bug Something isn't working label Apr 22, 2024

This was referenced Apr 22, 2024

speedup with to_lists() #34

Merged

reduce fails with duplicate column names #35

Closed

dougbrn added this to the v0.1 Release milestone Apr 26, 2024

hombit removed this from the v0.1 Release milestone May 2, 2024

hombit closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_lists()` with column subset much slower than all columns when chained to `apply(..., axis=1)` #33

`to_lists()` with column subset much slower than all columns when chained to `apply(..., axis=1)` #33

dougbrn commented Apr 22, 2024

hombit commented May 29, 2024

hombit commented May 30, 2024

to_lists() with column subset much slower than all columns when chained to apply(..., axis=1) #33

to_lists() with column subset much slower than all columns when chained to apply(..., axis=1) #33

Comments

dougbrn commented Apr 22, 2024

hombit commented May 29, 2024

hombit commented May 30, 2024

`to_lists()` with column subset much slower than all columns when chained to `apply(..., axis=1)` #33

`to_lists()` with column subset much slower than all columns when chained to `apply(..., axis=1)` #33