Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_lists() with column subset much slower than all columns when chained to apply(..., axis=1) #33

Closed
2 of 3 tasks
dougbrn opened this issue Apr 22, 2024 · 2 comments
Closed
2 of 3 tasks
Labels
bug Something isn't working

Comments

@dougbrn
Copy link
Collaborator

dougbrn commented Apr 22, 2024

Bug report

The following code is not a reproducible example because the data creation is not included, but applying this to any nestedframe should yield similar timing results.

nf.ztf_sources.nest.to_lists(["mjd","flux"]).apply(lambda x: x, axis=1) # 461ms
nf.ztf_sources.nest.to_lists().apply(lambda x: x, axis=1) # 34.4 ms

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
@dougbrn dougbrn added the bug Something isn't working label Apr 22, 2024
@dougbrn dougbrn added this to the v0.1 Release milestone Apr 26, 2024
@hombit hombit removed this from the v0.1 Release milestone May 2, 2024
@hombit
Copy link
Collaborator

hombit commented May 29, 2024

I believe it may be unrelated to our code and it is (again) a pandas dtypes issue.

First, let me re-write the code with built-in generator:

from nested_pandas.datasets import generate_data

ndf = generate_data(n_base=1000, n_layer=100)
%timeit ndf.nested.nest.to_lists().apply(lambda x: x, axis=1)  # 8.2ms
%timeit ndf.nested.nest.to_lists(["t", "flux"]).apply(lambda x: x, axis=1)  # 76ms
%timeit ndf.nested.nest.to_lists(["t", "band"]).apply(lambda x: x, axis=1)  # 8.1ms

The difference between second line and two other lines is in dtypes of the columns: "t" and "flux" are float64 columns, while "band" is a string column. In both cases pandas creates a pd.Series from each row. When dtypes are different, pd.Series has object dtype, and when they are the same, well, pandas casts pd.Series to this dtype. Surprisedly, it turns to be a quiet expensive operation:

import numpy as np
import pandas as pd
import pyarrow as pa

zeros = pa.scalar(np.zeros(100, dtype=float))
%timeit pd.Series([zeros, zeros])  # 13.8 μs
%timeit pd.Series([zeros, zeros], dtype=pd.ArrowDtype(zeros.type))  # 160 μs

What we can do it calling .assign with raw=True, which is a close equivalent of our reduce():

from nested_pandas.datasets import generate_data

ndf = generate_data(n_base=1000, n_layer=100)
%timeit ndf.nested.nest.to_lists().apply(lambda x: x, axis=1, raw=True)  # 1.43 ms
%timeit ndf.nested.nest.to_lists(["t", "flux"]).apply(lambda x: x, axis=1, raw=True)  # 0.89 ms
%timeit ndf.nested.nest.to_lists(["t", "band"]).apply(lambda x: x, axis=1, raw=True)  # 1.29 ms

@hombit
Copy link
Collaborator

hombit commented May 30, 2024

#96 changed reduce() to not use to_lists().assign, so I close this issue

@hombit hombit closed this as completed May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants