Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DataFrame Periods definition #3708

Open
mattijn opened this issue Dec 10, 2024 · 2 comments
Open

Support for DataFrame Periods definition #3708

mattijn opened this issue Dec 10, 2024 · 2 comments

Comments

@mattijn
Copy link
Contributor

mattijn commented Dec 10, 2024

What is your suggestion?

Include support for type Period. A period represents neither the start or the end of the period, but rather the entire period itself.

Given the following:

import altair as alt
import pandas as pd

df = pd.DataFrame(data={'year':pd.period_range(start=2020, end=2024, freq='Y'), 'values': range(0,5)})
df

Image

If I do:

alt.Chart(df).mark_bar().encode(x='year', y='values')

I receive the following error

TypeError: Object of type Period is not JSON serializable
alt.Chart(...)

Have you considered any alternative solutions?

Potential solution (?) for #1365 and #3701 (comment)

@dangotbanned
Copy link
Member

pandas

Note

Had to dig quite deep into pandas to figure out why pd.period_range was accepting an int for (start|end).

Argument of type "Literal[2020]" cannot be assigned to parameter "start" of type "str | datetime | date | Timestamp | Period | None" in function "period_range"
  Type "Literal[2020]" is not assignable to type "str | datetime | date | Timestamp | Period | None"
    "Literal[2020]" is not assignable to "str"
    "Literal[2020]" is not assignable to "datetime"
    "Literal[2020]" is not assignable to "date"
    "Literal[2020]" is not assignable to "Timestamp"
    "Literal[2020]" is not assignable to "Period"
    "Literal[2020]" is not assignable to "None".

import pyarrow as pa
import pandas as pd
import polars as pl

df = pd.DataFrame(
    {"year": pd.period_range("2020", "2024", freq="Y"), "values": range(5)}
)
df
   year  values
0  2020       0
1  2021       1
2  2022       2
3  2023       3
4  2024       4

Related docs

pyarrow

pa.table(df)
pyarrow.Table
year: extension<pandas.period<ArrowPeriodType>>
values: int64
----
year: [[50,51,52,53,54]]
values: [[0,1,2,3,4]]

polars

pl.DataFrame(df)
ComputeError: cannot create series from Extension("pandas.period", ...)

ComputeError                              Traceback (most recent call last)
Cell In[22], line 1
----> 1 pl.DataFrame(df)

File ../site-packages/polars/dataframe/frame.py:405, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    400     self._df = arrow_to_pydf(
    401         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    402     )
    404 elif _check_for_pandas(data) and isinstance(data, pd.DataFrame):
--> 405     self._df = pandas_to_pydf(
    406         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    407     )
    409 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):
    410     self._df = iterable_to_pydf(
    411         data,
    412         schema=schema,
   (...)
    416         infer_schema_length=infer_schema_length,
    417     )

File ../site-packages/polars/_utils/construction/dataframe.py:1125, in pandas_to_pydf(data, schema, schema_overrides, strict, rechunk, nan_to_null, include_index)
   1120     arrow_dict[str(col)] = plc.pandas_series_to_arrow(
   1121         data[col], nan_to_null=nan_to_null, length=length
   1122     )
   1124 arrow_table = pa.table(arrow_dict)
-> 1125 return arrow_to_pydf(
   1126     arrow_table,
   1127     schema=schema,
   1128     schema_overrides=schema_overrides,
   1129     strict=strict,
   1130     rechunk=rechunk,
   1131 )

File ../site-packages/polars/_utils/construction/dataframe.py:1213, in arrow_to_pydf(data, schema, schema_overrides, strict, rechunk)
   1209         pydf = pl.DataFrame(
   1210             [pl.Series(name, c) for (name, c) in zip(tbl.column_names, tbl.columns)]
   1211         )._df
   1212     else:
-> 1213         pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
   1214 else:
   1215     pydf = pl.DataFrame([])._df

ComputeError: cannot create series from Extension("pandas.period", Int64, Some("{\"freq\": \"Y-DEC\"}"))


@mattijn the overlapping problems I see here are:

  1. Converting a pd.PeriodIndex into a representation understood by narwhals (polars) see (Support for Pandas Period pola-rs/polars#5982).
  2. Representing a pd.PeriodIndex as a Vega-Lite Time Unit
  3. Serializing a pd.PeriodIndex, such that it is understood by Vega

@dangotbanned
Copy link
Member

On the topic of broadening temporal data type support, I would like to throw duration/timedelta into the mix.
Mentioned in vega/vega-datasets#641

python library support
...
Admittedly, we'd still need add (if possible) support for duration/timedelta on the altair-side:

These would be easier on the python-side, since the three major dataframe packages all support it:

I'm not sure, but maybe this maps to vega.timeInterval?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants