Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Altair graphs in a random fashion #3714

Open
ale-dg opened this issue Dec 17, 2024 · 3 comments
Open

Altair graphs in a random fashion #3714

ale-dg opened this issue Dec 17, 2024 · 3 comments
Labels

Comments

@ale-dg
Copy link

ale-dg commented Dec 17, 2024

What happened?

Hi all,

I have been creating several times some graphs using altair with the following code:

Code block 1

alt.Chart(
    df.group_by("season")
    .agg(pl.count("season").alias("count"))
).mark_bar().encode(
    alt.X(
        "season:N",
        title="Season",
        sort=['Spring', 'Summer', 'Autumn', 'Winter'],
    ),
    alt.Y("count:Q", title="Frequency"),
    alt.Color("season:N", title="Season", legend=None),
    [
        alt.Tooltip("count:Q", title="Frequency", format=".3s"),
        alt.Tooltip("season:N", title="Season"),
    ],
).properties(
    width=600, height=300, title="Number of trips per season"
)

Every time I run it I am getting different results, not in the amounts but in the order the graph is rendered, even though I have set an specific one in the alt.X options, see an example below:

Image

Image

I have found a workaround it, also aggregating some data from Polars, although I am curious about the behaviour of altair, as I am not sure why this is happening. I leave you the code I am using for correctly rendering the graph:

Code block 2

alt.Chart(
    df.group_by("season")
    .agg(pl.count("season").alias("count"))
    .with_columns(
        pl.when(pl.col("season") == "Spring")
        .then(1)
        .when(pl.col("season") == "Summer")
        .then(2)
        .when(pl.col("season") == "Autumn")
        .then(3)
        .otherwise(4)
        .alias("season_order")
    )
).mark_bar().encode(
    alt.X(
        "season:N",
        title="Season",
        sort=alt.EncodingSortField(field="season_order", order="ascending"),
    ),
    alt.Y("count:Q", title="Frequency"),
    alt.Color("season:N", title="Season", legend=None),
    [
        alt.Tooltip("count:Q", title="Frequency", format=".3s"),
        alt.Tooltip("season:N", title="Season"),
    ],
).properties(
    width=600, height=300, title="Number of trips per season"
)

Thank you for your help!

Best

What would you like to happen instead?

It should be rendering in the sort order specified, not in a random fashion

Which version of Altair are you using?

5.5.0

@ale-dg ale-dg added the bug label Dec 17, 2024
@dangotbanned
Copy link
Member

@ale-dg can you provide the source df for this repro please?

If you omitted it because you are unable to share it, could you instead rewrite your examples with dummy data that also demonstrates the issue?

@ale-dg
Copy link
Author

ale-dg commented Dec 18, 2024

@dangotbanned thank you for your reply. I didn't attach the source since it is a 1.7 GB parquet file... so really not feasible to attach it.

I did the following code with the Seattle Weather dataset, also using Polars, although this is working fine:

Code block 3

import altair as alt
import polars as pl
from vega_datasets import data

pl.Config.set_tbl_cols(100)
alt.theme.enable("googlecharts")

source = data.seattle_weather()

df = pl.DataFrame(source)

df = df.with_columns(pl.col("date").dt.month().alias("month"))

df = df.with_columns(
    pl.when(pl.col("month").is_between(3, 5))
    .then(pl.lit("Spring"))
    .when(pl.col("month").is_between(6, 8))
    .then(pl.lit("Summer"))
    .when(pl.col("month").is_between(9, 11))
    .then(pl.lit("Autumn"))
    .otherwise(pl.lit("Winter"))
    .alias("season")
)

alt.Chart(
    df.group_by("season").agg(pl.count("season").alias("count"))
).mark_bar().encode(
    alt.X("season:N", title="Season", sort=["Spring", "Summer", "Autumn", "Winter"]),
    alt.Y("count", title="Frequency"),
    alt.Color("season:N", legend=None),
    [
        alt.Tooltip("season:N", title="Season"),
        alt.Tooltip("count", title="Frequency", format=".3s"),
    ],
).properties(
    title="Count of records per season", width=900, height=300
)

Look below for some screenshots of the original issue. I copied and paste the code for generating the graph in different Jupyter cells and the issue happens everytime. I should also mention it is not only happening with the "season" column, it is also happening with any other column where I try to be specific on the order the graph should appear.

Hope this helps! Otherwise, please let me know.

Best

4 Images

Image

Image Image Image

@dangotbanned
Copy link
Member

@ale-dg does your original preserve the order if you change the encoding type?

Original code block

#3714 (comment)

alt.Chart(
    df.group_by("season")
    .agg(pl.count("season").alias("count"))
).mark_bar().encode(
    alt.X(
        "season:N",
        title="Season",
        sort=['Spring', 'Summer', 'Autumn', 'Winter'],
    ),
    alt.Y("count:Q", title="Frequency"),
    alt.Color("season:N", title="Season", legend=None),
    [
        alt.Tooltip("count:Q", title="Frequency", format=".3s"),
        alt.Tooltip("season:N", title="Season"),
    ],
).properties(
    width=600, height=300, title="Number of trips per season"
)

alt.X(
        # "season:N",
        "season:O",
        title="Season",
        sort=['Spring', 'Summer', 'Autumn', 'Winter'],
    )

Per encoding-data-types:

Data Type Shorthand Code Description
quantitative Q a continuous real-valued quantity
ordinal O a discrete ordered quantity
nominal N a discrete unordered category
temporal T a time or date value
geojson G a geographic shape

To me, it sounds like you know that "season" has an order.
But that is at odds with telling altair that it is an unordered category.


Alternatives

My other suggestions on the polars-side if the above doesn't fix it:

Tell polars to preserve order (pl.DataFrame.group_by)

# df.group_by("season").agg(pl.count("season").alias("count"))
df.group_by("season", maintain_order=True).agg(pl.count("season").alias("count"))

Using GroupBy.len instead of pl.count

# df.group_by("season").agg(pl.count("season").alias("count"))
df.group_by("season").len("count")

Hope something here can help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants