Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray support #891

Open
bocklund opened this issue May 23, 2018 · 19 comments
Open

xarray support #891

bocklund opened this issue May 23, 2018 · 19 comments

Comments

@bocklund
Copy link

Is it possible or planned that xarray be supported natively?

You can't pass in Datasets or DataArrays directly to altair. Using .to_dataframe() doesn't quite work either because xarray creates a hierarchical index, which are not supported.

However, you can reset the index to flatten out the DataFrame.

my_xr_dataset = calc_res.isel(component=1)

df = my_xr_dataset.to_dataframe()
df.reset_index(inplace=True)

alt.Chart(df).mark_circle().encode(
    x='X',
    y='GM',
    color='Phase',
).interactive()

You still have to slice up the Dataset yourself (the point limit can be somewhat limiting unless you slice up the Dataset).

Thoughts?

@jakevdp
Copy link
Collaborator

jakevdp commented May 23, 2018

Yes, I think we should support this. It will involve some reworking of the data_transformer architecture that's currently being done in #887, and then adding an xarray transformer to the pipeline.

@ellisonbg
Copy link
Collaborator

ellisonbg commented May 23, 2018 via email

@shoyer
Copy link

shoyer commented May 29, 2018

Calling .to_dataframe().reset_index() makes sense for most xarray.Dataset objects to transform them into tidy data. This is what we recommend with Seaborn, for example.

I don't think there is an unambiguous way to use an xarray.DataArray as input. This object is more similar to a pandas.Series in some ways (it usually represents a single variable) and a pandas.DataFrame in others. I would be inclined to raise here instead of guessing.

Visualizing xarray objects in altair could be awesome, but there are a couple of other challenges here, too:

  • File formats for representing the data. CSV isn't terribly efficient for ND-array data, and it's quite easy to run up against Seaborn's default 5000 line limit, which is only a 50x100 matrix. Saving netCDF in the browser could be interesting -- maybe with netcdf-js?
  • Xarray is usually used for gridded datasets, but Altair only has limited support for these (only heatmaps). Contour and labeled image plots would be nice to have, too.

@benbovy
Copy link

benbovy commented Mar 7, 2023

(Re-activating this discussion after seeing @mattijn's nice geopython 2023 talk)

File formats for representing the data. CSV isn't terribly efficient for ND-array data, and it's quite easy to run up against Seaborn's default 5000 line limit, which is only a 50x100 matrix. Saving netCDF in the browser could be interesting -- maybe with netcdf-js?

Maybe also worth looking at Zarr Javascript implementations (https://github.com/freeman-lab/zarr-js or https://github.com/gzuidhof/zarr.js/)?

Xarray is usually used for gridded datasets, but Altair only has limited support for these (only heatmaps). Contour and labeled image plots would be nice to have, too.

Contour and image plots would be nice indeed. Altair seems to now have many other features that would potentially be interesting to use with Xarray datasets (gridded data or vector data cubes), e.g., facet plots, maps, parallel coordinates or parameters in the forthcoming release.

@mattijn
Copy link
Contributor

mattijn commented Mar 7, 2023

Thanks @benbovy! It would be great if we can push this forward, there a few things at play here. Let me try to mention them briefly.

By design Altair works currently with tabular data, so the only route that is currently possible is to translate your gridded data into a dataframe so you can use the x, y, color encoding channel to create a heatmap. See eg. https://altair-viz.github.io/gallery/simple_heatmap.html, this route is possible for small raster tiles only (but you can get tooltips and can connect it to another chart that displays eg. Aa timeseries for each x/y pixel, similar to this: https://altair-viz.github.io/gallery/select_detail.html). The number of unique data points will eventually bottlenecks performance.

Having said that, what would be great is that we can push the isocontours transform forward. This transform works on native raster-alike data. The good thing is, this is already supported in Vega. For example see this very nice example: https://vega.github.io/vega/examples/annual-precipitation/.

If you look to the source input of the data: https://github.com/vega/vega/blob/main/docs/data/annual-precip.json you can see it is actually a flattened list including info on the shape. Perfect for not super large rasters.

So to make this isocontours transform available in Altair, it needs to become integrated within Vega-Lite first. Luckily it is being raised already vega/vega-lite#6043 and based on the number of emojis this is considered a much requested feature and help/PRs would be surely be appreciated there.

But once there is support for isocontour transforms in Altair, the (flattened) raster data will still be in the json specification since the isocontours are computed within JavaScript. For many occasions this will be fine, but for very large arrays this becomes not a useful approach either (very large json-files) and at that moment we have to precompute the isocontours, to make sure the raw raster data is not within the json specification.

At this moment https://github.com/hex-inc/vegafusion can come in to play. VegaFusion is meant to work for very large datasets where there is an aggregation defined within the altair chart-specification. The core of VegaFusion is in Rust using Arrow and Arrow DataFusion.

If we can introduce support for the array interface protocol in Altair (I assume zarr supports this protocol?) using pyarrow (I think it supports arrays?) we could offer support for:

  • an approach that can serialize the raster data into a flattened list and let JavaScript compute the isocontours
  • an entry point for VegaFusion to calculate the isocontours on-the-fly using UDFs using Arrow DataFusion (no idea if this is possible, cc: @jonmmease)

I noticed there is also a zarr protocol. Is this very different than the array protocol? When would you use it over the array protocol? Does it integrate with arrow?

I might miss other potential routes, so also open for these.

Again, thanks for bringing this back on the agenda!

@jonmmease
Copy link
Contributor

Thanks for the ping @mattijn,

I'd love to see 2D density in Vega-Lite/Altair. It would take some thought, but I'm pretty confident we could support this in VegaFusion as well.

@benbovy
Copy link

benbovy commented Mar 10, 2023

Thanks @mattijn for the detailed and helpful explanations (I have to admit that I'm not familiar with Vega, Vega-lite nor Altair internals).

I noticed there is also a zarr protocol. Is this very different than the array protocol? When would you use it over the array protocol? Does it integrate with arrow?

I guess my suggestion of using zarr.js (or @shoyer's suggestion of using netcdf.js) was more if there is any need to efficiently transfer (chunked) n-d array data to the browser, possibly via writing the xarray dataset to a temporary zarr/netcdf dataset (similarly to vegafusion widget renderer's feather data transformer for dataframes), and then run some custom data loader or transformer within the browser to convert it into one or more Vega-lite compatible (tabular) datasets. However, I don't know if this makes sense at all. Perhaps easier is to simply define and run custom transformers on the server side?

@joelostblom joelostblom moved this to Gridded data support in Roadmap Apr 12, 2024
@jonmmease
Copy link
Contributor

Performance aside, here's an example of displaying regular rasters in Vega-Lite:

visualization

Open the Chart in the Vega Editor

The idea is that the raster element values would be flattened into row-major ordering and inserted into the spec as "data". Then params are used to define the width and height of the raster. A window function is used to add a column with the row number, and the row number and width/height are used to compute the position or each rect (the x, x2, y, y2 values).

Performance of this for large rasters won't be great, even combined with VegaFusion, since the dataset with 1 row per raster element will be sent to the client, and the client has to render each raster element individually. But I've been wondering if it would make sense for VegaFusion to support rendering rect marks like this to images on the server, so that the base64-encoded PNG would be sent to the browser instead of the underlying data. This would be much faster to render in the browser. But it would remove any click/hover/tooltip interactivity, but this might be ok, since for large rasters I'm not used to seeing tooltip. Let me know if anyone has thoughts on this idea!

@joelostblom
Copy link
Contributor

That's a neat approach to supporting images! If a raster mark is eventually added in Vega-Lite, do you think that the VegaFusion solution would still be the higher performance option for images? Then it seems like it would be valuable to implement both to bring the functionality to altair sooner and to provide a high performance option long term.

@jonmmease
Copy link
Contributor

do you think that the VegaFusion solution would still be the higher performance option for images?

I think it would be comparable. I'm not certain yet how the implementation of raster marks in Vega-Lite would work, but I expect the end result would be a Vega image mark that gets displayed. This is how the Vega heatmap transform works (see https://vega.github.io/vega/examples/density-heatmaps/). So what I was thinking about is whether we could go directly from the rect representation to the image mark using VegaFusion. I'm not totally convinced it's a good idea, but something I'm thinking about.

Another angle, that makes this somewhat independent of the raster mark discussion, is that VegaFusion could integrate with Avenger to make it possible to replace any mark with an image rendered on the server. So you could do something like alt.Chart().mark_line(image=True).encode(...) and the line mark would be rendered to an image in Python and only the image would be sent to the browser.

@mattijn
Copy link
Contributor

mattijn commented Apr 13, 2024

Within Python-land we could use https://github.com/cogeotiff/rio-tiler to read arrays or images as tiles and in combination with the positioning logic of https://github.com/vega/altair_tiles these tiles can be rendered using mark_image instead of rects. Ie aligning ourselves with TileMatrixSet standard.

While these references originate from the geo-world I think these covers also cartesian unprojected array data.

@joelostblom
Copy link
Contributor

VegaFusion could integrate with Avenger to make it possible to replace any mark with an image rendered on the server. So you could do something like alt.Chart().mark_line(image=True).encode(...) and the line mark would be rendered to an image in Python and only the image would be sent to the browser.

This sounds like a really useful step to integrate with Avenger as you said and be able to provide Datashader-like functionality in Altair, which definitely is a direction that's exciting for me personally! That would also provide a unique value-add of this approach even if an image mark is added eventually in VL.

@melonora
Copy link

melonora commented May 16, 2024

Hello there,

I am one of the developers of the SpatialData framework https://spatialdata.scverse.org/en/latest/. We are investigating the use of vega (or at least vega like) to store view configurations in the spatialdata zarr store that would allow as much as possible reproducing views across our visualization ecosystem (matplotlib, napari and soon vitessce). Is there any working group currently on xarray support that I could get involved in?

@mattijn
Copy link
Contributor

mattijn commented May 23, 2024

Hi @melonora! Thank you for chiming in, there is currently not a working group on this topic.

For now, If you have any ideas or feel uncertain on some of these topics. Please ask or share!

I recently add another comment on this related issue, which might be of interest to you as well: #3077 (comment).

Again, thanks for joining this discussion! If there is anything we can do to assist in pushing this forward, please let me know!

@joelostblom
Copy link
Contributor

@jonmmease and I were part of a brief discussion in a hackathon a couple of month ago with some other people from the scverse regarding using Vega-Lite/Altair in some of their subprojects. I'm guessing you are already aware of this @melonora (and maybe you were even there on the hackathon), but if not I can send a ping to the people we were in contact with to chime in here and see if any progress or plans were made.

@melonora
Copy link

Hi @mattijn (nice to e-meet you) and @joelostblom , I am indeed one of the people from scverse and was also in the initial calls. One thing that was noticed for the implementation is that we first required a refactor in the spatialdata-plot library.

For short term, it seemed more approachable to subset vega grammar and extend it with what we would need for our image plotting / visualization using matplotlib / napari / vitessce and then see if / how we could feed that back into vega. This was more a decision of what can we do more on the short term:) However, long term it would be nice to see whether we can have SpatialData visualization / plotting fully supported across our visualization ecosystem using vega / vega-lite / altair.

Do you have developer meetings in which we could come to a plan on how to approach this?

@mattijn
Copy link
Contributor

mattijn commented May 24, 2024

Lets plan one! Can you reach out to me at mattijn[at]gmail.com with your email?

@melonora
Copy link

just sent you an email:)

@mattijn
Copy link
Contributor

mattijn commented Jul 13, 2024

Cross-referencing raised issues as outcome of the next steps from the discussion below:


LLM summary of feature request:

Based on the Slack discussion, the idea of introducing a new mark type called mark_array in Altair (and consequently in Vega-Lite) to support labeled array data like xarray has a solid foundation. Here’s a proposal to develop this feature, leveraging insights and concerns from the thread:

Proposal for mark_array in Altair/Vega-Lite

Motivation

The current visualization options in Altair/Vega-Lite lack direct support for multidimensional array data such as those provided by xarray. This limitation necessitates cumbersome data transformations that can obscure the structure and meaning of the data. The mark_array aims to streamline the visualization process for labeled array data, providing a more intuitive and efficient approach.

Features and Capabilities

  1. Direct Input of Labeled Arrays:

    • Allow xarray datasets or data arrays to be directly passed to the mark_array function.
    • Ensure compatibility with array structures without needing conversion to DataFrame.
  2. Encoding Dimensions:

    • Utilize array dimensions directly in encoding, e.g., alt.Chart(data).mark_array().encode(x='longitude', y='latitude', color='temperature').
    • Support for multidimensional axes, facilitating complex data representations.
  3. Handling of Unlabeled Arrays:

    • For unlabeled arrays (like numpy arrays), provide a mechanism to name dimensions during chart creation, ensuring seamless integration.
  4. Efficient Rendering:

    • Internally optimize the rendering of large datasets, possibly by leveraging a raster argument or implementing efficient grid rendering techniques.
  5. Versatile Visualization Options:

    • Support various visual representations such as heatmaps, contour plots, and other raster-based visualizations.

Technical Considerations

  • Integration with Existing Marks:

    • Consider if mark_array should be an entirely new mark type or an extension of existing marks like rect.
    • Implement efficient handling paths for dense grid data to optimize performance.
  • API Design:

    • Ensure the API remains intuitive and leverages existing encoding paradigms in Altair/Vega-Lite.
    • Maintain consistency with the overall design philosophy of Altair and Vega-Lite.
  • Community Feedback and Collaboration:

    • Engage with the community to gather feedback and iterate on the design.
    • Collaborate with key stakeholders and contributors to align the new feature with broader project goals.

Related Issues and Discussions

  • Support for Image Data in Vega-Lite:

    • Tracking issues and feature requests for pixel/raster/heatmap marks to display images.
    • Explore the viability of integrating mark_array with these ongoing developments.
  • xarray Support:

    • Address the existing limitations in passing xarray datasets directly into Altair.
    • Develop solutions to handle hierarchical indexing and other structural nuances of xarray data.
  • Array Interchange Protocols:

    • Evaluate protocols for array interchange to ensure compatibility with multidimensional data.

Summary of Slack Discussion

The Slack conversation highlighted various challenges and potential solutions for integrating multidimensional array data visualization in Altair. Key points include:

  • Interest in supporting image data visualization.
  • Discussions on whether a new mark type (raster, heatmap) is needed or if existing marks (rect) can be optimized for array data.
  • Consideration of gridded vs. ungridded data inputs and their respective handling.
  • Suggestions for naming conventions and semantic distinctions to avoid confusion.
  • Proposal for a raster argument in marks to enable efficient rendering.

Next Steps

  1. Draft a detailed proposal and share it with the Altair and Vega-Lite communities for feedback.
  2. Collaborate with developers to prototype the mark_array feature.
  3. Conduct user testing to refine the implementation.
  4. Document the feature comprehensively, including usage examples and performance considerations.

By introducing mark_array, we can significantly enhance the capability of Altair/Vega-Lite to handle complex, multidimensional data natively, thereby broadening the scope of visualizations possible with these powerful libraries.

List of references of the URLs mentioned in the discussion:

  1. Blur-based heatmaps issue in Vega - GitHub
  2. xarray support issue in Altair - GitHub
  3. Support for array interchange protocols in Altair - GitHub
  4. Support for higher dimensional data in vl-contour - GitHub
  5. MNIST image example in Altair - GitHub
  6. Observable Plot: Raster mark - Observable
  7. Heat map - Wikipedia - Wikipedia
  8. Heatmaps in Plotly - Plotly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Gridded data support
Development

No branches or pull requests

9 participants