Load collection-level assets into xarray #90

TomAugspurger · 2021-06-03T17:56:36Z

This is a prototype for loading collection-level assets from a STAC collection. If you want a full example, install the main branch of pystac:

In [2]: import intake

In [3]: collection = intake.open_stac_collection("https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi")

In [4]: source = collection.get_asset("zarr-https")

In [6]: source.kwargs
Out[6]: {'consolidated': True}

In [7]: source.to_dask()
Out[7]:
<xarray.Dataset>
Dimensions:                  (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
  * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
  * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

It's not quite ready, but I have a few points of discussion:

This PR assigns some meaning to the asset keys (zarr-https) in the example above. The STAC spec doesn't give those any meaning really, but they're used in other places (e.g. stackstac.stack(..., assets=[]) so I think we're OK.
This invents a Zarr media type. We should do that somewhere upstream (likely Zarr or fsspec). We also aren't really using it. In theory we should be able to dispatch the loader based on the media type of the asset. So we could load a NetCDF dataset with xarray.open_dataset if it had that media type. Right now we're only supporting Zarr.
I feel like the STAC Collection should be specifying the storage_options like consolidated=True, rather than the user. Does intake-stac do anything like that today?

Closes #59
Closes #70

scottyhq · 2021-06-07T22:51:46Z

Thanks @TomAugspurger !

This PR assigns some meaning to the asset keys (zarr-https) in the example above. The STAC spec doesn't give those any meaning really, but they're used in other places (e.g. stackstac.stack(..., assets=[]) so I think we're OK.

Seems reasonable to me!

This invents a Zarr media type. We should do that somewhere upstream (likely Zarr or fsspec). We also aren't really using it. In theory we should be able to dispatch the loader based on the media type of the asset...

Makes sense. Maybe work here could resolve #59

I feel like the STAC Collection should be specifying the storage_options like consolidated=True, rather than the user. Does intake-stac do anything like that today?

Not really, but definitely have a look at #75 and in particular proposed changes for potentially deviating from the intake-xarray dependency to have drivers defined with intake-stac (https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6).

For the specific case of consolidated=True see pydata/xarray#5251. Definitely seems like a reasonable default and in the case that the metadata isn't actually consolidated, presumably the library reading the file can figure that out and deal with it?...

Also, just want to note this PR addresses #59

…tion

This reverts commit dc5e89b.

TomAugspurger · 2021-06-14T15:03:00Z

I pushed an update so this is a bit simpler. I see that instake_stac.catalog.StacAsset essentially does what we wanted already.

My main question now is around what to call this method. It's really doing two things:

Extracting a collection-level asset
loading that asset with the right driver

I've called this to_dask(). Should we also have .read() that returns the data using an in-memory container?

…tion

TomAugspurger · 2021-06-25T13:03:49Z

Having some second thoughts about the API design around selecting an asset, and I wonder if anyone else has thoughts.

We can't use __getitem__ to select an asset, since a Collection can contains both Assets and Items, and there's no requirement that an asset's key be distinct from the Item ids, leading to a potential ambiguity. So my_collection[asset_key].to_dask() is out.

Then the question is: do we have a separate method to get an asset, followed by a to_dask()?

my_asset = my_collection.get_asset(asset_key)  # type: StacAsset
ds = my_asset.to_dask()  # type: xarray.DataArray, dask.dataframe.DataFrame, etc.

or do we put the .to_dask directly on the StacCollection class, but require the user to pass a key?

ds = my_asset.to_dask(asset_key)  # type: xarray.DataArray, dask.dataframe.DataFrame, etc.

I suppose that the first option, a .get_asset(asset_key), would be a bit more consistent with the rest of intake?

TomAugspurger · 2021-06-30T18:19:39Z

I feel like the STAC Collection should be specifying the storage_options like consolidated=True, rather than the user. Does intake-stac do anything like that today?

Thinking about this more, I think at least one more STAC extension is appropriate to capture this information. These would be an extension of of the STAC collection and Item Asset objects.

I want to capture everything necessary to go from STAC Asset to xarray Dataset within the STAC catalog itself. Essentially,

asset = stac_catalog.assets[key]
store = fsspec.get_mapper(asset.href, **storage_options)
ds = xr.open_zarr(store, **xarray_open_kwargs)

So there are two pieces of information to capture:

kwargs for fsspec.get_mapper
kwargs for xarray.open_zarr

We could have two new extensions: fsspec and xarray. I don't know if that should be a single STAC extension (e.g. "xarray-assets" or "pangeo-assets") or two ("xarray-assets" and "fsspec-assets"). It feels a bit silly to have two STAC extensions with one field each. Anyway, here's a proposal for an xarray-assets extension.

    "zarr-abfs": {
      "href": "abfs://daymet-zarr/daily/hi.zarr",
      "type": "application/vnd+zarr",
      "title": "Daily Hawaii Daymet Azure Blob File System Zarr root",
      "description": "Azure Blob File System of the daily Hawaii Daymet Zarr Group on Azure Blob Storage for use with adlfs.",
      "roles": [
        "data",
        "zarr",
        "abfs"
      ],
      "xarray:storage_options": {
        "account_name": "daymeteuwest"
      },
      "xarray:open_kwargs": {
        "consolidated": true
      }
    },

TomAugspurger · 2021-06-30T19:55:42Z

See https://github.com/tomAugspurger/xarray-assets for a proposal. I don't really know how valuable that is, but I think it's worth exploring a bit.

If that extension is present, then I think intake-stac could use it like https://github.com/tomAugspurger/xarray-assets#python-example to safely go from a STAC Asset -> xarray.Dataset without any arguments from the user.

…tion

TomAugspurger · 2021-10-19T18:24:39Z

intake_stac/catalog.py

+        if isinstance(result, DataSource):
+            kwargs = result._captured_init_kwargs
+            kwargs = {**kwargs, **dict(storage_options=storage_options), **open_kwargs}
+            result = result(*result._captured_init_args, **kwargs)


@martindurant currently, StacItem.__getitem__ will return a (subclass of) DataSource. Does this seem like the right way to control the parameters passed to that DataSource? If so, are _captured_init_args and captured_init_kwargs considered "public"?

This looks essentially the same as DataSourceBase.configure_new (aliased with get for compatibility, and __call__), but yes, seems fine to me.

are _captured_init_args and _captured_init_kwargs considered "public"

They were means for internal storage and to be able to recreate things after serialisation, possibly to YAML. They are more "automatic" than "private", I think.

Does this seem like the right way

Unless configure_new already does the right thing.
I do wonder what result can be if not a DataSource.

Unless configure_new already does the right thing.

Gotcha. I think configure_new doesn't quite work, since we want to merge these keywords with the "existing" ones that are in ._captured_init_args (we had a test relying on that anyway).

I don't see an easy way for configure_new to add a keyword to control whether or not to merge the new kwargs, since it's passing all the keywords through, there's the potential for a conflict.

I do wonder what result can be if not a DataSource.

In this case, perhaps a StacAsset, but I might be misunderstanding intake-stac's design.

Just noting for posterity, intake-xarray's datasources define a .kwargs and .storage_options properties. We can't use those because they apparently aren't implemented by RasterIOSource.

unfortunately i don't really follow this... i've always been a little confused about what should be handled by intake-xarray or whether intake-stac should just be stand-alone and define all the datasources under this repo. I sort of started down that road in https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6 but abandoned it...

…tion

* Tests for item-level, collection-level * Implemented `get_asset`

TomAugspurger · 2021-10-19T19:10:22Z

The latest commit implements the API described in #90 (comment). So now users call StacCollection.get_asset(key) to get a collection-level asset, and call .to_dask() on the result to load it into a Dask-backed object. This data source is configured to use xarray:storage_options and xarray:open_kwargs from the xarray-assets extension.

In [2]: import intake

In [3]: collection = intake.open_stac_collection("https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi")

In [4]: source = collection.get_asset("zarr-https")

In [6]: source.kwargs
Out[6]: {'consolidated': True}

In [7]: source.to_dask()
Out[7]:
<xarray.Dataset>
Dimensions:                  (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
  * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
  * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

TomAugspurger · 2021-10-19T19:20:32Z

Looks like the narrative docs are a bit out of date, but f1dc6ff added a small section on xarray-assets to the docs.

@kthyng did you already have STAC items / collections I could test this against? Or were you waiting for intake-stac to be updated before generating those?

@scottyhq do you have a chance to take a look at this?

kthyng · 2021-10-19T20:02:58Z

@TomAugspurger You mean a catalog already set up to use xarray-assets? I do not at the moment but can work on it now. I haven't used a STAC extension before... is there a good reference for how to "install" it?

TomAugspurger · 2021-10-19T20:11:31Z

is there a good reference for how to "install" it?

That should just require adding the extension's URL to the Catalog / Item's stac_extensions list, e.g. https://github.com/TomAugspurger/xstac/blob/main/examples/terraclimate/terraclimate.json#L14-L17 (I just noticed the test cases I added here were missing that) and then add the relevant keys to the assets.

If you're generating STAC metadata for Zarr datasets, https://github.com/TomAugspurger/xstac might be helpful, or you can generate it "by hand".

docs/source/tutorial.rst

intake_stac/catalog.py

scottyhq · 2021-10-19T20:59:18Z

intake_stac/catalog.py

@@ -230,6 +283,20 @@ class StacItem(AbstractStacCatalog):
    name = 'stac_item'
    _stac_cls = pystac.Item

+    def __getitem__(self, key):
+        result = super().__getitem__(key)
+        # TODO: handle non-string assets?


i haven't come across this in the wild. are they always strings? here for example I see asset["0"] https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS/collections/NSIDC-0723.v4/items

Apparently it's possible to look up multiple items by passing a tuple to __getitem__. https://github.com/intake/intake/blob/d9faa048bbc09d74bb6972f672155e3814c3ca62/intake/catalog/base.py#L403

I haven't used it either.

intake_stac/catalog.py

scottyhq

Thanks for pushing this forward @TomAugspurger! I think this will be a great addition, left some comments for some minor suggested changes, then we should merge it!

kthyng · 2021-10-19T21:34:01Z

@TomAugspurger Thanks for the help, that was really clear. I am meeting an issue I think due to using pystac to help generate the STAC items, collections, and catalogs.

For code here:

    asset = {'href': f'file://{file}', 'media_type': "application/netcdf", 'title': title2,
             'extra_fields': properties, 
             'xarray:open_kwargs': {'xarray_kwargs': {'drop_variables': 'obs'}}}
    item.add_asset(
        key=item.id,
        asset=pystac.Asset(**asset)
    )

/tmp/ipykernel_46667/3264833412.py in make_item(file, name, title)
227 item.add_asset(
228 key=item.id,
--> 229 asset=pystac.Asset(**asset)
230 )
231 ds.close()

TypeError: init() got an unexpected keyword argument 'xarray:open_kwargs'

kthyng · 2021-10-19T21:37:40Z

If you're generating STAC metadata for Zarr datasets, https://github.com/TomAugspurger/xstac might be helpful, or you can generate it "by hand".

I'm currently working with netcdf files and couldn't tell if I should be using xstac. Currently I am using STAC/intake to catalog mostly time series datasets but later will also have gridded-type datasets (ocean model output).

TomAugspurger · 2021-10-19T21:42:42Z

Thanks @scottyhq, updated to address your comments.

@kthyng I'll take a closer look later, but I think you can update properties with your xarray-assets:

properties['xarray_kwargs'] = {'drop_variables': 'obs'}
item.add_asset(
    key=item.id,
    asset=pystac.Asset(**asset)
)

Hopefully that does the trick.

I haven't tried xstac on a NetCDF file yet. I'll give that a shot tonight or tomorrow and add will add it as an example!

intake_stac/catalog.py

kthyng · 2021-10-19T21:59:23Z

@kthyng I'll take a closer look later, but I think you can update properties with your xarray-assets:
properties['xarray_kwargs'] = {'drop_variables': 'obs'}
item.add_asset(
    key=item.id,
    asset=pystac.Asset(**asset)
)
Hopefully that does the trick.

@TomAugspurger thanks for the suggestion but unfortunately that hasn't worked for me. Specifically, it has to go into extra_fields in the asset part of the item (properties isn't defined) and it comes through to intake but not at a level that it seems to be recognized as needing to be used upon read in of the dataset with xarray.

Here's what I mean at the point it gets to the intake GUI. The xarray_kwargs are available under "metadata" (first image, and if I could scroll up in the window to show you)

but I think they need to be available under "args" (second image) to be used in xr.open_dataset.

Tom Augspurger added 2 commits June 3, 2021 10:58

WIP: Added to_xarray for collection-level assets

5242189

tests

e727581

Tom Augspurger added 7 commits June 14, 2021 09:43

fixups

1ba67e4

Merge remote-tracking branch 'upstream/main' into feature/zarr-collec…

6144831

…tion

fixups

43ab47f

fixup

dc5e89b

Revert "fixup"

ced7149

This reverts commit dc5e89b.

fixup

c4984b8

fixup docs

14fff3f

Tom Augspurger added 2 commits June 25, 2021 06:55

Merge remote-tracking branch 'upstream/main' into feature/zarr-collec…

bd953c2

…tion

fixups

ed8efc0

TomAugspurger changed the title ~~[WIP]: Load collection-level assets into xarray~~ Load collection-level assets into xarray Jun 25, 2021

cisaacstern mentioned this pull request Jul 26, 2021

STAC catalog sprint: to-do items pangeo-forge/pangeo-forge-catalog#1

Open

6 tasks

TomAugspurger mentioned this pull request Oct 4, 2021

How to pass xarray_kwargs from STAC catalog through to intake #110

Closed

Tom Augspurger added 3 commits October 19, 2021 10:42

Merge remote-tracking branch 'upstream/main' into feature/zarr-collec…

472d64f

…tion

Updates

1d3af7f

Handle items

e7c47a4

TomAugspurger commented Oct 19, 2021

View reviewed changes

Tom Augspurger added 2 commits October 19, 2021 13:49

Merge remote-tracking branch 'upstream/main' into feature/zarr-collec…

4388d7d

…tion

Updates

37dbecf

* Tests for item-level, collection-level * Implemented `get_asset`

Added small doc example

f1dc6ff