Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load collection-level assets into xarray #90

Merged
merged 19 commits into from
Oct 20, 2021

Conversation

TomAugspurger
Copy link
Collaborator

@TomAugspurger TomAugspurger commented Jun 3, 2021

This is a prototype for loading collection-level assets from a STAC collection. If you want a full example, install the main branch of pystac:

In [2]: import intake

In [3]: collection = intake.open_stac_collection("https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi")

In [4]: source = collection.get_asset("zarr-https")

In [6]: source.kwargs
Out[6]: {'consolidated': True}

In [7]: source.to_dask()
Out[7]:
<xarray.Dataset>
Dimensions:                  (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
  * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
  * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

It's not quite ready, but I have a few points of discussion:

  1. This PR assigns some meaning to the asset keys (zarr-https) in the example above. The STAC spec doesn't give those any meaning really, but they're used in other places (e.g. stackstac.stack(..., assets=[]) so I think we're OK.
  2. This invents a Zarr media type. We should do that somewhere upstream (likely Zarr or fsspec). We also aren't really using it. In theory we should be able to dispatch the loader based on the media type of the asset. So we could load a NetCDF dataset with xarray.open_dataset if it had that media type. Right now we're only supporting Zarr.
  3. I feel like the STAC Collection should be specifying the storage_options like consolidated=True, rather than the user. Does intake-stac do anything like that today?

Closes #59
Closes #70

@scottyhq
Copy link
Collaborator

scottyhq commented Jun 7, 2021

Thanks @TomAugspurger !

This PR assigns some meaning to the asset keys (zarr-https) in the example above. The STAC spec doesn't give those any meaning really, but they're used in other places (e.g. stackstac.stack(..., assets=[]) so I think we're OK.

Seems reasonable to me!

This invents a Zarr media type. We should do that somewhere upstream (likely Zarr or fsspec). We also aren't really using it. In theory we should be able to dispatch the loader based on the media type of the asset...

Makes sense. Maybe work here could resolve #59

I feel like the STAC Collection should be specifying the storage_options like consolidated=True, rather than the user. Does intake-stac do anything like that today?

Not really, but definitely have a look at #75 and in particular proposed changes for potentially deviating from the intake-xarray dependency to have drivers defined with intake-stac (https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6).

For the specific case of consolidated=True see pydata/xarray#5251. Definitely seems like a reasonable default and in the case that the metadata isn't actually consolidated, presumably the library reading the file can figure that out and deal with it?...

Also, just want to note this PR addresses #59

@TomAugspurger
Copy link
Collaborator Author

I pushed an update so this is a bit simpler. I see that instake_stac.catalog.StacAsset essentially does what we wanted already.

My main question now is around what to call this method. It's really doing two things:

  1. Extracting a collection-level asset
  2. loading that asset with the right driver

I've called this to_dask(). Should we also have .read() that returns the data using an in-memory container?

@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jun 25, 2021

Having some second thoughts about the API design around selecting an asset, and I wonder if anyone else has thoughts.

We can't use __getitem__ to select an asset, since a Collection can contains both Assets and Items, and there's no requirement that an asset's key be distinct from the Item ids, leading to a potential ambiguity. So my_collection[asset_key].to_dask() is out.

Then the question is: do we have a separate method to get an asset, followed by a to_dask()?

my_asset = my_collection.get_asset(asset_key)  # type: StacAsset
ds = my_asset.to_dask()  # type: xarray.DataArray, dask.dataframe.DataFrame, etc.

or do we put the .to_dask directly on the StacCollection class, but require the user to pass a key?

ds = my_asset.to_dask(asset_key)  # type: xarray.DataArray, dask.dataframe.DataFrame, etc.

I suppose that the first option, a .get_asset(asset_key), would be a bit more consistent with the rest of intake?

@TomAugspurger TomAugspurger changed the title [WIP]: Load collection-level assets into xarray Load collection-level assets into xarray Jun 25, 2021
@TomAugspurger
Copy link
Collaborator Author

TomAugspurger commented Jun 30, 2021

I feel like the STAC Collection should be specifying the storage_options like consolidated=True, rather than the user. Does intake-stac do anything like that today?

Thinking about this more, I think at least one more STAC extension is appropriate to capture this information. These would be an extension of of the STAC collection and Item Asset objects.

I want to capture everything necessary to go from STAC Asset to xarray Dataset within the STAC catalog itself. Essentially,

asset = stac_catalog.assets[key]
store = fsspec.get_mapper(asset.href, **storage_options)
ds = xr.open_zarr(store, **xarray_open_kwargs)

So there are two pieces of information to capture:

  1. kwargs for fsspec.get_mapper
  2. kwargs for xarray.open_zarr

We could have two new extensions: fsspec and xarray. I don't know if that should be a single STAC extension (e.g. "xarray-assets" or "pangeo-assets") or two ("xarray-assets" and "fsspec-assets"). It feels a bit silly to have two STAC extensions with one field each. Anyway, here's a proposal for an xarray-assets extension.

    "zarr-abfs": {
      "href": "abfs://daymet-zarr/daily/hi.zarr",
      "type": "application/vnd+zarr",
      "title": "Daily Hawaii Daymet Azure Blob File System Zarr root",
      "description": "Azure Blob File System of the daily Hawaii Daymet Zarr Group on Azure Blob Storage for use with adlfs.",
      "roles": [
        "data",
        "zarr",
        "abfs"
      ],
      "xarray:storage_options": {
        "account_name": "daymeteuwest"
      },
      "xarray:open_kwargs": {
        "consolidated": true
      }
    },

@TomAugspurger
Copy link
Collaborator Author

See https://github.com/tomAugspurger/xarray-assets for a proposal. I don't really know how valuable that is, but I think it's worth exploring a bit.

If that extension is present, then I think intake-stac could use it like https://github.com/tomAugspurger/xarray-assets#python-example to safely go from a STAC Asset -> xarray.Dataset without any arguments from the user.

Comment on lines +293 to +296
if isinstance(result, DataSource):
kwargs = result._captured_init_kwargs
kwargs = {**kwargs, **dict(storage_options=storage_options), **open_kwargs}
result = result(*result._captured_init_args, **kwargs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant currently, StacItem.__getitem__ will return a (subclass of) DataSource. Does this seem like the right way to control the parameters passed to that DataSource? If so, are _captured_init_args and captured_init_kwargs considered "public"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks essentially the same as DataSourceBase.configure_new (aliased with get for compatibility, and __call__), but yes, seems fine to me.

are _captured_init_args and _captured_init_kwargs considered "public"

They were means for internal storage and to be able to recreate things after serialisation, possibly to YAML. They are more "automatic" than "private", I think.

Does this seem like the right way

Unless configure_new already does the right thing.
I do wonder what result can be if not a DataSource.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless configure_new already does the right thing.

Gotcha. I think configure_new doesn't quite work, since we want to merge these keywords with the "existing" ones that are in ._captured_init_args (we had a test relying on that anyway).

I don't see an easy way for configure_new to add a keyword to control whether or not to merge the new kwargs, since it's passing all the keywords through, there's the potential for a conflict.

I do wonder what result can be if not a DataSource.

In this case, perhaps a StacAsset, but I might be misunderstanding intake-stac's design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting for posterity, intake-xarray's datasources define a .kwargs and .storage_options properties. We can't use those because they apparently aren't implemented by RasterIOSource.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately i don't really follow this... i've always been a little confused about what should be handled by intake-xarray or whether intake-stac should just be stand-alone and define all the datasources under this repo. I sort of started down that road in https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6 but abandoned it...

Tom Augspurger added 2 commits October 19, 2021 13:49
* Tests for item-level, collection-level
* Implemented `get_asset`
@TomAugspurger
Copy link
Collaborator Author

The latest commit implements the API described in #90 (comment). So now users call StacCollection.get_asset(key) to get a collection-level asset, and call .to_dask() on the result to load it into a Dask-backed object. This data source is configured to use xarray:storage_options and xarray:open_kwargs from the xarray-assets extension.

In [2]: import intake

In [3]: collection = intake.open_stac_collection("https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi")

In [4]: source = collection.get_asset("zarr-https")

In [6]: source.kwargs
Out[6]: {'consolidated': True}

In [7]: source.to_dask()
Out[7]:
<xarray.Dataset>
Dimensions:                  (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
  * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
  * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

@TomAugspurger
Copy link
Collaborator Author

Looks like the narrative docs are a bit out of date, but f1dc6ff added a small section on xarray-assets to the docs.

@kthyng did you already have STAC items / collections I could test this against? Or were you waiting for intake-stac to be updated before generating those?

@scottyhq do you have a chance to take a look at this?

@kthyng
Copy link

kthyng commented Oct 19, 2021

@TomAugspurger You mean a catalog already set up to use xarray-assets? I do not at the moment but can work on it now. I haven't used a STAC extension before... is there a good reference for how to "install" it?

@TomAugspurger
Copy link
Collaborator Author

is there a good reference for how to "install" it?

That should just require adding the extension's URL to the Catalog / Item's stac_extensions list, e.g. https://github.com/TomAugspurger/xstac/blob/main/examples/terraclimate/terraclimate.json#L14-L17 (I just noticed the test cases I added here were missing that) and then add the relevant keys to the assets.

If you're generating STAC metadata for Zarr datasets, https://github.com/TomAugspurger/xstac might be helpful, or you can generate it "by hand".

docs/source/tutorial.rst Outdated Show resolved Hide resolved
@@ -230,6 +283,20 @@ class StacItem(AbstractStacCatalog):
name = 'stac_item'
_stac_cls = pystac.Item

def __getitem__(self, key):
result = super().__getitem__(key)
# TODO: handle non-string assets?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't come across this in the wild. are they always strings? here for example I see asset["0"] https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS/collections/NSIDC-0723.v4/items

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it's possible to look up multiple items by passing a tuple to __getitem__. https://github.com/intake/intake/blob/d9faa048bbc09d74bb6972f672155e3814c3ca62/intake/catalog/base.py#L403

I haven't used it either.

intake_stac/catalog.py Outdated Show resolved Hide resolved
intake_stac/catalog.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@scottyhq scottyhq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward @TomAugspurger! I think this will be a great addition, left some comments for some minor suggested changes, then we should merge it!

@kthyng
Copy link

kthyng commented Oct 19, 2021

@TomAugspurger Thanks for the help, that was really clear. I am meeting an issue I think due to using pystac to help generate the STAC items, collections, and catalogs.

For code here:

    asset = {'href': f'file://{file}', 'media_type': "application/netcdf", 'title': title2,
             'extra_fields': properties, 
             'xarray:open_kwargs': {'xarray_kwargs': {'drop_variables': 'obs'}}}
    item.add_asset(
        key=item.id,
        asset=pystac.Asset(**asset)
    )

/tmp/ipykernel_46667/3264833412.py in make_item(file, name, title)
227 item.add_asset(
228 key=item.id,
--> 229 asset=pystac.Asset(**asset)
230 )
231 ds.close()

TypeError: init() got an unexpected keyword argument 'xarray:open_kwargs'

@kthyng
Copy link

kthyng commented Oct 19, 2021

If you're generating STAC metadata for Zarr datasets, https://github.com/TomAugspurger/xstac might be helpful, or you can generate it "by hand".

I'm currently working with netcdf files and couldn't tell if I should be using xstac. Currently I am using STAC/intake to catalog mostly time series datasets but later will also have gridded-type datasets (ocean model output).

@TomAugspurger
Copy link
Collaborator Author

Thanks @scottyhq, updated to address your comments.

@kthyng I'll take a closer look later, but I think you can update properties with your xarray-assets:

properties['xarray_kwargs'] = {'drop_variables': 'obs'}
item.add_asset(
    key=item.id,
    asset=pystac.Asset(**asset)
)

Hopefully that does the trick.

I haven't tried xstac on a NetCDF file yet. I'll give that a shot tonight or tomorrow and add will add it as an example!

intake_stac/catalog.py Outdated Show resolved Hide resolved
@kthyng
Copy link

kthyng commented Oct 19, 2021

@kthyng I'll take a closer look later, but I think you can update properties with your xarray-assets:

properties['xarray_kwargs'] = {'drop_variables': 'obs'}
item.add_asset(
    key=item.id,
    asset=pystac.Asset(**asset)
)

Hopefully that does the trick.

@TomAugspurger thanks for the suggestion but unfortunately that hasn't worked for me. Specifically, it has to go into extra_fields in the asset part of the item (properties isn't defined) and it comes through to intake but not at a level that it seems to be recognized as needing to be used upon read in of the dataset with xarray.

Here's what I mean at the point it gets to the intake GUI. The xarray_kwargs are available under "metadata" (first image, and if I could scroll up in the window to show you)

image

but I think they need to be available under "args" (second image) to be used in xr.open_dataset.

image

@scottyhq scottyhq merged commit fe1aaa1 into intake:main Oct 20, 2021
@TomAugspurger TomAugspurger deleted the feature/zarr-collection branch October 20, 2021 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding support for Zarr datasets How to deal with STAC assets that don't declare 'type'
4 participants