Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load collection-level assets into xarray #90

Merged
merged 19 commits into from
Oct 20, 2021
47 changes: 47 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,50 @@ Intake-stac can turn this object into an Intake catalog:

catalog = intake.open_stac_item_collection('single-file-stac.json')
list(catalog)

Using xarray-assets
-------------------

Intake-stac uses the `xarray-assets`_ STAC extension to automatically use the appropriate keywords to load a STAC asset into a data container.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

Intake-stac will automatically use the keywords from the `xarray-assets`_ STAC extension, if present, when loading data into a container.
For example, the STAC collection at <https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi> defines an
asset ``zarr-https`` with the metadata ``"xarray:open_kwargs": {"consolidated": true}"`` to indicate that this dataset should be
opened with the ``consolidated=True`` keyword argument. This will be used automatically by ``.to_dask()``


.. code-block:: python

>>> collection = intake.open_stac_collection(
... "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-hi"
... )

>>> source = collection.get_asset("zarr-https")
>>> source.to_dask()
<xarray.Dataset>
Dimensions: (nv: 2, time: 41, x: 284, y: 584)
Coordinates:
lat (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
* time (time) datetime64[ns] 1980-07-01T12:00:00 ... 20...
* x (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
* y (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
Dimensions without coordinates: nv
Data variables:
lambert_conformal_conic int16 ...
prcp (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
swe (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
time_bnds (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
tmax (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
tmin (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
vp (time, y, x) float32 dask.array<chunksize=(1, 584, 284), meta=np.ndarray>
Attributes:
Conventions: CF-1.6
Version_data: Daymet Data Version 4.0
Version_software: Daymet Software Version 4.0
citation: Please see http://daymet.ornl.gov/ for current Daymet ...
references: Please see http://daymet.ornl.gov/ for current informa...
source: Daymet Software Version 4.0
start_year: 1980

.. _xarray-assets: https://github.com/stac-extensions/xarray-assets
71 changes: 71 additions & 0 deletions intake_stac/catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import pystac
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry
from intake.source import DataSource
from pkg_resources import get_distribution
from pystac.extensions.eo import EOExtension

Expand Down Expand Up @@ -38,6 +39,7 @@
'application/json': 'textfiles',
'application/geo+json': 'geopandas',
'application/geopackage+sqlite3': 'geopandas',
'application/vnd+zarr': 'zarr',
'application/xml': 'textfiles',
}

Expand Down Expand Up @@ -165,6 +167,60 @@ class StacCollection(StacCatalog):
name = 'stac_catalog'
_stac_cls = pystac.Collection

def get_asset(
self,
key,
storage_options=None,
merge_asset_storage_options=True,
merge_asset_open_kwargs=True,
**kwargs,
):
r"""
Get a datasource for a collection-level asset.

Parameters
----------
key : str, optional
The asset key to use if multiple Zarr assets are provided.
storage_options : dict, optional
Additional arguments for the backend fsspec filesystem.
merge_asset_storage_option : bool, default True
Whether to merge the storage options provided by the asset under the
``xarray:storage_options`` key with `storage_options`.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
merge_asset_open_kwargs : bool, default True
Whether to merge the keywords provided by the asset under the
``xarray:open_kwargs`` key with ``**kwargs``.
**kwargs
Additional keyword options are provided to the loader, for example ``consolidated=True``
to pass to :meth:`xarray.open_zarr`.

Notes
-----
The Media Type of the asset will be used to determine how to load the data.

Returns
-------
DataSource
The dataset described by the asset loaded into a dask-backed object.
"""
try:
asset = self._stac_obj.assets[key]
except KeyError:
raise KeyError(
f'No asset named {key}. Should be one of {list(self._stac_obj.assets)}'
) from None

storage_options = storage_options or {}
if merge_asset_storage_options:
asset_storage_options = asset.extra_fields.get('xarray:storage_options', {})
storage_options.update(asset_storage_options)

if merge_asset_open_kwargs:
asset_open_kwargs = asset.extra_fields.get('xarray:open_kwargs', {})
kwargs.update(asset_open_kwargs)

return StacAsset(asset, asset)(storage_options=storage_options, **kwargs)
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved


class StacItemCollection(AbstractStacCatalog):
"""
Expand Down Expand Up @@ -230,6 +286,20 @@ class StacItem(AbstractStacCatalog):
name = 'stac_item'
_stac_cls = pystac.Item

def __getitem__(self, key):
result = super().__getitem__(key)
# TODO: handle non-string assets?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't come across this in the wild. are they always strings? here for example I see asset["0"] https://cmr.earthdata.nasa.gov/stac/NSIDC_ECS/collections/NSIDC-0723.v4/items

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it's possible to look up multiple items by passing a tuple to __getitem__. https://github.com/intake/intake/blob/d9faa048bbc09d74bb6972f672155e3814c3ca62/intake/catalog/base.py#L403

I haven't used it either.

asset = self._entries[key]
storage_options = asset._stac_obj.extra_fields.get('xarray:storage_options', {})
open_kwargs = asset._stac_obj.extra_fields.get('xarray:open_kwargs', {})

if isinstance(result, DataSource):
kwargs = result._captured_init_kwargs
kwargs = {**kwargs, **dict(storage_options=storage_options), **open_kwargs}
result = result(*result._captured_init_args, **kwargs)
Comment on lines +296 to +299
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant currently, StacItem.__getitem__ will return a (subclass of) DataSource. Does this seem like the right way to control the parameters passed to that DataSource? If so, are _captured_init_args and captured_init_kwargs considered "public"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks essentially the same as DataSourceBase.configure_new (aliased with get for compatibility, and __call__), but yes, seems fine to me.

are _captured_init_args and _captured_init_kwargs considered "public"

They were means for internal storage and to be able to recreate things after serialisation, possibly to YAML. They are more "automatic" than "private", I think.

Does this seem like the right way

Unless configure_new already does the right thing.
I do wonder what result can be if not a DataSource.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless configure_new already does the right thing.

Gotcha. I think configure_new doesn't quite work, since we want to merge these keywords with the "existing" ones that are in ._captured_init_args (we had a test relying on that anyway).

I don't see an easy way for configure_new to add a keyword to control whether or not to merge the new kwargs, since it's passing all the keywords through, there's the potential for a conflict.

I do wonder what result can be if not a DataSource.

In this case, perhaps a StacAsset, but I might be misunderstanding intake-stac's design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting for posterity, intake-xarray's datasources define a .kwargs and .storage_options properties. We can't use those because they apparently aren't implemented by RasterIOSource.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately i don't really follow this... i've always been a little confused about what should be handled by intake-xarray or whether intake-stac should just be stand-alone and define all the datasources under this repo. I sort of started down that road in https://github.com/intake/intake-stac/pull/75/files#diff-b45fa0c9c70f45ce9661f18946a5a2aed632ac4c1d3b1c09333291f77bbdfda6 but abandoned it...


return result

def _load(self):
"""
Load the STAC Item.
Expand Down Expand Up @@ -357,6 +427,7 @@ def __init__(self, key, asset):
Construct an Intake catalog 'Source' from a STAC Item Asset.
asset = pystac.item.Asset
"""
self._stac_obj = asset
driver = self._get_driver(asset)

super().__init__(
Expand Down
Loading