Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible approach for stacking items #75

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

scottyhq
Copy link
Collaborator

Been working on this PR for a couple days to try and understand nuances of intake by splitting off from the intake-xarray plugin and thinking about how stacking items might work in practice.

I'm also trying to sort out the advantages of intake-stac versus working directly with STAC catalogs and stac-utils/stac-vrt#11. So I'm going to add some comments in the code and try to get feedback from @jhamman @martindurant @TomAugspurger and @rmg55 .

Currently this is a functional implementation that opens files in sequence, which is fine for small catalogs, but scaling up it would probably be smart to add an optional dependency on stac-vrt and inject the build_vrt() function as a catalog method.

addresses #65

@scottyhq scottyhq marked this pull request as draft March 10, 2021 01:05
@scottyhq
Copy link
Collaborator Author

scottyhq commented Mar 10, 2021

Screenshot illustrating how this currently works:

image

Binder 👈 Launch a binder notebook on this branch

@scottyhq scottyhq requested a review from jhamman March 10, 2021 01:13
Comment on lines +59 to +62
# if isinstance(self.urlpath, list):
# self._can_be_local = fsspec.utils.can_be_local(self.urlpath[0])
# else:
# self._can_be_local = fsspec.utils.can_be_local(self.urlpath)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant following up on your comment on the last PR about intake-xarray and it's design and purpose. It seems to me that some of the complication with intake-xarray is the integration with fsspec and ultimately whether the library dealing with I/O can handle fsspec filelike objects.

In the case of rioxarray / xr.open_rasterio. rasterio.open() does not know what to do with fsspec objects so we just pass urls around. and unless i'm missing something there is no need for fsspec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsspec might be useful for expanding glob-strings; also, in some cases the best route is to have fsspec make local copies and pass the paths of those. It's the case where the xarray backend has it's own internal URL handling that's thorny. fsspec might still be useful for expanding globs, but neither file-likes or local copies are appropriate.

Copy link
Collaborator Author

@scottyhq scottyhq Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback @martindurant !

It's the case where the xarray backend has it's own internal URL handling that's thorny.

yes, and with the backend refactor ongoing it's especially confusing for me to understand. there is a proposal in rioxarray to handle fsspec objects corteva/rioxarray#246 , so maybe eventually more xarray backends will support this...

fsspec might still be useful for expanding globs, but neither file-likes or local copies are appropriate.

i think this is another case where intake-stac is a bit weird compared to other intake plugins. since STAC is already a catalog and we're basically translating to a new catalog type, i don't think there is a case where someone would manually write or save an intake yml version. And consequently there won't be any glob patterns in the URLs...

in the absence of fsspec integrations the key goals of intake-stac in my mind are:

  1. provide a common interface to explore catalogs and load data (.to_dask()) regardless of file type (.jp2, .tif, .nc)...
  2. provide convenience functions for complicated loading of many assets (arguably this could also be implemented as an open_mfrasterio() in rioxarray, but having the STAC metadata allows us to take shortcuts or do things otherwise impossible with just filenames
  3. tap into the holoviews system for visualization

Comment on lines +136 to +138
# NOTE: don't know what's going on here...
# https://github.com/intake/intake-xarray/issues/20#issuecomment-432782846
def _get_schema(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant I had a hard time following what is going on here (copied from https://github.com/intake/intake-xarray/blob/master/intake_xarray/raster.py). It seems like this could be greatly simplified or deleted if we assume we are just referencing URLs or local filepaths and not using fsspec ?

Comment on lines +122 to +127
# if self._can_be_local:
# files = fsspec.open_local(self.urlpath, **self.storage_options)
# else:
# pass URLs to delegate remote opening to rasterio library
# files = self.urlpath
# files = fsspec.open(self.urlpath, **self.storage_options).open()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again i feel like none of this is needed unless the backend library knows what to do with fsspec things?:

with fsspec.open(url) as f:
     rioxarray.open_rasterio(f)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if we know for certain that we want original URLs passed through to the xarray backend, then fsspec has much less/nothing to do. It means there needs to be an alternate way to pass IO parameters (e.g., credentials) to that backend if it needs them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means there needs to be an alternate way to pass IO parameters (e.g., credentials) to that backend

I assume this means setting GDAL configs @scottyhq

metadatas = {'items': {}}
hrefs = []
for item in items:
metadatas['items'][item] = {'STAC': self[item].metadata, 'assets': {}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way i ended up structuring this it ended up being basically a loop around StacItem.stack_bands() so could probably just call that method instead of a bunch of duplicate code?

Comment on lines -270 to +338
def stack_bands(self, bands, path_as_pattern=None, concat_dim='band'):
def stack_assets(self, assets, path_as_pattern=None, concat_dim='band', override_coords=None):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed again to be less EO-specific and more in line with the STAC assets. i don't think too many people are using this library yet, so i'm not concerned with disruptive changes.

Comment on lines -280 to -282
Coordinate alignment is performed automatically upon calling the
`to_dask()` method to load into an Xarray DataArray if bands have diffent
ground sample distance (gsd) or array shapes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i learned the hard way that this isn't exactly true. it works for landsat bands that were aligned based on GSD, but here are two with the same CRS and transform, different GSD that xarray fails to do what we expect... i'll open an issue to illustrate this (https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B04.tif, https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B05.tif)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it might be best to raise a warning if CRS, Affine, and GSD do not match for requested stacked assets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @scottyhq,
i stumbled upon this, do you have a suggestion how to go about stacking (sen2) bands with different GSD to be later used for lazy-loading in a dask array?
Thanks,
Vitus

Copy link
Collaborator Author

@scottyhq scottyhq Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitusbenson i'd recommend explicitly interpolating, or using a gdal vrt

url4 = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B04.tif'
url5 = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B05.tif'
da4 = rioxarray.open_rasterio(url4, chunks='auto').assign_coords(band=['B4']) #gsd=10
da5 = rioxarray.open_rasterio(url5, chunks='auto').assign_coords(band=['B5']) #gsd=20
da5_interp = da5.interp(y=da4['y'], x=da4['x'],  method='nearest')
da = xr.concat([da5_interp, da4], dim='band') #gsd=10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'll have to give that a shot to compare.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I wasn't aware of rioxarray's interpolation capabilities - really useful!

__version__ = get_distribution('intake_stac').version


class RioxarraySource(DataSource, PatternMixin):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intended to be a simplified version of https://github.com/intake/intake-xarray/blob/master/intake_xarray/raster.py seeing what is the minimal driver if intake-xarray were to be separated out per xarray backends (netcdf, zarr, raster, etc)

@martindurant
Copy link
Member

Your issues are similar to the general conversation in intake-xarray and my fsspec/zarr PR to xarray: that the many backends need different inputs, so it's difficult to know when to invoke fsspec. I don't know how to make a comprehensive solution. Maybe keeping things specific, like here, is the way forward.

@rmg55
Copy link

rmg55 commented Mar 10, 2021

Hi @scottyhq - thanks for putting this together. Still working through the code and getting familiar with how STAC collections get mapped into intake-stac - so apologies if I am missing some key points in the suggestions below! I think adding these types of features (stacking/mosaicing) into intake-stac would really improve a lot of workflows.

Currently this is a functional implementation that opens files in sequence, which is fine for small catalogs, but scaling up it would probably be smart to add an optional dependency on stac-vrt

If dropping fsspec use and going straight to urls, we could also leverage the STAC metadata to create xarray objects with dask.delayed and dask.array.from_delayed that would be fast. Here is an example opening 500 assets in ~1 sec: https://gist.github.com/rmg55/ae5bc19b94345cc2033ef9261a229e24

trying to sort out the advantages of intake-stac versus working directly with STAC catalogs and stac-utils/stac-vrt#11

I am also having trouble wrapping my head around the best approach for this. I am conceptualizing this into two categories spatial mosaic and temporal stacking. Here are a few thoughts about what might be great to try and support long-term (realizing we probably want to start small and build features in).

Mosaic: Images collected at the same time and need to be stitched together. This is probably better accomplished with the stac_vrt approach. However, I think an interesting intake-stac + stac_vrt workflow could accomplish something like:

  • catalog.mosaic(by='time'), which would use a "group_by" approach to group assets by date, and pass to stac_vrt, and produce a new catalog with the mosaiced vrts. Perhaps even be able to do something like catalog.mosaic(by='time.weekofyear',how='max') - similar to pandas/xarray approach and leverage vrt pixel functions for the aggregations.

Stacking: Temporally stacking assets to create a time series (x,y,band(s),time). I think this could be accomplished in either intake-stac (perhaps with the approach mentioned above) or in stac_vrt. Thoughts below on using both approaches:

  • stac_vrt:
    • Advantage is that you are left with an catalog, that could be written to disk and shared with others. In this approach I think we would need to write to disk the in-mem vrt files (perhaps to a .vrt folder co-located with the catalog or use the intake caching mechanism).
    • Disadvantage - limited to 3dims, therefore we would need to have a vrt per band (although this is perhaps something we want to do anyways?)
  • intake-stac:
    • Advantage: supports 3+ dimensions (e.g. x,y,time,band).
    • Disadvantage: End up with an xarray object, not an intake-stac catalog.

I'd be happy to contribute to this effort once we can settle on a best approach.

@scottyhq
Copy link
Collaborator Author

scottyhq commented Mar 10, 2021

If dropping fsspec use and going straight to urls, we could also leverage the STAC metadata to create xarray objects with dask.delayed and dask.array.from_delayed that would be fast. Here is an example opening 500 assets in ~1 sec: https://gist.github.com/rmg55/ae5bc19b94345cc2033ef9261a229e24

Thanks @rmg55 this is awesome! This approach of getting the dimension and coordinate info from STAC JSON to delay file I/O is really slick! The disadvantage I see is the possibility for discrepancies when data is actually read. For example, note that you have an error with coordinate parsing in your function, which I came across when I tried to plot the resulting dataarray:

def _delayed_open(stac_item, band):
    transform = stac_item['assets'][band]['proj:transform'][0:6]
    nx,ny = stac_item['assets'][band]['proj:shape']
    # NOTE: xarray and rioxarray docs are wrong here, no need for from_gdal...
    # see https://github.com/pydata/xarray/issues/5005
    #x, y = Affine.from_gdal(*transform) * (np.arange(nx)+0.5, np.arange(ny)+0.5)
    x, y = Affine(*transform) * (np.arange(nx)+0.5, np.arange(ny)+0.5)
    href = stac_item['assets'][band]['href']
    date = datetime.datetime.fromisoformat(stac_item['properties']['datetime'][0:10]).date()
    # changed to darray because i always do `da=xr.open_rasterio()` ;)
    data = darray.from_delayed(dask.delayed(xr.open_rasterio)(href).data, (1,nx,ny), dtype=float)[np.newaxis,:,:,:]
    # swap order of x and y to be consistent with ordering returned from xr.open_rasterio
    da = xr.DataArray(data=data, coords=([date],[band],y,x),dims=('time','band','y','x'))
    return da

you bring up good points and great ideas for stacking and mosaicing options, continuing to prototype some examples here or via gists is definitely helping to figure out an approach.

@rmg55
Copy link

rmg55 commented Mar 12, 2021

@scottyhq - thanks for pointing out those bugs in the gist (updated recently to incorporate your edits). Agreed on your assessment that errors are not surfaced until the data is read - which could make tracing errors more confusing for users.

Also, I spent some time exploring the stackstac project by @gjoseph92 - which is really awesome. Here is a gist of using it with NASA HLS STAC: https://gist.github.com/rmg55/b144cb273d9ccfdf979e9843fdf5e651

One thing to note is that stackstac is able to work with STACs that are missing the projection extension by using the bbox, user input on resolution+crs:epsg, and warped_vrt. Note this can lead to slightly different shape/transform than the original file (see the above gist). Perhaps something to consider for stac_vrt as well?

@gjoseph92
Copy link

Hey @scottyhq! It does seem like we're working on similar things. Do you think stackstac could be helpful here? I could imagine intake-stac using it under the hood to generate the xarray, but I'm not sure if there are any intake-specific details that it can't handle right now.

@scottyhq
Copy link
Collaborator Author

scottyhq commented Jun 7, 2021

Do you think stackstac could be helpful here? I could imagine intake-stac using it under the hood to generate the xarray

Definitely! I abandoned this PR for a bit but would like to come back to it soon with STAC 1.0 coming out in the near future. I think I'll probably delay integrating stackstac and do that in a follow up.

I'm also watching corteva/rioxarray#246 and the linked PR in rasterio rasterio/rasterio#2141 , which will have some bearing on how fsspec and intake-xarray are currently used (as discussed in above comments)

@TomAugspurger
Copy link
Collaborator

@scottyhq is there anything that this would handle that stackstac doesn't? At a glance it seems like it already handles everything, though I might be missing things.

So I could imagine loading a catalog, maybe a .search() to filter it, and a .stack() to pass the ItemCollection to stackstac.

@scottyhq
Copy link
Collaborator Author

is there anything that this would handle that stackstac doesn't? At a glance it seems like it already handles everything, though I might be missing things.

This approach is much less-sophisticated than stackstac, so I think I might abandon this PR, or need to re-do the .stack() method adding an optional dependency on stackstac. That said, I think it would be nice to have consistency in what to_dask() returns whether it's a single asset or stack of assets. Specifically, whether we automatically set up the rioxarray .rio accessor with CRS stored in the coordinates... gjoseph92/stackstac#50

@TomAugspurger
Copy link
Collaborator

👍 great point about the consistent CRS handling. I'm planning to turn to that after getting pystac 1.0 support and collection-level assets handled in intake-stac.

FWIW, my hope is that intake-stac can be a pretty simple wrapper around pystac, pystac-client, and stackstac. Ideally we limit the complexity here.

@scottyhq
Copy link
Collaborator Author

FWIW, my hope is that intake-stac can be a pretty simple wrapper around pystac, pystac-client, and stackstac. Ideally we limit the complexity here.

In full agreement :)

I just added you as an admin for this repo @TomAugspurger, so that you have the ability to merge things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants