Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RioXarrayDataset for in-memory geographical xarray.DataArray objects #509

Closed
wants to merge 1 commit into from
Closed

RioXarrayDataset for in-memory geographical xarray.DataArray objects #509

wants to merge 1 commit into from

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Apr 18, 2022

A torchgeo.dataset based on an in-memory xarray.DataArray! Allows users to directly use a dataset they've already loaded from a GeoTIFF/NetCDF or any other processing pipeline without having to save it to a file first. Requires rioxarray as a dependency.

Usage example:

import xarray as xr
import rioxarray

xr_dataarray = xr.DataArray(
    data=np.random.randn(5, 3),
    coords=dict(y=[5.6, 4.5, 3.4, 2.3, 1.2], x=[6.7, 7.8, 8.9]),
    dims=["y", "x"],
)
xr_dataarray.rio.set_crs(input_crs="EPSG:3857")

dataset = RioXarrayDataset(xr_dataarray=xr_dataarray)

sample = dataset[dataset.bounds]
print(sample)

produces:

{'image': tensor([[ 1.0317, -0.4566, -0.2855],
        [ 0.6037, -1.5887, -1.4465],
        [-0.8714, -0.1645,  0.7559],
        [ 1.8187, -0.6460, -0.2239],
        [ 0.1100, -0.1918,  0.0911]], dtype=torch.float64), 'crs': CRS.from_epsg(3857), 'bbox': BoundingBox(minx=6.15, maxx=9.450000000000001, miny=0.65, maxy=6.1499999999999995, mint=0.0, maxt=9.223372036854776e+18)}

A torchgeo.dataset based on an in-memory xarray.DataArray! Allows users to directly use a dataset they've already loaded from a GeoTIFF/NetCDF or any other processing pipeline without having to save it to a file first. Requires `rioxarray` as a dependency.
@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Apr 18, 2022
@adamjstewart adamjstewart added this to the 0.3.0 milestone Apr 19, 2022
@calebrob6
Copy link
Member

Ah cool -- I really like this, it will play really nicely with the Planetary Computer environment. Do you mind if I rebase / work on the test coverage?

@weiji14
Copy link
Contributor Author

weiji14 commented Jun 6, 2022

Ah cool -- I really like this, it will play really nicely with the Planetary Computer environment.

Yes, I've actually used it on Planetary Computer in fact 😁 Though I'm hoping that eventually people can just read STAC assets directly using #412.

Do you mind if I rebase / work on the test coverage?

You're welcome to create a branch and work on it independently. I'm not a big fan of rebase/force-push as it changes the commit history (and the commit signatures become invalid). Probably won't be spending any time on this PR in the near future as I'm occupied with different projects and travelling soon ✈️ Happy to help you review/test stuff though if you get things to work better!

@calebrob6 calebrob6 marked this pull request as ready for review June 15, 2022 05:10
@calebrob6
Copy link
Member

@weiji14 it seems I don't have permissions to push to your branch, can you allow that so I can rebase?

@adamjstewart
Copy link
Collaborator

@weiji14 it seems I don't have permissions to push to your branch, can you allow that so I can rebase?

Sounds like @weiji14 doesn't like to rebase, a merge commit may work better. I think you can create a new branch using this branch as a starting point, then open a PR on @weiji14's repo to integrate your changes (including the merge commit).

I'm guessing this won't be ready in time for a 0.3.0 release about a week from now?

@adamjstewart adamjstewart marked this pull request as draft June 27, 2022 01:03
@adamjstewart adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022
@adamjstewart
Copy link
Collaborator

There is some interest in this feature for on-disk datasets as well. Some collaborators at Schlumberger are trying to write a GeoDataset for OCO-2. OCO-2 stores all data in NetCDF files (*.nc4). Unfortunately, these files don't store any geospatial metadata, so rasterio can't load them. I think they also encountered some issues with rioxarray, but xarray seemed to be able to load them. The other issue was that OCO-2 pixels aren't square, so a pure xarray solution wouldn't work for them.

@weiji14
Copy link
Contributor Author

weiji14 commented Jul 25, 2022

Closing this as it's probably better to start from scratch from the ground up using torch DataPipes (Composition over Inheritance) as mentioned in #576 (comment).

@weiji14 weiji14 closed this Jul 25, 2022
@weiji14 weiji14 deleted the datasets/rioxarray branch July 25, 2022 17:44
@adamjstewart adamjstewart removed this from the 0.4.0 milestone Dec 10, 2022
@julien-blanchon
Copy link
Contributor

I will really like this feature to come up.

This will fit really well in a STAC to virtually load an entire STAC collection and then sample from it using torchgeo random sampler with ROI

@weiji14
Copy link
Contributor Author

weiji14 commented May 10, 2023

This will fit really well in a STAC to virtually load an entire STAC collection

See weiji14/zen3geo#48 for example of STAC datapipes. What dataset are you working with, NetCDFs?

This was referenced Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants