-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integration with xarray? #79
Comments
Hey @swamidass, Thanks for opening the issue! I think it would useful to add native support for xarray to tiffslide. I wonder what approach would be best here: When comparing the datasets produced by tiffslide-xarray and a simple approach via xarray's zarr engine, I see there are mainly differences in the added metadata and dimensions: import tiffslide
import xarray
import tiffslide_xarray
fn = "/path/to/file/CMU-1.svs"
dataset1 = xarray.open_zarr(
tiffslide.open_slide(fn).zarr_group.store,
consolidated=False,
)
dataset2 = tiffslide_xarray.open_all_levels(fn) >>> dataset1
<xarray.Dataset>
Dimensions: (Y: 32914, X: 46000, S: 3, Y1: 8228, X1: 11500, Y2: 2057, X2: 2875)
Dimensions without coordinates: Y, X, S, Y1, X1, Y2, X2
Data variables:
0 (Y, X, S) float32 ...
1 (Y1, X1, S) float32 ...
2 (Y2, X2, S) float32 ...
Attributes:
multiscales: [{'datasets': [{'path': '0'}, {'path': '1'}, {'path': '2'}]...
>>> dataset2
DataTree('None', parent=None)
│ Dimensions: (y: 32914, x: 46000, c: 3)
│ Coordinates:
│ * y (y) int64 0 1 2 3 4 5 6 ... 32908 32909 32910 32911 32912 32913
│ * x (x) int64 0 1 2 3 4 5 6 ... 45994 45995 45996 45997 45998 45999
│ * c (c) <U1 'r' 'g' 'b'
│ Data variables:
│ image (y, x, c) <class 'numpy.uint8'> ...
│ Attributes: (12/30)
│ tiffslide.comment: Aperio Image Library v10.0.51
│ \n46920x33014 [...
│ tiffslide.vendor: aperio
│ tiffslide.objective-power: 20
│ tiffslide.mpp-x: 0.499
│ tiffslide.mpp-y: 0.499
│ aperio.AppMag: 20
│ ... ...
│ aperio.Top: 23.449873
│ aperio.User: b414003d-95c6-48b0-9369-8010ed517ba7
│ tiff.ImageDescription: Aperio Image Library v10.0.51
│ \n46920x33014 [...
│ tiffslide.series-index: 0
│ tiffslide.series-axes: YXS
│ source_file: /path/to/file...
├── DataTree('level1')
│ Dimensions: (y: 8228, x: 11500, c: 3)
│ Coordinates:
│ * y (y) float64 1.5 5.5 9.5 13.5 ... 3.29e+04 3.291e+04 3.291e+04
│ * x (x) float64 1.5 5.5 9.5 13.5 ... 4.599e+04 4.599e+04 4.6e+04
│ * c (c) <U1 'r' 'g' 'b'
│ Data variables:
│ image (y, x, c) <class 'numpy.uint8'> ...
│ Attributes: (12/30)
│ tiffslide.comment: Aperio Image Library v10.0.51
│ \n46920x33014 [...
│ tiffslide.vendor: aperio
│ tiffslide.objective-power: 20
│ tiffslide.mpp-x: 0.499
│ tiffslide.mpp-y: 0.499
│ aperio.AppMag: 20
│ ... ...
│ aperio.Top: 23.449873
│ aperio.User: b414003d-95c6-48b0-9369-8010ed517ba7
│ tiff.ImageDescription: Aperio Image Library v10.0.51
│ \n46920x33014 [...
│ tiffslide.series-index: 0
│ tiffslide.series-axes: YXS
│ source_file: /path/to/file...
└── DataTree('level2')
Dimensions: (y: 2057, x: 2875, c: 3)
Coordinates:
* y (y) float64 7.5 23.5 39.5 55.5 ... 3.287e+04 3.289e+04 3.29e+04
* x (x) float64 7.5 23.5 39.5 55.5 ... 4.596e+04 4.598e+04 4.599e+04
* c (c) <U1 'r' 'g' 'b'
Data variables:
image (y, x, c) <class 'numpy.uint8'> ...
Attributes: (12/30)
tiffslide.comment: Aperio Image Library v10.0.51
\n46920x33014 [...
tiffslide.vendor: aperio
tiffslide.objective-power: 20
tiffslide.mpp-x: 0.499
tiffslide.mpp-y: 0.499
aperio.AppMag: 20
... ...
aperio.Top: 23.449873
aperio.User: b414003d-95c6-48b0-9369-8010ed517ba7
tiff.ImageDescription: Aperio Image Library v10.0.51
\n46920x33014 [...
tiffslide.series-index: 0
tiffslide.series-axes: YXS
source_file: /path/to/file... I wonder if we can just add all of this metadata by default to the zarrstore we get from These docs are relevant for defining dimensions in the zarr spec https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html
Cheers, |
That's a neat trick with the Zarr store! That aspect is certainly a better way to do this than I have set up currently, and I'll move over to that (unless there are some reasons to keep using the openslide API?). Other than this, let me enumerate the differences: tiffslide-xarray:
In contrast, the approach you used,
Let's address each these differences in turn, and perhaps you can give me some advice. 1. MetadataI think that the metadata is critical to pass through, for a whole host of reasons. We also add "source_file" to automatically track provenance. Ideally I'd like to think through some metadata conventions that might provide a breadcrumb trail of several things we might do with the data. My question though is whether or not there is a performance penalty for reading it from the file, or if it is just read in no matter what. If there is a performance penalty (is there?) perhaps we should make adding the metadata optional. 2 and 4. Coordinates, and Same or Different Group/Dimension NamesThere are major advantages to including coordinates and placing each level into a different group in a single DataTree. We can do things then like aligned slicing of multiple arrays at once:
Critically (and this is the major benefit of xarray), this sort of slicing pulls exact same region of interest (aligned with one other) of all the levels. That's only possible because (1) the dimensions are all named the same and (2) coordinates specify precisely how each image aligns with level 0. Additional datasets can be added too. For example, a target map and heat map at different sampling level can be added in, and they would be sliced in aligned ways too. The one concern I have is that inferring coordinates for slides after 0 does make the assumption that the left and bottom most parts of the slide are trimmed when the slide is downsampled. That is usually how it is done, but not always. If that assumption does not hold in the tiff image, there can be a few pixel error in the levels. In most applications that probably isn't too important, but its worth knowing and fixing if possible using meta data, if it exists (does it?). 3. Naming dimensionsI'm inclined to move over to the approach you are using, which seems like it might be more robust. Do all tiff files include an axes argument? If not, what do the names you are using default to? |
Okay, there is some more information that it worth noting, that turns out to be quite important. I did a test reading the data into memory. Using your code, that would be:
Versus
Turns out that the zarr based approach is more than 10x slower, at least on my computer, than the read_region based code that I used. That is substantial difference in performance. |
Setting import contextlib
import time
import tiffslide
import xarray
import tiffslide_xarray
@contextlib.contextmanager
def measure(label):
t0 = time.time()
yield
print(label, time.time() - t0)
base_path = "/Users/andreaspoehlmann/Development/zarr-benchmark/data/TifffileSVSDataset"
fn = f"{base_path}/CMU-1.svs"
# via zarr_engine
dataset1 = xarray.open_zarr(
tiffslide.open_slide(fn).zarr_group.store,
consolidated=False,
mask_and_scale=False,
)
# via tiffslide_engine
dataset2 = tiffslide_xarray.open_all_levels(fn)
with measure("zarr_engine"):
dataset1["0"].load()
with measure("tiffslide_engine"):
dataset2["image"].load() (1) MetadataTiffslide can just add the metadata to the zarr store attrs. That should be easy, and the source filename (if available) can be added to those too. There shouldn't be a large penalty since it's read anyways when you instantiate a TiffSlide object and access properties or read a region. (2) Coordinates and Dimension NamesI wonder if additional info in the zarr store can be provided to tell xarray to load the data as a DataTree with the relevant info. We could ask that in the datatree repo. (3) Dimension namestifffile provides these dimension names. In the general case those can be different dependent on the type of file you load: https://github.com/cgohlke/tifffile/blob/35c19b7c6c8b85aa09e490520677b5a2a31c14e4/tifffile/tifffile.py#L18009C12-L18056C9 tiffslide was written with a focus on tiff-like images used in a digital pathology context. So most of them are 2d images with 3 channels... |
This is really helpful. And the solution is fairly obscure, so I'm glad we looked into it. I wouldn't have stumbled into that easily. I think I have a way to do this now. I'll be back in touch soon/later. |
So I just released a beta version: 0.1b0. You can see the key code here: https://github.com/swamidasslab/tiffslide-xarray/blob/main/tiffslide_xarray/data.py Turns out the zarr backed approach is the right way to go. By tweaking the chunking, it can be much faster than read_region. Moreover, it is far easier to implement and doesn't rely on any new code. So we just have to trust that tifffile is making a valid Zarr store, and that xarray is using it correctly. Both those dependencies are well tested. So, that is how the library works. |
As for attributes and coords, you can see in that file how I add them. It is fairly straightforward. They don't have to be in the zarr store. Instead, I use the metadata from tiffslide to construct coordinates (with units and downsampling), and add attributes to the resulting xarray object. That works just fine. I do wonder though:
|
As for a comparison with approach you initially offered:
This is what we gain from the tiffslide wrapping around this (which is only a fairly short amount of code):
I'm noticing too that it is working pretty well as an independent library, and I'm working on some additional extensions to xarray in this package, such as micron from/to pixel management of coordinates. For that reason I don't think it makes most sense to directly fold it into this repo. I do wonder if a lighter and more helpful solution would be to add a method to the TIffSlide class:
This way, it doesn't change the require dependencies of tiffslide at all, and is a fairly minimal amount of work to add. Though, obviously, testing and documentation always take time. What do you think? |
Hi @swamidass, Thanks for implementing the changes in your library already! I'd be happy to integrate xarray support directly into tiffslide, and if you have time, it would be great to discuss the specifics in more detail in a PR. I'll provide a few more details, on how I would imagine that integration to look like, and what issues and edge cases we might face further down the line.
Awesome. Tifffile is one of the best maintained libraries out there. And xarray's zarr support is well tested too, so I have zero concerns regarding both.
I still wonder if the code could be simplified a lot if we'd manually add them to the zarr store
This depends a bit on the specific use case. For the pathology file formats that tiffslide currently supports, the metadata is usually parsed out of a tag stored in the tiff image and then available in the properties. Some formats store extra images with barcodes, that usually contain information that stores information about the physical sample on the slide. (But that would be something to consider for the future)
It's less about the file extension, and more about what formats are supported (and detected) via
I think one reason to integrate it would be the following: There are some tiff based (and also non-tiffbased) digital pathology formats that can be represented as zarr arrays, but they require composition of individual arrays into a larger array to represent the final scan. (one supported example is Leica SCN, where individual arrays in the tiff file come from different physical regions of the slide.) Xarray could then be used to build this larger virtual array when all zarr arrays have x and y coordinates corresponding to their real location. Another format example (currently unsupported) would be Mirax (see: #44) The integration I'd imagine would be a relatively slim def to_xarray(slide: TiffSlide, **other_options_that_might_be_needed) -> DataTree: This submodule could then intially check that only slides that provide "standard" (non-composed) zarrstores can be converted, and gradually add more support for more formats. At some stage it would then become the default backend to support formats like LeicaSCN and Mirax. (assuming it's fast enough) Let me know if this sounds interesting, |
Sure, I will make a PR when I have some time.
This is a really good use case, one that xarray would excel on. In fact, it's the sort of thing that could both make supporting Mirax far easier. Regarding LeicaSCN do you have some test files that exhibit the key features I could look at? Along with explanation of things that wouldn't be obvious from inspecting the TiffSlide attributes and standard functions? If I had those exmamples, I'd have a better idea of what could/should be done.
Another key reason to integrate is to add testing. Right now Zarr is not a tested format, and you haven't made any public "contracts" about how it's interface/attributes will work in the future. So it could accidentally shift at any time. If you add some key test cases there, the whole integration and utility would be far stronger and reliable.
That could be a good idea. It would simplify the code. More importantly, it would split concerns. You wouldn't have to worry about keeping something stable in the openslide API to better serve users of the zarr store. That decoupling is good practice, and will make this more fun. I think we can be fairly certain that:
For those two reasons, I think we can iterate pretty easily on this if you are willing to release some new versions of tiffslide (or create a new branch). I'd suggest that in this case a "kitchen sink" approach makes a lot of sense. I'd just add all the metadata you can put in that would not be misleading. I'd suggest these rules:
If you can do these things (particularly #1 and #3, #4 and #5) there wouldn't be any reason to use the openslide mimicing interface. We'd be able to rely exclusively on the zarr interface, which would indeed greatly simplify the code. Once we start exposing attributes this way, we really should document them in the tiffslide documentation, at least with a sentence description. IMHO, I think this would be a good direction to move forward on. I think we could really increase the utility of the library beyond primarily being a migration library from openslide. And it might even make supporting Mirax easier to do too. What do you think? For Down the Road...
|
you can get test images here: https://openslide.cs.cmu.edu/download/openslide-testdata/Leica/ TiffSlide parses the metadata information in those files and extracts a SeriesCompositionInfo dictionary, which contains the shapes of the composited levels as well as the locations of the seriess in each of those composited levels (basically the offsets). It's stored under the "tiffslide.series-composition" key: https://github.com/ap--/tiffslide/blob/3b6b1fc490c3acf3bd136eaeba77b67954c1dc01/tiffslide/tiffslide.py#L869-L1002 This information is then used here: https://github.com/ap--/tiffslide/blob/3b6b1fc490c3acf3bd136eaeba77b67954c1dc01/tiffslide/_zarr.py#L195-L236 to retrieve offset array data from the composited image. The current implementation checks every tiff series for potential overlap, and is only fast-enough, because Leica SCN images have (at least the ones I've seen) just a few series. For formats like Mirax this will be on the order of ~100s or ~1000s and will need a spatial index to be fast.
Tiffslide was built to be used as a cloud-native drop in replacement for openslide, so the openslide interface is important for its main usecase. Users could always use tifffile directly if they need more direct access. The That being said:
That would be a big win. |
Would an official integration with xarray be useful? I published an independent package that is fairly small, but should also add most the key functionality. It could stay as an independent package, but I am open to adding it to this library.
https://github.com/swamidasslab/tiffslide-xarray
The text was updated successfully, but these errors were encountered: