Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Mirax images #44

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Support Mirax images #44

wants to merge 7 commits into from

Conversation

ap--
Copy link
Collaborator

@ap-- ap-- commented Aug 17, 2022

Draft while work in progress.

Recommended way for initial support is going the kerchunk route. If we provide the tiffslide properties in the zarrstore this format should already be supported.

Todo:

  • build minimal tiffslide properties and add to zattrs (see: tiffslide._kerchunk.to_kerchunk)
  • support per level positioning in composition
  • add missing position_record index types

If anyone has time to take this over, I'm happy to provide guidance 😃

@ap-- ap-- self-assigned this Aug 17, 2022
@ap-- ap-- linked an issue Aug 18, 2022 that may be closed by this pull request
@ap--
Copy link
Collaborator Author

ap-- commented Dec 21, 2022

Some extra pieces of background information regarding Mirax support, that might be useful:

As far as I know, the file format specification is not available and has been reverse engineered. The current best place to understand the Mirax format is this mailing list message (https://lists.andrew.cmu.edu/pipermail/openslide-users/2012-July/000373.html).

An important piece of information is that there are multiple ways Mirax can store tile overlaps. I recommend getting support for the non-overlapping format to work first, and then work on supporting the other formats. My code currently handles only one of them and should crash with a NotImplementedError in the other cases.

To understand the basic idea of the implementation in tiffslide, it's good to know how tiffslide currently handles access to the image data via zarr. In tiffslide Tiffslide.zarr_group provides a zarr Group of the underlying image data. For simpler tiff based image formats (like svs) the zarr group provided by tifffile maps very intuitively to the underlying imagedata. For most of the supported tifflike formats this hits exactly the code path that you mention in your email. For these cases get_zarr_selection uses the composition is None code path. But, there are other formats that require explicit composition of the image data, because the actual image data stored in the file is sparse and on a not uniformly offset grid, or because tiles are stored with small overlaps. Now, because tiffslide was originally designed to work together with another tool I wrote (https://github.com/Bayer-Group/pado) it was supposed to work with serialized image specifications to scale to petabyte scale pathology datasets. To be able to support non-tiff-based images, I opted for kerchunk (which is a specification for accessing archival data independent of storage locations). So when you look at tiffslide._kerchunk.from_kerchunk you can see that there's basically a path to create Tiffslide.zarr_group from fsspec.ReferenceFileSystems that are backed by kerchunk definitions. For the sake of backwards compatibility this is currently reusing the _tifffile attribute but all logic for creating the zarr groups, and for accessing the sliced image data has been extracted to tiffslide._zarr.

Now to support Mirax, there are a few things that need to happen:

  1. all image data stored in a Mirax file/directory needs to be representable as a zarr Group containing zarr Arrays. This is (I think) working for now in the _mirax.py code. It basically generates a kerchunk representation of all image data.
  2. then get_zarr_selection will need to implement how to composite this image data into a simple hierarchical representation.
  3. Since the mirax zarr representation is using one zarr Group per tile, this will require an optimized implementation using an RTree to be performant.
  4. all of this will need to be checked for levels other than 0.

Right now (1) and (2) should work for CMU-1.mrxs from the openslide testdata repository. 
Try running python -m tiffslide._mirax path/to/CMU-1.mrxs it should save a composited image.
(3) will be necessary to have reasonable performance for large Mirax images. (4) is currently untested.

@GuillaumeDesforges
Copy link

GuillaumeDesforges commented Jan 3, 2023

I am trying to fix the tests on the format-mirax branch

@ap-- do you remember the rational for adding this s character to the groupname?

https://github.com/bayer-science-for-a-better-life/tiffslide/blob/785e37533fe36e3c2e0f4fe00273bbd658552e46/tiffslide/_kerchunk.py#L81

This causes the keys of the zarr Group to also be strings prefixed by an s, which makes this line fail

https://github.com/bayer-science-for-a-better-life/tiffslide/blob/785e37533fe36e3c2e0f4fe00273bbd658552e46/tiffslide/_zarr.py#L197

Failure in tests: https://github.com/bayer-science-for-a-better-life/tiffslide/actions/runs/3749005182/jobs/6366987967#step:6:303

Now, the following diff makes the test test_from_kerchunk pass for wsi_file = CMU-1-Small-Region.svs

kaiko-ai@a87ec6c#diff-5ec7cd13afaf59c6b44cf66f33198d7dea84ec95ad4a5ae29cff9f9a00ed33c7R81

The version of packages I use:

pip list
Package           Version               Editable project location
----------------- --------------------- --------------------------------------
asciitree         0.3.3
attrs             22.2.0
black             22.12.0
cfgv              3.3.1
click             8.1.3
coverage          7.0.2
distlib           0.3.6
entrypoints       0.4
exceptiongroup    1.1.0
fasteners         0.18
filelock          3.9.0
fsspec            2022.11.0
identify          2.5.11
imagecodecs       2022.12.24
iniconfig         1.1.1
mypy              0.991
mypy-extensions   0.4.3
nodeenv           1.7.0
numcodecs         0.11.0
numpy             1.24.1
packaging         22.0
pathspec          0.10.3
Pillow            9.4.0
pip               22.0.4
platformdirs      2.6.2
pluggy            1.0.0
pre-commit        2.21.0
pytest            7.2.0
pytest-cov        4.0.0
PyYAML            6.0
setuptools        58.1.0
tifffile          2022.10.10
tiffslide         1.10.1.post6+g785e375 /home/gdforj/pro/tweag/kaiko/tiffslide
tomli             2.0.1
typing_extensions 4.4.0
virtualenv        20.17.1
zarr              2.13.3

@GuillaumeDesforges
Copy link

Should test_from_kerchunk pass for TestImageType.GENERATE_PYRAMIDAL_IMG?

The expression grp[str(level)] at this line

https://github.com/bayer-science-for-a-better-life/tiffslide/blob/e649cb50a84d1a67a488a29b82a470bdfee9de25/tiffslide/_zarr.py#L194

is obviously not an array in those cases, so the selection with a tuple of slices fails.

Should we disable this test for those inputs?

@ap--
Copy link
Collaborator Author

ap-- commented Jan 4, 2023

Hi @GuillaumeDesforges

do you remember the rational for adding this s character to the groupname

I introduced the "s" back then to make clear that the zarr groups can be arbitrary strings. The case in which the series integer indices map directly should basically be the exception.

I pushed a fix to make the tests pass and ensure that the mirax code is still working.
It's a bit convoluted since I needed to add the composition information not only to the zarr group attrs but also to the properties, because some other code relies on that.

This will all need some cleanup once the basic functionality is sufficiently implemented.

I would suggest to continue by adding a test_mirax.py file with tests for locally stored mirax files. They should all be available publicly, so that everyone can reproduce behaviour.

Let me know if there's further questions or if anything is unclear!

Cheers,
Andreas 😃

@GuillaumeDesforges
Copy link

I think I have a basic understanding of the MIRAX format thanks to your pointers.

However I'm not quite sure I understand the way you compute the key of the refs for the zarr file.
More precisely this line:
https://github.com/bayer-science-for-a-better-life/tiffslide/blob/5f267c441fc0c39550d0cd4192751c5f178e6fc5/tiffslide/_mirax.py#L448

if I understand correctly:

  • c_idx is the index of the "capture" (which is an actual captured image, the one that is split onto idiv tiles), computed based on the coordinate of the capture in the grid of captures
  • level is the level in the pyramid
  • dx and dy are the coordinates of the tile inside the capture (so 0.0 means top left of the capture)

Now I don't understand why the key wouldn't be something like f"s{level}/{y}.{x}.0", just like it seems to be for regular TIFF-like files. How can then a TiffSlide navigate in the zarr file when it's built the way you have in this PR?

@ap--
Copy link
Collaborator Author

ap-- commented Jan 4, 2023

So one reason is that zarr currently does not support overlapping chunks and requires stores to have uniform tiling (chunking). Mirax files have overlap between captures unless they are postprocessed with some mirax software.

To support overlaps, and arbitrarily offset chunks, I handle composition separately. See: https://github.com/bayer-science-for-a-better-life/tiffslide/blob/5f267c441fc0c39550d0cd4192751c5f178e6fc5/tiffslide/_zarr.py#L199-L240 Basically in this (currently naive) implementation, if composition information is available, tiffslide checks for overlaps with ALL of the captures. This works fine for ndpi files for example, where you usually have a few separate tiff-series. Mirax has a lot of them though, so this will have to be speed up by an R-Tree at some point.

@GuillaumeDesforges
Copy link

Okay, I didn't know about the composition. This makes a lot of sense.

I've pushed a PR #54 that adds testing with a MIRAX file from the openslide test file set, like the one used for SVS.

Do you think the PR could be merged and a new version published after that, or do you foresee some other changes needed?

@ap--
Copy link
Collaborator Author

ap-- commented Apr 2, 2023

Hey @GuillaumeDesforges,

I just wanted to ask, if you had made progress on full Mirax support for tiffslide, and/or if you are still working on it.

All the best,
Andreas 😃

@GuillaumeDesforges
Copy link

Hi @ap-- ! In the end I did not use tiffslide to process MIRAX images.

Unfortunately I am not working on medical images anymore for the time being, so I won't be able to continue this effort.

All the best, thank you so much for your help, support and kindness!
I hope someone else will take this up and help you move it forward. 💯

@ap-- ap-- mentioned this pull request Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Mirax
2 participants