Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiffslide errors when used in pytorch dataloader with num_workers>1 #18

Open
ap-- opened this issue Feb 22, 2022 · 4 comments
Open

Tiffslide errors when used in pytorch dataloader with num_workers>1 #18

ap-- opened this issue Feb 22, 2022 · 4 comments
Labels
bug 🐛 Something isn't working workaround-exists Issue contains a working fix

Comments

@ap--
Copy link
Collaborator

ap-- commented Feb 22, 2022

Unfortunately, tiffslide fails again in parallel mode, this time using pytorch dataloaders. This is a very common technique used in WSI processing with pytorch, the only difference is that it uses process based parallelisation (rather than threads, as in the original bug report).

The symptoms are exactly the same:

  • using tiffslide and one dataloader process (num_workers=1) everything works fine
  • using tiffslide and more dataloader processes (e.g. num_workers=4) the processing fails
  • using openslide everything works fine regardless of the num_workers value

Tested using tiffslide version 1.0.0 and tifffile version 2022.2.9. Please see the attached minimalist example.

tiffslide-bug2.zip

Originally posted by @lukasii in #14 (comment)

@ap--
Copy link
Collaborator Author

ap-- commented Feb 22, 2022

Thanks for the report @lukasii !

I created a new issue, because now it's multiprocessing related.

Please try moving the slide instantiation to a method that you call from worker_init_fn provided to the DataLoader, and report back if it solves your problem.

Cheers,
Andreas 😃

@lukasii
Copy link

lukasii commented Feb 22, 2022

Thanks Andreas, that did the trick! Would you know why openslide does not need this kind of special treatment?


For the record, in the dataset class I added:

def worker_init(self, *args):
    self.slide = tiffslide.TiffSlide(self.wsi_file)

and then creating dataloader as:

dataloader = torch.utils.data.DataLoader(
    dataset, batch_size=32, shuffle=False, num_workers=4, pin_memory=True,
    worker_init_fn=dataset.worker_init
)

@ap-- ap-- added the workaround-exists Issue contains a working fix label Feb 22, 2022
@ap--
Copy link
Collaborator Author

ap-- commented Feb 22, 2022

Hi @lukasii

Great! I'm happy that it works for you now. ❤️

I'd have to investigate why exactly it doesn't work in the example you provided, but it could be that it's due to pytorch using fork instead of spawn to create new worker processes, and fsspec is not playing nicely with fork under some circumstances. Or it's something else related to multiprocessing and tifffile or the way tiffslide doesn't try to lock access to the zarr array.

I'll keep this issue open, until I've worked out the specific details, and either make it work, or crash with a verbose error message suggesting the fix above.

Have a great day, and happy training 🎉
Andreas

@lukasii
Copy link

lukasii commented Feb 22, 2022

Thanks for explaining. In case more work on this issue is planned in the future I am attaching updates files. One is my original bug report file, which was incorrectly using a global variable ("self" was missing). Not a big deal as self.slide was just a reference to that global slide object anyway, so the results are exactly the same. The other file in the archive is the full workaround code.

Cheers!
tiffslide-bug2-updated.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working workaround-exists Issue contains a working fix
Projects
None yet
Development

No branches or pull requests

2 participants