Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch with num_workers=0 and MultiProcDataset has low computation time #1435

Closed
albertz opened this issue Oct 17, 2023 · 1 comment
Closed

Comments

@albertz
Copy link
Member

albertz commented Oct 17, 2023

Now torch_dataloader_opts = dict(num_workers=1) provides another way for PyTorch to use multiprocessing for the dataset (#1383). This leads to 98% computation time, i.e. the dataset is not a bottleneck anymore.

This was not the case with num_workers=0 and MultiProcDataset, which gave me only about 75% computation time. So it seems like there is still some overhead in MultiProcDataset? Or maybe this is not in MultiProcDataset but in our Torch data pipeline (ReturnnDatasetIterDataPipe, ..., BatchingIterDataPipe), because with num_workers>0, I think even that part will be in the subproc, while with MultiProcDataset, this still happens in the main proc.

I'm not sure if we can do much about it but I just wanted to document this.

@albertz
Copy link
Member Author

albertz commented Oct 17, 2023

I don't plan to do anything about this now, and as said, not sure if we even can do much about it, and using num_workers=1 is anyway a good solution, so I'm closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant