PyTorch with num_workers=0 and MultiProcDataset has low computation time #1435

albertz · 2023-10-17T08:53:08Z

Now torch_dataloader_opts = dict(num_workers=1) provides another way for PyTorch to use multiprocessing for the dataset (#1383). This leads to 98% computation time, i.e. the dataset is not a bottleneck anymore.

This was not the case with num_workers=0 and MultiProcDataset, which gave me only about 75% computation time. So it seems like there is still some overhead in MultiProcDataset? Or maybe this is not in MultiProcDataset but in our Torch data pipeline (ReturnnDatasetIterDataPipe, ..., BatchingIterDataPipe), because with num_workers>0, I think even that part will be in the subproc, while with MultiProcDataset, this still happens in the main proc.

I'm not sure if we can do much about it but I just wanted to document this.

The text was updated successfully, but these errors were encountered:

albertz · 2023-10-17T08:53:55Z

I don't plan to do anything about this now, and as said, not sure if we even can do much about it, and using num_workers=1 is anyway a good solution, so I'm closing this now.

albertz closed this as completed Oct 17, 2023

albertz mentioned this issue Oct 17, 2023

PyTorch DataLoader, use num_workers=1 by default? #1437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch with num_workers=0 and MultiProcDataset has low computation time #1435

PyTorch with num_workers=0 and MultiProcDataset has low computation time #1435

albertz commented Oct 17, 2023

albertz commented Oct 17, 2023

PyTorch with num_workers=0 and MultiProcDataset has low computation time #1435

PyTorch with num_workers=0 and MultiProcDataset has low computation time #1435

Comments

albertz commented Oct 17, 2023

albertz commented Oct 17, 2023