torch_xla.distributed.parallel_loader doesn't shard data #7904

davidaknowles · 2024-08-23T13:57:51Z

❓ Questions and Help

Maybe this is a misunderstanding on my part, but I assumed part of MpDeviceLoaders job was to split/shard data across devices. However the test below shows it doesn't do this: all 4 devices on my v4 receive all 12 datapoints. What am I missing here? Thanks.

import torch
import torch_xla.core.xla_model as xm
from torch_xla import runtime as xr
import torch_xla.distributed.parallel_loader as pl
import torch_xla.distributed.xla_multiprocessing as xmp

def test_parallel_loader(rank):

    data = torch.arange(12).reshape(-1, 1)
    dataset = torch.utils.data.TensorDataset(data)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=2)
    
    parallel_loader = pl.MpDeviceLoader(dataloader, xm.xla_device())
    results = sum([batch[0].tolist() for batch in parallel_loader], [])

    print(f"Device {rank} received data: {results}")
    
    expected_data_size = len(data) // xr.world_size()
    print(f"Device {rank} received {len(results)} datapoints, expected {expected_data_size}")

if __name__ == "__main__":
    xmp.spawn(test_parallel_loader, args=())

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-08-23T16:55:58Z

Let me take a look

bhavya01 · 2024-08-23T17:29:32Z

In this case, it is working as expected. The MpDeviceLoader in each spawned process will asynchronously get batches from the torch dataloader and put them on the TPU. If you use SPMD and just spawn one process and specify input_sharding such that it shard on the batch size, then you should see that batch is spread out over multiple devices.

JackCaoG · 2024-08-23T17:51:04Z

To add on what @bhavya01 ParallelLoader itself does not handle distributing the correct data to each worker, it is the dataloder it wrapped needs to do that.

In our MP example it is

xla/test/test_train_mp_imagenet.py

Lines 222 to 226 in 3d860bf

    
           train_sampler = torch.utils.data.distributed.DistributedSampler( 
        
               train_dataset, 
        
               num_replicas=xr.world_size(), 
        
               rank=xr.global_ordinal(), 
        
               shuffle=True)

doing this work. In the GSPMD case it is the sharding we passed to the ParallelLoader does that distribution. In my example I only used fake data loader so this is not an issue, I can fix that to make it more clear.

davidaknowles · 2024-08-24T16:51:56Z

I see, this is just a documentation issue then, e.g. https://pytorch.org/xla/release/r2.4/index.html#running-on-multiple-xla-devices-with-multi-processing gave me the impression wrapping my existing (single process) dataloader was sufficient. And yes it would probably be helpful to have your toy data use DistributedSampler so if people just their own real data in there it will do something sensible.

JackCaoG · 2024-08-26T17:45:28Z

yea let me update

davidaknowles closed this as completed Aug 24, 2024

JackCaoG mentioned this issue Aug 26, 2024

Update comments in data parallel example to use sampler #7914

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_xla.distributed.parallel_loader doesn't shard data #7904

torch_xla.distributed.parallel_loader doesn't shard data #7904

davidaknowles commented Aug 23, 2024

JackCaoG commented Aug 23, 2024

bhavya01 commented Aug 23, 2024

JackCaoG commented Aug 23, 2024

davidaknowles commented Aug 24, 2024

JackCaoG commented Aug 26, 2024

torch_xla.distributed.parallel_loader doesn't shard data #7904

torch_xla.distributed.parallel_loader doesn't shard data #7904

Comments

davidaknowles commented Aug 23, 2024

❓ Questions and Help

JackCaoG commented Aug 23, 2024

bhavya01 commented Aug 23, 2024

JackCaoG commented Aug 23, 2024

davidaknowles commented Aug 24, 2024

JackCaoG commented Aug 26, 2024