Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler unable to record CUDA activities in Tensorboard. #3028

Open
GKG1312 opened this issue Sep 4, 2024 · 2 comments
Open

Profiler unable to record CUDA activities in Tensorboard. #3028

GKG1312 opened this issue Sep 4, 2024 · 2 comments
Labels

Comments

@GKG1312
Copy link

GKG1312 commented Sep 4, 2024

I am trying to run pytorch profiler with tensorboard tutorial from pytorch/tutorial in Windows 11 in a conda environment and following version

python=3.12.4
pytorch=2.4.0
torch-tb-profiler=0.4.3
cuda-version=12.5

The code executes with only a single warning message [W904 11:50:36.000000000 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event. However, the tensorboard shows only CPU as device and dataloader time as 0.
Screenshot 2024-09-04 120822

I am not able to figure out if it is a bug or because of version mismatch.
Simplified code to replicate error:

import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T

######################################################################
# Then prepare the input data. For this tutorial, we use the CIFAR10 dataset.
# Transform it to the desired format and use ``DataLoader`` to load each batch.

transform = T.Compose(
    [T.Resize(224),
     T.ToTensor(),
     T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)

######################################################################
# Next, create Resnet model, loss function, and optimizer objects.
# To run on GPU, move model and loss to GPU device.

device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()


######################################################################
# Define the training step for each batch of input data.

def train(data):
    inputs, labels = data[0].to(device=device), data[1].to(device=device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


with torch.profiler.profile(
        activities=[
                torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA,
            ],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
) as prof:
    for step, batch_data in enumerate(train_loader):
        prof.step()  # Need to call this at each step to notify profiler of steps' boundary.
        if step >= 1 + 1 + 3:
            break
        train(batch_data)

cc @aaronenyeshi @chaekit @jcarreiro

@atalman
Copy link
Contributor

atalman commented Sep 4, 2024

hi @GKG1312 For release 2.4.0-2.4.1 please use one of the following CUDA versions: 11.8, 12.1, 12.4

@svekars svekars added the CUDA Issues relating to CUDA label Sep 4, 2024
@GKG1312
Copy link
Author

GKG1312 commented Sep 5, 2024

hi @GKG1312 For release 2.4.0-2.4.1 please use one of the following CUDA versions: 11.8, 12.1, 12.4

Hi @atalman, just tried with versions mentioned below, but it did not solve anything. The output is same as previous. I am running this with NVIDIA Geforce RTX 4070 laptop GPU (8GB).

python=3.12.4
pytorch=2.4.0
pytorch_cuda=12.1
torch-tb-profiler=0.4.3
cuda-version=12.1
tensorboard=2.17.1

One thing I noticed now is that in memory view I can see GPU0 as device, but in overview section it is not showing such.
Screenshot 2024-09-05 105642
Sorry, if I am being naive here.
I tried to run the same in google colab and there I can see GPU summary in overview section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants