Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers) #109

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lessw2020
Copy link

What does this PR do?

This PR fixes a common error hit with newer versions of Nccl.
Most traces currently break on computing the comm/compute overlap numbers b/c it does not account for the kernel name being prefaced with 'ncclDev' instead of only 'ncclKernel'.

The error is

Runtime Warning: invalid value encountered in scalar divide return (shifted_overlap["time_y"] - shifted_overlap["time_x"].sum())

The fix is simple, check for both to get the proper comm/compute overlap numbers in hta utility function "is comm kernel".

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you update the changelog?
    • N/A

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2024
@anupambhatnagar
Copy link
Contributor

@lessw2020 can you please add a test case for this? also please resolve the pre-commit errors.

@anupambhatnagar anupambhatnagar self-requested a review March 5, 2024 18:55
@fengxizhou fengxizhou self-assigned this Mar 8, 2024
@fengxizhou fengxizhou requested review from fengxizhou and removed request for anupambhatnagar March 23, 2024 18:39
@amazloumi
Copy link

amazloumi commented Jul 22, 2024

I come across this issue when working on distributed torch profiling, get_gpu_kernel_breakdown() does not show any communication breakdown. Do you have any plan to merging the fix of this PR soon?

I have two different types of nccl related items on my torch trace file (collected using torch.profiler.tensorboard_trace_handler) that none of them are captured as COMM category in TraceAnalysis .get_gpu_kernel_breakdown(). The items are as follows:

  1. The items with "cat": "user_annotation" and "name" starting with "nccl:" such as nccl:_all_gather_base, nccl:_reduce_scatter_base, nccl:reduce and nccl:all_reduce
  2. The items with "cat": "kernel" and "name" starting with "ncclDevKernel_" such as ncclDevKernel_AllReduce_Sum_f32_RING_LL, ncclDevKernel_Reduce_Sum_f32_RING_LL, ncclDevKernel_ReduceScatter_Sum_f32_RING_LL and ncclDevKernel_AllGather_RING_LL

I am not sure if the first item also needs to be considered as kernel communication cost while checking tensorboard I believe torch.profiler plug-in on tensorboard is considering those on item 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants