update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers) #109

lessw2020 · 2024-03-05T04:05:41Z

What does this PR do?

This PR fixes a common error hit with newer versions of Nccl.
Most traces currently break on computing the comm/compute overlap numbers b/c it does not account for the kernel name being prefaced with 'ncclDev' instead of only 'ncclKernel'.

The error is

Runtime Warning: invalid value encountered in scalar divide return (shifted_overlap["time_y"] - shifted_overlap["time_x"].sum())

The fix is simple, check for both to get the proper comm/compute overlap numbers in hta utility function "is comm kernel".

Before submitting

anupambhatnagar · 2024-03-05T18:55:12Z

@lessw2020 can you please add a test case for this? also please resolve the pre-commit errors.

amazloumi · 2024-07-22T20:14:31Z

I come across this issue when working on distributed torch profiling, get_gpu_kernel_breakdown() does not show any communication breakdown. Do you have any plan to merging the fix of this PR soon?

I have two different types of nccl related items on my torch trace file (collected using torch.profiler.tensorboard_trace_handler) that none of them are captured as COMM category in TraceAnalysis .get_gpu_kernel_breakdown(). The items are as follows:

The items with "cat": "user_annotation" and "name" starting with "nccl:" such as nccl:_all_gather_base, nccl:_reduce_scatter_base, nccl:reduce and nccl:all_reduce
The items with "cat": "kernel" and "name" starting with "ncclDevKernel_" such as ncclDevKernel_AllReduce_Sum_f32_RING_LL, ncclDevKernel_Reduce_Sum_f32_RING_LL, ncclDevKernel_ReduceScatter_Sum_f32_RING_LL and ncclDevKernel_AllGather_RING_LL

I am not sure if the first item also needs to be considered as kernel communication cost while checking tensorboard I believe torch.profiler plug-in on tensorboard is considering those on item 1.

update is_comm_kernel check to work with newer nccl versions

ea7af61

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2024

anupambhatnagar self-requested a review March 5, 2024 18:55

fengxizhou self-assigned this Mar 8, 2024

fengxizhou requested review from fengxizhou and removed request for anupambhatnagar March 23, 2024 18:39

Merge branch 'main' into comm-overlap-fix

d8e6607

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers) #109

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers) #109

lessw2020 commented Mar 5, 2024

anupambhatnagar commented Mar 5, 2024

amazloumi commented Jul 22, 2024 •

edited

Loading

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers) #109

Are you sure you want to change the base?

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers) #109

Conversation

lessw2020 commented Mar 5, 2024

What does this PR do?

Before submitting

anupambhatnagar commented Mar 5, 2024

amazloumi commented Jul 22, 2024 • edited Loading

amazloumi commented Jul 22, 2024 •

edited

Loading