Update context attribute query #1020

nsarka · 2024-09-10T20:11:14Z

From the UCC Backlog:

Check all TLs and CLs lib and context attributes

Attribute query function should follow this scheme:
if attr.mask is not zero check each attributed defined in mask
Set attribute according to Tl/CL capabilities
Set attr.flags always

This logic is generally broker in UCC, for example
ucc_tl_self_get_context_attr(const ucc_base_context_t *context, /* NOLINT */
                             ucc_base_ctx_attr_t      *attr)
{
    return UCC_OK;
}

attr.topo_required is not set (returns something random)
attr.global_work_buffer_size is not set
etc.

This PR updates UCC so that all context attribute queries read the mask for each field and then write a value, usually 0. Some of these were missing.

The backlog task mentions updating lib attributes as well, but it seems to me that both CL and TL lib attributes are filled correctly. There are two unused fields not filled--reduction types and sync type--but if the mask has these bits set UCC will error before it reaches the CL or TL (see ucc_lib_get_attr).

janjust · 2024-09-10T22:21:27Z

@nsarka IMO, I think an easier to read and understand approach is to simply define a zero_attr() function that will zero out the attributes prior to going over the mask bit and setting everything as needed.
Rather than going through all the bitmask fields and individually zero everything out

nsarka · 2024-09-11T15:40:04Z

@nsarka IMO, I think an easier to read and understand approach is to simply define a zero_attr() function that will zero out the attributes prior to going over the mask bit and setting everything as needed. Rather than going through all the bitmask fields and individually zero everything out

Updated

janjust · 2024-09-16T14:12:39Z

@nsarka please squash commits and rebase

janjust

Looks good to me, but I just have to caution that the user must know, either through documentation or otherwise that the attr is in fact cleared and set to only masked values

samnordmann

LGTM! Please update copyrights

nsarka · 2024-09-18T18:53:10Z

Copyright updated and commits squashed!

Sergei-Lebedev · 2024-09-24T12:37:39Z

@nsarka, not related to this PR, but seems like there are some issue with asymmetric memory when used together with pytorch

03:09:14  + python /opt/nvidia/workloads/dlrm/dlrm_s_pytorch.py --mini-batch-size=2048 --test-mini-batch-size=16384 --test-num-workers=0 --num-batches=10 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000-1000-1000-1000-1000-1000-1000-1000 --num-indices-per-lookup=100 --num-indices-per-lookup-fixed=0 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=1 --loss-function=mse --round-targets=False --learning-rate=0.01 --print-time --dist-backend=ucc --use-gpu
03:09:15  Unable to import mlperf_logging,  No module named 'mlperf_logging'
03:09:15  Unable to import mlperf_logging,  No module named 'mlperf_logging'
03:09:16  Running on 2 ranks using ucc backend
03:09:17  world size: 2, current rank: 0, local rank: 0
03:09:17  Using 1 GPU(s)...
03:09:17  world size: 2, current rank: 1, local rank: 0
03:09:18  time/loss/accuracy (if enabled):
03:09:18  [1726708157.732010] [swx-clx01:629  :0]         mc_cuda.c:248  UCC  ERROR cudaMemcpyAsync(dst, src, len, cudaMemcpyDefault, resources->stream)() failed: 1(invalid argument)
03:09:18  [1726708157.732034] [swx-clx01:629  :0]         mc_cuda.c:251  cuda mc ERROR failed to launch cudaMemcpyAsync, dst (nil), src 0x7f06fc084f50, len 94294989353336
03:09:18  [1726708157.732044] [swx-clx01:629  :0]  ucc_coll_utils.c:265  UCC  ERROR error copying back to old asymmetric buffer: Invalid parameter
03:09:18  [1726708157.732067] [swx-clx01:629  :0]    ucc_schedule.h:195  UCC  ERROR failure copying out asymmetric buffer: Invalid parameter
03:09:18  [swx-clx02:653  :0:653] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

nsarka · 2024-09-27T17:03:45Z

not related to this PR, but seems like there are some issue with asymmetric memory when used together with pytorch

I opened a PR to fix it here: #1026

nsarka requested review from Sergei-Lebedev and samnordmann September 10, 2024 20:14

nsarka self-assigned this Sep 10, 2024

nsarka requested a review from janjust September 10, 2024 20:14

nsarka changed the title ~~CODESTYLE: Update ctx attrs~~ Update context attribute query Sep 10, 2024

janjust approved these changes Sep 16, 2024

View reviewed changes

samnordmann approved these changes Sep 17, 2024

View reviewed changes

nsarka force-pushed the nsarka/attrib_query branch from e97d448 to 668e0cf Compare September 18, 2024 18:42

nsarka added the Ready-for-Review label Sep 18, 2024

nsarka force-pushed the nsarka/attrib_query branch from 9726905 to 0fb5ea7 Compare September 18, 2024 18:52

janjust approved these changes Sep 19, 2024

View reviewed changes

Sergei-Lebedev approved these changes Sep 24, 2024

View reviewed changes

CODESTYLE: Update ctx attrs

3c70be9

Sergei-Lebedev force-pushed the nsarka/attrib_query branch from 0fb5ea7 to 3c70be9 Compare September 30, 2024 14:28

Sergei-Lebedev merged commit 16586e1 into openucx:master Oct 3, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update context attribute query #1020

Update context attribute query #1020

nsarka commented Sep 10, 2024 •

edited

Loading

janjust commented Sep 10, 2024

nsarka commented Sep 11, 2024

janjust commented Sep 16, 2024

janjust left a comment

samnordmann left a comment

nsarka commented Sep 18, 2024

Sergei-Lebedev commented Sep 24, 2024

nsarka commented Sep 27, 2024 •

edited

Loading

Update context attribute query #1020

Update context attribute query #1020

Conversation

nsarka commented Sep 10, 2024 • edited Loading

janjust commented Sep 10, 2024

nsarka commented Sep 11, 2024

janjust commented Sep 16, 2024

janjust left a comment

Choose a reason for hiding this comment

samnordmann left a comment

Choose a reason for hiding this comment

nsarka commented Sep 18, 2024

Sergei-Lebedev commented Sep 24, 2024

nsarka commented Sep 27, 2024 • edited Loading

nsarka commented Sep 10, 2024 •

edited

Loading

nsarka commented Sep 27, 2024 •

edited

Loading