Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update context attribute query #1020

Merged
merged 1 commit into from
Oct 3, 2024

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented Sep 10, 2024

From the UCC Backlog:

Check all TLs and CLs lib and context attributes

Attribute query function should follow this scheme:
if attr.mask is not zero check each attributed defined in mask
Set attribute according to Tl/CL capabilities
Set attr.flags always

This logic is generally broker in UCC, for example
ucc_tl_self_get_context_attr(const ucc_base_context_t *context, /* NOLINT */
                             ucc_base_ctx_attr_t      *attr)
{
    return UCC_OK;
}

attr.topo_required is not set (returns something random)
attr.global_work_buffer_size is not set
etc.

This PR updates UCC so that all context attribute queries read the mask for each field and then write a value, usually 0. Some of these were missing.

The backlog task mentions updating lib attributes as well, but it seems to me that both CL and TL lib attributes are filled correctly. There are two unused fields not filled--reduction types and sync type--but if the mask has these bits set UCC will error before it reaches the CL or TL (see ucc_lib_get_attr).

@nsarka nsarka self-assigned this Sep 10, 2024
@nsarka nsarka requested a review from janjust September 10, 2024 20:14
@nsarka nsarka changed the title CODESTYLE: Update ctx attrs Update context attribute query Sep 10, 2024
@janjust
Copy link
Collaborator

janjust commented Sep 10, 2024

@nsarka IMO, I think an easier to read and understand approach is to simply define a zero_attr() function that will zero out the attributes prior to going over the mask bit and setting everything as needed.
Rather than going through all the bitmask fields and individually zero everything out

@nsarka
Copy link
Collaborator Author

nsarka commented Sep 11, 2024

@nsarka IMO, I think an easier to read and understand approach is to simply define a zero_attr() function that will zero out the attributes prior to going over the mask bit and setting everything as needed. Rather than going through all the bitmask fields and individually zero everything out

Updated

@janjust
Copy link
Collaborator

janjust commented Sep 16, 2024

@nsarka please squash commits and rebase

Copy link
Collaborator

@janjust janjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I just have to caution that the user must know, either through documentation or otherwise that the attr is in fact cleared and set to only masked values

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please update copyrights

@nsarka
Copy link
Collaborator Author

nsarka commented Sep 18, 2024

Copyright updated and commits squashed!

@Sergei-Lebedev
Copy link
Contributor

@nsarka, not related to this PR, but seems like there are some issue with asymmetric memory when used together with pytorch

03:09:14  + python /opt/nvidia/workloads/dlrm/dlrm_s_pytorch.py --mini-batch-size=2048 --test-mini-batch-size=16384 --test-num-workers=0 --num-batches=10 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000-1000-1000-1000-1000-1000-1000-1000 --num-indices-per-lookup=100 --num-indices-per-lookup-fixed=0 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=1 --loss-function=mse --round-targets=False --learning-rate=0.01 --print-time --dist-backend=ucc --use-gpu
03:09:15  Unable to import mlperf_logging,  No module named 'mlperf_logging'
03:09:15  Unable to import mlperf_logging,  No module named 'mlperf_logging'
03:09:16  Running on 2 ranks using ucc backend
03:09:17  world size: 2, current rank: 0, local rank: 0
03:09:17  Using 1 GPU(s)...
03:09:17  world size: 2, current rank: 1, local rank: 0
03:09:18  time/loss/accuracy (if enabled):
03:09:18  [1726708157.732010] [swx-clx01:629  :0]         mc_cuda.c:248  UCC  ERROR cudaMemcpyAsync(dst, src, len, cudaMemcpyDefault, resources->stream)() failed: 1(invalid argument)
03:09:18  [1726708157.732034] [swx-clx01:629  :0]         mc_cuda.c:251  cuda mc ERROR failed to launch cudaMemcpyAsync, dst (nil), src 0x7f06fc084f50, len 94294989353336
03:09:18  [1726708157.732044] [swx-clx01:629  :0]  ucc_coll_utils.c:265  UCC  ERROR error copying back to old asymmetric buffer: Invalid parameter
03:09:18  [1726708157.732067] [swx-clx01:629  :0]    ucc_schedule.h:195  UCC  ERROR failure copying out asymmetric buffer: Invalid parameter
03:09:18  [swx-clx02:653  :0:653] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

@nsarka
Copy link
Collaborator Author

nsarka commented Sep 27, 2024

not related to this PR, but seems like there are some issue with asymmetric memory when used together with pytorch

I opened a PR to fix it here: #1026

@Sergei-Lebedev Sergei-Lebedev merged commit 16586e1 into openucx:master Oct 3, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants