Build OpenCL kernels requiring CL2.0 (needed for __generic args) #135

szymonlopaciuk · 2024-06-13T12:12:13Z

Description

It seems that OpenCL on CUDA is quite permissive, whereas on AMD machines, which more closely follow OpenCL spec, CL1.2 is taken as the default version even if the actual device supports higher ones.

The current version of Xtrack contains code incompatible with CL1.2. This is because in some instances local scope values are passed to __global parameters. CL2.0 introduces a default __generic parameter, which accepts either, in a manner similar to CUDA. In particular the function multipole_compute_dpx_dpy_single_particle receives arguments from either memory. Requiring CL2.0 is the easiest fix in this case: the alternatives are to explicitly make two versions of the function, or manually copy from the global memory to local.

I am preparing a PR for Xtrack that fixes issues encountered on AMD, this is a prerequisite for those changes.

Checklist

Mandatory:

I have added tests to cover my changes
All the tests are passing, including my new ones
I described my changes in this PR description

Optional:

The code I wrote follows good style practices (see PEP 8 and PEP 20).
I have updated the docs in relation to my changes, if applicable
I have tested also GPU contexts

rdemaria · 2024-06-13T16:53:40Z

I would be careful with OpenCL 2.0, it is not supported well across drivers. 3.0 is better supported, but it does not include many 2.0 features. I could test on the Intel driver and PoCL in case...

szymonlopaciuk · 2024-06-14T12:53:31Z

Sorry for the late response, I noted it already yesterday. I can experiment with 3.0 too if you think relying on 2.0 is a problem, I had not known this. I had a quick look at the spec and CL3.0 also supports the __generic address space, and it seems that it's all that I/we need.

We could leave the version configurable to the user, though I'm not sure of the benefit of that if our code does not compile on different versions (like currently, under 2.1)...

Do you support the idea of bumping the version in principle? The alternative is working around it, and it will be a bit annoying, though not impossible.

szymonlopaciuk · 2024-06-14T15:21:28Z

So:

Radeon VII supports CL2.0 but not CL3.0
Neither the T4s nor the Titan Vs support CL2.0 but strangely happily compile with the --std-cl=CL2.0 flag.
POCL is unhappy with 2.0 on the OpenStack Xeon machine. (Interestingly it compiles on a Threadripper, though I suspect that pyopencl is simply picking the wrong device on the heterogenous machine, it looks suspicious.)
Intel OCL Runtime seems happy with 2.0 on both the Threadripper and Xeon.

rdemaria · 2024-06-14T15:29:35Z

Does pocl choke on the C code or the flag? CPUs have the same memory space, so I expect the feature to be supported.

szymonlopaciuk · 2024-06-17T06:22:16Z

It chokes on the flag. I will have to check without, I wouldn't be surprised if it still has problems though: I'm seeing in clinfo that it supports CL1.2 exclusively.

szymonlopaciuk · 2024-06-17T09:48:56Z

Okay, so to complicate matters further, I have confirmed a bug in Xobjects (or more likely in pyopencl): no matter which platform is selected, the first one is actually chosen for running the kernel. It's very sneaky, because the compiler messages seem "okay" (see screenshot), but the actual kernel is not run where we need it. I'm not sure why at this point, I briefly looked at the code of pyopencl and nothing particularly bad jumps out.

I first observed strange behaviour with gpustat where the gpus would light up even if allegedly running on CPU, and confirmed by using vendor specific macros in the kernel code (__NV_CL_C_VERSION in this case for nvidia, which is the one producing the message in all the cases). See the screenshot, it's a bit ugly because I just put a printf in one of the tests, but still:

rdemaria · 2024-06-17T11:02:43Z

benchmarks on https://github.com/rdemaria/simpletrack works normally for me. I have upgraded last version everything.
Interstingly I get:

POCL:  Device: cpu-skylake-avx512-AMD Ryzen 9 7950X3D 16-Core Processor
turns    npart t/turn[ms] t/turn/part[us]
   10    18000     154.68       8.59
   10    25000     254.29      10.17
NVIDIA: Device: NVIDIA TITAN V
turns    npart t/turn[ms] t/turn/part[us]
   10    18000      17.14       0.95
   10    25000      17.74       0.71
INTEL: Device: AMD Ryzen 9 7950X3D 16-Core Processor          
turns    npart t/turn[ms] t/turn/part[us]
   10    18000     415.34      23.07
   10    25000     543.04      21.72

Build OpenCL kernels requiring CL2.0 (needed for __generic args)

41095bd

szymonlopaciuk mentioned this pull request Jun 13, 2024

Solenoid: apply radiation after multipolar kick xsuite/xtrack#505

Merged

6 tasks

Select correct devices when tests run with XOBJECTS_TEST_CONTEXTS

fb725c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build OpenCL kernels requiring CL2.0 (needed for __generic args) #135

Build OpenCL kernels requiring CL2.0 (needed for __generic args) #135

szymonlopaciuk commented Jun 13, 2024

rdemaria commented Jun 13, 2024 •

edited

Loading

szymonlopaciuk commented Jun 14, 2024

szymonlopaciuk commented Jun 14, 2024 •

edited

Loading

rdemaria commented Jun 14, 2024

szymonlopaciuk commented Jun 17, 2024

szymonlopaciuk commented Jun 17, 2024

rdemaria commented Jun 17, 2024

Build OpenCL kernels requiring CL2.0 (needed for __generic args) #135

Are you sure you want to change the base?

Build OpenCL kernels requiring CL2.0 (needed for __generic args) #135

Conversation

szymonlopaciuk commented Jun 13, 2024

Description

Checklist

rdemaria commented Jun 13, 2024 • edited Loading

szymonlopaciuk commented Jun 14, 2024

szymonlopaciuk commented Jun 14, 2024 • edited Loading

rdemaria commented Jun 14, 2024

szymonlopaciuk commented Jun 17, 2024

szymonlopaciuk commented Jun 17, 2024

rdemaria commented Jun 17, 2024

rdemaria commented Jun 13, 2024 •

edited

Loading

szymonlopaciuk commented Jun 14, 2024 •

edited

Loading