Extending dpctl to support CUDA #1124

diptorupd · 2023-03-14T21:49:24Z

diptorupd
Mar 14, 2023
Maintainer

oneAPI 2023.0 supports CUDA devices using the "oneAPI for NVIDIA GPUs 2023.0" plugin. I am starting this exploratory discussion to evaluate the requirements and scope of work to support CUDA in dpctl via the oneAPI plugin.

Here are the findings from my initial exploration:

System information:

OS: Ubuntu 22.04 Jammy
CUDA GPU: NVIDIA GeForce GTX 1660 Ti card
CUDA 11.4.
CUDA Driver Version: 470.161.03

Initial setup steps:

a) Installed oneAPI following the installation guide

NOTE: Watch out for installation issues on Ubuntu 22.04 (cstddef.h not found etc.) to work around do sudo apt install libstdc++-12-dev.

b) I already had CUDA set up and I had followed the CUDA guide to install on my OS

c) Downloaded the oneAPI for NVIDIA GPUs plugin and followed the installation guide

NOTE: If you have multiple type of devices on the system (I have openCL GPU driver and L0 GPU driver for a gen9 integrated GPU, openCL CPU driver for a gen9 CPU, and CUDA), you can compile the simple-sycl-app.cpp from the get-started-guide with multiple -fsycl-targets, e.g., -fsycl-targets=nvptx64-nvidia-cuda, spir64-unknown-unknown. Once that is done, you can execute the simple-sycl-app for both CUDA and other devices simply by changing SYCL_DEVICE_FILTER.

Building dpctl with CUDA

a) Build dpctl with the customized oneAPI. The process for me was just to run python scripts/build_locally.py.

NOTE: be sure to remove the dpcpp-cpp-rt and dpcpp-linux_64 conda packages if you are building inside a conda environment.

Testing the install

a) After building and installing dpctl using the build_locally.py script. I tried the following:

>>> import dpctl
>>> dpctl.lsplatform()
Intel(R) FPGA Emulation Platform for OpenCL(TM) OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
Intel(R) OpenCL OpenCL 3.0 LINUX
Intel(R) OpenCL HD Graphics OpenCL 3.0 
Intel(R) Level-Zero 1.3
NVIDIA CUDA BACKEND CUDA 11.4

So far so good, the CUDA GPU is detected as expected.

b) Creating a CUDA stream:

>>> q = dpctl.SyclQueue("cuda")
>>> q.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu,  NVIDIA GeForce GTX 1660 Ti] at 0x7f9e9287bf70>
>>> q.sycl_device.print_device_info()
    Name            NVIDIA GeForce GTX 1660 Ti
    Driver version  CUDA 11.4
    Vendor          NVIDIA Corporation
    Filter string   cuda:gpu:0

c) Try a basic tensor creation:

>>> import dpctl.tensor as dpt
>>> a = dpt.empty(10, device="cuda")
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu,  NVIDIA GeForce GTX 1660 Ti] at 0x7f9e92fc5e70>
>>> print(a)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> a
usm_ndarray([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> a.usm_type
'device'

Initial thoughts

I went much farther than I had hoped to get. The plugin seamlessly exposed the CUDA device, queue creation works and even memory allocation seems to have succeeded.

The next steps will be to test some basic operations on the tensor. @oleksandr-pavlyk can you suggest something? Although, I doubt that will work out of the box. I think we will need to build dpctl with -fsycl-targets=nvptx64-nvidia-cuda.

Answered by diptorupd

Mar 15, 2023

@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:

>>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
    hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

However, after the following small patch

Author: Diptorup Deb <[email protected]>  2023-03-14 …

View full answer

oleksandr-pavlyk · 2023-03-15T02:56:14Z

oleksandr-pavlyk
Mar 15, 2023
Maintainer

@diptorupd Try dev = "cuda"; a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b];

For dev="gpu" I get:


In [4]: dev = "gpu"

In [5]: a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]

In [6]: c
Out[6]: usm_ndarray([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])

0 replies

diptorupd · 2023-03-15T05:00:35Z

diptorupd
Mar 15, 2023
Maintainer Author

@oleksandr-pavlyk As expected, running a kernel with a default compiled dpctl as-is will not work:

>>> a = dpt.arange(30, device=dev); b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/diptorupd/Desktop/devel/dpctl/dpctl/tensor/_ctors.py", line 642, in arange
    hev, _ = ti._linspace_step(_start, _step, res, sycl_queue)
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

However, after the following small patch

Author: Diptorup Deb <[email protected]>  2023-03-14 23:47:18
Committer: Diptorup Deb <[email protected]>  2023-03-14 23:47:18
Parent: 8f828f24ada9829ed4d9d5dc56e6d7f39dd9ac3c (Merge pull request #1118 from IntelPython/fix-build-break)
Branch: demo/cuda-support
Follows: 0.14.2
Precedes: 

    Compile with cuda support

----------------------------- dpctl/CMakeLists.txt -----------------------------
index 6ccca33dd..f8c08f105 100644
@@ -58,6 +58,7 @@ elseif(UNIX)
         "${WARNING_FLAGS}"
         "${SDL_FLAGS}"
         "-fsycl "
+        "-fsycl-targets=nvptx64-nvidia-cuda,spir64-unknown-unknown "
     )
     set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3 ${CFLAGS}")
     set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 ${CXXFLAGS}")

--------------------------- scripts/build_locally.py ---------------------------
index ff34c9d18..9c689ead9 100644
@@ -145,7 +145,7 @@ if __name__ == "__main__":
         and args.compiler_root is None
     ):
         args.c_compiler = "icx"
-        args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+        args.cxx_compiler = "clang++" if "linux" in sys.platform else "icx"
         args.compiler_root = None
     else:
         cr = args.compiler_root
@@ -153,7 +153,9 @@ if __name__ == "__main__":
             if args.c_compiler is None:
                 args.c_compiler = "icx"
             if args.cxx_compiler is None:
-                args.cxx_compiler = "icpx" if "linux" in sys.platform else "icx"
+                args.cxx_compiler = (
+                    "clang++" if "linux" in sys.platform else "icx"
+                )
         else:
             raise RuntimeError(
                 "Option 'compiler-root' must be provided when "

There were a few warnings of the kind: clang++: warning: linked binaries do not contain expected 'nvptx64-nvidia-cuda' target; found targets: 'nvptx64-nvidia-cuda-sm_50, spir64-unknown-unknown' [-Wsycl-target], but we have CUDA support 😄

>>> import dpctl
>>> import dpctl.tensor as dpt
>>> dev = "cuda"
>>> a = dpt.arange(30, device=dev)
>>> a
usm_ndarray([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
             15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
>>> print(a)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
>>> a.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu,  NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>
>>> b = dpt.roll(dpt.concat((dpt.ones(15, dtype=dpt.bool, device=dev), dpt.zeros(15, dtype=dpt.bool, device=dev))), 8); c = a[b]
>>> c
usm_ndarray([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])
>>> c.sycl_device
<dpctl.SyclDevice [backend_type.cuda, device_type.gpu,  NVIDIA GeForce GTX 1660 Ti] at 0x7f2e89f698b0>

2 replies

diptorupd Mar 15, 2023
Maintainer Author

We should now spin up a Docker image with dpctl+CUDA and put it on the repo.

(But for now let me go back on vacation 😉 )

ogrisel Mar 15, 2023

Great news! Enjoy your time-off!

oleksandr-pavlyk · 2023-10-27T02:28:42Z

oleksandr-pavlyk
Oct 27, 2023
Maintainer

With #1411, one can build dpctl using

$ DPCTL_TARGET_CUDA=1 python scripts/build_locally.py --verbose

This creates fat binary with SPV and PTX offload sections. Test suite passes using CUDA backend. Since the GPU at my disposal is weak (GT 1030) I must run each test file individually:

$ ONEAPI_DEVICE_SELECTOR=cuda:gpu find dpctl/tests/ -name "test_*.py" | xargs -n 1 bash -c 'python -m pytest $0 --durations=3 || exit 255'

With beefier GPU, running the test suite works out of the box:

$ ONEAPI_DEVICE_SELECTOR=cuda:gpu pytest --pyargs dpctl

@ogrisel @fcharras

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending dpctl to support CUDA #1124

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extending dpctl to support CUDA #1124

diptorupd Mar 14, 2023 Maintainer

Replies: 3 comments · 2 replies

oleksandr-pavlyk Mar 15, 2023 Maintainer

diptorupd Mar 15, 2023 Maintainer Author

diptorupd Mar 15, 2023 Maintainer Author

ogrisel Mar 15, 2023

oleksandr-pavlyk Oct 27, 2023 Maintainer

diptorupd
Mar 14, 2023
Maintainer

Replies: 3 comments 2 replies

oleksandr-pavlyk
Mar 15, 2023
Maintainer

diptorupd
Mar 15, 2023
Maintainer Author

diptorupd Mar 15, 2023
Maintainer Author

oleksandr-pavlyk
Oct 27, 2023
Maintainer