Redefine transform_reduce's scratch & result mem #1354

AidanBeltonS · 2024-01-18T14:52:34Z

Rationale

Algorithms using transform_reduce (std::reduce, std::max_element, etc) return a single value result. They also, except for the smallest cases, require intermediate scratch memory on the device to hold partial results.

For L0 backend, it's faster to have a 1-element USM host allocation and to write the final reduction result directly to that.

For Nvidia, host USM is expensive, and it's faster instead to have a single USM device allocation for both the result and the intermediate scratch.

Approach

This PR combines the two approaches into a struct __usm_host_or_unified_storage, based on the previous __usm_host_or_buffer_storage. When host USM is supported and the backend is L0, this struct holds two memory allocations (device USM for intermediate scratch, and host USM for final result). In all other cases, this struct holds a single device USM allocation, large enough for both intermediate scratch and final result. In this latter case, a memcpy from device to host is needed to return the final result.

julianmi

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:

I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.

An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

AidanBeltonS · 2024-02-06T12:12:38Z

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:

I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.

An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime.

Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution?

julianmi · 2024-02-06T17:14:53Z

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:
I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.
An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime.

Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution?

Could we use __usm_host_or_buffer_accessor instead of the current pointer solution?

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

julianmi · 2024-03-15T11:23:42Z

@AidanBeltonS Please rebase this since #1410 has been merged. I think we can get this merged after the rebase and testing.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

For L0 backend, it's faster to have a USM host allocation and to write the reduction result directly to that. For Nvidia, host USM is expensive, and it's faster to have a single USM device allocation for both the result and the intermediate scratch when required. This commit combines the two approaches into a struct __usm_host_or_unified_storage, based on the previous __usm_host_or_buffer_storage.

Co-authored-by: Julian Miller <[email protected]>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

SergeyKopienko · 2024-04-12T10:21:33Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

 {
  private:
    using __sycl_buffer_t = sycl::buffer<_T, 1>;
+
+    _ExecutionPolicy __exec;
+    ::std::shared_ptr<_T> __scratch_buf;


Are we going really to share this fields? Or simple manage memory by this way?
I think - the second case.
But this mean `std::unique_ptr' should be enough for us.

I think this should stay as a shared_ptr. There is a situation where a variable of this class is copied and therefore two objects have ownership of the pointer. This happens when a __future is constructed to return the result value.

So due to the possibility of non-reference passing and copying of the class I think this should stay as is.

Error when using unique_ptr.

/home/aidanbelton/oneDPL/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h:147:41: error: call to implicitly-deleted copy constructor of 'oneapi::dpl::__par_backend_hetero::__result_and_scratch_storage<oneapi::dpl::execution::device_policy<> &, TestUtils::Sum>' 147 | return __future(__reduce_event, __scratch_container);

The fix is very simple:

return __future(__reduce_event, std::move(__scratch_container));

Wouldn't unique_ptr limit what the user can do with the future-like object we return? I think shared_ptr is the more general approach here.

@julianmi exactly, you are right.
But let's take a look at https://en.cppreference.com/w/cpp/thread/future/operator%3D
We haven't requirement that should be able to have copyable result.

UPD: discussed with @MikeDvorskiy, let's still use std::shared_ptr to have the same requirements for this class like for sycl::buffer.

Thanks, @SergeyKopienko. I think we can switch to unique_ptr then. We should do so throughout the unified USM or buffer storage though.

See my update: #1354 (comment)

I think there is a discussion to be had regarding the exact usage of this class. However, I believe it is well beyond the scope of this PR, which is fundamentally a performance optimization. The usage of the shared_ptr existed prior to this change. I propose that this sort of architectural discussion and change should be placed into an issue or some other method of communication as it is not relevant to the proposed changes.

Additional moment - we shouldn't broke existing behavior of this class : our users already have the code where it's copyable and moveable.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

SergeyKopienko · 2024-04-12T13:25:02Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

+    __use_USM_allocations(sycl::queue __queue)
+    {
+#if _ONEDPL_SYCL_USM_HOST_PRESENT
+        auto __device = __queue.get_device();


What about

return __queue.get_device().__device.has(sycl::aspect::usm_device_allocations);

SergeyKopienko

LGTM
But please wait CI and receive additional approves.

* Check for empty scratch storage * Fix outdated macro * Add guards to fall back to buffer

AidanBeltonS requested review from julianmi and mmichel11 January 18, 2024 14:52

mmichel11 added this to the 2022.6.0 milestone Jan 29, 2024

julianmi requested changes Feb 1, 2024

View reviewed changes

AidanBeltonS force-pushed the reduce_unified_scratch branch from d1cbb76 to d341306 Compare February 6, 2024 11:46

julianmi reviewed Feb 8, 2024

View reviewed changes

julianmi reviewed Feb 12, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

julianmi mentioned this pull request Feb 16, 2024

Extend compiler support for the unified USM and buffer storage #1410

Merged

AidanBeltonS force-pushed the reduce_unified_scratch branch from ac30243 to 89f9e54 Compare April 9, 2024 11:17

julianmi reviewed Apr 9, 2024

View reviewed changes

AidanBeltonS force-pushed the reduce_unified_scratch branch from daf68dd to 0121a5a Compare April 10, 2024 09:36

julianmi reviewed Apr 10, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

julianmi reviewed Apr 10, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

joeatodd and others added 13 commits April 12, 2024 09:54

Add fallback buffer impl

06a1993

Update based on feedback

1a25fe7

Only wait with USM

d0aa28f

Clang-format

3c5e0b0

Guard USM impl

ba4a4d0

Fix rebase

86abc92

Remove execution_policy requirement

9b530c9

Format ::std

7c92c14

Co-authored-by: Julian Miller <[email protected]>

Remove blank line and other policy requirment

52d754b

Fix erorr

a8bf26f

Add check

791daa7

Remove comment

62f5d5b

AidanBeltonS force-pushed the reduce_unified_scratch branch from fff6fd6 to 62f5d5b Compare April 12, 2024 09:00

julianmi requested a review from SergeyKopienko April 12, 2024 09:05

SergeyKopienko reviewed Apr 12, 2024

View reviewed changes

Use default assignment

8ca4636

SergeyKopienko self-requested a review April 12, 2024 13:21

SergeyKopienko reviewed Apr 12, 2024

View reviewed changes

Simplify __use_USM_allocations

fdaba56

SergeyKopienko approved these changes Apr 12, 2024

View reviewed changes

julianmi approved these changes Apr 12, 2024

View reviewed changes

julianmi merged commit 691f2f5 into oneapi-src:main Apr 12, 2024
20 checks passed

mmichel11 mentioned this pull request Apr 15, 2024

Fix incorrect _ONEDPL_SYCL_USM_HOST_PRESENT usage in USM allocation helpers #1502

Closed

julianmi mentioned this pull request Apr 16, 2024

Fix issues in USM result and scratch memory (PR #1354) #1503

Merged

julianmi added a commit that referenced this pull request Apr 16, 2024

Fix issues in USM result and scratch memory (PR #1354) (#1503)

92179d4

* Check for empty scratch storage * Fix outdated macro * Add guards to fall back to buffer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redefine transform_reduce's scratch & result mem #1354

Redefine transform_reduce's scratch & result mem #1354

AidanBeltonS commented Jan 18, 2024

julianmi left a comment

AidanBeltonS commented Feb 6, 2024

julianmi commented Feb 6, 2024

julianmi commented Mar 15, 2024

SergeyKopienko Apr 12, 2024

AidanBeltonS Apr 12, 2024

SergeyKopienko Apr 12, 2024 •

edited

Loading

julianmi Apr 12, 2024

SergeyKopienko Apr 12, 2024 •

edited

Loading

julianmi Apr 12, 2024

SergeyKopienko Apr 12, 2024

AidanBeltonS Apr 12, 2024

SergeyKopienko Apr 12, 2024

SergeyKopienko Apr 12, 2024

AidanBeltonS Apr 12, 2024

SergeyKopienko left a comment

Redefine transform_reduce's scratch & result mem #1354

Redefine transform_reduce's scratch & result mem #1354

Conversation

AidanBeltonS commented Jan 18, 2024

Rationale

Approach

julianmi left a comment

Choose a reason for hiding this comment

AidanBeltonS commented Feb 6, 2024

julianmi commented Feb 6, 2024

julianmi commented Mar 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKopienko Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKopienko Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKopienko left a comment

Choose a reason for hiding this comment

SergeyKopienko Apr 12, 2024 •

edited

Loading

SergeyKopienko Apr 12, 2024 •

edited

Loading