-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redefine transform_reduce's scratch & result mem #1354
Redefine transform_reduce's scratch & result mem #1354
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:
I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.
An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.
d1cbb76
to
d341306
Compare
This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime. Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution? |
Could we use |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
@AidanBeltonS Please rebase this since #1410 has been merged. I think we can get this merged after the rebase and testing. |
ac30243
to
89f9e54
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
daf68dd
to
0121a5a
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
For L0 backend, it's faster to have a USM host allocation and to write the reduction result directly to that. For Nvidia, host USM is expensive, and it's faster to have a single USM device allocation for both the result and the intermediate scratch when required. This commit combines the two approaches into a struct __usm_host_or_unified_storage, based on the previous __usm_host_or_buffer_storage.
Co-authored-by: Julian Miller <[email protected]>
fff6fd6
to
62f5d5b
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h
Outdated
Show resolved
Hide resolved
{ | ||
private: | ||
using __sycl_buffer_t = sycl::buffer<_T, 1>; | ||
|
||
_ExecutionPolicy __exec; | ||
::std::shared_ptr<_T> __scratch_buf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going really to share this fields? Or simple manage memory by this way?
I think - the second case.
But this mean `std::unique_ptr' should be enough for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should stay as a shared_ptr
. There is a situation where a variable of this class is copied and therefore two objects have ownership of the pointer. This happens when a __future
is constructed to return the result value.
So due to the possibility of non-reference passing and copying of the class I think this should stay as is.
Error when using unique_ptr
.
/home/aidanbelton/oneDPL/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h:147:41: error: call to implicitly-deleted copy constructor of 'oneapi::dpl::__par_backend_hetero::__result_and_scratch_storage<oneapi::dpl::execution::device_policy<> &, TestUtils::Sum>'
147 | return __future(__reduce_event, __scratch_container);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is very simple:
return __future(__reduce_event, std::move(__scratch_container));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't unique_ptr
limit what the user can do with the future-like object we return? I think shared_ptr
is the more general approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julianmi exactly, you are right.
But let's take a look at https://en.cppreference.com/w/cpp/thread/future/operator%3D
We haven't requirement that should be able to have copyable result.
UPD: discussed with @MikeDvorskiy, let's still use std::shared_ptr
to have the same requirements for this class like for sycl::buffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @SergeyKopienko. I think we can switch to unique_ptr
then. We should do so throughout the unified USM or buffer storage though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my update: #1354 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a discussion to be had regarding the exact usage of this class. However, I believe it is well beyond the scope of this PR, which is fundamentally a performance optimization. The usage of the shared_ptr existed prior to this change. I propose that this sort of architectural discussion and change should be placed into an issue or some other method of communication as it is not relevant to the proposed changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional moment - we shouldn't broke existing behavior of this class : our users already have the code where it's copyable and moveable.
__use_USM_allocations(sycl::queue __queue) | ||
{ | ||
#if _ONEDPL_SYCL_USM_HOST_PRESENT | ||
auto __device = __queue.get_device(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
return __queue.get_device().__device.has(sycl::aspect::usm_device_allocations);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But please wait CI and receive additional approves.
Rationale
Algorithms using
transform_reduce
(std::reduce
,std::max_element
, etc) return a single value result. They also, except for the smallest cases, require intermediate scratch memory on the device to hold partial results.For L0 backend, it's faster to have a 1-element USM host allocation and to write the final reduction result directly to that.
For Nvidia, host USM is expensive, and it's faster instead to have a single USM device allocation for both the result and the intermediate scratch.
Approach
This PR combines the two approaches into a struct
__usm_host_or_unified_storage
, based on the previous__usm_host_or_buffer_storage
. When host USM is supported and the backend is L0, this struct holds two memory allocations (device USM for intermediate scratch, and host USM for final result). In all other cases, this struct holds a single device USM allocation, large enough for both intermediate scratch and final result. In this latter case, amemcpy
from device to host is needed to return the final result.