Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefine transform_reduce's scratch & result mem #1354

Merged
merged 15 commits into from
Apr 12, 2024

Conversation

AidanBeltonS
Copy link
Contributor

Rationale

Algorithms using transform_reduce (std::reduce, std::max_element, etc) return a single value result. They also, except for the smallest cases, require intermediate scratch memory on the device to hold partial results.

For L0 backend, it's faster to have a 1-element USM host allocation and to write the final reduction result directly to that.

For Nvidia, host USM is expensive, and it's faster instead to have a single USM device allocation for both the result and the intermediate scratch.

Approach

This PR combines the two approaches into a struct __usm_host_or_unified_storage, based on the previous __usm_host_or_buffer_storage. When host USM is supported and the backend is L0, this struct holds two memory allocations (device USM for intermediate scratch, and host USM for final result). In all other cases, this struct holds a single device USM allocation, large enough for both intermediate scratch and final result. In this latter case, a memcpy from device to host is needed to return the final result.

@mmichel11 mmichel11 added this to the 2022.6.0 milestone Jan 29, 2024
Copy link
Contributor

@julianmi julianmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:

I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.

An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

@AidanBeltonS
Copy link
Contributor Author

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:

I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.

An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime.

Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution?

@julianmi
Copy link
Contributor

julianmi commented Feb 6, 2024

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:
I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.
An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime.

Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution?

Could we use __usm_host_or_buffer_accessor instead of the current pointer solution?

@julianmi
Copy link
Contributor

@AidanBeltonS Please rebase this since #1410 has been merged. I think we can get this merged after the rebase and testing.

joeatodd and others added 13 commits April 12, 2024 09:54
For L0 backend, it's faster to have a USM host allocation and to write
the reduction result directly to that.

For Nvidia, host USM is expensive, and it's faster to have a single
USM device allocation for both the result and the intermediate scratch
when required.

This commit combines the two approaches into a struct
__usm_host_or_unified_storage, based on the previous
__usm_host_or_buffer_storage.
Co-authored-by: Julian Miller <[email protected]>
{
private:
using __sycl_buffer_t = sycl::buffer<_T, 1>;

_ExecutionPolicy __exec;
::std::shared_ptr<_T> __scratch_buf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going really to share this fields? Or simple manage memory by this way?
I think - the second case.
But this mean `std::unique_ptr' should be enough for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should stay as a shared_ptr. There is a situation where a variable of this class is copied and therefore two objects have ownership of the pointer. This happens when a __future is constructed to return the result value.

So due to the possibility of non-reference passing and copying of the class I think this should stay as is.

Error when using unique_ptr.

/home/aidanbelton/oneDPL/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h:147:41: error: call to implicitly-deleted copy constructor of 'oneapi::dpl::__par_backend_hetero::__result_and_scratch_storage<oneapi::dpl::execution::device_policy<> &, TestUtils::Sum>'
  147 |         return __future(__reduce_event, __scratch_container);

Copy link
Contributor

@SergeyKopienko SergeyKopienko Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is very simple:

return __future(__reduce_event, std::move(__scratch_container));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't unique_ptr limit what the user can do with the future-like object we return? I think shared_ptr is the more general approach here.

Copy link
Contributor

@SergeyKopienko SergeyKopienko Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julianmi exactly, you are right.
But let's take a look at https://en.cppreference.com/w/cpp/thread/future/operator%3D
We haven't requirement that should be able to have copyable result.

UPD: discussed with @MikeDvorskiy, let's still use std::shared_ptr to have the same requirements for this class like for sycl::buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @SergeyKopienko. I think we can switch to unique_ptr then. We should do so throughout the unified USM or buffer storage though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my update: #1354 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a discussion to be had regarding the exact usage of this class. However, I believe it is well beyond the scope of this PR, which is fundamentally a performance optimization. The usage of the shared_ptr existed prior to this change. I propose that this sort of architectural discussion and change should be placed into an issue or some other method of communication as it is not relevant to the proposed changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional moment - we shouldn't broke existing behavior of this class : our users already have the code where it's copyable and moveable.

@SergeyKopienko SergeyKopienko self-requested a review April 12, 2024 13:21
__use_USM_allocations(sycl::queue __queue)
{
#if _ONEDPL_SYCL_USM_HOST_PRESENT
auto __device = __queue.get_device();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

return __queue.get_device().__device.has(sycl::aspect::usm_device_allocations);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@SergeyKopienko SergeyKopienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
But please wait CI and receive additional approves.

@julianmi julianmi merged commit 691f2f5 into oneapi-src:main Apr 12, 2024
20 checks passed
julianmi added a commit that referenced this pull request Apr 16, 2024
* Check for empty scratch storage

* Fix outdated macro

* Add guards to fall back to buffer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants