RFC for histogram CPU implementation #1930

danhoeflinger · 2024-11-01T19:23:24Z

Adds an RFC for histogram CPU implementation.

Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov · 2024-11-04T14:48:42Z

rfcs/proposed/host_backend_histogram/README.md

+1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the backend.
+2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can use the user-provided output data.


Determining/assuming a certain number of threads is not necessary, and should be avoided for the sake of parallel composability. We need a sort of thread-local storage, maybe the one that is accessible by other threads for data aggregation. It certainly can be done without knowing the number of threads.

This is a good point for parallel composability.
However, I am curious about the details of making this happen in practice.
What do you think is the best way to be agnostic of the number of threads, but also having some reasonable control over the following:

limiting memory footprint

aggregating the data after accumulation

I think these may have different answers for TBB and OpenMP.

There is some precedent for utilizing the maximum number of threads for accumulation in the OpenMP backend with transform_reduce (link). This is of course not to say it is correct or best.

On the OpenMP side, perhaps we shouldn't be using parallel_for but rather extending some of its building blocks for use by histogram and parallel_for. With more control over __chunk_partitioner and the parallel tasks / loop we should have what we need, and we can understand our number of local temporary copies / threads at that level.

For TBB, I'm more in the dark since that is more recursively decomposing into chunks , but looking further...

For TBB, does it make sense to do something similar to this code?

The plan would be to Isolate then get the max concurrency of the task arena and allocating temporaries based on that. Then using current_thread_index() within a function brick which is decomposed directly or via a parallel_for pattern followed by a launch of tasks to accumulate those into the output.

It seems like this is not thinking very "tbb-like", because we would be interacting directly with thread indices and number of threads, which is discouraged. However, I this case I think its important to control the allocation while not sacrificing the ability to use a small grain size for work stealing and load balance.

Is there some more tbb-way of doing this which would have a similar control on footprint?

Aha, this is what I think I was missing...
https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneTBB/source/thread_local_storage/enumerable_thread_specific_cls.rst

It may make it difficult to understand where the appropriate heuristic for choosing between algorithms, but I think this is the "tbb way" of doing this which I was missing.

Yes, in TBB there is enumerable_thread_specific. But even without TBB, something like this could be done with e.g. a hash map protected by a lock. A thread ID/native handle would be the key to the map; and if it is guaranteed that insertions do not invalidate iterators/references to the map nodes, the lock would only need to be taken for getting such a reference, which could then be used to update the content without data races.

I am not sure if it's better to have a generic implementation like I outlined above, or if a special "interface" with backend-specific implementations is necessary for performance reasons.

For TBB, does it make sense to do something similar to this code?

This is acceptable, as it uses the upper limit on the number of threads, not an exact number of threads. Similarly, for OpenMP it is OK to query and use the number of threads in the "current" parallel region - it is not going to change during the algorithm execution.

rfcs/proposed/host_backend_histogram/README.md

Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov · 2024-11-04T15:05:40Z

rfcs/proposed/host_backend_histogram/README.md

+### Implementation Two (Atomics)
+This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the output histogram data, we can merely run a `parallel_for` pattern.
+
+To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own `atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic operations, but that is only available for the OpenMP backend.


C++17 provides std::atomic<T>, however, this can only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then copying it to the output data.

I guess you mean that if the output data type is not std::atomic, in C++17 there is no standard way to do atomic operations with these data.

We could look to implement our own atomic_ref<T> for C++17, but that would require specialization for individual compilers.

I think we would not need a full-fledged atomic_ref, we would need something like "atomic_cast", and - given that the output data are counters of integral type - that should not be a huge deal.

Yes, you are correct. I haven't done a deep look into all compilers / standard library implementations in our matrix at this point. It may be as simple as a reinterpret cast in most cases, but that isn't guaranteed by the standard to work. I agree this is worth looking at, and may not be so bad.

akukanov · 2024-11-04T15:23:30Z

rfcs/proposed/host_backend_histogram/README.md

+
+* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?
+
+* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector instructions?


A couple of papers that might be of interest:

https://www.researchgate.net/publication/221131224_SIMD_Vectorization_of_Histogram_Functions

https://www.researchgate.net/publication/266660552_Versatile_and_scalable_parallel_histogram_construction

Thanks, these are interesting, and mainly follow similar lines to the approaches outlined here. I will look more into the pcwar SIMD instruction to understand if that is viable to use here.

While its interesting to explore, I believe we basically depend on OpenMP simd to provide our SIMD operations. We won't be able to serve more complex intrinsic operations if we want to stick to that.

Further, I'm not actually finding anything like the pcwar instruction(s) they refer to here in https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html. I do find some for conflict detection across lane which could be useful, but again those won't be available through the interface of OpenMP.

I think our options are basically to decline SIMD or to have duplicates of the output for each SIMD lane. I think even that may be more "hands on" with SIMD details than we have done thus far from oneDPL.

rfcs/proposed/host_backend_histogram/README.md

Signed-off-by: Dan Hoeflinger <[email protected]>

more formatting fixes Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov · 2024-11-06T20:31:11Z

rfcs/proposed/host_backend_histogram/README.md

+`histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in
+much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be
+straightforward.


Well, in fact there might be something to discuss and the implementation might be not as straightforward as you think. It likely depends on whether we implement vectorization for unseq. If we do, then it makes sense to use the pattern-brick approach as with other algorithms, and we'd need to define the serial and SIMD bricks with a common API and the serial pattern implementation to use these bricks.

Also, there is the serial implementation of a parallel backend :) It does not create parallelism, but it has to handle all the policies, implementing backend interfaces used by the top-level patterns. We do not have to support histogram there because we do not plan to "upstream" it to LLVM PSTL. But that would mean some extra dispatch in the top-level histogram pattern is necessary that detects which backend is used, and adding the mentioned support might be a lesser of evils.

For the first point, its true, our approach to SIMD is the key here. I'll put some more thought into this. Right now, the only option I see for making use of SIMD is to make more replicates of the temporary storage available. I think that it is somewhat unique within oneDPL to have such a temporary allocation requirement to use a vector brick. Otherwise, it really maps poorly to these instructions.

On the second point, my plan is to just support the serial backend.

I can expand upon this topic a bit in the RFC after some thought, and maybe some experimentation. I'm working through a first draft implementation, which I think is necessary for me to understand all the infrastructure details in full here.

akukanov · 2024-11-06T21:37:33Z

Overall, this all sounds good enough for the "proposed" stage, where it's expected that some details are unknown and need to be determined. I am happy to approve it but will wait for a few days in case @danhoeflinger wants to update the document with some follow-up thoughts on the discussion.

danhoeflinger added 9 commits October 30, 2024 11:27

initial rough commit

a03577e

Signed-off-by: Dan Hoeflinger <[email protected]>

minor improvements

ce117f5

Signed-off-by: Dan Hoeflinger <[email protected]>

revision

ccc001e

Signed-off-by: Dan Hoeflinger <[email protected]>

Formatting, minor

d518a14

Signed-off-by: Dan Hoeflinger <[email protected]>

spelling and grammar

6e03468

Signed-off-by: Dan Hoeflinger <[email protected]>

Minor improvements

10c4e50

Signed-off-by: Dan Hoeflinger <[email protected]>

subsection

efa7c9b

Signed-off-by: Dan Hoeflinger <[email protected]>

Adding some alternative approaches

1ac82fd

Signed-off-by: Dan Hoeflinger <[email protected]>

minor improvements

02523c4

Signed-off-by: Dan Hoeflinger <[email protected]>

danhoeflinger mentioned this pull request Nov 1, 2024

Implement the histogram algorithm for standard aligned execution policies #1900

Open

danhoeflinger requested review from reble, vossmjp, timmiesmith, akukanov, rarutyun, MikeDvorskiy, dmitriy-sobolev, SergeyKopienko, mmichel11 and adamfidel November 4, 2024 14:25

akukanov reviewed Nov 4, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

line widths

ac7b654

Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov reviewed Nov 4, 2024

View reviewed changes

mmichel11 reviewed Nov 5, 2024

View reviewed changes

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved

danhoeflinger added 2 commits November 6, 2024 10:08

fixing numbering.

506fb62

Signed-off-by: Dan Hoeflinger <[email protected]>

putting in specifics for TBB / OpenMP

1c6cb47

more formatting fixes Signed-off-by: Dan Hoeflinger <[email protected]>

akukanov reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for histogram CPU implementation #1930

RFC for histogram CPU implementation #1930

danhoeflinger commented Nov 1, 2024

akukanov Nov 4, 2024

danhoeflinger Nov 4, 2024 •

edited

Loading

danhoeflinger Nov 4, 2024

danhoeflinger Nov 4, 2024 •

edited

Loading

danhoeflinger Nov 4, 2024

danhoeflinger Nov 5, 2024

akukanov Nov 6, 2024

akukanov Nov 6, 2024

akukanov Nov 4, 2024 •

edited

Loading

danhoeflinger Nov 4, 2024 •

edited

Loading

akukanov Nov 4, 2024

danhoeflinger Nov 4, 2024

danhoeflinger Nov 4, 2024

akukanov Nov 6, 2024

danhoeflinger Nov 6, 2024

akukanov commented Nov 6, 2024

		1) Determine the number of threads that we will use, perhaps adding some method to do this generically based on the backend.
		2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can use the user-provided output data.


		* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?

		* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector instructions?

RFC for histogram CPU implementation #1930

Are you sure you want to change the base?

RFC for histogram CPU implementation #1930

Conversation

danhoeflinger commented Nov 1, 2024

Choose a reason for hiding this comment

danhoeflinger Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhoeflinger Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akukanov Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

danhoeflinger Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akukanov commented Nov 6, 2024

danhoeflinger Nov 4, 2024 •

edited

Loading

danhoeflinger Nov 4, 2024 •

edited

Loading

akukanov Nov 4, 2024 •

edited

Loading

danhoeflinger Nov 4, 2024 •

edited

Loading