-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
does SORTPAIRS need to use reserve+emplace_back? #262
Comments
You are right, emplace_back is not parallel safe. Unfortunately resize will default construct items in memory sequentially, which will zero storage for types that are trivially constructible. We really want to do something like uninitialized_fill with std::execution::par, except that we need to generate the elements based on their index. I don't see a way to fill a vector in that way with a parallel policy. We may have to allocate uninitialized storage then use for_each with std::execution::par to initialize the memory with placement new/std::construct_at. |
Hi @jeffhammond ! Here is how I would rewrite that above godbolt example. using double_and_int = std::tuple<double, int>;
template<class ElementType, std::integral Size, class ElementFunction>
requires(std::is_invocable_r_v<ElementType, ElementFunction, Size>)
void uninitialized_generate_n(ElementType raw_array[], const Size num_elements, ElementFunction&& f)
{
auto range = std::views::iota(Size(0), num_elements);
std::for_each(std::execution::par, std::ranges::begin(range), std::ranges::end(range),
[raw_array, &f](const Size k) { std::construct_at(raw_array + k, f(k)); });
}
template<std::integral Size, class ElementFunction>
requires(std::is_invocable_v<ElementFunction, Size>)
std::unique_ptr< decltype(std::declval<ElementFunction>()(Size{}))[] >
create_array_and_generate_elements(const Size num_elements, ElementFunction&& f)
{
using element_type = decltype(std::declval<ElementFunction>()(Size{}));
auto arr = std::make_unique<element_type[]>(num_elements);
uninitialized_generate_n(arr.get(), num_elements, std::forward<ElementFunction>(f));
return arr;
}
int main()
{
constexpr std::size_t N = 5;
std::array<double, N> x{5.0, 1.0, 4.0, 2.0, 3.0};
std::array<int, N> i{5, 1, 4, 2, 3};
auto aos = create_array_and_generate_elements(N, [=](std::size_t k) { return std::tuple{x[k], i[k]}; });
for(std::size_t k = 0; k < N; ++k) {
std::cout << "k: {" << std::get<0>(aos[k]) << ", " << std::get<1>(aos[k]) << "}\n";
}
return 0;
} The parallel ranges experience is not very good, though: https://godbolt.org/z/1Gadz7M5r MSVC claims that |
Sorry, but why is this bad? Because initialization faults pages and causes problems with NUMA (including GPU migratable memory)? Or because sequential initialization is slow? Given that And finally, while this pattern might be bottleneck in RAJAPerf, how representative is it of LLNL applications to have allocation and initialization of a vectory of pairs on the critical path? |
Those are both things that I would think of, but I don't have the most experience with cpu threading so I'll take your word for it that they're not that bad. We would probably run at least 1 MPI process per NUMA domain so that may not be an issue anyway. I don't know a lot about how hardware acceleration of zeroing works, does that get accelerated if there is padding that isn't zeroed? How much does the factor of P affect the calculation on whether an extra O(N) matters or not? I'm not aware of any places where sort pairs is a significant performance bottleneck for LLNL applications, so effort would probably be better spent elsewhere. |
@mhoemmen Do you know if uninitialized_generate was proposed for the stl? I also planned to use |
Actually, isn't it undefined behavior to |
@MrBurmark wrote:
That's a good point, thank you! I had forgotten that The fix is to use custom allocation and deallocation. template<class ElementType>
requires(std::is_trivially_destructible_v<ElementType>)
struct deleter {
void operator()(ElementType* p) { free(p); }
};
template<class ElementType>
std::unique_ptr<ElementType[], deleter<ElementType>>
allocate_uninitialized_array(const std::size_t size)
{
return {
reinterpret_cast<ElementType*>(malloc(size * sizeof(ElementType))),
deleter<ElementType>{}
};
}
template<std::integral Size, class ElementFunction>
requires(std::is_invocable_v<ElementFunction, Size>)
std::unique_ptr< decltype(std::declval<ElementFunction>()(Size{}))[] >
create_array_and_generate_elements(const Size num_elements, ElementFunction&& f)
{
using element_type = decltype(std::declval<ElementFunction>()(Size{}));
auto arr = allocate_uninitialized_array<element_type[]>(num_elements);
uninitialized_generate_n(arr.get(), num_elements, std::forward<ElementFunction>(f));
return arr;
} |
I am working on the C++17 parallel version of SORTPAIRS, and I cannot use
emplace_back
there. I find that I can switch fromreserve
toresize
and use simple assignment instead ofemplace_back
, but I don't know if this consistent with the intent.Unfortunately, I don't see any non-RAJA parallel references to guide me here. The OMP and CUDA parallel versions just call
the RAJA::sort_pairs
method.If it is allowed to use
resize+(assign)
instead, I'll make that change in the sequential implementation, independent of what I'm doing with C++ parallelism.Thanks!
old
new
The text was updated successfully, but these errors were encountered: