Porting some optimization cases to run on GPU without UVM #1086

mcarlson801 · 2024-10-31T20:17:53Z

This PR does the following:

Adds Kokkos implementations for a number of scatter and gather specializations necessary for optimization cases to work on GPU when there is no unified virtual memory.
Adds Kokkos implementation for ResponseSquaredL2DifferenceSide that is used in a number of optimization cases to work on uvm-free GPU builds
Updates the scatter unit tests to use Thyra multivectors instead of vectors. There are downstream issues when casting Thyra vectors to multivectors that were exposed by my changes so these tests had to be updated.
Makes some of the DOFManager's internal data available as Kokkos Views to be device accessible.

I'm going to do some performance profiling to get an idea of what impact these changes have on performance but the code is ready for review in the meantime.

…ization

…atter evaluators

…ds for separableScatterScalarResponse

…ScatterScalarResponse

…OFManager to avoid excess transfers

jewatkins

lgtm, thanks!

bartgol

Looks good! But I have a few doubts and questions.

bartgol · 2024-10-31T21:37:33Z

src/evaluators/gather/PHAL_GatherScalarNodalParameter_Def.hpp

      const auto elem_LID = elem_lids(cell);
      const auto p_dof_lids = Kokkos::subview(p_elem_dof_lids,elem_LID,ALL);
      for (int node=0; node<num_deriv; ++node) {
        const LO lid = p_dof_lids(node);

        // Initialize Fad type for parameter value
-        const auto p_val = lid>=0 ? p_data[lid] : 0;
+        const auto p_val = lid>=0 ? p_data(lid) : 0;
        ParamScalarT v(num_deriv, node, p_val);


Does this work on device, and is it performant? With certain fads, doesn't it allocate memory on device at every call?

bartgol · 2024-10-31T21:38:40Z

src/evaluators/gather/PHAL_GatherScalarNodalParameter_Def.hpp

@@ -330,6 +350,18 @@ evaluateFields(typename Traits::EvalData workset)
    const int neq = sol_dof_mgr->getNumFields();
    const int num_deriv = this->numNodes;
    const bool trans = workset.transpose_dist_param_deriv;
+
+        if (Vp != Teuchos::null) {


Indentation seems off here.

bartgol · 2024-10-31T21:39:51Z

src/evaluators/response/PHAL_ResponseSquaredL2DifferenceSide_Def.hpp

-          const int cell = sideSet.ws_elem_idx.h_view(sideSet_idx);
+          const int cell = sideSet.ws_elem_idx.d_view(sideSet_idx);
+
+          ScalarT diff_1[8] = {0};


Another case of an automatic tmp fad inside a kernel: are these ok? For SFad and SLFad, I think the answer is yes, but for generic Fad, I'm not sure. Someone with more Fad knowledge than me may know.

That's a good point. With DFad it will create temporaries, and it could run out of memory.

it's an issue with dfad on gpu but we don't use dfad.

We do allow DFad during config though. If we don't want to allow DFad, we should make it clear and throw during config time.

I'm not concerned about running out of memory. I'm more concerned with doing lots of small Cuda allocations at run time.

yeah that makes sense. we should just set default slfad values (that work with testing) when running with cuda/hip/sycl instead of having those lines in every albany config. And error out if they are explicitly set to dfad.

when using sfad/slfad, they should be static allocations in local memory/registers. worse case, if the memory spills, it should be as performant as a read/write to global memory but not as bad as a device malloc, which is what's attempted with dfad (unless we setup the memory pool thing).

my concern would actually be, how large is the derivative dimension in this scalar? I forgot we're talking about optimization which could have a lot of derivative components... in which case, we may run out of memory...

bartgol · 2024-10-31T21:41:17Z

src/evaluators/response/PHAL_ResponseSquaredL2DifferenceSide_Def.hpp

-          this->global_response_eval(0) += sum*scaling;
-        }
+          this->local_response_eval(cell,0) = sum*scaling;
+          KU::atomic_add<ExecutionSpace>(&(this->global_response_eval(0)), sum*scaling);


Wouldn't it be better to use parallel_reduce, rather than a parallel_for with atomic access? All threads access the same value, so contention is very high here...

Yeah, this should definitely be a reduce. I ported this early on when I was just trying to get something to work. I'll fix it.

@bartgol Is there a trick to doing a parallel_reduce with Sacado FAD types? I created a very simple example program based on https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Custom-Reductions-Built-In-Reducers-with-Custom-Scalar-Types.html and I'm getting compiler errors:

/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/sacado/example/my_custom_reduce_example.cpp(35): error: no instance of constructor "Kokkos::View<DataType, Properties...>::View [with DataType=ScalarT *, Properties=<Kokkos::CudaSpace, Kokkos::MemoryUnmanaged>]" matches the argument list argument types are: (ScalarT *, int) detected during: instantiation of "SumScalarT<ScalarT, Space>::result_view_type SumScalarT<ScalarT, Space>::view() const [with ScalarT=ScalarT, Space=Kokkos::CudaSpace]" /pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1525): here instantiation of "void Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute_impl(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>, FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>]" /pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1542): here instantiation of "std::enable_if_t<<expression>, void> Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>, FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>, Dummy=SumScalarT<ScalarT, Kokkos::CudaSpace>]" /pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1798): here instantiation of "std::enable_if_t<<expression>, void> Kokkos::parallel_reduce(const size_t &, const FunctorType &, const ReturnType &) [with FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>]" (51): here

Uhm, I am not sure. Maybe @etphipp has some advice.

tests/unit/evaluators/ScatterScalarResponse.cpp

tests/unit/evaluators/ScatterResidual.cpp

mcarlson801 added 10 commits October 31, 2024 09:32

Port DistParamDeriv and HessianVec specializations for uvm-free optim…

4ae05ff

…ization

Convert local_VP to view and port hessianVec/DistParamDeriv gather/sc…

205f152

…atter evaluators

Restore ResponseSquaredL2DifferenceSide precomputation, add atomic_ad…

93f6eef

…ds for separableScatterScalarResponse

Fixes for GatherScalarNodalParameter

39db7d7

Fix for ScatterResidual

ae4e26f

Loop bound fixes

e612ead

Update scatter unit tests to work without uvm, fixed bug in Separable…

baf1b3b

…ScatterScalarResponse

Fix warnings

3d1babc

Move temporary data movement out of ScatterResidual and into Albany D…

d2e177b

…OFManager to avoid excess transfers

Cleanup and adding comment

eb3bcbb

mcarlson801 requested review from bartgol and jewatkins October 31, 2024 20:17

mcarlson801 self-assigned this Oct 31, 2024

jewatkins approved these changes Oct 31, 2024

View reviewed changes

bartgol reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting some optimization cases to run on GPU without UVM #1086

Porting some optimization cases to run on GPU without UVM #1086

mcarlson801 commented Oct 31, 2024

jewatkins left a comment

bartgol left a comment •

edited

Loading

bartgol Oct 31, 2024

bartgol Oct 31, 2024

bartgol Oct 31, 2024

mperego Oct 31, 2024

jewatkins Oct 31, 2024

bartgol Oct 31, 2024

bartgol Oct 31, 2024

jewatkins Oct 31, 2024

bartgol Oct 31, 2024

mcarlson801 Oct 31, 2024

mcarlson801 Nov 6, 2024

bartgol Nov 7, 2024

Porting some optimization cases to run on GPU without UVM #1086

Are you sure you want to change the base?

Porting some optimization cases to run on GPU without UVM #1086

Conversation

mcarlson801 commented Oct 31, 2024

jewatkins left a comment

Choose a reason for hiding this comment

bartgol left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartgol left a comment •

edited

Loading