Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting some optimization cases to run on GPU without UVM #1086

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mcarlson801
Copy link
Collaborator

This PR does the following:

  1. Adds Kokkos implementations for a number of scatter and gather specializations necessary for optimization cases to work on GPU when there is no unified virtual memory.
  2. Adds Kokkos implementation for ResponseSquaredL2DifferenceSide that is used in a number of optimization cases to work on uvm-free GPU builds
  3. Updates the scatter unit tests to use Thyra multivectors instead of vectors. There are downstream issues when casting Thyra vectors to multivectors that were exposed by my changes so these tests had to be updated.
  4. Makes some of the DOFManager's internal data available as Kokkos Views to be device accessible.

I'm going to do some performance profiling to get an idea of what impact these changes have on performance but the code is ready for review in the meantime.

@mcarlson801 mcarlson801 self-assigned this Oct 31, 2024
Copy link
Collaborator

@jewatkins jewatkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Copy link
Collaborator

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! But I have a few doubts and questions.

const auto elem_LID = elem_lids(cell);
const auto p_dof_lids = Kokkos::subview(p_elem_dof_lids,elem_LID,ALL);
for (int node=0; node<num_deriv; ++node) {
const LO lid = p_dof_lids(node);

// Initialize Fad type for parameter value
const auto p_val = lid>=0 ? p_data[lid] : 0;
const auto p_val = lid>=0 ? p_data(lid) : 0;
ParamScalarT v(num_deriv, node, p_val);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on device, and is it performant? With certain fads, doesn't it allocate memory on device at every call?

@@ -330,6 +350,18 @@ evaluateFields(typename Traits::EvalData workset)
const int neq = sol_dof_mgr->getNumFields();
const int num_deriv = this->numNodes;
const bool trans = workset.transpose_dist_param_deriv;

if (Vp != Teuchos::null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems off here.

const int cell = sideSet.ws_elem_idx.h_view(sideSet_idx);
const int cell = sideSet.ws_elem_idx.d_view(sideSet_idx);

ScalarT diff_1[8] = {0};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case of an automatic tmp fad inside a kernel: are these ok? For SFad and SLFad, I think the answer is yes, but for generic Fad, I'm not sure. Someone with more Fad knowledge than me may know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. With DFad it will create temporaries, and it could run out of memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an issue with dfad on gpu but we don't use dfad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do allow DFad during config though. If we don't want to allow DFad, we should make it clear and throw during config time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not concerned about running out of memory. I'm more concerned with doing lots of small Cuda allocations at run time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that makes sense. we should just set default slfad values (that work with testing) when running with cuda/hip/sycl instead of having those lines in every albany config. And error out if they are explicitly set to dfad.

when using sfad/slfad, they should be static allocations in local memory/registers. worse case, if the memory spills, it should be as performant as a read/write to global memory but not as bad as a device malloc, which is what's attempted with dfad (unless we setup the memory pool thing).

my concern would actually be, how large is the derivative dimension in this scalar? I forgot we're talking about optimization which could have a lot of derivative components... in which case, we may run out of memory...

this->global_response_eval(0) += sum*scaling;
}
this->local_response_eval(cell,0) = sum*scaling;
KU::atomic_add<ExecutionSpace>(&(this->global_response_eval(0)), sum*scaling);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use parallel_reduce, rather than a parallel_for with atomic access? All threads access the same value, so contention is very high here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should definitely be a reduce. I ported this early on when I was just trying to get something to work. I'll fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartgol Is there a trick to doing a parallel_reduce with Sacado FAD types? I created a very simple example program based on https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Custom-Reductions-Built-In-Reducers-with-Custom-Scalar-Types.html and I'm getting compiler errors:

/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/sacado/example/my_custom_reduce_example.cpp(35): error: no instance of constructor "Kokkos::View<DataType, Properties...>::View [with DataType=ScalarT *, Properties=<Kokkos::CudaSpace, Kokkos::MemoryUnmanaged>]" matches the argument list
            argument types are: (ScalarT *, int)
          detected during:
            instantiation of "SumScalarT<ScalarT, Space>::result_view_type SumScalarT<ScalarT, Space>::view() const [with ScalarT=ScalarT, Space=Kokkos::CudaSpace]"
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1525): here
            instantiation of "void Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute_impl(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>, FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>]"
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1542): here
            instantiation of "std::enable_if_t<<expression>, void> Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::RangePolicy<Kokkos::DefaultExecutionSpace>, FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>, Dummy=SumScalarT<ScalarT, Kokkos::CudaSpace>]"
/pscratch/sd/m/mcarlson/IntroToHPC/repos/trilinos-lite/packages/kokkos/core/src/Kokkos_Parallel_Reduce.hpp(1798): here
            instantiation of "std::enable_if_t<<expression>, void> Kokkos::parallel_reduce(const size_t &, const FunctorType &, const ReturnType &) [with FunctorType=lambda [](int, ValueType &)->void, ReturnType=SumScalarT<ScalarT, Kokkos::CudaSpace>]"
(51): here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I am not sure. Maybe @etphipp has some advice.

tests/unit/evaluators/ScatterResidual.cpp Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants