- When no placeholder of
_2
is passed, degrade totransform_reduce
. An existing feature? This is for performance sake, just like partial specializations in c++ std library. - It's said that
make_tuple
andget
can support only 10 components. Then we can write a more general version using "tuple of tuple".
following original README.
All samples make use of the runtime API.
This sample adds two vectors of floats.
This sample maps device pointers to pinned host memory so that kernels can directly read from and write to pinned host memory.
This sample measures host-to-device and device-to-host bandwidth via PCIe for pageable and pinned memory of four transfer sizes of 3KB, 15KB, 15MB and 100MB, and outputs them in CSV format.
This sample checks the return value of every runtime API.
This sample uses shared memory to accelerate matrix multiplication.
This sample uses atomic functions, assertions and printf.
This sample uses asynchronous engines to overlap data transfer and kernel execution.
This sample uses multiple streams to overlap multiple kernel execution, known as the HyperQ technology.
This sample uses cudaSetDevice within a single thread to utilize multiple GPUs.
This sample uses OpenMP to create multiple CPU threads to utilize multiple GPUs.
This sample uses MPI to create multiple CPU processes to utilize multiple GPUs.
This sample uses CUBLAS, a CUDA implementation of BLAS (Basic Linear Algebra Subprograms), for matrix multiplication.
This sample uses thrust, a CUDA implementation of STL (Standard Template Library), for vector reduction.