update CUDA_CUBLAS #10

TeslaCoder · 2015-09-26T06:47:49Z

call fast kernel cublas dgemmNT
Copy C before Copy+transpose A+B (fix for pipeline blocked by transpose kernel)
call transpose only when needed, otherwise copy data directly to dest_image
destroy cublas handles and cuda context in ExitRuntime
remove #ifdef from header file to fix seg fault problem

update to latest revision

- call fast cublas dgemm kernel NT (TN in cblas terms) - move copy C before copy+trans A+B - only call transpose when needed, otherwise copy directly to dest_image

remove #ifdef CALDGEMM_CUDA_CUBLAS from caldgemm_cuda.h header file With this #ifdef in place if I use multiple GPUs per node, it will give segmentation fault after HPL finished (during MPI_Finalize() in testing/ptest/HPL_pddriver.c) It does not make sense to me, but I tried many revisions of the files and searched for other possible causes, but this is the only place I found that solves the problem. The seg fault also creates problems for nvidia profiler (nvprof) becuase the output is not flushed to disk if the program terminates abnormaly. I will add "cudaThreadExit()" inside the ExitRuntime() to destory the CUDA context and force profiler to flush to disk.

destroy cublas handles and cuda context in caldgemm_cuda::ExitRuntime()

davidrohr · 2015-09-26T13:38:54Z

Hi,

this looks better to me than the last approach. I am wondering about some facts:

1:
Why at all is the transpose kernel blocking the data transfer. There are multiple cuda streams running in parallel. data transfer of one stream and kernels of the other one should be able to overlap. At least this works in the OpenCL version on non-nvidia GPUs.

Isn't there a similar problem for the transfer of a and b matrices and transpose kernel. i.e. instead of copying C first, wouldn't it make more sense to just run the transpose kernels after copying all a,b, and c?

I don't see any reason how the #ifdef in the header can cause a segfault.

TeslaCoder · 2015-09-27T23:40:32Z

I could send you a picture of visual profiler so you can see the problem, for example, DGEMM kernel in one stream must overlap small copy A + tranpose Kernel A + big copy C, so the first part of DGEMM kernel is overlapping ok with small copy A, but then transpose kernel waits for DGEMM kernel to be finished, so big copy C does not occur until after DGEMM kernel is complete. On nvidia GPU's ALL thread-blocks of an active kernel will be scheduled before blocks from another kernel, unless the other kernel has a higher priority. So 2 kernels with same priority will only overlap each other at he end of one and start of the next as the final thread blocks are scheduled from the first and resources become available for the second. Its possible AMD GPUs have a more complex scheduling of kernels.
yes it would make sense to do all copies first, but this requires more changes to source code, moving the C copy first is simple and enough to prevent the problem in my test, I can however make more extensive changes if you prefer all copies first.
I also don't see any reason for this, if I use a different #define and define it in the header, it works, but using a #define defined in the makefiles give me seg fault, I can ask someone to test on a different system to see if it is reproducible.

davidrohr · 2015-09-28T02:16:17Z

Hi,

thanks for the update:

Indeed such a screenshot would be interesting. Can you also sent the cuda profile file? I have nsight installed but no nvidia gpu at the moment.

i see, that is then because cuda wont run transpose of stream 1 and dgemm of stream 2 in parallel. Will think about it when i find some time.
i think i got an idea what is the problem. Will report back with a possible fix in some time.

Btw: out of curiosity: what system are you working on and what are you using our hpl for?
Kind regards
David Rohr

(Sent from my mobile, excuse the typos)

On September 28, 2015 8:40:33 AM GMT+09:00, TeslaCoder [email protected] wrote:

1: I could send you a picture of visual profiler so you can see the
problem, for example, DGEMM kernel in one stream must overlap small
copy A + tranpose Kernel A + big copy C, so the first part of DGEMM
kernel is overlapping ok with small copy A, but then transpose kernel
waits for DGEMM kernel to be finished, so big copy C does not occur
until after DGEMM kernel is complete. On nvidia GPU's ALL
thread-blocks of an active kernel will be scheduled before blocks from
another kernel, unless the other kernel has a higher priority. So 2
kernels with same priority will only overlap each other at he end of
one and start of the next as the final thread blocks are scheduled from
the first and resources become available for the second. Its possible
AMD GPUs have a more complex scheduling of kernels.

yes it would make sense to do all copies first, but this requires
more changes to source code, moving the C copy first is simple and
enough to prevent the problem in my test, I can however make more
extensive changes if you prefer all copies first.

I also don't see any reason for this, if I use a different #define
and define it in the header, it works, but using a #define defined in
the makefiles give me seg fault, I can ask someone to test on a
different system to see if it is reproducible.

Reply to this email directly or view it on GitHub:
#10 (comment)

davidrohr · 2015-09-30T16:16:32Z

Hi,

I have just commited a patch to test branch that should fix the cublas segfault in hpl.
I would appreciate if you could test it.

Regards
David

On 28.09.2015 01:40, TeslaCoder wrote:

1: I could send you a picture of visual profiler so you can see the problem, for example, DGEMM kernel in one stream must overlap small copy A + tranpose Kernel A + big copy C, so the first part of DGEMM kernel is overlapping ok with small copy A, but then transpose kernel waits for DGEMM kernel to be finished, so big copy C does not occur until after DGEMM kernel is complete. On nvidia GPU's ALL thread-blocks of an active kernel will be scheduled before blocks from another kernel, unless the other kernel has a higher priority. So 2 kernels with same priority will only overlap each other at he end of one and start of the next as the final thread blocks are scheduled from the first and resources become available for the second. Its possible AMD GPUs have a more complex scheduling of kernels.
yes it would make sense to do all copies first, but this requires more changes to source code, moving the C copy first is simple and enough to prevent the problem in my test, I can however make more extensive changes if you prefer all copies first.
I also don't see any reason for this, if I use a different #define and define it in the header, it works, but using a #define defined in the makefiles give me seg fault, I can ask someone to test on a different system to see if it is reproducible.
—
Reply to this email directly or view it on GitHub #10 (comment).

add streams and events to expose parallelism between copies and transpose kernels H2D copies now only wait for previous H2D copies or previous DGEMM kernel for this obuffer tranpose kernels wait for corresponding H2D copies

added streams and events to remove copy/transpose false dependencies

TeslaCoder · 2015-10-02T03:06:00Z

Thanks David! The patch in test branch fixes the seg fault problem.

As for the tranpose kernel blocking copy C, I commited the changes to expose parallelism between copy and transpose kernels using additional streams for the Copy H2D, and using cuda events with cudaStreamWaitEvent to impose dependencies.

(I used cuda events created with "disableTiming" to reduce the overhead of storing time-stamps when using these events since they will not be used to collect timing information)

I also noticed, since we use the DGEMM NT kernel, if P>1 and Q>1 there are no transpose required for the Update DGEMM, but the fix still gives an improvement I can see in the profiler because now copyH2D A/B can overlap copyD2H C where before all operations were forced to be serialized (all were on the same stream).

I tried again to use the alternative lookahead, I can see there is less gap between updates especially towards the end of the runtime, which results in improved performance, but the result is incorrect, do you have any idea about this? I guess there is some synchronization missing?

TeslaCoder · 2015-10-02T03:12:08Z

One last note, although the fix in hpl-gpu test branch resolved the segfault at the end of HPL, the other recent changes in hpl-gpu caused seg fault when starting HPL, at MPI_Barrier. When I replace all files in hpl-gpu/testing/ptest with my older version then it works.

davidrohr · 2015-10-22T20:32:35Z

Hi,

I have pushed a new stable version to the repository.
It has several fixes and improvements that might be relevant for you:

The segfault you reported is fixed.
Transposition kernels are moved after the transfer for better performance in CUDA.
AlternateLookahead is fixed with CUDA
I have ported several new GPU Queue Scheduling schemes from the OpenCL version to the CUDA version which you might want to try. Some of them also use different streams for transfer and kernel.

I have setup a system with one Titan GPU here for some CUDA tests which is much less elaborate than your system.
On my system, the different scheduling schemes do not improve performance and one is even slower.
You can enable them by the caldgemm command line options "-Oq", "-Oq -OQ", and "-Oq -OQ -OM".

As I have no test setup here, I am very interested in your feedback, in particular in the DGEMM kernel performance, single- and multi-GPU DGEMM system performance as well as a log of an HPL run with HPL_VERBOSE=3 setting.

Regards
David

On 02.10.2015 05:06, TeslaCoder wrote:

Thanks David! The patch in test branch fixes the seg fault problem.

As for the tranpose kernel blocking copy C, I commited the changes to expose parallelism between copy and transpose kernels using additional streams for the Copy H2D, and using cuda events with cudaStreamWaitEvent to impose dependencies.

(I used cuda events created with "disableTiming" to reduce the overhead of storing time-stamps when using these events since they will not be used to collect timing information)

I also noticed, since we use the DGEMM NT kernel, if P>1 and Q>1 there are no transpose required for the Update DGEMM, but the fix still gives an improvement I can see in the profiler because now copyH2D A/B can overlap copyD2H C where before all operations were forced to be serialized (all were on the same stream).

I tried again to use the alternative lookahead, I can see there is less gap between updates especially towards the end of the runtime, which results in improved performance, but the result is incorrect, do you have any idea about this? I guess there is some synchronization missing?

—
Reply to this email directly or view it on GitHub #10 (comment).

TeslaCoder added 4 commits September 25, 2015 21:27

Merge pull request #1 from davidrohr/master

7557bcf

update to latest revision

Update caldgemm_cuda.cu

7691323

- call fast cublas dgemm kernel NT (TN in cblas terms) - move copy C before copy+trans A+B - only call transpose when needed, otherwise copy directly to dest_image

Update caldgemm_cuda.cu

09c0e62

destroy cublas handles and cuda context in caldgemm_cuda::ExitRuntime()

TeslaCoder added 4 commits October 1, 2015 16:00

expose copy/tranpsose parallelism

1d8ecb9

add streams and events to expose parallelism between copies and transpose kernels H2D copies now only wait for previous H2D copies or previous DGEMM kernel for this obuffer tranpose kernels wait for corresponding H2D copies

remove false dependency between Copy and Transpose

6f87395

added streams and events to remove copy/transpose false dependencies

fix dependency between copy C D2H and H2D

106ee26

fix dependency between copy C D2H and H2D

f95ba2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update CUDA_CUBLAS #10

update CUDA_CUBLAS #10

TeslaCoder commented Sep 26, 2015

davidrohr commented Sep 26, 2015

TeslaCoder commented Sep 27, 2015

davidrohr commented Sep 28, 2015

davidrohr commented Sep 30, 2015

TeslaCoder commented Oct 2, 2015

TeslaCoder commented Oct 2, 2015

davidrohr commented Oct 22, 2015

update CUDA_CUBLAS #10

Are you sure you want to change the base?

update CUDA_CUBLAS #10

Conversation

TeslaCoder commented Sep 26, 2015

davidrohr commented Sep 26, 2015

TeslaCoder commented Sep 27, 2015

davidrohr commented Sep 28, 2015

davidrohr commented Sep 30, 2015

TeslaCoder commented Oct 2, 2015

TeslaCoder commented Oct 2, 2015

davidrohr commented Oct 22, 2015