Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update CUDA_CUBLAS #10

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

update CUDA_CUBLAS #10

wants to merge 8 commits into from

Conversation

TeslaCoder
Copy link
Contributor

call fast kernel cublas dgemmNT
Copy C before Copy+transpose A+B (fix for pipeline blocked by transpose kernel)
call transpose only when needed, otherwise copy data directly to dest_image
destroy cublas handles and cuda context in ExitRuntime
remove #ifdef from header file to fix seg fault problem

update to latest revision
- call fast cublas dgemm kernel NT (TN in cblas terms)
- move copy C before copy+trans A+B
- only call transpose when needed, otherwise copy directly to dest_image
remove #ifdef CALDGEMM_CUDA_CUBLAS from caldgemm_cuda.h header file 

With this #ifdef in place if I use multiple GPUs per node, it will give segmentation fault after HPL finished (during MPI_Finalize() in testing/ptest/HPL_pddriver.c)  

It does not make sense to me, but I tried many revisions of the files and searched for other possible causes, but this is the only place I found that solves the problem.

 The seg fault also creates problems for nvidia profiler (nvprof) becuase the output is not flushed to disk if the program terminates abnormaly.  I will add "cudaThreadExit()" inside the ExitRuntime() to destory the CUDA context and force profiler to flush to disk.
destroy cublas handles and cuda context in caldgemm_cuda::ExitRuntime()
@davidrohr
Copy link
Owner

Hi,

this looks better to me than the last approach. I am wondering about some facts:

1:
Why at all is the transpose kernel blocking the data transfer. There are multiple cuda streams running in parallel. data transfer of one stream and kernels of the other one should be able to overlap. At least this works in the OpenCL version on non-nvidia GPUs.

Isn't there a similar problem for the transfer of a and b matrices and transpose kernel. i.e. instead of copying C first, wouldn't it make more sense to just run the transpose kernels after copying all a,b, and c?

I don't see any reason how the #ifdef in the header can cause a segfault.

@TeslaCoder
Copy link
Contributor Author

  1. I could send you a picture of visual profiler so you can see the problem, for example, DGEMM kernel in one stream must overlap small copy A + tranpose Kernel A + big copy C, so the first part of DGEMM kernel is overlapping ok with small copy A, but then transpose kernel waits for DGEMM kernel to be finished, so big copy C does not occur until after DGEMM kernel is complete. On nvidia GPU's ALL thread-blocks of an active kernel will be scheduled before blocks from another kernel, unless the other kernel has a higher priority. So 2 kernels with same priority will only overlap each other at he end of one and start of the next as the final thread blocks are scheduled from the first and resources become available for the second. Its possible AMD GPUs have a more complex scheduling of kernels.
  2. yes it would make sense to do all copies first, but this requires more changes to source code, moving the C copy first is simple and enough to prevent the problem in my test, I can however make more extensive changes if you prefer all copies first.
  3. I also don't see any reason for this, if I use a different #define and define it in the header, it works, but using a #define defined in the makefiles give me seg fault, I can ask someone to test on a different system to see if it is reproducible.

@davidrohr
Copy link
Owner

Hi,

thanks for the update:

Indeed such a screenshot would be interesting. Can you also sent the cuda profile file? I have nsight installed but no nvidia gpu at the moment.

  1. i see, that is then because cuda wont run transpose of stream 1 and dgemm of stream 2 in parallel. Will think about it when i find some time.
  2. i think i got an idea what is the problem. Will report back with a possible fix in some time.

Btw: out of curiosity: what system are you working on and what are you using our hpl for?
Kind regards
David Rohr

(Sent from my mobile, excuse the typos)

On September 28, 2015 8:40:33 AM GMT+09:00, TeslaCoder [email protected] wrote:

1: I could send you a picture of visual profiler so you can see the
problem, for example, DGEMM kernel in one stream must overlap small
copy A + tranpose Kernel A + big copy C, so the first part of DGEMM
kernel is overlapping ok with small copy A, but then transpose kernel
waits for DGEMM kernel to be finished, so big copy C does not occur
until after DGEMM kernel is complete. On nvidia GPU's ALL
thread-blocks of an active kernel will be scheduled before blocks from
another kernel, unless the other kernel has a higher priority. So 2
kernels with same priority will only overlap each other at he end of
one and start of the next as the final thread blocks are scheduled from
the first and resources become available for the second. Its possible
AMD GPUs have a more complex scheduling of kernels.

  1. yes it would make sense to do all copies first, but this requires
    more changes to source code, moving the C copy first is simple and
    enough to prevent the problem in my test, I can however make more
    extensive changes if you prefer all copies first.
  2. I also don't see any reason for this, if I use a different #define
    and define it in the header, it works, but using a #define defined in
    the makefiles give me seg fault, I can ask someone to test on a
    different system to see if it is reproducible.

Reply to this email directly or view it on GitHub:
#10 (comment)

@davidrohr
Copy link
Owner

Hi,

I have just commited a patch to test branch that should fix the cublas segfault in hpl.
I would appreciate if you could test it.

Regards
David

On 28.09.2015 01:40, TeslaCoder wrote:

1: I could send you a picture of visual profiler so you can see the problem, for example, DGEMM kernel in one stream must overlap small copy A + tranpose Kernel A + big copy C, so the first part of DGEMM kernel is overlapping ok with small copy A, but then transpose kernel waits for DGEMM kernel to be finished, so big copy C does not occur until after DGEMM kernel is complete. On nvidia GPU's ALL thread-blocks of an active kernel will be scheduled before blocks from another kernel, unless the other kernel has a higher priority. So 2 kernels with same priority will only overlap each other at he end of one and start of the next as the final thread blocks are scheduled from the first and resources become available for the second. Its possible AMD GPUs have a more complex scheduling of kernels.

yes it would make sense to do all copies first, but this requires more changes to source code, moving the C copy first is simple and enough to prevent the problem in my test, I can however make more extensive changes if you prefer all copies first.
I also don't see any reason for this, if I use a different #define and define it in the header, it works, but using a #define defined in the makefiles give me seg fault, I can ask someone to test on a different system to see if it is reproducible.


Reply to this email directly or view it on GitHub #10 (comment).

add streams and events to expose parallelism between copies and transpose kernels
H2D copies now only wait for previous H2D copies or previous DGEMM kernel for this obuffer
tranpose kernels wait for corresponding H2D copies
added streams and events to remove copy/transpose false dependencies
@TeslaCoder
Copy link
Contributor Author

Thanks David! The patch in test branch fixes the seg fault problem.

As for the tranpose kernel blocking copy C, I commited the changes to expose parallelism between copy and transpose kernels using additional streams for the Copy H2D, and using cuda events with cudaStreamWaitEvent to impose dependencies.

(I used cuda events created with "disableTiming" to reduce the overhead of storing time-stamps when using these events since they will not be used to collect timing information)

I also noticed, since we use the DGEMM NT kernel, if P>1 and Q>1 there are no transpose required for the Update DGEMM, but the fix still gives an improvement I can see in the profiler because now copyH2D A/B can overlap copyD2H C where before all operations were forced to be serialized (all were on the same stream).

I tried again to use the alternative lookahead, I can see there is less gap between updates especially towards the end of the runtime, which results in improved performance, but the result is incorrect, do you have any idea about this? I guess there is some synchronization missing?

@TeslaCoder
Copy link
Contributor Author

One last note, although the fix in hpl-gpu test branch resolved the segfault at the end of HPL, the other recent changes in hpl-gpu caused seg fault when starting HPL, at MPI_Barrier. When I replace all files in hpl-gpu/testing/ptest with my older version then it works.

@davidrohr
Copy link
Owner

Hi,

I have pushed a new stable version to the repository.
It has several fixes and improvements that might be relevant for you:

  • The segfault you reported is fixed.
  • Transposition kernels are moved after the transfer for better performance in CUDA.
  • AlternateLookahead is fixed with CUDA
  • I have ported several new GPU Queue Scheduling schemes from the OpenCL version to the CUDA version which you might want to try. Some of them also use different streams for transfer and kernel.

I have setup a system with one Titan GPU here for some CUDA tests which is much less elaborate than your system.
On my system, the different scheduling schemes do not improve performance and one is even slower.
You can enable them by the caldgemm command line options "-Oq", "-Oq -OQ", and "-Oq -OQ -OM".

As I have no test setup here, I am very interested in your feedback, in particular in the DGEMM kernel performance, single- and multi-GPU DGEMM system performance as well as a log of an HPL run with HPL_VERBOSE=3 setting.

Regards
David

On 02.10.2015 05:06, TeslaCoder wrote:

Thanks David! The patch in test branch fixes the seg fault problem.

As for the tranpose kernel blocking copy C, I commited the changes to expose parallelism between copy and transpose kernels using additional streams for the Copy H2D, and using cuda events with cudaStreamWaitEvent to impose dependencies.

(I used cuda events created with "disableTiming" to reduce the overhead of storing time-stamps when using these events since they will not be used to collect timing information)

I also noticed, since we use the DGEMM NT kernel, if P>1 and Q>1 there are no transpose required for the Update DGEMM, but the fix still gives an improvement I can see in the profiler because now copyH2D A/B can overlap copyD2H C where before all operations were forced to be serialized (all were on the same stream).

I tried again to use the alternative lookahead, I can see there is less gap between updates especially towards the end of the runtime, which results in improved performance, but the result is incorrect, do you have any idea about this? I guess there is some synchronization missing?


Reply to this email directly or view it on GitHub #10 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants