-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update CUDA_CUBLAS #10
base: master
Are you sure you want to change the base?
Conversation
update to latest revision
- call fast cublas dgemm kernel NT (TN in cblas terms) - move copy C before copy+trans A+B - only call transpose when needed, otherwise copy directly to dest_image
remove #ifdef CALDGEMM_CUDA_CUBLAS from caldgemm_cuda.h header file With this #ifdef in place if I use multiple GPUs per node, it will give segmentation fault after HPL finished (during MPI_Finalize() in testing/ptest/HPL_pddriver.c) It does not make sense to me, but I tried many revisions of the files and searched for other possible causes, but this is the only place I found that solves the problem. The seg fault also creates problems for nvidia profiler (nvprof) becuase the output is not flushed to disk if the program terminates abnormaly. I will add "cudaThreadExit()" inside the ExitRuntime() to destory the CUDA context and force profiler to flush to disk.
destroy cublas handles and cuda context in caldgemm_cuda::ExitRuntime()
Hi, this looks better to me than the last approach. I am wondering about some facts: 1: Isn't there a similar problem for the transfer of a and b matrices and transpose kernel. i.e. instead of copying C first, wouldn't it make more sense to just run the transpose kernels after copying all a,b, and c? I don't see any reason how the #ifdef in the header can cause a segfault. |
|
Hi, thanks for the update: Indeed such a screenshot would be interesting. Can you also sent the cuda profile file? I have nsight installed but no nvidia gpu at the moment.
Btw: out of curiosity: what system are you working on and what are you using our hpl for? (Sent from my mobile, excuse the typos) On September 28, 2015 8:40:33 AM GMT+09:00, TeslaCoder [email protected] wrote:
|
Hi, I have just commited a patch to test branch that should fix the cublas segfault in hpl. Regards On 28.09.2015 01:40, TeslaCoder wrote:
|
add streams and events to expose parallelism between copies and transpose kernels H2D copies now only wait for previous H2D copies or previous DGEMM kernel for this obuffer tranpose kernels wait for corresponding H2D copies
added streams and events to remove copy/transpose false dependencies
Thanks David! The patch in test branch fixes the seg fault problem. As for the tranpose kernel blocking copy C, I commited the changes to expose parallelism between copy and transpose kernels using additional streams for the Copy H2D, and using cuda events with cudaStreamWaitEvent to impose dependencies. (I used cuda events created with "disableTiming" to reduce the overhead of storing time-stamps when using these events since they will not be used to collect timing information) I also noticed, since we use the DGEMM NT kernel, if P>1 and Q>1 there are no transpose required for the Update DGEMM, but the fix still gives an improvement I can see in the profiler because now copyH2D A/B can overlap copyD2H C where before all operations were forced to be serialized (all were on the same stream). I tried again to use the alternative lookahead, I can see there is less gap between updates especially towards the end of the runtime, which results in improved performance, but the result is incorrect, do you have any idea about this? I guess there is some synchronization missing? |
One last note, although the fix in hpl-gpu test branch resolved the segfault at the end of HPL, the other recent changes in hpl-gpu caused seg fault when starting HPL, at MPI_Barrier. When I replace all files in hpl-gpu/testing/ptest with my older version then it works. |
Hi, I have pushed a new stable version to the repository.
I have setup a system with one Titan GPU here for some CUDA tests which is much less elaborate than your system. As I have no test setup here, I am very interested in your feedback, in particular in the DGEMM kernel performance, single- and multi-GPU DGEMM system performance as well as a log of an HPL run with HPL_VERBOSE=3 setting. Regards On 02.10.2015 05:06, TeslaCoder wrote:
|
call fast kernel cublas dgemmNT
Copy C before Copy+transpose A+B (fix for pipeline blocked by transpose kernel)
call transpose only when needed, otherwise copy data directly to dest_image
destroy cublas handles and cuda context in ExitRuntime
remove #ifdef from header file to fix seg fault problem