Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate OpenCL acceleration using pyopencl/pyvkfft #855

Open
happycube opened this issue May 14, 2023 · 4 comments
Open

Investigate OpenCL acceleration using pyopencl/pyvkfft #855

happycube opened this issue May 14, 2023 · 4 comments
Assignees
Labels
enhancement ld-decode An issue only affecting the ld-decode[r]
Milestone

Comments

@happycube
Copy link
Owner

happycube commented May 14, 2023

I finally decided to look into GPU acceleration after playing with whisper.cpp and realizing that OpenCL was still Actually Useful(tm). (Seriously, someone should've nudged me a while ago. Maybe I'm stubborner than I think I am... ;) )

This would involve a bit of refactoring, but if it gets a 2x performance boost it'd be worth it:

  • The multithreaded demodcache should be torn out (this would generally be a win)
  • Data needs to be kept GPU-side as much as possible
  • There would probably have to be a wrapper so that non-OpenCL envs would still run. (Apple has deprecated it so it will probably go away sooner or later on Mac - there will probably be something to run there by the time it does...)

I'm still in the testing phase. On my main test platform (Dell T3600 w/6-core Sandy Bridge and a Geforce 3060 12GB) pyvkfft is 150% faster at the standard blocksize (64K samples), and ~15x faster at 1MB. So this will probably shift the bottleneck to the TBC even further unless things can be kept on the GPU side most of the time.

I'm also going to look at a secondary test potato^Wplatform, a Mele Quieter3C which has a Celeron N5105 and it's integrated GPU. The latter does pyvkfft benchmarks at about 4-5% the speed of the 3060, but since the CPU does not support AVX(2) it might still be faster. (By the way, the new Nxxx series does have AVX2 and would only lag behind a Haswell i5 because it only has one memory channel. Not bad.)

At a later point, I'm planning on getting my hands on an rk3588 board - if OpenCL is running there with the free drivers I'll try that too, but the A76 has enough SIMD that it might not help.

@happycube
Copy link
Owner Author

happycube commented May 14, 2023

N5105 notes: Not nearly as slow as I expected. Looks like ~1fps on ld-decode, and most of the slowness in OpenCL is data transfer, so the GPU results are even close.

An Alder Lake-N PC would probably do quite well for ld-decode if you put a nice NVM-e drive in it. These are not your father's Atoms.

@happycube
Copy link
Owner Author

happycube commented May 14, 2023

I played around with doing int16->complex64 conversion on the GPU side, and it's now 50x faster at 1MB buffers and ~2x with 32K buffers on my main system, if I'm running things right.

(The n3050 is 7.2x/1.87x respectively, I aparently finally got the 3060 properly in play)

So overall speedup will be limited on how much I can use the GPU-side buffers to help with TBC/scaling.

@happycube happycube self-assigned this May 14, 2023
@happycube happycube added enhancement ld-decode An issue only affecting the ld-decode[r] labels May 14, 2023
@happycube happycube added this to the Revision 8 milestone May 14, 2023
@happycube
Copy link
Owner Author

happycube commented Jun 20, 2023

I OpenCL'ified the RF stage, but performance gains are slight now because of pyopencl not releasing the GIL much, on top of switching to the threading model.

https://github.com/happycube/ld-decode/tree/chad-2023.06.11-opencl2

I hear PyCUDA isn't as bad, but since that's locked to nVidia I'd have to make sure the fallback will always work.

@typedrat
Copy link

Obviously I'm not telling you to rewrite your whole project in another language, but I think this is really edging into territory that Python is bad at. I don't know if it's ready yet, but in the long term this seems like exactly the sort of thing that Mojo is going to be great for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ld-decode An issue only affecting the ld-decode[r]
Projects
None yet
Development

No branches or pull requests

2 participants