You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I finally decided to look into GPU acceleration after playing with whisper.cpp and realizing that OpenCL was still Actually Useful(tm). (Seriously, someone should've nudged me a while ago. Maybe I'm stubborner than I think I am... ;) )
This would involve a bit of refactoring, but if it gets a 2x performance boost it'd be worth it:
The multithreaded demodcache should be torn out (this would generally be a win)
Data needs to be kept GPU-side as much as possible
There would probably have to be a wrapper so that non-OpenCL envs would still run. (Apple has deprecated it so it will probably go away sooner or later on Mac - there will probably be something to run there by the time it does...)
I'm still in the testing phase. On my main test platform (Dell T3600 w/6-core Sandy Bridge and a Geforce 3060 12GB) pyvkfft is 150% faster at the standard blocksize (64K samples), and ~15x faster at 1MB. So this will probably shift the bottleneck to the TBC even further unless things can be kept on the GPU side most of the time.
I'm also going to look at a secondary test potato^Wplatform, a Mele Quieter3C which has a Celeron N5105 and it's integrated GPU. The latter does pyvkfft benchmarks at about 4-5% the speed of the 3060, but since the CPU does not support AVX(2) it might still be faster. (By the way, the new Nxxx series does have AVX2 and would only lag behind a Haswell i5 because it only has one memory channel. Not bad.)
At a later point, I'm planning on getting my hands on an rk3588 board - if OpenCL is running there with the free drivers I'll try that too, but the A76 has enough SIMD that it might not help.
The text was updated successfully, but these errors were encountered:
N5105 notes: Not nearly as slow as I expected. Looks like ~1fps on ld-decode, and most of the slowness in OpenCL is data transfer, so the GPU results are even close.
An Alder Lake-N PC would probably do quite well for ld-decode if you put a nice NVM-e drive in it. These are not your father's Atoms.
I played around with doing int16->complex64 conversion on the GPU side, and it's now 50x faster at 1MB buffers and ~2x with 32K buffers on my main system, if I'm running things right.
(The n3050 is 7.2x/1.87x respectively, I aparently finally got the 3060 properly in play)
So overall speedup will be limited on how much I can use the GPU-side buffers to help with TBC/scaling.
I OpenCL'ified the RF stage, but performance gains are slight now because of pyopencl not releasing the GIL much, on top of switching to the threading model.
Obviously I'm not telling you to rewrite your whole project in another language, but I think this is really edging into territory that Python is bad at. I don't know if it's ready yet, but in the long term this seems like exactly the sort of thing that Mojo is going to be great for.
I finally decided to look into GPU acceleration after playing with whisper.cpp and realizing that OpenCL was still Actually Useful(tm). (Seriously, someone should've nudged me a while ago. Maybe I'm stubborner than I think I am... ;) )
This would involve a bit of refactoring, but if it gets a 2x performance boost it'd be worth it:
I'm still in the testing phase. On my main test platform (Dell T3600 w/6-core Sandy Bridge and a Geforce 3060 12GB) pyvkfft is 150% faster at the standard blocksize (64K samples), and ~15x faster at 1MB. So this will probably shift the bottleneck to the TBC even further unless things can be kept on the GPU side most of the time.
I'm also going to look at a secondary test potato^Wplatform, a Mele Quieter3C which has a Celeron N5105 and it's integrated GPU. The latter does pyvkfft benchmarks at about 4-5% the speed of the 3060, but since the CPU does not support AVX(2) it might still be faster. (By the way, the new Nxxx series does have AVX2 and would only lag behind a Haswell i5 because it only has one memory channel. Not bad.)
At a later point, I'm planning on getting my hands on an rk3588 board - if OpenCL is running there with the free drivers I'll try that too, but the A76 has enough SIMD that it might not help.
The text was updated successfully, but these errors were encountered: