Moving a cufinufft plan between gpu and cpu #423

RoeyYadgar · 2024-03-24T17:12:37Z

RoeyYadgar
Mar 24, 2024

Hi,
I hope i'm asking my question in the right place.

Im using the python interface of cufinufft, and i wondered if it is possible to move an already initialized plan to the cpu and back to the gpu, or alternatively "convert" a cufinufft plan to a finufft plan and vise-versa without performing the heavy computations again.

I noticed that initalizing a plan and using the setpoints method takes most of the execution time in my application, and since i need to use each plan multiple times (but not at once) i wondered if i could cache them on my cpu, because i need to create a lot of different plans which wouldn't fit on the gpu's memory.

Thanks a lot :)

ahbarnett · 2024-03-25T16:26:17Z

ahbarnett
Mar 25, 2024
Maintainer

Dear Roey, There is no obvious way to move plans or setpts info between CPU and GPU. At least, not without some hacking of our source code, which you should try. You mention both a) the plan (which is dominated by FFT plan stage, for types 1, 2), and b) setpts (which is dominated by sorting, unless you're in 1d). Consider a): clearly there's no way to interchange cuFFT and FFTW plans - they are different libraries. Do you really need to change regular grid sizes (hence do new cuFFT plans)? Do cuFFT plans really occupy that much RAM relative to the work arrays? I don't know much about cuFFT, but maybe there is a way to move a cuFFT plan to CPU and back for storage sake? ***@***.*** may know since she is at NVidia on this - I can give her you email if you email me). This would require some basic cuda hacking to cufinufft plan. But H2D movement is slow, so this sounds like a bad idea. Consider b): Try no sorting at all. Change the opts to opts.gpu_method=1 opts.gpu_sort=0 It could be your ordering is already decent enough not to need sorting. Example: some MRI k-point orderings are already fast enough, or maybe a simple transpose of ordering (spokes vs radius) is enough. If not, you could write out the sort order, and reorder your data to match that (requires hacking to make that output). Examine debug=2 output to investigate what is the bottleneck (on CPU side). On GPU side benchmark test executables with similar sizes, to see breakdown, as in https://finufft.readthedocs.io/en/latest/trouble.html#gpu-library-speed Finally, change your design so you do all transforms of same size, and same NU pts, at once. Maybe you can then vectorize the call and benefit in speed. Hope this helps. Alex

…

On Sun, Mar 24, 2024 at 1:13 PM Roey Yadgar ***@***.***> wrote: Hi, I hope i'm asking my question in the right place. Im using the python interface of cufinufft, and i wondered if it is possible to move an already initialized plan to the cpu and back to the gpu, or alternatively "convert" a cufinufft plan to a finufft plan and vise-versa without performing the heavy computations again. I noticed that initalizing a plan and using the setpoints method takes most of the execution time in my application, and since i need to use each plan multiple times (but not at once) i wondered if i could cache them on my cpu, because i need to create a lot of different plans which wouldn't fit on the gpu's memory. Thanks a lot :) — Reply to this email directly, view it on GitHub <#423>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSTFE7KPDSPG2EPZT2LYZ33QXAVCNFSM6AAAAABFF3BORWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGQYTKOBQGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

1 reply

RoeyYadgar Mar 25, 2024
Author

Hi Alex,

Thank you for the very detailed response!

All my plans are indeed of the same size (it's just the points that are different) And I already started working on initalizing the plans only once and and re-set the points each time, This would save me a big part of the execution time without paying anything extra in memory.

What i was not aware is the issue of point sorting, not sorting does seem to reduce the execution time by quit a bit. In my case im using 3-D type 1 & 2 nuffts with points that are on a 2-D plain (im working on something with a similar concept to MRI so i'd assume that is the same case in MRI as well) .
I must admit i'm not aware of the actual computation that is performed in nufft, is the purpose of sorting the points for faster memory access or does it affect the actual result of the nufft?
I should do some reading to figure out how these points in my case should be sorted, is the paper "cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs" in https://arxiv.org/abs/2102.08463 the right place to start?

If it turns out well I guess I won't even need caching the plans on the cpu but if i end up still needing to try it I'll email you.

Thank you so much!

Roey

ahbarnett · 2024-03-25T20:48:41Z

ahbarnett
Mar 25, 2024
Maintainer

On Mon, Mar 25, 2024 at 4:03 PM Roey Yadgar ***@***.***> wrote: Hi Alex, Thank you for the very detailed response! All my plans are indeed of the same size (it's just the points that are different) And I already started working on initalizing the plans only once and and re-set the points each time, This would save me a big part of the execution time without paying anything extra in memory. What i was not aware is the issue of point sorting, not sorting does seem to reduce the execution time by quit a bit. In my case im using 3-D type 1 & 2 nuffts with points that are on a 2-D plain (im working on something with a similar concept to MRI so i'd assume that is the same case in MRI as well) .

If your NU points really are on a single plane, surely the outputs (uniform) of the type 1 are invariant in the z (3rd) direction? (in the case where the plane is z=0. If the plane is some other z, there is a simple phasing in the 3rd dimension). Similar for type 2 (you could collapse the 3rd dim before applying a 2d2.) Not sure why you're not doing 2D nuffts, in that case?

I must admit i'm not aware of the actual computation that is performed in nufft, is the purpose of sorting the points for faster memory access or does it affect the actual result of the nufft? I should do some reading to figure out how these points in my case should be sorted, is the paper "cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs" in https://arxiv.org/abs/2102.08463 the right place to start?

Wouldn't bother reading about the details of the sorting. Just try different things. You could do simple tests with rand vs a regular grid to see the effect.

…

If it turns out well I guess I won't even need caching the plans on the cpu but if i end up still needing to try it I'll email you. Thank you so much! Roey — Reply to this email directly, view it on GitHub <#423 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSWVYZYF33HAKNTHNBDY2BYI7AVCNFSM6AAAAABFF3BORWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSMBXGI2TE> . You are receiving this because you commented.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

1 reply

RoeyYadgar Mar 26, 2024
Author

If your NU points really are on a single plane, surely the outputs
(uniform) of the type 1 are invariant in the z (3rd) direction? (in the
case where the plane is z=0. If the plane is some other z, there is a
simple phasing in the 3rd dimension). Similar for type 2 (you could
collapse the 3rd dim before applying a 2d2.) Not sure why you're not doing
2D nuffts, in that case?

I hope I understood what you meant. The 2-D plane has an arbitrary orientation (it is a 2-d uniform grid on the xy plane that is just rotated), so in type 1 the 3D output is constant across some axis (the one who is orthogonal to the plane) and type 2 can be done by collapsing along that axis and perform regular 2D fft. So it actually possible to use regular 2D ffts instead of both types of nuffts but it requires rotating the 3-D volume, and nuffts are used instead since it's more efficient (I have never personally looked into the implementation details of each method but nuffts are pretty much the standard from what i saw).

Wouldn't bother reading about the details of the sorting. Just try
different things.
You could do simple tests with rand vs a regular grid to see the effect.

I'll give it a try, what should I be expecting to see? is it supposed to just affect the speed performance or also the accuracy of the result?

Roey

ahbarnett · 2024-03-26T19:43:32Z

ahbarnett
Mar 26, 2024
Maintainer

Ok, I understand the arbitrary plane orientation now - it's a good use. Sorting only affects performance, not accuracy (the results should match to eps_mach).

…

On Tue, Mar 26, 2024 at 2:38 PM Roey Yadgar ***@***.***> wrote: If your NU points really are on a single plane, surely the outputs (uniform) of the type 1 are invariant in the z (3rd) direction? (in the case where the plane is z=0. If the plane is some other z, there is a simple phasing in the 3rd dimension). Similar for type 2 (you could collapse the 3rd dim before applying a 2d2.) Not sure why you're not doing 2D nuffts, in that case? I hope I understood what you meant. The 2-D plane has an arbitrary orientation (it is a 2-d uniform grid on the xy plane that is just rotated), so in type 1 the 3D output is constant across some axis (the one who is orthogonal to the plane) and type 2 can be done by collapsing along that axis and perform regular 2D fft. So it actually possible to use regular 2D ffts instead of both types of nuffts but it requires rotating the 3-D volume, and nuffts are used instead since it's more efficient (I have never personally looked into the implementation details of each method but nuffts are pretty much the standard from what i saw). Wouldn't bother reading about the details of the sorting. Just try different things. You could do simple tests with rand vs a regular grid to see the effect. I'll give it a try, what should I be expecting to see? is it supposed to just affect the speed performance or also the accuracy of the result? Roey — Reply to this email directly, view it on GitHub <#423 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSRG6RGOCKZ7NJC4NVTY2GXCPAVCNFSM6AAAAABFF3BORWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSMJZGE3DC> . You are receiving this because you commented.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

0 replies

RoeyYadgar · 2024-03-26T20:06:21Z

RoeyYadgar
Mar 26, 2024
Author

I see, thank you very much!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving a cufinufft plan between gpu and cpu #423

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Moving a cufinufft plan between gpu and cpu #423

RoeyYadgar Mar 24, 2024

Replies: 4 comments · 2 replies

ahbarnett Mar 25, 2024 Maintainer

RoeyYadgar Mar 25, 2024 Author

ahbarnett Mar 25, 2024 Maintainer

RoeyYadgar Mar 26, 2024 Author

ahbarnett Mar 26, 2024 Maintainer

RoeyYadgar Mar 26, 2024 Author

RoeyYadgar
Mar 24, 2024

Replies: 4 comments 2 replies

ahbarnett
Mar 25, 2024
Maintainer

RoeyYadgar Mar 25, 2024
Author

ahbarnett
Mar 25, 2024
Maintainer

RoeyYadgar Mar 26, 2024
Author

ahbarnett
Mar 26, 2024
Maintainer

RoeyYadgar
Mar 26, 2024
Author