Moving a cufinufft plan between gpu and cpu #423
Unanswered
RoeyYadgar
asked this question in
FINUFFT in applications, mathematical definitions
Replies: 4 comments 2 replies
-
Dear Roey,
There is no obvious way to move plans or setpts info between CPU and GPU.
At least, not without some hacking of our source code, which you should try.
You mention both a) the plan (which is dominated by FFT plan stage, for
types 1, 2), and b) setpts
(which is dominated by sorting, unless you're in 1d).
Consider a):
clearly there's no way to interchange cuFFT and FFTW plans - they are
different libraries.
Do you really need to change regular grid sizes (hence do new cuFFT plans)?
Do cuFFT plans really occupy that much RAM relative to the work arrays?
I don't know much about cuFFT, but maybe there is a way to move a cuFFT
plan to CPU and back for storage sake? ***@***.*** may know since she is
at NVidia on this - I can give her you email if you email me).
This would require some basic cuda hacking to cufinufft plan.
But H2D movement is slow, so this sounds like a bad idea.
Consider b):
Try no sorting at all. Change the opts to
opts.gpu_method=1
opts.gpu_sort=0
It could be your ordering is already decent enough not to need sorting.
Example: some MRI k-point orderings are already fast enough, or maybe a
simple transpose of ordering (spokes vs radius) is enough.
If not, you could write out the sort order, and reorder your data to match
that (requires hacking to make that output).
Examine debug=2 output to investigate what is the bottleneck (on CPU side).
On GPU side benchmark test executables with similar sizes, to see
breakdown, as in
https://finufft.readthedocs.io/en/latest/trouble.html#gpu-library-speed
Finally, change your design so you do all transforms of same size, and
same NU pts, at once. Maybe you can then vectorize the call and benefit in
speed.
Hope this helps. Alex
…On Sun, Mar 24, 2024 at 1:13 PM Roey Yadgar ***@***.***> wrote:
Hi,
I hope i'm asking my question in the right place.
Im using the python interface of cufinufft, and i wondered if it is
possible to move an already initialized plan to the cpu and back to the
gpu, or alternatively "convert" a cufinufft plan to a finufft plan and
vise-versa without performing the heavy computations again.
I noticed that initalizing a plan and using the setpoints method takes
most of the execution time in my application, and since i need to use each
plan multiple times (but not at once) i wondered if i could cache them on
my cpu, because i need to create a lot of different plans which wouldn't
fit on the gpu's memory.
Thanks a lot :)
—
Reply to this email directly, view it on GitHub
<#423>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSTFE7KPDSPG2EPZT2LYZ33QXAVCNFSM6AAAAABFF3BORWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGQYTKOBQGQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
1 reply
-
On Mon, Mar 25, 2024 at 4:03 PM Roey Yadgar ***@***.***> wrote:
Hi Alex,
Thank you for the very detailed response!
All my plans are indeed of the same size (it's just the points that are
different) And I already started working on initalizing the plans only once
and and re-set the points each time, This would save me a big part of the
execution time without paying anything extra in memory.
What i was not aware is the issue of point sorting, not sorting does seem
to reduce the execution time by quit a bit. In my case im using 3-D type 1
& 2 nuffts with points that are on a 2-D plain (im working on something
with a similar concept to MRI so i'd assume that is the same case in MRI as
well) .
If your NU points really are on a single plane, surely the outputs
(uniform) of the type 1 are invariant in the z (3rd) direction? (in the
case where the plane is z=0. If the plane is some other z, there is a
simple phasing in the 3rd dimension). Similar for type 2 (you could
collapse the 3rd dim before applying a 2d2.) Not sure why you're not doing
2D nuffts, in that case?
I must admit i'm not aware of the actual computation that is performed in
nufft, is the purpose of sorting the points for faster memory access or
does it affect the actual result of the nufft?
I should do some reading to figure out how these points in my case should
be sorted, is the paper "cuFINUFFT: a load-balanced GPU library for
general-purpose nonuniform FFTs" in https://arxiv.org/abs/2102.08463 the
right place to start?
Wouldn't bother reading about the details of the sorting. Just try
different things.
You could do simple tests with rand vs a regular grid to see the effect.
… If it turns out well I guess I won't even need caching the plans on the
cpu but if i end up still needing to try it I'll email you.
Thank you so much!
Roey
—
Reply to this email directly, view it on GitHub
<#423 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSWVYZYF33HAKNTHNBDY2BYI7AVCNFSM6AAAAABFF3BORWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSMBXGI2TE>
.
You are receiving this because you commented.Message ID:
***@***.***
com>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Ok, I understand the arbitrary plane orientation now - it's a good use.
Sorting only affects performance, not accuracy (the results should match to
eps_mach).
…On Tue, Mar 26, 2024 at 2:38 PM Roey Yadgar ***@***.***> wrote:
If your NU points really are on a single plane, surely the outputs
(uniform) of the type 1 are invariant in the z (3rd) direction? (in the
case where the plane is z=0. If the plane is some other z, there is a
simple phasing in the 3rd dimension). Similar for type 2 (you could
collapse the 3rd dim before applying a 2d2.) Not sure why you're not doing
2D nuffts, in that case?
I hope I understood what you meant. The 2-D plane has an arbitrary
orientation (it is a 2-d uniform grid on the xy plane that is just
rotated), so in type 1 the 3D output is constant across some axis (the one
who is orthogonal to the plane) and type 2 can be done by collapsing along
that axis and perform regular 2D fft. So it actually possible to use
regular 2D ffts instead of both types of nuffts but it requires rotating
the 3-D volume, and nuffts are used instead since it's more efficient (I
have never personally looked into the implementation details of each method
but nuffts are pretty much the standard from what i saw).
Wouldn't bother reading about the details of the sorting. Just try
different things.
You could do simple tests with rand vs a regular grid to see the effect.
I'll give it a try, what should I be expecting to see? is it supposed to
just affect the speed performance or also the accuracy of the result?
Roey
—
Reply to this email directly, view it on GitHub
<#423 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSRG6RGOCKZ7NJC4NVTY2GXCPAVCNFSM6AAAAABFF3BORWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSMJZGE3DC>
.
You are receiving this because you commented.Message ID:
***@***.***
com>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
0 replies
-
I see, thank you very much! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I hope i'm asking my question in the right place.
Im using the python interface of cufinufft, and i wondered if it is possible to move an already initialized plan to the cpu and back to the gpu, or alternatively "convert" a cufinufft plan to a finufft plan and vise-versa without performing the heavy computations again.
I noticed that initalizing a plan and using the setpoints method takes most of the execution time in my application, and since i need to use each plan multiple times (but not at once) i wondered if i could cache them on my cpu, because i need to create a lot of different plans which wouldn't fit on the gpu's memory.
Thanks a lot :)
Beta Was this translation helpful? Give feedback.
All reactions