Please try out v2.3.0-rc1, with new features and performance gains #522

ahbarnett · 2024-08-15T15:15:07Z

ahbarnett
Aug 15, 2024
Maintainer

We have packed a lot of new features into an upcoming 2.3.0 release!
In short this includes big performance increases via SIMD vectorization, switchable FFT (allowing DUCC0 instead of FFTW), GPU modeord option, and new Python library build. We would love users to try out 2.3.0-rc1 to make sure it is stable in their applications. Please report Issues. Thank-you!

Python users, please test the following to access this pre-release wheels from pypi:

pip install --pre --no-binary finufft finufft -f https://andenpantera.com/joakim/finufft-py

pip install --pre --no-binary cufinufft cufinufft -f https://andenpantera.com/joakim/finufft-py

(The --pre argument is to make sure we don't skip pre-release versions, --no-binary forces compilation instead of downloading a binary, and -f tells us to look for the package on Joakim Anden's server instead of on PyPI.)

Here is a full list from the CHANGELOG:

V 2.3.0-rc1 (8/6/24)

Switched C++ standards from C++14 to C++17, allowing various templating
improvements (Barbone).
Python build modernized to pyproject.toml (for both CPU and GPU).
PR 507 (Anden, Lu, Barbone). Compiles from source for the local build.
Switchable FFT: either FFTW or DUCC0 (latter needs no plan stage; also it is
used to exploit sparsity pattern to achieve FFT speedups 1-3x in 2D and 3D).
PR463, Martin Reinecke. Both CMake and makefile includes this DUCC0 option
(makefile PR511 by Barnett; CMake by Barbone).
ES kernel rescaled to max value 1, reduced poly degrees for upsampfac=1.25,
cleaner Horner coefficient generation PR499 (fixes fp32 overflow issue Overflow and admissible range for float 32 input parameters #454).
Major manual acceleration of spread/interp kernels via XSIMD header-only lib,
kernel evaluation, templating by ns with AVX-width-dependent decisions.
Up to 80% faster, dep on compiler. (Marco Barbone with help from Libin Lu).
A large chunk of work: PRs 459, 471, 502.
NOTE: introduces new dependency (XSIMD), added to CMake and makefile.
Exploiting even/odd symmetry for 10% faster xsimd-accel kernel poly eval
(Libin Lu based on idea of Martin Reinecke; PR477,492,493).
new test/finufft3dkernel_test checks kerevalmeth=0 and 1 agree to tolerance
PR 473 (M Barbone).
new perftest/compare_spreads.jl compares two spreadinterp libs (A Barnett).
new benchmarker perftest/spreadtestndall sweeps all kernel widths (M Barbone).
cufinufft now supports modeord(type 1,2 only): 0 CMCL-style increasing mode
order, 1 FFT-style mode order. PR447,446 (Libin Lu, Joakim Anden).
New doc page: migration guide from NFFT3 (2d1 case only), Barnett.
New foldrescale, removes [-3pi,3pi) restriction on NU points, and slight
speedup at large tols. Deprecates both opts.chkbnds and error code
FINUFFT_ERR_SPREAD_PTS_OUT_RANGE. Also inlined kernel eval code (increases
compile of spreadinterp.cpp to 10s). PR440 Marco Barbone + Martin Reinecke.
CPU plan stage allows any # threads, warns if > omp_get_max_threads(); or
if single-threaded fixes nthr=1 and warns opts.nthreads>1 attempt.
Sort now respects spread_opts.sort_threads not nthreads. Supercedes PR 431.
new docs troubleshooting accuracy limitations due to condition number of the
NUFFT problem (Barnett).
new sanity check on nj and nk (<0 or too big); new err code, tester, doc.
MAX_NF increased from 1e11 to 1e12, since machines grow.
improved GPU python docs: migration guide; usage from cupy, numba, torch,
pycuda. Docs for all GPU options. PyPI pkg still at 2.2.0beta.
Added a clang-format pre-commit hook to ensure consistent code style.
Created a .clang-format file to define a style similar to the existing style.
Applied clang-format to all cmake, C, C++, and CUDA code. Ignored the blame
using .git-blame-ignore-revs. contributing.md for devs. PR450,455, Barbone.
cuFINUFFT interface update: number of nonuniform points M is now a 64-bit int
as opposed to 32-bit. While this does modify the ABI, most code will just
need to recompile against the new library as compilers will silently upcast
any 32-bit integers to 64-bit when calling cufinufft(f)_setpts. Note that
internally, 32-bit integers are still used, so calling cufinufft with more
than 2e9 points will fail. This restriction may be lifted in the future.
CMake build system revamped completely, using more modern practices (Barbone).
It now auto-selects compiler flags based on those supported on all OSes, and
has support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc).
CMake added nvcc and msvc optimization flags.
sphinx local doc build also using CMake. (Barbone)
updated install docs, including for DUCC0 FFT and new python build.
updated install docs (Barnett)
Major acceleration effort for the GPU library cufinufft (M Barbone, PR488):
- binsize is now a function of the shared memory available where possible.
- GM 1D sorts using thrust::sort instead of bin-sort.
- uses the new normalized Horner coefficients and added support for
  upsampfac=1.25 on GPU, for first time.
- new compile flags for extra-vectorization, flushing single
  precision denormals to 0 and using fma where possible.
- using intrinsics (eg FMA) in foldrescale and other places to increase
  performance
- using SM90 float2 vector atomicAdd where supported
- make default binsize = 0
overide single-output relative error by l2 relative error in exit codes of
test/finufft?d_test.cpp to reduce CI fails due to random numbers on some
platforms in single-prec (with DUCC, etc). (Barnett PR516)
fix GPU segfault due to stream deletion as pointer not value (Barbone PR520)

lgarrison · 2024-08-15T18:40:00Z

lgarrison
Aug 15, 2024

Works great for nifty-ls! The Lomb-Scargle computation runs about 10% faster for mid-size 1D type-1 transforms (N_f > 10^4) , and all the tests still pass.

The timings are a bit noisy, but here's a plot of the speedups. While there are some apparent regressions, I'm not too concerned about them. Most of the evaluations are really fast to begin with, so the denominators are small and sensitive to noise/overheads.

2 replies

DiamonDinoia Aug 16, 2024
Maintainer

May I ask you how did you test it? I noticed the wheels are not updated on pipy yet? Did you install finufft from git?

lgarrison Aug 16, 2024

(Might have replied to this in the wrong thread; yes, these are source builds.)

ahbarnett · 2024-08-16T13:59:25Z

ahbarnett
Aug 16, 2024
Maintainer Author

That's weird that the gpu code isn't faster. Did you happen to try gpu_meth = 1? (the SM rather than GM method)? @DiamonDinoia may have advice, since he sped up the gpu code.

…

On Thu, Aug 15, 2024 at 2:41 PM Lehman Garrison ***@***.***> wrote: Works great for nifty-ls! The Lomb-Scargle computation runs about 10% faster for mid-size 1D type-1 transforms (N_f > 10^4) , and all the tests still pass. The timings are a bit noisy, but here's a plot of the speedups. While there are some apparent regressions, I'm not too concerned about them. Most of the evaluations are really fast to begin with, so the denominators are small and sensitive to noise/overheads. image.png (view on web) <https://github.com/user-attachments/assets/b4e21f81-1bdd-40d6-9be3-49f9fcbc9b77> — Reply to this email directly, view it on GitHub <#522 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSUA27BAV3BDEKKGBKDZRTYZNAVCNFSM6AAAAABMSMMTJ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZVGE2DCOI> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

12 replies

lgarrison Aug 16, 2024

I ran a larger test (N_f = 2e8), and it takes the same amount of time (1.7 seconds) for both versions to within a few percent. Real inputs for this application will be smaller, though.

Regardless, finufft/cufinufft is such a big improvement over other methods for the Lomb-Scargle problem that a few percent either way won't make a difference.

lgarrison Aug 16, 2024

Yes, Lehman's application is only small transform sizes, I'm afraid. So that won't change. They are millisecond tasks. So any thoughts/investigations about if there's a GPU bottleneck for such cases, welcome.

I'm hoping the next version of nifty-ls will support multiple CUDA streams, so many computations can be queued up for the GPU and we can overlap compute and communication.

(Lehman I guess we go ahead w 2.2.0 in the JoSS for now, right?)

Any reason not to use v2.3? None of the takeaways change even if a some cases are a few percent slower. And the CPU version at large N is definitely faster!

DiamonDinoia Aug 16, 2024
Maintainer

I would say that these results are expected. Even then 30% improvement over 0.05s is in the ms range which lower than the noise threshold. With n_f=2e8 I would have expected to see some improvements as in my tests from 9 digits to 15 digits with 1e8 1D is from 7% to 30% faster.

lgarrison Aug 16, 2024

I'm only using 9 digits, so it sounds like less than a 7% speedup, halved because cufinufft is only 50% of the total time. So that's consistent with what I'm seeing. Thanks for checking!

DiamonDinoia Aug 16, 2024
Maintainer

Thanks for testing!

ahbarnett · 2024-08-16T14:48:24Z

ahbarnett
Aug 16, 2024
Maintainer Author

Hmm, well all these differences are only a few %. (it sounds like even changing gpu_meth from 1 to 2 did not change the speed significantly). Marco, can you look back on your 1D GPU speedups and check whether the difference should be more (could depend on device)? Thanks for the testing, BTW! Thanks for testing!

…

On Fri, Aug 16, 2024 at 10:43 AM Lehman Garrison ***@***.***> wrote: (Might have replied to this in the wrong thread; yes, these are source builds.) — Reply to this email directly, view it on GitHub <#522 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSUA5ULHSYLDWPFJ5TTZRYFZJAVCNFSM6AAAAABMSMMTJ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZVHE4TMNQ> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

0 replies

ahbarnett · 2024-08-16T16:35:16Z

ahbarnett
Aug 16, 2024
Maintainer Author

Could be that Marco focused on fp32 (most common for MRI, etc), but Lehman needs fp64. (and yes, use 2.3.0, you're right).

…

On Fri, Aug 16, 2024 at 12:25 PM Marco Barbone ***@***.***> wrote: I would say that these results are expected. Even then 30% improvement over 0.05s is in the ms range which lower than the noise threshold. With n_f=2e8 I would have expected to see some improvements as in my tests from 10 digits to 15 digits with 1e8 1D is from 7% to 30% faster. — Reply to this email directly, view it on GitHub <#522 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSUA4GCGBSMDN7D6EI3ZRYRXRAVCNFSM6AAAAABMSMMTJ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZWGA4TIOI> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please try out v2.3.0-rc1, with new features and performance gains #522

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Please try out v2.3.0-rc1, with new features and performance gains #522

ahbarnett Aug 15, 2024 Maintainer

Replies: 4 comments · 14 replies

lgarrison Aug 15, 2024

DiamonDinoia Aug 16, 2024 Maintainer

lgarrison Aug 16, 2024

ahbarnett Aug 16, 2024 Maintainer Author

lgarrison Aug 16, 2024

lgarrison Aug 16, 2024

DiamonDinoia Aug 16, 2024 Maintainer

lgarrison Aug 16, 2024

DiamonDinoia Aug 16, 2024 Maintainer

ahbarnett Aug 16, 2024 Maintainer Author

ahbarnett Aug 16, 2024 Maintainer Author

ahbarnett
Aug 15, 2024
Maintainer

Replies: 4 comments 14 replies

lgarrison
Aug 15, 2024

DiamonDinoia Aug 16, 2024
Maintainer

ahbarnett
Aug 16, 2024
Maintainer Author

DiamonDinoia Aug 16, 2024
Maintainer

DiamonDinoia Aug 16, 2024
Maintainer

ahbarnett
Aug 16, 2024
Maintainer Author

ahbarnett
Aug 16, 2024
Maintainer Author