Please try out v2.3.0-rc1, with new features and performance gains #522
Replies: 4 comments 14 replies
-
Works great for nifty-ls! The Lomb-Scargle computation runs about 10% faster for mid-size 1D type-1 transforms (N_f > 10^4) , and all the tests still pass. The timings are a bit noisy, but here's a plot of the speedups. While there are some apparent regressions, I'm not too concerned about them. Most of the evaluations are really fast to begin with, so the denominators are small and sensitive to noise/overheads. |
Beta Was this translation helpful? Give feedback.
-
That's weird that the gpu code isn't faster. Did you happen to try gpu_meth
= 1? (the SM rather than GM method)? @DiamonDinoia may have advice, since
he sped up the gpu code.
…On Thu, Aug 15, 2024 at 2:41 PM Lehman Garrison ***@***.***> wrote:
Works great for nifty-ls! The Lomb-Scargle computation runs about 10%
faster for mid-size 1D type-1 transforms (N_f > 10^4) , and all the tests
still pass.
The timings are a bit noisy, but here's a plot of the speedups. While
there are some apparent regressions, I'm not too concerned about them. Most
of the evaluations are really fast to begin with, so the denominators are
small and sensitive to noise/overheads.
image.png (view on web)
<https://github.com/user-attachments/assets/b4e21f81-1bdd-40d6-9be3-49f9fcbc9b77>
—
Reply to this email directly, view it on GitHub
<#522 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSUA27BAV3BDEKKGBKDZRTYZNAVCNFSM6AAAAABMSMMTJ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZVGE2DCOI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
-
Hmm, well all these differences are only a few %. (it sounds like even
changing gpu_meth from 1 to 2 did not change the speed significantly).
Marco, can you look back on your 1D GPU speedups and check whether the
difference should be more (could depend on device)? Thanks for the
testing, BTW!
Thanks for testing!
…On Fri, Aug 16, 2024 at 10:43 AM Lehman Garrison ***@***.***> wrote:
(Might have replied to this in the wrong thread; yes, these are source
builds.)
—
Reply to this email directly, view it on GitHub
<#522 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSUA5ULHSYLDWPFJ5TTZRYFZJAVCNFSM6AAAAABMSMMTJ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZVHE4TMNQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
-
Could be that Marco focused on fp32 (most common for MRI, etc), but Lehman
needs fp64.
(and yes, use 2.3.0, you're right).
…On Fri, Aug 16, 2024 at 12:25 PM Marco Barbone ***@***.***> wrote:
I would say that these results are expected. Even then 30% improvement
over 0.05s is in the ms range which lower than the noise threshold. With
n_f=2e8 I would have expected to see some improvements as in my tests
from 10 digits to 15 digits with 1e8 1D is from 7% to 30% faster.
—
Reply to this email directly, view it on GitHub
<#522 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSUA4GCGBSMDN7D6EI3ZRYRXRAVCNFSM6AAAAABMSMMTJ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZWGA4TIOI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***
com>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
-
We have packed a lot of new features into an upcoming 2.3.0 release!
In short this includes big performance increases via SIMD vectorization, switchable FFT (allowing DUCC0 instead of FFTW), GPU modeord option, and new Python library build. We would love users to try out 2.3.0-rc1 to make sure it is stable in their applications. Please report Issues. Thank-you!
Python users, please test the following to access this pre-release wheels from pypi:
pip install --pre --no-binary finufft finufft -f https://andenpantera.com/joakim/finufft-py
pip install --pre --no-binary cufinufft cufinufft -f https://andenpantera.com/joakim/finufft-py
(The
--pre
argument is to make sure we don't skip pre-release versions,--no-binary
forces compilation instead of downloading a binary, and-f
tells us to look for the package on Joakim Anden's server instead of on PyPI.)Here is a full list from the CHANGELOG:
V 2.3.0-rc1 (8/6/24)
improvements (Barbone).
PR 507 (Anden, Lu, Barbone). Compiles from source for the local build.
used to exploit sparsity pattern to achieve FFT speedups 1-3x in 2D and 3D).
PR463, Martin Reinecke. Both CMake and makefile includes this DUCC0 option
(makefile PR511 by Barnett; CMake by Barbone).
cleaner Horner coefficient generation PR499 (fixes fp32 overflow issue Overflow and admissible range for float 32 input parameters #454).
kernel evaluation, templating by ns with AVX-width-dependent decisions.
Up to 80% faster, dep on compiler. (Marco Barbone with help from Libin Lu).
A large chunk of work: PRs 459, 471, 502.
NOTE: introduces new dependency (XSIMD), added to CMake and makefile.
(Libin Lu based on idea of Martin Reinecke; PR477,492,493).
PR 473 (M Barbone).
order, 1 FFT-style mode order. PR447,446 (Libin Lu, Joakim Anden).
speedup at large tols. Deprecates both opts.chkbnds and error code
FINUFFT_ERR_SPREAD_PTS_OUT_RANGE. Also inlined kernel eval code (increases
compile of spreadinterp.cpp to 10s). PR440 Marco Barbone + Martin Reinecke.
if single-threaded fixes nthr=1 and warns opts.nthreads>1 attempt.
Sort now respects spread_opts.sort_threads not nthreads. Supercedes PR 431.
NUFFT problem (Barnett).
pycuda. Docs for all GPU options. PyPI pkg still at 2.2.0beta.
Created a .clang-format file to define a style similar to the existing style.
Applied clang-format to all cmake, C, C++, and CUDA code. Ignored the blame
using .git-blame-ignore-revs. contributing.md for devs. PR450,455, Barbone.
as opposed to 32-bit. While this does modify the ABI, most code will just
need to recompile against the new library as compilers will silently upcast
any 32-bit integers to 64-bit when calling cufinufft(f)_setpts. Note that
internally, 32-bit integers are still used, so calling cufinufft with more
than 2e9 points will fail. This restriction may be lifted in the future.
It now auto-selects compiler flags based on those supported on all OSes, and
has support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc).
upsampfac=1.25 on GPU, for first time.
precision denormals to 0 and using fma where possible.
performance
test/finufft?d_test.cpp to reduce CI fails due to random numbers on some
platforms in single-prec (with DUCC, etc). (Barnett PR516)
Beta Was this translation helpful? Give feedback.
All reactions