Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a user level locking mechanism for FFTW #548

Merged
merged 11 commits into from
Sep 18, 2024
Merged
6 changes: 6 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
List of features / changes made / release notes, in reverse chronological order.
If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).

V 2.3.1

* Add options for user-provided FFTW locker (PR548, Blackwell). These options can be be
used to prevent crashes when a user is creating/destroying FFTW plans and
FINUFFT plans in threads simultaneously.

V 2.3.0 (9/5/24)

* Switched C++ standards from C++14 to C++17, allowing various templating
Expand Down
100 changes: 87 additions & 13 deletions docs/opts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ to the simple, vectorized, or guru makeplan routines.
Recall how to do this from C++:

.. code-block:: C++

// (... set up M,x,c,tol,N, and allocate F here...)
finufft_opts* opts;
finufft_default_opts(opts);
Expand All @@ -30,7 +30,7 @@ Recall how to do this from C++:
This setting produces more timing output to ``stdout``.

.. warning::

In C/C++ and Fortran, don't forget to call the command which sets default options
(``finufft_default_opts`` or ``finufftf_default_opts``)
before you start changing them and passing them to FINUFFT.
Expand All @@ -51,9 +51,9 @@ Here are their default settings (from ``src/finufft.cpp:finufft_default_opts``):
.. literalinclude:: ../src/finufft.cpp
:start-after: @defopts_start
:end-before: @defopts_end

As for quick advice, the main options you'll want to play with are:

- ``modeord`` to flip ("fftshift") the Fourier mode ordering
- ``debug`` to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated)
- ``nthreads`` to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread)
Expand Down Expand Up @@ -92,15 +92,15 @@ Data handling options
.. note:: The index *sets* are the same in the two ``modeord`` choices; their ordering differs only by a cyclic shift. The FFT ordering cyclically shifts the CMCL indices $\mbox{floor}(N/2)$ to the left (often called an "fftshift").

**chkbnds**: [DEPRECATED] has no effect.


Diagnostic options
~~~~~~~~~~~~~~~~~~~~~~~

**debug**: Controls the amount of overall debug/timing output to stdout.

* ``debug=0`` : silent

* ``debug=1`` : print some information

* ``debug=2`` : prints more information
Expand All @@ -113,11 +113,11 @@ Diagnostic options

* ``spread_debug=2`` : prints lots. This can print thousands of lines since it includes one line per *subproblem*.


**showwarn**: Whether to print warnings (these go to stderr).

* ``showwarn=0`` : suppresses such warnings

* ``showwarn=1`` : prints warnings


Expand Down Expand Up @@ -173,19 +173,93 @@ for only two settings, as follows. Otherwise, setting it to zero chooses a good
**spread_thread**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), controls how multithreading is used to spread/interpolate each batch of data.

* ``spread_thread=0`` : makes an automatic choice between the below. Recommended.

* ``spread_thread=1`` : acts on each vector in the batch in sequence, using multithreaded spread/interpolate on that vector. It can be slightly better than ``2`` for large problems.

* ``spread_thread=2`` : acts on all vectors in a batch (of size chosen typically to be the number of threads) simultaneously, assigning each a thread which performs a single-threaded spread/interpolate. It is much better than ``1`` for all but large problems. (Historical note: this was used by Melody Shih for the original "2dmany" interface in 2018.)

.. note::

Historical note: A former option ``3`` has been removed. This was like ``2`` except allowing nested OMP parallelism, so multi-threaded spread-interpolate was used for each of the vectors in a batch in parallel. This was used by Andrea Malleo in 2019. We have not yet found a case where this beats both ``1`` and ``2``, hence removed it due to complications with changing the OMP nesting state in both old and new OMP versions.


**maxbatchsize**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), set the largest batch size of data vectors.
Here ``0`` makes an automatic choice. If you are unhappy with this, then for small problems it should equal the number of threads, while for large problems it appears that ``1`` often better (since otherwise too much simultaneous RAM movement occurs). Some further work is needed to optimize this parameter.

**spread_nthr_atomic**: if non-negative: for numbers of threads up to this value, an OMP critical block for ``add_wrapped_subgrid`` is used in spreading (type 1 transforms). Above this value, instead OMP atomic writes are used, which scale better for large thread numbers. If negative, the heuristic default in the spreader is used, set in ``src/spreadinterp.cpp:setup_spreader()``.

**spread_max_sp_size**: if positive, overrides the maximum subproblem (chunking) size for multithreaded spreading (type 1 transforms). Otherwise the default in the spreader is used, set in ``src/spreadinterp.cpp:setup_spreader()``, which we believe is a decent heuristic for Intel i7 and xeon machines.


Thread safety options (advanced)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, with FFTW as the FFT library, FINUFFT is thread safe so long as no other threads are calling FFTW plan creation/destruction routines independently of FINUFFT. If these FFTW routines are called outside of FINUFFT, then the program is liable to crash. In most cases, the calling program can simply call the FFTW routine ``fftw_make_planner_thread_safe()`` before threading out and thread safety will be maintained. However, in instances where this is less desirable, we provide a means to provide your own FFTW locking mechanism. The following example code should exercise FFTW thread safety, and can be built with ``c++ thread_test.cpp -o thread_test -lfinufft -lfftw3_threads -lfftw3 -fopenmp -std=c++11``, assuming the finufft include and library paths are set.

.. code-block:: C++


// thread_test.cpp
blackwer marked this conversation as resolved.
Show resolved Hide resolved
#include <vector>
#include <mutex>
#include <complex>

#include <fftw3.h>
#include <finufft.h>
#include <omp.h>

using namespace std;

constexpr int N = 65384;

void locker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->lock(); }
void unlocker(void *lck) { reinterpret_cast<recursive_mutex *>(lck)->unlock(); }

int main() {
int64_t Ns[3]; // guru describes mode array by vector [N1,N2..]
Ns[0] = N;
recursive_mutex lck;

finufft_opts opts;
finufft_default_opts(&opts);
opts.nthreads = 1;
opts.debug = 0;
opts.fftw_lock_fun = locker;
opts.fftw_unlock_fun = unlocker;
opts.fftw_lock_data = reinterpret_cast<void *>(&lck);

// random nonuniform points (x) and complex strengths (c)
vector<complex<double>> c(N);

// init FFTW threads
fftw_init_threads();

// FFTW and FINUFFT execution using OpenMP parallelization
#pragma omp parallel for
for (int j = 0; j < 100; ++j) {
// allocate output array for FFTW...
vector<complex<double>> F1(N);

// FFTW plan
lck.lock();
fftw_plan_with_nthreads(1);
fftw_plan plan = fftw_plan_dft_1d(N, reinterpret_cast<fftw_complex*>(c.data()),
reinterpret_cast<fftw_complex*>(F1.data()),
FFTW_FORWARD, FFTW_ESTIMATE);
fftw_destroy_plan(plan);
lck.unlock();

// FINUFFT plan
finufft_plan nufftplan;
finufft_makeplan(1, 1, Ns, 1, 1, 1e-6, &nufftplan, &opts);
finufft_destroy(nufftplan);
}

return 0;
}

**fftw_lock_fun**: ``void (fun*)(void *)`` C-style callback function to lock calls to FFTW plan manipulation routines. A ``nullptr`` or ``0`` value will be ignored. If non-null, ``fftw_unlock_fun`` must also be set.

**fftw_unlock_fun**: ``void (fun*)(void *)`` C-style callback function to unlock calls to FFTW plan manipulation routines. A ``nullptr`` or ``0`` value will be ignored. If non-null, ``fftw_lock_fun`` must also be set.

**fftw_lock_data**: ``void *data`` pointer, typically to the lock object itself. Pointer will be passed to ``fftw_lock_fun`` and ``fftw_unlock_fun`` if they are set.
1 change: 1 addition & 0 deletions include/finufft_errors.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,5 @@
#define FINUFFT_ERR_BINSIZE_NOTVALID 18
#define FINUFFT_ERR_INSUFFICIENT_SHMEM 19
#define FINUFFT_ERR_NUM_NU_PTS_INVALID 20
#define FINUFFT_ERR_LOCK_FUNS_INVALID 21
#endif
6 changes: 6 additions & 0 deletions include/finufft_opts.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@ typedef struct finufft_opts { // defaults see finufft.cpp:finufft_default_opts()
// atomic
int spread_max_sp_size; // if >0, overrides spreader (dir=1) max subproblem size
// sphinx tag (don't remove): @opts_end

// User can provide their own FFTW planner lock functions for thread safety
// Null values ignored and use a default lock function (both or neither must be set)
void (*fftw_lock_fun)(void *); // Function ptr that locks the FFTW planner
void (*fftw_unlock_fun)(void *); // Function ptr that unlocks the FFTW planner
void *fftw_lock_data; // Data to pass to the lock functions (e.g. a mutex)
} finufft_opts;

// Those of the above of the form spread_* indicate pass through to finufft_spread_opts
Expand Down
5 changes: 4 additions & 1 deletion python/finufft/finufft/_finufft.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,10 @@ class FinufftOpts(ctypes.Structure):
('spread_thread', c_int),
('maxbatchsize', c_int),
('spread_nthr_atomic', c_int),
('spread_max_sp_size', c_int)]
('spread_max_sp_size', c_int),
('fftw_lock_fun', c_void_p),
('fftw_unlock_fun', c_void_p),
('fftw_lock_data', c_void_p)]


FinufftPlan = c_void_p
Expand Down
48 changes: 41 additions & 7 deletions src/finufft.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,32 @@ namespace finufft {
namespace common {

#ifndef FINUFFT_USE_DUCC0
// Technically global state...
// Needs to be static to avoid name collision with SINGLE/DOUBLE
static std::mutex fftw_lock;
// The only global state in finufft. Inline ensures variables shared between float/double
inline std::mutex fftw_lock;
inline bool did_fftw_init = false;

class FFTWLockGuard {
public:
FFTWLockGuard(void (*lock_fun)(void *), void (*unlock_fun)(void *), void *lock_data)
blackwer marked this conversation as resolved.
Show resolved Hide resolved
: unlock_fun_(unlock_fun), lock_data_(lock_data), fftw_lock_(fftw_lock) {
if (lock_fun)
lock_fun(lock_data_);
else
fftw_lock_.lock();
}
~FFTWLockGuard() {
if (unlock_fun_)
unlock_fun_(lock_data_);
else
fftw_lock_.unlock();
}

private:
void (*unlock_fun_)(void *);
void *lock_data_;
std::mutex &fftw_lock_;
};

#endif

// We macro because it has no FLT args but gets compiled for both prec's...
Expand Down Expand Up @@ -531,6 +554,9 @@ void FINUFFT_DEFAULT_OPTS(finufft_opts *o)
o->maxbatchsize = 0;
o->spread_nthr_atomic = -1;
o->spread_max_sp_size = 0;
o->fftw_lock_fun = nullptr;
o->fftw_unlock_fun = nullptr;
o->fftw_lock_data = nullptr;
// sphinx tag (don't remove): @defopts_end
}

Expand Down Expand Up @@ -568,6 +594,12 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT *n_modes, int iflag, int ntrans,
fprintf(stderr, "[%s] ntrans (%d) should be at least 1.\n", __func__, ntrans);
return FINUFFT_ERR_NTRANS_NOTVALID;
}
if (!p->opts.fftw_lock_fun != !p->opts.fftw_unlock_fun) {
fprintf(stderr, "[%s] fftw_(un)lock functions should be both null or both set\n",
__func__);
return FINUFFT_ERR_LOCK_FUNS_INVALID;
;
}

// get stuff from args...
p->type = type;
Expand Down Expand Up @@ -660,8 +692,8 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT *n_modes, int iflag, int ntrans,
// Now place FFTW initialization in a lock, courtesy of OMP. Makes FINUFFT
// thread-safe (can be called inside OMP)
{
static bool did_fftw_init = false; // the only global state of FINUFFT
std::lock_guard<std::mutex> lock(fftw_lock);
FFTWLockGuard lock(p->opts.fftw_lock_fun, p->opts.fftw_unlock_fun,
p->opts.fftw_lock_data);
if (!did_fftw_init) {
FFTW_INIT(); // setup FFTW global state; should only do once
did_fftw_init = true; // ensure other FINUFFT threads don't clash
Expand Down Expand Up @@ -753,7 +785,8 @@ int FINUFFT_MAKEPLAN(int type, int dim, BIGINT *n_modes, int iflag, int ntrans,
// fftw_plan_many_dft args: rank, gridsize/dim, howmany, in, inembed, istride,
// idist, ot, onembed, ostride, odist, sign, flags
{
std::lock_guard<std::mutex> lock(fftw_lock);
FFTWLockGuard lock(p->opts.fftw_lock_fun, p->opts.fftw_unlock_fun,
p->opts.fftw_lock_data);

// FFTW_PLAN_TH sets all future fftw_plan calls to use nthr_fft threads.
// FIXME: Since this might override what the user wants for fftw, we'd like to
Expand Down Expand Up @@ -1227,7 +1260,8 @@ int FINUFFT_DESTROY(FINUFFT_PLAN p)
if (p->type == 1 || p->type == 2) {
#ifndef FINUFFT_USE_DUCC0
{
std::lock_guard<std::mutex> lock(fftw_lock);
FFTWLockGuard lock(p->opts.fftw_lock_fun, p->opts.fftw_unlock_fun,
p->opts.fftw_lock_data);
FFTW_DE(p->fftwPlan);
}
#endif
Expand Down
11 changes: 11 additions & 0 deletions test/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,17 @@ add_test(
COMMAND testutils
WORKING_DIRECTORY ${CMAKE_BINARY_DIR})

if(NOT FINUFFT_USE_DUCC0)
add_executable(fftw_lock_test fftw_lock_test.cpp)
target_compile_features(fftw_lock_test PRIVATE cxx_std_17)
finufft_link_test(fftw_lock_test)

add_test(
NAME run_fftw_lock_test
COMMAND fftw_lock_test
WORKING_DIRECTORY ${CMAKE_BINARY_DIR})
endif()

# Add ctest definitions that run at both precisions...
function(add_tests_with_prec PREC REQ_TOL CHECK_TOL SUFFIX)
# All of the following should be run at OMP_NUM_THREADS=4 or something small,
Expand Down
56 changes: 56 additions & 0 deletions test/fftw_lock_test.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#include <complex>
#include <mutex>
#include <vector>

#include <fftw3.h>
#include <finufft.h>
#include <omp.h>

constexpr int N = 65384;

void locker(void *lck) { reinterpret_cast<std::mutex *>(lck)->lock(); }
void unlocker(void *lck) { reinterpret_cast<std::mutex *>(lck)->unlock(); }

int main() {
int64_t Ns[3]; // guru describes mode array by vector [N1,N2..]
Ns[0] = N;
std::mutex lck;

finufft_opts opts;
finufft_default_opts(&opts);
opts.nthreads = 1;
opts.debug = 0;
opts.fftw_lock_fun = locker;
opts.fftw_unlock_fun = unlocker;
opts.fftw_lock_data = reinterpret_cast<void *>(&lck);

// random nonuniform points (x) and complex strengths (c)...
std::vector<std::complex<double>> c(N);

omp_set_num_threads(8);

// init FFTW threads
fftw_init_threads();

// FFTW and FINUFFT execution using OpenMP parallelization
#pragma omp parallel for
for (int j = 0; j < 100; ++j) {
// allocate output array for FFTW...
std::vector<std::complex<double>> F1(N);

// FFTW plan
lck.lock();
fftw_plan_with_nthreads(1);
fftw_plan plan = fftw_plan_dft_1d(N, reinterpret_cast<fftw_complex *>(c.data()),
reinterpret_cast<fftw_complex *>(F1.data()),
FFTW_FORWARD, FFTW_ESTIMATE);
fftw_destroy_plan(plan);
lck.unlock();

// FINUFFT plan
finufft_plan nufftplan;
finufft_makeplan(1, 1, Ns, 1, 1, 1e-6, &nufftplan, &opts);
finufft_destroy(nufftplan);
}
return 0;
}
Loading