faster zeroing of internal spreadinterp array. #466

ahbarnett · 2024-05-07T21:13:26Z

ahbarnett
May 7, 2024
Maintainer

Zero the output array - best way? (1.5 s for 200M array on AMD laptop, too slow, 10% of a 1e8 1e8 1d1 transform @1e-6)

switch to Calloc in makeplan, eg finufft.cpp :

Line 716 in cc8629f

p->fwBatch = FFTW_ALLOC_CPX(p->nf * p->batchSize); // the big workspace

but we use fftw_alloc_complex()... need to use alignas eg 64 bytes = avx512 width.

Would need to add a flag "fwBatch_is_zeroed" to the plan, and zero fwBatch if this is false (at the start of each batch inside the execute), and set this False after each batch in the execute.

Remove the zeroing loop from spreadinterp.cpp.

First, benchmark calloc vs naive zeroing, etc...

mreineck · 2024-05-09T15:35:59Z

mreineck
May 9, 2024
Maintainer

One big problem with time measurement in this particular case is that you may not only be measuring the zeroing, but also the preparation of the memory by the kernel.

Assuming that you just obtained a large area of memory via, say, malloc (fftw_malloc will most likely behave the same), the returned pointer actually may not point to allocated memory. It ~~rather points~~ may rather point to an as yet unloaded memory page, and the first access to it will trigger a page fault, causing the kernel to actually allocate the page and fill it with zeros by itself. Only after that your own zeroing code will take over. This will happen for every memory page touched during the zeroing loop.

You can convince yourself of this behaviour by mallocing a few terabytes of memory. This is possible even on a machine that has much less RAM+swap. Only when you start accessing a sufficiently large number of pages in that memory region, the machine will run out of memory.

So if you want to measure performance of zeroing, it's probably best to do one zeroing pass for warming up and then zeroing the array again, measuring only the second pass.

It is possible to influence when memory is taken from the kernel and when it is given back via functions like mallopt (which is unfortunately not portable), but you need to be careful not to end up with a fast code which ends up eating all system memory during a potentially long run.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster zeroing of internal spreadinterp array. #466

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

faster zeroing of internal spreadinterp array. #466

ahbarnett May 7, 2024 Maintainer

Replies: 1 comment

mreineck May 9, 2024 Maintainer

ahbarnett
May 7, 2024
Maintainer

mreineck
May 9, 2024
Maintainer