Potential savings for polynomial approximation #461

mreineck · 2024-06-20T07:04:34Z

mreineck
Jun 20, 2024
Maintainer

I realized that it is possible to (roughly) halve the necessary work (and coefficient storage) when evaluating the polynomial kernel approximation.
Disclaimer: this sounds nice, but is not terribly significant except perhaps in 1D; in 2D and 3D, applying kernel is typically slower than its computation. Nevertheless I think it's a neat little micro-optimization.

The idea is the following: the Horner coefficients for the first and last kernel interval are (due to the kernel being symmetric) identical for even powers and identical with opposite signs for odd powers. Same for the second and next-to-last kernel interval etc. So (roughly) half the information in the coefficient tables is redundant.

Computationally this can be used by computing even and odd powers separately in Horner's method.

I.e. instead of

double res=coeff[0];
for (int i=1; i<=degree; ++i)
   res = res*x+coeff[i];

one can do something like this (also known as Estrin's scheme):

double res_odd=coeff[0];
double res_even=coeff[1];
double xsq  = x*x;
for (int i=2; i<=degree; i+=2) {
   res_odd = res_odd*xsq+coeff[i];
   res_even = res_even*xsq+coeff[i+1];
}
res = x*res_odd + res_even;
res_opposite = res_even-x*res_odd;

So one basically gets roughly twice the results for approximately the same number of operations.
(Of course one has to be careful with many things: in this simple example degree must be odd, and vectorized evaluation introduces additional complications.)

I plan to use this in ducc0.nufft in the future, although it doesn't show noticeable performance improvements. For large kernel supports, it saves roughly 1KB of L1 cache usage, which can't be a bad thing.

ahbarnett · 2024-06-20T15:58:56Z

ahbarnett
Jun 20, 2024
Maintainer

Dear Martin, This is a (another!) nice idea, that I had not noticed. We (mostly Marco) has sped kernel evaluation up in a different way, by explicit vectorization using the xsimd.hpp lib. You can see it in a PR (which also applies this to the spreader loops). Eg here are my spreader benchmarks vs master (on 1 thread of 5700U AMD laptop): [image: master-vs-svec2l_gcc114_5700U_nthr1_spread_M1.0e7_N1.0e6.png] Many percent of this is due to kernel eval. We should try to apply your even/odd idea with this xsimd kernel eval - that code is currently not complicated. @lu1and10 @DiamonDinoia However, cost is affected by moving registers around rather than flops at this point... Best wishes, Alex

…

On Thu, Jun 20, 2024 at 3:05 AM mreineck ***@***.***> wrote: I realized that it is possible to (roughly) halve the necessary work (and coefficient storage) when evaluating the polynomial kernel approximation. Disclaimer: this sounds nice, but is not terribly significant except perhaps in 1D; in 2D and 3D, applying kernel is typically slower than its computation. Nevertheless I think it's a neat little micro-optimization. The idea is the following: the Horner coefficients for the first and last kernel interval are (due to the kernel being symmetric) identical for even powers and identical with opposite signs for odd powers. Same for the second and next-to-last kernel interval etc. So (roughly) half the information in the coefficient tables is redundant. Computationally this can be used by computing even and odd powers separately in Horner's method. I.e. instead of double res=coeff[0]; for (int i=1; i<=degree; ++i) res = res*x+coeff[i]; one can do something like this (also known as Estrin's scheme): double res_odd=coeff[0]; double res_even=coeff[1]; double xsq = x*x; for (int i=2; i<=degree; i+=2) { res_odd = res_odd*xsq+coeff[i]; res_even = res_even*xsq+coeff[i+1]; } res = x*res_odd + res_even; res_opposite = res_even-x*res_odd; So one basically gets roughly twice the results for approximately the same number of operations. (Of course one has to be careful with many things: in this simple example degree must be odd, and vectorized evaluation introduces additional complications.) I plan to use this in ducc0.nufft in the future, although it doesn't show noticeable performance improvements. For large kernel supports, it saves roughly 1KB of L1 cache usage, which can't be a bad thing. — Reply to this email directly, view it on GitHub <#461>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSSA7JAFQM6QY2BALF3ZIJ5JNAVCNFSM6AAAAABJTKKBPWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWHA2DCNJQGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

0 replies

mreineck · 2024-06-20T16:17:21Z

mreineck
Jun 20, 2024
Maintainer Author

Dear Alex,

yes, vectorization of the kernel evaluation gives a really big boost. As far as I understand, this is the part of the code (in contrast to sprading/interpolating and FFT) where the CPU can be pushed closest to its theoretical peak performance.

I am already benchmarking Marco's branch and see the really nice improvement!.
One thing that could perhaps help even more with performance is interleaving the Horner evaluation in the different spatial directions, i.e. in the Horner loop you not only compute one vector's worth of kernel values, but for 2D problems you do x and y in a single loop, and in 3D you have x, y, and z. That helps with instruction latency (i.e. the arithmetic units in the CPU don't have to wait until the last instruction has finished), and you don't have to reload the polynomial coefficients so often.

Of course that only helps with 2D and 3D transforms, and it means more code replication, which is not nice ... but I think the performance gains are worth it.

Cheers,
Martin

0 replies

ahbarnett · 2024-06-20T16:31:18Z

ahbarnett
Jun 20, 2024
Maintainer

By interleaving do you mean: don't precompute ker1,2,3 (in x,y,z directions), rather only precompute ker1 (for the inner x), but evaluate ker2,3 on the fly just outside the innermost loop over x? PS I don't think my pic came through via email response to GH, so here it is again:

2 replies

mreineck Jun 20, 2024
Maintainer Author

Thanks for the picture!

By interleaving I mean instad of doing this (in 2D):

double res_x=coeff[0];
for (int i=1; i<=degree; ++i)
   res_x = res_x*x+coeff[i];
double res_y=coeff[0];
for (int i=1; i<=degree; ++i)
   res_y = res_y*y+coeff[i];

rather doing this:

double res_x=coeff[0], res_y=coeff[0];
for (int i=1; i<=degree; ++i) {
   double c = coeff[i];
   res_x = res_x*x+c;
   res_y = res_y*y+c;
}

mreineck Jun 21, 2024
Maintainer Author

To explain why I think the second version helps performance:

in every loop iteration, there is only one load from memory (coeff[i]); res_x, res_y, x, and y stay in registers
the updates for res_x and res_y are mutually independent, i.e. they can be in the CPU pipeline simultaneously, reducing latency problems.

The second point can be improved even more by switching to the Estrin scheme as described above; then you have four (in 3D, six) independent FMA operations per loop iteration, which should lead to very good hardware utilization.

DiamonDinoia · 2024-06-21T14:41:05Z

DiamonDinoia
Jun 21, 2024
Maintainer

Hi Martin,

I agree with both suggestions:
Splitting even odd -> fewer pipeline stalls, leveraging out-of-order more
Coalescing the evaluation -> more cache locality

That is something we should do.

0 replies

ahbarnett · 2024-06-25T20:17:09Z

ahbarnett
Jun 25, 2024
Maintainer

@lu1and10 please play with the even/odd symmetry idea for ker eval xsimd when you a chance - should be fun.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential savings for polynomial approximation #461

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Potential savings for polynomial approximation #461

mreineck Jun 20, 2024 Maintainer

Replies: 5 comments · 2 replies

ahbarnett Jun 20, 2024 Maintainer

mreineck Jun 20, 2024 Maintainer Author

ahbarnett Jun 20, 2024 Maintainer

mreineck Jun 20, 2024 Maintainer Author

mreineck Jun 21, 2024 Maintainer Author

DiamonDinoia Jun 21, 2024 Maintainer

ahbarnett Jun 25, 2024 Maintainer

mreineck
Jun 20, 2024
Maintainer

Replies: 5 comments 2 replies

ahbarnett
Jun 20, 2024
Maintainer

mreineck
Jun 20, 2024
Maintainer Author

ahbarnett
Jun 20, 2024
Maintainer

mreineck Jun 20, 2024
Maintainer Author

mreineck Jun 21, 2024
Maintainer Author

DiamonDinoia
Jun 21, 2024
Maintainer

ahbarnett
Jun 25, 2024
Maintainer