Replies: 5 comments 2 replies
-
Dear Martin,
This is a (another!) nice idea, that I had not noticed. We (mostly Marco)
has sped kernel evaluation up in a different way, by explicit vectorization
using the xsimd.hpp lib. You can see it in a PR (which also applies this to
the spreader loops). Eg here are my spreader benchmarks vs
master (on 1 thread of 5700U AMD laptop):
[image: master-vs-svec2l_gcc114_5700U_nthr1_spread_M1.0e7_N1.0e6.png]
Many percent of this is due to kernel eval. We should try to apply your
even/odd idea with this xsimd kernel eval - that code is currently not
complicated. @lu1and10 @DiamonDinoia
However, cost is affected by moving registers around rather than flops at
this point...
Best wishes, Alex
…On Thu, Jun 20, 2024 at 3:05 AM mreineck ***@***.***> wrote:
I realized that it is possible to (roughly) halve the necessary work (and
coefficient storage) when evaluating the polynomial kernel approximation.
Disclaimer: this sounds nice, but is not terribly significant except
perhaps in 1D; in 2D and 3D, applying kernel is typically slower than its
computation. Nevertheless I think it's a neat little micro-optimization.
The idea is the following: the Horner coefficients for the first and last
kernel interval are (due to the kernel being symmetric) identical for even
powers and identical with opposite signs for odd powers. Same for the
second and next-to-last kernel interval etc. So (roughly) half the
information in the coefficient tables is redundant.
Computationally this can be used by computing even and odd powers
separately in Horner's method.
I.e. instead of
double res=coeff[0];
for (int i=1; i<=degree; ++i)
res = res*x+coeff[i];
one can do something like this (also known as Estrin's scheme):
double res_odd=coeff[0];
double res_even=coeff[1];
double xsq = x*x;
for (int i=2; i<=degree; i+=2) {
res_odd = res_odd*xsq+coeff[i];
res_even = res_even*xsq+coeff[i+1];
}
res = x*res_odd + res_even;
res_opposite = res_even-x*res_odd;
So one basically gets roughly twice the results for approximately the same
number of operations.
(Of course one has to be careful with many things: in this simple example
degree must be odd, and vectorized evaluation introduces additional
complications.)
I plan to use this in ducc0.nufft in the future, although it doesn't show
noticeable performance improvements. For large kernel supports, it saves
roughly 1KB of L1 cache usage, which can't be a bad thing.
—
Reply to this email directly, view it on GitHub
<#461>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSSA7JAFQM6QY2BALF3ZIJ5JNAVCNFSM6AAAAABJTKKBPWVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWHA2DCNJQGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
Beta Was this translation helpful? Give feedback.
-
Dear Alex, yes, vectorization of the kernel evaluation gives a really big boost. As far as I understand, this is the part of the code (in contrast to sprading/interpolating and FFT) where the CPU can be pushed closest to its theoretical peak performance. I am already benchmarking Marco's branch and see the really nice improvement!. Of course that only helps with 2D and 3D transforms, and it means more code replication, which is not nice ... but I think the performance gains are worth it. Cheers, |
Beta Was this translation helpful? Give feedback.
-
By interleaving do you mean: don't precompute ker1,2,3 (in x,y,z directions), rather only precompute ker1 (for the inner x), but evaluate ker2,3 on the fly just outside the innermost loop over x? PS I don't think my pic came through via email response to GH, so here it is again: |
Beta Was this translation helpful? Give feedback.
-
Hi Martin, I agree with both suggestions: That is something we should do. |
Beta Was this translation helpful? Give feedback.
-
@lu1and10 please play with the even/odd symmetry idea for ker eval xsimd when you a chance - should be fun. |
Beta Was this translation helpful? Give feedback.
-
I realized that it is possible to (roughly) halve the necessary work (and coefficient storage) when evaluating the polynomial kernel approximation.
Disclaimer: this sounds nice, but is not terribly significant except perhaps in 1D; in 2D and 3D, applying kernel is typically slower than its computation. Nevertheless I think it's a neat little micro-optimization.
The idea is the following: the Horner coefficients for the first and last kernel interval are (due to the kernel being symmetric) identical for even powers and identical with opposite signs for odd powers. Same for the second and next-to-last kernel interval etc. So (roughly) half the information in the coefficient tables is redundant.
Computationally this can be used by computing even and odd powers separately in Horner's method.
I.e. instead of
one can do something like this (also known as Estrin's scheme):
So one basically gets roughly twice the results for approximately the same number of operations.
(Of course one has to be careful with many things: in this simple example
degree
must be odd, and vectorized evaluation introduces additional complications.)I plan to use this in
ducc0.nufft
in the future, although it doesn't show noticeable performance improvements. For large kernel supports, it saves roughly 1KB of L1 cache usage, which can't be a bad thing.Beta Was this translation helpful? Give feedback.
All reactions