Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up to the big reorg PR #584

Merged
merged 13 commits into from
Nov 5, 2024

Conversation

mreineck
Copy link
Collaborator

New attempt because of unexpected breakage

@ahbarnett ahbarnett self-requested a review October 22, 2024 18:35
Copy link
Collaborator Author

@mreineck mreineck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanatory comments

CMakeLists.txt Show resolved Hide resolved
include/finufft/utils.h Show resolved Hide resolved
src/spreadinterp.cpp Show resolved Hide resolved
src/spreadinterp.cpp Show resolved Hide resolved
src/spreadinterp.cpp Show resolved Hide resolved
src/spreadinterp.cpp Show resolved Hide resolved
src/spreadinterp.cpp Show resolved Hide resolved
@mreineck
Copy link
Collaborator Author

mreineck commented Oct 26, 2024

One more thing: I found a way of bypassing the FFTW memory handling functions entirely, so that we can work exclusively with vectors inside FINUFFT. It's pretty nice, but it requires the addition of an aligned allocator class, which looks like this

template<typename ElementType, std::size_t ALIGNMENT_IN_BYTES=64>
class AlignedAllocator
{
private:
  static_assert(
    ALIGNMENT_IN_BYTES >= alignof(ElementType),
    "Beware that types like int have minimum alignment requirements "
    "or access will result in crashes."
  );

public:
  using value_type = ElementType;
  static std::align_val_t constexpr ALIGNMENT{ ALIGNMENT_IN_BYTES };

    /**
     * This is only necessary because AlignedAllocator has a second template
     * argument for the alignment that will make the default
     * std::allocator_traits implementation fail during compilation.
     * @see https://stackoverflow.com/a/48062758/2191065
     */
    template<class OtherElementType>
    struct rebind
    {
      using other = AlignedAllocator<OtherElementType, ALIGNMENT_IN_BYTES>;
    };

public:
  constexpr AlignedAllocator() noexcept = default;
  constexpr AlignedAllocator( const AlignedAllocator& ) noexcept = default;

  template<typename U>
  constexpr AlignedAllocator( AlignedAllocator<U, ALIGNMENT_IN_BYTES> const& ) noexcept
    {}

  [[nodiscard]] ElementType*
  allocate( std::size_t nElementsToAllocate )
    {
    if ( nElementsToAllocate
        > std::numeric_limits<std::size_t>::max() / sizeof( ElementType ) ) {
       throw std::bad_array_new_length();
    }

  auto const nBytesToAllocate = nElementsToAllocate * sizeof( ElementType );
  return reinterpret_cast<ElementType*>(
    ::operator new[]( nBytesToAllocate, ALIGNMENT ) );
  }

  void deallocate(ElementType* allocatedPointer,
                [[maybe_unused]] std::size_t  nBytesAllocated )
  {
    /* According to the C++20 draft n4868 § 17.6.3.3, the delete operator
     * must be called with the same alignment argument as the new expression.
     * The size argument can be omitted but if present must also be equal to
     * the one used in new. */
    ::operator delete[]( allocatedPointer, ALIGNMENT );
  }
};

The code comes from https://stackoverflow.com/questions/60169819/modern-approach-to-making-stdvector-allocate-aligned-memory.

Do you think this is worth introducing? It's more code, but it removes every malloc/free/new/delete from finufft, as well as the need to check for null pointers returned from allocation functions.

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Oct 27, 2024

ALIGNMENT_IN_BYTES

why not ALIGNMENT_IN_BYTES=alignof(ElementType) as a member instead of a template?
Also, I is it possible to do alignas(alignment) std::vector<T> and pass the vector.data()?

@mreineck
Copy link
Collaborator Author

why not ALIGNMENT_IN_BYTES=alignof(ElementType) as a member instead of a template?

That would be a no-op, because every data type is automatically aligned to at least alignof(ElementType).

The point is that FFTW wants "overaligned" pointers. They point to, say, double, but should be aligned to the required alignment of an AVX datatype on an AVX-capable CPU, which is 32 bytes. If we rquire an alignment of 64 bytes, we automatically support everything up to AVX512.

Also, I is it possible to do alignas(alignment) std::vector and pass the vector.data()?

That would align the vector object itself, but not its internal data pointer.

@DiamonDinoia
Copy link
Collaborator

I see,

then I suggest using xsimd aligned allocator since it is already a dependency: xsimd::aligned_allocator<T>

@mreineck
Copy link
Collaborator Author

mreineck commented Oct 27, 2024

Thanks a lot, I wasn't aware of this class! This makes everything even easier.

@mreineck
Copy link
Collaborator Author

OK, it seems that we need to add the include path for xsimd to the test codes.

@mreineck
Copy link
Collaborator Author

Sorry, I don't think I know enough cmake to do this on my own.

@DiamonDinoia
Copy link
Collaborator

Can I push to this branch with cmake changes? I might have time on Tue/Wed.

@mreineck
Copy link
Collaborator Author

Absolutely, please do!

@mreineck
Copy link
Collaborator Author

Ah, perhaps I managed to fix it. To add the include directory, you have to tell cmake to link the library ... not the most intuitive thing.

makefile Show resolved Hide resolved
@DiamonDinoia
Copy link
Collaborator

Ah, perhaps I managed to fix it. To add the include directory, you have to tell cmake to link the library ... not the most intuitive thing.

That works, Linking xsimd to the target is the trick.

@DiamonDinoia
Copy link
Collaborator

Not sure if this PR is the right place but we could take the opportunity to align all the data that can benefit from it now. We could used aligned allocator in the std containers introduced. In a future PR, I will also sweep on the GPU side to align data to be transferred to the GPU as aligned PCIe sends/receive are also faster.

@mreineck
Copy link
Collaborator Author

It is certainly possible to use this for aligning more vectors where needed, but I recommend to check whether it really makes a measurable difference. If you add a non-default allocator to a vector, this vector can no longer be passed to functions that expect, say, a simple const vector<T> &, and this can be quite a nuisance.
To be honest, I'm not sure how much the extra alignment helps FFTW in this context, especially since you typically use FFTW_ESTIMATE for planning. But I had to be extra cautious here, because I don't want to be accused for putting FFTW at a disadvantage in comparisons to the ducc FFT :-)

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Oct 30, 2024

It is certainly possible to use this for aligning more vectors where needed, but I recommend to check whether it really makes a measurable difference. If you add a non-default allocator to a vector, this vector can no longer be passed to functions that expect, say, a simple const vector<T> &, and this can be quite a nuisance. To be honest, I'm not sure how much the extra alignment helps FFTW in this context, especially since you typically use FFTW_ESTIMATE for planning. But I had to be extra cautious here, because I don't want to be accused for putting FFTW at a disadvantage in comparisons to the ducc FFT :-)

Yes, It is possible to define an AlignedVector<T> and pass that instead.
Alignment, impacts only certain architectures mainly older AMD in my experience. Newer should be okay. AFAIK FFTW crashes when the alignment is not followed.

@DiamonDinoia
Copy link
Collaborator

It is certainly possible to use this for aligning more vectors where needed, but I recommend to check whether it really makes a measurable difference. If you add a non-default allocator to a vector, this vector can no longer be passed to functions that expect, say, a simple const vector<T> &, and this can be quite a nuisance. To be honest, I'm not sure how much the extra alignment helps FFTW in this context, especially since you typically use FFTW_ESTIMATE for planning. But I had to be extra cautious here, because I don't want to be accused for putting FFTW at a disadvantage in comparisons to the ducc FFT :-)

Yes, It is possible to define an AlignedVector<T> and pass that instead. Alignment, impacts only certain architectures mainly older AMD in my experience. Newer should be okay. AFAIK FFTW crashes when the alignment is not followed.

It is not a bad idea to have an allocator inside the plan. One can experiment with caching allocators, aligned allocators and so on to tweak performance. In c++ would not be wild to support the allocator to be passed to finufft, for example a bigger project or exotic architecture might have his own as can pass it to finufft.

@mreineck
Copy link
Collaborator Author

AFAIK FFTW crashes when the alignment is not followed.

Just to clarify: this only happens if you create a plan with aligned buffers, and then execute it (using the guru interface) on less aligned data. Otherwise FFTW will deal with "standard" aligned data, just perhaps not quite as efficiently.

src/spreadinterp.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@DiamonDinoia DiamonDinoia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Collaborator

@DiamonDinoia DiamonDinoia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge from my side.

perftest/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Martin - thanks! Took a while to go through it all (partly to understand various Class features).
My summary of your changes is in the google-doc.
There are a couple of questions and 1-2 very minor changes - should take you 15 mins. Then it can definitely come in.
Thanks so much, Alex

CPX a = (iflag > 0) ? exp(IMA * x[j]) : exp(-IMA * x[j]);
CPX p = pow(a, (FLT)kmin); // starting phase for most neg freq
CPX cc = c[j]; // no 1/nj prefac
std::complex<T> a = (iflag > 0) ? exp(std::complex<T>(0, 1) * x[j])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can some definition be done so that I is available, as std::complex(0,1) of the right T ? We don't want to have to type this each time I is needed :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but actually I suggest to move to std::polar(a,b) instead at some point (which computes a*exp(i*b)).
Having a templated constant I doesn't look very nice either ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I happen to like "I" or "IMA" or some template for 0+1i. It does not appear enough to fight about it :)

src/spreadinterp.cpp Show resolved Hide resolved
include/finufft/finufft_core.h Outdated Show resolved Hide resolved
include/finufft/finufft_core.h Show resolved Hide resolved
include/finufft/finufft_core.h Show resolved Hide resolved
include/finufft/utils.h Show resolved Hide resolved
include/finufft_eitherprec.h Outdated Show resolved Hide resolved
makefile Outdated Show resolved Hide resolved
makefile Show resolved Hide resolved
makefile Show resolved Hide resolved
@mreineck
Copy link
Collaborator Author

mreineck commented Nov 5, 2024

For some reason my comment about std::norm being the field norm ended up somewhere near the absolute top of the PR's discussion, so I'll repeat it here.

std::norm claims to be the field norm, which justifies the square.
From a technical standpoint I totally understand this choice, since it avoids the (comparatively expensive) square root and produces a quantity that is needed very often (also std::abs already exists if you want the "normal" norm).
I'm using the function often, since it it has just 3 arithmetic operations compared to the (theoretical) 7 operations (+ one copy) of (a*conj(a)).real(), but the optimizer will most likely produce the same code for both.

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - happy to merge.

@ahbarnett ahbarnett merged commit c2b3215 into flatironinstitute:master Nov 5, 2024
167 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants