Changelog

Changelog for next release


- not yet decided

Changelog for ELPA 2023.05.001
- enable gpu-streams per default for NVIDIA and AMD GPUs
- Updated / improved documentation and man pages
- Fixed compilation error on AMD GPUs
- Fixed SVE 256 compute kernels
- Allow (currently in parts of ELPA) to use NVIDIA NCCL for device to device
commpunication
- Speed up of GPU version of hermitian_multiply by up to an factor of 4
- significantly faster full-to-tridiagonal step in ELPA 1stage GPU
- significatnly faster ELPA 2stage solver on Intel GPUs
- Consistent enabling/disabling of SKEW_SYMMETRIC in header files
- new setup_gpu API function


Changelog for ELPA 2023.05.001
- added CITATION.cff file
- allow test programs to be run with 1 MPI task
- correct a memory leak in the gpu stream setup
- better handling of GPU BLAS handles
- implement the execution of the AMD HIP code path on NVIDIA GPUs
- implement the execution of the SYCL GPU code path on CPUs (debugging)
- port generalized routines to SYCL GPU
- PoC to use NVIDIA NCCL instead of MPI (not production ready)
- somewhat cleanup of documentation

Changelog for ELPA 2022.11.001
- store GPU setup per ELPA object
- clarify documentation a bit
- add a C++ interface, including an example test program
- fix a few bugs in the C interface for the ELPA solvers
- complete the C API
- make sure that OMP_NUM_THREADS is honoured even if omp_threads is not set
- fix MPI_COMMUNICATORS per ELPA object
- significantly improve the performance of the ELPA band-reduction step of 
  the 2step solver
- fix a few minor bugs in AMD GPU port: is now production ready
- allow to use NVIDIA's CUB implementation; experimental feature
- allow to use AMD's rocsolver library
- implement "HIP to ROCm" layer, in order to be able to run AMD GPU code paths
  on NVIDIA devices
- remove the neccessity to provide the CPP variable

Changelog for ELPA 2022.05.001
- implement OpenMP offloading to GPU for Intel GPU for ELPA 1 and 2 stage (
  except for "step tridi_to_band")
- implement SYCL offloading to Intel GPUs for ELPA 1 and 2 stage
- AMD GPU offload has been tested on Mi200 (also with MPI)
- can use ELPA with one individual "gpu stream" per MPI task (Nvidia and AMD
  only)
- allow steps "cholesky", "invert_trm", and "multiply_ab" to be called
  directly with GPU device pointers
- on error ELPA returns rather than aborting to give controll to calling
  application and to allow for error recovery and/or graceful abort
- allow ELPA to build with OpenMP and GPU
- fix an FPE with the Intel compiler and AVX-512 instructions and optimization
  level > -O2
- better checking of user defined options in configure
 

Changelog for ELPA 2021.11.002
- fix an error when choosing the Nvidia GPU kernel (fallback to CPU might have
  been selected)

Changelog for ELPA 2021.11.001

- support of Nvidia cusolver library to accelerate some routines (needs CUDA >= 11.4)
- experimental Nvidia GPU versions for "elpa_invert_trm" and "elpa_cholesky"
  can be tested by setting elpa_set("gpu_invert_trm",1) and
  elpa_set("gpu_cholesky",1). Is not used otherwise
- BUGFIX: error in resort_ev (also backported to 2021.05.002 and 2020.11.001)
- allow to call ELPA eigenvectors and eigenvalues also with GPU device
  pointers for the input matrix, the vectors of eigenvalues and the output 
  matrix for the eigenvectors
- EXPERIMENTAL feature:g new real GPU kernel for Nvidia A100 (provided by Nvidia): can show a
  performance boost if number of vectors per MPI task is > 20000. Most likely
  most benifit in non-MPI version
- as anounced, droping the legacy interface
- more autotuning features, for example using non blocking MPI collectives
- new version of autotunig avoiding a combinatorial grow of possibilities
  (the old autotune version can be still used if
  elpa%autotune_set_api_version(API_VERSION, error) is set to API_VERSION <
  20211125)

Changelog for ELPA 2021.05.002
- no feature changes
- correct the SO version which was wrong in ELPA 2021.05.001

Changelog for ELPA 2021.05.001

- allow the user to set the mapping of MPI tasks to GPU id per set/get
- experimental feature: port to AMD GPUS, works correctly, performance yet
  unclear; only tested --with-mpi=0
- On request, ELPA can print the pinning of MPI tasks and OpenMP thread
- support for FUGAKU: some minor fix still have to be fixed due to compiler
issues
- BUG FIX: if matrix is already banded, check whether bandwidth >= 2. DO NOT
  ALLOW a bandwidth = 1, since this would imply that the input matrix is
  already diagonal which the ELPA algorithms do not support
- BUG FIX in internal test programs: do not consider a residual of 0.0 to be
  an error
- support for skew-symmetric matrices now enabled by default
- BUG FIX in generalized case: in setups like "mpiexec -np 4 ./validate_real_double_generalized_1stage_random 90 90 45`
- ELPA_SETUPS does now (in case of MPI-runs) check whether the user-provided BLACSGRID is reasonable (i.e. ELPA does 
  _not_ rely anymore that the user does check prior to calling ELPA whether the BLACSGRID is ok) if this check fails 
  then ELPA returns with an error
- limit number of OpenMP threads to one, if MPI thread level is not at least MPI_THREAD_SERIALIZED
- allow checking of the supported threading level of the MPI library at build time

Changelog for ELPA 2020.11.001

- this release containts mostly bugfixes:
- fix determination whether a _ is needed to link Fortran to C
- fix an error in the real block4 kernel for arch64 NEON
- add missing test_scalapack_template.F90 to EXTRA_DIST list
- fix error in the GPU kernel
- switch form python2 to python3
- experimental feature: complex kernels for arch64 NEON
- experimental feature: kernels for ARM SVE

Changelog for ELPA 2020.05.001

- Enable compilation with gcc v10
- Fix a bug in elpa_multiply_a_b (GPU)
- improved documentation, including fixing of typos and errors in markdown
- Fix a bug in the calling of Cannons algorithm which might lead to crashes
for a squared process grid
- improvements and bugfixes of the ELPA2 stage GPU version, see
   https://arxiv.org/abs/2002.10991
- bugfix for the build of AVX-512 KNL kernels
- clean seperation of SIMD instructions for AVX and AVX2 kernels
- better error checking for allocations / deallocations of CPU and GPU memory
- experimental feature of matrix redistribution
- bugfix in the cpuid tests
- bugfix in elpa2_print_kernels
- bugfix when configuring --with-gpu-support-only

Changelog for ELPA 2019.11.001

- solve a bug when using parallel make builds
- check the cpuid set during build time
- add experimental feature "heterogenous-cluster-support"
- add experimental feature for 64bit integer LAS/LAPACK/SCALAPACK support
- add experimental feature for 64bit integer MPI support
- support of ELPA for real valued skew-symmetric matrices, please cite:
  https://arxiv.org/abs/1912.04062 
- cleanup of the GPU version
- bugfix in the OpenMP version
- bugfix on the Power8/9 kernels
- bugfix on ARM aarch64 FMA kernels


Changelog for ELPA 2019.05.002

- repacking of the src since the legacy interface has been forgotten in the
  2019.05.001 release

Changelog for ELPA 2019.05.001

- elpa_print_kernels supports GPU usage
- fix an error if PAPI measurements are activated
- new simple real kernels: block4 and block6
- c functions can be build with optional arguments if compiler supports it
(configure option)
- allow measurements with the likwid tool
- users can define the default-kernel at build time
- ELPA versioning number is provided in the C header files
- as announced a year ago, the following deprecated routines have been finally
removed; see DEPRECATED_FEATURES for the replacement routines , which have
been introduced a year ago. Removed routines:
  -> mult_at_b_real
  -> mult_ah_b_complex
  -> invert_trm_real
  -> invert_trm_complex
  -> cholesky_real
  -> cholesky_complex
  -> solve_tridi
- new kernels for ARM arch64 added
- fix an out-of-bound-error in elpa2


Changelog for ELPA 2018.11.001

- improved autotuning
- improved performance of generalized problem via Cannon's algorithm
- check pointing functionality of elpa objects
- store/read/resume of autotuning
- Python interface for ELPA
- more ELPA functions have an optional error argument (Fortran) or required
error argument (C) => ABI and API change


Changelog for ELPA 2018.05.001

- significant improved performance on K-computer
- added interface for the generalized eigenvalue problem
- extended autotuning functionality

Changelog for ELPA 2017.11.001

- significant improvement of performance of GPU version
- added new compute kernels for IBM Power8 and Fujistu Sparc64
  processors
- a first implementation of autotuning capability
- correct some type statements in Fortran
- correct detection of PAPI in configure step

Changelog for ELPA 2017.05.003

- remove bug in invert_triangular, which had been introduced
  in ELPA 2017.05.002

Changelog for ELPA 2017.05.002

Mainly bugfixes for ELPA 2017.05.001:
- fix memory leak of MPI communicators
- tests for hermitian_multiply, cholesky decomposition and
- deal with a problem on Debian (mawk)

Changelog for ELPA 2017.05.001

Final release of ELPA 2017.05.001
Since rc2 the following changes have been made
- more extensive tests during "make check"
- distribute missing C headers
- introduce analytic tests
- Fix stack overflow in some kernels

Changelog for ELPA 2017.05.001.rc2

This is the release candidate 2 for the ELPA 2017.05.001 version.
Additionaly to the changes from rc1, it fixes some smaller issues
- add missing script "manual_cpp"
- cleanup of code

Changelog for ELPA 2017.05.001.rc1

This is the release candidate 1 for the ELPA 2017.05.001 version.
It provides a first version of the new, more generic API of the ELPA library.
Smaller changes to the API might be possible in the upcoming release
candidates. For users, who would like to use the older API of the ELPA
library, the API as defined with release 2016.11.001.pre is frozen in and
also supported.

Apart of the API change to be more flexible for the future, this release
offers the following changes:

- faster GPU implementation, especially for ELPA 1stage
- the restriction of the block-cyclic distribution blocksize = 128 in the GPU
  case is relaxed
- Faster CPU implementation due to better blocking
- support of already banded matrices (new API only!)
- improved KNL support

Changelog for pre-release ELPA 2016.11.001.pre

This pre-release contains an experimental API which will most likely
change in the next stable release

- also suport of single-precision (real and complex case) eigenvalule problems
- GPU support in ELPA 1stage and 2stage (real and complex case)
- change of API (w.r.t. ELPA 2016.05.004) to support runtime-choice of GPU usage

Changelog for release ELPA 2016.05.004

- fix a problem with the private state of module precision
- distribute test_project with dist tarball
- generic driver routine for ELPA 1stage and 2stage
- test case for elpa_mult_at_b_real
- test case for elpa_mult_ah_b_complex
- test case for elpa_cholesky_real
- test case for elpa_cholesky_complex
- test case for elpa_invert_trm_real
- test case for elpa_invert_trm_complex
- fix building of static library
- better choice of AVX, AVX2, AVX512 kernels
- make assumed size Fortran arrays default

Changelog for release ELPA 2016.05.003

- fix a problem with the build of SSE kernels
- make some (internal) functions public, such that they
  can be used outside of ELPA
- add documentation and interfaces for new public functions
- shorten file namses and directory names for test programs
  in under to by pass "make agrument list too long" error

Changelog for release ELPA 2016.05.002

- fix problem with generated *.sh- check scripts
- name library differently if build without MPI support
- install only public modules


Changelog for release ELPA 2016.05.001

- support building without MPI for one node usage
- doxygen and man pages documentation for ELPA
- cleanup of documentation
- introduction of SSE gcc intrinsic kernels
- Remove errors due to unaligned memory
- removal of Fortran "contains functions"
- Fortran interfaces for assembly and C kernels