forked from marekandreas/elpa
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Changelog
309 lines (250 loc) · 11.9 KB
/
Changelog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
Changelog for next release
- not yet decided
Changelog for ELPA 2023.05.001
- enable gpu-streams per default for NVIDIA and AMD GPUs
- Updated / improved documentation and man pages
- Fixed compilation error on AMD GPUs
- Fixed SVE 256 compute kernels
- Allow (currently in parts of ELPA) to use NVIDIA NCCL for device to device
commpunication
- Speed up of GPU version of hermitian_multiply by up to an factor of 4
- significantly faster full-to-tridiagonal step in ELPA 1stage GPU
- significatnly faster ELPA 2stage solver on Intel GPUs
- Consistent enabling/disabling of SKEW_SYMMETRIC in header files
- new setup_gpu API function
Changelog for ELPA 2023.05.001
- added CITATION.cff file
- allow test programs to be run with 1 MPI task
- correct a memory leak in the gpu stream setup
- better handling of GPU BLAS handles
- implement the execution of the AMD HIP code path on NVIDIA GPUs
- implement the execution of the SYCL GPU code path on CPUs (debugging)
- port generalized routines to SYCL GPU
- PoC to use NVIDIA NCCL instead of MPI (not production ready)
- somewhat cleanup of documentation
Changelog for ELPA 2022.11.001
- store GPU setup per ELPA object
- clarify documentation a bit
- add a C++ interface, including an example test program
- fix a few bugs in the C interface for the ELPA solvers
- complete the C API
- make sure that OMP_NUM_THREADS is honoured even if omp_threads is not set
- fix MPI_COMMUNICATORS per ELPA object
- significantly improve the performance of the ELPA band-reduction step of
the 2step solver
- fix a few minor bugs in AMD GPU port: is now production ready
- allow to use NVIDIA's CUB implementation; experimental feature
- allow to use AMD's rocsolver library
- implement "HIP to ROCm" layer, in order to be able to run AMD GPU code paths
on NVIDIA devices
- remove the neccessity to provide the CPP variable
Changelog for ELPA 2022.05.001
- implement OpenMP offloading to GPU for Intel GPU for ELPA 1 and 2 stage (
except for "step tridi_to_band")
- implement SYCL offloading to Intel GPUs for ELPA 1 and 2 stage
- AMD GPU offload has been tested on Mi200 (also with MPI)
- can use ELPA with one individual "gpu stream" per MPI task (Nvidia and AMD
only)
- allow steps "cholesky", "invert_trm", and "multiply_ab" to be called
directly with GPU device pointers
- on error ELPA returns rather than aborting to give controll to calling
application and to allow for error recovery and/or graceful abort
- allow ELPA to build with OpenMP and GPU
- fix an FPE with the Intel compiler and AVX-512 instructions and optimization
level > -O2
- better checking of user defined options in configure
Changelog for ELPA 2021.11.002
- fix an error when choosing the Nvidia GPU kernel (fallback to CPU might have
been selected)
Changelog for ELPA 2021.11.001
- support of Nvidia cusolver library to accelerate some routines (needs CUDA >= 11.4)
- experimental Nvidia GPU versions for "elpa_invert_trm" and "elpa_cholesky"
can be tested by setting elpa_set("gpu_invert_trm",1) and
elpa_set("gpu_cholesky",1). Is not used otherwise
- BUGFIX: error in resort_ev (also backported to 2021.05.002 and 2020.11.001)
- allow to call ELPA eigenvectors and eigenvalues also with GPU device
pointers for the input matrix, the vectors of eigenvalues and the output
matrix for the eigenvectors
- EXPERIMENTAL feature:g new real GPU kernel for Nvidia A100 (provided by Nvidia): can show a
performance boost if number of vectors per MPI task is > 20000. Most likely
most benifit in non-MPI version
- as anounced, droping the legacy interface
- more autotuning features, for example using non blocking MPI collectives
- new version of autotunig avoiding a combinatorial grow of possibilities
(the old autotune version can be still used if
elpa%autotune_set_api_version(API_VERSION, error) is set to API_VERSION <
20211125)
Changelog for ELPA 2021.05.002
- no feature changes
- correct the SO version which was wrong in ELPA 2021.05.001
Changelog for ELPA 2021.05.001
- allow the user to set the mapping of MPI tasks to GPU id per set/get
- experimental feature: port to AMD GPUS, works correctly, performance yet
unclear; only tested --with-mpi=0
- On request, ELPA can print the pinning of MPI tasks and OpenMP thread
- support for FUGAKU: some minor fix still have to be fixed due to compiler
issues
- BUG FIX: if matrix is already banded, check whether bandwidth >= 2. DO NOT
ALLOW a bandwidth = 1, since this would imply that the input matrix is
already diagonal which the ELPA algorithms do not support
- BUG FIX in internal test programs: do not consider a residual of 0.0 to be
an error
- support for skew-symmetric matrices now enabled by default
- BUG FIX in generalized case: in setups like "mpiexec -np 4 ./validate_real_double_generalized_1stage_random 90 90 45`
- ELPA_SETUPS does now (in case of MPI-runs) check whether the user-provided BLACSGRID is reasonable (i.e. ELPA does
_not_ rely anymore that the user does check prior to calling ELPA whether the BLACSGRID is ok) if this check fails
then ELPA returns with an error
- limit number of OpenMP threads to one, if MPI thread level is not at least MPI_THREAD_SERIALIZED
- allow checking of the supported threading level of the MPI library at build time
Changelog for ELPA 2020.11.001
- this release containts mostly bugfixes:
- fix determination whether a _ is needed to link Fortran to C
- fix an error in the real block4 kernel for arch64 NEON
- add missing test_scalapack_template.F90 to EXTRA_DIST list
- fix error in the GPU kernel
- switch form python2 to python3
- experimental feature: complex kernels for arch64 NEON
- experimental feature: kernels for ARM SVE
Changelog for ELPA 2020.05.001
- Enable compilation with gcc v10
- Fix a bug in elpa_multiply_a_b (GPU)
- improved documentation, including fixing of typos and errors in markdown
- Fix a bug in the calling of Cannons algorithm which might lead to crashes
for a squared process grid
- improvements and bugfixes of the ELPA2 stage GPU version, see
https://arxiv.org/abs/2002.10991
- bugfix for the build of AVX-512 KNL kernels
- clean seperation of SIMD instructions for AVX and AVX2 kernels
- better error checking for allocations / deallocations of CPU and GPU memory
- experimental feature of matrix redistribution
- bugfix in the cpuid tests
- bugfix in elpa2_print_kernels
- bugfix when configuring --with-gpu-support-only
Changelog for ELPA 2019.11.001
- solve a bug when using parallel make builds
- check the cpuid set during build time
- add experimental feature "heterogenous-cluster-support"
- add experimental feature for 64bit integer LAS/LAPACK/SCALAPACK support
- add experimental feature for 64bit integer MPI support
- support of ELPA for real valued skew-symmetric matrices, please cite:
https://arxiv.org/abs/1912.04062
- cleanup of the GPU version
- bugfix in the OpenMP version
- bugfix on the Power8/9 kernels
- bugfix on ARM aarch64 FMA kernels
Changelog for ELPA 2019.05.002
- repacking of the src since the legacy interface has been forgotten in the
2019.05.001 release
Changelog for ELPA 2019.05.001
- elpa_print_kernels supports GPU usage
- fix an error if PAPI measurements are activated
- new simple real kernels: block4 and block6
- c functions can be build with optional arguments if compiler supports it
(configure option)
- allow measurements with the likwid tool
- users can define the default-kernel at build time
- ELPA versioning number is provided in the C header files
- as announced a year ago, the following deprecated routines have been finally
removed; see DEPRECATED_FEATURES for the replacement routines , which have
been introduced a year ago. Removed routines:
-> mult_at_b_real
-> mult_ah_b_complex
-> invert_trm_real
-> invert_trm_complex
-> cholesky_real
-> cholesky_complex
-> solve_tridi
- new kernels for ARM arch64 added
- fix an out-of-bound-error in elpa2
Changelog for ELPA 2018.11.001
- improved autotuning
- improved performance of generalized problem via Cannon's algorithm
- check pointing functionality of elpa objects
- store/read/resume of autotuning
- Python interface for ELPA
- more ELPA functions have an optional error argument (Fortran) or required
error argument (C) => ABI and API change
Changelog for ELPA 2018.05.001
- significant improved performance on K-computer
- added interface for the generalized eigenvalue problem
- extended autotuning functionality
Changelog for ELPA 2017.11.001
- significant improvement of performance of GPU version
- added new compute kernels for IBM Power8 and Fujistu Sparc64
processors
- a first implementation of autotuning capability
- correct some type statements in Fortran
- correct detection of PAPI in configure step
Changelog for ELPA 2017.05.003
- remove bug in invert_triangular, which had been introduced
in ELPA 2017.05.002
Changelog for ELPA 2017.05.002
Mainly bugfixes for ELPA 2017.05.001:
- fix memory leak of MPI communicators
- tests for hermitian_multiply, cholesky decomposition and
- deal with a problem on Debian (mawk)
Changelog for ELPA 2017.05.001
Final release of ELPA 2017.05.001
Since rc2 the following changes have been made
- more extensive tests during "make check"
- distribute missing C headers
- introduce analytic tests
- Fix stack overflow in some kernels
Changelog for ELPA 2017.05.001.rc2
This is the release candidate 2 for the ELPA 2017.05.001 version.
Additionaly to the changes from rc1, it fixes some smaller issues
- add missing script "manual_cpp"
- cleanup of code
Changelog for ELPA 2017.05.001.rc1
This is the release candidate 1 for the ELPA 2017.05.001 version.
It provides a first version of the new, more generic API of the ELPA library.
Smaller changes to the API might be possible in the upcoming release
candidates. For users, who would like to use the older API of the ELPA
library, the API as defined with release 2016.11.001.pre is frozen in and
also supported.
Apart of the API change to be more flexible for the future, this release
offers the following changes:
- faster GPU implementation, especially for ELPA 1stage
- the restriction of the block-cyclic distribution blocksize = 128 in the GPU
case is relaxed
- Faster CPU implementation due to better blocking
- support of already banded matrices (new API only!)
- improved KNL support
Changelog for pre-release ELPA 2016.11.001.pre
This pre-release contains an experimental API which will most likely
change in the next stable release
- also suport of single-precision (real and complex case) eigenvalule problems
- GPU support in ELPA 1stage and 2stage (real and complex case)
- change of API (w.r.t. ELPA 2016.05.004) to support runtime-choice of GPU usage
Changelog for release ELPA 2016.05.004
- fix a problem with the private state of module precision
- distribute test_project with dist tarball
- generic driver routine for ELPA 1stage and 2stage
- test case for elpa_mult_at_b_real
- test case for elpa_mult_ah_b_complex
- test case for elpa_cholesky_real
- test case for elpa_cholesky_complex
- test case for elpa_invert_trm_real
- test case for elpa_invert_trm_complex
- fix building of static library
- better choice of AVX, AVX2, AVX512 kernels
- make assumed size Fortran arrays default
Changelog for release ELPA 2016.05.003
- fix a problem with the build of SSE kernels
- make some (internal) functions public, such that they
can be used outside of ELPA
- add documentation and interfaces for new public functions
- shorten file namses and directory names for test programs
in under to by pass "make agrument list too long" error
Changelog for release ELPA 2016.05.002
- fix problem with generated *.sh- check scripts
- name library differently if build without MPI support
- install only public modules
Changelog for release ELPA 2016.05.001
- support building without MPI for one node usage
- doxygen and man pages documentation for ELPA
- cleanup of documentation
- introduction of SSE gcc intrinsic kernels
- Remove errors due to unaligned memory
- removal of Fortran "contains functions"
- Fortran interfaces for assembly and C kernels