From 1575ff6c2252f390a3d9a7dcc3fd3cf449f6415f Mon Sep 17 00:00:00 2001
From: Thomas VINCENT <thomas.vincent@esrf.fr>
Date: Mon, 6 Nov 2023 16:37:23 +0100
Subject: [PATCH 1/3] Squashed 'src/c-blosc2/' changes from
 8fbdb4310..72c5cc1e8

5b38ffeb6 Getting ready for release 2.11.1
ff87d56d4 Fix the header for ALTIVEC functions. Overrides #569.
ffc750b70 Post 2.11.0 release actions done
4ec95bf74 Getting ready for release 2.11.0
79e576472 Fix typo newly found by codespell
60d23181e Update CONTRIBUTING.rst
accf9ffc3 Adapt the ALTIVEC module to the new bitshuffle API
9728e2a00 Re-enabling __builtin_cpu_supports() in GCC and Clang
558d982c0 Activate AVX512 in MSVC and Intel compilers
af9608eee Fixes in signature of functions
68b3bf28b bitshuffle generic synced with bitshuffle upstream
c2e7b84f5 Use the NEON bitshuffle code in the bitshuffle project.
bfbd07db4 Support for disabling AVX512
a03bbc1f0 Proper AVX512 flags for MSVC
fdacf9f3b Fix loosing ends for AVX512 support
da8cd94be Preliminary support for AVX512 for bitshuffle
a5c6c8944 Initial version where SSE2 and AVX2 paths have been ported
af2c53418 Add c-blosc2 package variant for Guix with AVX-512 enabled
e11231bd4 Add c-blosc2 package variant for Guix with AVX2 enabled
f79964a97 Add c-blosc2 package definition for Guix
595f69abb Small name fix
c94787a0c Properly check calls to strtol
38a61f1c2 Merge branch 'public-b2nd_copy_buffer'
83a6d99f0 Move declaration of b2nd_copy_buffer to b2nd.h
a6e5f6552 Add simple unit test for b2nd_copy_buffer
8b81ae78d Use constant qualifiers in b2nd_copy_buffer where appropriate
06d1465f2 Include b2nd_utils and b2nd_copy_buffer in API documentation
40bc0d48d Add docstrings for b2nd_utils header and b2nd_copy_buffer function
e94de3d7c Export the b2nd_copy_buffer function
12e2a765e Move b2nd utilities header to public headers directory
8e1cfd9b6 Merge branch 'main' of github.com:Blosc/c-blosc2
2be8b26a3 Better check that nthreads must be >= 1 and <= INT16_MAX.  Fixes #559.
35e5ace48 fix compile arguments for armv7l
61c00e186 Post 2.10.5 release actions done
f8417b103 Getting ready for release 2.10.5
3ecb9dd57 Check ctx has been created correctly
95e0fd427 Change tuner's functions signature to return always an error code
24b703d94 Fix variable name for decompression context
3e88f8576 Post 2.10.4 release actions done
61377baf0 Getting ready for release 2.10.4
f4d00cc4d Remove duplicated tune inicialization since it is already done in blosc2_create_cctx
bba56388f Re-add ninja again
60e6679a3 Fix typo newly found by codespell
aefc9b5cc Post 2.10.3 release actions done
34b273ee6 Getting ready for release 2.10.3
ea1c222c8 Globally register openhtj2k codec
219f2d6db Bump actions/checkout from 3 to 4
6f8721808 Fix typo found by codespell
eacc7b35e Provide a smoother increase of blocksize over clevels
2329d42a8 Add a BLOSC_INFO macro for details on compression params
20884590b data_dest should be bytes, not float
c57645644 Automatic blocksize also depends on splitmode
4c7f52f42 Add an example of guessing automatic blocksizes for arbitrary chunksizes
6bbade9e1 Suppress an use-after-free warning
96746c028 Redo PR #551: Disable visibility attribute for mingw
8ce0e9684 Revert "First attempt at FetchContent for zlib-ng"
1c6b5906d Revert "zlib-ng static: needs -fPIC"
8185e18f9 Revert "Simplify ZLIB target"
481ca7ba3 Revert "Superbuild: Export & Install Zlib-NG"
3cece4776 Revert "Cleanup: internal-complibs/zlib-ng*"
22770bf7d Revert "Clone zlib-ng to build, not source directory"
e5c7f8263 Revert latest PR #544
4f886c3f9 Disable visibility attribute for mingw
365424276 Clone zlib-ng to build, not source directory
9cc30a4af Cleanup: internal-complibs/zlib-ng*
c6196c9c6 Superbuild: Export & Install Zlib-NG
c73b89418 Simplify ZLIB target
36d78effb zlib-ng static: needs -fPIC
02cc3d096 First attempt at FetchContent for zlib-ng
37e083e68 CMake: Cleanup Threads Search
5f66aff62 Fix unused parameter warning with dlopen function
79a6e9614 Post 2.10.2 release actions done

git-subtree-dir: src/c-blosc2
git-subtree-split: 72c5cc1e8516b1af39202c810e77fd1f790e3139
---
 .github/workflows/cmake.yml                |   14 +-
 .guix-channel                              |    3 +
 .guix/modules/c-blosc2-package.scm         |   99 ++
 ANNOUNCE.md                                |    9 +-
 Blosc2Config.cmake.in                      |   15 +-
 CMakeLists.txt                             |   48 +-
 CONTRIBUTING.rst                           |    2 +-
 RELEASE_NOTES.md                           |   63 +-
 blosc/CMakeLists.txt                       |   62 +-
 blosc/b2nd.c                               |    1 -
 blosc/b2nd_utils.c                         |    7 +-
 blosc/b2nd_utils.h                         |   23 -
 blosc/bitshuffle-altivec.c                 |   46 +-
 blosc/bitshuffle-altivec.h                 |   16 +-
 blosc/bitshuffle-avx2.c                    |  124 +-
 blosc/bitshuffle-avx2.h                    |   21 +-
 blosc/bitshuffle-avx512.c                  |  161 +++
 blosc/bitshuffle-avx512.h                  |   29 +
 blosc/bitshuffle-generic.c                 |  291 +++--
 blosc/bitshuffle-generic.h                 |   12 +-
 blosc/bitshuffle-neon.c                    | 1363 ++++++--------------
 blosc/bitshuffle-neon.h                    |   10 +-
 blosc/bitshuffle-sse2.c                    |  315 ++---
 blosc/bitshuffle-sse2.h                    |   20 +-
 blosc/blosc-private.h                      |    1 +
 blosc/blosc2.c                             |  114 +-
 blosc/frame.c                              |   63 +-
 blosc/schunk.c                             |   45 +-
 blosc/shuffle.c                            |   94 +-
 blosc/shuffle.h                            |   10 +-
 blosc/stune.c                              |   37 +-
 blosc/stune.h                              |   10 +-
 doc/reference/b2nd.rst                     |    7 +
 examples/CMakeLists.txt                    |    2 +-
 examples/README.rst                        |    4 +-
 examples/get_blocksize.c                   |   72 ++
 guix.scm                                   |    1 +
 include/b2nd.h                             |   32 +
 include/blosc2.h                           |   28 +-
 include/blosc2/blosc2-common.h             |    2 +-
 include/blosc2/blosc2-export.h             |    4 +-
 include/blosc2/codecs-registry.h           |    1 +
 plugins/codecs/codecs-registry.c           |    9 +
 plugins/codecs/ndlz/test_ndlz.c            |    2 +-
 plugins/codecs/zfp/test_zfp_acc_float.c    |    2 +-
 plugins/codecs/zfp/test_zfp_prec_float.c   |    2 +-
 plugins/codecs/zfp/test_zfp_rate_float.c   |    2 +-
 plugins/codecs/zfp/test_zfp_rate_getitem.c |    2 +-
 tests/b2nd/test_b2nd_copy_buffer.c         |   76 ++
 tests/test_contexts.c                      |    2 +-
 tests/test_nthreads.c                      |   47 +
 51 files changed, 1885 insertions(+), 1540 deletions(-)
 create mode 100644 .guix-channel
 create mode 100644 .guix/modules/c-blosc2-package.scm
 delete mode 100644 blosc/b2nd_utils.h
 create mode 100644 blosc/bitshuffle-avx512.c
 create mode 100644 blosc/bitshuffle-avx512.h
 create mode 100644 examples/get_blocksize.c
 create mode 120000 guix.scm
 create mode 100644 tests/b2nd/test_b2nd_copy_buffer.c

diff --git a/.github/workflows/cmake.yml b/.github/workflows/cmake.yml
index 59c538c4..52ebc8a1 100644
--- a/.github/workflows/cmake.yml
+++ b/.github/workflows/cmake.yml
@@ -71,6 +71,11 @@ jobs:
             compiler: clang
             cmake-args: -D DEACTIVATE_AVX2=ON
 
+          - name: Ubuntu Clang No AVX512
+            os: ubuntu-latest
+            compiler: clang
+            cmake-args: -D DEACTIVATE_AVX512=ON
+
           - name: Ubuntu Clang No ZLIB
             os: ubuntu-latest
             compiler: clang
@@ -105,7 +110,7 @@ jobs:
             compiler: gcc
 
     steps:
-    - uses: actions/checkout@v3
+    - uses: actions/checkout@v4
 
     - name: Install packages (Ubuntu)
       if: runner.os == 'Linux' && matrix.packages
@@ -113,10 +118,9 @@ jobs:
         sudo apt-get update
         sudo apt-get install -y ${{ matrix.packages }}
 
-    # Ninja should be not necessary anymore (see note on Win / GCC above)
-    # - name: Install packages (Windows)
-    #   if: runner.os == 'Windows'
-    #   run: choco install ninja ${{ matrix.packages }}
+    - name: Install packages (Windows)
+      if: runner.os == 'Windows'
+      run: choco install ninja ${{ matrix.packages }}
 
     - name: Install packages (macOS)
       if: runner.os == 'macOS'
diff --git a/.guix-channel b/.guix-channel
new file mode 100644
index 00000000..4ce12885
--- /dev/null
+++ b/.guix-channel
@@ -0,0 +1,3 @@
+(channel
+  (version 0)
+  (directory ".guix/modules"))
diff --git a/.guix/modules/c-blosc2-package.scm b/.guix/modules/c-blosc2-package.scm
new file mode 100644
index 00000000..54bc4451
--- /dev/null
+++ b/.guix/modules/c-blosc2-package.scm
@@ -0,0 +1,99 @@
+;;; This file follows the suggestions in the article "From development
+;;; environments to continuous integration—the ultimate guide to software
+;;; development with Guix" by Ludovic Courtès at the Guix blog:
+;;; <https://guix.gnu.org/es/blog/2023/from-development-environments-to-continuous-integrationthe-ultimate-guide-to-software-development-with-guix/>.
+
+(define-module (c-blosc2-package)
+  #:use-module (guix)
+  #:use-module (guix build-system cmake)
+  #:use-module (guix git-download)
+  #:use-module ((guix licenses)
+                #:prefix license:)
+  #:use-module (gnu packages compression)
+  #:use-module (ice-9 regex)
+  #:use-module (ice-9 textual-ports))
+
+(define (current-source-root)
+  (dirname (dirname (current-source-directory))))
+
+(define (get-c-blosc2-version)
+  (let ((version-path (string-append (current-source-root) "/include/blosc2.h"))
+        (version-rx (make-regexp
+                     "^\\s*#\\s*define\\s*BLOSC2_VERSION_STRING\\s*\"([^\"]*)\".*"
+                     regexp/newline)))
+    (call-with-input-file version-path
+      (lambda (port)
+        (let* ((version-body (get-string-all port))
+               (version-match (regexp-exec version-rx version-body)))
+          (and version-match
+               (match:substring version-match 1)))))))
+
+(define vcs-file?
+  ;; Return true if the given file is under version control.
+  (or (git-predicate (current-source-root))
+      (const #t)))
+
+(define-public c-blosc2
+  (package
+    (name "c-blosc2")
+    (version (get-c-blosc2-version))
+    (source (local-file "../.."
+                        "c-blosc2-checkout"
+                        #:recursive? #t
+                        #:select? (lambda (path stat)
+                                    (and (vcs-file? path stat)
+                                         (not (string-contains path
+                                               "/internal-complibs"))))))
+    (build-system cmake-build-system)
+    (arguments
+     ;; Disable AVX2 by default as in Guix' c-blosc package.
+     `(#:configure-flags '("-DBUILD_STATIC=OFF"
+                           "-DDEACTIVATE_AVX2=ON"
+                           "-DDEACTIVATE_AVX512=ON"
+                           "-DPREFER_EXTERNAL_LZ4=ON"
+                           "-DPREFER_EXTERNAL_ZLIB=ON"
+                           "-DPREFER_EXTERNAL_ZSTD=ON")))
+    (inputs (list lz4 zlib
+                  ;; The only input with a separate libs-only output.
+                  `(,zstd "lib")))
+    (home-page "https://blosc.org")
+    (synopsis "Blocking, shuffling and lossless compression library")
+    (description
+     "Blosc is a high performance compressor optimized for binary
+data (i.e. floating point numbers, integers and booleans, although it can
+handle string data too).  It has been designed to transmit data to the
+processor cache faster than the traditional, non-compressed, direct memory
+fetch approach via a @code{memcpy()} system call.  Blosc main goal is not just
+to reduce the size of large datasets on-disk or in-memory, but also to
+accelerate memory-bound computations.
+
+C-Blosc2 is the new major version of C-Blosc, and is backward compatible with
+both the C-Blosc1 API and its in-memory format.  However, the reverse thing is
+generally not true for the format; buffers generated with C-Blosc2 are not
+format-compatible with C-Blosc1 (i.e. forward compatibility is not
+supported).")
+    (license license:bsd-3)))
+
+(define (package-with-configure-flags p flags)
+  "Return P with FLAGS as additional 'configure' flags."
+  (package/inherit p
+    (arguments (substitute-keyword-arguments (package-arguments p)
+                 ((#:configure-flags original-flags
+                   #~(list))
+                  #~(append #$original-flags
+                            #$flags))))))
+
+(define-public c-blosc2-with-avx2
+  (package
+    (inherit (package-with-configure-flags c-blosc2
+                                           #~(list "-DDEACTIVATE_AVX2=OFF")))
+    (name "c-blosc2-with-avx2")))
+
+(define-public c-blosc2-with-avx512
+  (package
+    (inherit (package-with-configure-flags c-blosc2
+                                           #~(list "-DDEACTIVATE_AVX2=OFF"
+                                                   "-DDEACTIVATE_AVX512=OFF")))
+    (name "c-blosc2-with-avx512")))
+
+c-blosc2
diff --git a/ANNOUNCE.md b/ANNOUNCE.md
index 3947a853..69458298 100644
--- a/ANNOUNCE.md
+++ b/ANNOUNCE.md
@@ -1,11 +1,12 @@
-# Announcing C-Blosc2 2.10.2
+# Announcing C-Blosc2 2.11.1
 A fast, compressed and persistent binary data store library for C.
 
 ## What is new?
 
-This is a maintenance release with also several improvements for helping
-integration of C-Blosc2 in other projects (thanks to Alex Huebl).  Also,
-some fixes for MinGW platform are in (thanks to Biswapriyo Nath).
+This is a maintenance release for fixing the ALTIVEC header.
+Only affects to IBM POWER builds.
+
+Also, some other fixes and improvements are in.
 
 For more info, please see the release notes in:
 
diff --git a/Blosc2Config.cmake.in b/Blosc2Config.cmake.in
index 59274855..68d8b5c3 100644
--- a/Blosc2Config.cmake.in
+++ b/Blosc2Config.cmake.in
@@ -12,7 +12,7 @@ endif()
 list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_LIST_DIR}/Modules")
 
 # this section stores which configuration options were set
-set(HAVE_THREADS @Threads_FOUND@)
+set(HAVE_THREADS @HAVE_THREADS@)
 set(HAVE_IPP @HAVE_IPP@)
 set(HAVE_ZLIB_NG @HAVE_ZLIB_NG@)
 set(DEACTIVATE_IPP @DEACTIVATE_IPP@)
@@ -26,16 +26,13 @@ set(PREFER_EXTERNAL_ZSTD @PREFER_EXTERNAL_ZSTD@)
 #   additionally, the Blosc2_..._FOUND variables are used to support
 #   find_package(Blosc2 ... COMPONENTS ... ...)
 #   this enables downstream projects to express the need for specific features.
-if(WIN32)
-    if(HAVE_THREADS)
-        find_dependency(Threads)
-        set(Blosc2_THREADS_FOUND TRUE)
-    else()
-        set(Blosc2_THREADS_FOUND FALSE)
-    endif()
-else()
+set(CMAKE_THREAD_PREFER_PTHREAD TRUE) # pre 3.1
+set(THREADS_PREFER_PTHREAD_FLAG TRUE) # CMake 3.1+
+if(HAVE_THREADS)
     find_dependency(Threads)
     set(Blosc2_THREADS_FOUND TRUE)
+else()
+    set(Blosc2_THREADS_FOUND FALSE)
 endif()
 
 if(NOT DEACTIVATE_IPP AND HAVE_IPP)
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 4b4fc6a5..bff2f36c 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -29,6 +29,8 @@
 #       build a lite version (only with BloscLZ and LZ4/LZ4HC) of the blosc library
 #   DEACTIVATE_AVX2: default OFF
 #       do not attempt to build with AVX2 instructions
+#   DEACTIVATE_AVX512: default OFF
+#       do not attempt to build with AVX512 instructions
 #   DEACTIVATE_ZLIB: default OFF
 #       do not include support for the Zlib library
 #   DEACTIVATE_ZSTD: default OFF
@@ -115,6 +117,8 @@ option(BUILD_LITE
     "Build a lite version (only with BloscLZ and LZ4/LZ4HC) of the blosc library." OFF)
 option(DEACTIVATE_AVX2
     "Do not attempt to build with AVX2 instructions" OFF)
+option(DEACTIVATE_AVX512
+    "Do not attempt to build with AVX512 instructions" OFF)
 option(DEACTIVATE_ZLIB
     "Do not include support for the Zlib library." OFF)
 option(DEACTIVATE_ZSTD
@@ -151,6 +155,21 @@ if(BUILD_LITE)
     set(DEACTIVATE_ZSTD ON)
 endif()
 
+# Threads
+set(CMAKE_THREAD_PREFER_PTHREAD TRUE) # pre 3.1
+set(THREADS_PREFER_PTHREAD_FLAG TRUE) # CMake 3.1+
+if(WIN32)
+    # try to use the system library
+    find_package(Threads)
+else()
+    find_package(Threads REQUIRED)
+endif()
+if(Threads_FOUND)
+    set(HAVE_THREADS ON)
+else()
+    set(HAVE_THREADS OFF)
+endif()
+
 if(PREFER_EXTERNAL_LZ4)
     find_package(LZ4)
 else()
@@ -266,6 +285,11 @@ if(CMAKE_SYSTEM_PROCESSOR STREQUAL i386 OR
         else()
             set(COMPILER_SUPPORT_AVX2 FALSE)
         endif()
+        if(CMAKE_C_COMPILER_VERSION VERSION_GREATER 4.9 OR CMAKE_C_COMPILER_VERSION VERSION_EQUAL 4.9)
+            set(COMPILER_SUPPORT_AVX512 TRUE)
+        else()
+            set(COMPILER_SUPPORT_AVX512 FALSE)
+        endif()
     elseif(CMAKE_C_COMPILER_ID STREQUAL Clang OR CMAKE_C_COMPILER_ID STREQUAL AppleClang)
         set(COMPILER_SUPPORT_SSE2 TRUE)
         if(CMAKE_C_COMPILER_VERSION VERSION_GREATER 3.2 OR CMAKE_C_COMPILER_VERSION VERSION_EQUAL 3.2)
@@ -273,23 +297,30 @@ if(CMAKE_SYSTEM_PROCESSOR STREQUAL i386 OR
         else()
             set(COMPILER_SUPPORT_AVX2 FALSE)
         endif()
-    elseif(CMAKE_C_COMPILER_ID STREQUAL Intel)
-        set(COMPILER_SUPPORT_SSE2 TRUE)
-        if(CMAKE_C_COMPILER_VERSION VERSION_GREATER 14.0 OR CMAKE_C_COMPILER_VERSION VERSION_EQUAL 14.0)
-            set(COMPILER_SUPPORT_AVX2 TRUE)
+        if(CMAKE_C_COMPILER_VERSION VERSION_GREATER 10.0 OR CMAKE_C_COMPILER_VERSION VERSION_EQUAL 10.0)
+            set(COMPILER_SUPPORT_AVX512 TRUE)
         else()
-            set(COMPILER_SUPPORT_AVX2 FALSE)
+            set(COMPILER_SUPPORT_AVX512 FALSE)
         endif()
+    elseif(CMAKE_C_COMPILER_ID STREQUAL Intel)
+        # All Intel compilers since the introduction of AVX512 in 2016 should support it, so activate all SIMD flavors
+        set(COMPILER_SUPPORT_SSE2 TRUE)
+        set(COMPILER_SUPPORT_AVX2 TRUE)
+        set(COMPILER_SUPPORT_AVX512 TRUE)
     elseif(MSVC)
         set(COMPILER_SUPPORT_SSE2 TRUE)
         if(CMAKE_C_COMPILER_VERSION VERSION_GREATER 18.00.30501 OR CMAKE_C_COMPILER_VERSION VERSION_EQUAL 18.00.30501)
             set(COMPILER_SUPPORT_AVX2 TRUE)
+        # AVX512 starts to be supported since Visual Studio 17 15.0
+        elseif(CMAKE_C_COMPILER_VERSION VERSION_GREATER 19.10.25017 OR CMAKE_C_COMPILER_VERSION VERSION_EQUAL 19.10.25017)
+                set(COMPILER_SUPPORT_AVX512 TRUE)
         else()
             set(COMPILER_SUPPORT_AVX2 FALSE)
         endif()
     else()
         set(COMPILER_SUPPORT_SSE2 FALSE)
         set(COMPILER_SUPPORT_AVX2 FALSE)
+        set(COMPILER_SUPPORT_AVX512 FALSE)
         # Unrecognized compiler. Emit a warning message to let the user know hardware-acceleration won't be available.
         message(WARNING "Unable to determine which ${CMAKE_SYSTEM_PROCESSOR} hardware features are supported by the C compiler (${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}).")
     endif()
@@ -330,6 +361,13 @@ endif()
 # disable AVX2 if specified
 if(DEACTIVATE_AVX2)
     set(COMPILER_SUPPORT_AVX2 FALSE)
+    # AVX512 functions in bitshuffle depend on AVX2 too
+    set(COMPILER_SUPPORT_AVX512 FALSE)
+endif()
+
+# disable AVX512 if specified
+if(DEACTIVATE_AVX512)
+    set(COMPILER_SUPPORT_AVX512 FALSE)
 endif()
 
 # flags
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
index 9112c941..890ddf0c 100644
--- a/CONTRIBUTING.rst
+++ b/CONTRIBUTING.rst
@@ -36,5 +36,5 @@ Coding Style
 License
 -------
 By contributing to C-Blosc2, you agree that your contributions will be licensed
-under the `LICENSE <https://github.com/Blosc/c-blosc2/blob/main/LICENSES/BLOSC.txt>`_ 
+under the `LICENSE <https://github.com/Blosc/c-blosc2/blob/main/LICENSE.txt>`_ 
 file of the project.
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index dc6754ad..c8cfa685 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -1,12 +1,73 @@
 Release notes for C-Blosc2
 ==========================
 
+Changes from 2.11.0 to 2.11.1
+=============================
+
+* Fix ALTIVEC header.  Only affects to IBM POWER builds. Thanks to
+  Michael Kuhn for providing a patch.
+
+
+Changes from 2.10.5 to 2.11.0
+=============================
+
+* New AVX512 support for the bitshuffle filter.  This is a backport of the upstream
+  bitshuffle project (https://github.com/kiyo-masui/bitshuffle).  Expect up to [20%
+  better compression speed](https://github.com/Blosc/c-blosc2/pull/567#issuecomment-1789239842)
+  on AMD Zen4 architecture (7950X3D CPU).
+
+* Add c-blosc2 package definition for Guix.  Thanks to Ivan Vilata.
+
+* Properly check calls to `strtol`.  Fixes #558.
+
+* Export the `b2nd_copy_buffer` function. This may be useful for other projects
+  dealing with multidimensional arrays in memory. Thanks to Ivan Vilata.
+
+* Better check that nthreads must be >= 1 and <= INT16_MAX.  Fixes #559.
+
+* Fix compile arguments for armv7l. Thanks to Ben Greiner.
+
+
+Changes from 2.10.4 to 2.10.5
+=============================
+
+* Fix a variable name in a test that was causing a segfault in some platforms.
+
+* Change tuner's functions signature to return always an error code.  This allows
+  for better error checking when using pluggable tuners in Blosc2.
+
+* Do checks when creating contexts.
+
+
+Changes from 2.10.3 to 2.10.4
+=============================
+
+* Remove duplicated tune initialization since it is already done in blosc2_create_cctx.
+  Thanks to Marta Iborra
+
+* Typos fixed.  Thanks to Dimitri Papadopoulos.
+
+
+Changes from 2.10.2 to 2.10.3
+=============================
+
+* Globally registered new codec `openhtj2k`. This will be loaded dynamically. See PR #557.
+
+* Added a `BLOSC_INFO` macro for details on compression params.
+
+* Added `get_blocksize.c` example on automatic blocksizes.
+
+* Warning fixes.
+
+* Fixes for mingw.
+
+
 Changes from 2.10.1 to 2.10.2
 =============================
 
 * Several fixes for the CMake system.  Thanks to Axel Huebl. See PR #541 and #542.
 
-* Several fixes for mingw plaform.  Thanks to Biswapriyo Nath.  See PR #540 and #543.
+* Several fixes for mingw platform.  Thanks to Biswapriyo Nath.  See PR #540 and #543.
 
 
 Changes from 2.10.0 to 2.10.1
diff --git a/blosc/CMakeLists.txt b/blosc/CMakeLists.txt
index b44b7107..bf8527da 100644
--- a/blosc/CMakeLists.txt
+++ b/blosc/CMakeLists.txt
@@ -165,28 +165,22 @@ if(NOT DEACTIVATE_ZSTD)
     endif()
 endif()
 
-set(CMAKE_THREAD_PREFER_PTHREAD TRUE) # pre 3.1
-set(THREADS_PREFER_PTHREAD_FLAG TRUE) # CMake 3.1+
-if(WIN32)
-    # try to use the system library
-    find_package(Threads)
-    if(NOT Threads_FOUND)
-        message(STATUS "using the internal pthread library for win32 systems.")
-        list(APPEND SOURCES blosc/win32/pthread.c)
-    else()
-        if(CMAKE_VERSION VERSION_LESS 3.1)
-            set(LIBS ${LIBS} ${CMAKE_THREAD_LIBS_INIT})
-        else()
-            set(LIBS ${LIBS} Threads::Threads)
-        endif()
-    endif()
-else()
-    find_package(Threads REQUIRED)
+# Threads
+if(HAVE_THREADS)
     if(CMAKE_VERSION VERSION_LESS 3.1)
         set(LIBS ${LIBS} ${CMAKE_THREAD_LIBS_INIT})
     else()
         set(LIBS ${LIBS} Threads::Threads)
     endif()
+elseif(WIN32)
+    message(STATUS "using the internal pthread library for win32 systems.")
+    list(APPEND SOURCES blosc/win32/pthread.c)
+else()
+    message(FATAL_ERROR "Threads required but not found.")
+endif()
+
+# dlopen/dlclose
+if(NOT WIN32)
     set(LIBS ${LIBS} ${CMAKE_DL_LIBS})
 endif()
 
@@ -268,7 +262,7 @@ list(APPEND SOURCES
     blosc/directories.c
     blosc/blosc2-stdio.c
     blosc/b2nd.c
-    blosc/b2nd_utils.c    
+    blosc/b2nd_utils.c
 )
 if(NOT CMAKE_SYSTEM_PROCESSOR STREQUAL arm64)
     if(COMPILER_SUPPORT_SSE2)
@@ -279,6 +273,10 @@ if(NOT CMAKE_SYSTEM_PROCESSOR STREQUAL arm64)
         message(STATUS "Adding run-time support for AVX2")
         list(APPEND SOURCES blosc/shuffle-avx2.c blosc/bitshuffle-avx2.c)
     endif()
+    if(COMPILER_SUPPORT_AVX512)
+        message(STATUS "Adding run-time support for AVX512")
+        list(APPEND SOURCES blosc/bitshuffle-avx512.c)
+    endif()
 endif()
 if(COMPILER_SUPPORT_NEON)
     message(STATUS "Adding run-time support for NEON")
@@ -349,6 +347,30 @@ if(COMPILER_SUPPORT_AVX2)
             SOURCE shuffle.c
             APPEND PROPERTY COMPILE_DEFINITIONS SHUFFLE_AVX2_ENABLED)
 endif()
+if(COMPILER_SUPPORT_AVX512)
+    if(MSVC)
+        set_source_files_properties(
+                bitshuffle-avx512.c
+		PROPERTIES COMPILE_OPTIONS "/arch:AVX512")
+        set_property(
+                SOURCE shuffle.c
+		APPEND PROPERTY COMPILE_OPTIONS "/arch:AVX512")
+    else()
+        set_source_files_properties(
+                bitshuffle-avx512.c
+                PROPERTIES COMPILE_OPTIONS "-mavx512f;-mavx512bw")
+        set_property(
+                SOURCE shuffle.c
+                APPEND PROPERTY COMPILE_OPTIONS "-mavx512f;-mavx512bw")
+    endif()
+
+    # Define a symbol for the shuffle-dispatch implementation
+    # so it knows AVX512 is supported even though that file is
+    # compiled without AVX512 support (for portability).
+    set_property(
+            SOURCE shuffle.c
+            APPEND PROPERTY COMPILE_DEFINITIONS SHUFFLE_AVX512_ENABLED)
+endif()
 if(COMPILER_SUPPORT_NEON)
     set_source_files_properties(
             shuffle-neon.c bitshuffle-neon.c
@@ -360,10 +382,10 @@ if(COMPILER_SUPPORT_NEON)
         # Only armv7l needs special -mfpu=neon flag; aarch64 doesn't.
       set_source_files_properties(
             shuffle-neon.c bitshuffle-neon.c
-            PROPERTIES COMPILE_OPTIONS "-mfpu=neon -flax-vector-conversions")
+            PROPERTIES COMPILE_OPTIONS "-mfpu=neon;-flax-vector-conversions")
       set_property(
             SOURCE shuffle.c
-            APPEND PROPERTY COMPILE_OPTIONS "-mfpu=neon -flax-vector-conversions")
+            APPEND PROPERTY COMPILE_OPTIONS "-mfpu=neon;-flax-vector-conversions")
     endif()
     # Define a symbol for the shuffle-dispatch implementation
     # so it knows NEON is supported even though that file is
diff --git a/blosc/b2nd.c b/blosc/b2nd.c
index 4c0ff70c..ed367cc8 100644
--- a/blosc/b2nd.c
+++ b/blosc/b2nd.c
@@ -9,7 +9,6 @@
 **********************************************************************/
 
 #include "b2nd.h"
-#include "b2nd_utils.h"
 #include "context.h"
 #include "blosc2/blosc2-common.h"
 #include "blosc2.h"
diff --git a/blosc/b2nd_utils.c b/blosc/b2nd_utils.c
index bcd5e141..d945ad74 100644
--- a/blosc/b2nd_utils.c
+++ b/blosc/b2nd_utils.c
@@ -8,7 +8,6 @@
   See LICENSE.txt for details about copyright and rights to use.
 **********************************************************************/
 
-#include "b2nd_utils.h"
 #include "b2nd.h"
 
 #include <stdint.h>
@@ -256,10 +255,10 @@ void copy_ndim_fallback(const int8_t ndim,
 
 int b2nd_copy_buffer(int8_t ndim,
                      uint8_t itemsize,
-                     void *src, const int64_t *src_pad_shape,
-                     int64_t *src_start, const int64_t *src_stop,
+                     const void *src, const int64_t *src_pad_shape,
+                     const int64_t *src_start, const int64_t *src_stop,
                      void *dst, const int64_t *dst_pad_shape,
-                     int64_t *dst_start) {
+                     const int64_t *dst_start) {
   // Compute the shape of the copy
   int64_t copy_shape[B2ND_MAX_DIM] = {0};
   for (int i = 0; i < ndim; ++i) {
diff --git a/blosc/b2nd_utils.h b/blosc/b2nd_utils.h
deleted file mode 100644
index 38fba67a..00000000
--- a/blosc/b2nd_utils.h
+++ /dev/null
@@ -1,23 +0,0 @@
-/*********************************************************************
-  Blosc - Blocked Shuffling and Compression Library
-
-  Copyright (c) 2021  The Blosc Development Team <blosc@blosc.org>
-  https://blosc.org
-  License: BSD 3-Clause (see LICENSE.txt)
-
-  See LICENSE.txt for details about copyright and rights to use.
-**********************************************************************/
-
-#ifndef BLOSC_B2ND_UTILS_H
-#define BLOSC_B2ND_UTILS_H
-
-#include <stdint.h>
-
-int b2nd_copy_buffer(int8_t ndim,
-                     uint8_t itemsize,
-                     void *src, const int64_t *src_pad_shape,
-                     int64_t *src_start, const int64_t *src_stop,
-                     void *dst, const int64_t *dst_pad_shape,
-                     int64_t *dst_start);
-
-#endif /* BLOSC_B2ND_UTILS_H */
diff --git a/blosc/bitshuffle-altivec.c b/blosc/bitshuffle-altivec.c
index 884bf90c..0dd33b8b 100644
--- a/blosc/bitshuffle-altivec.c
+++ b/blosc/bitshuffle-altivec.c
@@ -179,7 +179,7 @@ bitunshuffle1_altivec(void* _src, void* dest, const size_t size, const size_t el
 
 
 /* Transpose bytes within elements for 16 bit elements. */
-int64_t bshuf_trans_byte_elem_16(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_16(const void* in, void* out, const size_t size) {
   static const uint8_t bytesoftype = 2;
   __vector uint8_t xmm0[2];
 
@@ -199,7 +199,7 @@ int64_t bshuf_trans_byte_elem_16(void* in, void* out, const size_t size) {
 
 
 /* Transpose bytes within elements for 32 bit elements. */
-int64_t bshuf_trans_byte_elem_32(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_32(const void* in, void* out, const size_t size) {
   static const uint8_t bytesoftype = 4;
   __vector uint8_t xmm0[4];
 
@@ -219,7 +219,7 @@ int64_t bshuf_trans_byte_elem_32(void* in, void* out, const size_t size) {
 
 
 /* Transpose bytes within elements for 64 bit elements. */
-int64_t bshuf_trans_byte_elem_64(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_64(const void* in, void* out, const size_t size) {
   static const uint8_t bytesoftype = 8;
   __vector uint8_t xmm0[8];
 
@@ -239,7 +239,7 @@ int64_t bshuf_trans_byte_elem_64(void* in, void* out, const size_t size) {
 
 
 /* Transpose bytes within elements for 128 bit elements. */
-int64_t bshuf_trans_byte_elem_128(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_128(const void* in, void* out, const size_t size) {
   static const uint8_t bytesoftype = 16;
   __vector uint8_t xmm0[16];
 
@@ -258,20 +258,8 @@ int64_t bshuf_trans_byte_elem_128(void* in, void* out, const size_t size) {
 }
 
 
-/* Memory copy with bshuf call signature. */
-int64_t bshuf_copy(void* in, void* out, const size_t size,
-                   const size_t elem_size) {
-
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
-
-  memcpy(out_b, in_b, size * elem_size);
-  return size * elem_size;
-}
-
-
 /* Transpose bytes within elements using best SSE algorithm available. */
-int64_t bshuf_trans_byte_elem_altivec(void* in, void* out, const size_t size,
+int64_t bshuf_trans_byte_elem_altivec(const void* in, void* out, const size_t size,
                                       const size_t elem_size, void* tmp_buf) {
 
   int64_t count;
@@ -338,7 +326,7 @@ int64_t bshuf_trans_byte_elem_altivec(void* in, void* out, const size_t size,
 
 
 /* Transpose bits within bytes. */
-int64_t bshuf_trans_bit_byte_altivec(void* in, void* out, const size_t size,
+int64_t bshuf_trans_bit_byte_altivec(const void* in, void* out, const size_t size,
                                      const size_t elem_size) {
 
   const uint8_t* in_b = (const uint8_t*)in;
@@ -372,25 +360,31 @@ int64_t bshuf_trans_bit_byte_altivec(void* in, void* out, const size_t size,
 
 
 /* Transpose bits within elements. */
-int64_t bshuf_trans_bit_elem_altivec(void* in, void* out, const size_t size,
-                                     const size_t elem_size, void* tmp_buf) {
+int64_t bshuf_trans_bit_elem_altivec(const void* in, void* out, const size_t size,
+                                     const size_t elem_size) {
 
   int64_t count;
 
   CHECK_MULT_EIGHT(size);
 
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
   count = bshuf_trans_byte_elem_altivec(in, out, size, elem_size, tmp_buf);
   CHECK_ERR(count);
   // bshuf_trans_bit_byte_altivec / bitshuffle1_altivec
   count = bshuf_trans_bit_byte_altivec(out, tmp_buf, size, elem_size);
   CHECK_ERR(count);
   count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
+
+  free(tmp_buf);
+
   return count;
 }
 
 /* For data organized into a row for each bit (8 * elem_size rows), transpose
  * the bytes. */
-int64_t bshuf_trans_byte_bitrow_altivec(void* in, void* out, const size_t size,
+int64_t bshuf_trans_byte_bitrow_altivec(const void* in, void* out, const size_t size,
                                         const size_t elem_size) {
   static const __vector uint8_t epi8_low = (const __vector uint8_t) {
     0x00, 0x10, 0x01, 0x11, 0x02, 0x12, 0x03, 0x13,
@@ -541,7 +535,7 @@ int64_t bshuf_trans_byte_bitrow_altivec(void* in, void* out, const size_t size,
 
 
 /* Shuffle bits within the bytes of eight element blocks. */
-int64_t bshuf_shuffle_bit_eightelem_altivec(void* in, void* out, const size_t size,
+int64_t bshuf_shuffle_bit_eightelem_altivec(const void* in, void* out, const size_t size,
                                             const size_t elem_size) {
   /*  With a bit of care, this could be written such that such that it is */
   /*  in_buf = out_buf safe. */
@@ -579,17 +573,21 @@ int64_t bshuf_shuffle_bit_eightelem_altivec(void* in, void* out, const size_t si
 
 
 /* Untranspose bits within elements. */
-int64_t bshuf_untrans_bit_elem_altivec(void* in, void* out, const size_t size,
-                                       const size_t elem_size, void* tmp_buf) {
+int64_t bshuf_untrans_bit_elem_altivec(const void* in, void* out, const size_t size,
+                                       const size_t elem_size) {
 
   int64_t count;
 
   CHECK_MULT_EIGHT(size);
 
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
   count = bshuf_trans_byte_bitrow_altivec(in, tmp_buf, size, elem_size);
   CHECK_ERR(count);
   count = bshuf_shuffle_bit_eightelem_altivec(tmp_buf, out, size, elem_size);
 
+  free(tmp_buf);
   return count;
 }
 
diff --git a/blosc/bitshuffle-altivec.h b/blosc/bitshuffle-altivec.h
index 5995aa74..320818a3 100644
--- a/blosc/bitshuffle-altivec.h
+++ b/blosc/bitshuffle-altivec.h
@@ -19,29 +19,29 @@
 #include <stdint.h>
 
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_byte_elem_altivec(void* in, void* out, const size_t size,
+    bshuf_trans_byte_elem_altivec(const void* in, void* out, const size_t size,
                                   const size_t elem_size, void* tmp_buf);
 
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_byte_bitrow_altivec(void* in, void* out, const size_t size,
+    bshuf_trans_byte_bitrow_altivec(const void* in, void* out, const size_t size,
                                     const size_t elem_size);
 
 BLOSC_NO_EXPORT int64_t
-    bshuf_shuffle_bit_eightelem_altivec(void* in, void* out, const size_t size,
-                                     const size_t elem_size);
+    bshuf_shuffle_bit_eightelem_altivec(const void* in, void* out, const size_t size,
+                                        const size_t elem_size);
 
 /**
   ALTIVEC-accelerated bitshuffle routine.
 */
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_bit_elem_altivec(void* in, void* out, const size_t size,
-                              const size_t elem_size, void* tmp_buf);
+    bshuf_trans_bit_elem_altivec(const void* in, void* out, const size_t size,
+                                 const size_t elem_size);
 
 /**
   ALTIVEC-accelerated bitunshuffle routine.
 */
 BLOSC_NO_EXPORT int64_t
-    bshuf_untrans_bit_elem_altivec(void* in, void* out, const size_t size,
-                                const size_t elem_size, void* tmp_buf);
+    bshuf_untrans_bit_elem_altivec(const void* in, void* out, const size_t size,
+                                   const size_t elem_size);
 
 #endif /* BLOSC_BITSHUFFLE_ALTIVEC_H */
diff --git a/blosc/bitshuffle-avx2.c b/blosc/bitshuffle-avx2.c
index a855bf83..b0f5ac3c 100644
--- a/blosc/bitshuffle-avx2.c
+++ b/blosc/bitshuffle-avx2.c
@@ -29,8 +29,6 @@
 
 #include <immintrin.h>
 
-#include <stdint.h>
-
 /* The next is useful for debugging purposes */
 #if 0
 #include <stdio.h>
@@ -57,12 +55,14 @@ static void printymm(__m256i ymm0)
 /* ---- Code that requires AVX2. Intel Haswell (2013) and later. ---- */
 
 
+
 /* Transpose bits within bytes. */
-int64_t bshuf_trans_bit_byte_avx2(void* in, void* out, const size_t size,
-                                  const size_t elem_size) {
+int64_t bshuf_trans_bit_byte_AVX(const void* in, void* out, const size_t size,
+                                 const size_t elem_size) {
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
+  size_t ii, kk;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
   int32_t* out_i32;
 
   size_t nbyte = elem_size * size;
@@ -71,14 +71,13 @@ int64_t bshuf_trans_bit_byte_avx2(void* in, void* out, const size_t size,
 
   __m256i ymm;
   int32_t bt;
-  size_t ii, kk;
 
   for (ii = 0; ii + 31 < nbyte; ii += 32) {
-    ymm = _mm256_loadu_si256((__m256i*)&in_b[ii]);
+    ymm = _mm256_loadu_si256((__m256i *) &in_b[ii]);
     for (kk = 0; kk < 8; kk++) {
       bt = _mm256_movemask_epi8(ymm);
       ymm = _mm256_slli_epi16(ymm, 1);
-      out_i32 = (int32_t*)&out_b[((7 - kk) * nbyte + ii) / 8];
+      out_i32 = (int32_t*) &out_b[((7 - kk) * nbyte + ii) / 8];
       *out_i32 = bt;
     }
   }
@@ -89,39 +88,44 @@ int64_t bshuf_trans_bit_byte_avx2(void* in, void* out, const size_t size,
 
 
 /* Transpose bits within elements. */
-int64_t bshuf_trans_bit_elem_avx2(void* in, void* out, const size_t size,
-                                  const size_t elem_size, void* tmp_buf) {
+int64_t bshuf_trans_bit_elem_AVX(const void* in, void* out, const size_t size,
+                                 const size_t elem_size) {
 
   int64_t count;
 
   CHECK_MULT_EIGHT(size);
 
-  count = bshuf_trans_byte_elem_sse2(in, out, size, elem_size, tmp_buf);
-  CHECK_ERR(count);
-  count = bshuf_trans_bit_byte_avx2(out, tmp_buf, size, elem_size);
-  CHECK_ERR(count);
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
+  count = bshuf_trans_byte_elem_SSE(in, out, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count = bshuf_trans_bit_byte_AVX(out, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
   count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
 
+  free(tmp_buf);
+
   return count;
 }
 
 
 /* For data organized into a row for each bit (8 * elem_size rows), transpose
  * the bytes. */
-int64_t bshuf_trans_byte_bitrow_avx2(void* in, void* out, const size_t size,
-                                     const size_t elem_size) {
+int64_t bshuf_trans_byte_bitrow_AVX(const void* in, void* out, const size_t size,
+                                    const size_t elem_size) {
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
+  size_t hh, ii, jj, kk, mm;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+
+  CHECK_MULT_EIGHT(size);
 
   size_t nrows = 8 * elem_size;
   size_t nbyte_row = size / 8;
-  size_t ii, jj, kk, hh, mm;
-
-  CHECK_MULT_EIGHT(size);
 
-  if (elem_size % 4)
-    return bshuf_trans_byte_bitrow_sse2(in, out, size, elem_size);
+  if (elem_size % 4) return bshuf_trans_byte_bitrow_SSE(in, out, size,
+                                                        elem_size);
 
   __m256i ymm_0[8];
   __m256i ymm_1[8];
@@ -129,22 +133,22 @@ int64_t bshuf_trans_byte_bitrow_avx2(void* in, void* out, const size_t size,
 
   for (jj = 0; jj + 31 < nbyte_row; jj += 32) {
     for (ii = 0; ii + 3 < elem_size; ii += 4) {
-      for (hh = 0; hh < 4; hh++) {
+      for (hh = 0; hh < 4; hh ++) {
 
-        for (kk = 0; kk < 8; kk++) {
-          ymm_0[kk] = _mm256_loadu_si256((__m256i*)&in_b[
+        for (kk = 0; kk < 8; kk ++){
+          ymm_0[kk] = _mm256_loadu_si256((__m256i *) &in_b[
               (ii * 8 + hh * 8 + kk) * nbyte_row + jj]);
         }
 
-        for (kk = 0; kk < 4; kk++) {
+        for (kk = 0; kk < 4; kk ++){
           ymm_1[kk] = _mm256_unpacklo_epi8(ymm_0[kk * 2],
                                            ymm_0[kk * 2 + 1]);
           ymm_1[kk + 4] = _mm256_unpackhi_epi8(ymm_0[kk * 2],
                                                ymm_0[kk * 2 + 1]);
         }
 
-        for (kk = 0; kk < 2; kk++) {
-          for (mm = 0; mm < 2; mm++) {
+        for (kk = 0; kk < 2; kk ++){
+          for (mm = 0; mm < 2; mm ++){
             ymm_0[kk * 4 + mm] = _mm256_unpacklo_epi16(
                 ymm_1[kk * 4 + mm * 2],
                 ymm_1[kk * 4 + mm * 2 + 1]);
@@ -154,21 +158,21 @@ int64_t bshuf_trans_byte_bitrow_avx2(void* in, void* out, const size_t size,
           }
         }
 
-        for (kk = 0; kk < 4; kk++) {
+        for (kk = 0; kk < 4; kk ++){
           ymm_1[kk * 2] = _mm256_unpacklo_epi32(ymm_0[kk * 2],
                                                 ymm_0[kk * 2 + 1]);
           ymm_1[kk * 2 + 1] = _mm256_unpackhi_epi32(ymm_0[kk * 2],
                                                     ymm_0[kk * 2 + 1]);
         }
 
-        for (kk = 0; kk < 8; kk++) {
+        for (kk = 0; kk < 8; kk ++){
           ymm_storeage[kk][hh] = ymm_1[kk];
         }
       }
 
-      for (mm = 0; mm < 8; mm++) {
+      for (mm = 0; mm < 8; mm ++) {
 
-        for (kk = 0; kk < 4; kk++) {
+        for (kk = 0; kk < 4; kk ++){
           ymm_0[kk] = ymm_storeage[mm][kk];
         }
 
@@ -182,75 +186,79 @@ int64_t bshuf_trans_byte_bitrow_avx2(void* in, void* out, const size_t size,
         ymm_0[2] = _mm256_permute2x128_si256(ymm_1[0], ymm_1[1], 49);
         ymm_0[3] = _mm256_permute2x128_si256(ymm_1[2], ymm_1[3], 49);
 
-        _mm256_storeu_si256((__m256i*)&out_b[
+        _mm256_storeu_si256((__m256i *) &out_b[
             (jj + mm * 2 + 0 * 16) * nrows + ii * 8], ymm_0[0]);
-        _mm256_storeu_si256((__m256i*)&out_b[
+        _mm256_storeu_si256((__m256i *) &out_b[
             (jj + mm * 2 + 0 * 16 + 1) * nrows + ii * 8], ymm_0[1]);
-        _mm256_storeu_si256((__m256i*)&out_b[
+        _mm256_storeu_si256((__m256i *) &out_b[
             (jj + mm * 2 + 1 * 16) * nrows + ii * 8], ymm_0[2]);
-        _mm256_storeu_si256((__m256i*)&out_b[
+        _mm256_storeu_si256((__m256i *) &out_b[
             (jj + mm * 2 + 1 * 16 + 1) * nrows + ii * 8], ymm_0[3]);
       }
     }
   }
-  for (ii = 0; ii < nrows; ii++) {
-    for (jj = nbyte_row - nbyte_row % 32; jj < nbyte_row; jj++) {
+  for (ii = 0; ii < nrows; ii ++ ) {
+    for (jj = nbyte_row - nbyte_row % 32; jj < nbyte_row; jj ++) {
       out_b[jj * nrows + ii] = in_b[ii * nbyte_row + jj];
     }
   }
-  return (int64_t)size * (int64_t)elem_size;
+  return size * elem_size;
 }
 
 
 /* Shuffle bits within the bytes of eight element blocks. */
-int64_t bshuf_shuffle_bit_eightelem_avx2(void* in, void* out, const size_t size,
-                                         const size_t elem_size) {
+int64_t bshuf_shuffle_bit_eightelem_AVX(const void* in, void* out, const size_t size,
+                                        const size_t elem_size) {
 
   CHECK_MULT_EIGHT(size);
 
-  /*  With a bit of care, this could be written such that such that it is */
-  /*  in_buf = out_buf safe. */
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
+  // With a bit of care, this could be written such that such that it is
+  // in_buf = out_buf safe.
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
 
+  size_t ii, jj, kk;
   size_t nbyte = elem_size * size;
-  size_t ii, jj, kk, ind;
 
   __m256i ymm;
   int32_t bt;
 
   if (elem_size % 4) {
-    return bshuf_shuffle_bit_eightelem_sse2(in, out, size, elem_size);
+    return bshuf_shuffle_bit_eightelem_SSE(in, out, size, elem_size);
   } else {
     for (jj = 0; jj + 31 < 8 * elem_size; jj += 32) {
       for (ii = 0; ii + 8 * elem_size - 1 < nbyte;
            ii += 8 * elem_size) {
-        ymm = _mm256_loadu_si256((__m256i*)&in_b[ii + jj]);
+        ymm = _mm256_loadu_si256((__m256i *) &in_b[ii + jj]);
         for (kk = 0; kk < 8; kk++) {
           bt = _mm256_movemask_epi8(ymm);
           ymm = _mm256_slli_epi16(ymm, 1);
-          ind = (ii + jj / 8 + (7 - kk) * elem_size);
-          *(int32_t*)&out_b[ind] = bt;
+          size_t ind = (ii + jj / 8 + (7 - kk) * elem_size);
+          * (int32_t *) &out_b[ind] = bt;
         }
       }
     }
   }
-  return (int64_t)size * (int64_t)elem_size;
+  return size * elem_size;
 }
 
 
 /* Untranspose bits within elements. */
-int64_t bshuf_untrans_bit_elem_avx2(void* in, void* out, const size_t size,
-                                    const size_t elem_size, void* tmp_buf) {
+int64_t bshuf_untrans_bit_elem_AVX(const void* in, void* out, const size_t size,
+                                   const size_t elem_size) {
 
   int64_t count;
 
   CHECK_MULT_EIGHT(size);
 
-  count = bshuf_trans_byte_bitrow_avx2(in, tmp_buf, size, elem_size);
-  CHECK_ERR(count);
-  count = bshuf_shuffle_bit_eightelem_avx2(tmp_buf, out, size, elem_size);
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
+  count = bshuf_trans_byte_bitrow_AVX(in, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count =  bshuf_shuffle_bit_eightelem_AVX(tmp_buf, out, size, elem_size);
 
+  free(tmp_buf);
   return count;
 }
 
diff --git a/blosc/bitshuffle-avx2.h b/blosc/bitshuffle-avx2.h
index 2b40bd29..ffbb4c8a 100644
--- a/blosc/bitshuffle-avx2.h
+++ b/blosc/bitshuffle-avx2.h
@@ -19,17 +19,26 @@
 #include <stdint.h>
 
 /**
-  AVX2-accelerated bitshuffle routine.
+ * AVX2-accelerated bitshuffle routine.
 */
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_bit_elem_avx2(void* in, void* out, const size_t size,
-                              const size_t elem_size, void* tmp_buf);
+    bshuf_trans_bit_elem_AVX(const void* in, void* out, const size_t size,
+                             const size_t elem_size);
 
 /**
-  AVX2-accelerated bitunshuffle routine.
+ * AVX2-accelerated bitunshuffle routine.
 */
 BLOSC_NO_EXPORT int64_t
-    bshuf_untrans_bit_elem_avx2(void* in, void* out, const size_t size,
-                                const size_t elem_size, void* tmp_buf);
+    bshuf_untrans_bit_elem_AVX(const void* in, void* out, const size_t size,
+                               const size_t elem_size);
+
+/**
+ * AVX2 utils used by AVX512 functions
+ */
+int64_t bshuf_shuffle_bit_eightelem_AVX(const void* in, void* out, const size_t size,
+                                        const size_t elem_size);
+
+int64_t bshuf_trans_byte_bitrow_AVX(const void* in, void* out, const size_t size,
+                                    const size_t elem_size);
 
 #endif /* BLOSC_BITSHUFFLE_AVX2_H */
diff --git a/blosc/bitshuffle-avx512.c b/blosc/bitshuffle-avx512.c
new file mode 100644
index 00000000..15fd1ea8
--- /dev/null
+++ b/blosc/bitshuffle-avx512.c
@@ -0,0 +1,161 @@
+/*********************************************************************
+  Blosc - Blocked Shuffling and Compression Library
+
+  Copyright (c) 2021  The Blosc Development Team <blosc@blosc.org>
+  https://blosc.org
+  License: BSD 3-Clause (see LICENSE.txt)
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
+/*********************************************************************
+  Bitshuffle - Filter for improving compression of typed binary data.
+
+  Author: Kiyoshi Masui <kiyo@physics.ubc.ca>
+  Website: https://github.com/kiyo-masui/bitshuffle
+
+  Note: Adapted for c-blosc2 by Francesc Alted.
+
+  See LICENSES/BITSHUFFLE.txt file for details about copyright and
+  rights to use.
+**********************************************************************/
+
+/* Make sure AVX512 is available for the compilation target and compiler. */
+#if defined(__AVX512F__) && defined (__AVX512BW__)
+#include <immintrin.h>
+#include "bitshuffle-avx512.h"
+#include "bitshuffle-avx2.h"
+#include "bitshuffle-sse2.h"
+#include "bitshuffle-generic.h"
+
+
+/* Transpose bits within bytes. */
+int64_t bshuf_trans_bit_byte_AVX512(const void* in, void* out, const size_t size,
+                                    const size_t elem_size) {
+
+  size_t ii, kk;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+  size_t nbyte = elem_size * size;
+  int64_t count;
+
+  int64_t* out_i64;
+  __m512i zmm;
+  __mmask64 bt;
+  if (nbyte >= 64) {
+    const __m512i mask = _mm512_set1_epi8(0);
+
+    for (ii = 0; ii + 63 < nbyte; ii += 64) {
+      zmm = _mm512_loadu_si512((__m512i *) &in_b[ii]);
+      for (kk = 0; kk < 8; kk++) {
+        bt = _mm512_cmp_epi8_mask(zmm, mask, 1);
+        zmm = _mm512_slli_epi16(zmm, 1);
+        out_i64 = (int64_t*) &out_b[((7 - kk) * nbyte + ii) / 8];
+        *out_i64 = (int64_t)bt;
+      }
+    }
+  }
+
+  __m256i ymm;
+  int32_t bt32;
+  int32_t* out_i32;
+  size_t start = nbyte - nbyte % 64;
+  for (ii = start; ii + 31 < nbyte; ii += 32) {
+    ymm = _mm256_loadu_si256((__m256i *) &in_b[ii]);
+    for (kk = 0; kk < 8; kk++) {
+      bt32 = _mm256_movemask_epi8(ymm);
+      ymm = _mm256_slli_epi16(ymm, 1);
+      out_i32 = (int32_t*) &out_b[((7 - kk) * nbyte + ii) / 8];
+      *out_i32 = bt32;
+    }
+  }
+
+
+  count = bshuf_trans_bit_byte_remainder(in, out, size, elem_size,
+                                         nbyte - nbyte % 64 % 32);
+
+  return count;
+}
+
+
+/* Transpose bits within elements. */
+int64_t bshuf_trans_bit_elem_AVX512(const void* in, void* out, const size_t size,
+                                    const size_t elem_size) {
+
+  int64_t count;
+
+  CHECK_MULT_EIGHT(size);
+
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
+  count = bshuf_trans_byte_elem_SSE(in, out, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count = bshuf_trans_bit_byte_AVX512(out, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
+
+  free(tmp_buf);
+
+  return count;
+
+}
+
+/* Shuffle bits within the bytes of eight element blocks. */
+int64_t bshuf_shuffle_bit_eightelem_AVX512(const void* in, void* out, const size_t size,
+                                           const size_t elem_size) {
+
+  CHECK_MULT_EIGHT(size);
+
+  // With a bit of care, this could be written such that such that it is
+  // in_buf = out_buf safe.
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+
+  size_t ii, jj, kk;
+  size_t nbyte = elem_size * size;
+
+  __m512i zmm;
+  __mmask64 bt;
+
+  if (elem_size % 8) {
+    return bshuf_shuffle_bit_eightelem_AVX(in, out, size, elem_size);
+  } else {
+    const __m512i mask = _mm512_set1_epi8(0);
+    for (jj = 0; jj + 63 < 8 * elem_size; jj += 64) {
+      for (ii = 0; ii + 8 * elem_size - 1 < nbyte;
+           ii += 8 * elem_size) {
+        zmm = _mm512_loadu_si512((__m512i *) &in_b[ii + jj]);
+        for (kk = 0; kk < 8; kk++) {
+          bt = _mm512_cmp_epi8_mask(zmm, mask, 1);
+          zmm = _mm512_slli_epi16(zmm, 1);
+          size_t ind = (ii + jj / 8 + (7 - kk) * elem_size);
+          * (int64_t *) &out_b[ind] = bt;
+        }
+      }
+    }
+
+  }
+  return size * elem_size;
+}
+
+/* Untranspose bits within elements. */
+int64_t bshuf_untrans_bit_elem_AVX512(const void* in, void* out, const size_t size,
+                                      const size_t elem_size) {
+
+  int64_t count;
+
+  CHECK_MULT_EIGHT(size);
+
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
+  count = bshuf_trans_byte_bitrow_AVX(in, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count =  bshuf_shuffle_bit_eightelem_AVX512(tmp_buf, out, size, elem_size);
+
+  free(tmp_buf);
+  return count;
+}
+
+#endif
diff --git a/blosc/bitshuffle-avx512.h b/blosc/bitshuffle-avx512.h
new file mode 100644
index 00000000..28c7a81c
--- /dev/null
+++ b/blosc/bitshuffle-avx512.h
@@ -0,0 +1,29 @@
+/*********************************************************************
+  Blosc - Blocked Shuffling and Compression Library
+
+  Copyright (c) 2021  The Blosc Development Team <blosc@blosc.org>
+  https://blosc.org
+  License: BSD 3-Clause (see LICENSE.txt)
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
+/* AVX512-accelerated shuffle/unshuffle routines. */
+
+#ifndef BLOSC_BITSHUFFLE_AVX512_H
+#define BLOSC_BITSHUFFLE_AVX512_H
+
+#include "blosc2/blosc2-common.h"
+
+#include <stddef.h>
+#include <stdint.h>
+
+BLOSC_NO_EXPORT int64_t
+    bshuf_trans_bit_elem_AVX512(const void* in, void* out, const size_t size,
+                                const size_t elem_size);
+
+BLOSC_NO_EXPORT int64_t
+    bshuf_untrans_bit_elem_AVX512(const void* in, void* out, const size_t size,
+                                  const size_t elem_size);
+
+#endif /* BLOSC_BITSHUFFLE_AVX512_H */
diff --git a/blosc/bitshuffle-generic.c b/blosc/bitshuffle-generic.c
index 13748efd..b8bf001e 100644
--- a/blosc/bitshuffle-generic.c
+++ b/blosc/bitshuffle-generic.c
@@ -10,165 +10,188 @@
 
 #include "bitshuffle-generic.h"
 
-#include <stdint.h>
+#include <stdlib.h>
 
 #ifdef _MSC_VER
 #pragma warning (push)
 #pragma warning (disable: 4146)
 #endif
 
+
+/* Memory copy with bshuf call signature. For testing and profiling. */
+int64_t bshuf_copy(const void* in, void* out, const size_t size,
+                   const size_t elem_size) {
+
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+
+  memcpy(out_b, in_b, size * elem_size);
+  return size * elem_size;
+}
+
+
 /* Transpose bytes within elements, starting partway through input. */
 int64_t bshuf_trans_byte_elem_remainder(const void* in, void* out, const size_t size,
-         const size_t elem_size, const size_t start) {
-
-    char* in_b = (char*) in;
-    char* out_b = (char*) out;
-    size_t ii, jj, kk;
-
-    CHECK_MULT_EIGHT(start);
-
-    if (size > start) {
-        /*  ii loop separated into 2 loops so the compiler can unroll */
-        /*  the inner one. */
-        for (ii = start; ii + 7 < size; ii += 8) {
-            for (jj = 0; jj < elem_size; jj++) {
-                for (kk = 0; kk < 8; kk++) {
-                    out_b[jj * size + ii + kk]
-                        = in_b[ii * elem_size + kk * elem_size + jj];
-                }
-            }
-        }
-        for (ii = size - size % 8; ii < size; ii ++) {
-            for (jj = 0; jj < elem_size; jj++) {
-                out_b[jj * size + ii] = in_b[ii * elem_size + jj];
-            }
+                                        const size_t elem_size, const size_t start) {
+
+  size_t ii, jj, kk;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+
+  CHECK_MULT_EIGHT(start);
+
+  if (size > start) {
+    // ii loop separated into 2 loops so the compiler can unroll
+    // the inner one.
+    for (ii = start; ii + 7 < size; ii += 8) {
+      for (jj = 0; jj < elem_size; jj++) {
+        for (kk = 0; kk < 8; kk++) {
+          out_b[jj * size + ii + kk]
+              = in_b[ii * elem_size + kk * elem_size + jj];
         }
+      }
+    }
+    for (ii = size - size % 8; ii < size; ii ++) {
+      for (jj = 0; jj < elem_size; jj++) {
+        out_b[jj * size + ii] = in_b[ii * elem_size + jj];
+      }
     }
-    return (int64_t)size * (int64_t)elem_size;
+  }
+  return size * elem_size;
 }
 
 
 /* Transpose bytes within elements. */
 int64_t bshuf_trans_byte_elem_scal(const void* in, void* out, const size_t size,
-				   const size_t elem_size) {
+                                   const size_t elem_size) {
 
-    return bshuf_trans_byte_elem_remainder(in, out, size, elem_size, 0);
+  return bshuf_trans_byte_elem_remainder(in, out, size, elem_size, 0);
 }
 
 
 /* Transpose bits within bytes. */
 int64_t bshuf_trans_bit_byte_remainder(const void* in, void* out, const size_t size,
-         const size_t elem_size, const size_t start_byte) {
+                                       const size_t elem_size, const size_t start_byte) {
 
-    const uint64_t* in_b = (const uint64_t*) in;
-    uint8_t* out_b = (uint8_t*) out;
+  const uint64_t* in_b = (const uint64_t*) in;
+  uint8_t* out_b = (uint8_t*) out;
 
-    uint64_t x, t;
+  uint64_t x, t;
 
-    size_t ii, kk;
-    size_t nbyte = elem_size * size;
-    size_t nbyte_bitrow = nbyte / 8;
+  size_t ii, kk;
+  size_t nbyte = elem_size * size;
+  size_t nbyte_bitrow = nbyte / 8;
 
-    uint64_t e=1;
-    const int little_endian = *(uint8_t *) &e == 1;
-    const size_t bit_row_skip = little_endian ? nbyte_bitrow : -nbyte_bitrow;
-    const size_t bit_row_offset = little_endian ? 0 : 7 * nbyte_bitrow;
+  uint64_t e=1;
+  const int little_endian = *(uint8_t *) &e == 1;
+  const size_t bit_row_skip = little_endian ? nbyte_bitrow : -nbyte_bitrow;
+  const int64_t bit_row_offset = little_endian ? 0 : 7 * nbyte_bitrow;
 
-    CHECK_MULT_EIGHT(nbyte);
-    CHECK_MULT_EIGHT(start_byte);
+  CHECK_MULT_EIGHT(nbyte);
+  CHECK_MULT_EIGHT(start_byte);
 
-    for (ii = start_byte / 8; ii < nbyte_bitrow; ii ++) {
-        x = in_b[ii];
-        if (little_endian) {
-            TRANS_BIT_8X8(x, t);
-        } else {
-            TRANS_BIT_8X8_BE(x, t);
-        }
-        for (kk = 0; kk < 8; kk ++) {
-            out_b[bit_row_offset + kk * bit_row_skip + ii] = (uint8_t)x;
-            x = x >> 8;
-        }
+  for (ii = start_byte / 8; ii < nbyte_bitrow; ii ++) {
+    x = in_b[ii];
+    if (little_endian) {
+      TRANS_BIT_8X8(x, t);
+    } else {
+      TRANS_BIT_8X8_BE(x, t);
     }
-    return (int64_t)size * (int64_t)elem_size;
+    for (kk = 0; kk < 8; kk ++) {
+      out_b[bit_row_offset + kk * bit_row_skip + ii] = x;
+      x = x >> 8;
+    }
+  }
+  return size * elem_size;
 }
 
 
 /* Transpose bits within bytes. */
 int64_t bshuf_trans_bit_byte_scal(const void* in, void* out, const size_t size,
-         const size_t elem_size) {
+                                  const size_t elem_size) {
 
-    return bshuf_trans_bit_byte_remainder(in, out, size, elem_size, 0);
+  return bshuf_trans_bit_byte_remainder(in, out, size, elem_size, 0);
 }
 
 
 /* General transpose of an array, optimized for large element sizes. */
 int64_t bshuf_trans_elem(const void* in, void* out, const size_t lda,
-        const size_t ldb, const size_t elem_size) {
-
-    char* in_b = (char*) in;
-    char* out_b = (char*) out;
-    size_t ii, jj;
-    for (ii = 0; ii < lda; ii++) {
-        for (jj = 0; jj < ldb; jj++) {
-            memcpy(&out_b[(jj*lda + ii) * elem_size],
-                   &in_b[(ii*ldb + jj) * elem_size], elem_size);
-        }
+                         const size_t ldb, const size_t elem_size) {
+
+  size_t ii, jj;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+  for(ii = 0; ii < lda; ii++) {
+    for(jj = 0; jj < ldb; jj++) {
+      memcpy(&out_b[(jj*lda + ii) * elem_size],
+             &in_b[(ii*ldb + jj) * elem_size], elem_size);
     }
-    return (int64_t)lda * (int64_t)ldb * (int64_t)elem_size;
+  }
+  return lda * ldb * elem_size;
 }
 
 
 /* Transpose rows of shuffled bits (size / 8 bytes) within groups of 8. */
 int64_t bshuf_trans_bitrow_eight(const void* in, void* out, const size_t size,
-         const size_t elem_size) {
+                                 const size_t elem_size) {
 
-    size_t nbyte_bitrow = size / 8;
+  size_t nbyte_bitrow = size / 8;
 
-    CHECK_MULT_EIGHT(size);
+  CHECK_MULT_EIGHT(size);
 
-    return bshuf_trans_elem(in, out, 8, elem_size, nbyte_bitrow);
+  return bshuf_trans_elem(in, out, 8, elem_size, nbyte_bitrow);
 }
 
 
 /* Transpose bits within elements. */
 int64_t bshuf_trans_bit_elem_scal(const void* in, void* out, const size_t size,
-                                  const size_t elem_size, void* tmp_buf) {
+                                  const size_t elem_size) {
+
+  int64_t count;
+  void *tmp_buf;
 
-    int64_t count;
+  CHECK_MULT_EIGHT(size);
 
-    CHECK_MULT_EIGHT(size);
+  tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
 
-    count = bshuf_trans_byte_elem_scal(in, out, size, elem_size);
-    CHECK_ERR(count);
-    count = bshuf_trans_bit_byte_scal(out, tmp_buf, size, elem_size);
-    CHECK_ERR(count);
-    count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
+  count = bshuf_trans_byte_elem_scal(in, out, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count = bshuf_trans_bit_byte_scal(out, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
 
-    return count;
+  free(tmp_buf);
+
+  return count;
 }
 
 
 /* For data organized into a row for each bit (8 * elem_size rows), transpose
  * the bytes. */
 int64_t bshuf_trans_byte_bitrow_scal(const void* in, void* out, const size_t size,
-         const size_t elem_size) {
-    char* in_b = (char*) in;
-    char* out_b = (char*) out;
+                                     const size_t elem_size) {
+  size_t ii, jj, kk, nbyte_row;
+  const char *in_b;
+  char *out_b;
+
 
-    size_t nbyte_row = size / 8;
-    size_t ii, jj, kk;
+  in_b = (const char*) in;
+  out_b = (char*) out;
 
-    CHECK_MULT_EIGHT(size);
+  nbyte_row = size / 8;
 
-    for (jj = 0; jj < elem_size; jj++) {
-        for (ii = 0; ii < nbyte_row; ii++) {
-            for (kk = 0; kk < 8; kk++) {
-                out_b[ii * 8 * elem_size + jj * 8 + kk] = \
+  CHECK_MULT_EIGHT(size);
+
+  for (jj = 0; jj < elem_size; jj++) {
+    for (ii = 0; ii < nbyte_row; ii++) {
+      for (kk = 0; kk < 8; kk++) {
+        out_b[ii * 8 * elem_size + jj * 8 + kk] = \
                         in_b[(jj * 8 + kk) * nbyte_row + ii];
-            }
-        }
+      }
     }
-    return (int64_t)size * (int64_t)elem_size;
+  }
+  return size * elem_size;
 }
 
 
@@ -176,56 +199,62 @@ int64_t bshuf_trans_byte_bitrow_scal(const void* in, void* out, const size_t siz
 int64_t bshuf_shuffle_bit_eightelem_scal(const void* in, void* out, \
         const size_t size, const size_t elem_size) {
 
-    const char *in_b;
-    char *out_b;
-    uint64_t x, t;
-    size_t ii, jj, kk;
-    size_t nbyte, out_index;
-
-    uint64_t e=1;
-    const int little_endian = *(uint8_t *) &e == 1;
-    const size_t elem_skip = little_endian ? elem_size : -elem_size;
-    const size_t elem_offset = little_endian ? 0 : 7 * elem_size;
-
-    CHECK_MULT_EIGHT(size);
-
-    in_b = (const char*) in;
-    out_b = (char*) out;
-
-    nbyte = elem_size * size;
-
-    for (jj = 0; jj < 8 * elem_size; jj += 8) {
-        for (ii = 0; ii + 8 * elem_size - 1 < nbyte; ii += 8 * elem_size) {
-            x = *((uint64_t*) &in_b[ii + jj]);
-            if (little_endian) {
-                TRANS_BIT_8X8(x, t);
-            } else {
-                TRANS_BIT_8X8_BE(x, t);
-            }
-            for (kk = 0; kk < 8; kk++) {
-                out_index = ii + jj / 8 + elem_offset + kk * elem_skip;
-                *((uint8_t*) &out_b[out_index]) = (uint8_t)x;
-                x = x >> 8;
-            }
-        }
+  const char *in_b;
+  char *out_b;
+  uint64_t x, t;
+  size_t ii, jj, kk;
+  size_t nbyte, out_index;
+
+  uint64_t e=1;
+  const int little_endian = *(uint8_t *) &e == 1;
+  const size_t elem_skip = little_endian ? elem_size : -elem_size;
+  const uint64_t elem_offset = little_endian ? 0 : 7 * elem_size;
+
+  CHECK_MULT_EIGHT(size);
+
+  in_b = (const char*) in;
+  out_b = (char*) out;
+
+  nbyte = elem_size * size;
+
+  for (jj = 0; jj < 8 * elem_size; jj += 8) {
+    for (ii = 0; ii + 8 * elem_size - 1 < nbyte; ii += 8 * elem_size) {
+      x = *((uint64_t*) &in_b[ii + jj]);
+      if (little_endian) {
+        TRANS_BIT_8X8(x, t);
+      } else {
+        TRANS_BIT_8X8_BE(x, t);
+      }
+      for (kk = 0; kk < 8; kk++) {
+        out_index = ii + jj / 8 + elem_offset + kk * elem_skip;
+        *((uint8_t*) &out_b[out_index]) = x;
+        x = x >> 8;
+      }
     }
-    return (int64_t)size * (int64_t)elem_size;
+  }
+  return size * elem_size;
 }
 
 
 /* Untranspose bits within elements. */
 int64_t bshuf_untrans_bit_elem_scal(const void* in, void* out, const size_t size,
-                                    const size_t elem_size, void* tmp_buf) {
+                                    const size_t elem_size) {
+
+  int64_t count;
+  void *tmp_buf;
+
+  CHECK_MULT_EIGHT(size);
 
-    int64_t count;
+  tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
 
-    CHECK_MULT_EIGHT(size);
+  count = bshuf_trans_byte_bitrow_scal(in, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count =  bshuf_shuffle_bit_eightelem_scal(tmp_buf, out, size, elem_size);
 
-    count = bshuf_trans_byte_bitrow_scal(in, tmp_buf, size, elem_size);
-    CHECK_ERR(count);
-    count =  bshuf_shuffle_bit_eightelem_scal(tmp_buf, out, size, elem_size);
+  free(tmp_buf);
 
-    return count;
+  return count;
 }
 
 #ifdef _MSC_VER
diff --git a/blosc/bitshuffle-generic.h b/blosc/bitshuffle-generic.h
index 843b66a5..e68d5d1b 100644
--- a/blosc/bitshuffle-generic.h
+++ b/blosc/bitshuffle-generic.h
@@ -40,8 +40,9 @@
         do {                              \
           if ((count) < 0)                \
             return count;                 \
-         } while (0)  
+         } while (0)
 
+#define CHECK_ERR_FREE(count, buf) if (count < 0) { free(buf); return count; }
 
 /* ---- Worker code not requiring special instruction sets. ----
  *
@@ -93,6 +94,11 @@
     }
 
 
+
+/* Memory copy with bshuf call signature. For testing and profiling. */
+BLOSC_NO_EXPORT int64_t
+bshuf_copy(const void* in, void* out, const size_t size, const size_t elem_size);
+
 /* Private functions */
 BLOSC_NO_EXPORT int64_t
 bshuf_trans_byte_elem_remainder(const void* in, void* out, const size_t size,
@@ -142,7 +148,7 @@ bshuf_trans_byte_bitrow_scal(const void* in, void* out, const size_t size,
 
 BLOSC_NO_EXPORT int64_t
 bshuf_trans_bit_elem_scal(const void* in, void* out, const size_t size,
-                          const size_t elem_size, void* tmp_buf);
+                          const size_t elem_size);
 
 /* Unshuffle bitshuffled data.
  *
@@ -167,6 +173,6 @@ bshuf_trans_bit_elem_scal(const void* in, void* out, const size_t size,
 
 BLOSC_NO_EXPORT int64_t
 bshuf_untrans_bit_elem_scal(const void* in, void* out, const size_t size,
-                            const size_t elem_size, void* tmp_buf);
+                            const size_t elem_size);
 
 #endif /* BLOSC_BITSHUFFLE_GENERIC_H */
diff --git a/blosc/bitshuffle-neon.c b/blosc/bitshuffle-neon.c
index 8b1b5948..acc4a1df 100644
--- a/blosc/bitshuffle-neon.c
+++ b/blosc/bitshuffle-neon.c
@@ -8,6 +8,18 @@
   See LICENSE.txt for details about copyright and rights to use.
 **********************************************************************/
 
+/*********************************************************************
+  Bitshuffle - Filter for improving compression of typed binary data.
+
+  Author: Kiyoshi Masui <kiyo@physics.ubc.ca>
+  Website: https://github.com/kiyo-masui/bitshuffle
+
+  Note: Adapted for c-blosc2 by Francesc Alted.
+
+  See LICENSES/BITSHUFFLE.txt file for details about copyright and
+  rights to use.
+**********************************************************************/
+
 #include "bitshuffle-neon.h"
 #include "bitshuffle-generic.h"
 
@@ -16,7 +28,7 @@
 
 #include <arm_neon.h>
 
-#include <stdint.h>
+#include <stdlib.h>
 
 /* The next is useful for debugging purposes */
 #if 0
@@ -33,975 +45,450 @@ static void printmem(uint8_t* buf)
 }
 #endif
 
-/* Routine optimized for bit-shuffling a buffer for a type size of 1 byte. */
-static void
-bitshuffle1_neon(void* src, void* dest, const size_t size, const size_t elem_size) {
-
-  uint16x8_t x0;
-  size_t i, j, k;
-  uint8x8_t lo_x, hi_x, lo, hi;
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 16, k++) {
-    /* Load 16-byte groups */
-    x0 = vld1q_u8(src + k * 16);
-    /* Split in 8-bytes grops */
-    lo_x = vget_low_u8(x0);
-    hi_x = vget_high_u8(x0);
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      lo = vand_u8(lo_x, mask_and);
-      lo = vshl_u8(lo, mask_shift);
-      hi = vand_u8(hi_x, mask_and);
-      hi = vshl_u8(hi, mask_shift);
-      lo = vpadd_u8(lo, lo);
-      lo = vpadd_u8(lo, lo);
-      lo = vpadd_u8(lo, lo);
-      hi = vpadd_u8(hi, hi);
-      hi = vpadd_u8(hi, hi);
-      hi = vpadd_u8(hi, hi);
-      /* Shift packed 8-bit */
-      lo_x = vshr_n_u8(lo_x, 1);
-      hi_x = vshr_n_u8(hi_x, 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size), lo, 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size), hi, 0);
+
+/* ---- Worker code that uses Arm NEON ----
+ *
+ * The following code makes use of the Arm NEON instruction set.
+ * NEON technology is the implementation of the ARM Advanced Single
+ * Instruction Multiple Data (SIMD) extension.
+ * The NEON unit is the component of the processor that executes SIMD instructions.
+ * It is also called the NEON Media Processing Engine (MPE).
+ *
+ */
+
+/* Transpose bytes within elements for 16 bit elements. */
+int64_t bshuf_trans_byte_elem_NEON_16(const void* in, void* out, const size_t size) {
+
+    size_t ii;
+    const char *in_b = (const char*) in;
+    char *out_b = (char*) out;
+    int8x16_t a0, b0, a1, b1;
+
+    for (ii=0; ii + 15 < size; ii += 16) {
+        a0 = vld1q_s8(in_b + 2*ii + 0*16);
+        b0 = vld1q_s8(in_b + 2*ii + 1*16);
+
+        a1 = vzip1q_s8(a0, b0);
+        b1 = vzip2q_s8(a0, b0);
+
+        a0 = vzip1q_s8(a1, b1);
+        b0 = vzip2q_s8(a1, b1);
+
+        a1 = vzip1q_s8(a0, b0);
+        b1 = vzip2q_s8(a0, b0);
+
+        a0 = vzip1q_s8(a1, b1);
+        b0 = vzip2q_s8(a1, b1);
+
+        vst1q_s8(out_b + 0*size + ii, a0);
+        vst1q_s8(out_b + 1*size + ii, b0);
     }
-  }
+
+    return bshuf_trans_byte_elem_remainder(in, out, size, 2,
+            size - size % 16);
 }
 
-/* Routine optimized for bit-shuffling a buffer for a type size of 2 bytes. */
-static void
-bitshuffle2_neon(void* src, void* dest, const size_t size, const size_t elem_size) {
-
-  uint8x16x2_t x0;
-  size_t i, j, k;
-  uint8x8_t lo_x[2], hi_x[2], lo[2], hi[2];
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 32, k++) {
-    /* Load 32-byte groups */
-    x0 = vld2q_u8(src + i);
-    /* Split in 8-bytes grops */
-    lo_x[0] = vget_low_u8(x0.val[0]);
-    hi_x[0] = vget_high_u8(x0.val[0]);
-    lo_x[1] = vget_low_u8(x0.val[1]);
-    hi_x[1] = vget_high_u8(x0.val[1]);
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      lo[0] = vand_u8(lo_x[0], mask_and);
-      lo[0] = vshl_u8(lo[0], mask_shift);
-      lo[1] = vand_u8(lo_x[1], mask_and);
-      lo[1] = vshl_u8(lo[1], mask_shift);
-
-      hi[0] = vand_u8(hi_x[0], mask_and);
-      hi[0] = vshl_u8(hi[0], mask_shift);
-      hi[1] = vand_u8(hi_x[1], mask_and);
-      hi[1] = vshl_u8(hi[1], mask_shift);
-
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      /* Shift packed 8-bit */
-      lo_x[0] = vshr_n_u8(lo_x[0], 1);
-      hi_x[0] = vshr_n_u8(hi_x[0], 1);
-      lo_x[1] = vshr_n_u8(lo_x[1], 1);
-      hi_x[1] = vshr_n_u8(hi_x[1], 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size), lo[0], 0);
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size) + size * elem_size / 2, lo[1], 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size), hi[0], 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size) + size * elem_size / 2, hi[1], 0);
+
+/* Transpose bytes within elements for 32 bit elements. */
+int64_t bshuf_trans_byte_elem_NEON_32(const void* in, void* out, const size_t size) {
+
+    size_t ii;
+    const char *in_b;
+    char *out_b;
+    in_b = (const char*) in;
+    out_b = (char*) out;
+    int8x16_t a0, b0, c0, d0, a1, b1, c1, d1;
+    int64x2_t a2, b2, c2, d2;
+
+    for (ii=0; ii + 15 < size; ii += 16) {
+        a0 = vld1q_s8(in_b + 4*ii + 0*16);
+        b0 = vld1q_s8(in_b + 4*ii + 1*16);
+        c0 = vld1q_s8(in_b + 4*ii + 2*16);
+        d0 = vld1q_s8(in_b + 4*ii + 3*16);
+
+        a1 = vzip1q_s8(a0, b0);
+        b1 = vzip2q_s8(a0, b0);
+        c1 = vzip1q_s8(c0, d0);
+        d1 = vzip2q_s8(c0, d0);
+
+        a0 = vzip1q_s8(a1, b1);
+        b0 = vzip2q_s8(a1, b1);
+        c0 = vzip1q_s8(c1, d1);
+        d0 = vzip2q_s8(c1, d1);
+
+        a1 = vzip1q_s8(a0, b0);
+        b1 = vzip2q_s8(a0, b0);
+        c1 = vzip1q_s8(c0, d0);
+        d1 = vzip2q_s8(c0, d0);
+
+        a2 = vzip1q_s64(vreinterpretq_s64_s8(a1), vreinterpretq_s64_s8(c1));
+        b2 = vzip2q_s64(vreinterpretq_s64_s8(a1), vreinterpretq_s64_s8(c1));
+        c2 = vzip1q_s64(vreinterpretq_s64_s8(b1), vreinterpretq_s64_s8(d1));
+        d2 = vzip2q_s64(vreinterpretq_s64_s8(b1), vreinterpretq_s64_s8(d1));
+
+        vst1q_s64((int64_t *) (out_b + 0*size + ii), a2);
+        vst1q_s64((int64_t *) (out_b + 1*size + ii), b2);
+        vst1q_s64((int64_t *) (out_b + 2*size + ii), c2);
+        vst1q_s64((int64_t *) (out_b + 3*size + ii), d2);
     }
-  }
+
+    return bshuf_trans_byte_elem_remainder(in, out, size, 4,
+            size - size % 16);
 }
 
-/* Routine optimized for bit-shuffling a buffer for a type size of 4 bytes. */
-static void
-bitshuffle4_neon(void* src, void* dest, const size_t size, const size_t elem_size) {
-  uint8x16x4_t x0;
-  size_t i, j, k;
-  uint8x8_t lo_x[4], hi_x[4], lo[4], hi[4];
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 64, k++) {
-    /* Load 64-byte groups */
-    x0 = vld4q_u8(src + i);
-    /* Split in 8-bytes grops */
-    lo_x[0] = vget_low_u8(x0.val[0]);
-    hi_x[0] = vget_high_u8(x0.val[0]);
-    lo_x[1] = vget_low_u8(x0.val[1]);
-    hi_x[1] = vget_high_u8(x0.val[1]);
-    lo_x[2] = vget_low_u8(x0.val[2]);
-    hi_x[2] = vget_high_u8(x0.val[2]);
-    lo_x[3] = vget_low_u8(x0.val[3]);
-    hi_x[3] = vget_high_u8(x0.val[3]);
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      lo[0] = vand_u8(lo_x[0], mask_and);
-      lo[0] = vshl_u8(lo[0], mask_shift);
-      lo[1] = vand_u8(lo_x[1], mask_and);
-      lo[1] = vshl_u8(lo[1], mask_shift);
-      lo[2] = vand_u8(lo_x[2], mask_and);
-      lo[2] = vshl_u8(lo[2], mask_shift);
-      lo[3] = vand_u8(lo_x[3], mask_and);
-      lo[3] = vshl_u8(lo[3], mask_shift);
-
-      hi[0] = vand_u8(hi_x[0], mask_and);
-      hi[0] = vshl_u8(hi[0], mask_shift);
-      hi[1] = vand_u8(hi_x[1], mask_and);
-      hi[1] = vshl_u8(hi[1], mask_shift);
-      hi[2] = vand_u8(hi_x[2], mask_and);
-      hi[2] = vshl_u8(hi[2], mask_shift);
-      hi[3] = vand_u8(hi_x[3], mask_and);
-      hi[3] = vshl_u8(hi[3], mask_shift);
-
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[2] = vpadd_u8(lo[2], lo[2]);
-      lo[2] = vpadd_u8(lo[2], lo[2]);
-      lo[2] = vpadd_u8(lo[2], lo[2]);
-      lo[3] = vpadd_u8(lo[3], lo[3]);
-      lo[3] = vpadd_u8(lo[3], lo[3]);
-      lo[3] = vpadd_u8(lo[3], lo[3]);
-
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[2] = vpadd_u8(hi[2], hi[2]);
-      hi[2] = vpadd_u8(hi[2], hi[2]);
-      hi[2] = vpadd_u8(hi[2], hi[2]);
-      hi[3] = vpadd_u8(hi[3], hi[3]);
-      hi[3] = vpadd_u8(hi[3], hi[3]);
-      hi[3] = vpadd_u8(hi[3], hi[3]);
-      /* Shift packed 8-bit */
-      lo_x[0] = vshr_n_u8(lo_x[0], 1);
-      hi_x[0] = vshr_n_u8(hi_x[0], 1);
-      lo_x[1] = vshr_n_u8(lo_x[1], 1);
-      hi_x[1] = vshr_n_u8(hi_x[1], 1);
-      lo_x[2] = vshr_n_u8(lo_x[2], 1);
-      hi_x[2] = vshr_n_u8(hi_x[2], 1);
-      lo_x[3] = vshr_n_u8(lo_x[3], 1);
-      hi_x[3] = vshr_n_u8(hi_x[3], 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 4, lo[0], 0);
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 4, lo[1], 0);
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 4, lo[2], 0);
-      vst1_lane_u8(dest + 2 * k + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 4, lo[3], 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 4, hi[0], 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 4, hi[1], 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 4, hi[2], 0);
-      vst1_lane_u8(dest + 2 * k + 1 + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 4, hi[3], 0);
+
+/* Transpose bytes within elements for 64 bit elements. */
+int64_t bshuf_trans_byte_elem_NEON_64(const void* in, void* out, const size_t size) {
+
+    size_t ii;
+    const char* in_b = (const char*) in;
+    char* out_b = (char*) out;
+    int8x16_t a0, b0, c0, d0, e0, f0, g0, h0;
+    int8x16_t a1, b1, c1, d1, e1, f1, g1, h1;
+
+    for (ii=0; ii + 15 < size; ii += 16) {
+        a0 = vld1q_s8(in_b + 8*ii + 0*16);
+        b0 = vld1q_s8(in_b + 8*ii + 1*16);
+        c0 = vld1q_s8(in_b + 8*ii + 2*16);
+        d0 = vld1q_s8(in_b + 8*ii + 3*16);
+        e0 = vld1q_s8(in_b + 8*ii + 4*16);
+        f0 = vld1q_s8(in_b + 8*ii + 5*16);
+        g0 = vld1q_s8(in_b + 8*ii + 6*16);
+        h0 = vld1q_s8(in_b + 8*ii + 7*16);
+
+        a1 = vzip1q_s8 (a0, b0);
+        b1 = vzip2q_s8 (a0, b0);
+        c1 = vzip1q_s8 (c0, d0);
+        d1 = vzip2q_s8 (c0, d0);
+        e1 = vzip1q_s8 (e0, f0);
+        f1 = vzip2q_s8 (e0, f0);
+        g1 = vzip1q_s8 (g0, h0);
+        h1 = vzip2q_s8 (g0, h0);
+
+        a0 = vzip1q_s8 (a1, b1);
+        b0 = vzip2q_s8 (a1, b1);
+        c0 = vzip1q_s8 (c1, d1);
+        d0 = vzip2q_s8 (c1, d1);
+        e0 = vzip1q_s8 (e1, f1);
+        f0 = vzip2q_s8 (e1, f1);
+        g0 = vzip1q_s8 (g1, h1);
+        h0 = vzip2q_s8 (g1, h1);
+
+        a1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (a0), vreinterpretq_s32_s8 (c0));
+        b1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (a0), vreinterpretq_s32_s8 (c0));
+        c1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (b0), vreinterpretq_s32_s8 (d0));
+        d1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (b0), vreinterpretq_s32_s8 (d0));
+        e1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (e0), vreinterpretq_s32_s8 (g0));
+        f1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (e0), vreinterpretq_s32_s8 (g0));
+        g1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (f0), vreinterpretq_s32_s8 (h0));
+        h1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (f0), vreinterpretq_s32_s8 (h0));
+
+        a0 = (int8x16_t) vzip1q_s64 (vreinterpretq_s64_s8 (a1), vreinterpretq_s64_s8 (e1));
+        b0 = (int8x16_t) vzip2q_s64 (vreinterpretq_s64_s8 (a1), vreinterpretq_s64_s8 (e1));
+        c0 = (int8x16_t) vzip1q_s64 (vreinterpretq_s64_s8 (b1), vreinterpretq_s64_s8 (f1));
+        d0 = (int8x16_t) vzip2q_s64 (vreinterpretq_s64_s8 (b1), vreinterpretq_s64_s8 (f1));
+        e0 = (int8x16_t) vzip1q_s64 (vreinterpretq_s64_s8 (c1), vreinterpretq_s64_s8 (g1));
+        f0 = (int8x16_t) vzip2q_s64 (vreinterpretq_s64_s8 (c1), vreinterpretq_s64_s8 (g1));
+        g0 = (int8x16_t) vzip1q_s64 (vreinterpretq_s64_s8 (d1), vreinterpretq_s64_s8 (h1));
+        h0 = (int8x16_t) vzip2q_s64 (vreinterpretq_s64_s8 (d1), vreinterpretq_s64_s8 (h1));
+
+        vst1q_s8(out_b + 0*size + ii, a0);
+        vst1q_s8(out_b + 1*size + ii, b0);
+        vst1q_s8(out_b + 2*size + ii, c0);
+        vst1q_s8(out_b + 3*size + ii, d0);
+        vst1q_s8(out_b + 4*size + ii, e0);
+        vst1q_s8(out_b + 5*size + ii, f0);
+        vst1q_s8(out_b + 6*size + ii, g0);
+        vst1q_s8(out_b + 7*size + ii, h0);
     }
-  }
+
+    return bshuf_trans_byte_elem_remainder(in, out, size, 8,
+            size - size % 16);
 }
 
-/* Routine optimized for bit-shuffling a buffer for a type size of 8 bytes. */
-static void
-bitshuffle8_neon(void* src, void* dest, const size_t size, const size_t elem_size) {
-
-  size_t i, j, k;
-  uint8x8x2_t r0[4];
-  uint16x4x2_t r1[4];
-  uint32x2x2_t r2[4];
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 64, k++) {
-    /* Load and interleave groups of 8 bytes (64 bytes) to the structure r0 */
-    r0[0] = vzip_u8(vld1_u8(src + i + 0 * 8), vld1_u8(src + i + 1 * 8));
-    r0[1] = vzip_u8(vld1_u8(src + i + 2 * 8), vld1_u8(src + i + 3 * 8));
-    r0[2] = vzip_u8(vld1_u8(src + i + 4 * 8), vld1_u8(src + i + 5 * 8));
-    r0[3] = vzip_u8(vld1_u8(src + i + 6 * 8), vld1_u8(src + i + 7 * 8));
-    /* Interleave 16 bytes */
-    r1[0] = vzip_u16(vreinterpret_u16_u8(r0[0].val[0]), vreinterpret_u16_u8(r0[1].val[0]));
-    r1[1] = vzip_u16(vreinterpret_u16_u8(r0[0].val[1]), vreinterpret_u16_u8(r0[1].val[1]));
-    r1[2] = vzip_u16(vreinterpret_u16_u8(r0[2].val[0]), vreinterpret_u16_u8(r0[3].val[0]));
-    r1[3] = vzip_u16(vreinterpret_u16_u8(r0[2].val[1]), vreinterpret_u16_u8(r0[3].val[1]));
-    /* Interleave 32 bytes */
-    r2[0] = vzip_u32(vreinterpret_u32_u16(r1[0].val[0]), vreinterpret_u32_u16(r1[2].val[0]));
-    r2[1] = vzip_u32(vreinterpret_u32_u16(r1[0].val[1]), vreinterpret_u32_u16(r1[2].val[1]));
-    r2[2] = vzip_u32(vreinterpret_u32_u16(r1[1].val[0]), vreinterpret_u32_u16(r1[3].val[0]));
-    r2[3] = vzip_u32(vreinterpret_u32_u16(r1[1].val[1]), vreinterpret_u32_u16(r1[3].val[1]));
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      r0[0].val[0] = vand_u8(vreinterpret_u8_u32(r2[0].val[0]), mask_and);
-      r0[0].val[0] = vshl_u8(r0[0].val[0], mask_shift);
-      r0[0].val[1] = vand_u8(vreinterpret_u8_u32(r2[0].val[1]), mask_and);
-      r0[0].val[1] = vshl_u8(r0[0].val[1], mask_shift);
-      r0[1].val[0] = vand_u8(vreinterpret_u8_u32(r2[1].val[0]), mask_and);
-      r0[1].val[0] = vshl_u8(r0[1].val[0], mask_shift);
-      r0[1].val[1] = vand_u8(vreinterpret_u8_u32(r2[1].val[1]), mask_and);
-      r0[1].val[1] = vshl_u8(r0[1].val[1], mask_shift);
-      r0[2].val[0] = vand_u8(vreinterpret_u8_u32(r2[2].val[0]), mask_and);
-      r0[2].val[0] = vshl_u8(r0[2].val[0], mask_shift);
-      r0[2].val[1] = vand_u8(vreinterpret_u8_u32(r2[2].val[1]), mask_and);
-      r0[2].val[1] = vshl_u8(r0[2].val[1], mask_shift);
-      r0[3].val[0] = vand_u8(vreinterpret_u8_u32(r2[3].val[0]), mask_and);
-      r0[3].val[0] = vshl_u8(r0[3].val[0], mask_shift);
-      r0[3].val[1] = vand_u8(vreinterpret_u8_u32(r2[3].val[1]), mask_and);
-      r0[3].val[1] = vshl_u8(r0[3].val[1], mask_shift);
-
-      r0[0].val[0] = vpadd_u8(r0[0].val[0], r0[0].val[0]);
-      r0[0].val[0] = vpadd_u8(r0[0].val[0], r0[0].val[0]);
-      r0[0].val[0] = vpadd_u8(r0[0].val[0], r0[0].val[0]);
-      r0[0].val[1] = vpadd_u8(r0[0].val[1], r0[0].val[1]);
-      r0[0].val[1] = vpadd_u8(r0[0].val[1], r0[0].val[1]);
-      r0[0].val[1] = vpadd_u8(r0[0].val[1], r0[0].val[1]);
-      r0[1].val[0] = vpadd_u8(r0[1].val[0], r0[1].val[0]);
-      r0[1].val[0] = vpadd_u8(r0[1].val[0], r0[1].val[0]);
-      r0[1].val[0] = vpadd_u8(r0[1].val[0], r0[1].val[0]);
-      r0[1].val[1] = vpadd_u8(r0[1].val[1], r0[1].val[1]);
-      r0[1].val[1] = vpadd_u8(r0[1].val[1], r0[1].val[1]);
-      r0[1].val[1] = vpadd_u8(r0[1].val[1], r0[1].val[1]);
-      r0[2].val[0] = vpadd_u8(r0[2].val[0], r0[2].val[0]);
-      r0[2].val[0] = vpadd_u8(r0[2].val[0], r0[2].val[0]);
-      r0[2].val[0] = vpadd_u8(r0[2].val[0], r0[2].val[0]);
-      r0[2].val[1] = vpadd_u8(r0[2].val[1], r0[2].val[1]);
-      r0[2].val[1] = vpadd_u8(r0[2].val[1], r0[2].val[1]);
-      r0[2].val[1] = vpadd_u8(r0[2].val[1], r0[2].val[1]);
-      r0[3].val[0] = vpadd_u8(r0[3].val[0], r0[3].val[0]);
-      r0[3].val[0] = vpadd_u8(r0[3].val[0], r0[3].val[0]);
-      r0[3].val[0] = vpadd_u8(r0[3].val[0], r0[3].val[0]);
-      r0[3].val[1] = vpadd_u8(r0[3].val[1], r0[3].val[1]);
-      r0[3].val[1] = vpadd_u8(r0[3].val[1], r0[3].val[1]);
-      r0[3].val[1] = vpadd_u8(r0[3].val[1], r0[3].val[1]);
-      /* Shift packed 8-bit */
-      r2[0].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[0].val[0]), 1));
-      r2[0].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[0].val[1]), 1));
-      r2[1].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[1].val[0]), 1));
-      r2[1].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[1].val[1]), 1));
-      r2[2].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[2].val[0]), 1));
-      r2[2].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[2].val[1]), 1));
-      r2[3].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[3].val[0]), 1));
-      r2[3].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[3].val[1]), 1));
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 8, r0[0].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 8, r0[0].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 8, r0[1].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 8, r0[1].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 4 * size * elem_size / 8, r0[2].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 5 * size * elem_size / 8, r0[2].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 6 * size * elem_size / 8, r0[3].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 7 * size * elem_size / 8, r0[3].val[1], 0);
+
+/* Transpose bytes within elements using best NEON algorithm available. */
+int64_t bshuf_trans_byte_elem_NEON(const void* in, void* out, const size_t size,
+         const size_t elem_size) {
+
+    int64_t count;
+
+    // Trivial cases: power of 2 bytes.
+    switch (elem_size) {
+        case 1:
+            count = bshuf_copy(in, out, size, elem_size);
+            return count;
+        case 2:
+            count = bshuf_trans_byte_elem_NEON_16(in, out, size);
+            return count;
+        case 4:
+            count = bshuf_trans_byte_elem_NEON_32(in, out, size);
+            return count;
+        case 8:
+            count = bshuf_trans_byte_elem_NEON_64(in, out, size);
+            return count;
     }
-  }
-}
 
-/* Routine optimized for bit-shuffling a buffer for a type size of 16 bytes. */
-static void
-bitshuffle16_neon(void* src, void* dest, const size_t size, const size_t elem_size) {
-
-  size_t i, j, k;
-  uint8x8x2_t r0[8];
-  uint16x4x2_t r1[8];
-  uint32x2x2_t r2[8];
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 128, k++) {
-    /* Load and interleave groups of 16 bytes (128 bytes) to the structure r0 */
-    r0[0] = vzip_u8(vld1_u8(src + i + 0 * 8), vld1_u8(src + i + 2 * 8));
-    r0[1] = vzip_u8(vld1_u8(src + i + 1 * 8), vld1_u8(src + i + 3 * 8));
-    r0[2] = vzip_u8(vld1_u8(src + i + 4 * 8), vld1_u8(src + i + 6 * 8));
-    r0[3] = vzip_u8(vld1_u8(src + i + 5 * 8), vld1_u8(src + i + 7 * 8));
-    r0[4] = vzip_u8(vld1_u8(src + i + 8 * 8), vld1_u8(src + i + 10 * 8));
-    r0[5] = vzip_u8(vld1_u8(src + i + 9 * 8), vld1_u8(src + i + 11 * 8));
-    r0[6] = vzip_u8(vld1_u8(src + i + 12 * 8), vld1_u8(src + i + 14 * 8));
-    r0[7] = vzip_u8(vld1_u8(src + i + 13 * 8), vld1_u8(src + i + 15 * 8));
-    /* Interleave 16 bytes */
-    r1[0] = vzip_u16(vreinterpret_u16_u8(r0[0].val[0]), vreinterpret_u16_u8(r0[2].val[0]));
-    r1[1] = vzip_u16(vreinterpret_u16_u8(r0[0].val[1]), vreinterpret_u16_u8(r0[2].val[1]));
-    r1[2] = vzip_u16(vreinterpret_u16_u8(r0[1].val[0]), vreinterpret_u16_u8(r0[3].val[0]));
-    r1[3] = vzip_u16(vreinterpret_u16_u8(r0[1].val[1]), vreinterpret_u16_u8(r0[3].val[1]));
-    r1[4] = vzip_u16(vreinterpret_u16_u8(r0[4].val[0]), vreinterpret_u16_u8(r0[6].val[0]));
-    r1[5] = vzip_u16(vreinterpret_u16_u8(r0[4].val[1]), vreinterpret_u16_u8(r0[6].val[1]));
-    r1[6] = vzip_u16(vreinterpret_u16_u8(r0[5].val[0]), vreinterpret_u16_u8(r0[7].val[0]));
-    r1[7] = vzip_u16(vreinterpret_u16_u8(r0[5].val[1]), vreinterpret_u16_u8(r0[7].val[1]));
-    /* Interleave 32 bytes */
-    r2[0] = vzip_u32(vreinterpret_u32_u16(r1[0].val[0]), vreinterpret_u32_u16(r1[4].val[0]));
-    r2[1] = vzip_u32(vreinterpret_u32_u16(r1[0].val[1]), vreinterpret_u32_u16(r1[4].val[1]));
-    r2[2] = vzip_u32(vreinterpret_u32_u16(r1[1].val[0]), vreinterpret_u32_u16(r1[5].val[0]));
-    r2[3] = vzip_u32(vreinterpret_u32_u16(r1[1].val[1]), vreinterpret_u32_u16(r1[5].val[1]));
-    r2[4] = vzip_u32(vreinterpret_u32_u16(r1[2].val[0]), vreinterpret_u32_u16(r1[6].val[0]));
-    r2[5] = vzip_u32(vreinterpret_u32_u16(r1[2].val[1]), vreinterpret_u32_u16(r1[6].val[1]));
-    r2[6] = vzip_u32(vreinterpret_u32_u16(r1[3].val[0]), vreinterpret_u32_u16(r1[7].val[0]));
-    r2[7] = vzip_u32(vreinterpret_u32_u16(r1[3].val[1]), vreinterpret_u32_u16(r1[7].val[1]));
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      r0[0].val[0] = vand_u8(vreinterpret_u8_u32(r2[0].val[0]), mask_and);
-      r0[0].val[0] = vshl_u8(r0[0].val[0], mask_shift);
-      r0[0].val[1] = vand_u8(vreinterpret_u8_u32(r2[0].val[1]), mask_and);
-      r0[0].val[1] = vshl_u8(r0[0].val[1], mask_shift);
-      r0[1].val[0] = vand_u8(vreinterpret_u8_u32(r2[1].val[0]), mask_and);
-      r0[1].val[0] = vshl_u8(r0[1].val[0], mask_shift);
-      r0[1].val[1] = vand_u8(vreinterpret_u8_u32(r2[1].val[1]), mask_and);
-      r0[1].val[1] = vshl_u8(r0[1].val[1], mask_shift);
-      r0[2].val[0] = vand_u8(vreinterpret_u8_u32(r2[2].val[0]), mask_and);
-      r0[2].val[0] = vshl_u8(r0[2].val[0], mask_shift);
-      r0[2].val[1] = vand_u8(vreinterpret_u8_u32(r2[2].val[1]), mask_and);
-      r0[2].val[1] = vshl_u8(r0[2].val[1], mask_shift);
-      r0[3].val[0] = vand_u8(vreinterpret_u8_u32(r2[3].val[0]), mask_and);
-      r0[3].val[0] = vshl_u8(r0[3].val[0], mask_shift);
-      r0[3].val[1] = vand_u8(vreinterpret_u8_u32(r2[3].val[1]), mask_and);
-      r0[3].val[1] = vshl_u8(r0[3].val[1], mask_shift);
-      r0[4].val[0] = vand_u8(vreinterpret_u8_u32(r2[4].val[0]), mask_and);
-      r0[4].val[0] = vshl_u8(r0[4].val[0], mask_shift);
-      r0[4].val[1] = vand_u8(vreinterpret_u8_u32(r2[4].val[1]), mask_and);
-      r0[4].val[1] = vshl_u8(r0[4].val[1], mask_shift);
-      r0[5].val[0] = vand_u8(vreinterpret_u8_u32(r2[5].val[0]), mask_and);
-      r0[5].val[0] = vshl_u8(r0[5].val[0], mask_shift);
-      r0[5].val[1] = vand_u8(vreinterpret_u8_u32(r2[5].val[1]), mask_and);
-      r0[5].val[1] = vshl_u8(r0[5].val[1], mask_shift);
-      r0[6].val[0] = vand_u8(vreinterpret_u8_u32(r2[6].val[0]), mask_and);
-      r0[6].val[0] = vshl_u8(r0[6].val[0], mask_shift);
-      r0[6].val[1] = vand_u8(vreinterpret_u8_u32(r2[6].val[1]), mask_and);
-      r0[6].val[1] = vshl_u8(r0[6].val[1], mask_shift);
-      r0[7].val[0] = vand_u8(vreinterpret_u8_u32(r2[7].val[0]), mask_and);
-      r0[7].val[0] = vshl_u8(r0[7].val[0], mask_shift);
-      r0[7].val[1] = vand_u8(vreinterpret_u8_u32(r2[7].val[1]), mask_and);
-      r0[7].val[1] = vshl_u8(r0[7].val[1], mask_shift);
-
-      r0[0].val[0] = vpadd_u8(r0[0].val[0], r0[0].val[0]);
-      r0[0].val[0] = vpadd_u8(r0[0].val[0], r0[0].val[0]);
-      r0[0].val[0] = vpadd_u8(r0[0].val[0], r0[0].val[0]);
-      r0[0].val[1] = vpadd_u8(r0[0].val[1], r0[0].val[1]);
-      r0[0].val[1] = vpadd_u8(r0[0].val[1], r0[0].val[1]);
-      r0[0].val[1] = vpadd_u8(r0[0].val[1], r0[0].val[1]);
-      r0[1].val[0] = vpadd_u8(r0[1].val[0], r0[1].val[0]);
-      r0[1].val[0] = vpadd_u8(r0[1].val[0], r0[1].val[0]);
-      r0[1].val[0] = vpadd_u8(r0[1].val[0], r0[1].val[0]);
-      r0[1].val[1] = vpadd_u8(r0[1].val[1], r0[1].val[1]);
-      r0[1].val[1] = vpadd_u8(r0[1].val[1], r0[1].val[1]);
-      r0[1].val[1] = vpadd_u8(r0[1].val[1], r0[1].val[1]);
-      r0[2].val[0] = vpadd_u8(r0[2].val[0], r0[2].val[0]);
-      r0[2].val[0] = vpadd_u8(r0[2].val[0], r0[2].val[0]);
-      r0[2].val[0] = vpadd_u8(r0[2].val[0], r0[2].val[0]);
-      r0[2].val[1] = vpadd_u8(r0[2].val[1], r0[2].val[1]);
-      r0[2].val[1] = vpadd_u8(r0[2].val[1], r0[2].val[1]);
-      r0[2].val[1] = vpadd_u8(r0[2].val[1], r0[2].val[1]);
-      r0[3].val[0] = vpadd_u8(r0[3].val[0], r0[3].val[0]);
-      r0[3].val[0] = vpadd_u8(r0[3].val[0], r0[3].val[0]);
-      r0[3].val[0] = vpadd_u8(r0[3].val[0], r0[3].val[0]);
-      r0[3].val[1] = vpadd_u8(r0[3].val[1], r0[3].val[1]);
-      r0[3].val[1] = vpadd_u8(r0[3].val[1], r0[3].val[1]);
-      r0[3].val[1] = vpadd_u8(r0[3].val[1], r0[3].val[1]);
-      r0[4].val[0] = vpadd_u8(r0[4].val[0], r0[4].val[0]);
-      r0[4].val[0] = vpadd_u8(r0[4].val[0], r0[4].val[0]);
-      r0[4].val[0] = vpadd_u8(r0[4].val[0], r0[4].val[0]);
-      r0[4].val[1] = vpadd_u8(r0[4].val[1], r0[4].val[1]);
-      r0[4].val[1] = vpadd_u8(r0[4].val[1], r0[4].val[1]);
-      r0[4].val[1] = vpadd_u8(r0[4].val[1], r0[4].val[1]);
-      r0[5].val[0] = vpadd_u8(r0[5].val[0], r0[5].val[0]);
-      r0[5].val[0] = vpadd_u8(r0[5].val[0], r0[5].val[0]);
-      r0[5].val[0] = vpadd_u8(r0[5].val[0], r0[5].val[0]);
-      r0[5].val[1] = vpadd_u8(r0[5].val[1], r0[5].val[1]);
-      r0[5].val[1] = vpadd_u8(r0[5].val[1], r0[5].val[1]);
-      r0[5].val[1] = vpadd_u8(r0[5].val[1], r0[5].val[1]);
-      r0[6].val[0] = vpadd_u8(r0[6].val[0], r0[6].val[0]);
-      r0[6].val[0] = vpadd_u8(r0[6].val[0], r0[6].val[0]);
-      r0[6].val[0] = vpadd_u8(r0[6].val[0], r0[6].val[0]);
-      r0[6].val[1] = vpadd_u8(r0[6].val[1], r0[6].val[1]);
-      r0[6].val[1] = vpadd_u8(r0[6].val[1], r0[6].val[1]);
-      r0[6].val[1] = vpadd_u8(r0[6].val[1], r0[6].val[1]);
-      r0[7].val[0] = vpadd_u8(r0[7].val[0], r0[7].val[0]);
-      r0[7].val[0] = vpadd_u8(r0[7].val[0], r0[7].val[0]);
-      r0[7].val[0] = vpadd_u8(r0[7].val[0], r0[7].val[0]);
-      r0[7].val[1] = vpadd_u8(r0[7].val[1], r0[7].val[1]);
-      r0[7].val[1] = vpadd_u8(r0[7].val[1], r0[7].val[1]);
-      r0[7].val[1] = vpadd_u8(r0[7].val[1], r0[7].val[1]);
-      /* Shift packed 8-bit */
-      r2[0].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[0].val[0]), 1));
-      r2[0].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[0].val[1]), 1));
-      r2[1].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[1].val[0]), 1));
-      r2[1].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[1].val[1]), 1));
-      r2[2].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[2].val[0]), 1));
-      r2[2].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[2].val[1]), 1));
-      r2[3].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[3].val[0]), 1));
-      r2[3].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[3].val[1]), 1));
-      r2[4].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[4].val[0]), 1));
-      r2[4].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[4].val[1]), 1));
-      r2[5].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[5].val[0]), 1));
-      r2[5].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[5].val[1]), 1));
-      r2[6].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[6].val[0]), 1));
-      r2[6].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[6].val[1]), 1));
-      r2[7].val[0] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[7].val[0]), 1));
-      r2[7].val[1] = vreinterpret_u8_u32(vshr_n_u8(vreinterpret_u8_u32(r2[7].val[1]), 1));
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 16, r0[0].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 16, r0[0].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 16, r0[1].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 16, r0[1].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 4 * size * elem_size / 16, r0[2].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 5 * size * elem_size / 16, r0[2].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 6 * size * elem_size / 16, r0[3].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 7 * size * elem_size / 16, r0[3].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 8 * size * elem_size / 16, r0[4].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 9 * size * elem_size / 16, r0[4].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 10 * size * elem_size / 16, r0[5].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 11 * size * elem_size / 16, r0[5].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 12 * size * elem_size / 16, r0[6].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 13 * size * elem_size / 16, r0[6].val[1], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 14 * size * elem_size / 16, r0[7].val[0], 0);
-      vst1_lane_u8(dest + k + j * size * elem_size / (8 * elem_size) + 15 * size * elem_size / 16, r0[7].val[1], 0);
+    // Worst case: odd number of bytes. Turns out that this is faster for
+    // (odd * 2) byte elements as well (hence % 4).
+    if (elem_size % 4) {
+        count = bshuf_trans_byte_elem_scal(in, out, size, elem_size);
+        return count;
     }
-  }
-}
 
-/* Routine optimized for bit-unshuffling a buffer for a type size of 1 byte. */
-static void
-bitunshuffle1_neon(void* _src, void* dest, const size_t size, const size_t elem_size) {
+    // Multiple of power of 2: transpose hierarchically.
+    {
+        size_t nchunk_elem;
+        void* tmp_buf = malloc(size * elem_size);
+        if (tmp_buf == NULL) return -1;
+
+        if ((elem_size % 8) == 0) {
+            nchunk_elem = elem_size / 8;
+            TRANS_ELEM_TYPE(in, out, size, nchunk_elem, int64_t);
+            count = bshuf_trans_byte_elem_NEON_64(out, tmp_buf,
+                    size * nchunk_elem);
+            bshuf_trans_elem(tmp_buf, out, 8, nchunk_elem, size);
+        } else if ((elem_size % 4) == 0) {
+            nchunk_elem = elem_size / 4;
+            TRANS_ELEM_TYPE(in, out, size, nchunk_elem, int32_t);
+            count = bshuf_trans_byte_elem_NEON_32(out, tmp_buf,
+                    size * nchunk_elem);
+            bshuf_trans_elem(tmp_buf, out, 4, nchunk_elem, size);
+        } else {
+            // Not used since scalar algorithm is faster.
+            nchunk_elem = elem_size / 2;
+            TRANS_ELEM_TYPE(in, out, size, nchunk_elem, int16_t);
+            count = bshuf_trans_byte_elem_NEON_16(out, tmp_buf,
+                    size * nchunk_elem);
+            bshuf_trans_elem(tmp_buf, out, 2, nchunk_elem, size);
+        }
+
+        free(tmp_buf);
+        return count;
+    }
+}
 
-  uint8x8_t lo_x, hi_x, lo, hi;
-  size_t i, j, k;
-  uint8_t* src = _src;
 
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
+/* Creates a mask made up of the most significant
+ * bit of each byte of 'input'
+ */
+int32_t move_byte_mask_neon(uint8x16_t input) {
 
-  for (i = 0, k = 0; i < size * elem_size; i += 16, k++) {
-    for (j = 0; j < 8; j++) {
-      /* Load lanes */
-      lo_x[j] = src[2 * k + 0 + j * size * elem_size / (8 * elem_size)];
-      hi_x[j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size)];
-    }
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      lo = vand_u8(lo_x, mask_and);
-      lo = vshl_u8(lo, mask_shift);
-      hi = vand_u8(hi_x, mask_and);
-      hi = vshl_u8(hi, mask_shift);
-      lo = vpadd_u8(lo, lo);
-      lo = vpadd_u8(lo, lo);
-      lo = vpadd_u8(lo, lo);
-      hi = vpadd_u8(hi, hi);
-      hi = vpadd_u8(hi, hi);
-      hi = vpadd_u8(hi, hi);
-      /* Shift packed 8-bit */
-      lo_x = vshr_n_u8(lo_x, 1);
-      hi_x = vshr_n_u8(hi_x, 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + j + i, lo, 0);
-      vst1_lane_u8(dest + j + i + 8, hi, 0);
-    }
-  }
+    return (  ((input[0] & 0x80) >> 7)          | (((input[1] & 0x80) >> 7) << 1)   | (((input[2] & 0x80) >> 7) << 2)   | (((input[3] & 0x80) >> 7) << 3)
+            | (((input[4] & 0x80) >> 7) << 4)   | (((input[5] & 0x80) >> 7) << 5)   | (((input[6] & 0x80) >> 7) << 6)   | (((input[7] & 0x80) >> 7) << 7)
+            | (((input[8] & 0x80) >> 7) << 8)   | (((input[9] & 0x80) >> 7) << 9)   | (((input[10] & 0x80) >> 7) << 10) | (((input[11] & 0x80) >> 7) << 11)
+            | (((input[12] & 0x80) >> 7) << 12) | (((input[13] & 0x80) >> 7) << 13) | (((input[14] & 0x80) >> 7) << 14) | (((input[15] & 0x80) >> 7) << 15)
+           );
 }
 
-/* Routine optimized for bit-unshuffling a buffer for a type size of 2 byte. */
-static void
-bitunshuffle2_neon(void* _src, void* dest, const size_t size, const size_t elem_size) {
-
-  size_t i, j, k;
-  uint8x8_t lo_x[2], hi_x[2], lo[2], hi[2];
-  uint8_t* src = _src;
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 32, k++) {
-    for (j = 0; j < 8; j++) {
-      /* Load lanes */
-      lo_x[0][j] = src[2 * k + j * size * elem_size / (8 * elem_size)];
-      lo_x[1][j] = src[2 * k + j * size * elem_size / (8 * elem_size) + size * elem_size / 2];
-      hi_x[0][j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size)];
-      hi_x[1][j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size) + size * elem_size / 2];
-    }
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      lo[0] = vand_u8(lo_x[0], mask_and);
-      lo[0] = vshl_u8(lo[0], mask_shift);
-      lo[1] = vand_u8(lo_x[1], mask_and);
-      lo[1] = vshl_u8(lo[1], mask_shift);
-
-      hi[0] = vand_u8(hi_x[0], mask_and);
-      hi[0] = vshl_u8(hi[0], mask_shift);
-      hi[1] = vand_u8(hi_x[1], mask_and);
-      hi[1] = vshl_u8(hi[1], mask_shift);
-
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      /* Shift packed 8-bit */
-      lo_x[0] = vshr_n_u8(lo_x[0], 1);
-      hi_x[0] = vshr_n_u8(hi_x[0], 1);
-      lo_x[1] = vshr_n_u8(lo_x[1], 1);
-      hi_x[1] = vshr_n_u8(hi_x[1], 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 2 * j + i, lo[0], 0);
-      vst1_lane_u8(dest + 2 * j + 1 + i, lo[1], 0);
-      vst1_lane_u8(dest + 2 * j + i + 16, hi[0], 0);
-      vst1_lane_u8(dest + 2 * j + 1 + i + 16, hi[1], 0);
+/* Transpose bits within bytes. */
+int64_t bshuf_trans_bit_byte_NEON(const void* in, void* out, const size_t size,
+         const size_t elem_size) {
+
+    size_t ii, kk;
+    const char* in_b = (const char*) in;
+    char* out_b = (char*) out;
+    uint16_t* out_ui16;
+
+    int64_t count;
+
+    size_t nbyte = elem_size * size;
+
+    CHECK_MULT_EIGHT(nbyte);
+
+    int16x8_t xmm;
+    int32_t bt;
+
+    for (ii = 0; ii + 15 < nbyte; ii += 16) {
+        xmm = vld1q_s16((int16_t *) (in_b + ii));
+        for (kk = 0; kk < 8; kk++) {
+            bt = move_byte_mask_neon((uint8x16_t) xmm);
+            xmm = vshlq_n_s16(xmm, 1);
+            out_ui16 = (uint16_t*) &out_b[((7 - kk) * nbyte + ii) / 8];
+            *out_ui16 = bt;
+        }
     }
-  }
+    count = bshuf_trans_bit_byte_remainder(in, out, size, elem_size,
+            nbyte - nbyte % 16);
+    return count;
 }
 
-/* Routine optimized for bit-unshuffling a buffer for a type size of 4 byte. */
-static void
-bitunshuffle4_neon(void* _src, void* dest, const size_t size, const size_t elem_size) {
-  size_t i, j, k;
-  uint8x8_t lo_x[4], hi_x[4], lo[4], hi[4];
-  uint8_t* src = _src;
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 64, k++) {
-    for (j = 0; j < 8; j++) {
-      /* Load lanes */
-      lo_x[0][j] = src[2 * k + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 4];
-      hi_x[0][j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 4];
-      lo_x[1][j] = src[2 * k + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 4];
-      hi_x[1][j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 4];
-      lo_x[2][j] = src[2 * k + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 4];
-      hi_x[2][j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 4];
-      lo_x[3][j] = src[2 * k + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 4];
-      hi_x[3][j] = src[2 * k + 1 + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 4];
-    }
 
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      lo[0] = vand_u8(lo_x[0], mask_and);
-      lo[0] = vshl_u8(lo[0], mask_shift);
-      lo[1] = vand_u8(lo_x[1], mask_and);
-      lo[1] = vshl_u8(lo[1], mask_shift);
-      lo[2] = vand_u8(lo_x[2], mask_and);
-      lo[2] = vshl_u8(lo[2], mask_shift);
-      lo[3] = vand_u8(lo_x[3], mask_and);
-      lo[3] = vshl_u8(lo[3], mask_shift);
-
-      hi[0] = vand_u8(hi_x[0], mask_and);
-      hi[0] = vshl_u8(hi[0], mask_shift);
-      hi[1] = vand_u8(hi_x[1], mask_and);
-      hi[1] = vshl_u8(hi[1], mask_shift);
-      hi[2] = vand_u8(hi_x[2], mask_and);
-      hi[2] = vshl_u8(hi[2], mask_shift);
-      hi[3] = vand_u8(hi_x[3], mask_and);
-      hi[3] = vshl_u8(hi[3], mask_shift);
-
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[0] = vpadd_u8(lo[0], lo[0]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[1] = vpadd_u8(lo[1], lo[1]);
-      lo[2] = vpadd_u8(lo[2], lo[2]);
-      lo[2] = vpadd_u8(lo[2], lo[2]);
-      lo[2] = vpadd_u8(lo[2], lo[2]);
-      lo[3] = vpadd_u8(lo[3], lo[3]);
-      lo[3] = vpadd_u8(lo[3], lo[3]);
-      lo[3] = vpadd_u8(lo[3], lo[3]);
-
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[0] = vpadd_u8(hi[0], hi[0]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[1] = vpadd_u8(hi[1], hi[1]);
-      hi[2] = vpadd_u8(hi[2], hi[2]);
-      hi[2] = vpadd_u8(hi[2], hi[2]);
-      hi[2] = vpadd_u8(hi[2], hi[2]);
-      hi[3] = vpadd_u8(hi[3], hi[3]);
-      hi[3] = vpadd_u8(hi[3], hi[3]);
-      hi[3] = vpadd_u8(hi[3], hi[3]);
-      /* Shift packed 8-bit */
-      lo_x[0] = vshr_n_u8(lo_x[0], 1);
-      hi_x[0] = vshr_n_u8(hi_x[0], 1);
-      lo_x[1] = vshr_n_u8(lo_x[1], 1);
-      hi_x[1] = vshr_n_u8(hi_x[1], 1);
-      lo_x[2] = vshr_n_u8(lo_x[2], 1);
-      hi_x[2] = vshr_n_u8(hi_x[2], 1);
-      lo_x[3] = vshr_n_u8(lo_x[3], 1);
-      hi_x[3] = vshr_n_u8(hi_x[3], 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 4 * j + i, lo[0], 0);
-      vst1_lane_u8(dest + 4 * j + 1 + i, lo[1], 0);
-      vst1_lane_u8(dest + 4 * j + 2 + i, lo[2], 0);
-      vst1_lane_u8(dest + 4 * j + 3 + i, lo[3], 0);
-      vst1_lane_u8(dest + 4 * j + i + 32, hi[0], 0);
-      vst1_lane_u8(dest + 4 * j + 1 + i + 32, hi[1], 0);
-      vst1_lane_u8(dest + 4 * j + 2 + i + 32, hi[2], 0);
-      vst1_lane_u8(dest + 4 * j + 3 + i + 32, hi[3], 0);
-    }
-  }
+/* Transpose bits within elements. */
+int64_t bshuf_trans_bit_elem_NEON(const void* in, void* out, const size_t size,
+         const size_t elem_size) {
+
+    int64_t count;
+
+    CHECK_MULT_EIGHT(size);
+
+    void* tmp_buf = malloc(size * elem_size);
+    if (tmp_buf == NULL) return -1;
+
+    count = bshuf_trans_byte_elem_NEON(in, out, size, elem_size);
+    CHECK_ERR_FREE(count, tmp_buf);
+    count = bshuf_trans_bit_byte_NEON(out, tmp_buf, size, elem_size);
+    CHECK_ERR_FREE(count, tmp_buf);
+    count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
+
+    free(tmp_buf);
+
+    return count;
 }
 
-/* Routine optimized for bit-unshuffling a buffer for a type size of 8 byte. */
-static void
-bitunshuffle8_neon(void* _src, void* dest, const size_t size, const size_t elem_size) {
-
-  size_t i, j, k;
-  uint8x8x2_t r0[4], r1[4];
-  uint8_t* src = _src;
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 64, k++) {
-    for (j = 0; j < 8; j++) {
-      /* Load lanes */
-      r0[0].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 8];
-      r0[0].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 8];
-      r0[1].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 8];
-      r0[1].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 8];
-      r0[2].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 4 * size * elem_size / 8];
-      r0[2].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 5 * size * elem_size / 8];
-      r0[3].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 6 * size * elem_size / 8];
-      r0[3].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 7 * size * elem_size / 8];
-    }
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      r1[0].val[0] = vand_u8(r0[0].val[0], mask_and);
-      r1[0].val[0] = vshl_u8(r1[0].val[0], mask_shift);
-      r1[0].val[1] = vand_u8(r0[0].val[1], mask_and);
-      r1[0].val[1] = vshl_u8(r1[0].val[1], mask_shift);
-      r1[1].val[0] = vand_u8(r0[1].val[0], mask_and);
-      r1[1].val[0] = vshl_u8(r1[1].val[0], mask_shift);
-      r1[1].val[1] = vand_u8(r0[1].val[1], mask_and);
-      r1[1].val[1] = vshl_u8(r1[1].val[1], mask_shift);
-      r1[2].val[0] = vand_u8(r0[2].val[0], mask_and);
-      r1[2].val[0] = vshl_u8(r1[2].val[0], mask_shift);
-      r1[2].val[1] = vand_u8(r0[2].val[1], mask_and);
-      r1[2].val[1] = vshl_u8(r1[2].val[1], mask_shift);
-      r1[3].val[0] = vand_u8(r0[3].val[0], mask_and);
-      r1[3].val[0] = vshl_u8(r1[3].val[0], mask_shift);
-      r1[3].val[1] = vand_u8(r0[3].val[1], mask_and);
-      r1[3].val[1] = vshl_u8(r1[3].val[1], mask_shift);
-
-      r1[0].val[0] = vpadd_u8(r1[0].val[0], r1[0].val[0]);
-      r1[0].val[0] = vpadd_u8(r1[0].val[0], r1[0].val[0]);
-      r1[0].val[0] = vpadd_u8(r1[0].val[0], r1[0].val[0]);
-      r1[0].val[1] = vpadd_u8(r1[0].val[1], r1[0].val[1]);
-      r1[0].val[1] = vpadd_u8(r1[0].val[1], r1[0].val[1]);
-      r1[0].val[1] = vpadd_u8(r1[0].val[1], r1[0].val[1]);
-      r1[1].val[0] = vpadd_u8(r1[1].val[0], r1[1].val[0]);
-      r1[1].val[0] = vpadd_u8(r1[1].val[0], r1[1].val[0]);
-      r1[1].val[0] = vpadd_u8(r1[1].val[0], r1[1].val[0]);
-      r1[1].val[1] = vpadd_u8(r1[1].val[1], r1[1].val[1]);
-      r1[1].val[1] = vpadd_u8(r1[1].val[1], r1[1].val[1]);
-      r1[1].val[1] = vpadd_u8(r1[1].val[1], r1[1].val[1]);
-      r1[2].val[0] = vpadd_u8(r1[2].val[0], r1[2].val[0]);
-      r1[2].val[0] = vpadd_u8(r1[2].val[0], r1[2].val[0]);
-      r1[2].val[0] = vpadd_u8(r1[2].val[0], r1[2].val[0]);
-      r1[2].val[1] = vpadd_u8(r1[2].val[1], r1[2].val[1]);
-      r1[2].val[1] = vpadd_u8(r1[2].val[1], r1[2].val[1]);
-      r1[2].val[1] = vpadd_u8(r1[2].val[1], r1[2].val[1]);
-      r1[3].val[0] = vpadd_u8(r1[3].val[0], r1[3].val[0]);
-      r1[3].val[0] = vpadd_u8(r1[3].val[0], r1[3].val[0]);
-      r1[3].val[0] = vpadd_u8(r1[3].val[0], r1[3].val[0]);
-      r1[3].val[1] = vpadd_u8(r1[3].val[1], r1[3].val[1]);
-      r1[3].val[1] = vpadd_u8(r1[3].val[1], r1[3].val[1]);
-      r1[3].val[1] = vpadd_u8(r1[3].val[1], r1[3].val[1]);
-      /* Shift packed 8-bit */
-      r0[0].val[0] = vshr_n_u8(r0[0].val[0], 1);
-      r0[0].val[1] = vshr_n_u8(r0[0].val[1], 1);
-      r0[1].val[0] = vshr_n_u8(r0[1].val[0], 1);
-      r0[1].val[1] = vshr_n_u8(r0[1].val[1], 1);
-      r0[2].val[0] = vshr_n_u8(r0[2].val[0], 1);
-      r0[2].val[1] = vshr_n_u8(r0[2].val[1], 1);
-      r0[3].val[0] = vshr_n_u8(r0[3].val[0], 1);
-      r0[3].val[1] = vshr_n_u8(r0[3].val[1], 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 8 * j + 0 + i, r1[0].val[0], 0);
-      vst1_lane_u8(dest + 8 * j + 1 + i, r1[0].val[1], 0);
-      vst1_lane_u8(dest + 8 * j + 2 + i, r1[1].val[0], 0);
-      vst1_lane_u8(dest + 8 * j + 3 + i, r1[1].val[1], 0);
-      vst1_lane_u8(dest + 8 * j + 4 + i, r1[2].val[0], 0);
-      vst1_lane_u8(dest + 8 * j + 5 + i, r1[2].val[1], 0);
-      vst1_lane_u8(dest + 8 * j + 6 + i, r1[3].val[0], 0);
-      vst1_lane_u8(dest + 8 * j + 7 + i, r1[3].val[1], 0);
+
+/* For data organized into a row for each bit (8 * elem_size rows), transpose
+ * the bytes. */
+int64_t bshuf_trans_byte_bitrow_NEON(const void* in, void* out, const size_t size,
+         const size_t elem_size) {
+
+    size_t ii, jj;
+    const char* in_b = (const char*) in;
+    char* out_b = (char*) out;
+
+    CHECK_MULT_EIGHT(size);
+
+    size_t nrows = 8 * elem_size;
+    size_t nbyte_row = size / 8;
+
+    int8x16_t a0, b0, c0, d0, e0, f0, g0, h0;
+    int8x16_t a1, b1, c1, d1, e1, f1, g1, h1;
+    int64x1_t *as, *bs, *cs, *ds, *es, *fs, *gs, *hs;
+
+    for (ii = 0; ii + 7 < nrows; ii += 8) {
+        for (jj = 0; jj + 15 < nbyte_row; jj += 16) {
+            a0 = vld1q_s8(in_b + (ii + 0)*nbyte_row + jj);
+            b0 = vld1q_s8(in_b + (ii + 1)*nbyte_row + jj);
+            c0 = vld1q_s8(in_b + (ii + 2)*nbyte_row + jj);
+            d0 = vld1q_s8(in_b + (ii + 3)*nbyte_row + jj);
+            e0 = vld1q_s8(in_b + (ii + 4)*nbyte_row + jj);
+            f0 = vld1q_s8(in_b + (ii + 5)*nbyte_row + jj);
+            g0 = vld1q_s8(in_b + (ii + 6)*nbyte_row + jj);
+            h0 = vld1q_s8(in_b + (ii + 7)*nbyte_row + jj);
+
+            a1 = vzip1q_s8(a0, b0);
+            b1 = vzip1q_s8(c0, d0);
+            c1 = vzip1q_s8(e0, f0);
+            d1 = vzip1q_s8(g0, h0);
+            e1 = vzip2q_s8(a0, b0);
+            f1 = vzip2q_s8(c0, d0);
+            g1 = vzip2q_s8(e0, f0);
+            h1 = vzip2q_s8(g0, h0);
+
+            a0 = (int8x16_t) vzip1q_s16 (vreinterpretq_s16_s8 (a1), vreinterpretq_s16_s8 (b1));
+            b0=  (int8x16_t) vzip1q_s16 (vreinterpretq_s16_s8 (c1), vreinterpretq_s16_s8 (d1));
+            c0 = (int8x16_t) vzip2q_s16 (vreinterpretq_s16_s8 (a1), vreinterpretq_s16_s8 (b1));
+            d0 = (int8x16_t) vzip2q_s16 (vreinterpretq_s16_s8 (c1), vreinterpretq_s16_s8 (d1));
+            e0 = (int8x16_t) vzip1q_s16 (vreinterpretq_s16_s8 (e1), vreinterpretq_s16_s8 (f1));
+            f0 = (int8x16_t) vzip1q_s16 (vreinterpretq_s16_s8 (g1), vreinterpretq_s16_s8 (h1));
+            g0 = (int8x16_t) vzip2q_s16 (vreinterpretq_s16_s8 (e1), vreinterpretq_s16_s8 (f1));
+            h0 = (int8x16_t) vzip2q_s16 (vreinterpretq_s16_s8 (g1), vreinterpretq_s16_s8 (h1));
+
+            a1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (a0), vreinterpretq_s32_s8 (b0));
+            b1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (a0), vreinterpretq_s32_s8 (b0));
+            c1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (c0), vreinterpretq_s32_s8 (d0));
+            d1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (c0), vreinterpretq_s32_s8 (d0));
+            e1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (e0), vreinterpretq_s32_s8 (f0));
+            f1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (e0), vreinterpretq_s32_s8 (f0));
+            g1 = (int8x16_t) vzip1q_s32 (vreinterpretq_s32_s8 (g0), vreinterpretq_s32_s8 (h0));
+            h1 = (int8x16_t) vzip2q_s32 (vreinterpretq_s32_s8 (g0), vreinterpretq_s32_s8 (h0));
+
+            as = (int64x1_t *) &a1;
+            bs = (int64x1_t *) &b1;
+            cs = (int64x1_t *) &c1;
+            ds = (int64x1_t *) &d1;
+            es = (int64x1_t *) &e1;
+            fs = (int64x1_t *) &f1;
+            gs = (int64x1_t *) &g1;
+            hs = (int64x1_t *) &h1;
+
+            vst1_s64((int64_t *)(out_b + (jj + 0) * nrows + ii), *as);
+            vst1_s64((int64_t *)(out_b + (jj + 1) * nrows + ii), *(as + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 2) * nrows + ii), *bs);
+            vst1_s64((int64_t *)(out_b + (jj + 3) * nrows + ii), *(bs + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 4) * nrows + ii), *cs);
+            vst1_s64((int64_t *)(out_b + (jj + 5) * nrows + ii), *(cs + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 6) * nrows + ii), *ds);
+            vst1_s64((int64_t *)(out_b + (jj + 7) * nrows + ii), *(ds + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 8) * nrows + ii), *es);
+            vst1_s64((int64_t *)(out_b + (jj + 9) * nrows + ii), *(es + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 10) * nrows + ii), *fs);
+            vst1_s64((int64_t *)(out_b + (jj + 11) * nrows + ii), *(fs + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 12) * nrows + ii), *gs);
+            vst1_s64((int64_t *)(out_b + (jj + 13) * nrows + ii), *(gs + 1));
+            vst1_s64((int64_t *)(out_b + (jj + 14) * nrows + ii), *hs);
+            vst1_s64((int64_t *)(out_b + (jj + 15) * nrows + ii), *(hs + 1));
+        }
+        for (jj = nbyte_row - nbyte_row % 16; jj < nbyte_row; jj ++) {
+            out_b[jj * nrows + ii + 0] = in_b[(ii + 0)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 1] = in_b[(ii + 1)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 2] = in_b[(ii + 2)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 3] = in_b[(ii + 3)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 4] = in_b[(ii + 4)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 5] = in_b[(ii + 5)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 6] = in_b[(ii + 6)*nbyte_row + jj];
+            out_b[jj * nrows + ii + 7] = in_b[(ii + 7)*nbyte_row + jj];
+        }
     }
-  }
+    return size * elem_size;
 }
 
-/* Routine optimized for bit-unshuffling a buffer for a type size of 16 byte. */
-static void
-bitunshuffle16_neon(void* _src, void* dest, const size_t size, const size_t elem_size) {
-
-  size_t i, j, k;
-  uint8x8x2_t r0[8], r1[8];
-  uint8_t* src = _src;
-
-  const int8_t __attribute__ ((aligned (16))) xr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  uint8x8_t mask_and = vdup_n_u8(0x01);
-  int8x8_t mask_shift = vld1_s8(xr);
-
-  for (i = 0, k = 0; i < size * elem_size; i += 128, k++) {
-    for (j = 0; j < 8; j++) {
-      /* Load lanes */
-      r0[0].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 0 * size * elem_size / 16];
-      r0[0].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 1 * size * elem_size / 16];
-      r0[1].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 2 * size * elem_size / 16];
-      r0[1].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 3 * size * elem_size / 16];
-      r0[2].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 4 * size * elem_size / 16];
-      r0[2].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 5 * size * elem_size / 16];
-      r0[3].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 6 * size * elem_size / 16];
-      r0[3].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 7 * size * elem_size / 16];
-      r0[4].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 8 * size * elem_size / 16];
-      r0[4].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 9 * size * elem_size / 16];
-      r0[5].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 10 * size * elem_size / 16];
-      r0[5].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 11 * size * elem_size / 16];
-      r0[6].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 12 * size * elem_size / 16];
-      r0[6].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 13 * size * elem_size / 16];
-      r0[7].val[0][j] = src[k + j * size * elem_size / (8 * elem_size) + 14 * size * elem_size / 16];
-      r0[7].val[1][j] = src[k + j * size * elem_size / (8 * elem_size) + 15 * size * elem_size / 16];
-    }
-    for (j = 0; j < 8; j++) {
-      /* Create mask from the most significant bit of each 8-bit element */
-      r1[0].val[0] = vand_u8(r0[0].val[0], mask_and);
-      r1[0].val[0] = vshl_u8(r1[0].val[0], mask_shift);
-      r1[0].val[1] = vand_u8(r0[0].val[1], mask_and);
-      r1[0].val[1] = vshl_u8(r1[0].val[1], mask_shift);
-      r1[1].val[0] = vand_u8(r0[1].val[0], mask_and);
-      r1[1].val[0] = vshl_u8(r1[1].val[0], mask_shift);
-      r1[1].val[1] = vand_u8(r0[1].val[1], mask_and);
-      r1[1].val[1] = vshl_u8(r1[1].val[1], mask_shift);
-      r1[2].val[0] = vand_u8(r0[2].val[0], mask_and);
-      r1[2].val[0] = vshl_u8(r1[2].val[0], mask_shift);
-      r1[2].val[1] = vand_u8(r0[2].val[1], mask_and);
-      r1[2].val[1] = vshl_u8(r1[2].val[1], mask_shift);
-      r1[3].val[0] = vand_u8(r0[3].val[0], mask_and);
-      r1[3].val[0] = vshl_u8(r1[3].val[0], mask_shift);
-      r1[3].val[1] = vand_u8(r0[3].val[1], mask_and);
-      r1[3].val[1] = vshl_u8(r1[3].val[1], mask_shift);
-      r1[4].val[0] = vand_u8(r0[4].val[0], mask_and);
-      r1[4].val[0] = vshl_u8(r1[4].val[0], mask_shift);
-      r1[4].val[1] = vand_u8(r0[4].val[1], mask_and);
-      r1[4].val[1] = vshl_u8(r1[4].val[1], mask_shift);
-      r1[5].val[0] = vand_u8(r0[5].val[0], mask_and);
-      r1[5].val[0] = vshl_u8(r1[5].val[0], mask_shift);
-      r1[5].val[1] = vand_u8(r0[5].val[1], mask_and);
-      r1[5].val[1] = vshl_u8(r1[5].val[1], mask_shift);
-      r1[6].val[0] = vand_u8(r0[6].val[0], mask_and);
-      r1[6].val[0] = vshl_u8(r1[6].val[0], mask_shift);
-      r1[6].val[1] = vand_u8(r0[6].val[1], mask_and);
-      r1[6].val[1] = vshl_u8(r1[6].val[1], mask_shift);
-      r1[7].val[0] = vand_u8(r0[7].val[0], mask_and);
-      r1[7].val[0] = vshl_u8(r1[7].val[0], mask_shift);
-      r1[7].val[1] = vand_u8(r0[7].val[1], mask_and);
-      r1[7].val[1] = vshl_u8(r1[7].val[1], mask_shift);
-
-      r1[0].val[0] = vpadd_u8(r1[0].val[0], r1[0].val[0]);
-      r1[0].val[0] = vpadd_u8(r1[0].val[0], r1[0].val[0]);
-      r1[0].val[0] = vpadd_u8(r1[0].val[0], r1[0].val[0]);
-      r1[0].val[1] = vpadd_u8(r1[0].val[1], r1[0].val[1]);
-      r1[0].val[1] = vpadd_u8(r1[0].val[1], r1[0].val[1]);
-      r1[0].val[1] = vpadd_u8(r1[0].val[1], r1[0].val[1]);
-      r1[1].val[0] = vpadd_u8(r1[1].val[0], r1[1].val[0]);
-      r1[1].val[0] = vpadd_u8(r1[1].val[0], r1[1].val[0]);
-      r1[1].val[0] = vpadd_u8(r1[1].val[0], r1[1].val[0]);
-      r1[1].val[1] = vpadd_u8(r1[1].val[1], r1[1].val[1]);
-      r1[1].val[1] = vpadd_u8(r1[1].val[1], r1[1].val[1]);
-      r1[1].val[1] = vpadd_u8(r1[1].val[1], r1[1].val[1]);
-      r1[2].val[0] = vpadd_u8(r1[2].val[0], r1[2].val[0]);
-      r1[2].val[0] = vpadd_u8(r1[2].val[0], r1[2].val[0]);
-      r1[2].val[0] = vpadd_u8(r1[2].val[0], r1[2].val[0]);
-      r1[2].val[1] = vpadd_u8(r1[2].val[1], r1[2].val[1]);
-      r1[2].val[1] = vpadd_u8(r1[2].val[1], r1[2].val[1]);
-      r1[2].val[1] = vpadd_u8(r1[2].val[1], r1[2].val[1]);
-      r1[3].val[0] = vpadd_u8(r1[3].val[0], r1[3].val[0]);
-      r1[3].val[0] = vpadd_u8(r1[3].val[0], r1[3].val[0]);
-      r1[3].val[0] = vpadd_u8(r1[3].val[0], r1[3].val[0]);
-      r1[3].val[1] = vpadd_u8(r1[3].val[1], r1[3].val[1]);
-      r1[3].val[1] = vpadd_u8(r1[3].val[1], r1[3].val[1]);
-      r1[3].val[1] = vpadd_u8(r1[3].val[1], r1[3].val[1]);
-      r1[4].val[0] = vpadd_u8(r1[4].val[0], r1[4].val[0]);
-      r1[4].val[0] = vpadd_u8(r1[4].val[0], r1[4].val[0]);
-      r1[4].val[0] = vpadd_u8(r1[4].val[0], r1[4].val[0]);
-      r1[4].val[1] = vpadd_u8(r1[4].val[1], r1[4].val[1]);
-      r1[4].val[1] = vpadd_u8(r1[4].val[1], r1[4].val[1]);
-      r1[4].val[1] = vpadd_u8(r1[4].val[1], r1[4].val[1]);
-      r1[5].val[0] = vpadd_u8(r1[5].val[0], r1[5].val[0]);
-      r1[5].val[0] = vpadd_u8(r1[5].val[0], r1[5].val[0]);
-      r1[5].val[0] = vpadd_u8(r1[5].val[0], r1[5].val[0]);
-      r1[5].val[1] = vpadd_u8(r1[5].val[1], r1[5].val[1]);
-      r1[5].val[1] = vpadd_u8(r1[5].val[1], r1[5].val[1]);
-      r1[5].val[1] = vpadd_u8(r1[5].val[1], r1[5].val[1]);
-      r1[6].val[0] = vpadd_u8(r1[6].val[0], r1[6].val[0]);
-      r1[6].val[0] = vpadd_u8(r1[6].val[0], r1[6].val[0]);
-      r1[6].val[0] = vpadd_u8(r1[6].val[0], r1[6].val[0]);
-      r1[6].val[1] = vpadd_u8(r1[6].val[1], r1[6].val[1]);
-      r1[6].val[1] = vpadd_u8(r1[6].val[1], r1[6].val[1]);
-      r1[6].val[1] = vpadd_u8(r1[6].val[1], r1[6].val[1]);
-      r1[7].val[0] = vpadd_u8(r1[7].val[0], r1[7].val[0]);
-      r1[7].val[0] = vpadd_u8(r1[7].val[0], r1[7].val[0]);
-      r1[7].val[0] = vpadd_u8(r1[7].val[0], r1[7].val[0]);
-      r1[7].val[1] = vpadd_u8(r1[7].val[1], r1[7].val[1]);
-      r1[7].val[1] = vpadd_u8(r1[7].val[1], r1[7].val[1]);
-      r1[7].val[1] = vpadd_u8(r1[7].val[1], r1[7].val[1]);
-      /* Shift packed 8-bit */
-      r0[0].val[0] = vshr_n_u8(r0[0].val[0], 1);
-      r0[0].val[1] = vshr_n_u8(r0[0].val[1], 1);
-      r0[1].val[0] = vshr_n_u8(r0[1].val[0], 1);
-      r0[1].val[1] = vshr_n_u8(r0[1].val[1], 1);
-      r0[2].val[0] = vshr_n_u8(r0[2].val[0], 1);
-      r0[2].val[1] = vshr_n_u8(r0[2].val[1], 1);
-      r0[3].val[0] = vshr_n_u8(r0[3].val[0], 1);
-      r0[3].val[1] = vshr_n_u8(r0[3].val[1], 1);
-      r0[4].val[0] = vshr_n_u8(r0[4].val[0], 1);
-      r0[4].val[1] = vshr_n_u8(r0[4].val[1], 1);
-      r0[5].val[0] = vshr_n_u8(r0[5].val[0], 1);
-      r0[5].val[1] = vshr_n_u8(r0[5].val[1], 1);
-      r0[6].val[0] = vshr_n_u8(r0[6].val[0], 1);
-      r0[6].val[1] = vshr_n_u8(r0[6].val[1], 1);
-      r0[7].val[0] = vshr_n_u8(r0[7].val[0], 1);
-      r0[7].val[1] = vshr_n_u8(r0[7].val[1], 1);
-      /* Store the created mask to the destination vector */
-      vst1_lane_u8(dest + 16 * j + 0 + i, r1[0].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 1 + i, r1[0].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 2 + i, r1[1].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 3 + i, r1[1].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 4 + i, r1[2].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 5 + i, r1[2].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 6 + i, r1[3].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 7 + i, r1[3].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 8 + i, r1[4].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 9 + i, r1[4].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 10 + i, r1[5].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 11 + i, r1[5].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 12 + i, r1[6].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 13 + i, r1[6].val[1], 0);
-      vst1_lane_u8(dest + 16 * j + 14 + i, r1[7].val[0], 0);
-      vst1_lane_u8(dest + 16 * j + 15 + i, r1[7].val[1], 0);
+
+/* Shuffle bits within the bytes of eight element blocks. */
+int64_t bshuf_shuffle_bit_eightelem_NEON(const void* in, void* out, const size_t size,
+         const size_t elem_size) {
+
+    CHECK_MULT_EIGHT(size);
+
+    // With a bit of care, this could be written such that such that it is
+    // in_buf = out_buf safe.
+    const char* in_b = (const char*) in;
+    uint16_t* out_ui16 = (uint16_t*) out;
+
+    size_t ii, jj, kk;
+    size_t nbyte = elem_size * size;
+
+    int16x8_t xmm;
+    int32_t bt;
+
+    if (elem_size % 2) {
+        bshuf_shuffle_bit_eightelem_scal(in, out, size, elem_size);
+    } else {
+        for (ii = 0; ii + 8 * elem_size - 1 < nbyte;
+                ii += 8 * elem_size) {
+            for (jj = 0; jj + 15 < 8 * elem_size; jj += 16) {
+                xmm = vld1q_s16((int16_t *) &in_b[ii + jj]);
+                for (kk = 0; kk < 8; kk++) {
+                    bt = move_byte_mask_neon((uint8x16_t) xmm);
+                    xmm = vshlq_n_s16(xmm, 1);
+                    size_t ind = (ii + jj / 8 + (7 - kk) * elem_size);
+                    out_ui16[ind / 2] = bt;
+                }
+            }
+        }
     }
-  }
+    return size * elem_size;
 }
 
-/* Shuffle a block.  This can never fail. */
-int64_t
-bitshuffle_neon(void* _src, void* _dest, const size_t size,
-                const size_t elem_size, void* tmp_buf) {
-  size_t vectorized_chunk_size = 0;
-  int64_t count;
-  if (elem_size == 1 || elem_size == 2 || elem_size == 4) {
-    vectorized_chunk_size = elem_size * 16;
-  } else if (elem_size == 8 || elem_size == 16) {
-    vectorized_chunk_size = elem_size * 8;
-  }
-
-  /* If the block size is too small to be vectorized,
-     use the generic implementation. */
-  if (size * elem_size < vectorized_chunk_size) {
-    count = bshuf_trans_bit_elem_scal((void*)_src, (void*)_dest, size, elem_size, tmp_buf);
-    return count;
-  }
-
-  /* Optimized bitshuffle implementations */
-  switch (elem_size) {
-    case 1:
-      bitshuffle1_neon(_src, _dest, size, elem_size);
-      break;
-    case 2:
-      bitshuffle2_neon(_src, _dest, size, elem_size);
-      break;
-    case 4:
-      bitshuffle4_neon(_src, _dest, size, elem_size);
-      break;
-    case 8:
-      bitshuffle8_neon(_src, _dest, size, elem_size);
-      break;
-    case 16:
-      bitshuffle16_neon(_src, _dest, size, elem_size);
-      break;
-    default:
-      /* Non-optimized bitshuffle */
-      count = bshuf_trans_bit_elem_scal((void*)_src, (void*)_dest, size, elem_size, tmp_buf);
-      /* The non-optimized function covers the whole buffer,
-         so we're done processing here. */
-      return count;
-  }
-
-  return (int64_t)size * (int64_t)elem_size;
-}
 
-/* Bitunshuffle a block.  This can never fail. */
-int64_t
-bitunshuffle_neon(void* _src, void* _dest, const size_t size,
-                  const size_t elem_size, void* tmp_buf) {
-  size_t vectorized_chunk_size = 0;
-  int64_t count;
-  if (size * elem_size == 1 || size * elem_size == 2 || size * elem_size == 4) {
-    vectorized_chunk_size = size * elem_size * 16;
-  } else if (size * elem_size == 8 || size * elem_size == 16) {
-    vectorized_chunk_size = size * elem_size * 8;
-  }
-
-  /* If the block size is too small to be vectorized,
-     use the generic implementation. */
-  if (size * elem_size < vectorized_chunk_size) {
-    count = bshuf_untrans_bit_elem_scal((void*)_src, (void*)_dest, size, elem_size, tmp_buf);
+/* Untranspose bits within elements. */
+int64_t bshuf_untrans_bit_elem_NEON(const void* in, void* out, const size_t size,
+         const size_t elem_size) {
+
+    int64_t count;
+
+    CHECK_MULT_EIGHT(size);
+
+    void* tmp_buf = malloc(size * elem_size);
+    if (tmp_buf == NULL) return -1;
+
+    count = bshuf_trans_byte_bitrow_NEON(in, tmp_buf, size, elem_size);
+    CHECK_ERR_FREE(count, tmp_buf);
+    count =  bshuf_shuffle_bit_eightelem_NEON(tmp_buf, out, size, elem_size);
+
+    free(tmp_buf);
+
     return count;
-  }
-
-  /* Optimized bitunshuffle implementations */
-  switch (elem_size) {
-    case 1:
-      bitunshuffle1_neon(_src, _dest, size, elem_size);
-      break;
-    case 2:
-      bitunshuffle2_neon(_src, _dest, size, elem_size);
-      break;
-    case 4:
-      bitunshuffle4_neon(_src, _dest, size, elem_size);
-      break;
-    case 8:
-      bitunshuffle8_neon(_src, _dest, size, elem_size);
-      break;
-    case 16:
-      bitunshuffle16_neon(_src, _dest, size, elem_size);
-      break;
-    default:
-      /* Non-optimized bitunshuffle */
-      count = bshuf_untrans_bit_elem_scal((void*)_src, (void*)_dest, size, elem_size, tmp_buf);
-      /* The non-optimized function covers the whole buffer,
-         so we're done processing here. */
-      return count;
-  }
-
-  return (int64_t)size * (int64_t)elem_size;
 }
 
 #endif /* defined(__ARM_NEON) */
diff --git a/blosc/bitshuffle-neon.h b/blosc/bitshuffle-neon.h
index 5d9f04af..1370f84a 100644
--- a/blosc/bitshuffle-neon.h
+++ b/blosc/bitshuffle-neon.h
@@ -5,8 +5,6 @@
   https://blosc.org
   License: BSD 3-Clause (see LICENSE.txt)
 
-  Note: Adapted for NEON by Lucian Marc.
-
   See LICENSE.txt for details about copyright and rights to use.
 **********************************************************************/
 
@@ -23,13 +21,13 @@
 /**
   NEON-accelerated bitshuffle routine.
 */
-BLOSC_NO_EXPORT int64_t bitshuffle_neon(void* _src, void* _dest, const size_t blocksize,
-                                        const size_t bytesoftype, void* tmp_buf);
+BLOSC_NO_EXPORT int64_t bshuf_trans_bit_elem_NEON(const void* in, void* out, const size_t size,
+                                                  const size_t elem_size);
 
 /**
   NEON-accelerated bitunshuffle routine.
 */
-BLOSC_NO_EXPORT int64_t bitunshuffle_neon(void* _src, void* _dest, const size_t blocksize,
-                                          const size_t bytesoftype, void* tmp_buf);
+BLOSC_NO_EXPORT int64_t bshuf_untrans_bit_elem_NEON(const void* in, void* out, const size_t size,
+                                                    const size_t elem_size);
 
 #endif /* BLOSC_BITSHUFFLE_NEON_H */
diff --git a/blosc/bitshuffle-sse2.c b/blosc/bitshuffle-sse2.c
index ed5769de..e4784c1c 100644
--- a/blosc/bitshuffle-sse2.c
+++ b/blosc/bitshuffle-sse2.c
@@ -29,8 +29,6 @@
 
 #include <emmintrin.h>
 
-#include <stdint.h>
-
 /* The next is useful for debugging purposes */
 #if 0
 #include <stdio.h>
@@ -53,17 +51,18 @@ static void printxmm(__m128i xmm0)
 
 /* ---- Worker code that requires SSE2. Intel Petium 4 (2000) and later. ---- */
 
+
 /* Transpose bytes within elements for 16 bit elements. */
-int64_t bshuf_trans_byte_elem_SSE_16(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_SSE_16(const void* in, void* out, const size_t size) {
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
-  __m128i a0, b0, a1, b1;
   size_t ii;
+  const char *in_b = (const char*) in;
+  char *out_b = (char*) out;
+  __m128i a0, b0, a1, b1;
 
-  for (ii = 0; ii + 15 < size; ii += 16) {
-    a0 = _mm_loadu_si128((__m128i*)&in_b[2 * ii + 0 * 16]);
-    b0 = _mm_loadu_si128((__m128i*)&in_b[2 * ii + 1 * 16]);
+  for (ii=0; ii + 15 < size; ii += 16) {
+    a0 = _mm_loadu_si128((__m128i *) &in_b[2*ii + 0*16]);
+    b0 = _mm_loadu_si128((__m128i *) &in_b[2*ii + 1*16]);
 
     a1 = _mm_unpacklo_epi8(a0, b0);
     b1 = _mm_unpackhi_epi8(a0, b0);
@@ -77,8 +76,8 @@ int64_t bshuf_trans_byte_elem_SSE_16(void* in, void* out, const size_t size) {
     a0 = _mm_unpacklo_epi8(a1, b1);
     b0 = _mm_unpackhi_epi8(a1, b1);
 
-    _mm_storeu_si128((__m128i*)&out_b[0 * size + ii], a0);
-    _mm_storeu_si128((__m128i*)&out_b[1 * size + ii], b0);
+    _mm_storeu_si128((__m128i *) &out_b[0*size + ii], a0);
+    _mm_storeu_si128((__m128i *) &out_b[1*size + ii], b0);
   }
   return bshuf_trans_byte_elem_remainder(in, out, size, 2,
                                          size - size % 16);
@@ -86,18 +85,20 @@ int64_t bshuf_trans_byte_elem_SSE_16(void* in, void* out, const size_t size) {
 
 
 /* Transpose bytes within elements for 32 bit elements. */
-int64_t bshuf_trans_byte_elem_SSE_32(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_SSE_32(const void* in, void* out, const size_t size) {
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
-  __m128i a0, b0, c0, d0, a1, b1, c1, d1;
   size_t ii;
+  const char *in_b;
+  char *out_b;
+  in_b = (const char*) in;
+  out_b = (char*) out;
+  __m128i a0, b0, c0, d0, a1, b1, c1, d1;
 
-  for (ii = 0; ii + 15 < size; ii += 16) {
-    a0 = _mm_loadu_si128((__m128i*)&in_b[4 * ii + 0 * 16]);
-    b0 = _mm_loadu_si128((__m128i*)&in_b[4 * ii + 1 * 16]);
-    c0 = _mm_loadu_si128((__m128i*)&in_b[4 * ii + 2 * 16]);
-    d0 = _mm_loadu_si128((__m128i*)&in_b[4 * ii + 3 * 16]);
+  for (ii=0; ii + 15 < size; ii += 16) {
+    a0 = _mm_loadu_si128((__m128i *) &in_b[4*ii + 0*16]);
+    b0 = _mm_loadu_si128((__m128i *) &in_b[4*ii + 1*16]);
+    c0 = _mm_loadu_si128((__m128i *) &in_b[4*ii + 2*16]);
+    d0 = _mm_loadu_si128((__m128i *) &in_b[4*ii + 3*16]);
 
     a1 = _mm_unpacklo_epi8(a0, b0);
     b1 = _mm_unpackhi_epi8(a0, b0);
@@ -119,10 +120,10 @@ int64_t bshuf_trans_byte_elem_SSE_32(void* in, void* out, const size_t size) {
     c0 = _mm_unpacklo_epi64(b1, d1);
     d0 = _mm_unpackhi_epi64(b1, d1);
 
-    _mm_storeu_si128((__m128i*)&out_b[0 * size + ii], a0);
-    _mm_storeu_si128((__m128i*)&out_b[1 * size + ii], b0);
-    _mm_storeu_si128((__m128i*)&out_b[2 * size + ii], c0);
-    _mm_storeu_si128((__m128i*)&out_b[3 * size + ii], d0);
+    _mm_storeu_si128((__m128i *) &out_b[0*size + ii], a0);
+    _mm_storeu_si128((__m128i *) &out_b[1*size + ii], b0);
+    _mm_storeu_si128((__m128i *) &out_b[2*size + ii], c0);
+    _mm_storeu_si128((__m128i *) &out_b[3*size + ii], d0);
   }
   return bshuf_trans_byte_elem_remainder(in, out, size, 4,
                                          size - size % 16);
@@ -130,23 +131,23 @@ int64_t bshuf_trans_byte_elem_SSE_32(void* in, void* out, const size_t size) {
 
 
 /* Transpose bytes within elements for 64 bit elements. */
-int64_t bshuf_trans_byte_elem_SSE_64(void* in, void* out, const size_t size) {
+int64_t bshuf_trans_byte_elem_SSE_64(const void* in, void* out, const size_t size) {
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
+  size_t ii;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
   __m128i a0, b0, c0, d0, e0, f0, g0, h0;
   __m128i a1, b1, c1, d1, e1, f1, g1, h1;
-  size_t ii;
 
-  for (ii = 0; ii + 15 < size; ii += 16) {
-    a0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 0 * 16]);
-    b0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 1 * 16]);
-    c0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 2 * 16]);
-    d0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 3 * 16]);
-    e0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 4 * 16]);
-    f0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 5 * 16]);
-    g0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 6 * 16]);
-    h0 = _mm_loadu_si128((__m128i*)&in_b[8 * ii + 7 * 16]);
+  for (ii=0; ii + 15 < size; ii += 16) {
+    a0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 0*16]);
+    b0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 1*16]);
+    c0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 2*16]);
+    d0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 3*16]);
+    e0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 4*16]);
+    f0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 5*16]);
+    g0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 6*16]);
+    h0 = _mm_loadu_si128((__m128i *) &in_b[8*ii + 7*16]);
 
     a1 = _mm_unpacklo_epi8(a0, b0);
     b1 = _mm_unpackhi_epi8(a0, b0);
@@ -184,39 +185,27 @@ int64_t bshuf_trans_byte_elem_SSE_64(void* in, void* out, const size_t size) {
     g0 = _mm_unpacklo_epi64(d1, h1);
     h0 = _mm_unpackhi_epi64(d1, h1);
 
-    _mm_storeu_si128((__m128i*)&out_b[0 * size + ii], a0);
-    _mm_storeu_si128((__m128i*)&out_b[1 * size + ii], b0);
-    _mm_storeu_si128((__m128i*)&out_b[2 * size + ii], c0);
-    _mm_storeu_si128((__m128i*)&out_b[3 * size + ii], d0);
-    _mm_storeu_si128((__m128i*)&out_b[4 * size + ii], e0);
-    _mm_storeu_si128((__m128i*)&out_b[5 * size + ii], f0);
-    _mm_storeu_si128((__m128i*)&out_b[6 * size + ii], g0);
-    _mm_storeu_si128((__m128i*)&out_b[7 * size + ii], h0);
+    _mm_storeu_si128((__m128i *) &out_b[0*size + ii], a0);
+    _mm_storeu_si128((__m128i *) &out_b[1*size + ii], b0);
+    _mm_storeu_si128((__m128i *) &out_b[2*size + ii], c0);
+    _mm_storeu_si128((__m128i *) &out_b[3*size + ii], d0);
+    _mm_storeu_si128((__m128i *) &out_b[4*size + ii], e0);
+    _mm_storeu_si128((__m128i *) &out_b[5*size + ii], f0);
+    _mm_storeu_si128((__m128i *) &out_b[6*size + ii], g0);
+    _mm_storeu_si128((__m128i *) &out_b[7*size + ii], h0);
   }
   return bshuf_trans_byte_elem_remainder(in, out, size, 8,
                                          size - size % 16);
 }
 
 
-/* Memory copy with bshuf call signature. */
-int64_t bshuf_copy(void* in, void* out, const size_t size,
-                   const size_t elem_size) {
-
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
-
-  memcpy(out_b, in_b, size * elem_size);
-  return (int64_t)size * (int64_t)elem_size;
-}
-
-
-/* Transpose bytes within elements using best SSE algorithm available. */
-int64_t bshuf_trans_byte_elem_sse2(void* in, void* out, const size_t size,
-                                   const size_t elem_size, void* tmp_buf) {
+/* Transpose bytes within elements using the best SSE algorithm available. */
+int64_t bshuf_trans_byte_elem_SSE(const void* in, void* out, const size_t size,
+                                  const size_t elem_size) {
 
   int64_t count;
 
-  /*  Trivial cases: power of 2 bytes. */
+  // Trivial cases: power of 2 bytes.
   switch (elem_size) {
     case 1:
       count = bshuf_copy(in, out, size, elem_size);
@@ -232,16 +221,18 @@ int64_t bshuf_trans_byte_elem_sse2(void* in, void* out, const size_t size,
       return count;
   }
 
-  /*  Worst case: odd number of bytes. Turns out that this is faster for */
-  /*  (odd * 2) byte elements as well (hence % 4). */
+  // Worst case: odd number of bytes. Turns out that this is faster for
+  // (odd * 2) byte elements as well (hence % 4).
   if (elem_size % 4) {
     count = bshuf_trans_byte_elem_scal(in, out, size, elem_size);
     return count;
   }
 
-  /*  Multiple of power of 2: transpose hierarchically. */
+  // Multiple of power of 2: transpose hierarchically.
   {
     size_t nchunk_elem;
+    void* tmp_buf = malloc(size * elem_size);
+    if (tmp_buf == NULL) return -1;
 
     if ((elem_size % 8) == 0) {
       nchunk_elem = elem_size / 8;
@@ -256,7 +247,7 @@ int64_t bshuf_trans_byte_elem_sse2(void* in, void* out, const size_t size,
                                            size * nchunk_elem);
       bshuf_trans_elem(tmp_buf, out, 4, nchunk_elem, size);
     } else {
-      /*  Not used since scalar algorithm is faster. */
+      // Not used since scalar algorithm is faster.
       nchunk_elem = elem_size / 2;
       TRANS_ELEM_TYPE(in, out, size, nchunk_elem, int16_t);
       count = bshuf_trans_byte_elem_SSE_16(out, tmp_buf,
@@ -264,33 +255,37 @@ int64_t bshuf_trans_byte_elem_sse2(void* in, void* out, const size_t size,
       bshuf_trans_elem(tmp_buf, out, 2, nchunk_elem, size);
     }
 
+    free(tmp_buf);
     return count;
   }
 }
 
 
 /* Transpose bits within bytes. */
-int64_t bshuf_trans_bit_byte_sse2(void* in, void* out, const size_t size,
-                                  const size_t elem_size) {
+int64_t bshuf_trans_bit_byte_SSE(const void* in, void* out, const size_t size,
+                                 const size_t elem_size) {
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
+  size_t ii, kk;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
   uint16_t* out_ui16;
+
   int64_t count;
+
   size_t nbyte = elem_size * size;
-  __m128i xmm;
-  int32_t bt;
-  size_t ii, kk;
 
   CHECK_MULT_EIGHT(nbyte);
 
+  __m128i xmm;
+  int32_t bt;
+
   for (ii = 0; ii + 15 < nbyte; ii += 16) {
-    xmm = _mm_loadu_si128((__m128i*)&in_b[ii]);
+    xmm = _mm_loadu_si128((__m128i *) &in_b[ii]);
     for (kk = 0; kk < 8; kk++) {
       bt = _mm_movemask_epi8(xmm);
       xmm = _mm_slli_epi16(xmm, 1);
-      out_ui16 = (uint16_t*)&out_b[((7 - kk) * nbyte + ii) / 8];
-      *out_ui16 = (uint16_t)bt;
+      out_ui16 = (uint16_t*) &out_b[((7 - kk) * nbyte + ii) / 8];
+      *out_ui16 = bt;
     }
   }
   count = bshuf_trans_bit_byte_remainder(in, out, size, elem_size,
@@ -300,50 +295,56 @@ int64_t bshuf_trans_bit_byte_sse2(void* in, void* out, const size_t size,
 
 
 /* Transpose bits within elements. */
-int64_t bshuf_trans_bit_elem_sse2(void* in, void* out, const size_t size,
-                                  const size_t elem_size, void* tmp_buf) {
+int64_t bshuf_trans_bit_elem_SSE(const void* in, void* out, const size_t size,
+                                 const size_t elem_size) {
 
   int64_t count;
 
   CHECK_MULT_EIGHT(size);
 
-  count = bshuf_trans_byte_elem_sse2(in, out, size, elem_size, tmp_buf);
-  CHECK_ERR(count);
-  count = bshuf_trans_bit_byte_sse2(out, tmp_buf, size, elem_size);
-  CHECK_ERR(count);
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
+  count = bshuf_trans_byte_elem_SSE(in, out, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count = bshuf_trans_bit_byte_SSE(out, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
   count = bshuf_trans_bitrow_eight(tmp_buf, out, size, elem_size);
 
+  free(tmp_buf);
+
   return count;
 }
 
 
 /* For data organized into a row for each bit (8 * elem_size rows), transpose
  * the bytes. */
-int64_t bshuf_trans_byte_bitrow_sse2(void* in, void* out, const size_t size,
-                                     const size_t elem_size) {
+int64_t bshuf_trans_byte_bitrow_SSE(const void* in, void* out, const size_t size,
+                                    const size_t elem_size) {
+
+  size_t ii, jj;
+  const char* in_b = (const char*) in;
+  char* out_b = (char*) out;
+
+  CHECK_MULT_EIGHT(size);
 
-  char* in_b = (char*)in;
-  char* out_b = (char*)out;
   size_t nrows = 8 * elem_size;
   size_t nbyte_row = size / 8;
-  size_t ii, jj;
 
   __m128i a0, b0, c0, d0, e0, f0, g0, h0;
   __m128i a1, b1, c1, d1, e1, f1, g1, h1;
-  __m128* as, * bs, * cs, * ds, * es, * fs, * gs, * hs;
-
-  CHECK_MULT_EIGHT(size);
+  __m128 *as, *bs, *cs, *ds, *es, *fs, *gs, *hs;
 
   for (ii = 0; ii + 7 < nrows; ii += 8) {
     for (jj = 0; jj + 15 < nbyte_row; jj += 16) {
-      a0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 0) * nbyte_row + jj]);
-      b0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 1) * nbyte_row + jj]);
-      c0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 2) * nbyte_row + jj]);
-      d0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 3) * nbyte_row + jj]);
-      e0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 4) * nbyte_row + jj]);
-      f0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 5) * nbyte_row + jj]);
-      g0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 6) * nbyte_row + jj]);
-      h0 = _mm_loadu_si128((__m128i*)&in_b[(ii + 7) * nbyte_row + jj]);
+      a0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 0)*nbyte_row + jj]);
+      b0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 1)*nbyte_row + jj]);
+      c0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 2)*nbyte_row + jj]);
+      d0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 3)*nbyte_row + jj]);
+      e0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 4)*nbyte_row + jj]);
+      f0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 5)*nbyte_row + jj]);
+      g0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 6)*nbyte_row + jj]);
+      h0 = _mm_loadu_si128((__m128i *) &in_b[(ii + 7)*nbyte_row + jj]);
 
 
       a1 = _mm_unpacklo_epi8(a0, b0);
@@ -379,66 +380,66 @@ int64_t bshuf_trans_byte_bitrow_sse2(void* in, void* out, const size_t size,
       g1 = _mm_unpacklo_epi32(g0, h0);
       h1 = _mm_unpackhi_epi32(g0, h0);
 
-      /*  We don't have a storeh instruction for integers, so interpret */
-      /*  as a float. Have a storel (_mm_storel_epi64). */
-      as = (__m128*)&a1;
-      bs = (__m128*)&b1;
-      cs = (__m128*)&c1;
-      ds = (__m128*)&d1;
-      es = (__m128*)&e1;
-      fs = (__m128*)&f1;
-      gs = (__m128*)&g1;
-      hs = (__m128*)&h1;
-
-      _mm_storel_pi((__m64*)&out_b[(jj + 0) * nrows + ii], *as);
-      _mm_storel_pi((__m64*)&out_b[(jj + 2) * nrows + ii], *bs);
-      _mm_storel_pi((__m64*)&out_b[(jj + 4) * nrows + ii], *cs);
-      _mm_storel_pi((__m64*)&out_b[(jj + 6) * nrows + ii], *ds);
-      _mm_storel_pi((__m64*)&out_b[(jj + 8) * nrows + ii], *es);
-      _mm_storel_pi((__m64*)&out_b[(jj + 10) * nrows + ii], *fs);
-      _mm_storel_pi((__m64*)&out_b[(jj + 12) * nrows + ii], *gs);
-      _mm_storel_pi((__m64*)&out_b[(jj + 14) * nrows + ii], *hs);
-
-      _mm_storeh_pi((__m64*)&out_b[(jj + 1) * nrows + ii], *as);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 3) * nrows + ii], *bs);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 5) * nrows + ii], *cs);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 7) * nrows + ii], *ds);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 9) * nrows + ii], *es);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 11) * nrows + ii], *fs);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 13) * nrows + ii], *gs);
-      _mm_storeh_pi((__m64*)&out_b[(jj + 15) * nrows + ii], *hs);
+      // We don't have a storeh instruction for integers, so interpret
+      // as a float. Have a storel (_mm_storel_epi64).
+      as = (__m128 *) &a1;
+      bs = (__m128 *) &b1;
+      cs = (__m128 *) &c1;
+      ds = (__m128 *) &d1;
+      es = (__m128 *) &e1;
+      fs = (__m128 *) &f1;
+      gs = (__m128 *) &g1;
+      hs = (__m128 *) &h1;
+
+      _mm_storel_pi((__m64 *) &out_b[(jj + 0) * nrows + ii], *as);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 2) * nrows + ii], *bs);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 4) * nrows + ii], *cs);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 6) * nrows + ii], *ds);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 8) * nrows + ii], *es);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 10) * nrows + ii], *fs);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 12) * nrows + ii], *gs);
+      _mm_storel_pi((__m64 *) &out_b[(jj + 14) * nrows + ii], *hs);
+
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 1) * nrows + ii], *as);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 3) * nrows + ii], *bs);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 5) * nrows + ii], *cs);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 7) * nrows + ii], *ds);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 9) * nrows + ii], *es);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 11) * nrows + ii], *fs);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 13) * nrows + ii], *gs);
+      _mm_storeh_pi((__m64 *) &out_b[(jj + 15) * nrows + ii], *hs);
     }
-    for (jj = nbyte_row - nbyte_row % 16; jj < nbyte_row; jj++) {
-      out_b[jj * nrows + ii + 0] = in_b[(ii + 0) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 1] = in_b[(ii + 1) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 2] = in_b[(ii + 2) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 3] = in_b[(ii + 3) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 4] = in_b[(ii + 4) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 5] = in_b[(ii + 5) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 6] = in_b[(ii + 6) * nbyte_row + jj];
-      out_b[jj * nrows + ii + 7] = in_b[(ii + 7) * nbyte_row + jj];
+    for (jj = nbyte_row - nbyte_row % 16; jj < nbyte_row; jj ++) {
+      out_b[jj * nrows + ii + 0] = in_b[(ii + 0)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 1] = in_b[(ii + 1)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 2] = in_b[(ii + 2)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 3] = in_b[(ii + 3)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 4] = in_b[(ii + 4)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 5] = in_b[(ii + 5)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 6] = in_b[(ii + 6)*nbyte_row + jj];
+      out_b[jj * nrows + ii + 7] = in_b[(ii + 7)*nbyte_row + jj];
     }
   }
-  return (int64_t)size * (int64_t)elem_size;
+  return size * elem_size;
 }
 
 
 /* Shuffle bits within the bytes of eight element blocks. */
-int64_t bshuf_shuffle_bit_eightelem_sse2(void* in, void* out, const size_t size,
-                                         const size_t elem_size) {
-  /*  With a bit of care, this could be written such that such that it is */
-  /*  in_buf = out_buf safe. */
-  char* in_b = (char*)in;
-  uint16_t* out_ui16 = (uint16_t*)out;
+int64_t bshuf_shuffle_bit_eightelem_SSE(const void* in, void* out, const size_t size,
+                                        const size_t elem_size) {
 
+  CHECK_MULT_EIGHT(size);
+
+  // With a bit of care, this could be written such that such that it is
+  // in_buf = out_buf safe.
+  const char* in_b = (const char*) in;
+  uint16_t* out_ui16 = (uint16_t*) out;
+
+  size_t ii, jj, kk;
   size_t nbyte = elem_size * size;
 
   __m128i xmm;
   int32_t bt;
-  size_t ii, jj, kk;
-  size_t ind;
-
-  CHECK_MULT_EIGHT(size);
 
   if (elem_size % 2) {
     bshuf_shuffle_bit_eightelem_scal(in, out, size, elem_size);
@@ -446,33 +447,39 @@ int64_t bshuf_shuffle_bit_eightelem_sse2(void* in, void* out, const size_t size,
     for (ii = 0; ii + 8 * elem_size - 1 < nbyte;
          ii += 8 * elem_size) {
       for (jj = 0; jj + 15 < 8 * elem_size; jj += 16) {
-        xmm = _mm_loadu_si128((__m128i*)&in_b[ii + jj]);
+        xmm = _mm_loadu_si128((__m128i *) &in_b[ii + jj]);
         for (kk = 0; kk < 8; kk++) {
           bt = _mm_movemask_epi8(xmm);
           xmm = _mm_slli_epi16(xmm, 1);
-          ind = (ii + jj / 8 + (7 - kk) * elem_size);
-          out_ui16[ind / 2] = (uint16_t)bt;
+          size_t ind = (ii + jj / 8 + (7 - kk) * elem_size);
+          out_ui16[ind / 2] = bt;
         }
       }
     }
   }
-  return (int64_t)size * (int64_t)elem_size;
+  return size * elem_size;
 }
 
 
 /* Untranspose bits within elements. */
-int64_t bshuf_untrans_bit_elem_sse2(void* in, void* out, const size_t size,
-                                    const size_t elem_size, void* tmp_buf) {
+int64_t bshuf_untrans_bit_elem_SSE(const void* in, void* out, const size_t size,
+                                   const size_t elem_size) {
 
   int64_t count;
 
   CHECK_MULT_EIGHT(size);
 
-  count = bshuf_trans_byte_bitrow_sse2(in, tmp_buf, size, elem_size);
-  CHECK_ERR(count);
-  count = bshuf_shuffle_bit_eightelem_sse2(tmp_buf, out, size, elem_size);
+  void* tmp_buf = malloc(size * elem_size);
+  if (tmp_buf == NULL) return -1;
+
+  count = bshuf_trans_byte_bitrow_SSE(in, tmp_buf, size, elem_size);
+  CHECK_ERR_FREE(count, tmp_buf);
+  count =  bshuf_shuffle_bit_eightelem_SSE(tmp_buf, out, size, elem_size);
+
+  free(tmp_buf);
 
   return count;
 }
 
+
 #endif /* defined(__SSE2__) */
diff --git a/blosc/bitshuffle-sse2.h b/blosc/bitshuffle-sse2.h
index 2d31789d..f6a822eb 100644
--- a/blosc/bitshuffle-sse2.h
+++ b/blosc/bitshuffle-sse2.h
@@ -19,29 +19,29 @@
 #include <stdint.h>
 
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_byte_elem_sse2(void* in, void* out, const size_t size,
-                               const size_t elem_size, void* tmp_buf);
+    bshuf_trans_byte_elem_SSE(const void* in, void* out, const size_t size,
+                              const size_t elem_size);
 
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_byte_bitrow_sse2(void* in, void* out, const size_t size,
-                                 const size_t elem_size);
+    bshuf_trans_byte_bitrow_SSE(const void* in, void* out, const size_t size,
+                                const size_t elem_size);
 
 BLOSC_NO_EXPORT int64_t
-    bshuf_shuffle_bit_eightelem_sse2(void* in, void* out, const size_t size,
-                                     const size_t elem_size);
+    bshuf_shuffle_bit_eightelem_SSE(const void* in, void* out, const size_t size,
+                                    const size_t elem_size);
 
 /**
   SSE2-accelerated bitshuffle routine.
 */
 BLOSC_NO_EXPORT int64_t
-    bshuf_trans_bit_elem_sse2(void* in, void* out, const size_t size,
-                              const size_t elem_size, void* tmp_buf);
+    bshuf_trans_bit_elem_SSE(const void* in, void* out, const size_t size,
+                             const size_t elem_size);
 
 /**
   SSE2-accelerated bitunshuffle routine.
 */
 BLOSC_NO_EXPORT int64_t
-    bshuf_untrans_bit_elem_sse2(void* in, void* out, const size_t size,
-                                const size_t elem_size, void* tmp_buf);
+    bshuf_untrans_bit_elem_SSE(const void* in, void* out, const size_t size,
+                               const size_t elem_size);
 
 #endif /* BLOSC_BITSHUFFLE_SSE2_H */
diff --git a/blosc/blosc-private.h b/blosc/blosc-private.h
index 64097f07..306825b2 100644
--- a/blosc/blosc-private.h
+++ b/blosc/blosc-private.h
@@ -201,6 +201,7 @@ var = {
 };
 
 static inline void *dlopen (const char *filename, int flags) {
+  BLOSC_UNUSED_PARAM(flags);
   HINSTANCE hInst;
   hInst = LoadLibrary(filename);
   if (hInst==NULL) {
diff --git a/blosc/blosc2.c b/blosc/blosc2.c
index 1b691f0e..5dea6bc7 100644
--- a/blosc/blosc2.c
+++ b/blosc/blosc2.c
@@ -921,7 +921,7 @@ void _cycle_buffers(uint8_t **src, uint8_t **dest, uint8_t **tmp) {
 
 uint8_t* pipeline_forward(struct thread_context* thread_context, const int32_t bsize,
                           const uint8_t* src, const int32_t offset,
-                          uint8_t* dest, uint8_t* tmp, uint8_t* tmp2) {
+                          uint8_t* dest, uint8_t* tmp) {
   blosc2_context* context = thread_context->parent_context;
   uint8_t* _src = (uint8_t*)src + offset;
   uint8_t* _tmp = tmp;
@@ -977,7 +977,7 @@ uint8_t* pipeline_forward(struct thread_context* thread_context, const int32_t b
           }
           break;
         case BLOSC_BITSHUFFLE:
-          if (bitshuffle(typesize, bsize, _src, _dest, tmp2) < 0) {
+          if (bitshuffle(typesize, bsize, _src, _dest) < 0) {
             return NULL;
           }
           break;
@@ -1081,7 +1081,6 @@ static int blosc_c(struct thread_context* thread_context, int32_t bsize,
   int accel;
   const uint8_t* _src;
   uint8_t *_tmp = tmp, *_tmp2 = tmp2;
-  uint8_t *_tmp3 = thread_context->tmp4;
   int last_filter_index = last_filter(context->filters, 'c');
   bool memcpyed = context->header_flags & (uint8_t)BLOSC_MEMCPYED;
   bool instr_codec = context->blosc2_flags & BLOSC2_INSTR_CODEC;
@@ -1097,14 +1096,14 @@ static int blosc_c(struct thread_context* thread_context, int32_t bsize,
     /* Apply the filter pipeline just for the prefilter */
     if (memcpyed && context->prefilter != NULL) {
       // We only need the prefilter output
-      _src = pipeline_forward(thread_context, bsize, src, offset, dest, _tmp2, _tmp3);
+      _src = pipeline_forward(thread_context, bsize, src, offset, dest, _tmp2);
       if (_src == NULL) {
         return BLOSC2_ERROR_FILTER_PIPELINE;
       }
       return bsize;
     }
     /* Apply regular filter pipeline */
-    _src = pipeline_forward(thread_context, bsize, src, offset, _tmp, _tmp2, _tmp3);
+    _src = pipeline_forward(thread_context, bsize, src, offset, _tmp, _tmp2);
     if (_src == NULL) {
       return BLOSC2_ERROR_FILTER_PIPELINE;
     }
@@ -1357,7 +1356,7 @@ int pipeline_backward(struct thread_context* thread_context, const int32_t bsize
           }
           break;
         case BLOSC_BITSHUFFLE:
-          if (bitunshuffle(typesize, bsize, _src, _dest, _tmp, context->src[BLOSC2_CHUNK_VERSION]) < 0) {
+          if (bitunshuffle(typesize, bsize, _src, _dest, context->src[BLOSC2_CHUNK_VERSION]) < 0) {
             return BLOSC2_ERROR_FILTER_PIPELINE;
           }
           break;
@@ -2095,7 +2094,7 @@ void free_thread_context(struct thread_context* thread_context) {
 
 int check_nthreads(blosc2_context* context) {
   if (context->nthreads <= 0) {
-    BLOSC_TRACE_ERROR("nthreads must be a positive integer.");
+    BLOSC_TRACE_ERROR("nthreads must be >= 1 and <= %d", INT16_MAX);
     return BLOSC2_ERROR_INVALID_PARAM;
   }
 
@@ -2179,9 +2178,13 @@ static int initialize_context_compression(
   context->splitmode = splitmode;
   /* tuner some compression parameters */
   context->blocksize = (int32_t)blocksize;
+  int rc = 0;
   if (context->tuner_params != NULL) {
     if (context->tuner_id < BLOSC_LAST_TUNER && context->tuner_id == BLOSC_STUNE) {
-      blosc_stune_next_cparams(context);
+      if (blosc_stune_next_cparams(context) < 0) {
+        BLOSC_TRACE_ERROR("Error in stune next_cparams func\n");
+        return BLOSC2_ERROR_TUNER;
+      }
     } else {
       for (int i = 0; i < g_ntuners; ++i) {
         if (g_tuners[i].id == context->tuner_id) {
@@ -2191,10 +2194,16 @@ static int initialize_context_compression(
               return BLOSC2_ERROR_FAILURE;
             }
           }
-          g_tuners[i].next_cparams(context);
+          if (g_tuners[i].next_cparams(context) < 0) {
+            BLOSC_TRACE_ERROR("Error in tuner %d next_cparams func\n", context->tuner_id);
+            return BLOSC2_ERROR_TUNER;
+          }
           if (g_tuners[i].id == BLOSC_BTUNE && context->blocksize == 0) {
             // Call stune for initializing blocksize
-            blosc_stune_next_blocksize(context);
+            if (blosc_stune_next_blocksize(context) < 0) {
+              BLOSC_TRACE_ERROR("Error in stune next_blocksize func\n");
+              return BLOSC2_ERROR_TUNER;
+            }
           }
           goto urtunersuccess;
         }
@@ -2204,7 +2213,7 @@ static int initialize_context_compression(
     }
   } else {
     if (context->tuner_id < BLOSC_LAST_TUNER && context->tuner_id == BLOSC_STUNE) {
-      blosc_stune_next_blocksize(context);
+      rc = blosc_stune_next_blocksize(context);
     } else {
       for (int i = 0; i < g_ntuners; ++i) {
         if (g_tuners[i].id == context->tuner_id) {
@@ -2214,7 +2223,7 @@ static int initialize_context_compression(
               return BLOSC2_ERROR_FAILURE;
             }
           }
-          g_tuners[i].next_blocksize(context);
+          rc = g_tuners[i].next_blocksize(context);
           goto urtunersuccess;
         }
       }
@@ -2223,6 +2232,11 @@ static int initialize_context_compression(
     }
   }
   urtunersuccess:;
+  if (rc < 0) {
+    BLOSC_TRACE_ERROR("Error in tuner next_blocksize func\n");
+    return BLOSC2_ERROR_TUNER;
+  }
+
 
   /* Check buffer size limits */
   if (srcsize > BLOSC2_MAX_BUFFERSIZE) {
@@ -2504,8 +2518,9 @@ static int blosc_compress_context(blosc2_context* context) {
   if (context->tuner_params != NULL) {
     blosc_set_timestamp(&current);
     double ctime = blosc_elapsed_secs(last, current);
+    int rc;
     if (context->tuner_id < BLOSC_LAST_TUNER && context->tuner_id == BLOSC_STUNE) {
-      blosc_stune_update(context, ctime);
+      rc = blosc_stune_update(context, ctime);
     } else {
       for (int i = 0; i < g_ntuners; ++i) {
         if (g_tuners[i].id == context->tuner_id) {
@@ -2515,7 +2530,7 @@ static int blosc_compress_context(blosc2_context* context) {
               return BLOSC2_ERROR_FAILURE;
             }
           }
-          g_tuners[i].update(context, ctime);
+          rc = g_tuners[i].update(context, ctime);
           goto urtunersuccess;
         }
       }
@@ -2523,6 +2538,10 @@ static int blosc_compress_context(blosc2_context* context) {
       return BLOSC2_ERROR_INVALID_PARAM;
       urtunersuccess:;
     }
+    if (rc < 0) {
+      BLOSC_TRACE_ERROR("Error in tuner update func\n");
+      return BLOSC2_ERROR_TUNER;
+    }
   }
 
   return ntbytes;
@@ -2668,7 +2687,7 @@ int blosc2_compress(int clevel, int doshuffle, int32_t typesize,
   if (envvar != NULL) {
     long value;
     value = strtol(envvar, NULL, 10);
-    if ((value != EINVAL) && (value >= 0)) {
+    if ((errno != EINVAL) && (value >= 0)) {
       clevel = (int)value;
     }
     else {
@@ -2711,7 +2730,7 @@ int blosc2_compress(int clevel, int doshuffle, int32_t typesize,
   if (envvar != NULL) {
     long value;
     value = strtol(envvar, NULL, 10);
-    if ((value != EINVAL) && (value > 0)) {
+    if ((errno != EINVAL) && (value > 0)) {
       typesize = (int32_t)value;
     }
     else {
@@ -2733,7 +2752,7 @@ int blosc2_compress(int clevel, int doshuffle, int32_t typesize,
   if (envvar != NULL) {
     long blocksize;
     blocksize = strtol(envvar, NULL, 10);
-    if ((blocksize != EINVAL) && (blocksize > 0)) {
+    if ((errno != EINVAL) && (blocksize > 0)) {
       blosc1_set_blocksize((size_t) blocksize);
     }
     else {
@@ -2746,7 +2765,7 @@ int blosc2_compress(int clevel, int doshuffle, int32_t typesize,
   if (envvar != NULL) {
     long nthreads;
     nthreads = strtol(envvar, NULL, 10);
-    if ((nthreads != EINVAL) && (nthreads > 0)) {
+    if ((errno != EINVAL) && (nthreads > 0)) {
       result = blosc2_set_nthreads((int16_t) nthreads);
       if (result < 0) {
         BLOSC_TRACE_WARNING("BLOSC_NTHREADS environment variable '%s' not recognized\n", envvar);
@@ -2800,6 +2819,10 @@ int blosc2_compress(int clevel, int doshuffle, int32_t typesize,
     cparams.nthreads = g_nthreads;
     cparams.splitmode = g_splitmode;
     cctx = blosc2_create_cctx(cparams);
+    if (cctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the compression context");
+      return BLOSC2_ERROR_NULL_POINTER;
+    }
     /* Do the actual compression */
     result = blosc2_compress_ctx(cctx, src, srcsize, dest, destsize);
     /* Release context resources */
@@ -2925,7 +2948,11 @@ int blosc2_decompress(const void* src, int32_t srcsize, void* dest, int32_t dest
   envvar = getenv("BLOSC_NTHREADS");
   if (envvar != NULL) {
     nthreads = strtol(envvar, NULL, 10);
-    if ((nthreads != EINVAL) && (nthreads > 0)) {
+    if ((errno != EINVAL)) {
+      if ((nthreads <= 0) || (nthreads > INT16_MAX)) {
+        BLOSC_TRACE_ERROR("nthreads must be >= 1 and <= %d", INT16_MAX);
+        return BLOSC2_ERROR_INVALID_PARAM;
+      }
       result = blosc2_set_nthreads((int16_t) nthreads);
       if (result < 0) {
         return result;
@@ -2940,6 +2967,10 @@ int blosc2_decompress(const void* src, int32_t srcsize, void* dest, int32_t dest
   if (envvar != NULL) {
     dparams.nthreads = g_nthreads;
     dctx = blosc2_create_dctx(dparams);
+    if (dctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the decompression context");
+      return BLOSC2_ERROR_NULL_POINTER;
+    }
     result = blosc2_decompress_ctx(dctx, src, srcsize, dest, destsize);
     blosc2_free_ctx(dctx);
     return result;
@@ -3495,7 +3526,10 @@ int16_t blosc2_set_nthreads(int16_t nthreads) {
  if (nthreads != ret) {
    g_nthreads = nthreads;
    g_global_context->new_nthreads = nthreads;
-   check_nthreads(g_global_context);
+   int16_t ret2 = check_nthreads(g_global_context);
+   if (ret2 < 0) {
+     return ret2;
+   }
  }
 
   return ret;
@@ -3940,7 +3974,7 @@ blosc2_context* blosc2_create_cctx(blosc2_cparams cparams) {
   if (envvar != NULL) {
     int32_t value;
     value = (int32_t) strtol(envvar, NULL, 10);
-    if ((value != EINVAL) && (value > 0)) {
+    if ((errno != EINVAL) && (value > 0)) {
       context->typesize = value;
     }
     else {
@@ -3955,7 +3989,7 @@ blosc2_context* blosc2_create_cctx(blosc2_cparams cparams) {
   if (envvar != NULL) {
     int value;
     value = (int)strtol(envvar, NULL, 10);
-    if ((value != EINVAL) && (value >= 0)) {
+    if ((errno != EINVAL) && (value >= 0)) {
       context->clevel = value;
     }
     else {
@@ -3982,7 +4016,7 @@ blosc2_context* blosc2_create_cctx(blosc2_cparams cparams) {
   if (envvar != NULL) {
     int32_t blocksize;
     blocksize = (int32_t) strtol(envvar, NULL, 10);
-    if ((blocksize != EINVAL) && (blocksize > 0)) {
+    if ((errno != EINVAL) && (blocksize > 0)) {
       context->blocksize = blocksize;
     }
     else {
@@ -3995,7 +4029,7 @@ blosc2_context* blosc2_create_cctx(blosc2_cparams cparams) {
   envvar = getenv("BLOSC_NTHREADS");
   if (envvar != NULL) {
     int16_t nthreads = (int16_t) strtol(envvar, NULL, 10);
-    if ((nthreads != EINVAL) && (nthreads > 0)) {
+    if ((errno != EINVAL) && (nthreads > 0)) {
       context->nthreads = nthreads;
     }
     else {
@@ -4050,7 +4084,10 @@ blosc2_context* blosc2_create_cctx(blosc2_cparams cparams) {
             return NULL;
           }
         }
-        g_tuners[i].init(cparams.tuner_params, context, NULL);
+        if (g_tuners[i].init(cparams.tuner_params, context, NULL) < 0) {
+          BLOSC_TRACE_ERROR("Error in user-defined tuner %d init function\n", cparams.tuner_id);
+          return NULL;
+        }
         goto urtunersuccess;
       }
     }
@@ -4080,7 +4117,7 @@ blosc2_context* blosc2_create_dctx(blosc2_dparams dparams) {
   char* envvar = getenv("BLOSC_NTHREADS");
   if (envvar != NULL) {
     long nthreads = strtol(envvar, NULL, 10);
-    if ((nthreads != EINVAL) && (nthreads > 0)) {
+    if ((errno != EINVAL) && (nthreads > 0)) {
       context->nthreads = (int16_t) nthreads;
     }
   }
@@ -4118,8 +4155,9 @@ void blosc2_free_ctx(blosc2_context* context) {
 #endif
   }
   if (context->tuner_params != NULL) {
+    int rc;
     if (context->tuner_id < BLOSC_LAST_TUNER && context->tuner_id == BLOSC_STUNE) {
-      blosc_stune_free(context);
+      rc = blosc_stune_free(context);
     } else {
       for (int i = 0; i < g_ntuners; ++i) {
         if (g_tuners[i].id == context->tuner_id) {
@@ -4129,7 +4167,7 @@ void blosc2_free_ctx(blosc2_context* context) {
               return;
             }
           }
-          g_tuners[i].free(context);
+          rc = g_tuners[i].free(context);
           goto urtunersuccess;
         }
       }
@@ -4137,6 +4175,10 @@ void blosc2_free_ctx(blosc2_context* context) {
       return;
       urtunersuccess:;
     }
+    if (rc < 0) {
+      BLOSC_TRACE_ERROR("Error in user-defined tuner free function\n");
+      return;
+    }
   }
   if (context->prefilter != NULL) {
     my_free(context->preparams);
@@ -4218,6 +4260,10 @@ int blosc2_chunk_zeros(blosc2_cparams cparams, const int32_t nbytes, void* dest,
 
   blosc_header header;
   blosc2_context* context = blosc2_create_cctx(cparams);
+  if (context == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
 
   int error = initialize_context_compression(
           context, NULL, nbytes, dest, destsize,
@@ -4261,6 +4307,10 @@ int blosc2_chunk_uninit(blosc2_cparams cparams, const int32_t nbytes, void* dest
 
   blosc_header header;
   blosc2_context* context = blosc2_create_cctx(cparams);
+  if (context == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
   int error = initialize_context_compression(
           context, NULL, nbytes, dest, destsize,
           context->clevel, context->filters, context->filters_meta,
@@ -4303,6 +4353,10 @@ int blosc2_chunk_nans(blosc2_cparams cparams, const int32_t nbytes, void* dest,
 
   blosc_header header;
   blosc2_context* context = blosc2_create_cctx(cparams);
+  if (context == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
 
   int error = initialize_context_compression(
           context, NULL, nbytes, dest, destsize,
@@ -4348,6 +4402,10 @@ int blosc2_chunk_repeatval(blosc2_cparams cparams, const int32_t nbytes,
 
   blosc_header header;
   blosc2_context* context = blosc2_create_cctx(cparams);
+  if (context == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
 
   int error = initialize_context_compression(
           context, NULL, nbytes, dest, destsize,
diff --git a/blosc/frame.c b/blosc/frame.c
index e5e0a30a..4808a6cc 100644
--- a/blosc/frame.c
+++ b/blosc/frame.c
@@ -961,6 +961,10 @@ int64_t frame_from_schunk(blosc2_schunk *schunk, blosc2_frame_s *frame) {
     // Compress the chunk of offsets
     off_chunk = malloc(off_nbytes + BLOSC2_MAX_OVERHEAD);
     blosc2_context *cctx = blosc2_create_cctx(BLOSC2_CPARAMS_DEFAULTS);
+    if (cctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the compression context");
+      return BLOSC2_ERROR_NULL_POINTER;
+    }
     cctx->typesize = sizeof(int64_t);
     off_cbytes = blosc2_compress_ctx(cctx, data_tmp, off_nbytes, off_chunk,
                                      off_nbytes + BLOSC2_MAX_OVERHEAD);
@@ -1191,6 +1195,10 @@ int64_t* blosc2_frame_get_offsets(blosc2_schunk *schunk) {
   // Decompress offsets
   blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
   blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+  if (dctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the decompression context");
+    return NULL;
+  }
   int32_t prev_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes,
                                               offsets, off_nbytes);
   blosc2_free_ctx(dctx);
@@ -1635,11 +1643,12 @@ int frame_get_vlmetalayers(blosc2_frame_s* frame, blosc2_schunk* schunk) {
       char* eframe_name = malloc(strlen(frame->urlpath) + strlen("/chunks.b2frame") + 1);
       sprintf(eframe_name, "%s/chunks.b2frame", frame->urlpath);
       fp = io_cb->open(eframe_name, "rb", frame->schunk->storage->io->params);
-      free(eframe_name);
       if (fp == NULL) {
         BLOSC_TRACE_ERROR("Error opening file in: %s", eframe_name);
+        free(eframe_name);
         return BLOSC2_ERROR_FILE_OPEN;
       }
+      free(eframe_name);
       io_cb->seek(fp, trailer_offset, SEEK_SET);
     }
     else {
@@ -1745,9 +1754,17 @@ blosc2_schunk* frame_to_schunk(blosc2_frame_s* frame, bool copy, const blosc2_io
   blosc2_cparams *cparams;
   blosc2_schunk_get_cparams(schunk, &cparams);
   schunk->cctx = blosc2_create_cctx(*cparams);
+  if (schunk->cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return NULL;
+  }
   blosc2_dparams *dparams;
   blosc2_schunk_get_dparams(schunk, &dparams);
   schunk->dctx = blosc2_create_dctx(*dparams);
+  if (schunk->dctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the decompression context");
+    return NULL;
+  }
   blosc2_storage storage = {.contiguous = copy ? false : true};
   schunk->storage = get_new_storage(&storage, cparams, dparams, udio);
   free(cparams);
@@ -1776,6 +1793,10 @@ blosc2_schunk* frame_to_schunk(blosc2_frame_s* frame, bool copy, const blosc2_io
   // Decompress offsets
   blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
   blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+  if (dctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the decompression context");
+    return NULL;
+  }
   int64_t* offsets = (int64_t *) malloc((size_t)nchunks * sizeof(int64_t));
   int32_t off_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes,
                                              offsets, (int32_t)(nchunks * sizeof(int64_t)));
@@ -2623,6 +2644,10 @@ void* frame_append_chunk(blosc2_frame_s* frame, void* chunk, blosc2_schunk* schu
     // Decompress offsets
     blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
     blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+    if (dctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the decompression context");
+      return NULL;
+    }
     int32_t prev_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes, offsets,
                                                 off_nbytes);
     blosc2_free_ctx(dctx);
@@ -2679,6 +2704,10 @@ void* frame_append_chunk(blosc2_frame_s* frame, void* chunk, blosc2_schunk* schu
   cparams.nthreads = 4;  // 4 threads seems a decent default for nowadays CPUs
   cparams.compcode = BLOSC_BLOSCLZ;
   blosc2_context* cctx = blosc2_create_cctx(cparams);
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return NULL;
+  }
   cctx->typesize = sizeof(int64_t);  // override a possible BLOSC_TYPESIZE env variable (or chaos may appear)
   void* off_chunk = malloc((size_t)off_nbytes + BLOSC2_MAX_OVERHEAD);
   int32_t new_off_cbytes = blosc2_compress_ctx(cctx, offsets, off_nbytes,
@@ -2827,6 +2856,10 @@ void* frame_insert_chunk(blosc2_frame_s* frame, int64_t nchunk, void* chunk, blo
     // Decompress offsets
     blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
     blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+    if (dctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the decompression context");
+      return NULL;
+    }
     int32_t prev_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes, offsets, off_nbytes);
     blosc2_free_ctx(dctx);
     if (prev_nbytes < 0) {
@@ -2888,6 +2921,10 @@ void* frame_insert_chunk(blosc2_frame_s* frame, int64_t nchunk, void* chunk, blo
   cparams.nthreads = 4;  // 4 threads seems a decent default for nowadays CPUs
   cparams.compcode = BLOSC_BLOSCLZ;
   blosc2_context* cctx = blosc2_create_cctx(cparams);
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return NULL;
+  }
   void* off_chunk = malloc((size_t)off_nbytes + BLOSC2_MAX_OVERHEAD);
   int32_t new_off_cbytes = blosc2_compress_ctx(cctx, offsets, off_nbytes,
                                                off_chunk, off_nbytes + BLOSC2_MAX_OVERHEAD);
@@ -3042,6 +3079,10 @@ void* frame_update_chunk(blosc2_frame_s* frame, int64_t nchunk, void* chunk, blo
     // Decompress offsets
     blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
     blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+    if (dctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the decompression context");
+      return NULL;
+    }
     int32_t prev_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes, offsets, off_nbytes);
     blosc2_free_ctx(dctx);
     if (prev_nbytes < 0) {
@@ -3138,6 +3179,10 @@ void* frame_update_chunk(blosc2_frame_s* frame, int64_t nchunk, void* chunk, blo
   cparams.nthreads = 4;  // 4 threads seems a decent default for nowadays CPUs
   cparams.compcode = BLOSC_BLOSCLZ;
   blosc2_context* cctx = blosc2_create_cctx(cparams);
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return NULL;
+  }
   void* off_chunk = malloc((size_t)off_nbytes + BLOSC2_MAX_OVERHEAD);
   int32_t new_off_cbytes = blosc2_compress_ctx(cctx, offsets, off_nbytes,
                                                off_chunk, off_nbytes + BLOSC2_MAX_OVERHEAD);
@@ -3277,6 +3322,10 @@ void* frame_delete_chunk(blosc2_frame_s* frame, int64_t nchunk, blosc2_schunk* s
     // Decompress offsets
     blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
     blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+    if (dctx == NULL) {
+      BLOSC_TRACE_ERROR("Error while creating the decompression context");
+      return NULL;
+    }
     int32_t prev_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes, offsets, off_nbytes);
     blosc2_free_ctx(dctx);
     if (prev_nbytes < 0) {
@@ -3300,6 +3349,10 @@ void* frame_delete_chunk(blosc2_frame_s* frame, int64_t nchunk, blosc2_schunk* s
   cparams.nthreads = 4;  // 4 threads seems a decent default for nowadays CPUs
   cparams.compcode = BLOSC_BLOSCLZ;
   blosc2_context* cctx = blosc2_create_cctx(cparams);
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return NULL;
+  }
   void* off_chunk = malloc((size_t)off_nbytes + BLOSC2_MAX_OVERHEAD);
   int32_t new_off_cbytes = blosc2_compress_ctx(cctx, offsets, off_nbytes - (int32_t)sizeof(int64_t),
                                                off_chunk, off_nbytes + BLOSC2_MAX_OVERHEAD);
@@ -3435,6 +3488,10 @@ int frame_reorder_offsets(blosc2_frame_s* frame, const int64_t* offsets_order, b
   // Decompress offsets
   blosc2_dparams off_dparams = BLOSC2_DPARAMS_DEFAULTS;
   blosc2_context *dctx = blosc2_create_dctx(off_dparams);
+  if (dctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the decompression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
   int32_t prev_nbytes = blosc2_decompress_ctx(dctx, coffsets, coffsets_cbytes,
                                               offsets, off_nbytes);
   blosc2_free_ctx(dctx);
@@ -3461,6 +3518,10 @@ int frame_reorder_offsets(blosc2_frame_s* frame, const int64_t* offsets_order, b
   cparams.nthreads = 4;  // 4 threads seems a decent default for nowadays CPUs
   cparams.compcode = BLOSC_BLOSCLZ;
   blosc2_context* cctx = blosc2_create_cctx(cparams);
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
   void* off_chunk = malloc((size_t)off_nbytes + BLOSC2_MAX_OVERHEAD);
   int32_t new_off_cbytes = blosc2_compress_ctx(cctx, offsets, off_nbytes,
                                                off_chunk, off_nbytes + BLOSC2_MAX_OVERHEAD);
diff --git a/blosc/schunk.c b/blosc/schunk.c
index 4bafddfd..20f0bf33 100644
--- a/blosc/schunk.c
+++ b/blosc/schunk.c
@@ -73,7 +73,7 @@ int blosc2_schunk_get_dparams(blosc2_schunk *schunk, blosc2_dparams **dparams) {
 }
 
 
-void update_schunk_properties(struct blosc2_schunk* schunk) {
+int update_schunk_properties(struct blosc2_schunk* schunk) {
   blosc2_cparams* cparams = schunk->storage->cparams;
   blosc2_dparams* dparams = schunk->storage->dparams;
 
@@ -99,6 +99,10 @@ void update_schunk_properties(struct blosc2_schunk* schunk) {
   }
   cparams->schunk = schunk;
   schunk->cctx = blosc2_create_cctx(*cparams);
+  if (schunk->cctx == NULL) {
+    BLOSC_TRACE_ERROR("Could not create compression ctx");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
 
   /* The decompression context */
   if (schunk->dctx != NULL) {
@@ -106,6 +110,12 @@ void update_schunk_properties(struct blosc2_schunk* schunk) {
   }
   dparams->schunk = schunk;
   schunk->dctx = blosc2_create_dctx(*dparams);
+  if (schunk->dctx == NULL) {
+    BLOSC_TRACE_ERROR("Could not create decompression ctx");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
+
+  return BLOSC2_ERROR_SUCCESS;
 }
 
 
@@ -132,27 +142,10 @@ blosc2_schunk* blosc2_schunk_new(blosc2_storage *storage) {
   }
 
   // ...and update internal properties
-  update_schunk_properties(schunk);
-
-  if (schunk->cctx->tuner_id < BLOSC_LAST_TUNER && schunk->cctx->tuner_id == BLOSC_STUNE) {
-    blosc_stune_init(schunk->storage->cparams->tuner_params, schunk->cctx, schunk->dctx);
-  } else {
-    for (int i = 0; i < g_ntuners; ++i) {
-      if (g_tuners[i].id == schunk->cctx->tuner_id) {
-        if (g_tuners[i].init == NULL) {
-          if (fill_tuner(&g_tuners[i]) < 0) {
-            BLOSC_TRACE_ERROR("Could not load tuner %d.", g_tuners[i].id);
-            return NULL;
-          }
-        }
-        g_tuners[i].init(schunk->storage->cparams->tuner_params, schunk->cctx, schunk->dctx);
-        goto urtunersuccess;
-      }
-    }
-    BLOSC_TRACE_ERROR("User-defined tuner %d not found\n", schunk->cctx->tuner_id);
+  if (update_schunk_properties(schunk) < 0) {
+    BLOSC_TRACE_ERROR("Error when updating schunk properties");
     return NULL;
   }
-  urtunersuccess:;
 
   if (!storage->contiguous && storage->urlpath != NULL){
     char* urlpath;
@@ -1600,6 +1593,10 @@ int blosc2_vlmeta_add(blosc2_schunk *schunk, const char *name, uint8_t *content,
   } else {
     cctx = blosc2_create_cctx(BLOSC2_CPARAMS_DEFAULTS);
   }
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
 
   int csize = blosc2_compress_ctx(cctx, content, content_len, content_buf, content_len + BLOSC2_MAX_OVERHEAD);
   if (csize < 0) {
@@ -1641,6 +1638,10 @@ int blosc2_vlmeta_get(blosc2_schunk *schunk, const char *name, uint8_t **content
   *content_len = nbytes;
   *content = malloc((size_t) nbytes);
   blosc2_context *dctx = blosc2_create_dctx(*schunk->storage->dparams);
+  if (dctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the decompression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
   int nbytes_ = blosc2_decompress_ctx(dctx, meta->content, meta->content_len, *content, nbytes);
   blosc2_free_ctx(dctx);
   if (nbytes_ != nbytes) {
@@ -1668,6 +1669,10 @@ int blosc2_vlmeta_update(blosc2_schunk *schunk, const char *name, uint8_t *conte
   } else {
     cctx = blosc2_create_cctx(BLOSC2_CPARAMS_DEFAULTS);
   }
+  if (cctx == NULL) {
+    BLOSC_TRACE_ERROR("Error while creating the compression context");
+    return BLOSC2_ERROR_NULL_POINTER;
+  }
 
   int csize = blosc2_compress_ctx(cctx, content, content_len, content_buf, content_len + BLOSC2_MAX_OVERHEAD);
   if (csize < 0) {
diff --git a/blosc/shuffle.c b/blosc/shuffle.c
index 2fff396f..c2c2ed60 100644
--- a/blosc/shuffle.c
+++ b/blosc/shuffle.c
@@ -13,6 +13,10 @@
 /*  Include hardware-accelerated shuffle/unshuffle routines based on
     the target architecture. Note that a target architecture may support
     more than one type of acceleration!*/
+#if defined(SHUFFLE_USE_AVX512)
+  #include "bitshuffle-avx512.h"
+#endif  /* defined(SHUFFLE_USE_AVX512) */
+
 #if defined(SHUFFLE_USE_AVX2)
   #include "shuffle-avx2.h"
   #include "bitshuffle-avx2.h"
@@ -41,17 +45,15 @@
 
 #include "shuffle-generic.h"
 #include "bitshuffle-generic.h"
-#include "blosc2/blosc2-common.h"
 #include "blosc2.h"
 
-#include <stdbool.h>
 #include <stdio.h>
-#include <stdint.h>
 #include <string.h>
 
-
-#if !defined(__clang__) && defined(__GNUC__) && defined(__GNUC_MINOR__) && \
-    __GNUC__ >= 5 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 8)
+// __builtin_cpu_supports() fixed in GCC 8: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100
+// Also, clang added support for it in clang 10 at very least (and possibly since 3.8)
+#if (defined(__clang__) && (__clang_major__ >= 10)) || \
+    (defined(__GNUC__) && defined(__GNUC_MINOR__) && __GNUC__ >= 8)
 #define HAVE_CPU_FEAT_INTRIN
 #endif
 
@@ -61,8 +63,8 @@ typedef void(* shuffle_func)(const int32_t, const int32_t, const uint8_t*, const
 typedef void(* unshuffle_func)(const int32_t, const int32_t, const uint8_t*, const uint8_t*);
 // For bitshuffle, everything is done in terms of size_t and int64_t (return value)
 // and although this is not strictly necessary for Blosc, it does not hurt either
-typedef int64_t(* bitshuffle_func)(void*, void*, const size_t, const size_t, void*);
-typedef int64_t(* bitunshuffle_func)(void*, void*, const size_t, const size_t, void*);
+typedef int64_t(* bitshuffle_func)(const void*, void*, const size_t, const size_t);
+typedef int64_t(* bitunshuffle_func)(const void*, void*, const size_t, const size_t);
 
 /* An implementation of shuffle/unshuffle routines. */
 typedef struct shuffle_implementation {
@@ -83,20 +85,15 @@ typedef enum {
   BLOSC_HAVE_SSE2 = 1,
   BLOSC_HAVE_AVX2 = 2,
   BLOSC_HAVE_NEON = 4,
-  BLOSC_HAVE_ALTIVEC = 8
+  BLOSC_HAVE_ALTIVEC = 8,
+  BLOSC_HAVE_AVX512 = 16,
 } blosc_cpu_features;
 
 /* Detect hardware and set function pointers to the best shuffle/unshuffle
    implementations supported by the host processor. */
 #if defined(SHUFFLE_USE_AVX2) || defined(SHUFFLE_USE_SSE2)    /* Intel/i686 */
 
-/*  Disabled the __builtin_cpu_supports() call, as it has issues with
-    new versions of gcc (like 5.3.1 in forthcoming ubuntu/xenial:
-      "undefined symbol: __cpu_model"
-    For a similar report, see:
-    https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ZM2L65WIZEEQHHLFERZYD5FAG7QY2OGB/
-*/
-#if defined(HAVE_CPU_FEAT_INTRIN) && 0
+#if defined(HAVE_CPU_FEAT_INTRIN)
 static blosc_cpu_features blosc_get_cpu_features(void) {
   blosc_cpu_features cpu_features = BLOSC_HAVE_NOTHING;
   if (__builtin_cpu_supports("sse2")) {
@@ -105,6 +102,9 @@ static blosc_cpu_features blosc_get_cpu_features(void) {
   if (__builtin_cpu_supports("avx2")) {
     cpu_features |= BLOSC_HAVE_AVX2;
   }
+  if (__builtin_cpu_supports("avx512f") && __builtin_cpu_supports("avx512bw")) {
+    cpu_features |= BLOSC_HAVE_AVX512;
+  }
   return cpu_features;
 }
 #else
@@ -193,10 +193,12 @@ static blosc_cpu_features blosc_get_cpu_features(void) {
 
   /* Check for AVX-based features, if the processor supports extended features. */
   bool avx2_available = false;
+  bool avx512f_available = false;
   bool avx512bw_available = false;
   if (max_basic_function_id >= 7) {
     __cpuid(cpu_info, 7);
     avx2_available = (cpu_info[1] & (1 << 5)) != 0;
+    avx512f_available = (cpu_info[1] & (1 << 16)) != 0;
     avx512bw_available = (cpu_info[1] & (1 << 30)) != 0;
   }
 
@@ -206,13 +208,14 @@ static blosc_cpu_features blosc_get_cpu_features(void) {
       extended control register XCR0 to see if the CPU features are enabled. */
   bool xmm_state_enabled = false;
   bool ymm_state_enabled = false;
-  //bool zmm_state_enabled = false;  // commented this out for avoiding an 'unused variable' warning
+  // Silence an unused variable compiler warning
+  // bool zmm_state_enabled = false;
 
 #if defined(_XCR_XFEATURE_ENABLED_MASK)
   if (xsave_available && xsave_enabled_by_os && (
       sse2_available || sse3_available || ssse3_available
       || sse41_available || sse42_available
-      || avx2_available || avx512bw_available)) {
+      || avx2_available || avx512f_available || avx512bw_available)) {
     /* Determine which register states can be restored by the OS. */
     uint64_t xcr0_contents = _xgetbv(_XCR_XFEATURE_ENABLED_MASK);
 
@@ -221,7 +224,7 @@ static blosc_cpu_features blosc_get_cpu_features(void) {
 
     /*  Require support for both the upper 256-bits of zmm0-zmm15 to be
         restored as well as all of zmm16-zmm31 and the opmask registers. */
-    //zmm_state_enabled = (xcr0_contents & 0x70) == 0x70;
+    // zmm_state_enabled = (xcr0_contents & 0x70) == 0x70;
   }
 #endif /* defined(_XCR_XFEATURE_ENABLED_MASK) */
 
@@ -233,12 +236,13 @@ static blosc_cpu_features blosc_get_cpu_features(void) {
   printf("SSE4.1 available: %s\n", sse41_available ? "True" : "False");
   printf("SSE4.2 available: %s\n", sse42_available ? "True" : "False");
   printf("AVX2 available: %s\n", avx2_available ? "True" : "False");
+  printf("AVX512F available: %s\n", avx512f_available ? "True" : "False");
   printf("AVX512BW available: %s\n", avx512bw_available ? "True" : "False");
   printf("XSAVE available: %s\n", xsave_available ? "True" : "False");
   printf("XSAVE enabled: %s\n", xsave_enabled_by_os ? "True" : "False");
   printf("XMM state enabled: %s\n", xmm_state_enabled ? "True" : "False");
   printf("YMM state enabled: %s\n", ymm_state_enabled ? "True" : "False");
-  //printf("ZMM state enabled: %s\n", zmm_state_enabled ? "True" : "False");
+  // printf("ZMM state enabled: %s\n", zmm_state_enabled ? "True" : "False");
 #endif /* defined(BLOSC_DUMP_CPU_INFO) */
 
   /* Using the gathered CPU information, determine which implementation to use. */
@@ -250,6 +254,9 @@ static blosc_cpu_features blosc_get_cpu_features(void) {
   if (xmm_state_enabled && ymm_state_enabled && avx2_available) {
     result |= BLOSC_HAVE_AVX2;
   }
+  if (xmm_state_enabled && ymm_state_enabled && avx512f_available && avx512bw_available) {
+    result |= BLOSC_HAVE_AVX512;
+  }
   return result;
 }
 #endif /* HAVE_CPU_FEAT_INTRIN */
@@ -288,14 +295,26 @@ return BLOSC_HAVE_NOTHING;
 
 static shuffle_implementation_t get_shuffle_implementation(void) {
   blosc_cpu_features cpu_features = blosc_get_cpu_features();
+#if defined(SHUFFLE_USE_AVX512)
+  if (cpu_features & BLOSC_HAVE_AVX512) {
+    shuffle_implementation_t impl_avx512;
+    impl_avx512.name = "avx512";
+    impl_avx512.shuffle = (shuffle_func)shuffle_avx2;
+    impl_avx512.unshuffle = (unshuffle_func)unshuffle_avx2;
+    impl_avx512.bitshuffle = (bitshuffle_func) bshuf_trans_bit_elem_AVX512;
+    impl_avx512.bitunshuffle = (bitunshuffle_func)bshuf_untrans_bit_elem_AVX512;
+    return impl_avx512;
+  }
+#endif  /* defined(SHUFFLE_USE_AVX512) */
+
 #if defined(SHUFFLE_USE_AVX2)
   if (cpu_features & BLOSC_HAVE_AVX2) {
     shuffle_implementation_t impl_avx2;
     impl_avx2.name = "avx2";
     impl_avx2.shuffle = (shuffle_func)shuffle_avx2;
     impl_avx2.unshuffle = (unshuffle_func)unshuffle_avx2;
-    impl_avx2.bitshuffle = (bitshuffle_func)bshuf_trans_bit_elem_avx2;
-    impl_avx2.bitunshuffle = (bitunshuffle_func)bshuf_untrans_bit_elem_avx2;
+    impl_avx2.bitshuffle = (bitshuffle_func) bshuf_trans_bit_elem_AVX;
+    impl_avx2.bitunshuffle = (bitunshuffle_func)bshuf_untrans_bit_elem_AVX;
     return impl_avx2;
   }
 #endif  /* defined(SHUFFLE_USE_AVX2) */
@@ -306,8 +325,8 @@ static shuffle_implementation_t get_shuffle_implementation(void) {
     impl_sse2.name = "sse2";
     impl_sse2.shuffle = (shuffle_func)shuffle_sse2;
     impl_sse2.unshuffle = (unshuffle_func)unshuffle_sse2;
-    impl_sse2.bitshuffle = (bitshuffle_func)bshuf_trans_bit_elem_sse2;
-    impl_sse2.bitunshuffle = (bitunshuffle_func)bshuf_untrans_bit_elem_sse2;
+    impl_sse2.bitshuffle = (bitshuffle_func)bshuf_trans_bit_elem_SSE;
+    impl_sse2.bitunshuffle = (bitunshuffle_func) bshuf_untrans_bit_elem_SSE;
     return impl_sse2;
   }
 #endif  /* defined(SHUFFLE_USE_SSE2) */
@@ -320,12 +339,11 @@ static shuffle_implementation_t get_shuffle_implementation(void) {
     impl_neon.unshuffle = (unshuffle_func)unshuffle_neon;
     //impl_neon.shuffle = (shuffle_func)shuffle_generic;
     //impl_neon.unshuffle = (unshuffle_func)unshuffle_generic;
-    //impl_neon.bitshuffle = (bitshuffle_func)bitshuffle_neon;
-    //impl_neon.bitunshuffle = (bitunshuffle_func)bitunshuffle_neon;
+    //impl_neon.bitshuffle = (bitshuffle_func)bshuf_trans_bit_elem_NEON;
+    //impl_neon.bitunshuffle = (bitunshuffle_func)bshuf_untrans_bit_elem_NEON;
     // The current bitshuffle optimized for NEON is not any faster
     // (in fact, it is pretty much slower) than the scalar implementation.
-    // Also, bitshuffle_neon (forward direction) is broken for 1, 2 and 4 bytes.
-    // So, let's use the the scalar one, which is pretty fast, at least on a M1 CPU.
+    // So, let's use the scalar one, which is pretty fast, at least on a M1 CPU.
     impl_neon.bitshuffle = (bitshuffle_func)bshuf_trans_bit_elem_scal;
     impl_neon.bitunshuffle = (bitunshuffle_func)bshuf_untrans_bit_elem_scal;
     return impl_neon;
@@ -405,7 +423,7 @@ shuffle(const int32_t bytesoftype, const int32_t blocksize,
   init_shuffle_implementation();
 
   /* The implementation is initialized.
-     Dispatch to it's shuffle routine. */
+     Dispatch to its shuffle routine. */
   (host_implementation.shuffle)(bytesoftype, blocksize, _src, _dest);
 }
 
@@ -426,15 +444,14 @@ unshuffle(const int32_t bytesoftype, const int32_t blocksize,
     hardware-accelerated routine at run-time. */
 int32_t
 bitshuffle(const int32_t bytesoftype, const int32_t blocksize,
-           const uint8_t *_src, const uint8_t *_dest,
-           const uint8_t *_tmp) {
+           const uint8_t *_src, const uint8_t *_dest) {
   /* Initialize the shuffle implementation if necessary. */
   init_shuffle_implementation();
   size_t size = blocksize / bytesoftype;
   /* bitshuffle only supports a number of elements that is a multiple of 8. */
   size -= size % 8;
-  int ret = (int) (host_implementation.bitshuffle)((void *) _src, (void *) _dest,
-                                             size, bytesoftype, (void *) _tmp);
+  int ret = (int) (host_implementation.bitshuffle)
+      ((const void *) _src, (void *) _dest, size, bytesoftype);
   if (ret < 0) {
     // Some error in bitshuffle (should not happen)
     BLOSC_TRACE_ERROR("the impossible happened: the bitshuffle filter failed!");
@@ -452,7 +469,7 @@ bitshuffle(const int32_t bytesoftype, const int32_t blocksize,
     hardware-accelerated routine at run-time. */
 int32_t bitunshuffle(const int32_t bytesoftype, const int32_t blocksize,
                      const uint8_t *_src, const uint8_t *_dest,
-                     const uint8_t *_tmp, const uint8_t format_version) {
+                     const uint8_t format_version) {
   /* Initialize the shuffle implementation if necessary. */
   init_shuffle_implementation();
   size_t size = blocksize / bytesoftype;
@@ -462,9 +479,8 @@ int32_t bitunshuffle(const int32_t bytesoftype, const int32_t blocksize,
     if ((size % 8) == 0) {
       /* The number of elems is a multiple of 8 which is supported by
          bitshuffle. */
-      int ret = (int) (host_implementation.bitunshuffle)((void *) _src, (void *) _dest,
-                                                   blocksize / bytesoftype,
-                                                   bytesoftype, (void *) _tmp);
+      int ret = (int) (host_implementation.bitunshuffle)
+          ((const void *) _src, (void *) _dest, blocksize / bytesoftype, bytesoftype);
       if (ret < 0) {
         // Some error in bitshuffle (should not happen)
         BLOSC_TRACE_ERROR("the impossible happened: the bitunshuffle filter failed!");
@@ -481,8 +497,8 @@ int32_t bitunshuffle(const int32_t bytesoftype, const int32_t blocksize,
   else {
     /* bitshuffle only supports a number of bytes that is a multiple of 8. */
     size -= size % 8;
-    int ret = (int) (host_implementation.bitunshuffle)((void *) _src, (void *) _dest,
-                                                 size, bytesoftype, (void *) _tmp);
+    int ret = (int) (host_implementation.bitunshuffle)
+        ((const void *) _src, (void *) _dest, size, bytesoftype);
     if (ret < 0) {
       BLOSC_TRACE_ERROR("the impossible happened: the bitunshuffle filter failed!");
       return ret;
diff --git a/blosc/shuffle.h b/blosc/shuffle.h
index 24784eaa..3421e524 100644
--- a/blosc/shuffle.h
+++ b/blosc/shuffle.h
@@ -26,6 +26,11 @@
 /* Toggle hardware-accelerated routines based on SHUFFLE_*_ENABLED macros
    and availability on the target architecture.
 */
+#if defined(SHUFFLE_AVX512_ENABLED) && defined(__AVX512F__) && defined (__AVX512BW__)
+#define SHUFFLE_USE_AVX512
+#define SHUFFLE_USE_AVX512
+#endif
+
 #if defined(SHUFFLE_AVX2_ENABLED) && defined(__AVX2__)
 #define SHUFFLE_USE_AVX2
 #endif
@@ -58,8 +63,7 @@ BLOSC_NO_EXPORT void
 
 BLOSC_NO_EXPORT int32_t
     bitshuffle(const int32_t bytesoftype, const int32_t blocksize,
-               const uint8_t *_src, const uint8_t *_dest,
-               const uint8_t *_tmp);
+               const uint8_t *_src, const uint8_t *_dest);
 
 /**
   Primary unshuffle and bitunshuffle routine.
@@ -79,6 +83,6 @@ BLOSC_NO_EXPORT void
 BLOSC_NO_EXPORT int32_t
     bitunshuffle(const int32_t bytesoftype, const int32_t blocksize,
                  const uint8_t *_src, const uint8_t *_dest,
-                 const uint8_t *_tmp, const uint8_t format_version);
+                 const uint8_t format_version);
 
 #endif /* BLOSC_SHUFFLE_H */
diff --git a/blosc/stune.c b/blosc/stune.c
index 102c1999..4a8b2f3a 100644
--- a/blosc/stune.c
+++ b/blosc/stune.c
@@ -34,14 +34,16 @@ static bool is_HCR(blosc2_context * context) {
   }
 }
 
-void blosc_stune_init(void * config, blosc2_context* cctx, blosc2_context* dctx) {
+int blosc_stune_init(void * config, blosc2_context* cctx, blosc2_context* dctx) {
   BLOSC_UNUSED_PARAM(config);
   BLOSC_UNUSED_PARAM(cctx);
   BLOSC_UNUSED_PARAM(dctx);
+
+  return BLOSC2_ERROR_SUCCESS;
 }
 
 // Set the automatic blocksize 0 to its real value
-void blosc_stune_next_blocksize(blosc2_context *context) {
+int blosc_stune_next_blocksize(blosc2_context *context) {
   int32_t clevel = context->clevel;
   int32_t typesize = context->typesize;
   int32_t nbytes = context->sourcesize;
@@ -51,7 +53,7 @@ void blosc_stune_next_blocksize(blosc2_context *context) {
   // Protection against very small buffers
   if (nbytes < typesize) {
     context->blocksize = 1;
-    return;
+    return BLOSC2_ERROR_SUCCESS;
   }
 
   if (user_blocksize) {
@@ -106,7 +108,8 @@ void blosc_stune_next_blocksize(blosc2_context *context) {
   }
 
   /* Now the blocksize for splittable codecs */
-  if (clevel > 0 && split_block(context, typesize, blocksize)) {
+  int splitmode = split_block(context, typesize, blocksize);
+  if (clevel > 0 && splitmode) {
     // For performance reasons, do not exceed 256 KB (it must fit in L2 cache)
     switch (clevel) {
       case 1:
@@ -120,6 +123,8 @@ void blosc_stune_next_blocksize(blosc2_context *context) {
         blocksize = 64 * 1024;
         break;
       case 7:
+        blocksize = 128 * 1024;
+        break;
       case 8:
         blocksize = 256 * 1024;
         break;
@@ -152,19 +157,29 @@ void blosc_stune_next_blocksize(blosc2_context *context) {
   }
 
   context->blocksize = blocksize;
+  BLOSC_INFO("compcode: %d, clevel: %d, blocksize: %d, splitmode: %d, typesize: %d",
+             context->compcode, context->clevel, blocksize, splitmode, typesize);
+
+  return BLOSC2_ERROR_SUCCESS;
 }
 
-void blosc_stune_next_cparams(blosc2_context * context) {
-    BLOSC_UNUSED_PARAM(context);
+int blosc_stune_next_cparams(blosc2_context * context) {
+  BLOSC_UNUSED_PARAM(context);
+
+  return BLOSC2_ERROR_SUCCESS;
 }
 
-void blosc_stune_update(blosc2_context * context, double ctime) {
-    BLOSC_UNUSED_PARAM(context);
-    BLOSC_UNUSED_PARAM(ctime);
+int blosc_stune_update(blosc2_context * context, double ctime) {
+  BLOSC_UNUSED_PARAM(context);
+  BLOSC_UNUSED_PARAM(ctime);
+
+  return BLOSC2_ERROR_SUCCESS;
 }
 
-void blosc_stune_free(blosc2_context * context) {
-    BLOSC_UNUSED_PARAM(context);
+int blosc_stune_free(blosc2_context * context) {
+  BLOSC_UNUSED_PARAM(context);
+
+  return BLOSC2_ERROR_SUCCESS;
 }
 
 int split_block(blosc2_context *context, int32_t typesize, int32_t blocksize) {
diff --git a/blosc/stune.h b/blosc/stune.h
index 26400e4b..c4ac1485 100644
--- a/blosc/stune.h
+++ b/blosc/stune.h
@@ -25,15 +25,15 @@
 
 #define BLOSC_STUNE 0
 
-void blosc_stune_init(void * config, blosc2_context* cctx, blosc2_context* dctx);
+int blosc_stune_init(void * config, blosc2_context* cctx, blosc2_context* dctx);
 
-void blosc_stune_next_blocksize(blosc2_context * context);
+int blosc_stune_next_blocksize(blosc2_context * context);
 
-void blosc_stune_next_cparams(blosc2_context * context);
+int blosc_stune_next_cparams(blosc2_context * context);
 
-void blosc_stune_update(blosc2_context * context, double ctime);
+int blosc_stune_update(blosc2_context * context, double ctime);
 
-void blosc_stune_free(blosc2_context * context);
+int blosc_stune_free(blosc2_context * context);
 
 /* Conditions for splitting a block before compressing with a codec. */
 int split_block(blosc2_context *context, int32_t typesize, int32_t blocksize);
diff --git a/doc/reference/b2nd.rst b/doc/reference/b2nd.rst
index 79219cea..e247e344 100644
--- a/doc/reference/b2nd.rst
+++ b/doc/reference/b2nd.rst
@@ -101,3 +101,10 @@ Destruction
 +++++++++++
 
 ..  doxygenfunction:: b2nd_free
+
+Utilities
+---------
+
+These functions may be used for working with plain C buffers representing multidimensional arrays.
+
+.. doxygenfunction:: b2nd_copy_buffer
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index a8d5defb..385030f7 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -1,7 +1,7 @@
 # Examples with correspondingly named source files
 set(EXAMPLES contexts instrument_codec delta_schunk_ex multithread simple frame_metalayers
     noinit find_roots schunk_simple frame_simple schunk_postfilter urcodecs urfilters frame_vlmetalayers
-    sframe_simple frame_backed_schunk compress_file frame_offset frame_roundtrip get_set_slice)
+    sframe_simple frame_backed_schunk compress_file frame_offset frame_roundtrip get_set_slice get_blocksize)
 
 add_subdirectory(b2nd)
 
diff --git a/examples/README.rst b/examples/README.rst
index d6646154..385960aa 100644
--- a/examples/README.rst
+++ b/examples/README.rst
@@ -5,8 +5,8 @@ In this directory you can find a series of examples on how to link
 your apps with the Blosc library.  A few of them are:
 
 * simple.c -- The simplest way to add Blosc to your app
-* simple_schunk.c -- Adding the more powerful super-chunk into the equation
-* simple_frame.c -- Use a frame to serialize Blosc2 super-chunks
+* schunk_simple.c -- Adding the more powerful super-chunk into the equation
+* frame_simple.c -- Use a frame to serialize Blosc2 super-chunks
 * compress_file.c -- Compress a file into a Blosc2 file-frame
 
 For more info, please visit the `official API documentation
diff --git a/examples/get_blocksize.c b/examples/get_blocksize.c
new file mode 100644
index 00000000..88c29168
--- /dev/null
+++ b/examples/get_blocksize.c
@@ -0,0 +1,72 @@
+/*
+  Copyright (c) 2021  The Blosc Development Team <blosc@blosc.org>
+  https://blosc.org
+  License: BSD 3-Clause (see LICENSE.txt)
+
+  Example program demonstrating the use of a Blosc from C code.
+
+  To compile this program:
+
+  $ gcc -O get_blocksize.c -o get_blocksize -lblosc2
+
+  To run:
+
+  $ ./get_blocksize
+  Blosc version info: 2.10.3.dev ($Date:: 2023-08-19 #$)
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 16384
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 131072
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 65536
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 131072
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 262144
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 262144
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 524288
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 1048576
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 524288
+  Compression: 10000000 -> 32 (312500.0x)
+  osize, csize, blocksize: 10000000, 32, 2097152
+
+  Process finished with exit code 0
+
+*/
+
+#include <stdio.h>
+#include "blosc2.h"
+
+
+int main(void) {
+  blosc2_init();
+
+  static uint8_t data_dest[BLOSC2_MAX_OVERHEAD];
+  blosc2_cparams cparams = BLOSC2_CPARAMS_DEFAULTS;
+  cparams.typesize = sizeof(float);
+  cparams.compcode = BLOSC_ZSTD;
+
+  printf("Blosc version info: %s (%s)\n",
+         BLOSC2_VERSION_STRING, BLOSC2_VERSION_DATE);
+
+  /* Do the actual compression */
+  for (int clevel=0; clevel < 10; clevel++) {
+    cparams.clevel = clevel;
+    cparams.splitmode = clevel % 2;
+    int isize = 10 * 1000 * 1000;
+    int osize, csize, blocksize;
+    csize = blosc2_chunk_zeros(cparams, isize, data_dest, BLOSC2_MAX_OVERHEAD);
+    printf("Compression: %d -> %d (%.1fx)\n", isize, csize, (1. * isize) / csize);
+
+    BLOSC_ERROR(blosc2_cbuffer_sizes(data_dest, &osize, &csize, &blocksize));
+    printf("osize, csize, blocksize: %d, %d, %d\n", osize, csize, blocksize);
+  }
+
+  blosc2_destroy();
+
+  return 0;
+}
diff --git a/guix.scm b/guix.scm
new file mode 120000
index 00000000..1fcd3bb9
--- /dev/null
+++ b/guix.scm
@@ -0,0 +1 @@
+.guix/modules/c-blosc2-package.scm
\ No newline at end of file
diff --git a/include/b2nd.h b/include/b2nd.h
index 0e6a43aa..769819f1 100644
--- a/include/b2nd.h
+++ b/include/b2nd.h
@@ -599,6 +599,38 @@ static inline int b2nd_deserialize_meta(
 }
 
 
+// Utilities for C buffers representing multidimensional arrays
+
+/**
+ * @brief Copy a slice of a source array into another array. The arrays have
+ * the same number of dimensions (though their shapes may differ), the same
+ * item size, and they are stored as C buffers with contiguous data (any
+ * padding is considered part of the array).
+ *
+ * @param ndim The number of dimensions in both arrays.
+ * @param itemsize The size of the individual data item in both arrays.
+ * @param src The buffer for getting the data from the source array.
+ * @param src_pad_shape The shape of the source array, including padding.
+ * @param src_start The source coordinates where the slice will begin.
+ * @param src_stop The source coordinates where the slice will end.
+ * @param dst The buffer for setting the data into the destination array.
+ * @param dst_pad_shape The shape of the destination array, including padding.
+ * @param dst_start The destination coordinates where the slice will be placed.
+ *
+ * @return An error code.
+ *
+ * @note Please make sure that slice boundaries fit within the source and
+ * destination arrays before using this function, as it does not perform these
+ * checks itself.
+ */
+BLOSC_EXPORT int b2nd_copy_buffer(int8_t ndim,
+                                  uint8_t itemsize,
+                                  const void *src, const int64_t *src_pad_shape,
+                                  const int64_t *src_start, const int64_t *src_stop,
+                                  void *dst, const int64_t *dst_pad_shape,
+                                  const int64_t *dst_start);
+
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/blosc2.h b/include/blosc2.h
index 5074e2ca..56d29b38 100644
--- a/include/blosc2.h
+++ b/include/blosc2.h
@@ -82,11 +82,11 @@ extern "C" {
 
 /* Version numbers */
 #define BLOSC2_VERSION_MAJOR    2    /* for major interface/format changes  */
-#define BLOSC2_VERSION_MINOR    10   /* for minor interface/format changes  */
-#define BLOSC2_VERSION_RELEASE  2    /* for tweaks, bug-fixes, or development */
+#define BLOSC2_VERSION_MINOR    11   /* for minor interface/format changes  */
+#define BLOSC2_VERSION_RELEASE  1    /* for tweaks, bug-fixes, or development */
 
-#define BLOSC2_VERSION_STRING   "2.10.2"  /* string version.  Sync with above! */
-#define BLOSC2_VERSION_DATE     "$Date:: 2023-08-19 #$"    /* date version */
+#define BLOSC2_VERSION_STRING   "2.11.1"  /* string version.  Sync with above! */
+#define BLOSC2_VERSION_DATE     "$Date:: 2023-11-05 #$"    /* date version */
 
 
 /* The maximum number of dimensions for Blosc2 NDim arrays */
@@ -121,6 +121,13 @@ extern "C" {
         }                                           \
     } while (0)
 
+#define BLOSC_INFO(msg, ...)                        \
+    do {                                            \
+        const char *__e = getenv("BLOSC_INFO");     \
+        if (!__e) { break; }                        \
+        fprintf(stderr, "[INFO] - " msg "\n", ##__VA_ARGS__); \
+    } while(0)
+
 
 /* The VERSION_FORMAT symbols below should be just 1-byte long */
 enum {
@@ -287,7 +294,7 @@ enum {
   BLOSC2_GLOBAL_REGISTERED_CODECS_START = 32,
   BLOSC2_GLOBAL_REGISTERED_CODECS_STOP = 159,
   //!< Blosc-registered codecs must be between 31 - 159.
-  BLOSC2_GLOBAL_REGISTERED_CODECS = 1,
+  BLOSC2_GLOBAL_REGISTERED_CODECS = 5,
     //!< Number of Blosc-registered codecs at the moment.
   BLOSC2_USER_REGISTERED_CODECS_START = 160,
   BLOSC2_USER_REGISTERED_CODECS_STOP = 255,
@@ -462,6 +469,7 @@ enum {
   BLOSC2_ERROR_INVALID_INDEX = -33,   //!< Invalid index
   BLOSC2_ERROR_METALAYER_NOT_FOUND = -34,   //!< Metalayer has not been found
   BLOSC2_ERROR_MAX_BUFSIZE_EXCEEDED = -35,  //!< Max buffer size exceeded
+  BLOSC2_ERROR_TUNER = -36,           //!< Tuner failure
 };
 
 
@@ -1077,15 +1085,15 @@ BLOSC_EXPORT blosc2_io_cb *blosc2_get_io_cb(uint8_t id);
 typedef struct blosc2_context_s blosc2_context;   /* opaque type */
 
 typedef struct {
-  void (*init)(void * config, blosc2_context* cctx, blosc2_context* dctx);
+  int (*init)(void * config, blosc2_context* cctx, blosc2_context* dctx);
   //!< Initialize tuner. Keep in mind dctx may be NULL. This should memcpy the cctx->tuner_params.
-  void (*next_blocksize)(blosc2_context * context);
+  int (*next_blocksize)(blosc2_context * context);
   //!< Only compute the next blocksize. Only it is executed if tuner is not initialized.
-  void (*next_cparams)(blosc2_context * context);
+  int (*next_cparams)(blosc2_context * context);
   //!< Compute the next cparams. Only is executed if tuner is initialized.
-  void (*update)(blosc2_context * context, double ctime);
+  int (*update)(blosc2_context * context, double ctime);
   //!< Update the tuner parameters.
-  void (*free)(blosc2_context * context);
+  int (*free)(blosc2_context * context);
   //!< Free the tuner.
   int id;
   //!< The tuner id
diff --git a/include/blosc2/blosc2-common.h b/include/blosc2/blosc2-common.h
index b3e34850..e03042e7 100644
--- a/include/blosc2/blosc2-common.h
+++ b/include/blosc2/blosc2-common.h
@@ -73,7 +73,7 @@
 #if defined(__SSE2__)
   #include <emmintrin.h>
 #endif
-#if defined(__AVX2__)
+#if defined(__AVX2__) || defined(__AVX512F__) || defined (__AVX512BW__)
   #include <immintrin.h>
 #endif
 
diff --git a/include/blosc2/blosc2-export.h b/include/blosc2/blosc2-export.h
index 303502eb..99eccbe3 100644
--- a/include/blosc2/blosc2-export.h
+++ b/include/blosc2/blosc2-export.h
@@ -33,11 +33,11 @@
   #define BLOSC_EXPORT
 #endif  /* defined(BLOSC_SHARED_LIBRARY) */
 
-#if defined(__GNUC__) || defined(__clang__)
+#if (defined(__GNUC__) || defined(__clang__)) && !defined(__MINGW32__)
   #define BLOSC_NO_EXPORT __attribute__((visibility("hidden")))
 #else
   #define BLOSC_NO_EXPORT
-#endif  /* defined(__GNUC__) || defined(__clang__) */
+#endif  /* (defined(__GNUC__) || defined(__clang__)) && !defined(__MINGW32__) */
 
 /* When testing, export everything to make it easier to implement tests. */
 #if defined(BLOSC_TESTING)
diff --git a/include/blosc2/codecs-registry.h b/include/blosc2/codecs-registry.h
index f772a0a1..058fe744 100644
--- a/include/blosc2/codecs-registry.h
+++ b/include/blosc2/codecs-registry.h
@@ -22,6 +22,7 @@ enum {
     BLOSC_CODEC_ZFP_FIXED_ACCURACY = 33,
     BLOSC_CODEC_ZFP_FIXED_PRECISION = 34,
     BLOSC_CODEC_ZFP_FIXED_RATE = 35,
+    BLOSC_CODEC_OPENHTJ2K = 36,
 };
 
 void register_codecs(void);
diff --git a/plugins/codecs/codecs-registry.c b/plugins/codecs/codecs-registry.c
index def6452d..7fdccba3 100644
--- a/plugins/codecs/codecs-registry.c
+++ b/plugins/codecs/codecs-registry.c
@@ -47,4 +47,13 @@ void register_codecs(void) {
   zfp_rate.decoder = &zfp_rate_decompress;
   zfp_rate.compname = "zfp_rate";
   register_codec_private(&zfp_rate);
+
+  blosc2_codec openhtj2k;
+  openhtj2k.compcode = BLOSC_CODEC_OPENHTJ2K;
+  openhtj2k.version = 1;
+  openhtj2k.complib = BLOSC_CODEC_OPENHTJ2K;
+  openhtj2k.encoder = NULL;
+  openhtj2k.decoder = NULL;
+  openhtj2k.compname = "openhtj2k";
+  register_codec_private(&openhtj2k);
 }
diff --git a/plugins/codecs/ndlz/test_ndlz.c b/plugins/codecs/ndlz/test_ndlz.c
index 64e378d1..fb2e727c 100644
--- a/plugins/codecs/ndlz/test_ndlz.c
+++ b/plugins/codecs/ndlz/test_ndlz.c
@@ -309,7 +309,7 @@ int some_matches() {
 int main(void) {
 
   int result;
-  blosc2_init();   // this is mandatory for initiallizing the plugin mechanism
+  blosc2_init();   // this is mandatory for initializing the plugin mechanism
   result = rand_();
   printf("rand: %d obtained \n \n", result);
   if (result < 0)
diff --git a/plugins/codecs/zfp/test_zfp_acc_float.c b/plugins/codecs/zfp/test_zfp_acc_float.c
index 0ade56bb..d2f3c533 100644
--- a/plugins/codecs/zfp/test_zfp_acc_float.c
+++ b/plugins/codecs/zfp/test_zfp_acc_float.c
@@ -286,7 +286,7 @@ int item_prices() {
 int main(void) {
 
   int result;
-  blosc2_init();   // this is mandatory for initiallizing the plugin mechanism
+  blosc2_init();   // this is mandatory for initializing the plugin mechanism
   result = float_cyclic();
   printf("float_cyclic: %d obtained \n \n", result);
   if (result < 0)
diff --git a/plugins/codecs/zfp/test_zfp_prec_float.c b/plugins/codecs/zfp/test_zfp_prec_float.c
index 22e4ce6a..a80f74b3 100644
--- a/plugins/codecs/zfp/test_zfp_prec_float.c
+++ b/plugins/codecs/zfp/test_zfp_prec_float.c
@@ -298,7 +298,7 @@ int item_prices() {
 int main(void) {
 
   int result;
-  blosc2_init();   // this is mandatory for initiallizing the plugin mechanism
+  blosc2_init();   // this is mandatory for initializing the plugin mechanism
   result = float_cyclic();
   printf("float_cyclic: %d obtained \n \n", result);
   if (result < 0)
diff --git a/plugins/codecs/zfp/test_zfp_rate_float.c b/plugins/codecs/zfp/test_zfp_rate_float.c
index de99cc76..0d8c535d 100644
--- a/plugins/codecs/zfp/test_zfp_rate_float.c
+++ b/plugins/codecs/zfp/test_zfp_rate_float.c
@@ -309,7 +309,7 @@ int item_prices() {
 int main(void) {
 
   int result;
-  blosc2_init();   // this is mandatory for initiallizing the plugin mechanism
+  blosc2_init();   // this is mandatory for initializing the plugin mechanism
   result = float_cyclic();
   printf("float_cyclic: %d obtained \n \n", result);
   if (result <= 0)
diff --git a/plugins/codecs/zfp/test_zfp_rate_getitem.c b/plugins/codecs/zfp/test_zfp_rate_getitem.c
index a628d31a..534de814 100644
--- a/plugins/codecs/zfp/test_zfp_rate_getitem.c
+++ b/plugins/codecs/zfp/test_zfp_rate_getitem.c
@@ -334,7 +334,7 @@ int item_prices() {
 int main(void) {
 
   int result;
-  blosc2_init();   // this is mandatory for initiallizing the plugin mechanism
+  blosc2_init();   // this is mandatory for initializing the plugin mechanism
   printf("float_cyclic: ");
   result = float_cyclic();
   if (result < 0)
diff --git a/tests/b2nd/test_b2nd_copy_buffer.c b/tests/b2nd/test_b2nd_copy_buffer.c
new file mode 100644
index 00000000..c0e9af8f
--- /dev/null
+++ b/tests/b2nd/test_b2nd_copy_buffer.c
@@ -0,0 +1,76 @@
+/*********************************************************************
+  Blosc - Blocked Shuffling and Compression Library
+
+  Copyright (c) 2023  The Blosc Development Team <blosc@blosc.org>
+  https://blosc.org
+  License: BSD 3-Clause (see LICENSE.txt)
+
+  See LICENSE.txt for details about copyright and rights to use.
+**********************************************************************/
+
+
+#include "test_common.h"
+
+
+const int64_t result_length = 2 * 2 * 2;
+const uint8_t result[] = {0, 1,
+                          2, 3,
+
+                          4, 5,
+                          6, 7};
+
+
+CUTEST_TEST_SETUP(copy_buffer) {
+  blosc2_init();
+}
+
+CUTEST_TEST_TEST(copy_buffer) {
+  const int8_t ndim = 3;
+  const uint8_t itemsize = sizeof(uint8_t);
+
+  const int64_t chunk_shape[] = {3, 3, 1};
+
+  const uint8_t chunk0x[] = {0, 0, 0,
+                             0, 0, 2,
+                             0, 4, 6};
+  const int64_t chunk0s_start[] = {1, 1, 0};
+  const int64_t chunk0s_stop[] = {3, 3, 1};
+  const int64_t chunk0s_dest[] = {0, 0, 0};
+
+  const uint8_t chunk1x[] = {1, 3, 0,
+                             5, 7, 0,
+                             0, 0, 0};
+  const int64_t chunk1s_start[] = {0, 0, 0};
+  const int64_t chunk1s_stop[] = {2, 2, 1};
+  const int64_t chunk1s_dest[] = {0, 0, 1};
+
+  uint8_t dest[] = {0, 0,
+                    0, 0,
+
+                    0, 0,
+                    0, 0};
+  const int64_t dest_shape[] = {2, 2, 2};
+
+  B2ND_TEST_ASSERT(b2nd_copy_buffer(ndim, itemsize,
+                                    chunk0x, chunk_shape, chunk0s_start, chunk0s_stop,
+                                    dest, dest_shape, chunk0s_dest));
+  B2ND_TEST_ASSERT(b2nd_copy_buffer(ndim, itemsize,
+                                    chunk1x, chunk_shape, chunk1s_start, chunk1s_stop,
+                                    dest, dest_shape, chunk1s_dest));
+
+  for (int i = 0; i < result_length; ++i) {
+    uint8_t a = dest[i];
+    uint8_t b = result[i];
+    CUTEST_ASSERT("Elements are not equal!", a == b);
+  }
+
+  return 0;
+}
+
+CUTEST_TEST_TEARDOWN(copy_buffer) {
+  blosc2_destroy();
+}
+
+int main() {
+  CUTEST_TEST_RUN(copy_buffer);
+}
diff --git a/tests/test_contexts.c b/tests/test_contexts.c
index 23d572d3..1059a75b 100644
--- a/tests/test_contexts.c
+++ b/tests/test_contexts.c
@@ -79,7 +79,7 @@ int main(void) {
   dctx = blosc2_create_dctx(dparams);
 
   blosc2_dparams dparams2 = {0};
-  blosc2_ctx_get_dparams(cctx, &dparams2);
+  blosc2_ctx_get_dparams(dctx, &dparams2);
 
   if (dparams2.nthreads != dparams.nthreads) {
     printf("Nthreads are not equal!");
diff --git a/tests/test_nthreads.c b/tests/test_nthreads.c
index 03c06e3f..db11fc75 100644
--- a/tests/test_nthreads.c
+++ b/tests/test_nthreads.c
@@ -69,10 +69,57 @@ static char *test_compress_decompress(void) {
   return 0;
 }
 
+/* Check nthreads limits */
+static char *test_nthreads_limits(void) {
+  /* Get a compressed buffer */
+  cbytes = blosc1_compress(clevel, doshuffle, typesize, size, src,
+                           dest, size + BLOSC2_MAX_OVERHEAD);
+  mu_assert("ERROR: cbytes is not correct", cbytes < (int)size);
+
+  int16_t nthreads = blosc2_set_nthreads((int16_t) (INT16_MAX + 1));
+  mu_assert("ERROR: nthreads incorrect (1)", nthreads < 0);
+  /* Decompress the buffer */
+  nbytes = blosc1_decompress(dest, dest2, size);
+  mu_assert("ERROR: nbytes incorrect(>=0)", nbytes < 0);
+
+  nthreads = blosc2_set_nthreads(0);
+  mu_assert("ERROR: nthreads incorrect (2)", nthreads < 0);
+  /* Decompress the buffer */
+  nbytes = blosc1_decompress(dest, dest2, size);
+  mu_assert("ERROR: nbytes incorrect(>=0)", nbytes < 0);
+
+  return 0;
+}
+
+/* Check nthreads limits */
+static char *test_nthreads_limits_envvar(void) {
+  /* Get a compressed buffer */
+  cbytes = blosc1_compress(clevel, doshuffle, typesize, size, src,
+                           dest, size + BLOSC2_MAX_OVERHEAD);
+  mu_assert("ERROR: cbytes is not correct", cbytes < (int)size);
+
+  char strval[10];
+  sprintf(strval, "%d", INT16_MAX + 1);
+  setenv("BLOSC_NTHREADS", strval, 1);
+  /* Decompress the buffer */
+  nbytes = blosc1_decompress(dest, dest2, size);
+  mu_assert("ERROR: nbytes incorrect (1)", nbytes < 0);
+
+  sprintf(strval, "%d", -1);
+  setenv("BLOSC_NTHREADS", strval, 1);
+  /* Decompress the buffer */
+  nbytes = blosc1_decompress(dest, dest2, size);
+  mu_assert("ERROR: nbytes incorrect (2)", nbytes < 0);
+
+  return 0;
+}
+
 
 static char *all_tests(void) {
   mu_run_test(test_compress);
   mu_run_test(test_compress_decompress);
+  mu_run_test(test_nthreads_limits);
+  mu_run_test(test_nthreads_limits_envvar);
 
   return 0;
 }

From ac00aa758d2eda5c732f9c4e67af6995cb5b5ce1 Mon Sep 17 00:00:00 2001
From: Thomas VINCENT <thomas.vincent@esrf.fr>
Date: Tue, 7 Nov 2023 12:06:48 +0100
Subject: [PATCH 2/3] Enable AVX512 for blosc2/bitshuffle

---
 setup.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/setup.py b/setup.py
index 1409d688..a5a320ee 100644
--- a/setup.py
+++ b/setup.py
@@ -869,7 +869,7 @@ def get_blosc2_plugin():
     # blosc sources
     sources = glob(f'{blosc2_dir}/blosc/*.c')
     include_dirs = [blosc2_dir, f'{blosc2_dir}/blosc', f'{blosc2_dir}/include']
-    define_macros = [('SHUFFLE_NEON_ENABLED', 1)]
+    define_macros = [('SHUFFLE_AVX512_ENABLED', 1), ('SHUFFLE_NEON_ENABLED', 1)]
     extra_compile_args = []
     extra_link_args = []
     libraries = []

From d77c9df2555cdaf9d5968f215265e3e720291348 Mon Sep 17 00:00:00 2001
From: Thomas VINCENT <thomas.vincent@esrf.fr>
Date: Tue, 7 Nov 2023 12:07:14 +0100
Subject: [PATCH 3/3] update c-blosc2 version in documentation

---
 doc/information.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/information.rst b/doc/information.rst
index a2100e08..352bd32c 100644
--- a/doc/information.rst
+++ b/doc/information.rst
@@ -58,7 +58,7 @@ HDF5 compression filters and compression libraries sources were obtained from:
 * `hdf5-blosc plugin <https://github.com/Blosc/hdf5-blosc>`_ (v1.0.0)
   using `c-blosc <https://github.com/Blosc/c-blosc>`_ (v1.21.5), LZ4, Snappy, ZLib and ZStd.
 * hdf5-blosc2 plugin (from `PyTables <https://github.com/PyTables/PyTables/>`_ v3.9.2.dev0, commit `3ba4e78 <https://github.com/PyTables/PyTables/tree/3ba4e78336f21f5e32ddf49ced4560f610de70dd/hdf5-blosc2/src>`_)
-  using `c-blosc2 <https://github.com/Blosc/c-blosc2>`_ (v2.10.2), LZ4, ZLib and ZStd.
+  using `c-blosc2 <https://github.com/Blosc/c-blosc2>`_ (v2.11.1), LZ4, ZLib and ZStd.
 * `FCIDECOMP plugin <ftp://ftp.eumetsat.int/pub/OPS/out/test-data/Test-data-for-External-Users/MTG_FCI_Test-Data/FCI_Decompression_Software_V1.0.2>`_ (v1.0.2)
   using `CharLS <https://github.com/team-charls/charls>`_
   (1.x branch, commit `25160a4 <https://github.com/team-charls/charls/tree/25160a42fb62e71e4b0ce081f5cb3f8bb73938b5>`_).