Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore options for precision of div/rcp/sqrt/rsqrt #626

Merged
merged 1 commit into from
Jan 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ As a result, additional treatments should be applied to ensure consistency betwe
Though floating-point operations in NEON use the IEEE single-precision format, NEON does not fully comply to the IEEE standard when inputs or results are denormal or NaN values for minimizing power consumption as well as maximizing performance.
Considering the balance between correctness and performance, `sse2neon` recognizes the following compile-time configurations:
* `SSE2NEON_PRECISE_MINMAX`: Enable precise implementation of `_mm_min_{ps,pd}` and `_mm_max_{ps,pd}`. If you need consistent results such as handling with NaN values, enable it.
* `SSE2NEON_PRECISE_DIV` (deprecated): Enable precise implementation of `_mm_rcp_ps` and `_mm_div_ps` by additional Netwon-Raphson iteration for accuracy.
* `SSE2NEON_PRECISE_SQRT` (deprecated): Enable precise implementation of `_mm_sqrt_ps` and `_mm_rsqrt_ps` by additional Netwon-Raphson iteration for accuracy.
* `SSE2NEON_PRECISE_DIV`: Enable precise implementation of `_mm_rcp_ps` and `_mm_div_ps` by additional Netwon-Raphson iteration for accuracy.
* `SSE2NEON_PRECISE_SQRT`: Enable precise implementation of `_mm_sqrt_ps` and `_mm_rsqrt_ps` by additional Netwon-Raphson iteration for accuracy.
* `SSE2NEON_PRECISE_DP`: Enable precise implementation of `_mm_dp_pd`. When the conditional bit is not set, the corresponding multiplication would not be executed.

The above are turned off by default, and you should define the corresponding macro(s) as `1` before including `sse2neon.h` if you need the precise implementations.
Expand Down
13 changes: 11 additions & 2 deletions sse2neon.h
Original file line number Diff line number Diff line change
Expand Up @@ -1720,7 +1720,7 @@ FORCE_INLINE int64_t _mm_cvttss_si64(__m128 a)
// https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_div_ps
FORCE_INLINE __m128 _mm_div_ps(__m128 a, __m128 b)
{
#if defined(__aarch64__) || defined(_M_ARM64)
#if (defined(__aarch64__) || defined(_M_ARM64)) && !SSE2NEON_PRECISE_DIV
return vreinterpretq_m128_f32(
vdivq_f32(vreinterpretq_f32_m128(a), vreinterpretq_f32_m128(b)));
#else
Expand Down Expand Up @@ -2297,6 +2297,10 @@ FORCE_INLINE __m128 _mm_rcp_ps(__m128 in)
{
float32x4_t recip = vrecpeq_f32(vreinterpretq_f32_m128(in));
recip = vmulq_f32(recip, vrecpsq_f32(recip, vreinterpretq_f32_m128(in)));
#if SSE2NEON_PRECISE_DIV
// Additional Netwon-Raphson iteration for accuracy
recip = vmulq_f32(recip, vrecpsq_f32(recip, vreinterpretq_f32_m128(in)));
#endif
return vreinterpretq_m128_f32(recip);
}

Expand Down Expand Up @@ -2329,6 +2333,11 @@ FORCE_INLINE __m128 _mm_rsqrt_ps(__m128 in)

out = vmulq_f32(
out, vrsqrtsq_f32(vmulq_f32(vreinterpretq_f32_m128(in), out), out));
#if SSE2NEON_PRECISE_SQRT
// Additional Netwon-Raphson iteration for accuracy
out = vmulq_f32(
out, vrsqrtsq_f32(vmulq_f32(vreinterpretq_f32_m128(in), out), out));
#endif

// Set output vector element to infinity/negative-infinity if
// the corresponding input vector element is 0.0f/-0.0f.
Expand Down Expand Up @@ -2644,7 +2653,7 @@ FORCE_INLINE void _mm_lfence(void)
// https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_sqrt_ps
FORCE_INLINE __m128 _mm_sqrt_ps(__m128 in)
{
#if defined(__aarch64__) || defined(_M_ARM64)
#if (defined(__aarch64__) || defined(_M_ARM64)) && !SSE2NEON_PRECISE_SQRT
return vreinterpretq_m128_f32(vsqrtq_f32(vreinterpretq_f32_m128(in)));
#else
float32x4_t recip = vrsqrteq_f32(vreinterpretq_f32_m128(in));
Expand Down