Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized assembly for div128by64 and mul64x64to128 #70

Open
mratsim opened this issue Dec 3, 2018 · 0 comments
Open

Optimized assembly for div128by64 and mul64x64to128 #70

mratsim opened this issue Dec 3, 2018 · 0 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Dec 3, 2018

Without going full int128 there are 2 huge optimisations opportunities on 64-bit processors.

All 64-bit platforms implement the division of a 128-bit number by a 64-bit one and also implement the extended precision multiplication of 2 64-bit number to a 128-bit one. The reason why is that 64x64 multiplication and 64/64 division requires extended precision and then discarding part of the result.

There is currently no way in pure C to force those instructions besides using int128 and even then the int128 code would probably less efficient by introducing checks/branches even though we know that only the low 64-bit part is used.

The benefits would be tremendous, replacing ~50 instructions by a single one in both multiplication and division case.

Division

func div2n1n[T: SomeunsignedInt](q, r: var T, n_hi, n_lo, d: T) =
# assert countLeadingZeroBits(d) == 0, "Divisor was not normalized"
const
size = bitsof(q)
halfSize = size div 2
halfMask = (1.T shl halfSize) - 1.T
template halfQR(n_hi, n_lo, d, d_hi, d_lo: T): tuple[q,r: T] =
var (q, r) = divmod(n_hi, d_hi)
let m = q * d_lo
var r = (r shl halfSize) or n_lo
# Fix the reminder, we're at most 2 iterations off
if r < m:
dec q
r += d
if r >= d and r < m:
dec q
r += d
r -= m
(q, r)
let
d_hi = d shr halfSize
d_lo = d and halfMask
n_lohi = nlo shr halfSize
n_lolo = nlo and halfMask
# First half of the quotient
let (q1, r1) = halfQR(n_hi, n_lohi, d, d_hi, d_lo)
# Second half
let (q2, r2) = halfQR(r1, n_lolo, d, d_hi, d_lo)
q = (q1 shl halfSize) or q2
r = r2

x86_64 assembly implementation. The same divq opcodes exists as divl on all 32-bit platform for 64div32 and the equivalent exists on ARM, MIPS etc.

func asm_x86_64_div2n1n(q, r: var uint64, n_hi, n_lo, d: uint64) {.inline.}=
  ## Division uint128 by uint64

  # DIV r/m64
  # Divide RDX:RAX (n_hi:n_lo) by r/m64
  #
  # Inputs
  #   - numerator high word in RDX,
  #   - numerator low word in RAX,
  #   - divisor as r/m parameter (register or memory at the compiler discretion)
  # Result
  #   - Quotient in RAX
  #   - Remainder in RDX
  asm """
    divq %[divisor]             // We name the register/memory divisor
    : "=a" (`*q`), "=d" (`*r`)  // Don't forget to dereference the var hidden pointer
    : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`)
    :  // no register clobbered besides explicitly used RAX and RDX
  """

Multiplication

template extPrecMulImpl(result: var UintImpl[uint64], op: untyped, u, v: uint64) =
const
p = 64 div 2
base: uint64 = 1 shl p
var
x0, x1, x2, x3: uint64
let
ul = lo(u)
uh = hi(u)
vl = lo(v)
vh = hi(v)
x0 = ul * vl
x1 = ul * vh
x2 = uh * vl
x3 = uh * vh
x1 += hi(x0) # This can't carry
x1 += x2 # but this can
if x1 < x2: # if carry, add it to x3
x3 += base
op(result.hi, x3 + hi(x1))
op(result.lo, (x1 shl p) or lo(x0))

x86_64 assembly implementation. The same mulq opcodes exists as mull on all 32-bit platform for 32x32to64 and the equivalent exists on ARM, MIPS etc.

func asm_x86_64_extMul(hi, lo: var uint64, a, b: uint64) {.inline.}=
  ## Extended precision multiplication uint64 * uint64 --> uint128

  # MUL r/m64
  # Multiply RAX by r/m64
  #
  # Inputs:
  #   - RAX
  #   - r/m
  # Outputs:
  #   - High word in RDX
  #   - Low word in RAX

  asm """
    mulq %[operand]
    : "=d" (`*hi`), "=a" (`*lo`) // Don't forget to dereference the var hidden pointer
    : "a" (`a`), [operand] "rm" (`b`)
    : // no clobbered registers
  """

Tests

when isMainModule:
  block: # Multiplication
    var hi, lo: uint64

    asm_x86_64_extMul(hi, lo, 1 shl 32, 1 shl 33) # 2^65
    doAssert hi == 2
    doAssert lo == 0

  block: # Division
    var q, r: uint64

    # (1 shl 64) div 3
    let n_hi = 1'u64
    let n_lo = 0'u64
    let d = 3'u64

    asm_x86_64_div2n1n(q, r, n_hi, n_lo, d)

    doAssert q == 6148914691236517205'u64
    doAssert r == 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant