Optimized assembly for div128by64 and mul64x64to128 #70

mratsim · 2018-12-03T11:10:43Z

Without going full int128 there are 2 huge optimisations opportunities on 64-bit processors.

All 64-bit platforms implement the division of a 128-bit number by a 64-bit one and also implement the extended precision multiplication of 2 64-bit number to a 128-bit one. The reason why is that 64x64 multiplication and 64/64 division requires extended precision and then discarding part of the result.

There is currently no way in pure C to force those instructions besides using int128 and even then the int128 code would probably less efficient by introducing checks/branches even though we know that only the low 64-bit part is used.

The benefits would be tremendous, replacing ~50 instructions by a single one in both multiplication and division case.

Division

nim-stint/stint/private/uint_div.nim

Lines 131 to 169 in edb1ade

    
           func div2n1n[T: SomeunsignedInt](q, r: var T, n_hi, n_lo, d: T) = 
        
             # assert countLeadingZeroBits(d) == 0, "Divisor was not normalized" 
        
             const 
        
               size = bitsof(q) 
        
               halfSize = size div 2 
        
               halfMask = (1.T shl halfSize) - 1.T 
        
             template halfQR(n_hi, n_lo, d, d_hi, d_lo: T): tuple[q,r: T] = 
        
               var (q, r) = divmod(n_hi, d_hi) 
        
               let m = q * d_lo 
        
               var r = (r shl halfSize) or n_lo 
        
               # Fix the reminder, we're at most 2 iterations off 
        
               if r < m: 
        
                 dec q 
        
                 r += d 
        
                 if r >= d and r < m: 
        
                   dec q 
        
                   r += d 
        
               r -= m 
        
               (q, r) 
        
             let 
        
               d_hi = d shr halfSize 
        
               d_lo = d and halfMask 
        
               n_lohi = nlo shr halfSize 
        
               n_lolo = nlo and halfMask 
        
             # First half of the quotient 
        
             let (q1, r1) = halfQR(n_hi, n_lohi, d, d_hi, d_lo) 
        
             # Second half 
        
             let (q2, r2) = halfQR(r1, n_lolo, d, d_hi, d_lo) 
        
             q = (q1 shl halfSize) or q2 
        
             r = r2

x86_64 assembly implementation. The same divq opcodes exists as divl on all 32-bit platform for 64div32 and the equivalent exists on ARM, MIPS etc.

func asm_x86_64_div2n1n(q, r: var uint64, n_hi, n_lo, d: uint64) {.inline.}=
  ## Division uint128 by uint64

  # DIV r/m64
  # Divide RDX:RAX (n_hi:n_lo) by r/m64
  #
  # Inputs
  #   - numerator high word in RDX,
  #   - numerator low word in RAX,
  #   - divisor as r/m parameter (register or memory at the compiler discretion)
  # Result
  #   - Quotient in RAX
  #   - Remainder in RDX
  asm """
    divq %[divisor]             // We name the register/memory divisor
    : "=a" (`*q`), "=d" (`*r`)  // Don't forget to dereference the var hidden pointer
    : "d" (`n_hi`), "a" (`n_lo`), [divisor] "rm" (`d`)
    :  // no register clobbered besides explicitly used RAX and RDX
  """

Multiplication

nim-stint/stint/private/uint_mul.nim

Lines 48 to 73 in edb1ade

    
           template extPrecMulImpl(result: var UintImpl[uint64], op: untyped, u, v: uint64) = 
        
             const 
        
               p = 64 div 2 
        
               base: uint64 = 1 shl p 
        
             var 
        
               x0, x1, x2, x3: uint64 
        
             let 
        
               ul = lo(u) 
        
               uh = hi(u) 
        
               vl = lo(v) 
        
               vh = hi(v) 
        
             x0 = ul * vl 
        
             x1 = ul * vh 
        
             x2 = uh * vl 
        
             x3 = uh * vh 
        
             x1 += hi(x0)          # This can't carry 
        
             x1 += x2              # but this can 
        
             if x1 < x2:           # if carry, add it to x3 
        
               x3 += base 
        
             op(result.hi, x3 + hi(x1)) 
        
             op(result.lo, (x1 shl p) or lo(x0))

x86_64 assembly implementation. The same mulq opcodes exists as mull on all 32-bit platform for 32x32to64 and the equivalent exists on ARM, MIPS etc.

func asm_x86_64_extMul(hi, lo: var uint64, a, b: uint64) {.inline.}=
  ## Extended precision multiplication uint64 * uint64 --> uint128

  # MUL r/m64
  # Multiply RAX by r/m64
  #
  # Inputs:
  #   - RAX
  #   - r/m
  # Outputs:
  #   - High word in RDX
  #   - Low word in RAX

  asm """
    mulq %[operand]
    : "=d" (`*hi`), "=a" (`*lo`) // Don't forget to dereference the var hidden pointer
    : "a" (`a`), [operand] "rm" (`b`)
    : // no clobbered registers
  """

Tests

when isMainModule:
  block: # Multiplication
    var hi, lo: uint64

    asm_x86_64_extMul(hi, lo, 1 shl 32, 1 shl 33) # 2^65
    doAssert hi == 2
    doAssert lo == 0

  block: # Division
    var q, r: uint64

    # (1 shl 64) div 3
    let n_hi = 1'u64
    let n_lo = 0'u64
    let d = 3'u64

    asm_x86_64_div2n1n(q, r, n_hi, n_lo, d)

    doAssert q == 6148914691236517205'u64
    doAssert r == 1

mratsim added the Performance label Dec 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized assembly for div128by64 and mul64x64to128 #70

Optimized assembly for div128by64 and mul64x64to128 #70

mratsim commented Dec 3, 2018

Optimized assembly for div128by64 and mul64x64to128 #70

Optimized assembly for div128by64 and mul64x64to128 #70

Comments

mratsim commented Dec 3, 2018

Division

Multiplication

Tests