My blog has been moved to ariya.ofilabs.com.

Monday, February 12, 2007

Modulus with Mersenne prime

Consider the following simple operation, where k is integer and p is prime:

  int i = k % p;

Typically this will be assembled to (sorry, x86 only):

  mov  eax, k
  cdq
  idiv p

where the result is available in register EDX.

Such IDIV instruction has a latency of more than 40 cycles on Intel Pentium or AMD64 processor family. Hence, for optimization purposes, it is best to avoid integer division.

The above division/modulus operation can be avoided if the prime number p is chosen to be the Mersenne primes only, i.e there is a positive integer s such as p = 2s-1. In 32-bit range, there are Mersenne primes: 3, 7, 31, 127, 8191, 131071, 524287, and 2147483647.

The modulus operator with Mersenne prime can be simplified as:

  int i = (k & p) + (k >> s);
  return (i >= p) ? i - p : i;

Update: This works only for k up to (1 << (2 * s)) - 1, so careful with small Mersenne primes. Otherwise you need to call the function recursively.

One possible assembler implementation is as follows.

  ; assume edx = k, ebx = p, ecx = s
  ; result is in eax
  mov     eax, edx
  sar     edx, cl        ; k >> s
  and     eax, ebx       ; k & p
  add     eax, edx       ; eax is i = (k & p) + (k >> s)
  mov     edx, eax       ; edx is also i
  sub     edx, ebx       ; i - p
  cmp     eax, ebx       ; only if (i >= p)
  cmovge  eax, edx       ; then eax is (i-p)

Note that with the help of CMOVGE (6 cycles latency on Pentium 4), there is no need for real branching (which is expensive). Although the code is longer compared to the IDIV version, it executes much faster. Still faster if the range k is limited so that only a decrement operator is needed. Even faster of course if p is constant.

Last time I used this is for hash table (micro)optimization, because the number of stored items is known and I can live with a table whose size is a Mersenne prime. That's indeed a very special case only.

BTW this is off-topic, but thanks for those who mailed/texted me after this post. Nothing happened to me, I'm just fine. Those lines are from Keane's latest single (guess which one?), picked for no particular reason other that it's a good ballad.

6 comments:

Anonymous said...

Pretty neat.

Anonymous said...

Don't constain your hash-tables just accept floating point conversions:

float d = k / p;
return (k - ((int)d*p));

about 6 assember instructions with 8 clocks latency.

Anonymous said...

if p is constant, let the compiler optimise the division/modulus for you. Including the case where p is a Mersenne prime.

Anonymous said...

I dont think the trick works.
example if I want k%7 and k=500
then your prog gives 66-7=59
intstead we should make another
call to the function on 66

Ariya Hidayat said...

I added some clarification on the range where the trick can work.

Anonymous said...

You can use the SSE2 instruction pmaddwd to do the shifts (as multiplies) and adds in a single instruction with a latency of 3 cycles. Even if you only need one of the 4 results and need a 'movd reg,xmmreg', this should be faster than using GPRs.