Consider the following simple operation, where k is integer and p is prime:
int i = k % p;
Typically this will be assembled to (sorry, x86 only):
mov eax, k cdq idiv p
where the result is available in register EDX.
Such IDIV instruction has a latency of more than 40 cycles on Intel Pentium or AMD64 processor family. Hence, for optimization purposes, it is best to avoid integer division.
The above division/modulus operation can be avoided if the prime number p is chosen to be the Mersenne primes only, i.e there is a positive integer s such as p = 2s-1. In 32-bit range, there are Mersenne primes: 3, 7, 31, 127, 8191, 131071, 524287, and 2147483647.
The modulus operator with Mersenne prime can be simplified as:
int i = (k & p) + (k >> s); return (i >= p) ? i - p : i;
Update: This works only for k up to (1 << (2 * s)) - 1, so careful with small Mersenne primes. Otherwise you need to call the function recursively.
One possible assembler implementation is as follows.
; assume edx = k, ebx = p, ecx = s ; result is in eax mov eax, edx sar edx, cl ; k >> s and eax, ebx ; k & p add eax, edx ; eax is i = (k & p) + (k >> s) mov edx, eax ; edx is also i sub edx, ebx ; i - p cmp eax, ebx ; only if (i >= p) cmovge eax, edx ; then eax is (i-p)
Note that with the help of CMOVGE (6 cycles latency on Pentium 4), there is no need for real branching (which is expensive). Although the code is longer compared to the IDIV version, it executes much faster. Still faster if the range k is limited so that only a decrement operator is needed. Even faster of course if p is constant.
Last time I used this is for hash table (micro)optimization, because the number of stored items is known and I can live with a table whose size is a Mersenne prime. That's indeed a very special case only.
BTW this is off-topic, but thanks for those who mailed/texted me after this post. Nothing happened to me, I'm just fine. Those lines are from Keane's latest single (guess which one?), picked for no particular reason other that it's a good ballad.
6 comments:
Pretty neat.
Don't constain your hash-tables just accept floating point conversions:
float d = k / p;
return (k - ((int)d*p));
about 6 assember instructions with 8 clocks latency.
if p is constant, let the compiler optimise the division/modulus for you. Including the case where p is a Mersenne prime.
I dont think the trick works.
example if I want k%7 and k=500
then your prog gives 66-7=59
intstead we should make another
call to the function on 66
I added some clarification on the range where the trick can work.
You can use the SSE2 instruction pmaddwd to do the shifts (as multiplies) and adds in a single instruction with a latency of 3 cycles. Even if you only need one of the 4 results and need a 'movd reg,xmmreg', this should be faster than using GPRs.
Post a Comment