don't code today what you can't debug tomorrow: faster quaternion multiplication

Sweet memories, it was fun to derive it.

Faster here however must be taken with a grain of salt as the new code is not always guaranteed to be better pipelined.

And of course, it's trivial to beat this generic C code with architecture-specific hand-rolled assembly.

git show cbc22908
commit cbc229081a9df67a577b4bea61ad6aac52d470cb
Author: Ariya Hidayat 
Date:   Tue Jun 30 11:18:03 2009 +0200

    Faster quaternion multiplications.
    
    Use the known factorization trick to speed-up quaternion multiplication.
    Now we need only 9 floating-point multiplications, instead of 16 (but
    at the cost of extra additions and subtractions).
    
    Callgrind shows that the function now takes 299 instructions instead of
    318 instructions, which is not a big win. However I assume the speed-up
    has a better effect for mobile CPU, where multiplications are more
    expensive.
    
    Reviewed-by: Rhys Weatherley

diff --git a/src/gui/math3d/qquaternion.h b/src/gui/math3d/qquaternion.h
index 55c871d..9a1b590 100644
--- a/src/gui/math3d/qquaternion.h
+++ b/src/gui/math3d/qquaternion.h
@@ -198,24 +198,17 @@ inline QQuaternion &QQuaternion::operator*=(qreal factor)
 
 inline const QQuaternion operator*(const QQuaternion &q1, const QQuaternion& q2)
 {
-    // Algorithm from:
-    // http://www.j3d.org/matrix_faq/matrfaq_latest.html#Q53
-    float x = q1.wp * q2.xp +
-                    q1.xp * q2.wp +
-                    q1.yp * q2.zp -
-                    q1.zp * q2.yp;
-    float y = q1.wp * q2.yp +
-                    q1.yp * q2.wp +
-                    q1.zp * q2.xp -
-                    q1.xp * q2.zp;
-    float z = q1.wp * q2.zp +
-                    q1.zp * q2.wp +
-                    q1.xp * q2.yp -
-                    q1.yp * q2.xp;
-    float w = q1.wp * q2.wp -
-                    q1.xp * q2.xp -
-                    q1.yp * q2.yp -
-                    q1.zp * q2.zp;
+    float ww = (q1.zp + q1.xp) * (q2.xp + q2.yp);
+    float yy = (q1.wp - q1.yp) * (q2.wp + q2.zp);
+    float zz = (q1.wp + q1.yp) * (q2.wp - q2.zp);
+    float xx = ww + yy + zz;
+    float qq = 0.5 * (xx + (q1.zp - q1.xp) * (q2.xp - q2.yp));
+
+    float w = qq - ww + (q1.zp - q1.yp) * (q2.yp - q2.zp);
+    float x = qq - xx + (q1.xp + q1.wp) * (q2.xp + q2.wp);
+    float y = qq - yy + (q1.wp - q1.xp) * (q2.yp + q2.zp);
+    float z = qq - zz + (q1.zp + q1.yp) * (q2.wp - q2.xp);
+
     return QQuaternion(w, x, y, z, 1);
 }

3 comments:

Johan ThelinJuly 09, 2010 5:47 PM
qreal?
Benoit JacobJuly 10, 2010 11:01 AM
If you are interested in SSE code for quaternion multiplication, we have that in Eigen. It does not require assembly, just intrinsics, and should be easily portable to NEON.
Benoit JacobJuly 10, 2010 11:11 AM
By the way, the resulting assembly produced by gcc 4.4 (the source is only C++ with intrinsics) has only 20 instructions:

movaps (%rsi), %xmm5
movdqa (%rdx), %xmm2
pshufd $137, %xmm5, %xmm4
pshufd $146, %xmm2, %xmm1
movaps .LC0(%rip), %xmm0
pshufd $127, %xmm5, %xmm6
pshufd $18, %xmm5, %xmm3
mulps %xmm4, %xmm1
pshufd $100, %xmm2, %xmm4
mulps %xmm6, %xmm4
xorps %xmm0, %xmm1
xorps %xmm0, %xmm4
pshufd $9, %xmm2, %xmm0
addps %xmm4, %xmm1
mulps %xmm0, %xmm3
pshufd $255, %xmm2, %xmm0
mulps %xmm5, %xmm0
subps %xmm3, %xmm0
addps %xmm1, %xmm0
movaps %xmm0n, (%rdi)

Note: Only a member of this blog may post a comment.

Friday, July 09, 2010

faster quaternion multiplication

3 comments: