don't code today what you can't debug tomorrow: faster quaternion multiplication

Friday, July 09, 2010

faster quaternion multiplication

Labels: coding, hack, musing, optimization, qt, tip

Sweet memories, it was fun to derive it.

Faster here however must be taken with a grain of salt as the new code is not always guaranteed to be better pipelined.

And of course, it's trivial to beat this generic C code with architecture-specific hand-rolled assembly.

git show cbc22908
commit cbc229081a9df67a577b4bea61ad6aac52d470cb
Author: Ariya Hidayat 
Date:   Tue Jun 30 11:18:03 2009 +0200

    Faster quaternion multiplications.
    
    Use the known factorization trick to speed-up quaternion multiplication.
    Now we need only 9 floating-point multiplications, instead of 16 (but
    at the cost of extra additions and subtractions).
    
    Callgrind shows that the function now takes 299 instructions instead of
    318 instructions, which is not a big win. However I assume the speed-up
    has a better effect for mobile CPU, where multiplications are more
    expensive.
    
    Reviewed-by: Rhys Weatherley

diff --git a/src/gui/math3d/qquaternion.h b/src/gui/math3d/qquaternion.h
index 55c871d..9a1b590 100644
--- a/src/gui/math3d/qquaternion.h
+++ b/src/gui/math3d/qquaternion.h
@@ -198,24 +198,17 @@ inline QQuaternion &QQuaternion::operator*=(qreal factor)
 
 inline const QQuaternion operator*(const QQuaternion &q1, const QQuaternion& q2)
 {
-    // Algorithm from:
-    // http://www.j3d.org/matrix_faq/matrfaq_latest.html#Q53
-    float x = q1.wp * q2.xp +
-                    q1.xp * q2.wp +
-                    q1.yp * q2.zp -
-                    q1.zp * q2.yp;
-    float y = q1.wp * q2.yp +
-                    q1.yp * q2.wp +
-                    q1.zp * q2.xp -
-                    q1.xp * q2.zp;
-    float z = q1.wp * q2.zp +
-                    q1.zp * q2.wp +
-                    q1.xp * q2.yp -
-                    q1.yp * q2.xp;
-    float w = q1.wp * q2.wp -
-                    q1.xp * q2.xp -
-                    q1.yp * q2.yp -
-                    q1.zp * q2.zp;
+    float ww = (q1.zp + q1.xp) * (q2.xp + q2.yp);
+    float yy = (q1.wp - q1.yp) * (q2.wp + q2.zp);
+    float zz = (q1.wp + q1.yp) * (q2.wp - q2.zp);
+    float xx = ww + yy + zz;
+    float qq = 0.5 * (xx + (q1.zp - q1.xp) * (q2.xp - q2.yp));
+
+    float w = qq - ww + (q1.zp - q1.yp) * (q2.yp - q2.zp);
+    float x = qq - xx + (q1.xp + q1.wp) * (q2.xp + q2.wp);
+    float y = qq - yy + (q1.wp - q1.xp) * (q2.yp + q2.zp);
+    float z = qq - zz + (q1.zp + q1.yp) * (q2.wp - q2.xp);
+
     return QQuaternion(w, x, y, z, 1);
 }

3 comments:

Johan Thelin said...: qreal?; July 09, 2010 5:47 PM
Benoit Jacob said...: If you are interested in SSE code for quaternion multiplication, we have that in Eigen. It does not require assembly, just intrinsics, and should be easily portable to NEON.; July 10, 2010 11:01 AM
Benoit Jacob said...: By the way, the resulting assembly produced by gcc 4.4 (the source is only C++ with intrinsics) has only 20 instructions:

movaps (%rsi), %xmm5
movdqa (%rdx), %xmm2
pshufd $137, %xmm5, %xmm4
pshufd $146, %xmm2, %xmm1
movaps .LC0(%rip), %xmm0
pshufd $127, %xmm5, %xmm6
pshufd $18, %xmm5, %xmm3
mulps %xmm4, %xmm1
pshufd $100, %xmm2, %xmm4
mulps %xmm6, %xmm4
xorps %xmm0, %xmm1
xorps %xmm0, %xmm4
pshufd $9, %xmm2, %xmm0
addps %xmm4, %xmm1
mulps %xmm0, %xmm3
pshufd $255, %xmm2, %xmm0
mulps %xmm5, %xmm0
subps %xmm3, %xmm0
addps %xmm1, %xmm0
movaps %xmm0n, (%rdi); July 10, 2010 11:11 AM

don't code today what you can't debug tomorrow

Friday, July 09, 2010

faster quaternion multiplication

3 comments:

recent posts

recent comments

archive