Friday, July 09, 2010

faster quaternion multiplication

Sweet memories, it was fun to derive it.

Faster here however must be taken with a grain of salt as the new code is not always guaranteed to be better pipelined.

And of course, it's trivial to beat this generic C code with architecture-specific hand-rolled assembly.

git show cbc22908
commit cbc229081a9df67a577b4bea61ad6aac52d470cb
Author: Ariya Hidayat 
Date:   Tue Jun 30 11:18:03 2009 +0200

    Faster quaternion multiplications.
    
    Use the known factorization trick to speed-up quaternion multiplication.
    Now we need only 9 floating-point multiplications, instead of 16 (but
    at the cost of extra additions and subtractions).
    
    Callgrind shows that the function now takes 299 instructions instead of
    318 instructions, which is not a big win. However I assume the speed-up
    has a better effect for mobile CPU, where multiplications are more
    expensive.
    
    Reviewed-by: Rhys Weatherley

diff --git a/src/gui/math3d/qquaternion.h b/src/gui/math3d/qquaternion.h
index 55c871d..9a1b590 100644
--- a/src/gui/math3d/qquaternion.h
+++ b/src/gui/math3d/qquaternion.h
@@ -198,24 +198,17 @@ inline QQuaternion &QQuaternion::operator*=(qreal factor)
 
 inline const QQuaternion operator*(const QQuaternion &q1, const QQuaternion& q2)
 {
-    // Algorithm from:
-    // http://www.j3d.org/matrix_faq/matrfaq_latest.html#Q53
-    float x = q1.wp * q2.xp +
-                    q1.xp * q2.wp +
-                    q1.yp * q2.zp -
-                    q1.zp * q2.yp;
-    float y = q1.wp * q2.yp +
-                    q1.yp * q2.wp +
-                    q1.zp * q2.xp -
-                    q1.xp * q2.zp;
-    float z = q1.wp * q2.zp +
-                    q1.zp * q2.wp +
-                    q1.xp * q2.yp -
-                    q1.yp * q2.xp;
-    float w = q1.wp * q2.wp -
-                    q1.xp * q2.xp -
-                    q1.yp * q2.yp -
-                    q1.zp * q2.zp;
+    float ww = (q1.zp + q1.xp) * (q2.xp + q2.yp);
+    float yy = (q1.wp - q1.yp) * (q2.wp + q2.zp);
+    float zz = (q1.wp + q1.yp) * (q2.wp - q2.zp);
+    float xx = ww + yy + zz;
+    float qq = 0.5 * (xx + (q1.zp - q1.xp) * (q2.xp - q2.yp));
+
+    float w = qq - ww + (q1.zp - q1.yp) * (q2.yp - q2.zp);
+    float x = qq - xx + (q1.xp + q1.wp) * (q2.xp + q2.wp);
+    float y = qq - yy + (q1.wp - q1.xp) * (q2.yp + q2.zp);
+    float z = qq - zz + (q1.zp + q1.yp) * (q2.wp - q2.xp);
+
     return QQuaternion(w, x, y, z, 1);
 }

3 comments:

  1. If you are interested in SSE code for quaternion multiplication, we have that in Eigen. It does not require assembly, just intrinsics, and should be easily portable to NEON.

    ReplyDelete
  2. By the way, the resulting assembly produced by gcc 4.4 (the source is only C++ with intrinsics) has only 20 instructions:

    movaps (%rsi), %xmm5
    movdqa (%rdx), %xmm2
    pshufd $137, %xmm5, %xmm4
    pshufd $146, %xmm2, %xmm1
    movaps .LC0(%rip), %xmm0
    pshufd $127, %xmm5, %xmm6
    pshufd $18, %xmm5, %xmm3
    mulps %xmm4, %xmm1
    pshufd $100, %xmm2, %xmm4
    mulps %xmm6, %xmm4
    xorps %xmm0, %xmm1
    xorps %xmm0, %xmm4
    pshufd $9, %xmm2, %xmm0
    addps %xmm4, %xmm1
    mulps %xmm0, %xmm3
    pshufd $255, %xmm2, %xmm0
    mulps %xmm5, %xmm0
    subps %xmm3, %xmm0
    addps %xmm1, %xmm0
    movaps %xmm0n, (%rdi)

    ReplyDelete

Note: Only a member of this blog may post a comment.