My blog has been moved to ariya.ofilabs.com.

## Friday, July 09, 2010

### faster quaternion multiplication

Sweet memories, it was fun to derive it.

Faster here however must be taken with a grain of salt as the new code is not always guaranteed to be better pipelined.

And of course, it's trivial to beat this generic C code with architecture-specific hand-rolled assembly.

```git show cbc22908
Author: Ariya Hidayat
Date:   Tue Jun 30 11:18:03 2009 +0200

Faster quaternion multiplications.

Use the known factorization trick to speed-up quaternion multiplication.
Now we need only 9 floating-point multiplications, instead of 16 (but
at the cost of extra additions and subtractions).

Callgrind shows that the function now takes 299 instructions instead of
318 instructions, which is not a big win. However I assume the speed-up
has a better effect for mobile CPU, where multiplications are more
expensive.

Reviewed-by: Rhys Weatherley

diff --git a/src/gui/math3d/qquaternion.h b/src/gui/math3d/qquaternion.h
index 55c871d..9a1b590 100644
--- a/src/gui/math3d/qquaternion.h
+++ b/src/gui/math3d/qquaternion.h
@@ -198,24 +198,17 @@ inline QQuaternion &QQuaternion::operator*=(qreal factor)

inline const QQuaternion operator*(const QQuaternion &q1, const QQuaternion& q2)
{
-    // Algorithm from:
-    // http://www.j3d.org/matrix_faq/matrfaq_latest.html#Q53
-    float x = q1.wp * q2.xp +
-                    q1.xp * q2.wp +
-                    q1.yp * q2.zp -
-                    q1.zp * q2.yp;
-    float y = q1.wp * q2.yp +
-                    q1.yp * q2.wp +
-                    q1.zp * q2.xp -
-                    q1.xp * q2.zp;
-    float z = q1.wp * q2.zp +
-                    q1.zp * q2.wp +
-                    q1.xp * q2.yp -
-                    q1.yp * q2.xp;
-    float w = q1.wp * q2.wp -
-                    q1.xp * q2.xp -
-                    q1.yp * q2.yp -
-                    q1.zp * q2.zp;
+    float ww = (q1.zp + q1.xp) * (q2.xp + q2.yp);
+    float yy = (q1.wp - q1.yp) * (q2.wp + q2.zp);
+    float zz = (q1.wp + q1.yp) * (q2.wp - q2.zp);
+    float xx = ww + yy + zz;
+    float qq = 0.5 * (xx + (q1.zp - q1.xp) * (q2.xp - q2.yp));
+
+    float w = qq - ww + (q1.zp - q1.yp) * (q2.yp - q2.zp);
+    float x = qq - xx + (q1.xp + q1.wp) * (q2.xp + q2.wp);
+    float y = qq - yy + (q1.wp - q1.xp) * (q2.yp + q2.zp);
+    float z = qq - zz + (q1.zp + q1.yp) * (q2.wp - q2.xp);
+
return QQuaternion(w, x, y, z, 1);
}
```

Johan Thelin said...

qreal?

Benoit Jacob said...

If you are interested in SSE code for quaternion multiplication, we have that in Eigen. It does not require assembly, just intrinsics, and should be easily portable to NEON.

Benoit Jacob said...

By the way, the resulting assembly produced by gcc 4.4 (the source is only C++ with intrinsics) has only 20 instructions:

movaps (%rsi), %xmm5
movdqa (%rdx), %xmm2
pshufd \$137, %xmm5, %xmm4
pshufd \$146, %xmm2, %xmm1
movaps .LC0(%rip), %xmm0
pshufd \$127, %xmm5, %xmm6
pshufd \$18, %xmm5, %xmm3
mulps %xmm4, %xmm1
pshufd \$100, %xmm2, %xmm4
mulps %xmm6, %xmm4
xorps %xmm0, %xmm1
xorps %xmm0, %xmm4
pshufd \$9, %xmm2, %xmm0