My blog has been moved to ariya.ofilabs.com.

Thursday, June 30, 2011

quaternion multiplication: two years later

Sometime back I wrote (fun fact: it's Google first hit for faster quaternion multiplication) about my favorite commit I did exactly two years ago to Qt :

git show cbc22908
commit cbc229081a9df67a577b4bea61ad6aac52d470cb
Author: Ariya Hidayat 
Date:   Tue Jun 30 11:18:03 2009 +0200

    Faster quaternion multiplications.
    
    Use the known factorization trick to speed-up quaternion multiplication.
    Now we need only 9 floating-point multiplications, instead of 16 (but
    at the cost of extra additions and subtractions).

Ages ago, during my Ph.D research, when I worked with a certain hardware platform (hint: it's not generalized CPU), minimizing the needed number of hardware multipliers with a very little impact in the computation speed makes a huge different. With today's advanced processor architecture armed with vectorized instructions and a really smart optimizing compiler, there is often no need to use the factorized version of the multiplication.

Side note: if you want to like quaternion, see this simple rotatation quiz which can be solved quite easily once you know quaternion.

I try to apply the same trick to PhiloGL, an excellent WebGL framework from Nicolas. Recently, to my delight, he added quaternion support to the accompanying math library in PhiloGL. I think this is a nice chance to try the old trick, as I had the expectation that reducing the number of multiplications from 16 to just 9 could give some slight performance advantage.

It turns out that it is not the case, at least based on the benchmark tests running on modern browsers with very capable JavaScript engine. You can try the test yourself at jsperf.com/quaternion-multiplication. I have no idea whether this is due to JSPerf (very unlikely) or it's simply because the longer construct of the factorized version does not really speed-up anything. If any, seems that the amount of executed instruction matters more than whether addition is much faster than multiplication. And of course, we're talking about modern CPU, the difference is then becoming more subtle.

With the help of Nicolas, I tried various other tricks to help the JavaScript engine, mainly around different ways to prepare the persistent temporary variables: using normal properties, using Array, using Float32Array (at the cost of precision). Nothing leads to any significant improvement.

Of course if you have other tricks in your sleeve, I welcome you to try it with the benchmark. Meanwhile, let's hope that someday some JavaScript engine will run the factorized version faster. It's just a much cooler way to multiply quaternions!

6 comments:

Benoit Jacob said...

Well, multiplications are not really more expensive than additions or substractions on CPUs made in the past 5 years at least. What does matter though is pipelining so you can run a (multiplication,addition) pair in 1 cycle on recent enough Intel CPUs. This is one of the reasons why fast matrix product algorithms (that do fewer multiplications, but a bit more additions/substractions with a less regular pattern) are generally not yet worth it on current CPUs; I'd suspect the same thing to be biting you here.

Ariya Hidayat said...

@Benoit: I would be surprised if the factorized version can not be pipelined easily.

But in all cases, here I'm talking about running the different approaches *not* straight on the CPU, but via JavaScript engines. There are other factors kicking in, including the boxed model of JS object, JIT compiler, and trace optimization.

e.v.e said...

Actually putting things into Float64Array will help on recent V8 (see results for http://jsperf.com/quaternion-multiplication/7). Float64Array is always better then normal array if you do math with doubles --- on V8 it means one less indirection.

Regarding the factorized case. It suffers from increased register pressure. JIT has to work hard to do good register allocation. (I think in this case best register allocation requires quite complicated scheduling). On V8 due to some design decisions factorized case suffers excessive (non-factorized case uses 3 spill slots, factorized ~14) spilling (possibly into unaligned spill locations on ia32).

Ariya Hidayat said...

@e.v.e: Thanks for the comment, it's very insightful! And send my best to the rest of V8 team :)

The User said...

What is the result with C++ or ASM? There are many things which may influence the JavaScript performance, we could probably understand it better if the CPU aspect matters if you would try it with C++ or ASM.

Ariya Hidayat said...

@TheUser: For native (non JS), you just follow the previous blog entry I referred right in the beginning.