don't code today what you can't debug tomorrow: Memory efficient DOM (Part 2)

Friday, November 24, 2006

Memory efficient DOM (Part 2)

This is the follow-up to the first part of my writing on memory efficient DOM implementation for KOffice. Now my intention is to find out whether real-time compression can help to further reduce the memory consumption.

XML compression is a whole field of research. There are a bunch of compressor available, from XMill, XMLZip, XMLPPM, XGrind, XML-Xpress, Exalt, TREECHOP and many others. Since I'm not a scholar in this field (I already have my own graduate research to take care of), I have chosen a direct and brutal approach: just compress the packed nodes representation and decompress them on-the-fly. While iteration is very common, simple cache buffer for decompressed neighbouring nodes is implemented so there is no need to always decompress every single node. This method easily fulfills the requirements that I set: no need to load the whole XML (because it intercepts SAX events), allows queries and iteration (due to the QDom-like wrapper), and can be conveniently tested with the compression switched off.

There are many algorithms suitable for fast compression and decompression. Byte-aligned Lempel-Ziv is a logical and obvious choice. Again, I have very much zero knowledge about data compression. There are Ross Williams's LZRW1, Herman Vogt's LZV, Markus Oberhumer's LZO, Marc Lehmann's LZF and many others. LZRW1 is well-known although probably covered by patents. LZV has the poorest compression ratio so it goes out of the list. LZO's compression is the best, at the price of a bit more complicated state-machine, but my own implementation (since the reference implementation is GPL only) is not so fast and well debugged. LZF, the successor to LZV, is a good contender and already used (among others) in suspend2 project. On fairly modern machine with large CPU cache and reasonable branch predictor, liblzf (the reference implementation of LZF) is much slower than LZRW1. So what I did is to recreate the compressor. Armed with Valgrind's Cachegrind, my new implementation is finally faster than LZRW1.

What I am really eager to try (but no time for that yet) is Wilson-Kaplan algorithm. At least the WK4x4 and WKdm variants have been implemented for Linux compressed caching project. And since the compression/decompression inside KoXmlReader is actually involving in-memory data, this fast and low-overhead WK algorithm should be a good candidate. There are assorted patent applications for such cache compression, but I wonder whether the algorithm per se is patented.

Now on to the result, starting with the worst test document: Frequencies.ods. Just to remind you, loading and iterating this document (19 MB XML) takes 250 MB if we're using QDom but only 40 MB with KoXmlReader. Here what Valgrind/Massif reported when the compact form of the DOM is compressed and decompressed on the fly. At most, this barely consumes 8 MB. Compared to 250 MB, 40 MB is good enough, but 8 MB is great, isn't it?

Interesting is, during the parsing phase, the heap allocation increases only (almost) linearly.

The full comparison with other documents is shown below. Amazing to see how short the violet bars are.

As for speed, regardless the lightning fast packing and unpacking, there's some penalty. This graph below summarizes the timing required for parsing (light bars) and iterating (dark bars). Compare to what was shown in part 1, I only added the result for KoXmlReader with compression. Small documents are omitted because processing time is short anyway.

Of course, all these graphs must be taken with a grain of salt. But my objective is only to find out whether this all makes sense, not to write a full-brown research paper. So far I'm quite happy with the result.

It can be seen that KoXmlReader is a feasible alternative for QDom (in KOffice) as it does not incur any significant performance penalty yet the memory footprint in the worst case (Frequencies.ods) can only be one sixth (40 MB) compared to that of QDom (250 MB). We also see the additional benefit of using real-time compression. Compared to plain KoXmlReader, although in the worst case it is 1/5 slower (and there is still room for further improvements here), the allocated heap is only one fifth (8 MB vs 40 MB). Or even 1/30 if you bravely compare it against QDom (8 MB vs 250 MB).

I guess we should empirically find out how large the average documents that most people are working with. If these are less than the worst case that I tested, then enabling compression by default is a good thing because the performance penalty won't be noticable. If not, even without compression it is already much more better than using QDom.

Other than that, my aim now is to fully integrate KoXmlReader in KOffice. Hopefully someday bug #59510 can really be closed.

21 comments:

elvis said...: Amazing work Ariya, really amazing and sorely needed. But I want to see the violet bars you talk about! :); November 24, 2006 10:21 AM
Anonymous said...: I think it's definitely worth having this in KOffice -- and used whenever we read an xml-based document.

Boudewijn; November 24, 2006 10:25 AM
Anonymous said...: You said you recreated the compressor, is that something that can be improved in the original liblzf as well (en thereby improve suspend2 performance as well for example) or was it something very specific for this project?; November 24, 2006 6:02 PM
Ariya Hidayat said...: elvis: sometime the server where I put the images is overloaded, give it a try again.

boudewijn: at least kspread is already ready for that. other apps are on my TODO.

quintesse: possibly does not make too big difference for suspend2, there it is likely an I/O bound process.; November 26, 2006 1:22 AM
Anonymous said...: Hmmm, the reduction from 40 MB to 8 MB looks related to the new style storage. Have you measured the KoXmlReader performance with or without the new style storage?; December 06, 2006 3:33 PM
Ariya Hidayat said...: Stefan: of course nothing to do with style storage. I'm running *only* koxmlreadertest (as explained in Part I as well).; December 06, 2006 3:57 PM
Anonymous said...: Atomic SEO I really like this blog when I first signed up here i wasn't sure were to post...; February 19, 2007 5:32 AM
Anonymous said...: Good post.
I was searching the internet for cheap internet phone service (Voip) and found a company called Via Talk. They are cheaper and just as good as Vonage.

They are now offering 1 year phone service Free when you purchase 1 year – for a limited time. Check it out at Via Talk; February 26, 2007 6:38 PM
Anonymous said...: Nice blog.
If you’re interested in free weight loss tips please visit this site.; February 26, 2007 10:21 PM
Unknown said...: Hey you have a very good blog.

If by any chance you need weight loss help check out by blog: Weight Loss Tips; February 26, 2007 11:43 PM
Anonymous said...: FREE Business Advertising Tips The Most Powerful Internet Classified Advertising Methods On The Web! "TOP" Rated Money Making Website! A Must See!!!; March 01, 2007 2:32 AM
Anonymous said...: Nice Blog. I will continue reading........................

Tip For All Bloggers:

If I can sum up in three words how anyone or any website can totally "DOMINATE" the search engines and get to the very "TOP" of the search engines is: blogs, links, and content, period. That's all it takes. But you need "LOTS" of blogs, links and content.......................MajorEnterprise, Carael Knight

For more information, visit:

The Internet Marketing Genius, Carael KnightThe Most Powerful Internet Marketing Resources On The Web!; March 06, 2007 4:26 AM
Anonymous said...: There is something unique about your style, am not exactly sure what it is. The only thing I do not like about your site is that it is not mine. Free from all spyware; March 07, 2007 7:20 AM
Anonymous said...: A Must Have For An Online Business Millions use forms of income producing packages but the majority know that this is the greatest immediate income producer simply by posting it on your web business site. If you are needing a new website or web business visit today.; March 16, 2007 1:06 PM
Anonymous said...: This comment has been removed by a blog administrator.; March 19, 2007 6:49 PM
Anonymous said...: This comment has been removed by a blog administrator.; April 06, 2007 5:18 AM
Anonymous said...: This comment has been removed by a blog administrator.; April 10, 2007 8:23 PM
Anonymous said...: This comment has been removed by a blog administrator.; April 17, 2007 5:40 PM
Anonymous said...: This comment has been removed by a blog administrator.; May 07, 2007 11:17 AM
Anonymous said...: This comment has been removed by a blog administrator.; May 08, 2007 6:03 PM
Anonymous said...: This comment has been removed by a blog administrator.; May 11, 2007 11:42 AM

don't code today what you can't debug tomorrow

Friday, November 24, 2006

Memory efficient DOM (Part 2)

21 comments:

recent posts

recent comments

archive