My blog has been moved to

Wednesday, November 08, 2006

Memory efficient DOM

A known problem with KOffice applications is the inability to cope with very large data (e.g. bug #59510). This is often related to filters, e.g. when you are importing Excel documents (see bug #85372). Fortunately, as I wrote there, this has been solved already as the filter now can produce large documents without causing the machine to consume every single bytes of RAM.

Still, a very large file can force the application to use hundred MB memory. One of the reason is because we use QDom to hold some DOMs associated with the document. Unfortunately, once the DOM gets very big, QDom would take too much memory space. And once some parts are paged out to swap, performance will degrade significantly. So when people claimed that opening 1 MB file (which is actually the compressed size, not the actual size of the XML document) consumes more than 300 MB memory, it is not exaggerated. After looking at some colorful charts below, you would hardly be astonished.

When the document is loaded, we do not need the DOMs anymore and these can be destroyed. Thus, although loading takes quite some time, once the loading process is finished, the application won't take so much memory anymore.

This topic has been discussed couple of times in the past. Last year, I started to work on a memory efficient implementation of DOM, designed to be used in KOffice. After some attempts, looks like it may after all work. And quite promising indeed.

So, how can this DOM thing be improved?

First, it's likely that we do not need to modify any elements once they are loaded from the file. The way it works is always like this: get the document, load some XML files from the storage (e.g. content.xml, settings.xml, styles.xml, ...), parse them and construct the application-specific objects from those XML files. For example, for a spreadsheet this means that we create all the sheets, rows, columns, cells, and other objects. When the application saves the document, it creates the DOM from scratch and write it to the file. Thus, when opening a file, the corresponding DOMs are not modified at all.

Second, many namespace URIs, attribute names and element values are duplicated. That alone is already a very good reason to cache common strings. For example, in OpenDocument spreadsheet, almost every cell have attributes named table:style-name and office:value-type so we do not need to duplicate them for every single cells.

The replacement for QDom, dubbed as KoXmlReader, is using these tricks. The whole DOM tree is stored in a very compact form (where the overhead per node is minimized and all duplicated strings are stored efficiently), not using all instances of nodes, elements, texts, etc. However, there are KoXmlNode, KoXmlElement, KoXmlDocument classes which correspond to their QDom counterparts. The key is, those DOM nodes are created (from the compact representation) only when we need it and afterwards they can be destroyed again. In short: it's constructed only on-demand. This way, compatibility with QDom classes is preserved (otherwise we need to modify and test thousand lines of code) while the memory requirement can be kept to minimum.

(Side note: yes, we considered using a SAX parser directly. But that means changing a lot of code and is prone to mistakes. And seems nobody is willing to do that. My personal feeling is that the gained performance is not worth the hassle, though but I'm not so sad if I'm wrong here).

How can this help? Say we're dealing with a spreadsheet. When we parse the first sheet we need all the nodes and elements belong to this sheet, but we don't care for nodes and elements associated for the other sheets. So we just load what we need (this loading even happens automagically). After we finish with it, we can move to the next sheet and at the same purge everything from the first sheet (cause we don't bother with it anymore).

I tried to make the API for node, element, text etc very much compatible to QDom. There is even a compiler define so that it just wraps all those QDom classes (to ease transition), which allowed Stefan to change all QDom references with KoXml in one amazing sweep. Unit tests - with more than 1000 checks - provide (hopefully) almost 100% coverage and should help catching bugs or incompatibilities as early as possible (also, I notice that I have one more example of what I wrote some time ago: the whole test suite is longer than code itself).

So how does it perform?

Obviously this is not a scientifically correct benchmark, but just to give some illustrations, here are few spreadsheet documents that I tested (actually I have tried many many documents, these are just some of them). The column Content specifies the size of the content.xml of each document. As you can see, once it's uncompressed, the file can be very big. Other columns are self-explained, I hope.

Document File Size Content Sheets Cells
permfall.ods 18 KB 92 KB 4 271
test.ods 18 KB 144 KB 13 1138
Customer_Solution.ods 90 KB 1478 KB 18 11735
Financial_Report.ods 143 KB 2947 KB 24 8913
bug85372.ods 247 KB 4754 KB 81 347193
Frequencies.ods 627 KB 18952 KB 3 185334

So what happen if we load each of this document, give it to our XML parser, construct the DOM tree and iterate every single cell in the document? We won't load it into KSpread, and we won't create any data from that. We would just treat it like a plain DOM representation of a spreadsheet.

Using Valgrind's Massif, here what I got when I processed one of them using all QDom classes (QDomDocument, QDomElement, etc). As you witness, the memory consumption skyrocketed to around 250 MB although the compressed ODS file is not even 1 MB. In reality (i.e. within KOffice), of course the situation is much worse because KOffice needs also some more megabytes for its own code and data.

And this is how it looks like when I patch QDom so that the strings are cached (get the simple patch, if you're curious). It does not seem to help too much in this case.

However, using KoXmlReader stuff (KoXmlDocument, KoXmlElement, etc), at most we need only around 40 MB (see the vertical axis).

And follows is the complete comparison chart for other documents. Note that I omit Frequencies.ods - which you can already enjoy from the graphs above - because the bars for that file would be very very long compared to the others.

(There are perhaps better ways to instrument memory usage, so please take my lame attempt above with a grain of salt).

So we can see that the family of KoXml classes can be useful. But, what are the catches? For sure, we can't change the DOM tree. That is quite clear. However, as I write in the beginning, this does not matter a lot for KOffice. In other words, you can't just simply substitute your QDom with this KoXmlReader. It is designed and implemented for a very specific use case. It's not even a fully compliant implementation.

Another disadvantage is that iterating over nodes is a bit slower than QDom. The good news: parsing is faster. This is because the whole process is likely memory bound. True, KoXmlReader has some processing overhead while packing and unpacking the nodes (i.e. constructing them on-the-fly). But it definitely also accesses only a fraction of RAM compared to QDom, and we all know how slow memory operation is.

A few runs on P4 1.8 GHz gives the following chart. The light color is for parsing, the darker is for iterating. Looks like KoXmlReader does not really lag behind.

This also makes me rather excited, I wonder whether my next attempt (using additional real-time compression, wait for the Part II of this writing) would still give an acceptable performance in term of speed.

There are still few issues which must be solved before this stuff can be fully utilized in the upcoming KOffice 2.0. I'm working on that. But just for appetizer, here is the memory chart running KSpread and opening one of the test document (Customer_Solution.ods). Using QDom, the peak was about 110 MB before it settled down to around 80 MB when the loading is completed. I have no idea what caused the deep spike in the middle, this must be investigated (Is it really from KSpread? Or possibly Valgrind?).

However, when KoXmlReader stuff is used instead of QDom, the heap allocation (during the parsing and iterating) touched just about 90 MB, as evidenced below:

Please do not assume that 80 MB is what KSpread 2.0 needs to work with such simple document. What I am using is a rather work-in-progress version, so it has lots of extra stuff. KOffice 2.0 is still very far from its release date and there are still lots of on-going work on optimizations.

To be continued...


Stefan Nikolaus said...

> I have no idea what caused the deep
> spike in the middle, this must be
> investigated (Is it really from
> KSpread? Or possibly Valgrind?).

If you have started opening the file from a workbook, this is possibly due to deleting the view (and all the related stuff) for this workbook and creating a new one. After loading a new view is created, even if the workbook was empty.

You can check this by either opening the file directly through the open dialog on startup (prefered) or by modifying the old workbook (this should ensure, that it is not deleted, but you have at least two views in memory).

Anyway, great work both, the mem efficient DOM and this analysis. I'm already drooling over your next steps. :-)

Robert said...

We need more posts like this on planetkde.

Anonymous said...

I think it is similiar to the VTD-XML parser, or if you can use some of their ideas, here you are the URL:

Good Work.

Ariya Hidayat said...

Stefan: I doubt it's from the view, since I started KSpread directly. Beside, KSpread::Cluster (the pink region) was already building-up (during the parsing) but suddently vanished for a very short moment.

Anonymous: it's different. VTD-XML requires that the whole XML is held in memory, which is similar to my very first KoXmlReader attempt. The so-called multiplication factor of KoXmlReader should even be better (and wait until I talk about additional compression).

Thomas Zander said...

cool stuff, man :)

Looking forward to using it :)