What is CWXML?
CWXML is a high-performance, open-source C-language library for parsing and generating XML and BXML (below) formats with a straightforward API. Initial testing indicates that CWXML is 3 or more times faster at parsing XML (text data) than other popular libraries such as expat and libxml2 (10x Xerces) and 5 to 60 times faster again at parsing BXML data. The library is being developed by CubeWerx as the reference implementation for the BXML proposed specification. The parser accepts and automatically recognizes the following formats: XML, GZIPped XML, BXML, BXML with internal GZIP, and BXML with external GZIP. It is licensed under the GNU LGPL
What is BXML?
BXML (Binary eXtensible Markup Language) is an straightforward, open, patent-unencumbered binary-encoding format for XML data that is a stand-alone work-alike drop-in replacement for an XML file that mirrors the XML markup structures in a way that is similar to the in-memory representations of many parser libraries. BXML was developed by CubeWerx Inc. for the OpenGIS® Consortium and it makes all XML documents more compact and efficient to parse and generate by using a symbol table for element/attribute names and length-prefix encoding all arbitrary-length structures (strings, blobs, arrays). But it especially makes dense-numeric XML documents much more efficient by allowing raw arrays of different common types of numbers. For example, imagery data can be handled in BXML just as well if not better than it is handled in PNG format. A numeric array can pass from end-to-end in a client/server environment as a raw chunk of data without ever being re-coded. Dense numeric data also compresses faster and more compactly when encoded in binary rather than text. BXML can also support random access.
SAX sux because it is backwards and too low-level. DOM is dum because it can blow up on large documents. The CWXML API has been designed to overcome these limitations while still being convenient and efficient. The API has three modes of operation: Raw Tokenizer, Whole-Document (DOM-like), and Node/Subtree. The Raw Tokenizer mode is "consumer-pull" rather than "producer-push" (SAX) and could, in fact, be inserted into the bottom level of other parser libraries to make them more efficient. The Whole-Document (DOM-like) mode should only be used when you are confident that the document isn’t too large. The document can be accessed using limited-XPath expressions or by traversing the parse tree node-by-node. And, binary numbers and arrays are preserved through the API; it would not be very smart to translate them back to text so that your program would need to translate them back to binary, especially when reading BXML input. Our experience is that BXML binary arrays can be parsed about 60 times faster than XML text-encoded numeric arrays.
The Node/Subtree mode is what you want for arbitrarily large documents. Such documents will normally have a particular element type(s) somewhere near that top that is repeated a large number of times which gives the document its bulk. In the Node/Subtree mode, you can read the outer nesting layers one node/token at a time and then read the complete subtree of elements that you know aren’t too large. The subtrees can be accessed with the DOM mechanisms, viewing only the subtree. When you advance to the next node, the previous subtree is discarded. API functions are available to make the scanning & skipping more convenient. And the attributes of an element are read in with the element node, for convenience since they aren’t likely to be too large.
For example, in the example XML-image format defined for testing CWXML, the document has an arbitrary number of <Scanline> elements inside of the root <XmlDemoImage> root element (plus other stuff). If you tried to read the document in the Whole-Document (DOM) mode, you might run out of memory. However, if you scan along the outer levels node-by-node and read the subtrees of only the <Scanline> elements, you will only ever hold a single scanline in memory at once. If your application only needs to access one scanline at a time, you’re all set–you get the convenience of DOM without wasting the memory. The paradigm is fairly simple too; there’s no weird hidden, complex handling of subtrees that you need to tickle in the just the right way to keep it from blowing up.
The design and performance of the BXML format and the CWXML library are characterized in this report
There’s no external API documentation available yet, but the header files are quite well commented. You’ll want to read "cw_xmlscan.h" and "cw_xmltree.h" in the distribution for the parser and "cw_xmlgen.h" for the generator.
Some test programs are included with the distribution in the "cwxmlutil" directory so you can see some examples of using the API.
The FAQ is available here
Production version 4.0.5 (712 KB) is available for download. Be sure to start by reading the "README" file in the distribution. It will build on Linux, Solaris, Alpha/OSF1, and Windows, though you may run into some problems building on Windows.
The distribution comes with the test programs "xmlscan", "ppmtoxdi", and "xditoppm". The xmlscan program translates from XML/BXML input to XML/BXML output, with controllable pretty re-formatting of XML output. The ppmtoxdi and xditoppm programs transform images from PPM format to the made-up XDI format for performance testing.
You can also download additional test data (5.5 MB) so that you can confirm the impressive performance results we show in our report mentioned above. Extract it while you are in the cwxml root directory. Be sure to make with BUILD=optimize when performance testing.
CubeWerx Inc. is an independent software vendor for Web-enabled geo-spatial database systems. Our interest in BXML is to make the transport and usage of XML-encoded geo-spatial data over the Web as efficient as possible.