Isn't "Binary XML" an oxymoron?

Lots of people use GZIP encoding with XML data to make it suitable for sending over a network. Surprise, GZIP is a _BINARY_ format! And yet, it interoperates between systems very nicely. The page you are viewing may have been GZIP encoded too on the way to your browser. Most XML parsers will accept a GZIPped XML stream and recover the original content transparently. Binary data can be just as interoperable as text data, and hugely more efficient when done right.

Why not just use an efficient non-XML binary format? Using XML for everything is just misguided!

The problem with the plethora of binary file formats is that people have kept reinventing the same things over and over again, poorly. These are the misguided ones. This situation is what gave rise to XML format in the first place: people were tired of inventing one-off file formats that all required custom tools to use or edit or parse or transform or whatever. Now, one of XML’s biggest new problems is its success. People now want to use it for everything, including things that were previously only done with custom binary formats. But, imagine the flexibility of XML combined with the efficiency of binary encoding.

A problem that CubeWerx is being faced with is that interoperable geographic-feature processing is being defined in terms of XML rather than more traditional binary formats such as Shapefile (which has its own problems). We’re faced with arrays of numbers represented in text that must be parsed as quickly as the old formats.

Have you ever heard of ASN.1?

ASN.1 is a very large and complex specification. It can be argued that some people who seek to use ASN.1 for this purpose do not seek to augment XML but instead seek to replace it. ASN.1 has been around almost forever and it never managed to ‘catch fire’ like XML did. ASN.1 has been reported to be successfully used in telecommunications standards but is ASN.1 for everyone? There has been attempts at ISO to provide data translation mechanisms between XML and ASN.1 representations, but the translation is not adequate for most purposes (e.g., there aren’t even any attributes as far as we can tell) and you would lose all of the XML tools and support when you switch. Good luck!

Can BXML be used to represent HTML web pages?

Probably. Obviously it can be for any version of HTML that is derived directly from XML. But we think it is also sufficient for all older versions of HTML, though we cannot be sure about every possible markup. Obviously, older versions of HTML that are not valid XML will cause problems with XML parsers.

Why doesn't CWXML have a standard SAX/DOM API?

We didn’t implement a SAX-callback kind of interface because callbacks are very difficult to manage in interpreting software. We have implemented an open/read/close-paradigm parser. You can also scan by raw tokens, nodes, or subtrees. But someone could implement a callback mechanism on top of the raw token scanner if they wanted to.

The traditional DOM-style parsing has significant problems with large documents. If you examine the ‘xditoppm’ sample program, you will see that it scans in a node-oriented fashion (one node at a time) for the major structures of the document and then reads the subtrees of the individual <Scanline> elements. This combines the efficiency of node-oriented scanning with the convenience of DOM-style parsing. Only a single image scanline is ever stored in memory at a time.

Also, we not so sure how “standard” the SAX/DOM APIs really are since different libraries seem to use different structures/functions. Finally, “standard” SAX/DOM has no mechanisms for efficiently handling numbers, blobs, and arrays since standard XML is all text. They should have thought ahead.

Why not use a just-in-time-DOM paradigm?

In general, we think that the idea of "just-in-time" is a dangerous one since it generally implies a great deal of management and if applications aren’t written to understand all of the idiosyncrasies of the underlying system, which change from version to version, they may have poor performance.

The node/subtree paradigm is simple to understand and has immutable semantics. You don’t need to tickle it in just the right way to make it work well.

Why not prefix the BXML elements with a size for efficient data skipping?

You have an interesting idea about allowing unwanted elements to be skipped over, but we don’t think it can be implemented practically as you describe. In order to prefix the total length of the body to an element, you either need to buffer the whole body before emitting it or you need to leave a ‘hole’ for the length as you are writing and then seek() back to the position and fill it in when you have the information. The first approach isn’t reasonable for a large document and the second approach cannot be used with a socket stream which must be written sequentially.

Also, the information really cannot be taken advantage of when reading from a sequential socket stream. If you know that you do not want to read the current element and its total size is 60,000 bytes, you still can’t do a seek(); you must read the stream sequentially. The BXML tokenizing speed is certainly efficient enough to read a network stream at full speed, so you really haven’t gained anything in this case.

The best way to implement this feature would be as another optional index in the trailer node. For each element in the XML content, you can store its symbol-table index, starting position, and its total length using three Count numbers. You could also abbreviate the element index by omitting the lower-level elements, since seeking shorter distances gives less of a payoff than seeking longer distances. You can just scan through the shorter distances.

The index can only be utilized when reading a stream with random access, but as indicated above, this is the only case where it is useful anyway. The application would read the element index into memory and then when the parser wants to skip over a given element, it can look up the element’s file position in the element index and if it is present, the parser can seek() over the given number of bytes. If the element index is not present or the current element isn’t included, then you have to scan the element body sequentially. The stream can be written sequentially since the writer can just remember the positions and sizes of the important elements as it is writing, same as with the other index types.

If you examine the cwxml API, you will see calls CwXmlScan_SkipToElement(), CwXmlScan_SkipToCloseTag(), and others which can benefit from this feature. Conceptually, even the CwXmlScan_ReadSubTree() function could have an option to keep only the nodes that match a given XPath expression (and skip the rest), but this would probably take more effort to implement than it is worth at the present time.

Why not make the BXML format always use an index for accessing XML content so that the content can be dynamically updated?

A huge problem with this approach is that it makes it so that the stream cannot be read or written sequentially. The first-and-foremost requirement for BXML is that it be a direct drop-in replacement for an XML file. Eliminating streamability is not an option.

Why not enable and disable compression on a per-element level so that you don't waste time trying to compress pre-compressed content (e.g. JPEG blobs)?

The present implementation doesn’t do this, but a new design would segment the BXML file into multiple user-defined "compression blocks" where each can be compressed using GZIP or be uncompressed. In this way, you could optionally not compress large blobs of pre-compressed data. Random access will also be defined relative to these compression blocks so that you can randomly seek to any compression block and access sequentially from there. For example, if you had a 1000-page manual encoded in a revised BXML, you could keep it all compressed and use a different compression block for all of the markups of each page. Then, you could randomly access each page and only uncompress the data for that page as you render it.

Why would a for-profit company release open-source software?

CubeWerx is committed to geoprocessing interoperability and has already invested resources and money on interoperabilty for Geographic Information Systems with the OpenGIS® Consortium. Following on the rules of the OpenGIS® Consortium the BXML format and its specification document are therefore open. CubeWerx has vested interests in resolving performance problems associated to CubeWerx software products directed at the Spatial Data Infrastructure market. CubeWerx is always seeking to release efficient software products when data is represented with text-based XML.  It is therefore in our best interest to promote efficient processing of XML-based scientific data.

Why use the LGPL license?

Using the LGPL license allows everyone, including corporations, to benefit from our software without allowing any proprietary interests to co-opt it. With the LGPL, everybody wins.  Certain proprietary interests which consistently slam the GPL may not want to you know that the LGPL exists.

Why are GZIP-compressed BXML files around the same size as GZIP-compressed XML files?

The BXML format and the CWXML library are aimed at optimizing parser time and not compressed-document size. In fact, CubeWerx results show that the compressed file is usually around the same size for both XML and BXML. The only files that are significantly smaller are dense-numeric ones that are written properly, since GZIP is particularly bad at compressing dense numerics.

However, one thing to note is that a BXML file is about half the size of the XML file (give or take) before compression. GZIP uses quite a slow compression algorithm (and BZIP2 even slower), so you will get a significant compress/decompress speedup over XML because the expensive compress/decompress process is run on only half as much data. Documents that have a great deal of markup relative to textual content will compact well. We have a few more ideas to increase the amount of pre-compaction.

What if I have an insane hatred of binary formats? They pollute the pristine XML!

Well, the CWXML library is a high-performance XML parser that blows the doors off anything else we are aware of. And if you’re dead-set againt binary encoding, the XML parser is your proof that XML parsing can still be optimized.

Most proponents of binary encoding want to add lots of other features, including: indexing, random access, random access into compressed documents, uncompressed compactness for limited devices, update-in-place, update deltas, packaging, and fragmentation. Things like indexing could be added to XML, but you couldn’t prevent users from going in and editing your XML document thereby messing up the index (damned editor-friendly format!). BXML provides pre-arranged random access, but we have developed a solution for full XPath random access using an element index materialized as a list of (symbol_id, offset, size) triples.

As for pollution, perhaps you’re not aware of the damage that multiple passing "flavours of the year" worth of validation languages have inflicted on XML. Many validating parsers won’t accept newer or older XML files if they aren’t written relative to the parser’s favorite validation language. So much for there being only ONE XML out there! And then there are character encodings. And a pending XML 1.1. In short, XML already has a pollution problem.

What do you have to say about lossless tranformation to and from XML? How would you round-trip numbers like 7.3000, 7.3, +7.3, +007.3, 0007.3?

The BXML specification doesn’t address this issue specifically, but it effectively can represent all the things you give above since it is not bound to any schema definitions but instead lets the writer select what data type to use to represent each fragment of the output content data. In this sense, it is self-describing for data types. So, in a trivial, inefficient way, the parser can always represent given content exactly because it can always represent the content as a string, regardless of its schema type.

BXML doesn’t define a canonical representation for regenerating numbers, but it could be defined as something like whatever is equivalent to the "%.16g" printf-formatting string in C (16 significant digits), substituting the broken "INF", "-INF" and "NaN" definitions of XML Schema. This the above numbers could be represented as purely strings or as "{double}000", "{double}", "+{double}", "+00{double}", and "000{double}", respectively. A parser would have a hard time optimizating these odd usages, however. Our parser would convert them to a string and then parse them as double numbers if the user called the get-content-as-double function.

Normally, however, one wouldn’t emit XML only to parse it and generate a binary encoding since that’s a waste of time; one would emit the binary directly in the first place, making that the canonical version of the document. A web-service program wouldn’t do silly things like emit "0007.3" if it wants the value to be processed efficiently.

Overall, there are three levels of ‘losslessness’ that are probably generally useful: no loss of information (quotes, whitespace, etc. allowed to change), produces the same XML signature (allows changing of whitespace between attribute definitions, etc.), and byte-for-byte identicalness. Normally, byte-for-byte equality isn’t necessary, but it is feasible to achieve it when it is desired by keeping track of a few more things.

Normally, an intermediate translator will be told whether to preserve the content and to what degree and will have a default setting short of byte-for-byte identity. For most purposes, it doesn’t matter to the application or the user and, in the most efficient case, the document will never materialize in XML format at all, only the binary encoding. It may be important that a binary encoding be capable of preserving every byte of an XML representation, but that capability won’t normally be used.

How did you compare your parser to libxml2 and others?

We used the "testReader" program from libxml2 on our test data (which is downloadable). The libxml2 test time is for non-validated parsing, since if we try the "–valid" switch, the program blows up because it doesn’t understand XML Schema (only DTD)—maybe most validating implementations will blow up every time the flavour of the year changes for validation languages!) We didn’t try Xerces directly but used the comparison chart on the libxml2 web page.

Can't schema-aware parser/generators produce a more compacted pre-compressed output?

Yes, but they are also horribly complex and probably slow to process because they need to follow along inside of an arbitrarily complex schema document. But, simpler non-schema-processing systems can also achieve a similar or better amount of compaction in a simple way using a macro mechanism. We have designed a simple byte-oriented macro mechanism that can be used to encode record-oriented structures very compactly, and it can be processed at the very low level of an input buffer filler.

In the body of a BXML file, you can define a sequence of bytes as a macro and define the substitution of fixed blocks, variable-length blocks, and variable-length blocks including the byte-length ‘Count’ value. So, for example, if you were to define an XML image format like: "<Scanline><Pixel><Red>123</Red>…", you could define a macro for the whole <Pixel> definition including instance-data byte references for the red, green, and blue component values. When used, you would say to repeat this macro, say, 1024 times and you would only need to supply the three single-byte color-component values for each of the 1024 pixels. This is pretty much optimally compact, since the data effectively materializes as a raw array. However, it will be much slower to process than using a raw array as in the XDI format, since the full element-oriented encoding will need to be regenerated and consumed in a general parser, so you should avoid doing something like that. But, you can see the compactness. You can do something similar for general record-oriented data by referring to instance string bodies (including lengths), numbers, or general content. And you can define any number of macros using a string-table-like mechanism and refer to them arbitrarily to handle variant records or whatnot. You would only want this kind of compaction for lower-level structures since there won’t be much overhead resulting from less-frequently-used higher-level structures (e.g., the <Scanline> and higher elements can be encoded in the normal way without making the file much bigger).

A very compact token scheme is here

Bookmark and Share