|
|
FAQ
Isn't "Binary XML" an oxymoron? Why not just use an efficient non-XML binary format? Using XML for everything is just misguided! A problem that CubeWerx is being faced with is that interoperable geographic-feature processing is being defined in terms of XML rather than more traditional binary formats such as Shapefile (which has its own problems). We're faced with arrays of numbers represented in text that must be parsed as quickly as the old formats. Have you ever heard of ASN.1? Can BXML be used to represent HTML web pages? Probably. Obviously it can be for any version of HTML that is derived directly from XML. But we think it is also sufficient for all older versions of HTML, though we cannot be sure about every possible markup. Obviously, older versions of HTML that are not valid XML will cause problems with XML parsers. Why doesn't CWXML have a standard SAX/DOM API? The traditional DOM-style parsing has significant problems with large documents. If you examine the 'xditoppm' sample program, you will see that it scans in a node-oriented fashion (one node at a time) for the major structures of the document and then reads the subtrees of the individual <Scanline> elements. This combines the efficiency of node-oriented scanning with the convenience of DOM-style parsing. Only a single image scanline is ever stored in memory at a time. Also, we not so sure how "standard" the SAX/DOM APIs really are since different libraries seem to use different structures/functions. Finally, "standard" SAX/DOM has no mechanisms for efficiently handling numbers, blobs, and arrays since standard XML is all text. They should have thought ahead. Why not use a just-in-time-DOM paradigm? The node/subtree paradigm is simple to understand and has immutable semantics. You don't need to tickle it in just the right way to make it work well. Why not prefix the BXML elements with a size for efficient data skipping? Also, the information really cannot be taken advantage of when reading from a sequential socket stream. If you know that you do not want to read the current element and its total size is 60,000 bytes, you still can't do a seek(); you must read the stream sequentially. The BXML tokenizing speed is certainly efficient enough to read a network stream at full speed, so you really haven't gained anything in this case. The best way to implement this feature would be as another optional index in the trailer node. For each element in the XML content, you can store its symbol-table index, starting position, and its total length using three Count numbers. You could also abbreviate the element index by omitting the lower-level elements, since seeking shorter distances gives less of a payoff than seeking longer distances. You can just scan through the shorter distances. The index can only be utilized when reading a stream with random access, but as indicated above, this is the only case where it is useful anyway. The application would read the element index into memory and then when the parser wants to skip over a given element, it can look up the element's file position in the element index and if it is present, the parser can seek() over the given number of bytes. If the element index is not present or the current element isn't included, then you have to scan the element body sequentially. The stream can be written sequentially since the writer can just remember the positions and sizes of the important elements as it is writing, same as with the other index types. If you examine the cwxml API, you will see calls CwXmlScan_SkipToElement(), CwXmlScan_SkipToCloseTag(), and others which can benefit from this feature. Conceptually, even the CwXmlScan_ReadSubTree() function could have an option to keep only the nodes that match a given XPath expression (and skip the rest), but this would probably take more effort to implement than it is worth at the present time. Why not make the BXML format always use an index for accessing XML content so that the content can be dynamically updated? Why not enable and disable compression on a per-element level so that you don't waste time trying to compress pre-compressed content (e.g. JPEG blobs)? Why would a for-profit company release open-source software? Why use the LGPL license? Why are GZIP-compressed BXML files around the same size as GZIP-compressed XML files? However, one thing to note is that a BXML file is about half the size of the XML file (give or take) before compression. GZIP uses quite a slow compression algorithm (and BZIP2 even slower), so you will get a significant compress/decompress speedup over XML because the expensive compress/decompress process is run on only half as much data. Documents that have a great deal of markup relative to textual content will compact well. We have a few more ideas to increase the amount of pre-compaction. What if I have an insane hatred of binary formats? They pollute the pristine XML! Most proponents of binary encoding want to add lots of other features, including: indexing, random access, random access into compressed documents, uncompressed compactness for limited devices, update-in-place, update deltas, packaging, and fragmentation. Things like indexing could be added to XML, but you couldn't prevent users from going in and editing your XML document thereby messing up the index (damned editor-friendly format!). BXML provides pre-arranged random access, but we have developed a solution for full XPath random access using an element index materialized as a list of (symbol_id, offset, size) triples. As for pollution, perhaps you're not aware of the damage that multiple passing "flavours of the year" worth of validation languages have inflicted on XML. Many validating parsers won't accept newer or older XML files if they aren't written relative to the parser's favorite validation language. So much for there being only ONE XML out there! And then there are character encodings. And a pending XML 1.1. In short, XML already has a pollution problem. What do you have to say about lossless tranformation to and from XML? How would you round-trip numbers like 7.3000, 7.3, +7.3, +007.3, 0007.3? BXML doesn't define a canonical representation for regenerating numbers, but it could be defined as something like whatever is equivalent to the "%.16g" printf-formatting string in C (16 significant digits), substituting the broken "INF", "-INF" and "NaN" definitions of XML Schema. This the above numbers could be represented as purely strings or as "{double}000", "{double}", "+{double}", "+00{double}", and "000{double}", respectively. A parser would have a hard time optimizating these odd usages, however. Our parser would convert them to a string and then parse them as double numbers if the user called the get-content-as-double function. Normally, however, one wouldn't emit XML only to parse it and generate a binary encoding since that's a waste of time; one would emit the binary directly in the first place, making that the canonical version of the document. A web-service program wouldn't do silly things like emit "0007.3" if it wants the value to be processed efficiently. Overall, there are three levels of 'losslessness' that are probably generally useful: no loss of information (quotes, whitespace, etc. allowed to change), produces the same XML signature (allows changing of whitespace between attribute definitions, etc.), and byte-for-byte identicalness. Normally, byte-for-byte equality isn't necessary, but it is feasible to achieve it when it is desired by keeping track of a few more things. Normally, an intermediate translator will be told whether to preserve the content and to what degree and will have a default setting short of byte-for-byte identity. For most purposes, it doesn't matter to the application or the user and, in the most efficient case, the document will never materialize in XML format at all, only the binary encoding. It may be important that a binary encoding be capable of preserving every byte of an XML representation, but that capability won't normally be used. How did you compare your parser to libxml2 and others? Can't schema-aware parser/generators produce a more compacted pre-compressed output? In the body of a BXML file, you can define a sequence of bytes as a macro and define the substitution of fixed blocks, variable-length blocks, and variable-length blocks including the byte-length 'Count' value. So, for example, if you were to define an XML image format like: "<Scanline><Pixel><Red>123</Red>...", you could define a macro for the whole <Pixel> definition including instance-data byte references for the red, green, and blue component values. When used, you would say to repeat this macro, say, 1024 times and you would only need to supply the three single-byte color-component values for each of the 1024 pixels. This is pretty much optimally compact, since the data effectively materialzes as a raw array. However, it will be much slower to process than using a raw array as in the XDI format, since the full element-oriented encoding will need to be regenerated and consumed in a general parser, so you should avoid doing something like that. But, you can see the compactness. You can do something similar for general record-oriented data by referring to instance string bodies (including lengths), numbers, or general content. And you can define any number of macros using a string-table-like mechanism and refer to them arbitrarily to handle variant records or whatnot. You would only want this kind of compaction for lower-level structures since there won't be much overhead resulting from less-frequently-used higher-level structures (e.g., the <Scanline> and higher elements can be encoded in the normal way without making the file much bigger). A very compact token scheme is here. |