API Design

SAX sux because it is backwards and too low-level. DOM is dum because it can blow up on large documents. The CWXML API has been designed to overcome these limitations while still being convenient and efficient. The API has three modes of operation: Raw Tokenizer, Whole-Document (DOM-like), and Node/Subtree. The Raw Tokenizer mode is "consumer-pull" rather than "producer-push" (SAX) and could, in fact, be inserted into the bottom level of other parser libraries to make them more efficient. The Whole-Document (DOM-like) mode should only be used when you are confident that the document isn't too large. The document can be accessed using limited-XPath expressions or by traversing the parse tree node-by-node. And, binary numbers and arrays are preserved through the API; it would not be very smart to translate them back to text so that your program would need to translate them back to binary, especially when reading BXML input. Our experience is that BXML binary arrays can be parsed about 60 times faster than XML text-encoded numeric arrays.

The Node/Subtree mode is what you want for arbitrarily large documents. Such documents will normally have a particular element type(s) somewhere near that top that is repeated a large number of times which gives the document its bulk. In the Node/Subtree mode, you can read the outer nesting layers one node/token at a time and then read the complete subtree of elements that you know aren't too large. The subtrees can be accessed with the DOM mechanisms, viewing only the subtree. When you advance to the next node, the previous subtree is discarded. API functions are available to make the scanning & skipping more convenient. And the attributes of an element are read in with the element node, for convenience since they aren't likely to be too large.

For example, in the example XML-image format defined for testing CWXML, the document has an arbitrary number of <Scanline> elements inside of the root <XmlDemoImage> root element (plus other stuff). If you tried to read the document in the Whole-Document (DOM) mode, you might run out of memory. However, if you scan along the outer levels node-by-node and read the subtrees of only the <Scanline> elements, you will only ever hold a single scanline in memory at once. If your application only needs to access one scanline at a time, you're all set--you get the convenience of DOM without wasting the memory. The paradigm is fairly simple too; there's no weird hidden, complex handling of subtrees that you need to tickle in the just the right way to keep it from blowing up.

The parser is non-validating but the API provides various means for conveniently validating the document manually as you interpret the content. We believe that schema-based validation is over-rated and probably quite dangerous from a security perspective. A validating parser will tell you if a document matches the schema, but it cannot guarantee that the schema is what it's supposed to be. Someone could unintentionally, intentionally, or maliciously substitute the schema referenced or could hack or 'middle-man' some remote computer that supposedly contains an authoritative canonical schema. If your interpreter relies on validation too heavily, you could easily create many security holes in your system. Perhaps one day, schema-trust hacking will be as common as e-mail viruses. We assert that the proper place for validation is as a debugging tool.