BXML version 0.0.9 The 'Count' type is defined in an independent space from the token space and as follows: CODES MEANING ------- -------------- 00 - EF literal count values 0 to 239 F0 - FB plus 1 byte: count values 240 to 3311 FC plus 2 bytes: count values that fit in uint16 FD plus 4 bytes: count values that fit in int32, negative illegal FE plus 8 bytes: count values that fit in int64, negative illegal FF special 'null' value, normally illegal File organization: --header --sequence of physical content blocks --trailer --random-access stream offsets refer to logical-content Trailer node: can include the following indexes: --content blocks {physical offset, logical-content length} --string-table-fragment index {base symbol index, logical-content offset} --macro index --explicit XPath index {XPath, {value, {logical offset}*}*} --element index {symbol_id, logical offset, length} allow most of the content of the trailer block to be compressed Tokens are defined according to the following codes: CODES MEANING ------- -------------- 00 - 1F integer content: integer in range of 0 to 31 20 - 7E literal character content: unicode between 0x20 and 0x7E 7F string content: string-table reference 80 string content: followed by byte length, followed by content 81 - 9F string content: byte length 1 to 31, followed by content A0 scalar boolean content: "0" A1 scalar boolean content: "1" A2 scalar boolean content: "false" A3 scalar boolean content: "true" A4 boolean array content: "0" and "1" (bitpacked) A5 boolean array content: "false" and "true" (bitpacked) A6 scalar number content: uint8 A7 scalar number content: int16 A8 scalar number content: uint16 A9 scalar number content: int32 AA scalar number content: int64 AB scalar number content: float using 'e' AC scalar number content: float using 'E' AD scalar number content: double using 'e' AE scalar number content: double using 'E' AF reserved B0 numeric array content: uint8 B1 numeric array content: int16 B2 numeric array content: uint16 B3 numeric array content: int32 B4 numeric array content: int64 B5 numeric array content: float using 'e' B6 numeric array content: float using 'E' B7 numeric array content: double using 'e' B8 numeric array content: double using 'E' B9 reserved BA content: potentially insignificant whitespace BC content: denormalized whitespace (CR+LF/LF) BC content: blob BD content: general macro reference BE content: macro reference to macro index 0, count = 1 BF reserved C0 string-table fragment: length, namespace per entry C1 - CF string-table fragment: implied length 1 to 15, namespace D0 string-table fragment: length, no namespaces D1 - DF string-table fragment & content: implied length 1 to 15, no namespace E0 element: no-attr empty element start E1 element: no-attr content element start E2 element: general element-tag start E8 element: empty-element-tag end .../> E9 close: close tag EA close: general close tag EB markup: XML declaration EC markup: comment code ED markup: processing instuction EE markup: bang tag EF markup: bang-bracket tag F0 content: CDATA-section start > F2 content: entity ref &entity; F3 content: char-entity ref &#unicode_num; F4 content: general char-entity ref &#unicode_num_string; F5 - F7 reserved F8 control: intra-random-access uncompressed physical content block F9 control: encoded physical content block FA control: no-op byte FB control: no-op block FC control: macro definition FD - FE reserved FF control: trailer token If people insist on schema-aware processing for greater compactness, I could suggest using macros instead. Basically, a macro is defined as a sequence of tokens that can include external-data-value references. The macros are defined and referenced in a parallel way to the string table. A macro-def tag is: {token_code, n_defs, {block_count, {literal_or_instance_data_ref}*}*} A macro-ref tag is: {token_code, macro_index, count, external_instance_data_values} There could be a special compact code that references macro index 0 with a count of 1. The macro expansion is done in a naive byte-for-byte way at a very low level in the parser. The input-buffer filler can expand it. We can define that the byte content must expand to an integral number of tokens, for simplicity. We can also define that there must not be nested macro references for simplicity. An external instance-data reference can refer to a fixed block of bytes, or a variable-length block of bytes, or a variable-length block with the length included in the expansion (e.g., for strings). A particular problems is how to index content that is included in a macro instance. This concept could possibly be mixed in with compression blocks, but I think that I probably want to keep them as orthogonal as possible. I think the thing to do is to have a special compression-block type of uncompressed sequential access which is allowed to include macros. Normal random-access uncompressed blocks are not allowed to include macros, so if a macro is generated, a new block may need to be started. This will only cost two bytes, once, but it turns off random access unless it is counteracted.