BXML version 0.0.9
The 'Count' type is defined in an independent space from the token space
and as follows:
CODES MEANING
------- --------------
00 - EF literal count values 0 to 239
F0 - FB plus 1 byte: count values 240 to 3311
FC plus 2 bytes: count values that fit in uint16
FD plus 4 bytes: count values that fit in int32, negative illegal
FE plus 8 bytes: count values that fit in int64, negative illegal
FF special 'null' value, normally illegal
File organization:
--header
--sequence of physical content blocks
--trailer
--random-access stream offsets refer to logical-content
Trailer node:
can include the following indexes:
--content blocks {physical offset, logical-content length}
--string-table-fragment index {base symbol index, logical-content offset}
--macro index
--explicit XPath index {XPath, {value, {logical offset}*}*}
--element index {symbol_id, logical offset, length}
allow most of the content of the trailer block to be compressed
Tokens are defined according to the following codes:
CODES MEANING
------- --------------
00 - 1F integer content: integer in range of 0 to 31
20 - 7E literal character content: unicode between 0x20 and 0x7E
7F string content: string-table reference
80 string content: followed by byte length, followed by content
81 - 9F string content: byte length 1 to 31, followed by content
A0 scalar boolean content: "0"
A1 scalar boolean content: "1"
A2 scalar boolean content: "false"
A3 scalar boolean content: "true"
A4 boolean array content: "0" and "1" (bitpacked)
A5 boolean array content: "false" and "true" (bitpacked)
A6 scalar number content: uint8
A7 scalar number content: int16
A8 scalar number content: uint16
A9 scalar number content: int32
AA scalar number content: int64
AB scalar number content: float using 'e'
AC scalar number content: float using 'E'
AD scalar number content: double using 'e'
AE scalar number content: double using 'E'
AF reserved
B0 numeric array content: uint8
B1 numeric array content: int16
B2 numeric array content: uint16
B3 numeric array content: int32
B4 numeric array content: int64
B5 numeric array content: float using 'e'
B6 numeric array content: float using 'E'
B7 numeric array content: double using 'e'
B8 numeric array content: double using 'E'
B9 reserved
BA content: potentially insignificant whitespace
BC content: denormalized whitespace (CR+LF/LF)
BC content: blob
BD content: general macro reference
BE content: macro reference to macro index 0, count = 1
BF reserved
C0 string-table fragment: length, namespace per entry
C1 - CF string-table fragment: implied length 1 to 15, namespace
D0 string-table fragment: length, no namespaces
D1 - DF string-table fragment & content: implied length 1 to 15, no namespace
E0 element: no-attr empty element start
E1 element: no-attr content element start
E2 element: general element-tag start
E8 element: empty-element-tag end .../>
E9 close: close tag
EA close: general close tag
EB markup: XML declaration
EC markup: comment code
ED markup: processing instuction
EE markup: bang tag
EF markup: bang-bracket tag
F0 content: CDATA-section start >
F2 content: entity ref &entity;
F3 content: char-entity ref unicode_num;
F4 content: general char-entity ref unicode_num_string;
F5 - F7 reserved
F8 control: intra-random-access uncompressed physical content block
F9 control: encoded physical content block
FA control: no-op byte
FB control: no-op block
FC control: macro definition
FD - FE reserved
FF control: trailer token
If people insist on schema-aware processing for greater compactness,
I could suggest using macros instead. Basically, a macro is defined as
a sequence of tokens that can include external-data-value references.
The macros are defined and referenced in a parallel way to the string
table.
A macro-def tag is:
{token_code, n_defs, {block_count, {literal_or_instance_data_ref}*}*}
A macro-ref tag is:
{token_code, macro_index, count, external_instance_data_values}
There could be a special compact code that references macro index 0 with
a count of 1.
The macro expansion is done in a naive byte-for-byte way at a very low
level in the parser. The input-buffer filler can expand it. We can
define that the byte content must expand to an integral number of tokens,
for simplicity. We can also define that there must not be nested macro
references for simplicity.
An external instance-data reference can refer to a fixed block of bytes,
or a variable-length block of bytes, or a variable-length block with the
length included in the expansion (e.g., for strings).
A particular problems is how to index content that is included in a macro
instance. This concept could possibly be mixed in with compression blocks,
but I think that I probably want to keep them as orthogonal as possible.
I think the thing to do is to have a special compression-block type
of uncompressed sequential access which is allowed to include macros.
Normal random-access uncompressed blocks are not allowed to include
macros, so if a macro is generated, a new block may need to be started.
This will only cost two bytes, once, but it turns off random access unless
it is counteracted.