[Matroska-devel] EBML Namespaces

HAESSIG Jean-Christophe haessije at eps.e-i.com
Mon Apr 10 18:17:49 CEST 2006

Steve Lhomme wrote:
> This is a good solution. But it probably wouldn't work with 
> Matroska (the only known format to use EBML so far). Because 
> the bit(s) used to mark the namespace are probably already 
> used by some IDs. Also, the limit to 3 (or 2 or 5) is 
> arbitrary and doesn't meet the goal of EBML to be a format 
> with no limits. (the only one we have is Matroska legacy, but 
> we could evolve EBML independently of Matroska too)

Out of luck, the first EBML (lev.0) master element [1A45DFA3]
(VINT=0A45DFA3) falls in that category  since it has no unset
bit after the first 1. The solution I proposed in my first post
was simply to move the conflicting bits out the way by sliding
them one byte to the right. Since the current Class-Ids are
supposed to be represented as their shortest form, this doesn't
introduce new conflicts. 0A45DFA3 would then be encoded in the
byte stream as [080A45DFA3], which makes room for 7 bits.

Of course this violently breaks compatibility, but the problem
could be solved by clever rules about when namespaces are
enabled and to what scope they apply.

> Setting the namespace width might be a good idea. Because 
> older formats (like Matroska) could set it to 0 (default 
> value of the new element in the EBML header). And new files 
> could use more space for the IDs (with namespace).

I was thinking the same. If no namespaces declarations are
found in the file, the parser should assume there is no use
of namespaces and just talk to the EBML engine.

I also just found out that being able to set the namespace
length is not enough to meet the 'no limits' goal of EBML.
It is not even technically good in a binary format which
needs terseness, because it is very likely that in the same
file, some elements from namespace A are used very often and
elements from namespaces B,C,D, and E more rarely. My
current proposal would force the EBML writer to add either :
* a constant overhead of N bits in front of each ID, which is
  bad because 1) all possible NSIDs are not used and 2) many
  IDs will be stretched to make room for the NSID
* a constant overhead of 1 bit (to distinguish between EBML
  and the private namespace) plus occasional (long) namespace
  declarations for local use of extra elements from the least
  used namespaces.

A way better solution, that will remove the need for an element
to declare the length of NSIDs is to use prefix codes (as in
Huffman compression) to identify namespaces.

> In that case each namespace could use a custom (to the file) 
> ID and be defined by a URL (string ID) or a (EBML) DTD.
> Given we extend the ID size (to include the namespace of each 
> ID) I don't see a problem here. The scope applies to the ID itself.

In fact, the problem is that we need to support random seeking
in a file, since at least one format (Matroska) will do it (and
many others will, it is a performance requirement). Consider a
cue head that points to the beginning of an EBML element
somewhere in the file : what context do we have about that
element ? Do we know its level, its parent, etc ? While this
information might be more or less important to a particular
vocabulary, we'll have a hard time guessing which namespaces
apply to that specific element, depending on the scoping rules
we agreed on.

In the "parent & parent's children" model, there must be a way
to find the parent of an element. In the "following-siblings",
one must scan all the preceding siblings of an element to check
for namespace declarations (I said I didn't like it). In the
"global" model, everything is easier, but it limits the
flexibility of the format, i.e. it will not be possible to have
local namespace declarations for rarely used namespaces...
> > Of course, this doesn't work if the data looks like 
> legitimate EBML, 
> > but in fact isn't. There I can see only one solution : escape it.
> No, the data in each EBML should not be modified because of 
> the EBML ID it's in. That will make parsers way too complex. 
> There could be a rule that all EBML Master IDs have a certain 
> bit set, and the others don't. 
> That could mean that one of the ID bits would be used for EBML Master.
> That will break Matroska compatibility but it could be added 
> as an EBML
> 2 version (and Matroska v3 bitstream).

Adding a bit to tell whether an element is master or not is not
a solution IMHO, because adding namespaces in EBML should
enable "annotation" :

Consider an EBML snippet :

[C5]                    # this is an element
	[C7]              # this is an element holding some integer
		1234        # the value itself
	.                 # C7 element ends here
.                       # C5 element ends here

Now imagine you want to use an element in another vocabulary to describe
that particular C7 instance :

[C5]                    # this is an element
	[C7]              # this is an element holding some integer
		[D1]        # this is an element holding a string value
			This value has been entered by user john
		            # D1 element ends here
	.                 # C7 element ends here
.                       # C5 element ends here

As you can see, with namespaces all elements are potential "masters".
I can not see another way to do it cleanly. Annotating as sibling nodes
would be a mess, since annotations should themselves be annotable.
Because of this, there must be some 'end-of-elements' tag to tell the
parser that raw data starts here, if necessary.

I can imagine a solution for future versions of EBML parsers that use
namespaces. A lowlevel module will parse the file, identify to which
namespace the elements belong and talk to the specific vocabulary
modules. Namespace-unaware formats will extend the EBML vocabulary, as
usual. Namespace-aware formats will register their own vocabulary module
to the lowlevel parser. The EBML vocabulary module will be responsible
for notifying the lowlevel parser about namespace declarations.


More information about the Matroska-devel mailing list