[Matroska-devel] EBML Namespaces

Steve Lhomme steve.lhomme at free.fr
Thu Apr 13 11:04:28 CEST 2006


HAESSIG Jean-Christophe wrote:
> Steve Lhomme wrote:
>> This is a good solution. But it probably wouldn't work with 
>> Matroska (the only known format to use EBML so far). Because 
>> the bit(s) used to mark the namespace are probably already 
>> used by some IDs. Also, the limit to 3 (or 2 or 5) is 
>> arbitrary and doesn't meet the goal of EBML to be a format 
>> with no limits. (the only one we have is Matroska legacy, but 
>> we could evolve EBML independently of Matroska too)
> 
> Out of luck, the first EBML (lev.0) master element [1A45DFA3]
> (VINT=0A45DFA3) falls in that category  since it has no unset
> bit after the first 1. The solution I proposed in my first post
> was simply to move the conflicting bits out the way by sliding
> them one byte to the right. Since the current Class-Ids are
> supposed to be represented as their shortest form, this doesn't
> introduce new conflicts. 0A45DFA3 would then be encoded in the
> byte stream as [080A45DFA3], which makes room for 7 bits.

Sliding of 8 bits to the right, should make room for 8 bits. Depending 
on the EBML header we could know wether IDs are supposed to have a 
namespace or not. But I may have another option: why not but the bits 
*after* the current bits used for the ID ? All the ID processing of IDs 
would remain unchanged. And we would only need code to handle the 
namespace, the same way we have the length. So parsing would be split 
like this:

[ID][namespace][size][data]
it could also be
[ID][size][namespace][data]

What we need is to make one of the namespace in the document be set as 
"default", ie not marked. The same way we don't have to write mandatory 
elements that have the default value. This way Matroska can keep its low 
overhead and be extended by new namespaces.

Of course for other formats than Matroska, this default namespace 
doesn't need to be mandatory.

Also, if we use an EBML element to say: all lower elements use namespace 
XYZ it could replace the default value. Namespace switching would only 
occur in very localized places. That's the difference between having the 
"using namespace XYZ" approach and the "XYZ::element" one. We might use 
both (as in C++).

>> In that case each namespace could use a custom (to the file) 
>> ID and be defined by a URL (string ID) or a (EBML) DTD.
>>
>> Given we extend the ID size (to include the namespace of each 
>> ID) I don't see a problem here. The scope applies to the ID itself.
> 
> In fact, the problem is that we need to support random seeking
> in a file, since at least one format (Matroska) will do it (and
> many others will, it is a performance requirement). Consider a
> cue head that points to the beginning of an EBML element
> somewhere in the file : what context do we have about that
> element ? Do we know its level, its parent, etc ? While this
> information might be more or less important to a particular
> vocabulary, we'll have a hard time guessing which namespaces
> apply to that specific element, depending on the scoping rules
> we agreed on.

Yes, I was thinking about that too. That's why I prefer to keep the IDs 
intact and the format proposed above is good. Seeking (at least in 
matroska) can remain unchanged. For other formats we would need to take 
the namespace in account to make sure the element is the namespace we're 
looking.

> In the "parent & parent's children" model, there must be a way
> to find the parent of an element. In the "following-siblings",
> one must scan all the preceding siblings of an element to check
> for namespace declarations (I said I didn't like it). In the
> "global" model, everything is easier, but it limits the
> flexibility of the format, i.e. it will not be possible to have
> local namespace declarations for rarely used namespaces...

Indeed. But there is no case where you seek randomly in an EBML stream 
and look around for the context. It's just not possible. Seeking is only 
allowed for elements where you can recover the context (namely Level 0 
and Level 1 in matroska). And these elements should contain the 
namespace (unless it's the default one). So the "using namespace" could 
still work just fine.

>>> Of course, this doesn't work if the data looks like 
>> legitimate EBML, 
>>> but in fact isn't. There I can see only one solution : escape it.
>> No, the data in each EBML should not be modified because of 
>> the EBML ID it's in. That will make parsers way too complex. 
>> There could be a rule that all EBML Master IDs have a certain 
>> bit set, and the others don't. 
>> That could mean that one of the ID bits would be used for EBML Master.
>>
>> That will break Matroska compatibility but it could be added 
>> as an EBML
>> 2 version (and Matroska v3 bitstream).
> 
> Adding a bit to tell whether an element is master or not is not
> a solution IMHO, because adding namespaces in EBML should
> enable "annotation" :

Indeed, if we use this format [ID][namespace][size][data] the namespace 
part could also be used to say if the element is master or not. Maybe 
even the element type (3 or 4 bits).

I think we're getting close to a solution that could solve our namespace 
& DTD problem ! Then you could compress any XML document to/from EBML. 
Well, there would still the need to know if a sub-element is an XML 
attribute or an XML value:
<element attr="val"/>  vs  <element>val</element>

Which roughly translate to EBML as:

EBML_element (Master)
   EBML_attr val (String)




More information about the Matroska-devel mailing list