[Matroska-devel] EBML Namespaces

HAESSIG Jean-Christophe haessije at eps.e-i.com
Mon Apr 10 11:08:46 CEST 2006

> Hi,
> Well, extending an EBML document with more tags has been 
> discussed in the past. The idea was to include a DTD in the 
> header. But using DTDs mean we can use external ones too (as 
> in HTML/XML). But there's always the issue of ID collision.

Indeed, DTDs provide semantic information about a file, but no
namespace isolation.

> This is similar to the DTD system. Except you're changing the 
> ID parsing. I think Class 3, 4 (and even Class 2) level 
> offers enough IDs to avoid collision for formats in the same 
> field of work (multimedia, banking, tagging, etc).

Yes, this is possible, but vocabulary writers will need to
cooperate to avoid using the same IDs. Some will release a
format and afterwards notice that they have clashes. Some
will take clashing IDs on purpose, to be incompatible
with their competitor.

> Now the idea of a namespace would mean that the same ID would 
> be used by
> 2 formats but with a different meaning. But given you set the 
> different namespace in each ID, de facto they have a 
> different ID. So I don't really see how it solves the problem 
> of collision.

Empirically, yes. But the namespace ID should not be seen as
part of the Class-ID, since the used namespace ID can virtually
hold any value, and will probably be different for the same
vocabulary used in two distinct files. If you had 2 files, each
One using two namespaces : File A [0 (EBML); 1 (Private NS A)],
And File B [0 (EBML); 1(Private NS B)], a suitable program could
Merge them in file C [0 (EBML); 1 (Private NS A); 2(Private NS B)].

I think a little confusion has been introduced in Class-ID naming
Because their representation in the specs is the full byte dump,
so ID [A1] is represented as [A1], while its real VINT value
really is 21(hex).

After realizing that the size descriptor is not part of the class
ID value we can introduce another object that will not be counted
as part of the class ID : the namespace ID.
For example, if the namespace ID width is 3, the ID represented
as [81] would have VINT value 1, namespace 0. The same ID in
namespace 1 would read [91] and [F1] in namespace 7. Notice that
only the representation in the byte stream changes, not the real
value of the ID.

> Yes, but you still need to map, at the lowest level, the 
> namespaces for each upper level reader.

Of course. This job will be done by new elements in the EBML
namespace (further noted as NSDE -- namespace declaration elements).
I proposed value 0 for the namespace for EBML elements, mainly for
convenience reasons.
We need one element to set the namespace width and one container
element to declare a namespace : in this element must be a
sub-element to set the namespace value, and a sub-element to
associate a namespace key with it (the only thing formats need to
be globally unique).

The trickiest problem I can see is deciding in what scope a namespace
is active. The cleanest rule would be (just as in XML): a NSDE
controls the namespace of its parent and its parent's children (the
NSDE is therefore included) but this would be harder to implement
because it requires forward-checking to decide to which namespace
the current element belongs. Happily we are allowed to add
restrictions on where NSDEs can be used in elements, if any.

Another scoping rule can be "following-siblings" where a NSDE changes
NS rules for the next elements and their children. It is technically
correct and easy to implement, but for the moment I dislike it, I
can't tell why...

A third option is to only allow the use of NSDE near the beginning
of the file and make the rules global to the whole EBML file, but this
is rather gory.

> Sure. Maybe I didn't get your solution right. But I'm glad 
> someone is trying to extend EBML. The main missing feature 
> for the moment is the inability at the lower level to know if 
> an element is EBMLMaster or not. 
> So it's impossible to display a map of an EBML document 
> without knowing the semantic.

I've had some success in that, but not full, which means my solution
cannot be used in real programs (see the attached Python
script -- GTK2 libs needed), the idea is to always parse the content
of elements -- be they master or not. The data in the element is
searched for sub-elements. If the length of the found sub-elements
overflows the parent, then parsing is cancelled and the data returns
to raw status.

Of course, this doesn't work if the data looks like legitimate EBML,
but in fact isn't. There I can see only one solution : escape it.
A code that says 'EBML stops here' should be inserted just before
he raw data that needs it. This job can be done by a normal EBML
element (with size 0), which is minimum 2 bytes long. Statistically
I didn't encounter much cases where bogus EBML was interpreted, so
it wouldn't be a problem for terseness. As an added benefit, this
code could be used as a marker for the end of unbounded (size
unknown) container elements and totally relieve the class ID from
Providing hints about the level of that element (which is currently
the case).

And last, but not least, to provide real compositing and annotation
with namespaces, all elements should be allowed to contain
sub-elements (except size unbounded ones).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ebml.py
Type: application/octet-stream
Size: 7752 bytes
Desc: ebml.py
URL: <http://lists.matroska.org/pipermail/matroska-devel/attachments/20060410/941f51ac/attachment.obj>

More information about the Matroska-devel mailing list