[Matroska-devel] EBML Namespaces

Steve Lhomme steve.lhomme at free.fr
Fri Apr 21 13:48:43 CEST 2006

HAESSIG Jean-Christophe wrote:
> Sorry for the long break, I couldn't find time to answer
> properly, since the subject isn't trivial.

No problem, there's no rush.

>>> would then be encoded in the byte stream as [080A45DFA3], 
>> which makes 
>>> room for 7 bits.
>> Sliding of 8 bits to the right, should make room for 8 bits. 
> Except that one bit out of the 8 is eated by the size descriptor
> because the ID is made longer.

It all depends on the solution adopted. But I still didn't understand 
how your solution can be backward compatible (EBML & Matroska) since it 
modifies the rules already in place.

BTW, while backward compatibility is the goal we should think about all 
the possible options without that in mind. Then we'll see all that is 
possible and only then if it's worth to keep compatibility or not.

>> Depending on the EBML header we could know wether IDs are 
>> supposed to have a namespace or not. But I may have another 
>> option: why not but the bits
>> *after* the current bits used for the ID ? All the ID 
> Putting the namespace value before or after the Class-ID would
> basically have the same effect, except that values are more
> likely to change in their low order digits, and therefore it's
> harder to find unused space here. 
>> processing of IDs would remain unchanged. And we would only 
> I'm not quite sure how you see this, but AFAIC imagine, one
> should seek for the namespace part of the ID, remove it, and then
> resume with normal ID interpretation.

Why do you want to remove the ID ? As seen below there are different 
options of where we could put it.

Now instead of using another byte (or more) instead of splitting the IDs 
we could reuse bits in the data length. It would be no more backward 
compatible as using bits in the IDs but there would be more room for 
improvement (as the size is known to be encoded in different byte sizes).

>> need code to handle the namespace, the same way we have the 
>> length. So parsing would be split like this:
>> [ID][namespace][size][data]
>> it could also be
>> [ID][size][namespace][data]
> I sense that you want to encode the namespace value as a
> totally separate field, with equal status compared to Class-ID,
> Size, and Data. However there is a slight problem with this :

Yes, that's the idea.

> EBML is supposed to be a byte-aligned format and it would
> require at least 1 extra byte for each element. This is not bad
> in itself, but it would waste a great amount of bits, since I do
> not expect files with more than 5 mixed namespaces to be
> frequent. Therefore, I expect the namespace value to take up to
> 3 bits in most cases, this is why I try to pack it into an
> existing field. 

Well, what happens when you need 6 or 7 ? You don't have any more bits 
left. Adding another byte or 2 gives room for unlimited extensions 
(including the ability to use some other bits for other things like 
marking an element as EBML Master). That's what EBML is good at: no 
limits and still having very basic rules.

I'm not too concerned about the overhead because right now if you need a 
lot of IDs you need to use 2 octets long ones. While with a namespace 
most IDs for each namespace won't need a lot of room (127 possibilities 
for Class A IDs). So in the end there should be a good balance.

Using 3 bits in the ID header would reduce the number of possible Class 
A IDs of a format to 2^4-1 = 15 ! That's too small IMO. So I think 
adding another bit will give us more freedom and space and almost no cost.

> You seem to be prepared to make big changes to the format, but
> I don't know to what extent whe should break compatibility...

Again, for current Matroska files it shouldn't be a problem as matroska 
would be the default namespace. In that case the namespace shouldn't be 
used for such IDs. That means older files will play without any problem 
in namespace-aware parsers. Only newer files containing some namespace 
will not be usable by older parsers. Which AFAIK is the same with what 
you propose.

BTW, we still haven't discussed how to define the namespace in the EBML 
header, but the DocType existing today will remain. And that is the way 
to define the special namespace that will be used as default... The 
other namespaces will probably fall back in an list. It's like the 
DocType in XML (like html). At least we got the right name for that field ;)

>> What we need is to make one of the namespace in the document 
>> be set as "default", ie not marked. The same way we don't 
>> have to write mandatory elements that have the default value. 
>> This way Matroska can keep its low overhead and be extended 
>> by new namespaces.
> This could be the best solution, if we can find a way to
> express the namespace descriptors in a space-efficient manner
> *and* not making it a pain in the a** for random-seeking
> applications to recover the namespace state. However if it
> can't be done I would rather have the namespace expressed for
> each element in a file using them, and have files with no
> namespaces at all (ns desl length=0) like plain Matroska.
> With proper prefix-coding of the ns descriptor, one could use
> only one or two *bits* per element.

You want to use external files ? I don't really understand. The 
namespace 'tagging' of each element/ID has to be done inside the file... 
Well, actually not really you can use some namespaces in the file 
without defining them. You'd just know how to map IDs to different 
namespaces. The only thing missing for the file to be usable is the 
semantic. That's where the DTD (internal and/or external) comes in. Is 
that what you want to make external ?

>> Also, if we use an EBML element to say: all lower elements 
>> use namespace XYZ it could replace the default value. 
>> Namespace switching would only occur in very localized 
>> places. That's the difference between having the "using 
>> namespace XYZ" approach and the "XYZ::element" one. We might 
>> use both (as in C++).
> Using such a following-sibling approach would hurt seeking as
> it is currently done. Of course we can add specific rules
> like : an element containing namespace switches MUST NOT have
> its sub-elements indexed by seek heads, except if these seek
> heads point the parser to all relevant ns switches. This

Yes that would be a limit but seeking in a file format is a very special 
feature not used a lot for most formats. It makes sense in A/V formats 
where there is a timeline, but then you need to know the semantic to 
know what you're looking for when seeking.

> raises an important issue about the effective structure of
> libraries (of course, people who implement the whole parsing
> for their own application will have less problems here)
> dedicated to do the parsing. I believe that namespace
> processing really should be unknown to the specific
> applications.

Yes. But again the namespace imply the semantic. All namespace-related 
functionalities should be done at the EBML level, but there will always 
be the need to map the semantic for the application.

>> Yes, I was thinking about that too. That's why I prefer to 
>> keep the IDs intact and the format proposed above is good. 
>> Seeking (at least in
>> matroska) can remain unchanged. For other formats we would 
> Since there is no foreign-format mixing possibility due to
> The lack of namespaces, there is indeed no problem.
>> need to take the namespace in account to make sure the 
>> element is the namespace we're looking.
> I was thinking a little more about seeking and I came to the
> conclusion that seeking (indexing and pointing to some part
> of the file, and the like) should go in EBML (or some seeking
> NS), and not in each specific application. Why ? Imagine you
> have some program to add comments in EBML files. You could
> take any element in the file and add a string comment. The app
> has its private elements and would use a separate namespace,
> so it wouldn't interfere with the existing data. The file would
> still be readable by the original program, the natural rule
> being to simply ignore unknown elements. However, adding
> elements changes the size of the file, and therefore the
> positions to which the seek heads point. Moving seeking into
> EBML would enable automatic relocation of the seek-heads.

That's indeed a good point. Remuxing a matroska file with seek/cue 
elements would require to know the matroska semantic to remux it.

When I read that it made me think of XPath in XML. I have no idea of how 
it works for XML but it seems an extension used to define pointers 
between elements and/or documents (one direction or bidirectional). And 
that's something that could make sense at the EBML level too. That 
wouldn't be backward compatible, but depending on the solution we come 
up with, it could be a good replacement.

> A more interesting thing with this is that local namespace
> state can be recovered while seeking, since it would be the
> job of the EBML library to make seek heads and it could
> include all the necessary information.

Yes, that could work. But again, seeking (at the EBML or semantic level) 
is a special/tricky feature. In the case of Matroska it's only for Level 
1 elements and therefore you always know the upper context (segment, 
since that's the only level 0 element). I can hardly imagine a format 
that would need to seek at level 2 or more. Especially because after 
seeking you probably need the context of upper elements (one of the 
feature of nested formats is that each element has a context to 
interpret it).

While I'm all for seeking at the EBML level (would help for the format 
resistance to errors too) we shouldn't over design it for cases that 
won't make sense in the real world. We'll see what the discussion leads 
to :)


robUx4 on blog <http://robux4.blogspot.com/>

More information about the Matroska-devel mailing list