[Matroska-devel] EBML

Steve Lhomme steve.lhomme at free.fr
Fri Feb 13 13:49:05 CET 2004


Martin Nilsson wrote:

> Steve Lhomme wrote:
> 
>>
>> "As an example ID 1 from class A, encoded as 0x81, and ID 1 from class 
>> B, encoded as 0x4001, are considered different IDs."
>>
>> Interresting point that was never raised before. IMO it should be 
>> avoided to use this case. And that's what we did with Matroska. It 
>> could impose some constraints on the parser behaviour which is IMO not 
>> necessary. I think there is enough IDs to avoid this (and BTW that 
>> means the number of IDs for class B, C and D are wrong because they 
>> should not contain elements of the other class).
> 
> 
> I disagree. Even if you represented the ID in its parsed numerical state 
> you would still have to save the size as well to prevent 0x4001 from 
> being compressed into 0x81. So disallowing two IDs from having the same 
> decoded value only imposes extra constraints on ID generation, unless 
> one uses the decoded ID as the actual ID (i.e. having 0x81 and 0x4001 
> meaning the same element).

 From a coding perspective, as we decided that IDs are max 4 octets 
long, one could decide to store the decoded ID value in a 32 bits 
variable (int on most computers). And if this ID don't keep the way it's 
coded, some IDs may have the same value even if they are not coded the 
same way (length).

> And imposing extra ID generation constraints is very bad since 
> (hopefully) more people will be writing EBML DTDs than EBML parsers. 
> Making correct IDs are already the most complicated step in formulating 
> an EBML language.

That's true from this point of view. But I think it's better to avoid 
this case. Any writer of a format based on EBML should be able (but not 
be obliged) to take that in account.

>> FYI, one of the EBML enhancement that is planned is to allow some kind 
>> of embedded DTD inside the EBML header, that would describe the 
>> hierarchy of the elements, their ID and their type (all in EBML format 
>> of course). This way it would be easy to spot known/unknown elements 
>> and interpret the value of some elements (maybe even with a human 
>> friendly name) even if you don't know the actual meaning.
> 
> 
> Good idea. There should probably be some sort of ID range reserved for 
> EBML elements. Actually the possible ID clashes between different DTDs 
> upon compositioning is probably the biggest weakness of EBML right now 
> as far as I can see. And it is a difficult problem to solve. W3C never 
> actually solved the problem for XML. They sidestepped the issue with 
> namespaces as "solution".

That's why we have a DocType. It is not possible to merge 2 DocTypes in 
one. And XML has more "space" to use namespaces. Because of the way we 
try to store values, it would cost a lot of space to make all EBML 
formats compatible with each others.

>  > The problem is in what scope the 'other' element should be.
> 
> My first thought was to only refer back to parents, but in practice you 
> would probably more often want to refer to siblings. The real issue here 
> of course is to cap the memory consumption for the "semantic EBML 
> parser". Do we need to save the latest value of every parsed element? 
> No, fortunately not. When we read the EBML DTD we compile a list of all 
> referred elements and only save the latest one of those in a history. 

Yeah, that makes sense this way.

> Further we could state that once we leave the top level element the 
> history must be cleared. That in itself is a good idea, but the fewer 
> rules the better. Even further we could state that once we leave the 
> current parent element, the history for that level should be cleared. 
> That is probably a bad idea. Lastly we could introduce another DTD 
> property specifying that an element has a private scope that should be 
> cleared when the element is exited. That is a flexible solution that 
> allows the DTD author to control how values are stored. I think it adds 
> unnecessary complexity for too little gain.

Yep. Well in the case of Matroska, all refered data make sense in the 
scope of a Segment. (well, there *is* a way to link segments with IDs so 
the scope is even wider). And in the DTD you know at which level Element 
A referenced by Element B mix (the first level they have in common which 
in Matroska is probably always the Segment). So even with the DTD you 
should have enough knowledge of the scope you need to keep the 
references. Then what about references that could match many elements in 
the same scope ?

> Right now my focus is on going through all the layers of Matroska and at 
> least try to change everything that I don't like... I hope you guys are 
> open for debate when I reach the difficult parts.

Of course we are always opened to improvements. But right now we have 
lots of files based on the current specs in the wild. And we do not want 
to break any backward compatibility with those files. So changes would 
only be done if they meet this requirement. Otherwise they would be an 
option for Matroska 2 that will have more features and some doubled for 
better efficiency (maybe a new Block element). That would still not 
break forward compatibility (Matroska 1 files should play in Matroska 2 
files).

BTW, some SMTP servers don't like your matroska at mani.user.lysator.liu.se 
address.




More information about the Matroska-devel mailing list