[Matroska-devel] EBML Namespaces
Steve Lhomme
steve.lhomme at free.fr
Fri Apr 21 13:48:43 CEST 2006
HAESSIG Jean-Christophe wrote:
>
> Sorry for the long break, I couldn't find time to answer
> properly, since the subject isn't trivial.
No problem, there's no rush.
>>> would then be encoded in the byte stream as [080A45DFA3],
>> which makes
>>> room for 7 bits.
>> Sliding of 8 bits to the right, should make room for 8 bits.
>
> Except that one bit out of the 8 is eated by the size descriptor
> because the ID is made longer.
It all depends on the solution adopted. But I still didn't understand
how your solution can be backward compatible (EBML & Matroska) since it
modifies the rules already in place.
BTW, while backward compatibility is the goal we should think about all
the possible options without that in mind. Then we'll see all that is
possible and only then if it's worth to keep compatibility or not.
>> Depending on the EBML header we could know wether IDs are
>> supposed to have a namespace or not. But I may have another
>> option: why not but the bits
>> *after* the current bits used for the ID ? All the ID
>
> Putting the namespace value before or after the Class-ID would
> basically have the same effect, except that values are more
> likely to change in their low order digits, and therefore it's
> harder to find unused space here.
>
>> processing of IDs would remain unchanged. And we would only
>
> I'm not quite sure how you see this, but AFAIC imagine, one
> should seek for the namespace part of the ID, remove it, and then
> resume with normal ID interpretation.
Why do you want to remove the ID ? As seen below there are different
options of where we could put it.
Now instead of using another byte (or more) instead of splitting the IDs
we could reuse bits in the data length. It would be no more backward
compatible as using bits in the IDs but there would be more room for
improvement (as the size is known to be encoded in different byte sizes).
>> need code to handle the namespace, the same way we have the
>> length. So parsing would be split like this:
>>
>> [ID][namespace][size][data]
>> it could also be
>> [ID][size][namespace][data]
>
> I sense that you want to encode the namespace value as a
> totally separate field, with equal status compared to Class-ID,
> Size, and Data. However there is a slight problem with this :
Yes, that's the idea.
> EBML is supposed to be a byte-aligned format and it would
> require at least 1 extra byte for each element. This is not bad
> in itself, but it would waste a great amount of bits, since I do
> not expect files with more than 5 mixed namespaces to be
> frequent. Therefore, I expect the namespace value to take up to
> 3 bits in most cases, this is why I try to pack it into an
> existing field.
Well, what happens when you need 6 or 7 ? You don't have any more bits
left. Adding another byte or 2 gives room for unlimited extensions
(including the ability to use some other bits for other things like
marking an element as EBML Master). That's what EBML is good at: no
limits and still having very basic rules.
I'm not too concerned about the overhead because right now if you need a
lot of IDs you need to use 2 octets long ones. While with a namespace
most IDs for each namespace won't need a lot of room (127 possibilities
for Class A IDs). So in the end there should be a good balance.
Using 3 bits in the ID header would reduce the number of possible Class
A IDs of a format to 2^4-1 = 15 ! That's too small IMO. So I think
adding another bit will give us more freedom and space and almost no cost.
> You seem to be prepared to make big changes to the format, but
> I don't know to what extent whe should break compatibility...
Again, for current Matroska files it shouldn't be a problem as matroska
would be the default namespace. In that case the namespace shouldn't be
used for such IDs. That means older files will play without any problem
in namespace-aware parsers. Only newer files containing some namespace
will not be usable by older parsers. Which AFAIK is the same with what
you propose.
BTW, we still haven't discussed how to define the namespace in the EBML
header, but the DocType existing today will remain. And that is the way
to define the special namespace that will be used as default... The
other namespaces will probably fall back in an list. It's like the
DocType in XML (like html). At least we got the right name for that field ;)
>> What we need is to make one of the namespace in the document
>> be set as "default", ie not marked. The same way we don't
>> have to write mandatory elements that have the default value.
>> This way Matroska can keep its low overhead and be extended
>> by new namespaces.
>
> This could be the best solution, if we can find a way to
> express the namespace descriptors in a space-efficient manner
> *and* not making it a pain in the a** for random-seeking
> applications to recover the namespace state. However if it
> can't be done I would rather have the namespace expressed for
> each element in a file using them, and have files with no
> namespaces at all (ns desl length=0) like plain Matroska.
> With proper prefix-coding of the ns descriptor, one could use
> only one or two *bits* per element.
You want to use external files ? I don't really understand. The
namespace 'tagging' of each element/ID has to be done inside the file...
Well, actually not really you can use some namespaces in the file
without defining them. You'd just know how to map IDs to different
namespaces. The only thing missing for the file to be usable is the
semantic. That's where the DTD (internal and/or external) comes in. Is
that what you want to make external ?
>> Also, if we use an EBML element to say: all lower elements
>> use namespace XYZ it could replace the default value.
>> Namespace switching would only occur in very localized
>> places. That's the difference between having the "using
>> namespace XYZ" approach and the "XYZ::element" one. We might
>> use both (as in C++).
>
> Using such a following-sibling approach would hurt seeking as
> it is currently done. Of course we can add specific rules
> like : an element containing namespace switches MUST NOT have
> its sub-elements indexed by seek heads, except if these seek
> heads point the parser to all relevant ns switches. This
Yes that would be a limit but seeking in a file format is a very special
feature not used a lot for most formats. It makes sense in A/V formats
where there is a timeline, but then you need to know the semantic to
know what you're looking for when seeking.
> raises an important issue about the effective structure of
> libraries (of course, people who implement the whole parsing
> for their own application will have less problems here)
> dedicated to do the parsing. I believe that namespace
> processing really should be unknown to the specific
> applications.
Yes. But again the namespace imply the semantic. All namespace-related
functionalities should be done at the EBML level, but there will always
be the need to map the semantic for the application.
>> Yes, I was thinking about that too. That's why I prefer to
>> keep the IDs intact and the format proposed above is good.
>> Seeking (at least in
>> matroska) can remain unchanged. For other formats we would
>
> Since there is no foreign-format mixing possibility due to
> The lack of namespaces, there is indeed no problem.
>
>> need to take the namespace in account to make sure the
>> element is the namespace we're looking.
>
> I was thinking a little more about seeking and I came to the
> conclusion that seeking (indexing and pointing to some part
> of the file, and the like) should go in EBML (or some seeking
> NS), and not in each specific application. Why ? Imagine you
> have some program to add comments in EBML files. You could
> take any element in the file and add a string comment. The app
> has its private elements and would use a separate namespace,
> so it wouldn't interfere with the existing data. The file would
> still be readable by the original program, the natural rule
> being to simply ignore unknown elements. However, adding
> elements changes the size of the file, and therefore the
> positions to which the seek heads point. Moving seeking into
> EBML would enable automatic relocation of the seek-heads.
That's indeed a good point. Remuxing a matroska file with seek/cue
elements would require to know the matroska semantic to remux it.
When I read that it made me think of XPath in XML. I have no idea of how
it works for XML but it seems an extension used to define pointers
between elements and/or documents (one direction or bidirectional). And
that's something that could make sense at the EBML level too. That
wouldn't be backward compatible, but depending on the solution we come
up with, it could be a good replacement.
> A more interesting thing with this is that local namespace
> state can be recovered while seeking, since it would be the
> job of the EBML library to make seek heads and it could
> include all the necessary information.
Yes, that could work. But again, seeking (at the EBML or semantic level)
is a special/tricky feature. In the case of Matroska it's only for Level
1 elements and therefore you always know the upper context (segment,
since that's the only level 0 element). I can hardly imagine a format
that would need to seek at level 2 or more. Especially because after
seeking you probably need the context of upper elements (one of the
feature of nested formats is that each element has a context to
interpret it).
While I'm all for seeking at the EBML level (would help for the format
resistance to errors too) we shouldn't over design it for cases that
won't make sense in the real world. We'll see what the discussion leads
to :)
Steve
--
robUx4 on blog <http://robux4.blogspot.com/>
More information about the Matroska-devel
mailing list