[Matroska-devel] EBML specification component for review - Element Data Size

Dave Rice dave at dericed.com
Sat May 2 20:45:32 CEST 2015

> On May 2, 2015, at 12:08 PM, wm4 <nfxjfg at googlemail.com> wrote:
> On Sat, 2 May 2015 17:59:22 +0200
> Steve Lhomme <slhomme at matroska.org> wrote:
>> On Fri, May 1, 2015 at 12:48 PM, wm4 <nfxjfg at googlemail.com> wrote:
>>> On Fri, 1 May 2015 12:15:04 +0200
>>> Moritz Bunkus <moritz at bunkus.org> wrote:
>>> There's EBMLMaxIDLength, which gives the length of IDs. I see
>>> absolutely no point in making this value different from 4.
>> You never know. Maybe someone will want to put a SHA1 someday.
> Doesn't make sense. The ID has to follow a certain formatting (the
> variable length encoding), so you can't have arbitrary data as ID. Only
> a subset of SHA1 hashes are valid EBML IDs.

Perhaps I'm missing it, but I don't think the EBML spec is strong enough to say that Element IDs must use variable length encoding. The closest is where it says "Element ID coded with an UTF-8 like system", but it's missing a declaration of mandate or "MUST".

>>> Then there's EBMLMaxSizeLength. With 8 you're apparently limited to
>>> 2^56-2. This will probably be high enough forever. Maybe you could argue
>>> that larger sizes should be possible; then I would suggest that the
>>> limit should be fixed to 2^64-1, which would require a maximum length
>>> of 10 bytes, with some representable values out of spec. Or maybe limit
>>> it 9 bytes, which would make the maximum 2^63-2 or so.
>>> But going higher than 2^64 ever doesn't seem useful. Why make it
>>> possible? Handling up to 2^64 on the other hand is easy.
>> Hopefully someday the whole universe can be contained in one EBML stream.
>> We don't need to put limits where we don't need to.
>> Now that's we're talking about it, in the Matroska specs we should specify
>> that the vint size cannot exceed 8 octets and the id size cannot exceed 4
>> octets.
>> Existing demuxers will probably handle them anyway. There are a lot of
>>> buggy Matroska files around, and demuxers potentially contain hacks and
>>> concessions for such broken files. But this doesn't mean a tool that
>>> generates valid files according to the new spec should be allowed to
>>> create such "problematic" files.
>> If there are hacks in Matroska parsers due to bad muxing, they are never at
>> the EBML level. If the EBML is broken, it's OK not to care.
>>> So some cleanup is necessary, unless you fully want to concentrate
>>> on 1) - and even then you don't want to formalize every single broken
>>> file you can find.
>>> In this case, I think unknown lengths should be disallowed in most
>>> contexts because they make parsers more complicated for a feature that
>>> is almost never used. It also makes extending the format harder: in my
>>> understanding, if a sub-element has an unknown element ID, the parser
>>> can't continue.
>> Even if it doesn't make sense in Matroska (it may be deprecated but it's
>> been used widely in GStreamer), it should not go away fro EBML. Low latency
>> transmission of data is very nice feature and an advantage over a lot of
>> other binary formats. XML or JSON can be streamed because they have end
>> markers ('>' or '}') but that's usually not the case for binary. So we
>> should keep this feature.
> That's why I'm saying EBML should explicitly should require formats
> to specify where exactly unknown lengths can happen, and the Matroska
> format specification should restrict them to cases useful for
> streaming, such as the Segment and Cluster elements.
> _______________________________________________
> Matroska-devel mailing list
> Matroska-devel at lists.matroska.org
> http://lists.matroska.org/cgi-bin/mailman/listinfo/matroska-devel
> Read Matroska-Devel on GMane: http://dir.gmane.org/gmane.comp.multimedia.matroska.devel

More information about the Matroska-devel mailing list