[Matroska-devel] EBML specification component for review - Element Data Size

wm4 nfxjfg at googlemail.com
Sat May 2 18:08:07 CEST 2015

On Sat, 2 May 2015 17:59:22 +0200
Steve Lhomme <slhomme at matroska.org> wrote:

> On Fri, May 1, 2015 at 12:48 PM, wm4 <nfxjfg at googlemail.com> wrote:
> > On Fri, 1 May 2015 12:15:04 +0200
> > Moritz Bunkus <moritz at bunkus.org> wrote:
> >
> > There's EBMLMaxIDLength, which gives the length of IDs. I see
> > absolutely no point in making this value different from 4.
> >
> >
> You never know. Maybe someone will want to put a SHA1 someday.

Doesn't make sense. The ID has to follow a certain formatting (the
variable length encoding), so you can't have arbitrary data as ID. Only
a subset of SHA1 hashes are valid EBML IDs.

> > Then there's EBMLMaxSizeLength. With 8 you're apparently limited to
> > 2^56-2. This will probably be high enough forever. Maybe you could argue
> > that larger sizes should be possible; then I would suggest that the
> > limit should be fixed to 2^64-1, which would require a maximum length
> > of 10 bytes, with some representable values out of spec. Or maybe limit
> > it 9 bytes, which would make the maximum 2^63-2 or so.
> >
> > But going higher than 2^64 ever doesn't seem useful. Why make it
> > possible? Handling up to 2^64 on the other hand is easy.
> >
> Hopefully someday the whole universe can be contained in one EBML stream.
> We don't need to put limits where we don't need to.
> Now that's we're talking about it, in the Matroska specs we should specify
> that the vint size cannot exceed 8 octets and the id size cannot exceed 4
> octets.
>  Existing demuxers will probably handle them anyway. There are a lot of
> > buggy Matroska files around, and demuxers potentially contain hacks and
> > concessions for such broken files. But this doesn't mean a tool that
> > generates valid files according to the new spec should be allowed to
> > create such "problematic" files.
> >
> If there are hacks in Matroska parsers due to bad muxing, they are never at
> the EBML level. If the EBML is broken, it's OK not to care.
> > So some cleanup is necessary, unless you fully want to concentrate
> > on 1) - and even then you don't want to formalize every single broken
> > file you can find.
> >
> > In this case, I think unknown lengths should be disallowed in most
> > contexts because they make parsers more complicated for a feature that
> > is almost never used. It also makes extending the format harder: in my
> > understanding, if a sub-element has an unknown element ID, the parser
> > can't continue.
> >
> Even if it doesn't make sense in Matroska (it may be deprecated but it's
> been used widely in GStreamer), it should not go away fro EBML. Low latency
> transmission of data is very nice feature and an advantage over a lot of
> other binary formats. XML or JSON can be streamed because they have end
> markers ('>' or '}') but that's usually not the case for binary. So we
> should keep this feature.

That's why I'm saying EBML should explicitly should require formats
to specify where exactly unknown lengths can happen, and the Matroska
format specification should restrict them to cases useful for
streaming, such as the Segment and Cluster elements.

More information about the Matroska-devel mailing list