[Matroska-devel] EBML specification component for review - Element Data Size

wm4 nfxjfg at googlemail.com
Fri May 1 12:48:16 CEST 2015

On Fri, 1 May 2015 12:15:04 +0200
Moritz Bunkus <moritz at bunkus.org> wrote:

> Hey,
> > > Element Data Size
> > >
> > > The EBML element data size is encoded as a variable size integer with, by
> > > default, widths up to 8. Another maximum width value can be set by setting
> > > another value to EBMLMaxSizeWidth in the EBML header. See section 5.1.
> >
> > What's the point of such a header?
> Do you mean the general EBML header or the EBMLMaxSizeWidth element?
> You need a general EBML header in order to tell the parser which type of
> content this file carries. Remember, EBML does not only carry
> Matroska. It's similar to XML files where you need some namespace
> declarations so that the parser can actually understand where the
> element names come from and what they mean; in our case: so that the
> parser knows what EBML ID 0x4711 is semantically.

Yes, sorry, I meant the EBMLMaxSizeWidth element.

> EBMLMaxSizeWidth on the other hand is… Well… I guess it can be useful
> for allowing even more element IDs if you ever invent such a crazy
> format. It's good to be flexible here. I'm pretty sure, though, that no
> existing parser has ever been tested with EBML IDs (and with
> EBMLMaxSizeWidth by extension) being longer than four bytes, especially
> as Matroska only uses four-byte long IDs.

There's EBMLMaxIDLength, which gives the length of IDs. I see
absolutely no point in making this value different from 4.

Then there's EBMLMaxSizeLength. With 8 you're apparently limited to
2^56-2. This will probably be high enough forever. Maybe you could argue
that larger sizes should be possible; then I would suggest that the
limit should be fixed to 2^64-1, which would require a maximum length
of 10 bytes, with some representable values out of spec. Or maybe limit
it 9 bytes, which would make the maximum 2^63-2 or so.

But going higher than 2^64 ever doesn't seem useful. Why make it
possible? Handling up to 2^64 on the other hand is easy.

> > In the context of Matroska, I'd disallow unknown lengths in almost all
> > contexts (i.e. disallow them by default, allow them only when
> > explicitly specifified). Because they make demuxing a pain, and most
> > Matroska demuxers probably support unknown length at a few places at
> > best.
> While this being true we have to be careful not to invalidate existing
> files retroactively. Unknown sizes have been in the specs since the
> beginning; changing them now is not something we should do on a whim (if
> at all).

Making a formal spec for an informal standard that has been around for
a decode can do 2 things:

1.) formalize existing practice
2.) restrict the existing specification to a subset of what commonly
    occurs "in the wild"

The previous specification was not very precise in a lot of things. But
this doesn't mean a new specification has to be the same. It's
desirable to make the specification as minimal as possible (without
losing precision). If there are some valid Matroska files which would
become invalid with the new spec, and if not many of such Matroska
files were around, I see no problem in this.

Existing demuxers will probably handle them anyway. There are a lot of
buggy Matroska files around, and demuxers potentially contain hacks and
concessions for such broken files. But this doesn't mean a tool that
generates valid files according to the new spec should be allowed to
create such "problematic" files.

So some cleanup is necessary, unless you fully want to concentrate
on 1) - and even then you don't want to formalize every single broken
file you can find.

In this case, I think unknown lengths should be disallowed in most
contexts because they make parsers more complicated for a feature that
is almost never used. It also makes extending the format harder: in my
understanding, if a sub-element has an unknown element ID, the parser
can't continue.

More information about the Matroska-devel mailing list