[Matroska-devel] EBML specification component for review - Element Data Size

Dave Rice dave at dericed.com
Fri May 1 14:05:46 CEST 2015

> On May 1, 2015, at 6:48 AM, wm4 <nfxjfg at googlemail.com> wrote:
> On Fri, 1 May 2015 12:15:04 +0200
> Moritz Bunkus <moritz at bunkus.org> wrote:
>> Hey,
>>>> Element Data Size
>>>> The EBML element data size is encoded as a variable size integer with, by
>>>> default, widths up to 8. Another maximum width value can be set by setting
>>>> another value to EBMLMaxSizeWidth in the EBML header. See section 5.1.
>>> What's the point of such a header?
>> Do you mean the general EBML header or the EBMLMaxSizeWidth element?
>> You need a general EBML header in order to tell the parser which type of
>> content this file carries. Remember, EBML does not only carry
>> Matroska. It's similar to XML files where you need some namespace
>> declarations so that the parser can actually understand where the
>> element names come from and what they mean; in our case: so that the
>> parser knows what EBML ID 0x4711 is semantically.
> Yes, sorry, I meant the EBMLMaxSizeWidth element.
>> EBMLMaxSizeWidth on the other hand is… Well… I guess it can be useful
>> for allowing even more element IDs if you ever invent such a crazy
>> format. It's good to be flexible here. I'm pretty sure, though, that no
>> existing parser has ever been tested with EBML IDs (and with
>> EBMLMaxSizeWidth by extension) being longer than four bytes, especially
>> as Matroska only uses four-byte long IDs.
> There's EBMLMaxIDLength, which gives the length of IDs. I see
> absolutely no point in making this value different from 4.

You have to read between the lines, but I think the spec mandates that this value is set to 4. The spec says "4 or less", but since the EBML ID length itself is 4, the EBMLMaxIDLength has not other valid value. Correct?

> Then there's EBMLMaxSizeLength. With 8 you're apparently limited to
> 2^56-2. This will probably be high enough forever. Maybe you could argue
> that larger sizes should be possible; then I would suggest that the
> limit should be fixed to 2^64-1, which would require a maximum length
> of 10 bytes, with some representable values out of spec. Or maybe limit
> it 9 bytes, which would make the maximum 2^63-2 or so.
> But going higher than 2^64 ever doesn't seem useful. Why make it
> possible? Handling up to 2^64 on the other hand is easy.
>>> In the context of Matroska, I'd disallow unknown lengths in almost all
>>> contexts (i.e. disallow them by default, allow them only when
>>> explicitly specifified). Because they make demuxing a pain, and most
>>> Matroska demuxers probably support unknown length at a few places at
>>> best.
>> While this being true we have to be careful not to invalidate existing
>> files retroactively. Unknown sizes have been in the specs since the
>> beginning; changing them now is not something we should do on a whim (if
>> at all).
> Making a formal spec for an informal standard that has been around for
> a decode can do 2 things:
> 1.) formalize existing practice
> 2.) restrict the existing specification to a subset of what commonly
>    occurs "in the wild"
> The previous specification was not very precise in a lot of things. But
> this doesn't mean a new specification has to be the same. It's
> desirable to make the specification as minimal as possible (without
> losing precision). If there are some valid Matroska files which would
> become invalid with the new spec, and if not many of such Matroska
> files were around, I see no problem in this.

With spec refinements that cause some 'in the wild' files to become invalid, we could:
1. consider the refinement to be a bug fix (ie the 'in the wild' file was invalid this entire time)
2. enforce the refinement only in version 4 (ie 'in the wild' file stays as is). Here we could say versions 0-3 SHOULD do X, but version 4 MUST do X).

I prefer option 2 generally, but it'll probably be a case-by-case basis.

> Existing demuxers will probably handle them anyway. There are a lot of
> buggy Matroska files around, and demuxers potentially contain hacks and
> concessions for such broken files. But this doesn't mean a tool that
> generates valid files according to the new spec should be allowed to
> create such "problematic" files.
> So some cleanup is necessary, unless you fully want to concentrate
> on 1) - and even then you don't want to formalize every single broken
> file you can find.

Good point, certainly not.

> In this case, I think unknown lengths should be disallowed in most
> contexts because they make parsers more complicated for a feature that
> is almost never used. It also makes extending the format harder: in my
> understanding, if a sub-element has an unknown element ID, the parser
> can't continue.
> _______________________________________________
> Matroska-devel mailing list
> Matroska-devel at lists.matroska.org
> http://lists.matroska.org/cgi-bin/mailman/listinfo/matroska-devel
> Read Matroska-Devel on GMane: http://dir.gmane.org/gmane.comp.multimedia.matroska.devel

More information about the Matroska-devel mailing list