[Matroska-devel] EBML specification component for review - Element Data Size

wm4 nfxjfg at googlemail.com
Fri May 1 17:24:31 CEST 2015


On Fri, 1 May 2015 11:12:14 -0400
Dave Rice <dave at dericed.com> wrote:

> 
> > On May 1, 2015, at 10:55 AM, wm4 <nfxjfg at googlemail.com> wrote:
> > 
> > On Fri, 1 May 2015 10:15:17 -0400
> > Dave Rice <dave at dericed.com <mailto:dave at dericed.com>> wrote:
> > 
> >> Hi,
> >> 
> >>> On May 1, 2015, at 9:18 AM, Moritz Bunkus <moritz at bunkus.org> wrote:
> >>> 
> >>> Hey,
> >>> 
> >>>> I know, EBML was supposed to be generic and flexible, but I would say
> >>>> experience with Matroska taught us that this specific aspect is not
> >>>> really needed.
> >>> 
> >>> I know of at least four (private) projects that use EBML as a basis for
> >>> various in-house solutions. Sure, it hasn't reached any kind of global
> >>> use, but Matroska is certainly not the only one.
> >>> 
> >>> Anyway, I don't care that strongly about the point, so I won't argue it
> >>> further. Limiting the maximum ID length to 4 is fine with me.
> >> 
> >> I'm going to try to summarize this thread. Apologies for a long post.
> >> 
> >> Regarding EBMLMaxSizeWidth, the EBML spec is flexible are what range is allowed, though Matroska constrains it. I propose this stay as is. We aren't currently working on a new version of EBML but trying to clarify the existing one. I propose this stay as is.
> > 
> > I think constraining it a bit is ok too. What value is a spec if it's
> > as loose as the previous version? (It'd act as better documentation,
> > which of course that is fine too.)
> > 
> >> Unknown EBMLMaxSizeWidth are technically allowed, but the rules about use potentially make the document nearly unparseable as wm4 pointed out (espectially if all possible Element IDs aren't known by the parser). Is it possible to parse an EBML document where all elements without sub-elements have unknown values? For me "When an element that isn't a sub-element of the element with unknown size arrives, the element list is ended." isn't clear enough. For instance an unknown element may contain various known sub-elements and then a VOID element, but the VOID element could be a child of the unknown element or the grandparent element.
> > 
> > You're mixing several terms here. We were talking about elements with
> > unknown size, not missing EBMLMaxSizeWidth elements.
> 
> At any rate, EBMLMaxSizeWidth is mandatory. A missing EBMLMaxSizeWidth is invalid both in EBML and MKV.
> 
> > Yes, it would be possible to parse such an EBML document, as long as
> > the parser knows about all elements, and maintains a knowledge about
> > their relative ordering. But as I've said, this makes the parser a bit
> > more complicated than a simple one, and breaks future extensibility.
> > 
> > To be precise, for each element, the parser needs to know the set of
> > allowed sub-elements. The set of sub-elements must not overlap with the
> > set of sub-elements allowed in the grandparent element (which affects
> > in particular VOID and CRC elements).
> 
> Right. So perhaps we need a specific constraints on the use of VOID and CRC within an element of unknown size. It's hard to imagine the use case of a CRC sub-element within an element of unknown size, perhaps this can CRC within an unknown length element can simply be forbidden.
> 
> Would there be an issue with SimpleTag in the case of an unknown size parent? SimpleTag can appear at several different levels.
> 
> Also the rules we're discussing presume that an EBML-based format makes all sub-elements known. Should we add a statement to the EBML spec that defines this to keep the unknown-size parsing possible?

It gets messy fast, which I think the use of elements with unknown
length should be severely restricted.

Elements with unknown length don't even have any use. Except in some
very restricted scenarios, like endless streaming. And even that not
all Matroska demuxers support fully.

I bet most third party implementers actually ignored the existence of
unknown length, as files in the wild rarely (if at all) use them.

> > Instead of "each element", it might also be required to consider the
> > list of parent elements (the "path"), depending on the structure of the
> > format.
> > 
> >> Question from our discussion:
> >> - EBMLMaxSizeWidth and EBMLMaxSizeLength are synonymous? The RFC draft uses 'Width' whereas the spec uses 'Length'. I propose preference to Length.
> >> 
> >> Back to the origins of this thread...
> >> 
> >> For reference, the RFC draft states:
> >> 
> >>> The EBML element data size is encoded as a variable size integer with, by default, widths up to 8. Another maximum width value can be set by setting another value to EBMLMaxSizeWidth in the EBML header. See section 5.1. There is a range overlap between all different widths, so that 1 encoded with width 1 is semantically equal to 1 encoded with width 8. This allows for the element data to shrink without having to shrink the width of the size descriptor.
> >>> 
> >>> Values with all data bits set to 1 means size unknown, which allows for dynamically generated EBML streams where the final size isn't known beforehand. The element with unknown size MUST be an element with an element list as data payload. The end of the element list is determined by the ID of the element. When an element that isn't a sub-element of the element with unknown size arrives, the element list is ended.
> > 
> > Here I'd ask what the exact value of "unknown size" is. Sure, all bits
> > are 1, but how many 1 bits are needed? It's a variable-length encoding.
> > For example, the byte 0xFF has all bits set. Does this mean the length
> > is unknown? (I think it is in EBML, not quite sure right now.)
> > 
> >> 
> >> For cleanup, I propose (knowing that the unknown size issue is not yet resolved):
> >> 
> >> "The EBML element data size is encoded as a variable size integer. Another maximum width value can be set by setting another value to EBMLMaxSizeWidth in the EBML header. There is a range overlap between all different widths, so that 1 encoded with width 1 is semantically equal to 1 encoded with width 8. This allows for the element data to shrink without having to shrink the width of the size descriptor.
> > 
> > This doesn't sound very clear to me. EBMLMaxSizeWidth is the maximum
> > number of bytes that can be used to encode a single,
> > variable-length-encoded size field.
> 
> Thanks, I'll review and cleanup.
> 
> >> An EBML element data size with all data bits set to 1 indicate that the data size is unknown. This allows for dynamically generated EBML streams where the final size isn't known beforehand. The element with unknown size MUST be an element with an element list as data payload. The end of the element list is determined by the ID of the element. When an element that isn't a sub-element of the element with unknown size arrives, the element list is ended."
> >> 
> >> The RFC Draft also has this line just after that in the Data Size section:
> >> 
> >>> Since the highest value is used for unknown size the effective maximum data size is 2^56-2, using variable size integer width 8.
> >> 
> >> Whereas the EBML spec doesn't give an 8 bit limit, this line seems to imply it. There is a probably with the 1 filled unknown length if EBMLMaxSizeWidth is greater than 8. Is the unknown length value supposed to match the length of the EBMLMaxSizeWidth or be fixed at 8?
> > 
> > Bytes, not bits.
> > 
> > The unknown length is just unbounded. It has nothing to do with
> > EBMLMaxSizeWidth AFAIK.
> 
> Doesn't it? If EBMLMaxSizeWidth=12 or EBMLMaxSizeWidth=8 then the length of the 1-filled unknown size value changes accordingly, right?
> Dave Rice

In my understanding, there are multiple ways to encode the unknown
size, and 0xFF  is the shortest one. It's independent of
EBMLMaxSizeWidth (it just restricts the maximum byte size
you can use to encode the unknown size).


More information about the Matroska-devel mailing list