[Matroska-devel] EBML specification component for review - Element Data Size

wm4 nfxjfg at googlemail.com
Fri May 1 16:55:21 CEST 2015


On Fri, 1 May 2015 10:15:17 -0400
Dave Rice <dave at dericed.com> wrote:

> Hi,
> 
> > On May 1, 2015, at 9:18 AM, Moritz Bunkus <moritz at bunkus.org> wrote:
> > 
> > Hey,
> > 
> >> I know, EBML was supposed to be generic and flexible, but I would say
> >> experience with Matroska taught us that this specific aspect is not
> >> really needed.
> > 
> > I know of at least four (private) projects that use EBML as a basis for
> > various in-house solutions. Sure, it hasn't reached any kind of global
> > use, but Matroska is certainly not the only one.
> > 
> > Anyway, I don't care that strongly about the point, so I won't argue it
> > further. Limiting the maximum ID length to 4 is fine with me.
> 
> I'm going to try to summarize this thread. Apologies for a long post.
> 
> Regarding EBMLMaxSizeWidth, the EBML spec is flexible are what range is allowed, though Matroska constrains it. I propose this stay as is. We aren't currently working on a new version of EBML but trying to clarify the existing one. I propose this stay as is.

I think constraining it a bit is ok too. What value is a spec if it's
as loose as the previous version? (It'd act as better documentation,
which of course that is fine too.)

> Unknown EBMLMaxSizeWidth are technically allowed, but the rules about use potentially make the document nearly unparseable as wm4 pointed out (espectially if all possible Element IDs aren't known by the parser). Is it possible to parse an EBML document where all elements without sub-elements have unknown values? For me "When an element that isn't a sub-element of the element with unknown size arrives, the element list is ended." isn't clear enough. For instance an unknown element may contain various known sub-elements and then a VOID element, but the VOID element could be a child of the unknown element or the grandparent element.

You're mixing several terms here. We were talking about elements with
unknown size, not missing EBMLMaxSizeWidth elements.

Yes, it would be possible to parse such an EBML document, as long as
the parser knows about all elements, and maintains a knowledge about
their relative ordering. But as I've said, this makes the parser a bit
more complicated than a simple one, and breaks future extensibility.

To be precise, for each element, the parser needs to know the set of
allowed sub-elements. The set of sub-elements must not overlap with the
set of sub-elements allowed in the grandparent element (which affects
in particular VOID and CRC elements).

Instead of "each element", it might also be required to consider the
list of parent elements (the "path"), depending on the structure of the
format.

> Question from our discussion:
> - EBMLMaxSizeWidth and EBMLMaxSizeLength are synonymous? The RFC draft uses 'Width' whereas the spec uses 'Length'. I propose preference to Length.
> 
> Back to the origins of this thread...
> 
> For reference, the RFC draft states:
> 
> > The EBML element data size is encoded as a variable size integer with, by default, widths up to 8. Another maximum width value can be set by setting another value to EBMLMaxSizeWidth in the EBML header. See section 5.1. There is a range overlap between all different widths, so that 1 encoded with width 1 is semantically equal to 1 encoded with width 8. This allows for the element data to shrink without having to shrink the width of the size descriptor.
> > 
> > Values with all data bits set to 1 means size unknown, which allows for dynamically generated EBML streams where the final size isn't known beforehand. The element with unknown size MUST be an element with an element list as data payload. The end of the element list is determined by the ID of the element. When an element that isn't a sub-element of the element with unknown size arrives, the element list is ended.

Here I'd ask what the exact value of "unknown size" is. Sure, all bits
are 1, but how many 1 bits are needed? It's a variable-length encoding.
For example, the byte 0xFF has all bits set. Does this mean the length
is unknown? (I think it is in EBML, not quite sure right now.)

> 
> For cleanup, I propose (knowing that the unknown size issue is not yet resolved):
> 
> "The EBML element data size is encoded as a variable size integer. Another maximum width value can be set by setting another value to EBMLMaxSizeWidth in the EBML header. There is a range overlap between all different widths, so that 1 encoded with width 1 is semantically equal to 1 encoded with width 8. This allows for the element data to shrink without having to shrink the width of the size descriptor.

This doesn't sound very clear to me. EBMLMaxSizeWidth is the maximum
number of bytes that can be used to encode a single,
variable-length-encoded size field.

> An EBML element data size with all data bits set to 1 indicate that the data size is unknown. This allows for dynamically generated EBML streams where the final size isn't known beforehand. The element with unknown size MUST be an element with an element list as data payload. The end of the element list is determined by the ID of the element. When an element that isn't a sub-element of the element with unknown size arrives, the element list is ended."
> 
> The RFC Draft also has this line just after that in the Data Size section:
> 
> > Since the highest value is used for unknown size the effective maximum data size is 2^56-2, using variable size integer width 8.
> 
> Whereas the EBML spec doesn't give an 8 bit limit, this line seems to imply it. There is a probably with the 1 filled unknown length if EBMLMaxSizeWidth is greater than 8. Is the unknown length value supposed to match the length of the EBMLMaxSizeWidth or be fixed at 8?

Bytes, not bits.

The unknown length is just unbounded. It has nothing to do with
EBMLMaxSizeWidth AFAIK.


More information about the Matroska-devel mailing list