[Matroska-devel] Several (minor) issues or underspecified areas in the MKV spec

Steve Lhomme slhomme at matroska.org
Sat Oct 17 13:28:42 CEST 2015


On 16/10/2015 08:51, Dave Rice wrote:
>
>> On Oct 11, 2015, at 3:43 AM, Steve Lhomme <slhomme at matroska.org
>> <mailto:slhomme at matroska.org>> wrote:
>>
>> 2015-10-05 18:47 GMT+02:00 Michael Bradshaw <mjbshaw at google.com
>> <mailto:mjbshaw at google.com>>:
>>> On Mon, Oct 5, 2015 at 8:03 AM, Dave Rice <dave at dericed.com
>>> <mailto:dave at dericed.com>> wrote:
>>>>
>>>> I'm working on the EBML specification (the one being drafted on GitHub)
>>>> quite a bit. What are the questions to EBML?
>>>
>>>
>>> Preface: some of these are weird corner cases that are extremely
>>> unlikely to
>>> occur for anyone doing anything sane. That said, I think parsers should
>>> consistently (or even gracefully) handle the insane, and in order to
>>> do that
>>> I think these corner cases should be clarified in the spec.
>>>
>>> Can a global element (i.e. Void, CRC-32) occur before an EBML element? If
>>> so, are they considered part of the document (as is, it seems like an
>>> EBML
>>> document is implicitly defined as everything between an EBML header
>>> and then
>>> next EBML header (or EOF), in which case they are not considered part
>>> of the
>>> EBML document)?
>>
>> The CRC-32 cannot because it has to be in a Master-Element to make sense.
>>
>> The Void element could be placed between the "EBML Header" and the
>> "EBML Stream", to reserve space for later editing, for example. It may
>> belong to the EBML Header if its size fits the inside. Or it may be
>> in-between the Header and the Stream. The Stream actually starts with
>> a "level 0" element of the Stream described in the Doctype. What's
>> before and is not in the EBML Header can be discarded.
>
> I used the term "EBML Stream" in a different way in
> https://github.com/Matroska-Org/ebml-specification/pull/28. Here I used
> EBML Stream to mean a stream of many EBML Documents within a file or
> data stream, but here you use the term to mean the non-EBML Header part
> of the EBML Document. Preference as to which meaning of EBML Stream is
> correct? If using the EBML Stream of my PR then perhaps we need another
> term to mean the non-header part of an EBML Document.

You're right, it should read "EBML Document" where I wrote "EBML Stream".

>>> How should a EBMLMaxSizeLength > 8 be handled if it occurs after the
>>> element
>>> that needs it (specific edge case: DocType has a size length of 9, but
>>> DocType occurs before EBMLMaxSizeLength in the header; how should that be
>>> handled?) (alternate edge case: a Void element occurring in (or
>>> before) an
>>> EBML element with a size length is > 8 and occurring before
>>> EBMLMaxSizeLength). Should the spec explicitly require parsers to
>>> parse as
>>> if EBMLMaxSizeLength is 8 unless and until explicitly told otherwise?
>>> Do the limitations of EBMLMaxSizeLength apply to the document
>>> immediately?
>>
>> The values in the EBML Header describe what the EBML parser will need
>> to parse the EBML Stream. On the other hand it should always be safe
>> to read the EBML Header even if your parser cannot handle the Stream
>> due to internal limitations. So we may define in the EBML specs that,
>> for the EBML Header, the ID Length must not be longer than 4 and the
>> Size Length may never be more than 8, maybe even 4 (I'd favor 4).
>
> I added this to https://github.com/Matroska-Org/ebml-specification/pull/28.
>
>>> Shouldn't EBMLMaxIDLength have a range of > 3 (given that the EBML
>>> element
>>> has an ID length of 4)?
>>
>> Not necessarily, as small EBML Doctypes may not need that much and
>> favor saving container space. As said above, the values in the EBML
>> Header describe the Doctype, the EBML Stream. Not the EBML Header
>> itself. We should definitely clarify that in the specs.
>
> Ah, I wasn't clear on this from any document before. I can add to the PR
> but will do this another day.
>
>>> Shouldn't EBMLMaxSizeLength have a range of > 0?
>>
>> Correct, it cannot be 0, just like EBMLMaxIDLength. I edited the
>> Matroska specs accordingly.
>
> I edited the EBML spec via PR here
> https://github.com/Matroska-Org/ebml-specification/pull/25.
>
>>> That is, if EBMLMaxSizeLength is 1, does that apply to elements in
>>> the EBML
>>> header immediately after it is encountered, meaning that if DocType
>>> followed
>>> it it must have a length < 127?
>>
>> No, the "parsing context" defined by the EBML Header cannot be used
>> while it's being created.
>>
>>> Typo in the EBML spec in the Length definition for the Binary data
>>> type: “A
>>> Master-element” should be “A Byte Element”
>>
>> Which document? I could not find this.
>
> My mistake, PR here:
> https://github.com/Matroska-Org/ebml-specification/pull/29
>
>>> The EBML spec says that the Reserved ID (all bits set to 1) is the
>>> only ID
>>> that may change the Length Descriptor (the count of leading zeroes + 1).
>>> What exactly does it mean to "change the Length Descriptor?" Does
>>> this mean
>>> a Length Descriptor can be > 4 (even if EBMLMaxIDLength = 4) if the ID is
>>> the Reserved ID?
>>
>> I think it doesn't make sense as it is. I think it refers to the fact
>> that IDs should always be coded in their lowest form. But when all set
>> to 1 or 0, there's no default size.
>>
>> Dave: I'm not sure we defined that in the EBML specs yet but I think
>> we should (in a clearer form). I'm also not sure we defined what a
>> parser should do when encountering an reserved element (all ID data
>> set to 1 or 0). IMO it should just skip the element, rather than
>> consider the stream invalid/broken. That may be an easy way to
>> remove/clear some elements in the stream rather than rewriting a Void
>> element on top.
>
> I think it's fairly clear, see
> https://github.com/Matroska-Org/ebml-specification/blob/master/specification.markdown#element-id.
> That element ids must not have a vint_data of all 0 or all 1 and must be
> in the shortest possible expression. It doesn't give instruction to what
> a parser would do if those rules weren't followed. In a few other places
> we say if THIS then the ELEMENT is INVALID.

Excellent.

>> We may use only one of these 2 reserved values and call it "Clear"
>> element (so probably all 0).
>
> Not sure I understand. You're talking about overwriting invalid elements
> with void or 'clear'?

Overwriting with Clear. Although I can't think of a case where we 
couldn't write a Void when we want to clear an element. We do not allow 
elements without a size defined, so a void will fit everywhere there's 
already an element.

There's an edge case with EBML Void when you edit some content like 
tags, if you use all the available space minus 1, you do not have space 
to put an EBML Void at the end because you only have one octet to use. 
When editing text you can use 0 padding at the end to get around the 
issue. But libebml has a case where replacing an element is not possible 
because of that.

> [...]
>
> Dave Rice
>
>
> _______________________________________________
> Matroska-devel mailing list
> Matroska-devel at lists.matroska.org
> http://lists.matroska.org/cgi-bin/mailman/listinfo/matroska-devel
> Read Matroska-Devel on GMane: http://dir.gmane.org/gmane.comp.multimedia.matroska.devel
>




More information about the Matroska-devel mailing list