[Matroska-devel] EBML Schema

wm4 nfxjfg at googlemail.com
Fri Oct 2 12:29:16 CEST 2015

On Thu, 1 Oct 2015 07:10:04 -0400
Dave Rice <dave at dericed.com> wrote:

> > On Oct 1, 2015, at 5:50 AM, wm4 <nfxjfg at googlemail.com> wrote:
> > 
> > On Wed, 30 Sep 2015 22:46:09 -0400
> > Dave Rice <dave at dericed.com> wrote:
> > 
> >> Hi,
> >>>> On Aug 28, 2015, at 10:52 AM, Dave Rice <dave at dericed.com> wrote:
> >>>> On Aug 28, 2015, at 2:50 AM, Moritz Bunkus <moritz at bunkus.org <mailto:moritz at bunkus.org>> wrote:
> >>>> 
> >>>> Hey,
> >>>> 
> >>>> I have no objections, however I don't know a lot about XML schemas in
> >>>> the first place (neither about DTDs, to be honest).
> >>> 
> >>> Honestly, I know a lot more about XML Schemas than I do about DTDs. As wikipedia mentions at https://en.wikipedia.org/wiki/Document_type_definition <https://en.wikipedia.org/wiki/Document_type_definition>, DTDs have largely been superseded by XML Schemas. And at this point I think that XML Schemas may be a more familiar analogy to use.
> >>> 
> >>> I think XML Schemas also share more in common with specdata.xml than DTDs do. Schemas use the <element> node and have maxOccurs and minOccurs attributes (specdata has semantically the same thing with mandatory and multiple), they both have a similar declaration of element type, element name and element description. Actually I think a semantically equivalent version of specdata.xml could be written as an XML Schema.
> >>> 
> >>> XML Schemas also offer a few advantages for machine readable expressions; for instance XML Schemas can mandate a particular pattern or regex for a value.
> >>> 
> >>>>> I propose the specdata.xml file here
> >>>>> https://github.com/Matroska-Org/foundation-source/blob/master/spectool/specdata.xml <https://github.com/Matroska-Org/foundation-source/blob/master/spectool/specdata.xml>
> >>>>> <https://github.com/Matroska-Org/foundation-source/blob/master/spectool/specdata.xml>
> >>>>> is a good basis for the consideration of an EBML Schema. From what I
> >>>>> can see, specdata.xml is an expression of the EBML + Matroska
> >>>>> specifications to support automated creation of documentation, but the
> >>>>> structure of this already shares a lot of similarity to XML Schemas.
> >>>> 
> >>>> For both documentation (e.g. the table on the matroska.org <http://matroska.org/> specs page is
> >>>> generated from this file) and code (libMatroska's class hierarchy is
> >>>> generated automatically from this file) actually.
> >>> 
> >>> Does specdata.xml play a role in mkvalidate? I'm thinking of the potential to have an ebmlvalidator where you can provide the EBML Schema to validate particular EBML docType.
> >>> 
> >>>>> Is there a preference in handling the standardization of Matroska:
> >>>>> documenting it in a similar fashion to our work in the EBML spec or to
> >>>>> define what an EBML Schema is and consider matroska an expression of
> >>>>> it?
> >>>> 
> >>>> I'm not sure whether or not I understand the implications. But my gut
> >>>> feeling is that having a definition for an EBML Schema would benefit
> >>>> other formats than Matroska, too, therefore the latter seems the way to
> >>>> go.
> >>> 
> >>> I have the same feeling:
> >>> - document EBML as a specification that includes rules for defining a docType in the form of an EBML Schema
> >>> - write an EBML Schema (updated specdata.xml) for Matroska and maybe webM
> >>> 
> >>>>> Are some changes to specdata.xml acceptable? Such as a filename change
> >>>>> or changing the name of the <table> element of some attributes?
> >>>> 
> >>>> Well, like I said above the specdata.xml is used for generating both
> >>>> documentation and code. Both should stay viable. If changes to it are
> >>>> made then the accompanying tools must be updated as well.
> >>>> 
> >>>>> Neither the current EBML specs nor the specdata.xml specifically refer
> >>>>> to the hierarchical arrangement of the elements, but this could be
> >>>>> presumed by their ordering. For instance, could any level 3 element be
> >>>>> a child of any level 2 Master-element? I presume not, but I don't
> >>>>> think it's clear anywhere what parent-child relationships are
> >>>>> feasible. Possibly specdata.xml and/or the EBML Schema Definition
> >>>>> could define the relationship between levels of related elements
> >>>>> similar to how an XML Schema (XSD) does.
> >>>> 
> >>>> So far it is understood that an element not marked as a global element
> >>>> must only occur as a child of its parent. Its parent is the last element
> >>>> located before the child element in the specdata file with a lower level
> >>>> than the child element. Or something like that.
> >>> 
> >>> This will need some documentation. That's how I've understood the mkv spec as well but the definition for how an EBML Schema works should be explicit about this.
> >> 
> >> I created a first draft of Matroska's specdata.xml with nested elements here: https://gist.github.com/dericed/f0a4bb0e7dc635ed1347 <https://gist.github.com/dericed/f0a4bb0e7dc635ed1347>. The content of the xml is the same but the definition is moved from element to element/documentation. And then elements are nested within elements according to their level and allowed location. I think a nested structure in an EBML Schema would make the location more clear than the current rule which is that the element is a child of the previous element with a higher element level value. Now the element is simply a child of the parent element. With a structure like this the level attribute would be redundant to the element structure.
> >> 
> >> Another advantage of this structure is that is allowed the EBML Schema to be better adapted to foriegn language descriptions. Just as in XML Schema one could have multiple <documentation> nodes per <element> with different language attributes.
> >> 
> >> I'd also like to propose change the EBML Schema attributes mandatory and multiple to their familiar XML Schema counterparts: minOccurs and maxOccurs. Here all mandotory="1" would become minOccurs="1" and multiple="1" would be maxOccurs="unbounded".
> >> 
> >> Another idea is that the next version of EBML could add an element for schemaLocation which would be a url to the EBML Schema, thus a Matroska file could have an EBML header schemaLocation of https://github.com/Matroska-Org/foundation-source/blob/master/spectool/specdata.xml <https://github.com/Matroska-Org/foundation-source/blob/master/spectool/specdata.xml> so that validators could pull the appropriate schema for validation.
> >> 
> >> Comments?
> >> Dave Rice
> > 
> > Maybe I'm way too late for this, but: does it really have to be XML?
> > It's neither readable, nor inviting to add lots of details to the
> > documentation elements.
> The basis of the Matroska spec is currently in XML, see https://github.com/Matroska-Org/foundation-source/blob/master/spectool/specdata.xml. The xml is the used to created the human-readable documentation. I think the XML is also used programmatically (I think in mkvalidator). So an XML document that defines an EBML document is not a new idea, but I would like to standardize how an EBML Schematic should be expressed. I think that following the analogy of XML Schematic makes sense.

In my opinion, the existing spec is so vague exactly because nobody
knew how to edit the spec, or because the spec was hard to edit, or
because this xml file simply doesn't look very inviting to edit. That
it feels a bit crammed in there, and that it's hard to do good text
formatting. Most real specs are not edited in ad-hoc XML formats.

Having a XML file defining Matroska elements might be useful, but I
don't understand why it should be the definitive document.

More information about the Matroska-devel mailing list