[Matroska-devel] using ISO 639-3 language codes in Matroska

Moritz Bunkus moritz at bunkus.org
Tue Jan 12 14:28:03 CET 2016


Hey,

the Matroska container currently uses the bibliopgraphic versions of
ISO 639-2 codes for marking anything that requires language
information. Back when Matroska was specified ISO 639-2 was the latest
standard available and therefore a good choice.

However, 639-2 is incomplete and has been largely superseded by
639-3[1] which covers pretty much each and every language out there;
citing Wikipedia:

"It provides an enumeration of languages as complete as possible,
including living and extinct, ancient and constructed, major and
minor, written and unwritten."

Over the course of the last couple of years users have often asked my
to extend MKVToolNix to use 639-3 codes intead of 639-2 ones. One
example of a rather common question I get is why people cannot use
e.g. Mandarin as a language; I even have a FAQ entry for that[2]. I've
always told those people that Matroska itself doesn't support
that.

Right now with work being done to extend Matroska for standardization
may be the best time to introduce 639-3 to Matroska.

Problem is I don't know the best way to do this. I see three possible
avenues each with their own sets of pros and cons, and I'd like some
feedback in order to turn this into a proper proposal:

1. Change the specs so that all language elements use 639-3 codes

2. Introduce new elements on the same level as the existing language
   elements that determine the standard the corresponding language
   element uses defaulting to 639-2 if missing

3. Introduce new elements on the same level as the existing language
   elements that contain a 639-3 code

Here are the details:

1. Change the specs so that all language elements use 639-3 codes

Pros: no new elements required. As most of 639-2 is included in 639-3
this should work mostly OK for existing applications; adding support
should be easy for both players and muxers

Cons: 639-3 is not a superset of 639-2 as far as I know (Wikipedia
agrees), though only corner cases should be affected. Matroska uses
bibliographic versions of the 639-2 codes while 639-3's codes are
dervied from the terminology ones of 639-2 (example: German is
currently "ger" in Matroska, 639-3 uses "deu") potentially confusing
players

2. Introduce new elements on the same level as the existing language
   elements that determine the standard the corresponding language
   element uses defaulting to 639-2 if missing

For example for the TrackLanguage element we'd introduce
TrackLanguageStandard, unsigned integer, default value 0; 0 meaning
ISO 639-2 bibliographic, 1 meaning 639-3.

A conforming player would have to look for such a
TrackLanguageStandard element and interpret TrackLanguage as ISO 639-2
bibliographic if TrackLanguageStandard is missing or if it is set to 0
and as 639-3 if it's present and set to 1.

Pros: clear distinction which standard was used; extensible for future
changes

Cons: introduces three new elements; for non-conforming players the
same cons as for 1. apply

3. Introduce new elements on the same level as the existing language
   elements that contain a 639-3 code

For example for the TrackLanguage element we'd introduce
TrackLanguageIso639_3, ASCII string, no default value. Restrictions:
if present a TrackLanguage element SHOULD be written, too, that
corresponds to the language in TrackLanguageIso639_3 for backwards
compatibility (possible exception: if the producing system knows that
only conforming players will ever read such a file).

A conforming player would have to look for a TrackLanguageIso639_3
element. If it's present then this element is used. Otherwise the
player looks for TrackLanguage just as it always has.

Pros: no confusion over the domain of the information in existing
language elements; should work best with non-conforming/older readers

Cons: not extensible for new standards; introduces three new elements;
complex mapping requirements for writing both elements

------------------------------------------------------------

Those are my thoughts. Ideas? Is this even a worth it or should we
just stick with 639-2? Any other ways to add support? Preferred
solutions?

Thanks.

Kind regards,
mosu

[1] https://en.wikipedia.org/wiki/ISO_639-3
[2] https://github.com/mbunkus/mkvtoolnix/wiki/Chinese-not-selectable-as-language
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.matroska.org/pipermail/matroska-devel/attachments/20160112/d650441c/attachment.sig>


More information about the Matroska-devel mailing list