[Matroska-devel] [Cellar] using ISO 639-3 language codes in Matroska

Jerome Martinez jerome at mediaarea.net
Tue Jan 12 15:19:51 CET 2016


Le 12/01/2016 14:28, Moritz Bunkus a écrit :
> Hey,
>
> the Matroska container currently uses the bibliopgraphic versions of
> ISO 639-2 codes for marking anything that requires language
> information. Back when Matroska was specified ISO 639-2 was the latest
> standard available and therefore a good choice.
>
> However, 639-2 is incomplete and has been largely superseded by
> 639-3[1] which covers pretty much each and every language out there;
> citing Wikipedia:
>
> "It provides an enumeration of languages as complete as possible,
> including living and extinct, ancient and constructed, major and
> minor, written and unwritten."
>
> Over the course of the last couple of years users have often asked my
> to extend MKVToolNix to use 639-3 codes intead of 639-2 ones. One
> example of a rather common question I get is why people cannot use
> e.g. Mandarin as a language; I even have a FAQ entry for that[2]. I've
> always told those people that Matroska itself doesn't support
> that.

With 639-3, you'll get a problem with people from Hong Kong
https://en.wikipedia.org/wiki/Cantonese
ISO 639-3 line is.. empty :(
note: this is a real use case, I get such issue with my software, and I 
use RFC 5646, so zh-HK for Cantonese aka "Hong Kong, traditional 
characters".
Looks like RFC 5646 takes care of Chinese language issue.

Actually, I don't understand the issue with Mandarin: I understood that 
it is widely accepted to zh-CH for »Chinese (Simplified)« and zh-TW for 
»Chinese (Traditional)«, and current spec says:
"(...) followed by a dash and a country code for specialities in languages"
so looks like "zho-cn" and "zho-tw" are acceptable with current spec.
What is the issue with such string and current Matroska specs?

>
> Right now with work being done to extend Matroska for standardization
> may be the best time to introduce 639-3 to Matroska.

as we are moving to IETF, maybe RFC 5646 is another possibility.
I have no strong opinion about ISO 639-2 / ISO 639-3 / RFC 5646, I just 
do the remark that you may solve 1 issue but maybe not all issues.

>
> Problem is I don't know the best way to do this. I see three possible
> avenues each with their own sets of pros and cons, and I'd like some
> feedback in order to turn this into a proper proposal:
>
> 1. Change the specs so that all language elements use 639-3 codes
>
> 2. Introduce new elements on the same level as the existing language
>     elements that determine the standard the corresponding language
>     element uses defaulting to 639-2 if missing
>
> 3. Introduce new elements on the same level as the existing language
>     elements that contain a 639-3 code
>
> Here are the details:
>
> 1. Change the specs so that all language elements use 639-3 codes
>
> Pros: no new elements required. As most of 639-2 is included in 639-3
> this should work mostly OK for existing applications; adding support
> should be easy for both players and muxers
>
> Cons: 639-3 is not a superset of 639-2 as far as I know (Wikipedia
> agrees), though only corner cases should be affected. Matroska uses
> bibliographic versions of the 639-2 codes while 639-3's codes are
> dervied from the terminology ones of 639-2 (example: German is
> currently "ger" in Matroska, 639-3 uses "deu") potentially confusing
> players

"ger" (bibliographic code, ISO 639-2/B) and "deu" (terminological code, 
ISO 639-2/T) are synonym in ISO 639-2, so current players are expected 
to support both (current spec does not say that there is a restriction 
about B or T)
Cons: does not resolve all issues (e.g. Cantonese)

>
> 2. Introduce new elements on the same level as the existing language
>     elements that determine the standard the corresponding language
>     element uses defaulting to 639-2 if missing
>
> For example for the TrackLanguage element we'd introduce
> TrackLanguageStandard, unsigned integer, default value 0; 0 meaning
> ISO 639-2 bibliographic, 1 meaning 639-3.
>
> A conforming player would have to look for such a
> TrackLanguageStandard element and interpret TrackLanguage as ISO 639-2
> bibliographic if TrackLanguageStandard is missing or if it is set to 0
> and as 639-3 if it's present and set to 1.
>
> Pros: clear distinction which standard was used; extensible for future
> changes
>
> Cons: introduces three new elements; for non-conforming players the
> same cons as for 1. apply
>
> 3. Introduce new elements on the same level as the existing language
>     elements that contain a 639-3 code
>
> For example for the TrackLanguage element we'd introduce
> TrackLanguageIso639_3, ASCII string, no default value. Restrictions:
> if present a TrackLanguage element SHOULD be written, too, that
> corresponds to the language in TrackLanguageIso639_3 for backwards
> compatibility (possible exception: if the producing system knows that
> only conforming players will ever read such a file).
>
> A conforming player would have to look for a TrackLanguageIso639_3
> element. If it's present then this element is used. Otherwise the
> player looks for TrackLanguage just as it always has.
>
> Pros: no confusion over the domain of the information in existing
> language elements; should work best with non-conforming/older readers
>
> Cons: not extensible for new standards; introduces three new elements;
> complex mapping requirements for writing both elements

three new elements? I understand there is only 1 new element in that 
case (TrackLanguageIso639_3)

> ------------------------------------------------------------
>
> Those are my thoughts. Ideas? Is this even a worth it or should we
> just stick with 639-2? Any other ways to add support? Preferred
> solutions?

3b. Introduce new elements on the same level as the existing language 
elements that provides the RFC 5646 language tag.

Pro: more IETF style (reusing an RFC), includes ISO 639-3 (if I 
understand well the RFC).

Cons: more complex

------------------------------------------------------------


My preference is:
3b
3 (not IETF, remaining issues)
1 (minor compatibility break, remaining issues)
2 (could be a major break for old players if we decide to switch to a 
super new standard totally incompatible with ISO 639-2, remaining issues)

Jérôme


More information about the Matroska-devel mailing list