[Matroska-devel] mmg.exe 2 Bugs Related to Charset

Liisachan Liisachan at faireal.net
Wed Dec 29 04:14:37 CET 2004


Testing MMG 1.0.1, I found that [Copy to Clipboard]
doesnt work in some situations for 2 reasons:

(1) MMG produces this:

"mkvmerge" -o "C:\FILENAME.mkv" --command-line-charset UTF-8

It won't work. Because MMG is trying to handle FILENAME as UTF-8, 
we should declare --command-line-charset BEFORE -o
That is,

"mkvmerge" --command-line-charset UTF-8 -o "C:\FILENAME.mkv"



(2) Even you fixes that, you can still not make a valid .bat 
file via Clipbaord, because handling UTF-8 as CF_TEXT is 
sometimes lossy, especially for MBCS.

# UTF-8 is "ASCII-transparent", but not MBCS-translaprent.

Example:
1. Let's think about the 5 first Japanese 'alphabet' (Hiragana),

[U+3042][U+3044][U+3046][U+3048][U+304A].mkv

2. In UTF8, this will be:
E3 81 82 E3 81 84 E3 81 86 E3 81 88 E3 81 8A 2E 6D 6B 76
-------- -------- -------- -------- -------- .  m  k  v

3. When copied to Clipboard as CF_TEXT,
(3-1) If the locale is US_ASCII, they might be interpreted 
simply as
E3 81 82 E3...
which should be ok

(3-1) However, if the locale is a DBCS, such as SHIFT_JIS,
CF_TEXT must be valid as SHIFT_JIS.
The problem is, you didnt consider about this limitation.

Which means,

E3 81 82 E3 81 84 E3 81 86 E3 81 88 E3 81 8A 2E
----- ----- ----- ----- ===== ----- ----- =====

[E3 81][82 E3][81 84][E3 81] are valid code points as SHIFT_JIS, 
so they are basically OK. Then SHIFT_JIS parser gets [86 E3] 
which is an illegal Code point as SHIFT_JIS, and will be 
converted to the DefaultChar (probably the one defined in the 
CPINFO structure), which is in this case [81 45].
Now, [81 88] and [E3 81] are OK, but again [8A 2E] is illegal 
and converted to [81 45].
As a result, what you paste from ClipBoard is very lossy.

E3 81 82 E3 81 84 E3 81 81 45 81 88 E3 81 81 45
----- ----- ----- ----- xxxxx ----- ----- xxxxx

no way we could correctly re-convert this back into UTF-8. So, 
the bottom line is, MMG's [Copy to Clipboard] doesn't work.

Because of the same reason (DOS-BOX is lossy for 
WCHAR to MBCS), even if you make a .bat file manually, using 
correct UTF-8, that won't work.
It may work in lucky situations, but generally, it doesn't work.

(3) Solution...
A. First, like older versions, MMG should customize/disable 
"--command-line-charset"
When uncheck --command-line-charset, it uses the default charset 
defined by the user's locale.

B. More fundamentally, MMG can stop relying on stdin (dos-box),
and aggressively tell Windows to give it everything in 
Unicode from the begining, by calling
LPWSTR GetCommandLineW( VOID )
But this has downsides...again, Win98 users will get angry. So,

C. Something like this might be the best:

mkvmerge -j job.utf8

job.utf8 is a text file in UTF8. It says, for instance,
-o [U+3042][U+3044][U+3046][U+3048][U+304A].mkv

mkvmerge just can open the file and read it and parse the text 
as UTF8, so no one can mess with that.

Liisachan





More information about the Matroska-devel mailing list