Huge thanks to our Platinum Members Endace and LiveAction,
and our Silver Member Veeam, for supporting the Wireshark Foundation and project.

Wireshark-dev: Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Wed, 29 Jun 2011 09:44:14 -0700
On Jun 29, 2011, at 2:37 AM, Graham Bloice wrote:

> For reference, here's the test executable output on Win7, using the SDK 7.0 build environment (a cmd.prompt):

Not surprisingly, it doesn't work.

Microsoft introduced Unicode support when they introduced Win32; as they were introducing a new API, they could make the versions of the API that support Unicode take UCS-2 (later UTF-16) strings as arguments.  They also offered "ASCII" versions, which took strings in the local code page as arguments.  This also applies to the C library's routines, such as open()/_open().

UN*X systems already had a well-established API when they introduced Unicode support, and they had what amounted to code pages (the various ISO 8859/x encodings, the EUC encodings, assorted other encodings); instead, they added a new "code page", with UTF-8 encoding.

The program was written for UN*X, to test whether, in the user's locale, UTF-8 strings work.  In Windows, the ASCII API it was using to create a file would take your local code page, not UTF-8, as the string encoding, and I suspect cmd.exe also expects "ASCII" output from programs - such as when the test program was printing Stig's name - to be in the local code page, not UTF-8.

This is why GLib has file functions that do mapping on file names; the page at

	http://developer.gnome.org/glib/stable/glib-File-Utilities.html

says

	There is a group of functions which wrap the common POSIX functions dealing with filenames (g_open(), g_rename(), g_mkdir(), g_stat(),g_unlink(), g_remove(), g_fopen(), g_freopen()). The point of these wrappers is to make it possible to handle file names with any Unicode characters in them on Windows without having to use ifdefs and the wide character API in the application code.

	The pathname argument should be in the GLib file name encoding. On POSIX this is the actual on-disk encoding which might correspond to the locale settings of the process (or the G_FILENAME_ENCODING environment variable), or not.

	On Windows the GLib file name encoding is UTF-8. Note that the Microsoft C library does not use UTF-8, but has separate APIs for current system code page and wide characters (UTF-16). The GLib wrappers call the wide character API if present (on modern Windows systems), otherwise convert to/from the system code page.

	Another group of functions allows to open and read directories in the GLib file name encoding. These are g_dir_open(), g_dir_read_name(),g_dir_rewind(), g_dir_close().

This is also why we have our own copies of some of those functions on Windows, and wrap them ourselves (so that we don't require GLib 2.6, which introduced them, for all platforms).