Skip to content

UTF Chaos

Joan Andrés edited this page Dec 6, 2019 · 2 revisions

MSVC, source code and string literals

MSVC provides as much as 5 (five) UTF encoding options for saving sources:

  • Normal human UTF-8
  • UTF-8, but with a BOM (byte order mark, which in the case of UTF-8 is not recommended by the Unicode standard, is one headache)
  • Rare beast UTF-7 (is it somewhere other than IMAP?)
  • UTF-16 LE
  • UTF-16 BE

Saving the sources of any UTF, except UTF-8, is not cross-platform and therefore is not considered. UTF-8 with and without BOM remain.

In the code literals of strings can be "single-byte" (char) and "wide" (wchar_t):

const char*    cs =  "Я";
const wchar_t* ws = L"Я";

The 'I' character in UTF-8 is represented as D0 AF and in UTF-16 as 042F.

In my memory In my memory
Literary On the disc BOM Without BOM
char* D0 AF D0 AF D0 AF 00
wchar_t* 042F 0000 0420 0407 0000

In the case of BOM, one DF character in char* is the 'I' character in Windows-1251, i.e. MSVC converts a string to ANSI system encoding (obviously with losses). But everything is fine with wchar_t*. Without BOM, on the contrary, char* contains the desired UTF-8 representation of the string, but wchar_t* contains strings that are not clear - 0420 0407. This is the UTF-16 character representation of РЇ, which in Windows-1251 has D0 AF codes (compare to the original UTF-8).

That is, the correct processing of unicode string literals depends on the file saving parameters, and the simultaneous use of char* and wchar_t* is impossible in this case:

  • When saving the source code to UTF-8 with BOM, char* will be corrupted (converted to "system" ANSI).
  • When saving to UTF-8 without BOM, wchar_t* will be corrupted (converted to UTF-16 as if the source was in ANSI).
Clone this wiki locally