UTF Chaos

MSVC, source code and string literals

MSVC provides as much as 5 (five) UTF encoding options for saving sources:

Normal human UTF-8
UTF-8, but with a BOM (byte order mark, which in the case of UTF-8 is not recommended by the Unicode standard, is one headache)
Rare beast UTF-7 (is it somewhere other than IMAP?)
UTF-16 LE
UTF-16 BE

Saving the sources of any UTF, except UTF-8, is not cross-platform and therefore is not considered. UTF-8 with and without BOM remain.

In the code literals of strings can be "single-byte" (char) and "wide" (wchar_t):

const char*    cs =  "Я";
const wchar_t* ws = L"Я";

The 'I' character in UTF-8 is represented as D0 AF and in UTF-16 as 042F.

		In my memory	In my memory
Literary	On the disc	BOM	Without BOM
char*	D0 AF	D0 AF	D0 AF 00
wchar_t*		042F 0000	0420 0407 0000

In the case of BOM, one DF character in char* is the 'I' character in Windows-1251, i.e. MSVC converts a string to ANSI system encoding (obviously with losses). But everything is fine with wchar_t*. Without BOM, on the contrary, char* contains the desired UTF-8 representation of the string, but wchar_t* contains strings that are not clear - 0420 0407. This is the UTF-16 character representation of РЇ, which in Windows-1251 has D0 AF codes (compare to the original UTF-8).

That is, the correct processing of unicode string literals depends on the file saving parameters, and the simultaneous use of char* and wchar_t* is impossible in this case:

When saving the source code to UTF-8 with BOM, char* will be corrupted (converted to "system" ANSI).
When saving to UTF-8 without BOM, wchar_t* will be corrupted (converted to UTF-16 as if the source was in ANSI).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF Chaos

Clone this wiki locally