-
Notifications
You must be signed in to change notification settings - Fork 0
UTF Chaos
MSVC, source code and string literals
MSVC provides as much as 5 (five) UTF encoding options for saving sources:
- Normal human UTF-8
- UTF-8, but with a BOM (byte order mark, which in the case of UTF-8 is not recommended by the Unicode standard, is one headache)
- Rare beast UTF-7 (is it somewhere other than IMAP?)
- UTF-16 LE
- UTF-16 BE
Saving the sources of any UTF, except UTF-8, is not cross-platform and therefore is not considered. UTF-8 with and without BOM remain.
In the code literals of strings can be "single-byte" (char) and "wide" (wchar_t):
const char* cs = "Я";
const wchar_t* ws = L"Я";
The 'I' character in UTF-8 is represented as D0 AF and in UTF-16 as 042F.
In my memory | In my memory | ||
---|---|---|---|
Literary | On the disc | BOM | Without BOM |
char* | D0 AF | D0 AF | D0 AF 00 |
wchar_t* | 042F 0000 | 0420 0407 0000 |
In the case of BOM, one DF character in char* is the 'I' character in Windows-1251, i.e. MSVC converts a string to ANSI system encoding (obviously with losses). But everything is fine with wchar_t*. Without BOM, on the contrary, char* contains the desired UTF-8 representation of the string, but wchar_t* contains strings that are not clear - 0420 0407. This is the UTF-16 character representation of РЇ, which in Windows-1251 has D0 AF codes (compare to the original UTF-8).
That is, the correct processing of unicode string literals depends on the file saving parameters, and the simultaneous use of char* and wchar_t* is impossible in this case:
- When saving the source code to UTF-8 with BOM, char* will be corrupted (converted to "system" ANSI).
- When saving to UTF-8 without BOM, wchar_t* will be corrupted (converted to UTF-16 as if the source was in ANSI).