xtensa wchar_t is 16 bits? #829

keith-packard · 2024-10-22T04:09:49Z

I just did a survey of all of the SDK compilers and only xtensa uses a 2-byte wchar_t. This means that applications built
on xtensa will not be able to handle the full Unicode range. Looking at the xtensa gcc config, only the embedded build uses this size; other xtensa toolchain options use a 4-byte wchar_t. This seems like an opportunity for errors when porting software between xtensa and other architectures.

stephanosio · 2024-11-01T03:33:43Z

There is nothing wrong with 2-byte wchar_t as far as standard compliance goes, and 4-byte wchar_t is not exactly embedded-friendly.

I would say one should never use wchar_t in an embedded applications because it is very space inefficient, especially with 4-byte sized wchar_t. The only reason for it to be supported would be for portability (i.e. for compiling existing codebase using wchar_t) ...

The whole C "wide character" business is a big mess ... even for non-embedded applications, I would even go as far as to say it should be never used in general and one should just stick with char/UTF-8 for everything.

keith-packard · 2024-11-01T04:46:54Z

Yeah, I realize 2-byte wchar_t is valid, but wchar_t is supposed to handle any codepoint the system supports in the wide character encoding. That used to be system-dependent, but it's essentially always Unicode now. 2-byte unicode isn't sufficient, so a system with 2-byte wchar_t fails the basic requirements as far as I'm concerned. But, I did get uchar.h working on Xtensa, which was really the point of this rather pointless exercise (the APIs in uchar.h being essentially pointless).

wchar_t, like the newer char32_t shouldn't be used for a storage format, but you still need a way to do character-by-character analysis of data, so a function which iterates over a utf-8 string extracting one code point at a time into a char32_t local variable would have been really useful here. On systems where wchar_t is 32-bits, you can get that as long as you sign up for the whole locale adventure. That's not enabled on our picolibc builds because of the size penalty you get on so many core C library functions. Maybe someday uchar.h will include simple translations between utf-8 encoded strings and char32_t values. Given the perils of open-coded utf-8 encoding, that would be really useful...

keith-packard · 2024-11-01T04:57:39Z

Hrm. Actually, the C spec says:

3.7.3 wide character
value representable by an object of type wchar_t, capable of representing any character in the current locale.

That means xtensa cannot support a Unicode locale (like en_US.UTF-8). So, if we ever enable locale support, then things like mbtowc(&wchar, "🚀", 4) will not work correctly -- you'll get a high surrogate instead of the correct value, 0x1f680. I think we should fix that; we shouldn't restrict the SDK to locales that live entirely in the BMP.

stephanosio added area: GCC Issues related to GCC (GNU Compiler Collection) enhancement labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xtensa wchar_t is 16 bits? #829

xtensa wchar_t is 16 bits? #829

keith-packard commented Oct 22, 2024

stephanosio commented Nov 1, 2024

keith-packard commented Nov 1, 2024

keith-packard commented Nov 1, 2024

xtensa wchar_t is 16 bits? #829

xtensa wchar_t is 16 bits? #829

Comments

keith-packard commented Oct 22, 2024

stephanosio commented Nov 1, 2024

keith-packard commented Nov 1, 2024

keith-packard commented Nov 1, 2024