Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtensa wchar_t is 16 bits? #829

Open
keith-packard opened this issue Oct 22, 2024 · 3 comments
Open

xtensa wchar_t is 16 bits? #829

keith-packard opened this issue Oct 22, 2024 · 3 comments
Labels
area: GCC Issues related to GCC (GNU Compiler Collection) enhancement

Comments

@keith-packard
Copy link
Collaborator

I just did a survey of all of the SDK compilers and only xtensa uses a 2-byte wchar_t. This means that applications built
on xtensa will not be able to handle the full Unicode range. Looking at the xtensa gcc config, only the embedded build uses this size; other xtensa toolchain options use a 4-byte wchar_t. This seems like an opportunity for errors when porting software between xtensa and other architectures.

@stephanosio stephanosio added area: GCC Issues related to GCC (GNU Compiler Collection) enhancement labels Nov 1, 2024
@stephanosio
Copy link
Member

There is nothing wrong with 2-byte wchar_t as far as standard compliance goes, and 4-byte wchar_t is not exactly embedded-friendly.

I would say one should never use wchar_t in an embedded applications because it is very space inefficient, especially with 4-byte sized wchar_t. The only reason for it to be supported would be for portability (i.e. for compiling existing codebase using wchar_t) ...

The whole C "wide character" business is a big mess ... even for non-embedded applications, I would even go as far as to say it should be never used in general and one should just stick with char/UTF-8 for everything.

@keith-packard
Copy link
Collaborator Author

Yeah, I realize 2-byte wchar_t is valid, but wchar_t is supposed to handle any codepoint the system supports in the wide character encoding. That used to be system-dependent, but it's essentially always Unicode now. 2-byte unicode isn't sufficient, so a system with 2-byte wchar_t fails the basic requirements as far as I'm concerned. But, I did get uchar.h working on Xtensa, which was really the point of this rather pointless exercise (the APIs in uchar.h being essentially pointless).

wchar_t, like the newer char32_t shouldn't be used for a storage format, but you still need a way to do character-by-character analysis of data, so a function which iterates over a utf-8 string extracting one code point at a time into a char32_t local variable would have been really useful here. On systems where wchar_t is 32-bits, you can get that as long as you sign up for the whole locale adventure. That's not enabled on our picolibc builds because of the size penalty you get on so many core C library functions. Maybe someday uchar.h will include simple translations between utf-8 encoded strings and char32_t values. Given the perils of open-coded utf-8 encoding, that would be really useful...

@keith-packard
Copy link
Collaborator Author

Hrm. Actually, the C spec says:

3.7.3 wide character
value representable by an object of type wchar_t, capable of representing any character in the current locale.

That means xtensa cannot support a Unicode locale (like en_US.UTF-8). So, if we ever enable locale support, then things like mbtowc(&wchar, "🚀", 4) will not work correctly -- you'll get a high surrogate instead of the correct value, 0x1f680. I think we should fix that; we shouldn't restrict the SDK to locales that live entirely in the BMP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: GCC Issues related to GCC (GNU Compiler Collection) enhancement
Projects
None yet
Development

No branches or pull requests

2 participants