Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiByte Encoding Support #81

Open
hnakamur opened this issue Sep 17, 2012 · 5 comments
Open

MultiByte Encoding Support #81

hnakamur opened this issue Sep 17, 2012 · 5 comments

Comments

@hnakamur
Copy link
Collaborator

Something like Unix pipe would be best, for example, shift_jis -> UTF-8 -> to_lower.

Avoid iconv because of license incompatibility.
Maybe we can use some parts of libnkf, bsdconv, PHP's mbstring.

This is a very important subject, so let's take time for consideration.

@dvv
Copy link
Contributor

dvv commented Sep 17, 2012

couldn't you explain the subj a bit more verbose? tia

@kristate
Copy link
Member

@dvv I just changed the title to MultiByte Encoding Support

@hnakamur
Copy link
Collaborator Author

Strings in Lua may contain any 8-bit value, including embedded zeros, which can be specified as ‘\0’,
according to http://www.lua.org/ftp/refman-5.0.pdf

So we can use any multibyte character encoding such as Shift_JIS or EUC-JIS as well as UTF-8.
Therefore we need APIs for converting encodings in strings or cBuffers.
We'd like those APIs to be

  • able to work on partial inputs (which might end in the middle of a multibyte character) and successive calls of the conversion API for more input on the next read will continue conversion. Here I suppose reading input bytes in a fixed sized cBuffer, say 4KB for example.
  • able to concatenate multiple conversions, for example Shift_JIS -> UTF-8 and UTF-8 -> to_lower. like Unix pipes,

We would like to define APIs to satisfy these goals. So more than a just simple API like convert(src, dest_encoding) returns dest is needed.

@dvv I hope this explanation is clear enough.

@hnakamur
Copy link
Collaborator Author

We should decide rules for encodings for strings and cBuffers.

It's just an idea, my take is to use only UTF-8 for strings and any encoding for cBuffers.
And we would add APIs to cBuffers for interoperability to strings, so that we can pass cBuffers to APIs which expects strings.

However, I have not thought about it thoroughly yet, so i'm not sure this actually works. Maybe any encoding for both strings and cBuffers is a better way.

@dvv
Copy link
Contributor

dvv commented Sep 18, 2012

right. as previously stated, imho it's better to keep small and clean as far as we can, so utf-8 should be enough for starters.
i have something related to this domain -- unicode -- do not know whether useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants