MultiByte Encoding Support #81

hnakamur · 2012-09-17T16:46:04Z

Something like Unix pipe would be best, for example, shift_jis -> UTF-8 -> to_lower.

Avoid iconv because of license incompatibility.
Maybe we can use some parts of libnkf, bsdconv, PHP's mbstring.

This is a very important subject, so let's take time for consideration.

dvv · 2012-09-17T17:33:17Z

couldn't you explain the subj a bit more verbose? tia

kristate · 2012-09-17T18:15:53Z

@dvv I just changed the title to MultiByte Encoding Support

hnakamur · 2012-09-17T21:08:07Z

Strings in Lua may contain any 8-bit value, including embedded zeros, which can be specified as ‘\0’,
according to http://www.lua.org/ftp/refman-5.0.pdf

So we can use any multibyte character encoding such as Shift_JIS or EUC-JIS as well as UTF-8.
Therefore we need APIs for converting encodings in strings or cBuffers.
We'd like those APIs to be

able to work on partial inputs (which might end in the middle of a multibyte character) and successive calls of the conversion API for more input on the next read will continue conversion. Here I suppose reading input bytes in a fixed sized cBuffer, say 4KB for example.
able to concatenate multiple conversions, for example Shift_JIS -> UTF-8 and UTF-8 -> to_lower. like Unix pipes,

We would like to define APIs to satisfy these goals. So more than a just simple API like convert(src, dest_encoding) returns dest is needed.

@dvv I hope this explanation is clear enough.

hnakamur · 2012-09-17T22:29:59Z

We should decide rules for encodings for strings and cBuffers.

It's just an idea, my take is to use only UTF-8 for strings and any encoding for cBuffers.
And we would add APIs to cBuffers for interoperability to strings, so that we can pass cBuffers to APIs which expects strings.

However, I have not thought about it thoroughly yet, so i'm not sure this actually works. Maybe any encoding for both strings and cBuffers is a better way.

dvv · 2012-09-18T10:23:43Z

right. as previously stated, imho it's better to keep small and clean as far as we can, so utf-8 should be enough for starters.
i have something related to this domain -- unicode -- do not know whether useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiByte Encoding Support #81

MultiByte Encoding Support #81

hnakamur commented Sep 17, 2012

dvv commented Sep 17, 2012

kristate commented Sep 17, 2012

hnakamur commented Sep 17, 2012

hnakamur commented Sep 17, 2012

dvv commented Sep 18, 2012

MultiByte Encoding Support #81

MultiByte Encoding Support #81

Comments

hnakamur commented Sep 17, 2012

dvv commented Sep 17, 2012

kristate commented Sep 17, 2012

hnakamur commented Sep 17, 2012

hnakamur commented Sep 17, 2012

dvv commented Sep 18, 2012