Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more #166

luantranminh · 2023-10-20T08:18:41Z

Data needs to be stored in storage (RAM, Disk ...) as bits (0,1)

ASCII maps basic Westen characters to numbers between 0 and 127.

5 characters -> 5 bytes

Unicode was born to handle the vast multitude of languages and complex things with accents, emojis, modifiers, and other strange characters.
Grapheme: a single unit of a human writing system (d or 华 or 🙀 ...).
Code points: one or more code points are combined to create a grapheme.

Different encoding

UTF-32: each code point converts to binary with 4 bytes:
- pros: same size for all characters -> easier to search, index
- cons: waste space with simple characters
UTF-8:
- pros: ASCII compatibility backward, save space
- cons: harder to index because of unequal sizes and bytes

luantranminh added the dev label Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more #166

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more #166

luantranminh commented Oct 20, 2023

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more #166

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more #166

Comments

luantranminh commented Oct 20, 2023

Different encoding