Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more #166

Open
luantranminh opened this issue Oct 20, 2023 · 0 comments
Labels

Comments

@luantranminh
Copy link
Owner

Data needs to be stored in storage (RAM, Disk ...) as bits (0,1)

ASCII maps basic Westen characters to numbers between 0 and 127.
image
5 characters -> 5 bytes

Unicode was born to handle the vast multitude of languages and complex things with accents, emojis, modifiers, and other strange characters.
Grapheme: a single unit of a human writing system (d or 华 or 🙀 ...).
Code points: one or more code points are combined to create a grapheme.
image

Different encoding

  • UTF-32: each code point converts to binary with 4 bytes:
    • pros: same size for all characters -> easier to search, index
    • cons: waste space with simple characters
  • UTF-8:
    • pros: ASCII compatibility backward, save space
    • cons: harder to index because of unequal sizes and bytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant