Skip to content

Releases: BYVoid/uchardet

Version 0.0.5 released.

05 Dec 12:25
Compare
Choose a tag to compare
  • Revert UTF-16 and UTF-32 label change:
    it was an error to specify endianness for texts with BOM.
    The Unicode standard explicitly warns against it, and it actually
    even (partially) breaks conversions.
  • Added supports:
    • French: Windows-1252.
    • German: ISO-8859-1, Windows-1252
    • Esperanto: ISO-8859-3
    • Turkish: ISO-8859-3 and ISO-8859-9
    • Thai: ISO-8859-11 (and TIS-620 model rebuilt).
  • Single Byte charset detection algorithm improved:
    detection of control characters lowers confidence.

Version 0.0.4 released.

03 Dec 19:08
Compare
Choose a tag to compare
  • Add support of ISO-8859-1 and ISO-8859-15 for French.
  • Re-enable Hungarian language models (ISO-8859-2 and Windows-1250) which used to conflict with other charsets (should be better now).
  • Differentiate ASCII detection and detection failure.
  • Improve single-byte charset detection confidence algorithm (fixes for instance Windows-1251 Russian text detection).
  • "UTF-16" is now outputted with endianness information (UTF-16LE/BE).
  • Add UTF-32 BOM detection.
  • Discard single byte charsets upon illegal codepoint detection.
  • Internal redesign of single-byte charmaps with more semantics, and variable sample size length (different languages have different sizes of grapheme lists).
  • A lot more test files (33 successful unit tests should be successful with make test).
  • Adding python scripts to generate language models from Wikipedia data in a single command.

Version 0.0.3 Released.

19 Nov 14:35
Compare
Choose a tag to compare

A quick release after 0.0.2 mostly to fix a bad crash on the command
line tool when charset detection failed (or detected ASCII).

Additionaly:

  • The build now includes more test files for various language/encoding
    and a make test target for unit testing (20 encoding detection tests
    should be successful upon running it).
  • The build has a new BUILD_STATIC option, by default set to ON,
    allowing to disable static library building if not needed.
  • All encoding names are iconv-compatible, enabling developers to
    directly feed the result of uchardet_get_charset() into libiconv.
  • Compilation warnings fixed.

Version 0.0.2

16 Nov 15:18
Compare
Choose a tag to compare

The primary goal of this release is to set a fixed point in time for distributions, since most are using various commits as their source, but still calling it 0.0.1 (there was actually a version 0.0.1 tarball available in GoogleCode, dating from 2011).

Version 0.0.2 mostly fixes various bugs and allow querying charsets for multiple files in the same command with uchardet command line tool.