PyThaiNLP · bact · Feb 11, 2024 · Feb 11, 2024
diff --git a/Dockerfile b/Dockerfile
@@ -1,3 +1,6 @@
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
+# SPDX-License-Identifier: Apache-2.0
+
 FROM python:3.8-slim-buster
 
 COPY . .

diff --git a/README.md b/README.md
@@ -13,20 +13,19 @@
   <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
 </div>
 
-PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with a focus on the Thai language.
+PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with a focus on Thai language.
 
 PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย [ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD](https://github.com/PyThaiNLP/pythainlp/blob/dev/README_TH.md)
 
-**News**
+## News
 
 > Now, You can contact with or ask any questions of the PyThaiNLP team. <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
 
 | Version | Description | Status |
 |:------:|:--:|:------:|
-| [5.0](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/788) |
+| [5.0.1](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/788) |
 | [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 5.1 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/900) |
 
-
 ## Getting Started
 
 - PyThaiNLP 2 requires Python 3.7+. Python 2.7 users can use PyThaiNLP 1.6. See [2.0 change log](https://github.com/PyThaiNLP/pythainlp/issues/118) | [Upgrading from 1.7](https://pythainlp.github.io/docs/2.0/notes/pythainlp-1_7-2_0.html) | [Upgrading ThaiNER from 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
@@ -37,24 +36,20 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร
 
 ## Capabilities
 
-PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via the command-line interface.
+PyThaiNLP provides standard linguistic analysis for Thai language and standard Thai locale utility functions.
+Some of these functions are also available via the command-line interface (run `thainlp` in your shell).
 
-<details>
-  <summary>List of Features</summary>
+Partial list of features:
 
 - Convenient character and word classes, like Thai consonants (`pythainlp.thai_consonants`), vowels (`pythainlp.thai_vowels`), digits (`pythainlp.thai_digits`), and stop words (`pythainlp.corpus.thai_stopwords`) -- comparable to constants like `string.letters`, `string.digits`, and `string.punctuation`
-- Thai linguistic unit segmentation/tokenization, including sentence (`sent_tokenize`), word (`word_tokenize`), and subword segmentations based on Thai Character Cluster (`subword_tokenize`)
-- Thai part-of-speech tagging (`pos_tag`)
-- Thai spelling suggestion and correction (`spell` and `correct`)
-- Thai transliteration (`transliterate`)
-- Thai soundex (`soundex`) with three engines (`lk82`, `udom83`, `metasound`)
-- Thai collation (sorted by dictionary order) (`collate`)
-- Read out number to Thai words (`bahttext`, `num_to_thaiword`)
-- Thai datetime formatting (`thai_strftime`)
+- Linguistic unit segmentation at different levels: sentence (`sent_tokenize`), word (`word_tokenize`), and subword (`subword_tokenize`)
+- Part-of-speech tagging (`pos_tag`)
+- Spelling suggestion and correction (`spell` and `correct`)
+- Phonetic algorithm and transliteration (`soundex`  and `transliterate`)
+- Collation (sorted by dictionary order) (`collate`)
+- Number read out (`num_to_thaiword` and `bahttext`)
+- Datetime formatting (`thai_strftime`)
 - Thai-English keyboard misswitched fix (`eng_to_thai`, `thai_to_eng`)
-- Command-line interface for basic functions, like tokenization and POS tagging (run `thainlp` in your shell)
-</details>
-
 
 ## Installation
 
@@ -78,46 +73,43 @@ Some functionalities, like Thai WordNet, may require extra packages. To install
 pip install pythainlp[extra1,extra2,...]
 ```
 
-<details>
-  <summary>List of possible <code>extras</code></summary>
+Possible `extras`:
 
--  `full` (install everything)
--  `attacut` (to support attacut, a fast and accurate tokenizer)
--  `benchmarks` (for [word tokenization benchmarking](tokenization-benchmark.md))
--  `icu` (for ICU, International Components for Unicode, support in transliteration and tokenization)
--  `ipa` (for IPA, International Phonetic Alphabet, support in transliteration)
--  `ml` (to support ULMFiT models for classification)
--  `thai2fit` (for Thai word vector)
--  `thai2rom` (for machine-learnt romanization)
--  `wordnet` (for Thai WordNet API)
-</details>
+- `full` (install everything)
+- `attacut` (to support attacut, a fast and accurate tokenizer)
+- `benchmarks` (for [word tokenization benchmarking](tokenization-benchmark.md))
+- `icu` (for ICU, International Components for Unicode, support in transliteration and tokenization)
+- `ipa` (for IPA, International Phonetic Alphabet, support in transliteration)
+- `ml` (to support ULMFiT models for classification)
+- `thai2fit` (for Thai word vector)
+- `thai2rom` (for machine-learnt romanization)
+- `wordnet` (for Thai WordNet API)
 
 For dependency details, look at the `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).
 
-
 ## Data Directory
 
 - Some additional data, like word lists and language models, may be automatically downloaded during runtime.
 - PyThaiNLP caches these data under the directory `~/pythainlp-data` by default.
 - The data directory can be changed by specifying the environment variable `PYTHAINLP_DATA_DIR`.
 - See the data catalog (`db.json`) at https://github.com/PyThaiNLP/pythainlp-corpus
 
-
 ## Command-Line Interface
 
 Some of PyThaiNLP functionalities can be used via command line with the `thainlp` command.
 
 For example, to display a catalog of datasets:
+
 ```sh
 thainlp data catalog
 ```
 
 To show how to use:
+
 ```sh
 thainlp help
 ```
 
-
 ## Licenses
 
 | | License |
@@ -127,7 +119,6 @@ thainlp help
 | Language models created by PyThaiNLP | [Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/)  |
 | Other corpora and models that may be included in PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |
 
-
 ## Contribute to PyThaiNLP
 
 - Please fork and create a pull request :)
@@ -137,7 +128,6 @@ thainlp help
 
 You can read [INTHEWILD.md](https://github.com/PyThaiNLP/pythainlp/blob/dev/INTHEWILD.md).
 
-
 ## Citations
 
 If you use `PyThaiNLP` in your project or publication, please cite the library as follows:

diff --git a/README_TH.md b/README_TH.md
@@ -14,18 +14,17 @@
 </div>
 PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ โดยเน้นภาษาไทย
 
-**ข่าวสาร**
+## ข่าวสาร
 
 > คุณสามารถพูดคุยหรือแชทกับทีม PyThaiNLP หรือผู้สนับสนุนคนอื่น ๆ ได้ที่ <a href="https://matrix.to/#/#thainlp:matrix.org" rel="noopener" target="_blank"><img src="https://matrix.to/img/matrix-badge.svg" alt="Chat on Matrix"></a>
 
 | รุ่น | คำอธิบาย | สถานะ |
 |:------:|:--:|:------:|
-| [5.0](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/788) |
+| [5.0.1](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/788) |
 | [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 5.1  | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/900) |
 
 ติดตามพวกเราบน [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) เพื่อรับข่าวสารเพิ่มเติม
 
-
 ## เริ่มต้นกับ PyThaiNLP
 
 พวกเราได้จัดทำ [PyThaiNLP Get Started Tutorial](https://pythainlp.github.io/tutorials/notebooks/pythainlp_get_started.html) สำหรับสำรวจความสามารถของ PyThaiNLP; พวกเรามีเอกสารสอนใช้งาน สามารถศึกษาได้ที่ [หน้า tutorial](https://pythainlp.github.io/tutorials).
@@ -34,7 +33,6 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร
 
 พวกเราพยายามทำให้โมดูลใช้งานได้ง่ายที่สุดเท่าที่จะเป็นไปได้; ตัวอย่างเช่น บางชุดข้อมูล (เช่น รายการคำและตัวแบบภาษา) จะถูกดาวน์โหลดอัตโนมัติเมื่อมีการเรียกใช้งาน โดย PyThaiNLP จะจัดเก็บข้อมูลเหล่านั้นไว้ในโฟลเดอร์ `~/pythainlp-data` เป็นค่าเริ่มต้น แต่ผู้ใช้งานสามารถระบุตำแหน่งที่ต้องการได้เองผ่านค่า environment variable `PYTHAINLP_DATA_DIR` อ่านรายละเอียดคลังข้อมูลเพิ่มเติมได้ที่ [PyThaiNLP/pythainlp-corpus](https://github.com/PyThaiNLP/pythainlp-corpus).
 
-
 ## ความสามารถ
 
 PyThaiNLP มีความสามารถพื้นฐานสำหรับการประมวลผลภาษาไทย ตัวอย่างเช่นการกำกับหน้าที่ของคำ (part-of-speech tagging) การแบ่งหน่วยของข้อความตามหลักภาษาศาสตร์ (พยางค์ คำ และประโยค) บางความสามารถสามารถใช้งานได้ผ่านทางคอมมานด์ไลน์
@@ -84,43 +82,42 @@ pip install pythainlp[extra1,extra2,...]
 <details>
   <summary>รายการสำหรับติดตั้งผ่าน <code>extras</code></summary>
 
--  `full` (ติดตั้งทุกอย่าง)
--  `attacut` (เพื่อสนับสนุน attacut ซึ่งเป็นตัวตัดคำที่ทำงานได้รวดเร็วและมีประสิทธิภาพ)
--  `benchmarks` (สำหรับ [word tokenization benchmarking](tokenization-benchmark.md))
--  `icu` (สำหรับการรองรับ ICU หรือ International Components for Unicode ในการถอดเสียงเป็นอักษรและการตัดแบ่งคำ)
--  `ipa` (สำหรับการรองรับ IPA หรือ International Phonetic Alphabet ในการถอดเสียงเป็นอักษร)
--  `ml` (เพื่อให้สนับสนุนตัวแบบภาษา ULMFiT สำหรับการจำแนกข้อความ)
--  `thai2fit` (สำหรับ Thai word vector)
--  `thai2rom` (สำหรับการถอดอักษรไทยเป็นอักษรโรมัน)
--  `wordnet` (สำหรับ Thai WordNet API)
+- `full` (ติดตั้งทุกอย่าง)
+- `attacut` (เพื่อสนับสนุน attacut ซึ่งเป็นตัวตัดคำที่ทำงานได้รวดเร็วและมีประสิทธิภาพ)
+- `benchmarks` (สำหรับ [word tokenization benchmarking](tokenization-benchmark.md))
+- `icu` (สำหรับการรองรับ ICU หรือ International Components for Unicode ในการถอดเสียงเป็นอักษรและการตัดแบ่งคำ)
+- `ipa` (สำหรับการรองรับ IPA หรือ International Phonetic Alphabet ในการถอดเสียงเป็นอักษร)
+- `ml` (เพื่อให้สนับสนุนตัวแบบภาษา ULMFiT สำหรับการจำแนกข้อความ)
+- `thai2fit` (สำหรับ Thai word vector)
+- `thai2rom` (สำหรับการถอดอักษรไทยเป็นอักษรโรมัน)
+- `wordnet` (สำหรับ Thai WordNet API)
 </details>
 
 สำหรับโมดูลที่ต้องการ สามารถดูรายละเอียดได้ที่ตัวแปร `extras` ใน [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).
 
-
 ## Command-line
 
 บางความสามารถของ PyThaiNLP สามารถใช้งานผ่าน command line ได้โดยใช้ `thainlp`
 
 ตัวอย่าง, แสดงรายละเอียดของชุดข้อมูล:
+
 ```sh
 thainlp data catalog
 ```
 
 แสดงวิธีใช้งาน:
+
 ```sh
 thainlp help
 ```
 
-
 ## ผู้ใช้งาน Python 2
 
 - PyThaiNLP 2 สนับสนุน Python 3.6 ขึ้นไป บางความสามารถ สามารถใช้งานกับ Python 3 รุ่นก่อนหน้าได้ แต่ไม่ได้มีการทดสอบว่าใช้งานได้หรือไม่ อ่านเพิ่มเติม [1.7 -> 2.0 change log](https://github.com/PyThaiNLP/pythainlp/issues/118).
   - [Upgrading from 1.7](https://pythainlp.github.io/docs/2.0/notes/pythainlp-1_7-2_0.html)
   - [Upgrade ThaiNER from 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
 - ผู้ใช้งาน Python 2.7 สามารถใช้งาน PyThaiNLP 1.6
 
-
 ## การอ้างอิง
 
 หากคุณใช้ซอฟต์แวร์ `PyThaiNLP` ในโครงงานหรืองานวิจัยของคุณ คุณสามารถอ้างอิงได้ตามนี้
@@ -184,7 +181,6 @@ Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Sur
 
 คุณสามารถอ่านได้ที่ [INTHEWILD.md](https://github.com/PyThaiNLP/pythainlp/blob/dev/INTHEWILD.md)
 
-
 ## สัญญาอนุญาต
 
 | | สัญญาอนุญาต |
@@ -194,12 +190,10 @@ Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Sur
 | Language models created by PyThaiNLP | [Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/)  |
 | สำหรับฐานข้อมูลภาษาและโมเดลอื่นที่อาจมาพร้อมกับซอฟต์แวร์ PyThaiNLP | ดู [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |
 
-
 ## บัตรโมเดล
 
 สำหรับรายละเอียดทางเทคนิค ข้อควรระวัง และข้อคำนึงทางจริยธรรมของตัวแบบ (โมเดล) ที่ใช้ใน PyThaiNLP กรุณาดูที่ [Model cards](https://github.com/PyThaiNLP/pythainlp/wiki/Model-Cards)
 
-
 ## ผู้สนับสนุน
 
 [![VISTEC-depa Thailand Artificial Intelligence Research Institute](https://airesearch.in.th/assets/img/logo/airesearch-logo.svg)](https://airesearch.in.th/)

diff --git a/docs/conf.py b/docs/conf.py
@@ -1,4 +1,6 @@
 # -*- coding: utf-8 -*-
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
+# SPDX-License-Identifier: Apache-2.0
 #
 # Configuration file for the Sphinx documentation builder.
 # http://www.sphinx-doc.org/en/master/config
@@ -21,8 +23,8 @@
 # -- Project information -----------------------------------------------------
 
 project = "PyThaiNLP"
-copyright = "2019, pythainlp_builders"
-author = "pythainlp_builders"
+copyright = "2016-2024 PyThaiNLP Project"
+author = "PyThaiNLP Project"
 
 curyear = datetime.today().year
 copyright = f"2017-{curyear}, {project} (Apache Software License 2.0)"

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,3 +1,6 @@
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
+# SPDX-License-Identifier: Apache-2.0
+
 [tool.ruff]
 line-length = 79
 indent-width = 4

diff --git a/pythainlp/__init__.py b/pythainlp/__init__.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 __version__ = "5.0.1"
 

diff --git a/pythainlp/__main__.py b/pythainlp/__main__.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 import argparse
 import sys

diff --git a/pythainlp/ancient/__init__.py b/pythainlp/ancient/__init__.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 """
 Ancient versions of the Thai language

diff --git a/pythainlp/ancient/aksonhan.py b/pythainlp/ancient/aksonhan.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 from pythainlp.util import Trie
 from pythainlp import thai_consonants, thai_tonemarks

diff --git a/pythainlp/augment/__init__.py b/pythainlp/augment/__init__.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 """
 Thai text augment

diff --git a/pythainlp/augment/lm/__init__.py b/pythainlp/augment/lm/__init__.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 """
 Language Models

diff --git a/pythainlp/augment/lm/fasttext.py b/pythainlp/augment/lm/fasttext.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 import itertools
 from typing import List, Tuple

diff --git a/pythainlp/augment/lm/phayathaibert.py b/pythainlp/augment/lm/phayathaibert.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 
 from typing import List

diff --git a/pythainlp/augment/lm/wangchanberta.py b/pythainlp/augment/lm/wangchanberta.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 
 from typing import List

diff --git a/pythainlp/augment/word2vec/__init__.py b/pythainlp/augment/word2vec/__init__.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 """
 Word2Vec

diff --git a/pythainlp/augment/word2vec/bpemb_wv.py b/pythainlp/augment/word2vec/bpemb_wv.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 from typing import List, Tuple
 from pythainlp.augment.word2vec.core import Word2VecAug

diff --git a/pythainlp/augment/word2vec/core.py b/pythainlp/augment/word2vec/core.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 from typing import List, Tuple
 import itertools

diff --git a/pythainlp/augment/word2vec/ltw2v.py b/pythainlp/augment/word2vec/ltw2v.py
@@ -1,5 +1,5 @@
 # -*- coding: utf-8 -*-
-# SPDX-FileCopyrightText: Copyright 2016-2024 PyThaiNLP Project
+# SPDX-FileCopyrightText: 2016-2024 PyThaiNLP Project
 # SPDX-License-Identifier: Apache-2.0
 from typing import List, Tuple
 from pythainlp.augment.word2vec.core import Word2VecAug