Merge pull request #2 from OpenPecha/add/documentation

Add/documentation
OpenPecha · May 1, 2024 · 8e3cbc8 · 8e3cbc8
2 parents b05ea8f + 8eb1b79
commit 8e3cbc8
Showing 1 changed file with 25 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -11,21 +11,39 @@
 
 ## Description
 
-Tibetan sentence tokenizer
+Tibetan sentence tokenizer designed specifically for data preparation.
 
 ## Project owner(s)
 
 <!-- Link to the repo owners' github profiles -->
 
 - [@tenzin3](https://github.com/tenzin3)
 
-## Integrations
+## Installation 
 
-<!-- Add any intregrations here or delete `- []()` and write None-->
+```py
+pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git
+```
 
-None
-## Docs
+## Usage
+```py
+from bo_sent_tokenizer import tokenize
 
-<!-- Update the link to the docs -->
+text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"
 
-Read the docs [here](https://wiki.openpecha.org/#/dev/coding-guidelines).
+tokenized_text = tokenize(text)
+print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'
+
+
+```
+
+## Explanation 
+The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.
+
+The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.
+
+The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.
+
+The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.
+
+If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.