Skip to content

Commit

Permalink
Merge pull request #2 from OpenPecha/add/documentation
Browse files Browse the repository at this point in the history
Add/documentation
  • Loading branch information
spsither authored May 1, 2024
2 parents b05ea8f + 8eb1b79 commit 8e3cbc8
Showing 1 changed file with 25 additions and 7 deletions.
32 changes: 25 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,39 @@

## Description

Tibetan sentence tokenizer
Tibetan sentence tokenizer designed specifically for data preparation.

## Project owner(s)

<!-- Link to the repo owners' github profiles -->

- [@tenzin3](https://github.com/tenzin3)

## Integrations
## Installation

<!-- Add any intregrations here or delete `- []()` and write None-->
```py
pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git
```

None
## Docs
## Usage
```py
from bo_sent_tokenizer import tokenize

<!-- Update the link to the docs -->
text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"

Read the docs [here](https://wiki.openpecha.org/#/dev/coding-guidelines).
tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'


```

## Explanation
The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.

The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.

The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.

The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.

If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.

0 comments on commit 8e3cbc8

Please sign in to comment.