From 2456825550a6f64336fba38a485b78a0e935165a Mon Sep 17 00:00:00 2001 From: tabuna Date: Sat, 11 May 2024 02:18:23 +0300 Subject: [PATCH] Added docs for tokenize --- README.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/README.md b/README.md index f5faf84..eaf8ae6 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,21 @@ items: array:2 [ */ ``` +## Tokenizer + +The algorithm utilizes a tokenizer to segment the text into words. By default, it splits the text by spaces and includes +words with a length of more than 3 symbols. You can also define your custom tokenizer using the following example: + +```php +$classifier = new Classifier(); + +$classifier->setTokenizer(function (string $string) { + return Str::of($string) + ->lower() + ->matchAll('/[[:alpha:]]+/u') + ->filter(fn (string $word) => Str::length($word) > 3); +}); +``` ## Wrapping up