From 2456825550a6f64336fba38a485b78a0e935165a Mon Sep 17 00:00:00 2001
From: tabuna <bliz48rus@gmail.com>
Date: Sat, 11 May 2024 02:18:23 +0300
Subject: [PATCH] Added docs for tokenize

---
 README.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/README.md b/README.md
index f5faf84..eaf8ae6 100644
--- a/README.md
+++ b/README.md
@@ -54,6 +54,21 @@ items: array:2 [
 */
 ```
 
+## Tokenizer
+
+The algorithm utilizes a tokenizer to segment the text into words. By default, it splits the text by spaces and includes
+words with a length of more than 3 symbols. You can also define your custom tokenizer using the following example:
+
+```php
+$classifier = new Classifier();
+
+$classifier->setTokenizer(function (string $string) {
+    return Str::of($string)
+        ->lower()
+        ->matchAll('/[[:alpha:]]+/u')
+        ->filter(fn (string $word) => Str::length($word) > 3);
+});
+```
 
 ## Wrapping up