feat: Add Chunking Strategies: Regex and Substring Methods #735
+184
−96
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Because
This PR is essential for enhancing the functionality and usability of our text processing capabilities. By introducing Regex and Substring chunking methods, we empower users to customize their text handling according to specific needs, ultimately improving their experience. The ability to define custom chunking rules through regular expressions and predefined indices offers greater flexibility and efficiency, especially when dealing with complex text formats. Additionally, this update aligns with our project requirements, ensuring we meet user demands for diverse text processing strategies. Implementing these features not only addresses current needs but also lays the groundwork for future enhancements based on user feedback, making this PR a crucial addition to the project.
This commit
Implemented the Regex chunking method, allowing users to specify custom regular expression patterns for text splitting.
Introduced properties for
chunk-size
,chunk-overlap
,model-name
, andpattern
to configure the chunking behavior.Implemented the Substring chunking method, enabling users to define start and end indices for chunking the text.
Introduced properties for
chunk-size
,chunk-overlap
,model-name
,start-index
, andend-index
for detailed configuration.Enhanced the existing JSON schema to include the new chunking strategies within the
strategy
properties.It is ensured all new properties are properly documented and formatted for clarity and usability.