Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Chunking Strategies: Regex and Substring Methods #735

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NailaRais
Copy link

Because

This PR is essential for enhancing the functionality and usability of our text processing capabilities. By introducing Regex and Substring chunking methods, we empower users to customize their text handling according to specific needs, ultimately improving their experience. The ability to define custom chunking rules through regular expressions and predefined indices offers greater flexibility and efficiency, especially when dealing with complex text formats. Additionally, this update aligns with our project requirements, ensuring we meet user demands for diverse text processing strategies. Implementing these features not only addresses current needs but also lays the groundwork for future enhancements based on user feedback, making this PR a crucial addition to the project.

This commit

  1. Added Regex Chunking Method:

Implemented the Regex chunking method, allowing users to specify custom regular expression patterns for text splitting.
Introduced properties for chunk-size, chunk-overlap, model-name, and pattern to configure the chunking behavior.

  1. Added Substring Chunking Method:

Implemented the Substring chunking method, enabling users to define start and end indices for chunking the text.
Introduced properties for chunk-size, chunk-overlap, model-name, start-index, and end-index for detailed configuration.

  1. Updated JSON Schema:

Enhanced the existing JSON schema to include the new chunking strategies within the strategy properties.
It is ensured all new properties are properly documented and formatted for clarity and usability.

@NailaRais NailaRais changed the title Add Chunking Strategies: Regex and Substring Methods Improvement by add Chunking Strategies: Regex and Substring Methods Oct 14, 2024
@NailaRais NailaRais changed the title Improvement by add Chunking Strategies: Regex and Substring Methods feat: Add Chunking Strategies: Regex and Substring Methods Oct 14, 2024
@kuroxx kuroxx linked an issue Oct 14, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Text] Regular expression for data cleansing
3 participants