Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New nvtext::wordpiece_tokenizer APIs #17600

Draft
wants to merge 18 commits into
base: branch-25.02
Choose a base branch
from

Conversation

davidwendt
Copy link
Contributor

Description

Creates a new word-piece tokenizer which replaces the existing subword-tokenizer in nvtext.
The subword-tokenizer logic is to split out and specialized to perform basic tokenizing with the word-piece logic only.
The normalizing part is already a separate API. The output will be a lists column of tokens format only.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 16, 2024
@davidwendt davidwendt self-assigned this Dec 16, 2024
Copy link

copy-pr-bot bot commented Dec 16, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the CMake CMake build issue label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant