Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement late chunking when my context limit is more than 8192 tokens? #2

Open
venkatana-kore opened this issue Sep 10, 2024 · 2 comments

Comments

@venkatana-kore
Copy link

Jina.ai support a token limit of 8192 for generating the embeddings. For late chunking if my context is more than 8192, then what are the best strategies to implement late chunking?

@guenthermi
Copy link
Member

I think if you have very long documents, not all of the context might be necessary. So if you can split the text into chapters or longer sections, there might be enough context for the embedding model to interpret all of the tokens correctly. Otherwise you can also pass a bit more text before and after the first chunk yu are interested. Maybe als adding summaries before the text chunks could further improve it, but I haven't tried something like this.

@guenthermi
Copy link
Member

Now we have implemented a strategy that uses overlapping macro chunks to solve this problem. Just set --long-late-chunking-embed-size to the maximum context length of the model that you are using and it will automatically use this strategy.

Here is the argument in the script:

@click.option(
'--long-late-chunking-embed-size',
default=DEFAULT_LONG_LATE_CHUNKING_EMBED_SIZE,
type=int,
help='Token length of the embeddings that come before/after soft boundaries (i.e. overlapping embeddings). Above zero, overlap is used between neighbouring embeddings.',
)

For more information how it works, take a look at Section 3.1 in the paper: https://arxiv.org/pdf/2409.04701

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants