Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of long late chunking #20

Open
Fleandre opened this issue Oct 23, 2024 · 3 comments
Open

Implementation of long late chunking #20

Fleandre opened this issue Oct 23, 2024 · 3 comments

Comments

@Fleandre
Copy link

I noticed that in the implementation code of the long late chunking, line 147 calculates the split indices based on the size of the macro chunk and overlap. Does this step need to consider adding instructions and special tokens such as CLS, EOS, etc., to each macro chunk?

image
@guenthermi
Copy link
Member

The _embed_with_overlap gets as input the token sequence and outputs the sequence of token embeddings. Therefore it receives the full tokenization of the input this includes all special tokens and instructions and can be larger than the about of tokens the model can fit. Then the model get passed a sliding window of the input tokens in each iteration. This means for example that the model only receives a [CLS] input token in the first iteration but not in the the second iteration (as this is only added at the beginning of the sequence). After all token embeddings are calcuated by the function the actual chunking is done outside of the model based on the annotation. Does this answer your question?

@Fleandre
Copy link
Author

This is exactly what I wanted to ask. During multiple iterations, the model receives the complete instruction and CLS token only in the first iteration, and these contents are not received in subsequent iterations. Since the embedding model is stateless during multiple iterations, meaning that the input the model receives is incomplete except for the first iteration. I'm wondering if this understanding is correct?

@Fleandre Fleandre reopened this Oct 28, 2024
@guenthermi
Copy link
Member

Yes in this sense it is incomplete. As the tokenizer generates a [SEP] token at the end for most models the first one is incomplete as well in this sense. Also note that the token sequences passed to the model are overlapping, i.e., all but the first macro chunk get a certain number of tokens from the last sequence to not miss this context. The embeddings of those additional tokens are not used at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants