-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of long late chunking #20
Comments
The |
This is exactly what I wanted to ask. During multiple iterations, the model receives the complete instruction and CLS token only in the first iteration, and these contents are not received in subsequent iterations. Since the embedding model is stateless during multiple iterations, meaning that the input the model receives is incomplete except for the first iteration. I'm wondering if this understanding is correct? |
Yes in this sense it is incomplete. As the tokenizer generates a [SEP] token at the end for most models the first one is incomplete as well in this sense. Also note that the token sequences passed to the model are overlapping, i.e., all but the first macro chunk get a certain number of tokens from the last sequence to not miss this context. The embeddings of those additional tokens are not used at the end. |
I noticed that in the implementation code of the long late chunking, line 147 calculates the split indices based on the size of the macro chunk and overlap. Does this step need to consider adding instructions and special tokens such as CLS, EOS, etc., to each macro chunk?
The text was updated successfully, but these errors were encountered: