Implementation of long late chunking #20

Fleandre · 2024-10-23T08:20:00Z

I noticed that in the implementation code of the long late chunking, line 147 calculates the split indices based on the size of the macro chunk and overlap. Does this step need to consider adding instructions and special tokens such as CLS, EOS, etc., to each macro chunk?

guenthermi · 2024-10-24T14:49:20Z

The _embed_with_overlap gets as input the token sequence and outputs the sequence of token embeddings. Therefore it receives the full tokenization of the input this includes all special tokens and instructions and can be larger than the about of tokens the model can fit. Then the model get passed a sliding window of the input tokens in each iteration. This means for example that the model only receives a [CLS] input token in the first iteration but not in the the second iteration (as this is only added at the beginning of the sequence). After all token embeddings are calcuated by the function the actual chunking is done outside of the model based on the annotation. Does this answer your question?

Fleandre · 2024-10-28T07:59:19Z

This is exactly what I wanted to ask. During multiple iterations, the model receives the complete instruction and CLS token only in the first iteration, and these contents are not received in subsequent iterations. Since the embedding model is stateless during multiple iterations, meaning that the input the model receives is incomplete except for the first iteration. I'm wondering if this understanding is correct?

guenthermi · 2024-10-28T09:30:20Z

Yes in this sense it is incomplete. As the tokenizer generates a [SEP] token at the end for most models the first one is incomplete as well in this sense. Also note that the token sequences passed to the model are overlapping, i.e., all but the first macro chunk get a certain number of tokens from the last sequence to not miss this context. The embeddings of those additional tokens are not used at the end.

Fleandre closed this as completed Oct 28, 2024

Fleandre reopened this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of long late chunking #20

Implementation of long late chunking #20

Fleandre commented Oct 23, 2024

guenthermi commented Oct 24, 2024

Fleandre commented Oct 28, 2024

guenthermi commented Oct 28, 2024

Implementation of long late chunking #20

Implementation of long late chunking #20

Comments

Fleandre commented Oct 23, 2024

guenthermi commented Oct 24, 2024

Fleandre commented Oct 28, 2024

guenthermi commented Oct 28, 2024