Questions about token level features and casual attention mechanism #230

jingedawang · 2024-06-19T14:33:15Z

jingedawang
Jun 19, 2024

I've read the book but there is still something not clear in my mind.

Question 1
In the training stage, we hope the logits in the token level correspond to each token in the target. The model architecture provides a paralleled processing for the token-level features. But in the inference stage, we need only the last logits which correspond to the next token of the last token in the input text. The logits before it are not needed. Can anyone provide an intuitive explanation of what these unused logits stand for? Why are they still necessary in the inference stage?

Question 2
The casual attention mechanism is ensured by the diagonal mask applied to the weight matrix. This is easy to understand. But when we apply another linear layer and repeat the transformer block many times, how does the causality still reserve?

rasbt · 2024-06-20T12:10:30Z

rasbt
Jun 20, 2024
Maintainer

Good questions here!

Question 1

The model is trained to predict the targets, which are the inputs shifted by 1 position. The "last" token in each iteration thus represents the new token. The other tokens are basically the tokens you have already seen. They are not bad or wrong or something like that, but they would just repeat what you already gave the model as input:

Question 2

You still have an n_token -> n_token mapping. I.e. in each block you have the same input and output tokens. At the first transformer block, you have the original embedded input tokens, what what's coming out of there are the context vectors. You have 1 context vector corresponding to one input token. However, as explained in chapter 3, the context vector now combine information from all the input vectors considering the triangular mask. So the first context vector still doesn't combine information of all tokens, only the context vector combines information from all tokens.
The linear layers you mention then mix information from all the dimensions of a context vector, but it does not mix among the different sequence positions. So, in other words, in my view, the causality remains preserved.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about token level features and casual attention mechanism #230

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Questions about token level features and casual attention mechanism #230

jingedawang Jun 19, 2024

Replies: 1 comment

rasbt Jun 20, 2024 Maintainer

jingedawang
Jun 19, 2024

rasbt
Jun 20, 2024
Maintainer