Questions about token level features and casual attention mechanism #230
Unanswered
jingedawang
asked this question in
Q&A
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've read the book but there is still something not clear in my mind.
Question 1
In the training stage, we hope the logits in the token level correspond to each token in the target. The model architecture provides a paralleled processing for the token-level features. But in the inference stage, we need only the last logits which correspond to the next token of the last token in the input text. The logits before it are not needed. Can anyone provide an intuitive explanation of what these unused logits stand for? Why are they still necessary in the inference stage?
Question 2
The casual attention mechanism is ensured by the diagonal mask applied to the weight matrix. This is easy to understand. But when we apply another linear layer and repeat the transformer block many times, how does the causality still reserve?
Beta Was this translation helpful? Give feedback.
All reactions