You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use a language-specific sentencepiece model and subword vocabulary, together with a language-specific embedding matrix. This is the default usage.
Set a pseudo-language-code "all". This enforces all languages to use the same sentence piece model, the same subword vocabulary, and a shared embedding matrix.
It is not possible to reduce the number of parameters by sharing the embedding matrix, while allowing each language to only use a subset of it.
E.g. if language A has the subwords {a, b} and language B has the subwords {b, c}, then the joined set is {a, b, c}. Both languages use the same embedding for the shared subword "b". When decoding language A, the softmax is only over {a, b}, it is not possible to decode c.
(It would be easier, but slower, if the softmax is over the entire embedding matrix: then the question is whether c should be mapped to <unk> when decoding A. Not doing so would mean actually having a shared subword vocabulary, but language-specific sentencepiece models.)
Implementing this would require:
Joining all language-specific subword vocabularies to a joint vocabulary. Each language should retain a set of indices into this joint vocabulary.
Initializing a single large embedding matrix for the joint vocabulary.
Creating a language specific view of the embedding matrix, by slicing it according to the language's indices.
Using the view as the embedding matrix in the forward pass.
After the backward pass, the original full embedding matrix should have a gradient (while the view does not have a gradient) [1].
All devices communicate and apply the gradient of the full embedding matrix (all reduce across all devices).
vocab size ≤ joint_vocab_size ≤ n_tasks_total * vocab size
Assuming small subword vocabularies with a fairly large overlap between languages:
VRAM usage should be reduced: instead of having n_tasks_on_device * vocab size embeddings loaded, we have joint_vocab_size embeddings. As the embedding size does not depend on the number of tasks per device, the latter can be increased if the number of language-specific parameters is otherwise low.
Communication would be increased: instead of communicating vocab_size embeddings to a small group, we would communicate the larger joint_vocab_size embeddings to all devices.
Currently it is possible to do either of these:
It is not possible to reduce the number of parameters by sharing the embedding matrix, while allowing each language to only use a subset of it.
E.g. if language A has the subwords {a, b} and language B has the subwords {b, c}, then the joined set is {a, b, c}. Both languages use the same embedding for the shared subword "b". When decoding language A, the softmax is only over {a, b}, it is not possible to decode c.
(It would be easier, but slower, if the softmax is over the entire embedding matrix: then the question is whether c should be mapped to
<unk>
when decoding A. Not doing so would mean actually having a shared subword vocabulary, but language-specific sentencepiece models.)Implementing this would require:
vocab size ≤ joint_vocab_size ≤ n_tasks_total * vocab size
Assuming small subword vocabularies with a fairly large overlap between languages:
n_tasks_on_device * vocab size
embeddings loaded, we havejoint_vocab_size
embeddings. As the embedding size does not depend on the number of tasks per device, the latter can be increased if the number of language-specific parameters is otherwise low.vocab_size
embeddings to a small group, we would communicate the largerjoint_vocab_size
embeddings to all devices.[1] pytorch/pytorch#19778
The text was updated successfully, but these errors were encountered: