Shared embeddings without a shared subword vocabulary #50

Waino · 2024-02-05T11:25:02Z

Currently it is possible to do either of these:

Use a language-specific sentencepiece model and subword vocabulary, together with a language-specific embedding matrix. This is the default usage.
Set a pseudo-language-code "all". This enforces all languages to use the same sentence piece model, the same subword vocabulary, and a shared embedding matrix.

It is not possible to reduce the number of parameters by sharing the embedding matrix, while allowing each language to only use a subset of it.

E.g. if language A has the subwords {a, b} and language B has the subwords {b, c}, then the joined set is {a, b, c}. Both languages use the same embedding for the shared subword "b". When decoding language A, the softmax is only over {a, b}, it is not possible to decode c.

(It would be easier, but slower, if the softmax is over the entire embedding matrix: then the question is whether c should be mapped to <unk> when decoding A. Not doing so would mean actually having a shared subword vocabulary, but language-specific sentencepiece models.)

Implementing this would require:

Joining all language-specific subword vocabularies to a joint vocabulary. Each language should retain a set of indices into this joint vocabulary.
Initializing a single large embedding matrix for the joint vocabulary.
Creating a language specific view of the embedding matrix, by slicing it according to the language's indices.
Using the view as the embedding matrix in the forward pass.
After the backward pass, the original full embedding matrix should have a gradient (while the view does not have a gradient) [1].
All devices communicate and apply the gradient of the full embedding matrix (all reduce across all devices).

vocab size ≤ joint_vocab_size ≤ n_tasks_total * vocab size
Assuming small subword vocabularies with a fairly large overlap between languages:

VRAM usage should be reduced: instead of having n_tasks_on_device * vocab size embeddings loaded, we have joint_vocab_size embeddings. As the embedding size does not depend on the number of tasks per device, the latter can be increased if the number of language-specific parameters is otherwise low.
Communication would be increased: instead of communicating vocab_size embeddings to a small group, we would communicate the larger joint_vocab_size embeddings to all devices.

[1] pytorch/pytorch#19778

The text was updated successfully, but these errors were encountered:

Waino added the enhancement New feature or request label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared embeddings without a shared subword vocabulary #50

Shared embeddings without a shared subword vocabulary #50

Waino commented Feb 5, 2024 •

edited

Loading

Shared embeddings without a shared subword vocabulary #50

Shared embeddings without a shared subword vocabulary #50

Comments

Waino commented Feb 5, 2024 • edited Loading

Waino commented Feb 5, 2024 •

edited

Loading