Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared embeddings without a shared subword vocabulary #50

Open
Waino opened this issue Feb 5, 2024 · 0 comments
Open

Shared embeddings without a shared subword vocabulary #50

Waino opened this issue Feb 5, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@Waino
Copy link
Collaborator

Waino commented Feb 5, 2024

Currently it is possible to do either of these:

  1. Use a language-specific sentencepiece model and subword vocabulary, together with a language-specific embedding matrix. This is the default usage.
  2. Set a pseudo-language-code "all". This enforces all languages to use the same sentence piece model, the same subword vocabulary, and a shared embedding matrix.

It is not possible to reduce the number of parameters by sharing the embedding matrix, while allowing each language to only use a subset of it.

E.g. if language A has the subwords {a, b} and language B has the subwords {b, c}, then the joined set is {a, b, c}. Both languages use the same embedding for the shared subword "b". When decoding language A, the softmax is only over {a, b}, it is not possible to decode c.

(It would be easier, but slower, if the softmax is over the entire embedding matrix: then the question is whether c should be mapped to <unk> when decoding A. Not doing so would mean actually having a shared subword vocabulary, but language-specific sentencepiece models.)

Implementing this would require:

  • Joining all language-specific subword vocabularies to a joint vocabulary. Each language should retain a set of indices into this joint vocabulary.
  • Initializing a single large embedding matrix for the joint vocabulary.
  • Creating a language specific view of the embedding matrix, by slicing it according to the language's indices.
  • Using the view as the embedding matrix in the forward pass.
  • After the backward pass, the original full embedding matrix should have a gradient (while the view does not have a gradient) [1].
  • All devices communicate and apply the gradient of the full embedding matrix (all reduce across all devices).

vocab size ≤ joint_vocab_size ≤ n_tasks_total * vocab size
Assuming small subword vocabularies with a fairly large overlap between languages:

  • VRAM usage should be reduced: instead of having n_tasks_on_device * vocab size embeddings loaded, we have joint_vocab_size embeddings. As the embedding size does not depend on the number of tasks per device, the latter can be increased if the number of language-specific parameters is otherwise low.
  • Communication would be increased: instead of communicating vocab_size embeddings to a small group, we would communicate the larger joint_vocab_size embeddings to all devices.

[1] pytorch/pytorch#19778

@Waino Waino added the enhancement New feature or request label Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant