Question about Gemma tensor parallel sharding policy #1464

AIGideon · 2024-02-23T16:40:15Z

Thanks for Gemma model implementation. I found that the layout_map in GemmaBackbone.get_layout_map() seems to show that it has completely opposite sharding policy compared to the typical TP sharding policy used in Transformer architecture.

Typical:

embedding: embedding matrix is sharded along vocab_size axis (for matrix of shape [vocab_size, hidden_dim], along axis=0);
attention:
- query|key|value dense kernel are sharded along the column (for kernel of shape [hidden_dim, hidden_dim], along axis=1);
- output dense kernel is sharded along the row (for kernel of shape [hidden_dim, hidden_dim], along axis=0);
feedforward:
- gating dense (the first dense) kernel is sharded along the column (for kernel of shape [hidden_dim, intermediate_dim], along axis=1);
- output dense (the second dense) kernel is sharded along the row (for kernel of shape [intermediate_dim, hidden_dim], along axis=0);

Gemma:

embedding: layout_map["token_embedding/embeddings"] = (None, model_dim), seems to be sharded along hidden_dim axis;
attention:
- query|key|value dense: layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (None, model_dim, None), seems to be sharded along the row except the num_heads axis.
- output dense: layout_map["decoder_block.*attention_output.*kernel"] = (None, None, model_dim), seems to be sharded along the column except the num_heads axis.
feedforward:
- the first dense: layout_map["decoder_block.*ffw_gating.*kernel"] = (model_dim, None), seems to be sharded along the row.
- the second dense: layout_map["decoder_block.*ffw_linear.*kernel"] = (None, model_dim), seems to be sharded along the column.

Is my understanding correct? If they are opposite, can you please explain the reason?

The text was updated successfully, but these errors were encountered:

mattdangerw · 2024-03-05T02:29:41Z

@qlzh727 for thoughts.

qlzh727 · 2024-03-05T17:34:26Z

Thanks for the reporting of the issue. Can you share more reference of the "typical" sharding/layout here? I would like to take a closer look for that.

AIGideon · 2024-03-06T02:42:01Z

@qlzh727 Thanks for the reply. The typical tensor parallel sharding policy I provided comes from the implementation of Megatron-LM and the documentation of HuggingFace Transformers:

Megatron-LM embedding: https://github.com/NVIDIA/Megatron-LM/blob/53a350eddbce38f036fe5884c3fec955cf710b9e/megatron/model/language_model.py#L120. The Embedding module use VocabParallelEmbedding to shard vocab matrix along vocab_size axis.
Megatron-LM attention: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L495. The ParallelAttention module perform ColumnParallelLinear on inputs q|k|v and use RowParallelLinear as the final projection layer.
Megatron-LM feedforward: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L80. The ParallelMLP module perform a ColumnParallelLinear at first then a RowParallelLinear .
Related discussion: [Tensor Parallelism] Megatron-LM to transformers huggingface/transformers#10321
HuggingFace Transformers Tensor Parallelism documentation: https://huggingface.co/docs/transformers/v4.15.0/parallelism#tensor-parallelism

qlzh727 · 2024-03-06T18:54:44Z

Thanks for the information.

I think in general they are just different ways to shard the tensor/weights, especially for different conditions.

In your approach, it is doing matmul without allgather for qkv and do the collective afterwards (at dotprod of qk and softmax) because your qkv are sharded. Whereas the current Keras implementation will do collective at qkv matmul (since the contrast dimension is sharded), and avoid the collective afterward. It also depends on the cost of collectives (network connection) vs the local computation speed, as well as whether this model is just for prediction or it need finetune and weights update.

I did some benchmark for this and the results are show below. I think your setting does have advantage for the finetune use case. I am testing this on a TPU v3-8 setting. Feel free to provide more benchmark result with GPU testing as well.

(Smaller value are better)

===================
base line (current setting):
generate: 1342 ms per 100 token
finetune with lora: 125ms/step

=====================
Your setting
generate: 1501 ms per 100 token
finetune with lora: 77ms/step

qlzh727 · 2024-03-15T16:46:28Z

Should be addressed by #1491

SuryanarayanaY added the Gemma Gemma model specific issues label Mar 6, 2024

qlzh727 mentioned this issue Mar 12, 2024

Update gemma_backbone.py for sharding config. #1491

Merged

qlzh727 self-assigned this Mar 12, 2024

qlzh727 closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Gemma tensor parallel sharding policy #1464

Question about Gemma tensor parallel sharding policy #1464

AIGideon commented Feb 23, 2024

mattdangerw commented Mar 5, 2024

qlzh727 commented Mar 5, 2024

AIGideon commented Mar 6, 2024

qlzh727 commented Mar 6, 2024 •

edited

Loading

qlzh727 commented Mar 15, 2024

Question about Gemma tensor parallel sharding policy #1464

Question about Gemma tensor parallel sharding policy #1464

Comments

AIGideon commented Feb 23, 2024

mattdangerw commented Mar 5, 2024

qlzh727 commented Mar 5, 2024

AIGideon commented Mar 6, 2024

qlzh727 commented Mar 6, 2024 • edited Loading

qlzh727 commented Mar 15, 2024

qlzh727 commented Mar 6, 2024 •

edited

Loading