PhiMoE #33363

garg-amit · 2024-09-06T21:15:46Z

What does this PR do?

Integrates PhiMoE into transformers. https://huggingface.co/microsoft/Phi-3.5-MoE-instruct

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @gante

garg-amit · 2024-09-16T20:54:36Z

@ArthurZucker @gante can I please get a review?

…-amit/transformers into gargamit/onboard_phi3_5_moe

… gargamit/onboard_phi3_5_moe

merryHunter · 2024-09-23T03:41:38Z

Hi, it seems to be a very important and awaited PR!:) Other frameworks are willing to integrate MoE too, like in litgpt Lightning-AI/litgpt#1686.

ArthurZucker · 2024-09-25T13:02:06Z

We are very much willing to integrate it as well 🤗 just came back from the torch conf, was a bit OO because of it 😢

ArthurZucker

Thanks!
Let's go with camel cased classes, If we want to be compile compatible we need to have a script conversion and use the formulation from gpt fast moe with a version implemented here: https://github.com/huggingface/transformers/pull/30793/files#diff-733ab0a772c69f78b1d8ed361e6ae1fda7243652887aed0bab5d3ecf07794c01R789

Lot's of stuff seems similar to phi3 so we can probably copy from it!

docs/source/en/perf_infer_gpu_one.md

src/transformers/__init__.py

src/transformers/models/phimoe/configuration_phimoe.py

src/transformers/models/phimoe/modeling_phimoe.py

ArthurZucker · 2024-09-25T13:45:02Z

TLDR, overall the mixer needs to be properly documented and written to be more understandable!

… gargamit/onboard_phi3_5_moe

garg-amit · 2024-09-27T04:54:57Z

@ArthurZucker Thanks for reviewing the PR. I’ve refactored the code according to your suggestions, and it’s ready for another look. Also, the failing test case appears to be unrelated to this PR. Please let me know if it needs to be addressed.

ArthurZucker · 2024-10-01T08:01:49Z

Reviewing! 🤗

HuggingFaceDocBuilderDev · 2024-10-01T08:25:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM, the only thing needed to merge:

The Copied from need a capital letter
The core part needs a tad bit more doc as I said, why do we need a specific gradient computation (had to go through the paper to see that indeed you need a special gradient approx)
That part of the code is IMO less readable than the rest, but fine for now!
THanks and sorry for the late revies!

ArthurZucker · 2024-10-01T10:36:15Z

src/transformers/models/phimoe/__init__.py

you no longer need this complicated structred! See the __init__ for Albert for example!
You need to define a __all__ in the modeling and config and that's it

src/transformers/models/phimoe/configuration_phimoe.py

ArthurZucker · 2024-10-02T13:07:32Z

src/transformers/models/phimoe/modeling_phimoe.py

+    return torch.cat((-x2, x1), dim=-1)
+
+
+# copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb


Suggested change

# copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb

# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb

ArthurZucker · 2024-10-02T13:19:43Z

src/transformers/models/phimoe/modeling_phimoe.py

+        self.rotary_emb = PhimoeRotaryEmbedding(
+            config=self.config,
+        )
+


IMO you can already put this outside the Attention layer, and remove the copied from mixtral to pass in the position embedding!

Sure, moved it to the PhimoeModel class

ArthurZucker · 2024-10-02T13:19:58Z

src/transformers/models/phimoe/modeling_phimoe.py

+        return attn_output, attn_weights, past_key_value
+
+
+# copied from transformers.models.mixtral.modeling_mixtral.MixtralFlashAttention2 with Mixtral->Phimoe


Suggested change

# copied from transformers.models.mixtral.modeling_mixtral.MixtralFlashAttention2 with Mixtral->Phimoe

# Copied from transformers.models.mixtral.modeling_mixtral.MixtralFlashAttention2 with Mixtral->Phimoe

ArthurZucker · 2024-10-02T13:20:18Z

src/transformers/models/phimoe/modeling_phimoe.py

+}
+
+
+# copied from transformers.models.mixtral.modeling_mixtral.MixtralBlockSparseTop2MLP with Mixtral->Phimoe


Suggested change

# copied from transformers.models.mixtral.modeling_mixtral.MixtralBlockSparseTop2MLP with Mixtral->Phimoe

# Copied from transformers.models.mixtral.modeling_mixtral.MixtralBlockSparseTop2MLP with Mixtral->Phimoe

ArthurZucker · 2024-10-02T14:01:39Z

src/transformers/models/phimoe/modeling_phimoe.py

+    Returns:
+        Tuple[torch.Tensor, torch.Tensor]: Multiplier and selected experts tensors.
+    """
+    assert top_k == 2


also let's raise an error rather than an assert!

src/transformers/models/phimoe/modeling_phimoe.py

ArthurZucker · 2024-10-02T14:14:25Z

src/transformers/models/phimoe/modeling_phimoe.py

+
+        routing_weights, selected_experts = sparsemixer(
+            router_logits,
+            top_k=2,


if it's hardcoded we can also just not put it!

@ArthurZucker I’ve removed top_k from here and instead created it as a keyword argument.

ArthurZucker · 2024-10-02T14:15:39Z

src/transformers/models/phimoe/modeling_phimoe.py

+            config.hidden_size, eps=config.rms_norm_eps, elementwise_affine=True
+        )
+
+    # copied from transformers.models.mixtral.modeling_mixtral.MixtralDecoderLayer.forward


Suggested change

# copied from transformers.models.mixtral.modeling_mixtral.MixtralDecoderLayer.forward

# Copied from transformers.models.mixtral.modeling_mixtral.MixtralDecoderLayer.forward

… gargamit/onboard_phi3_5_moe

garg-amit · 2024-10-03T17:25:27Z

@ArthurZucker Thanks for reviewing! I've addressed the comments and moved PhimoeRotaryEmbedding out of the PhimoeAttention class.

ArthurZucker

Great work! Thanks for integrating this new model 🔥

ArthurZucker · 2024-10-04T16:04:35Z

src/transformers/models/phimoe/modeling_phimoe.py

+        kv_seq_len = hidden_states.shape[-2]
+        if past_key_values is not None:
+            kv_seq_len += past_key_values.get_usable_length(kv_seq_len)
+        position_embeddings = self.rotary_emb(hidden_states, seq_len=kv_seq_len)


pretty sure you should be using cache positions here! cache_position[0]!

it's the last nit!

Thanks for the suggestion! I've updated it to cache_position[-1]+1 as cache_position[0] would return 0 when the kv cache is empty.

indeed! 🤗

ArthurZucker · 2024-10-04T19:40:04Z

Thanks everyone and @garg-amit for bearing with me! Congrats on the model release again 🤗

* onboard phimoe model * removed debug code * added unit tests * updated docs * formatted * fixed unit tests * fixed test case * fixed format * refactored code * fixed expected outputs in the integration tests * Added a warning msg * Addressed comments * Addressed comments * fixed test cases * added paper link * Addressed comments * Refactored PhimoeForCausalLM forward fn * Refactored PhimoeRotaryEmbedding class * fixed test cases * fixed testcase * fixed test case * Addressed comments * fixed test cases * fixed testcases * Used cache position instead to get the seq len

garg-amit added 12 commits September 4, 2024 17:40

onboard phimoe model

6ad863a

removed debug code

1a1e547

added unit tests

08d73d7

updated docs

1277bc8

formatted

232588d

fixed unit tests

8462783

fixed test case

3668c5d

fixed format

e6ed8dc

refactored code

89f51ea

fixed expected outputs in the integration tests

e552e33

Added a warning msg

c8173d7

Merge branch 'main' into gargamit/onboard_phi3_5_moe

290514e

ThiloteE mentioned this pull request Sep 10, 2024

Request to use Phi-3.5-MoE-instruct ggerganov/llama.cpp#9168

Open

4 tasks

Merge branch 'main' into gargamit/onboard_phi3_5_moe

5dda7d6

garg-amit added 2 commits September 20, 2024 15:51

Merge branch 'gargamit/onboard_phi3_5_moe' of https://github.com/garg…

11a0f17

…-amit/transformers into gargamit/onboard_phi3_5_moe

Merge branch 'main' of https://github.com/garg-amit/transformers into…

dd02bf9

… gargamit/onboard_phi3_5_moe

ArthurZucker self-requested a review September 20, 2024 17:28

ArthurZucker reviewed Sep 25, 2024

View reviewed changes

garg-amit added 8 commits September 26, 2024 06:35

Addressed comments

43f9cc9

Merge branch 'main' of https://github.com/garg-amit/transformers into…

b3f8af5

… gargamit/onboard_phi3_5_moe

Addressed comments

b6acc3e

fixed test cases

33caa63

added paper link

e01a78e

Addressed comments

d1f847e

Refactored PhimoeForCausalLM forward fn

dd8b8b0

Refactored PhimoeRotaryEmbedding class

42b59c6

garg-amit added 3 commits September 27, 2024 01:38

fixed test cases

e4e2f1a

fixed testcase

8359a59

fixed test case

bebad97

garg-amit requested a review from ArthurZucker September 27, 2024 04:56

ArthurZucker reviewed Oct 2, 2024

View reviewed changes

garg-amit added 4 commits October 2, 2024 15:45

Merge branch 'main' of https://github.com/garg-amit/transformers into…

4c28bd5

… gargamit/onboard_phi3_5_moe

Addressed comments

1311e80

fixed test cases

2671887

fixed testcases

18830f5

garg-amit requested a review from ArthurZucker October 3, 2024 17:18

ArthurZucker approved these changes Oct 4, 2024

View reviewed changes

Used cache position instead to get the seq len

36891bf

ArthurZucker merged commit e377553 into huggingface:main Oct 4, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhiMoE #33363

PhiMoE #33363

garg-amit commented Sep 6, 2024 •

edited

Loading

garg-amit commented Sep 16, 2024

merryHunter commented Sep 23, 2024

ArthurZucker commented Sep 25, 2024

ArthurZucker left a comment

ArthurZucker commented Sep 25, 2024

garg-amit commented Sep 27, 2024

ArthurZucker commented Oct 1, 2024

HuggingFaceDocBuilderDev commented Oct 1, 2024

ArthurZucker left a comment

ArthurZucker Oct 1, 2024

ArthurZucker Oct 2, 2024

ArthurZucker Oct 2, 2024

garg-amit Oct 3, 2024

ArthurZucker Oct 2, 2024

ArthurZucker Oct 2, 2024

ArthurZucker Oct 2, 2024

garg-amit Oct 3, 2024

ArthurZucker Oct 2, 2024

garg-amit Oct 3, 2024

ArthurZucker Oct 2, 2024

garg-amit commented Oct 3, 2024

ArthurZucker left a comment

ArthurZucker Oct 4, 2024

ArthurZucker Oct 4, 2024

garg-amit Oct 4, 2024

ArthurZucker Oct 4, 2024

ArthurZucker commented Oct 4, 2024

		return torch.cat((-x2, x1), dim=-1)


		# copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb

		return attn_output, attn_weights, past_key_value


		# copied from transformers.models.mixtral.modeling_mixtral.MixtralFlashAttention2 with Mixtral->Phimoe

PhiMoE #33363

PhiMoE #33363

Conversation

garg-amit commented Sep 6, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

garg-amit commented Sep 16, 2024

merryHunter commented Sep 23, 2024

ArthurZucker commented Sep 25, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Sep 25, 2024

garg-amit commented Sep 27, 2024

ArthurZucker commented Oct 1, 2024

HuggingFaceDocBuilderDev commented Oct 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garg-amit commented Oct 3, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Oct 4, 2024

garg-amit commented Sep 6, 2024 •

edited

Loading