Support Mixtral-8x7B #71

yanboliang · 2023-12-30T05:58:40Z

This is based on #57. Please checkout https://github.com/yanboliang/gpt-fast/tree/mixtral-moe to try this.

Performance numbers (tokens/second):

|                  |   1 GPU |    2 GPU  |    8 GPU    |
|------------------|---------|-----------|-------------|
|baseline(bfloat16)|    OOM  |    78.75  |   203.69    |
|        int8      |   56.04 |    99.91  |   218.48    |

Note: Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology.

How to reproduce it:

export MODEL_REPO=mistralai/Mixtral-8x7B-v0.1
# Download model weights
python scripts/download.py --repo_id $MODEL_REPO
# Convert to gpt-fast supported format
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/$MODEL_REPO
# Generate int8 quantization model weights
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
# Test tp=8
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model.pth
# Test single GPU + int8 model
python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth

…e_descent_tuning is enabled (#116582) We found this perf optimization opportunity at pytorch-labs/gpt-fast#71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast. Pull Request resolved: #116582 Approved by: https://github.com/lezcano

model.py

raghukiran1224 · 2024-01-03T16:30:49Z

does the model compile without any graph breaks?

yanboliang · 2024-01-04T00:14:57Z

@raghukiran1224 Yes, no graph break!

Summary: Pull Request resolved: #1533 Support MoE structure, where there can be multiple experts in the FFN layer. The change in model.py is based on pytorch-labs/gpt-fast#71. Note that it's a functional verification, with random weights. It can be successful exported and lowered to ExecuTorch. TODO: test the runtime side. Reviewed By: larryliu0820 Differential Revision: D52543030 fbshipit-source-id: 5d4220f1e8ea9eb1e4be398fe2a47bfb0b89c975

chauhang · 2024-02-11T19:09:11Z

@yanboliang Great to see this PR, what is the work remaining for merging? It will help to also update the main Readme Benchmarks to include the model.

yanboliang · 2024-02-12T21:24:11Z

@chauhang I think we need to figure out a structure of how to put this under gpt-fast, probably we need a separate folder. No other blockers, so I'll prioritize this work and hopefully we can merge it in a few days.

yanboliang · 2024-02-26T06:50:08Z

closing this as it has been merged at #105

guangy10 · 2024-03-07T03:24:35Z

@yanboliang It doesn't seem like python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8 will quantize all weights. I end up getting mismatched dtype error when lowering this model to ExecuTorch. After looking into the model_int8.pth, I noticed that there are still weights in bfloat16. Is it expected?

yanboliang · 2024-03-07T05:23:23Z

@guangy10 Yes, it's expected! We don't quantize gate networks to ensure accuracy as they are used to choose experts.

gpt-fast/mixtral-moe/quantize.py

Lines 56 to 57 in 1c23b94

    
           if isinstance(child, nn.Linear) and name != "gate": 
        
               setattr(module, name, WeightOnlyBit8Linear(child.in_features, child.out_features, target_dtype=target_dtype))

yanboliang added 3 commits December 29, 2023 16:28

Support Mixtral 8x7B

9687c8b

int8 quantization

29d3773

Update convert checkpoint script

6a46ffb

facebook-github-bot added the CLA Signed label Dec 30, 2023

yanboliang added 2 commits December 29, 2023 22:09

Update

b53cd18

Remove torch.compile decorator and in favor of compile_prefill

04aeb7c

yanboliang mentioned this pull request Jan 1, 2024

[Inductor] Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled pytorch/pytorch#116582

Closed

fix config

f4b0ee8

tgale96 reviewed Jan 2, 2024

View reviewed changes

model.py Outdated Show resolved Hide resolved

Fix moe forward

fb6023b

iseeyuan mentioned this pull request Jan 4, 2024

Add MoE structure in the llama example pytorch/executorch#1533

Closed

yanboliang mentioned this pull request Feb 1, 2024

Expert parallelism / MoE example would be awesome :) #62

Open

yanboliang mentioned this pull request Feb 15, 2024

Add Mixtral-8x7B in sub-folder #105

Merged

yanboliang closed this Feb 26, 2024

yanboliang deleted the mixtral-moe branch March 7, 2024 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Mixtral-8x7B #71

Support Mixtral-8x7B #71

yanboliang commented Dec 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

raghukiran1224 commented Jan 3, 2024

Uh oh!

yanboliang commented Jan 4, 2024

Uh oh!

chauhang commented Feb 11, 2024

Uh oh!

yanboliang commented Feb 12, 2024

Uh oh!

yanboliang commented Feb 26, 2024

Uh oh!

guangy10 commented Mar 7, 2024

Uh oh!

yanboliang commented Mar 7, 2024

Uh oh!

Support Mixtral-8x7B #71

Support Mixtral-8x7B #71

Conversation

yanboliang commented Dec 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

raghukiran1224 commented Jan 3, 2024

Uh oh!

yanboliang commented Jan 4, 2024

Uh oh!

chauhang commented Feb 11, 2024

Uh oh!

yanboliang commented Feb 12, 2024

Uh oh!

yanboliang commented Feb 26, 2024

Uh oh!

guangy10 commented Mar 7, 2024

Uh oh!

yanboliang commented Mar 7, 2024

Uh oh!

yanboliang commented Dec 30, 2023 •

edited

Loading