-
Notifications
You must be signed in to change notification settings - Fork 558
Support Mixtral-8x7B #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e_descent_tuning is enabled (#116582) We found this perf optimization opportunity at pytorch-labs/gpt-fast#71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast. Pull Request resolved: #116582 Approved by: https://github.com/lezcano
does the model compile without any graph breaks? |
@raghukiran1224 Yes, no graph break! |
Summary: Pull Request resolved: #1533 Support MoE structure, where there can be multiple experts in the FFN layer. The change in model.py is based on pytorch-labs/gpt-fast#71. Note that it's a functional verification, with random weights. It can be successful exported and lowered to ExecuTorch. TODO: test the runtime side. Reviewed By: larryliu0820 Differential Revision: D52543030 fbshipit-source-id: 5d4220f1e8ea9eb1e4be398fe2a47bfb0b89c975
@yanboliang Great to see this PR, what is the work remaining for merging? It will help to also update the main Readme Benchmarks to include the model. |
@chauhang I think we need to figure out a structure of how to put this under gpt-fast, probably we need a separate folder. No other blockers, so I'll prioritize this work and hopefully we can merge it in a few days. |
closing this as it has been merged at #105 |
@yanboliang It doesn't seem like |
@guangy10 Yes, it's expected! We don't quantize gate networks to ensure accuracy as they are used to choose experts. gpt-fast/mixtral-moe/quantize.py Lines 56 to 57 in 1c23b94
|
This is based on #57. Please checkout https://github.com/yanboliang/gpt-fast/tree/mixtral-moe to try this.
Performance numbers (tokens/second):
Note: Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology.
How to reproduce it: