-
Notifications
You must be signed in to change notification settings - Fork 551
what's the meaning of "Groupwise 4-bit (128)" #3559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the case of the LLama2 Linear operation, the weights are quantized. There are various methods to perform quantization. In this instance, we utilized "Symmetric, per channel groupwise" quantization to convert and represent the original fp32 weight elements as int4. The term "groupwise" refers to the number of weight elements in the same output channels that are quantized together and share the same quantization scale. For this particular case, we empirically chose a group size of 128. But support other 'standard' values like 32 or 256. |
Groupwise in "Groupwise 4-bit" refers to the how many elements are in a group that share the quantization parameters. See this and this for the details on what are the quantization parametesr, names scale and zero point. So for example weight tensor of linear layer might of shape (NxK) [4096, 1024]. 4096=N=# of output channels, 1024=K=# of input channels. If you we quantize entire tensor with one set of quantization parameters then we have per tensor quantization. |
@kimishpatel @digantdesai Thank u. After I read your answer, I have issues as follow:
|
Ansering some
Yes
Naming is bit confusing but 8da4w represents 4bit weight quantization. 8da refers to the need of quantizing activation, during inference, to 8 bits. This is known as dynamic quantization. https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
--quantization_mode 8da4w. Please see https://github.com/pytorch/executorch/tree/main/examples/models/llama2#option-c-download-and-export-llama3-8b-model. (You can use llama2 7b or llama2 8b model. Just highlighting this as the readme contains the repro instructions) |
Uh oh!
There was an error while loading. Please reload this page.
hi.
could some kinder helper tell me what do"128" in "Groupwise 4-bit (128)" indicatedin the "https://github.com/pytorch/executorch/tree/main/examples/models/llama2" ?
Thank u
The text was updated successfully, but these errors were encountered: