Skip to content

what's the meaning of "Groupwise 4-bit (128)" #3559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
l2002924700 opened this issue May 9, 2024 · 4 comments
Closed

what's the meaning of "Groupwise 4-bit (128)" #3559

l2002924700 opened this issue May 9, 2024 · 4 comments
Assignees
Labels
module: quantization Issues related to quantization module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ rfc Request for comment and feedback on a post, proposal, etc.

Comments

@l2002924700
Copy link

l2002924700 commented May 9, 2024

hi.
could some kinder helper tell me what do"128" in "Groupwise 4-bit (128)" indicatedin the "https://github.com/pytorch/executorch/tree/main/examples/models/llama2" ?
Thank u

@mergennachin mergennachin added module: quantization Issues related to quantization module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ labels May 9, 2024
@digantdesai
Copy link
Contributor

In the case of the LLama2 Linear operation, the weights are quantized. There are various methods to perform quantization. In this instance, we utilized "Symmetric, per channel groupwise" quantization to convert and represent the original fp32 weight elements as int4.

The term "groupwise" refers to the number of weight elements in the same output channels that are quantized together and share the same quantization scale. For this particular case, we empirically chose a group size of 128. But support other 'standard' values like 32 or 256.

@mergennachin mergennachin added the rfc Request for comment and feedback on a post, proposal, etc. label May 9, 2024
@kimishpatel
Copy link
Contributor

Groupwise in "Groupwise 4-bit" refers to the how many elements are in a group that share the quantization parameters. See this and this for the details on what are the quantization parametesr, names scale and zero point.

So for example weight tensor of linear layer might of shape (NxK) [4096, 1024]. 4096=N=# of output channels, 1024=K=# of input channels.

If you we quantize entire tensor with one set of quantization parameters then we have per tensor quantization.
If we quantize each input channel with one set of quantization parameters then we have per channel quantization. In this case each channel is one group and groupsize is 1024. Thus quantization parameter has size [4096, 1] corresponding to each output channel.
If we quantize each input channel with more than one set of quantization parameters then we get groupwise per channel quantization. For example groupwise = 128 means, each channel has is divided into group of 128 elements. In our examples you have 1024/128= 8 groups. Thus quantizaton parametesr are of size [4096, 8]

@l2002924700
Copy link
Author

@kimishpatel @digantdesai Thank u. After I read your answer, I have issues as follow:

  1. the 4-bit in "Groupwise 4-bit" indicate the int4 in "convert and represent the original fp32 weight elements as int4", since int4 accupy 4bits. about this, am i right?
  2. 128 in "Groupwise 4-bit (128)" means each input channel is divided into group which has 128 elements?
  3. how can i set the "4-bit " and "128" in the "Groupwise 4-bit (128)" when I want to evaluate the llama2 with "examples.models.llama2.eval_llama" function?
  4. Since the "--quantization_mode" in "examples.models.llama2.eval_llama" function have inclueded only {int8,8da4w,8da4w-gptq}, how can i convert the original fp32 weight elements as int4?
  5. Which quantization_mode( {int8,8da4w,8da4w-gptq}),did you choose when you evaluated results in "https://github.com/pytorch/executorch/tree/main/examples/models/llama2" ?

@kimishpatel
Copy link
Contributor

Ansering some

128 in "Groupwise 4-bit (128)" means each input channel is divided into group which has 128 elements?

Yes

Since the "--quantization_mode" in "examples.models.llama2.eval_llama" function have inclueded only {int8,8da4w,8da4w-gptq}, how can i convert the original fp32 weight elements as int4?

Naming is bit confusing but 8da4w represents 4bit weight quantization. 8da refers to the need of quantizing activation, during inference, to 8 bits. This is known as dynamic quantization. https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

Which quantization_mode( {int8,8da4w,8da4w-gptq}),did you choose when you evaluated results in

--quantization_mode 8da4w. Please see https://github.com/pytorch/executorch/tree/main/examples/models/llama2#option-c-download-and-export-llama3-8b-model. (You can use llama2 7b or llama2 8b model. Just highlighting this as the readme contains the repro instructions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: quantization Issues related to quantization module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ rfc Request for comment and feedback on a post, proposal, etc.
Projects
None yet
Development

No branches or pull requests

5 participants