AutoAWQ¶

AutoAWQ pushes ease of use and fast inference speed into one package. In the following documentation, you will learn how to quantize and run inference.

Example inference speed (RTX 4090, Ryzen 9 7950X, 64 tokens):

Vicuna 7B (GEMV kernel): 198.848 tokens/s
Mistral 7B (GEMM kernel): 156.317 tokens/s
Mistral 7B (ExLlamaV2 kernel): 188.865 tokens/s
Mixtral 46.7B (GEMM kernel): 93 tokens/s (2x 4090)

Installation notes¶

Install: pip install autoawq.
Your torch version must match the build version, i.e. you cannot use torch 2.0.1 with a wheel that was built with 2.2.0.
For AMD GPUs, inference will run through ExLlamaV2 kernels without fused layers. You need to pass the following arguments to run with AMD GPUs:

model = AutoAWQForCausalLM.from_quantized(
    ...,
    fuse_layers=False,
    use_exllama_v2=True
)

Supported models¶

The detailed support list:

Models	Sizes
LLaMA-2	7B/13B/70B
LLaMA	7B/13B/30B/65B
Mistral	7B
Vicuna	7B/13B
MPT	7B/30B
Falcon	7B/40B
OPT	125m/1.3B/2.7B/6.7B/13B/30B
Bloom	560m/3B/7B/
GPTJ	6.7B
Aquila	7B
Aquila2	7B/34B
Yi	6B/34B
Qwen	1.8B/7B/14B/72B
BigCode	1B/7B/15B
GPT NeoX	20B
GPT-J	6B
LLaVa	7B/13B
Mixtral	8x7B
Baichuan	7B/13B
QWen	1.8B/7B/14/72B