Skip to content

AutoAWQ

AutoAWQ pushes ease of use and fast inference speed into one package. In the following documentation, you will learn how to quantize and run inference.

Example inference speed (RTX 4090, Ryzen 9 7950X, 64 tokens):

  • Vicuna 7B (GEMV kernel): 198.848 tokens/s
  • Mistral 7B (GEMM kernel): 156.317 tokens/s
  • Mistral 7B (ExLlamaV2 kernel): 188.865 tokens/s
  • Mixtral 46.7B (GEMM kernel): 93 tokens/s (2x 4090)

Installation notes

  • Install: pip install autoawq.
  • Your torch version must match the build version, i.e. you cannot use torch 2.0.1 with a wheel that was built with 2.2.0.
  • For AMD GPUs, inference will run through ExLlamaV2 kernels without fused layers. You need to pass the following arguments to run with AMD GPUs:
model = AutoAWQForCausalLM.from_quantized(
    ...,
    fuse_layers=False,
    use_exllama_v2=True
)

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Mistral 7B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B
Aquila 7B
Aquila2 7B/34B
Yi 6B/34B
Qwen 1.8B/7B/14B/72B
BigCode 1B/7B/15B
GPT NeoX 20B
GPT-J 6B
LLaVa 7B/13B
Mixtral 8x7B
Baichuan 7B/13B
QWen 1.8B/7B/14/72B