Quantization demo

Demonstrate how to do quantization for LLM

We have mentioned several ways to do quantization in previous posts. Here we will demonstrate how to convert the model weights into quantization models for llama family models via llama.cpp .

What is llama.cpp

After llama is released, open source community is inspired and work on the solutions to make it can work on wider devices. https://github.com/ggerganov who is specialsed in c++ release the llama.cpp, which is used to convert the weights into the format gguf after quantization.

The package also provides a python binding which allow you to call the LLM from python package llama-cpp-python

In this way, if you want to run a LLM, all you need to do is

  • a gguf file for the model, if not exists, you can create one by yourself with the llama.cpp

  • install python package llama-cpp-python

Then you are free to use it in the way like below

>>> from llama_cpp import Llama
>>> llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
>>> output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
>>> print(output)
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}

Quantization

There are quite detailed instruction about how to convert it into the gguf format in the llama.cpp readme section.

Summary of the steps:

  • Clone llama.cpp and make to install it

  • Then download the model weights

  • Write your own or use the convert.py in the example folder to convert the model weights into the quantization version you want.

  • Run the quantized model to test it out

As you can see, it is actually a data transform and one-off action, so you can do it once, and post the gguf files somewhere. It is actually what people doing.

In huggingface, there is a user TheBloke is doing this. For example, this link: https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF/tree/main

You can download the quantization version of CodeLLama-70B.

Run it with Python bindings

It actually not only have python bindings, it includes a vary of the bindings for most of the popular programming language, so you are already free to pick one of the Open LLM and get it running inside your production environment, or build an application.

Other notes

LLama.cpp is not the only way to do the quantization, at the same time, not all models can be quantized by the llama.cpp, so keep this in mind, and we will keep this updated.

Last updated