Quantization demo
Demonstrate how to do quantization for LLM
We have mentioned several ways to do quantization in previous posts. Here we will demonstrate how to convert the model weights into quantization models for llama family models via llama.cpp .
What is llama.cpp
After llama is released, open source community is inspired and work on the solutions to make it can work on wider devices. https://github.com/ggerganov who is specialsed in c++ release the llama.cpp, which is used to convert the weights into the format gguf
after quantization.
The package also provides a python binding which allow you to call the LLM from python package llama-cpp-python
In this way, if you want to run a LLM, all you need to do is
a gguf file for the model, if not exists, you can create one by yourself with the
llama.cpp
install python package
llama-cpp-python
Then you are free to use it in the way like below
Quantization
There are quite detailed instruction about how to convert it into the gguf format in the llama.cpp readme section.
Summary of the steps:
Clone llama.cpp and
make
to install itThen download the model weights
Write your own or use the
convert.py
in the example folder to convert the model weights into the quantization version you want.Run the quantized model to test it out
As you can see, it is actually a data transform and one-off action, so you can do it once, and post the gguf files somewhere. It is actually what people doing.
In huggingface, there is a user TheBloke
is doing this. For example, this link: https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF/tree/main
You can download the quantization version of CodeLLama-70B.
Run it with Python bindings
It actually not only have python bindings, it includes a vary of the bindings for most of the popular programming language, so you are already free to pick one of the Open LLM and get it running inside your production environment, or build an application.
Other notes
LLama.cpp is not the only way to do the quantization, at the same time, not all models can be quantized by the llama.cpp, so keep this in mind, and we will keep this updated.
Last updated