Self-Hosted LLM

Introduce how to self host LLM

Introduction

ChatGPT is great, Claude probably is even better.

However, it is controlled by other people or companies.

When we try to build an application based on them, we will first want to check: whether my data will be leaked?

ChatGPT do say that they will not use your data to do further training, or leaking it.

However, we go back to the old debate: trust big vendors or control everything by your own?

We do not have answers, but research community do want to provide options.

Luckily, Meta decide to open source their LLM models, which is LLAMA and LLAMA2.

And with the effort from research and industry, there are more and more open source LLM released. Some of them are for specific domains, some of them are for general purposes.

However, it is a relative new thing, how can wider WA community benefits from them?

We will try to work out the answers together.

What is Open Source LLM?

As we can see from the name, OpenAI used to open source the AI and make sure the human beings can benefit from it. However, due to some reasons, they are not open source anymore. All the papers they publish will be the tech they have applied probably 6-9 months ago.

Debating between open source and closed source is a very old topic. I will say we need them both.

While ChatGPT attracted all the attentions from all around the world. Meta released their open source LLM, LLAMA and followed by the LLAMA2.

Open Source LLM means you can access and download the weights of their models on the website, and then run it on your own machine, do the inferences (which is the chat).

In short words, you can have your own ChatGPT with the open source LLMs.

Here is a list of the LLM which you can use commerically: https://github.com/eugeneyan/open-llms

Problems with the Open Source LLM?

If the open source LLM can match the performance of the ChatGPT, then ChatGPT will die immediately or you will see a lot of companies like OpenAI.

The truth sadly is No, there is no open source LLMs can match the GPT-4 performance.

So what's the problem with the current Open Source LLMs?

  • Accuracy: The response from the chat is not making sense

    • If the size of the model is smaller, then the possiblity that the response is not making any sense is higher

    • If the size of the model is larger, like 70B, then the response is better.

  • Latency: How long it takes to generate the performance?

    • If the size if larger, then the latency normally will be higher, otherwise you will need larger resources like GPUs.

So basically there is a conflict between: Cost/Accuracy/ModelSize.

At the same time, I will say GPT-4 is the best LLM, so another big problem for the research community is: how can these open source models match the performance of ChatGPT.

However, this is not the problem for the industry.

For industry, the problem is:

How can we fit the LLM in our applications?

At the same time, what's the performance level is required for your application scenarios ?

For example, if you want to create an application to generate the stories for kids before they go to bed, you can easily achieve a good one, all the effort you need to make is to make sure the content delivered is safe and suitable for kids.

However, if you want to use LLM on the mining sites, make it be able to schedule and manage all the vehicles, then you will expect it can not make any mistakes. This will be very very hard.

So even with problems, industry and research community should work together to explore the future of the LLM in our life.

Why we need self-hosted LLMs?

For researchers:

  • smaller models with better accuracy can enable better future for us?

  • unlock why GPT is working?

  • ....

  • It is research.

For industry:

  • More in-context QA with private knowledge base

  • Improve efficiency of the cooperations

  • Automate some of the process

  • More applications which can benefit different groups of people

I do not really have a good answers for this, I think our starting point is simple, we should have more options.


Technology Part

There are two concepts for the LLM models: training and inference.

  • Training is the process when you try to use your data, your model architecture to train or fine tune a model, make sure the weights of the model can be optimised for the inference usage.

  • Inference is the stage that the weights of the model is ready, and you want to ask(prompt) the model some questions, and model is doing inference and give you the answers.

For an application, we will care more about the inferences stage. So what we need to do is:

  1. Download the weights of the models

  2. Run the model with the questions you have

  3. Get the answer

Looks like an easy win, right?

No.

The first challenge comes from the size of models.

You do can try to load the models from huggingface (where most people will have their model weights hosted), and then run the scripts to do the inference, this is one of the solution.

However, the process of loading model and doing inference will both be very slow, it is ok for research purpose, but it is not ok for application development.

During the training process, most of the LLMs are using a huge number of powerful GPUs, and the size of parameters are from 7B to 130B. Which means if you want to run the models by your own, you will need a significant amount of GPU/CPU/Memory resources.

At least, you need to load all the 7B to 130B parameters to the GPU or CPU memory, which is already a big block for most of the companies.

Even if you have all the resources, how to make it running fast and get the response in a timely manner is another big block.

Is this a dead end?

No

There is a research concept called LLM Quantization is trying to mitigate the problem.

A reference link is here: https://www.tensorops.ai/post/what-are-quantized-llms

LLM Quantization

The theory is quite simple for LLM Quantization. For all the parameters/weights of the models, they are float32 or float16 normally during the training process. If we want to keep the same precision when we load the models into memory, it will require huge amount of resources.

But if we reduce the precision a bit with a small trade off of the performance, then we can save huge amount of the memory, so the speed of the LLM will be faster.

There are some researchers trying to reduce the precision of the weights to INT4 and see no big drop for the performance when experiment on the LLAMA models.

This is one of the ways to reduce the size of the LLM models, so we can have an acceptable performance and speed. Also, most importantly, we can run LLM on your local machine, or a normal server.

The inital work for the LLM quantization work starting from the llama.cpp , the author is trying to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

After you download the weights of models, and with proper setup, you can transfer the weights of models into the quantized version of it, which is in .guff format.

After this, you can then run the model easily with c++/command line/python interface, or any other languages. The open source community is actively working on that.

Starting from this, model compression (which includes Quantization) attract some attention. There are more quantization ways proposed to maintain both performance and inference speed for LLMs. For example: AWQ, GPTQ, etc.

After the quantization, the model are supposed to be run one mini setup, which means a normal computer with just CPU can also host a LLM, which makes Nvidia nervous.

So Nvidia propose the Nvidia TensorRT and Nvidia TensorRT LLM, which is built in with the quantization methods via GPU.

To make it clear, the llama.cpp can also use the GPU layers from the compute nodes, it will depend on your configurations, but only CPU can also be sufficient depend on the performance you require.

Via quantization, for now we are able to host a LLM even on a RaspberryPi. It is not perfect, however, it is sufficient for us to address whether there are LLMs for various application scenarios.

Last updated