Version: 2.1.0

Fine-tune an open-source LLM with llama.cpp

Fine-tune LLMs

You could fine-tune an open-source LLM to

Teach it to follow conversations.
Teach it to respect and follow instructions.
Make it refuse to answer certain questions.
Give it a specific "speaking" style.
Make it response in certain formats (e.g., JSON).
Give it focus on a specific domain area.
Teach it certain knowledge.

To do that, you need to create a set of question and answer pairs to show the model the prompt and the expected response. Then, you can use a fine-tuning tool to perform the training and make the model respond the expected answer for each question.

How to fine-tune an open-source LLM with llama.cpp

The popular llama.cpp tool comes with a finetune utility. It works well on CPUs! This fine-tune guide is reproduced with permission from Tony Yuan's Finetune an open-source LLM for the chemistry subject project.

Build the fine-tune utility from llama.cpp

The finetune utility in llama.cpp can work with quantized GGUF files on CPUs, and hence dramatically reducing the hardware requirements and expenses for fine-tuning LLMs.

Check out and download the llama.cpp source code.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build the llama.cpp binary.

mkdir build
cd build
cmake ..
cmake --build . --config Release

If you have NVIDIA GPU and CUDA toolkit installed, you should build llama.cpp with CUDA support.

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
cmake --build . --config Release

Get the base model

We are going to use Meta's Llama2 chat 13B model as the base model. Note that we are using a Q5 quantized GGUF model file directly to save computing resources. You can use any of the Llama2 compatible GGUF models on Hugging Face.

cd .. # change to the llama.cpp directory
cd models/
curl -LO https://huggingface.co/gaianet/Llama-2-13B-Chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf

Create a question and answer set for fine-tuning

Next we came up with 1700+ pairs of QAs for the chemistry subject. It is like the following in a CSV file.

Question	Answer
What is unique about hydrogen?	It's the most abundant element in the universe, making up over 75% of all matter.
What is the main component of Jupiter?	Hydrogen is the main component of Jupiter and the other gas giant planets.
Can hydrogen be used as fuel?	Yes, hydrogen is used as rocket fuel. It can also power fuel cells to generate electricity.
What is mercury's atomic number?	The atomic number of mercury is 80
What is Mercury?	Mercury is a silver colored metal that is liquid at room temperature. It has an atomic number of 80 on the periodic table. It is toxic to humans.

We used GPT-4 to help me come up many of these QAs.

Then, we wrote a Python script to convert each row in the CSV file into a sample QA in the Llama2 chat template format. Notice that each QA pair starts with <SFT> as an indicator for the fine-tune program to start a sample. The result train.txt file can now be used in fine-tuning.

Put the train.txt file in the llama.cpp/models directory with the GGUF base model.

Finetune!

Use the following command to start the fine-tuning process on your CPUs. I am putting it in the background so that it can run continuously now. It could take several days or even a couple of weeks depending on how many CPUs you have.

nohup ../build/bin/finetune --model-base llama-2-13b-chat.Q5_K_M.gguf --lora-out lora.bin --train-data train.txt --sample-start '<SFT>' --adam-iter 1024 &

You can check the process every few hours in the nohup.out file. It will report the loss for each iteration. You can stop the process when the loss goes consistently under 0.1.

Note 1 If you have multiple CPUs (or CPU cores), you can speed up the fine-tuning process by adding a -t parameter to the above command to use more threads. For example, if you have 60 CPU cores, you could do -t 60 to use all of them.

Note 2 If your fine-tuning process is interrupted, you can restart it from checkpoint-250.gguf. The next file it outputs is checkpoint-260.gguf.

nohup ../build/bin/finetune --model-base llama-2-13b-chat.Q5_K_M.gguf --checkpoint-in checkpoint-250.gguf --lora-out lora.bin --train-data train.txt --sample-start '<SFT>' --adam-iter 1024 &

Merge

The fine-tuning process updates several layers of the LLM's neural network. Those updated layers are saved in a file called lora.bin and you can now merge them back to the base LLM to create the new fine-tuned LLM.

../build/bin/export-lora --model-base llama-2-13b-chat.Q5_K_M.gguf --lora lora.bin --model-out chemistry-assistant-13b-q5_k_m.gguf

The result is this file.

curl -LO https://huggingface.co/juntaoyuan/chemistry-assistant-13b/resolve/main/chemistry-assistant-13b-q5_k_m.gguf

Note 3 If you want to use a checkpoint to generate a lora.bin file, use the following command. This is needed when you believe the final lora.bin is an overfit.

../build/bin/finetune --model-base llama-2-13b-chat.Q5_K_M.gguf --checkpoint-in checkpoint-250.gguf --only-write-lora --lora-out lora.bin

Fine-tune LLMs

How to fine-tune an open-source LLM with llama.cpp

Build the fine-tune utility from llama.cpp​

Get the base model​

Create a question and answer set for fine-tuning​

Finetune!​

Merge​

Build the fine-tune utility from llama.cpp

Get the base model

Create a question and answer set for fine-tuning

Finetune!

Merge