26 Mar 2023llama alpaca

Alpaca Finetuning of Llama on a 24G Consumer GPU

by John Robinson @johnrobinsn

Baby Alpaca

Llama the large language model released by Meta AI just a month ago has been getting a lot of attention over the past few weeks despite having a research-only license. It was released in several sizes a 7B, a 13B, a 30B and a 65B model (B is for a Billion parameters!). These massive language models have been trained on troves of Internet data, and in their raw form these machine-learning models take a small bit of input text and then can predict the text that is likely to come next.

But in a recent paper from Stanford's Tatsu Lab, they show how to finetune the Llama models into an instruction-following variant sort of like ChatGPT. Where you can give the model a prompt in the form of a question or instruction and it will respond with a response that is likely to be correct. For example:

Prompt: Give three tips for staying healthy.
Response: 
1. Eat a balanced diet and make sure to include plenty of fruits and vegetables.
2. Exercise regularly to keep your body active and strong.
3. Get enough sleep and maintain a consistent sleep schedule.

It's amazing that such large language models are available. But if you want to be able to finetune these models for your own experiments or research are they out of reach for the average data scientist or researcher? Even the 13B model at half-precision (fp16) would take over 26G of VRAM just for the weights alone (forget about the memory required for gradient calculations and other overhead) putting it beyond the reach of even high-end consumer GPUs.

Key Technical Advances

Thankfully due to the heroic work of many researchers and engineers to reduce the memory required to work with these large models, you too can finetune these models on a consumer grade GPU. In this quick experiment overview, I talk about how I was able to finetune the 13B Llama model into an instruction-following model using a single 24G consumer-grade GPU in about 18 hours. Which is utterly amazing for a model of this size.

Note: You can find a used Nvidia 3090 with 24G of VRAM on Ebay for around $700.

Think about it... your very own LLM variant finetuned on your own data and on your own equipment.

Some of the key technical innovations that make this possible:

LoRA: Low-Rank Adaptation of Large Language Models
Lora is a technique that allows you to finetune a large language model on a much smaller GPU. It does this by freezing the base model weights and dynamically injecting trainable "adapter" layers into the model. The number of trainable parameters in these adapter layers are much much smaller than the base model.
PEFT: Parameter Efficient Finetuning of Transformers
Which is HuggingFace's finetuning library which supports Lora and other techniques for finetuning large models.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Is an approach for reducing the size of the weights used in a model from 16-bit floating point numbers down to 8-bit integers. This is done by using a technique called quantization. This technique has been around for a while but has only recently been applied to large language models. The main benefit of this is that it reduces the memory required to store the weights by a factor of 2. This is a huge win for finetuning large language models with less VRAM.
bitsandbytes
Is a library by Tim Dettmers that implements the functionality of LLM.int8 for CUDA and PyTorch.

A few details on my setup:

* 1x NVidia Titan RTX 24G
* 13B Llama model
* The cleaned Alpaca dataset.
* 18 hours of training time.

Finetuning Llama 13B on a 24G GPU

All of this along with the training scripts for doing finetuning using Alpaca has been pulled together in the github repository, Alpaca-Lora. Just download the repo using git clone, and follow the instructions for setup.

Note: This is a forked repository with some minor deltas from the upstream. So make sure to keep an eye on the upstream repo too for new goodies! Things are moving fast!

The finetune.py script is configured by default to download and finetune the 7B Llama model which takes only 8 hours to train. But if you want the 13B model like I did, you can pass in the desired pretrained model name via a command line argument. For example:

python finetune.py --data_path="./alpaca_data_cleaned.json" --model_pretrained_name="decapoda-research/llama-13b-hf"

Note: You might have to look around in huggingface's model repository if decapoda removes the llama models from this location.

Note: The 30B model is still out of reach for a single 24G cards using int8 quantization. But I talk more about 30B a little later in this post.

The finetune script will download the appropriate checkpoint and start training. The finetune script is conveniently set up to use weights and biases, so if you set up an account there you can monitor the training process using your own dashboard.

The file "alpaca_data_cleaned.json" contains the cleaned up version of the Alpaca dataset. This json file contains a series of instructions and a cooresponding output that is an example of how a good "Alpaca" would respond. There are some great notes on how the dataset was cleaned here.. Here is just one example:

    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
    }, ...

The dataset includes over 52K such examples. By keeping the base model weights frozen, and using supervised learning the Lora adapter weights can be trained on this dataset with the goal that the model will generalize to respond appropriately to prompts that haven't been seen before.

Note: I didn't try adjusting any of the hyperparameters related to training, so it's likely that training time could be reduced a bit.

Inference

After training you can try out your finetuned model using the generate.py script. The trained lora adapters weights are saved in the "lora-alpaca" directory by default.

python generate.py \
    --path_to_lora_adapters="lora-alpaca" \
    --pretrained_model="decapoda-research/llama-13b-hf"

Just type in a prompt and hit enter and you'll get a response from your finetuned Alpaca model.

Your own data

The whole point of this is not really to use an Alpaca model. For that you can just download the Alpaca weights from the Internet and use them (No Training Required!). The point is for you to be able to do your own finetuning of the Llama models with your own data and instructions allowing you to conduct your own experiments and explore this fascinating new technology.

Going Beyond 13B

If 13B parameters is not enough for you, folks having been pushing the envelope even further, not stopping with just 8-bit quantization, but pushing on to 4-bit quantization and beyond. This repository is allowing me to finetune the Llama 30B model on a single 24G GPU (will probably take a couple of weeks to train). It's still training but I'm hoping to have some results in the next week or so. I'll post an update when I do and let you know how it turns out.

The other path to 30B is to use two 24G GPUs and at $700 a pop, it's a fairly reasonable path to take.

Conclusion

Even though Llama has been getting the lion's share of attention lately, Llama is not the only game in town. There are several other large LLMs available that are truly open, such as the 20B GPT-NeoX model and the 20B Flan-UL2 model. Hopefully, we will see a shift of all this energy applied to open source LLMs such as these. As soon as I'm able to free up a GPU or two that's where I'll be headed.

Alpaca Finetuning of Llama on a 24G Consumer GPU

Key Technical Advances #

Finetuning Llama 13B on a 24G GPU #

Inference #

Your own data #

Going Beyond 13B #