Stormin' The Castle


Finetuning Redpajama (OpenLlama)

by John Robinson @johnrobinsn

Llama, the large language model from Meta was effectively released to the world a little over two months ago, resulting in an explosion of experimentation, exploration and innovation. Unfortunately Llama has a fairly restrictive "research-only" license which prohibits commercial use.

RedPajama is an effort to create an open-source clone of Llama. It's a project to create leading open-source language models and starts by reproducing the LLaMA training dataset of over 1.2 trillion tokens. Leveraging this dataset, the team from has just released initial versions of their 3B and 7B RedPajama models. Both of these come in a number of flavors including a base model and a couple of finetuned variants (an instruction-tuned variant and a chat-tuned variant). All of these have been released under a liberal Apache-2.0 Open Source License. Making them suitable for commercial applications

Out of the box, the base model (given a sequence of words) has been trained to "statistically" predict what word should come next based on the very large dataset described above. As such, the language model has in a way compressed and encoded much of the information contained within that dataset and when prompted can regenerate information that it has learned, one token at a time. These generated sequences echo many of the ideas, concepts and "thoughts" that were encoded in the dataset, reflecting thoughts about society, humanity, general knowledge and more. You can sort of think of the base model as having compressed a large portion of the Internet into a set of a few billion numbers.

But while predicting what word comes next is a powerful capability, it is fairly limited in practical application. Enter the process of alignment or finetuning. As you might imagine, the base model has been exposed to many different patterns of written language, ranging from freeform prose, to poetry, Q&As, dialogue etc. Finetuning is about bringing one or more of these latent language patterns to the surface, so that the language model performs better at a specific desired behavior or task. A few of the more common downstream tasks that large language models have been proven to be good at include:

In this notebook, I'll demonstrate finetuning the RedPJ model into an "Instruction Following" variant using the Alpaca dataset. Once RedPJ has been finetuned for the task of Instruction Following, we'll expect our model to be able to take in a natural language instruction and be able to generate coherent responses to those instructions. My goal here is two-fold, one to demonstrate the mechanics of how to finetune a LLM on a dataset of your own choosing and two to show the profound impact that finetuning can have on the behavior of a LLM.

There are several approaches that we can take to finetuning an LLM model. Full finetuning involves using back propogation to iteratively modify all of the model weights, which can be quite resource intensive from a compute and a memory perspective. Another popular approach to finetuning is through the use of LoRA adapters. LoRA is a technique that allows you to finetune a large language model on a much smaller GPU. It does this by freezing the base model weights and dynamically injecting trainable "adapter" layers into the model. The number of trainable parameters in these adapter layers are much much smaller than the base model. We'll also use int8 quantization of the base model weights to further reduce the amount of memory needed for training.

One additional benefit, since LoRA adapters are a separate set of weights from the base model and are much smaller. You can have just one copy of the large base model weights and dynamically load different LoRA adapters weights to support multiple downstream tasks for the same shared base model weights.

Later in this article, I'll demonstrate interacting with the base model before and after finetuning, so that you can see the effects of finetuning more clearly.

The HuggingFace ecosystem, makes downloading, training, saving models and datasets very quick and easy. This notebook heavily leverages the HuggingFace libraries and platform.

Note: This notebook should support any of the upcoming RedPajama model releases (bigger ones are coming... assuming you have enough VRAM on your GPU). I've trained and tested this notebook on both the 3B and 7B models that have been released so far.

We can start by defining a few variables that will determine which specific base model that we're finetuning (either the 3B or the 7B model).


Configure the base model and a few other variables that we'll use later.

model = '3B' #'7B' # Pick your poison

if model == '7B':
model_name = ("togethercomputer/RedPajama-INCITE-Base-7B-v0.1","togethercomputer/RedPajama-INCITE-Base-7B-v0.1")
run_name = 'redpj7B-lora-int8-alpaca'
dataset = 'johnrobinsn/alpaca-cleaned'
peft_name = 'redpj7B-lora-int8-alpaca'
output_dir = 'redpj7B-lora-int8-alpaca-results'
else: #3B
model_name = ("togethercomputer/RedPajama-INCITE-Base-3B-v1","togethercomputer/RedPajama-INCITE-Base-3B-v1")
run_name = 'redpj3B-lora-int8-alpaca'
dataset = 'johnrobinsn/alpaca-cleaned'
peft_name = 'redpj3B-lora-int8-alpaca'
output_dir = 'redpj3B-lora-int8-alpaca-results'



Install the required dependencies.

def install_dependencies():
!pip install -Uqq git+
!pip install -Uqq transformers datasets accelerate bitsandbytes
!pip install -Uqq wandb

# uncomment the following line to install the required dependencies

Note: If you just want to do inference you can jump all the way down to the "Evaluate" cell and start running from there to download my adapter weights from HF hub and try some prompts through the finetuned model.

But if you want to train keep going...

Setting Up Tracking and Monitoring using Weights and Biases

This notebook has support for logging the training run to weights and biases (wandb). This makes it very easy to track, monitor and annotate your training sessions from anywhere.

Run the next cell and follow the directions to authenticate with wandb.

report_to = "wandb" # "none"

if report_to != "none":
import wandb

After authenticating, we have to initialize wandb. We add a few key-value pairs about the run to the information that will be logged to the wandb dashboard.

Note: You can add more key/values if you'd like.

"model": model_name[1],

Tracking run with wandb version 0.15.2

Run data is saved locally in /mnt2/code/redpajama/wandb/run-20230515_065006-xrvsv1au

Syncing run dutiful-wave-23 to Weights & Biases (docs)

View project at

View run at

After you get training started below. You can revisit the wandb links shown above to monitor the status of your training run from anywhere with Internet connectivity.

Note: I like to send the link (View run) to my phone so that I can monitor on the go...


The tokenizer converts words into a list/tensor of numbers so that the model can process them. Each language model has been trained using a specific tokenizer. If your base model is already supported by HuggingFace then the transformer library makes it very easy to load the correct tokenizer for your given model. Just use the AutoTokenizer class to create an instance of the correct tokenizer by just specifying the model name.

from transformers import AutoTokenizer

print("Loading tokenizer for model: ", model_name[1])
tokenizer = AutoTokenizer.from_pretrained(model_name[1],add_eos_token=True)
tokenizer.pad_token_id = 0

Loading tokenizer for model: togethercomputer/RedPajama-INCITE-Base-3B-v1

One problem that I've found with many of the finetuning scripts and notebooks found online is that the "end-of-stream" handling is not done correctly, so in many cases the finetuned models don't know when to stop emitting tokens and tend to "blabber" on. Since we are finetuning on an instruction following task, we would like the model to respond to the instruction prompt succintly and then stop. There are a number of ways to approach this, but the way I approach it here is to explicitly add a new token to represent end-of-stream, <eos> and use that eos token during training to teach the model when it should stop. Then during inference, we can use that token to recognize when the model is done responding.


eos_token_id: 50277

CUTOFF_LEN = 256  # 256 accounts for about 96% of the data in the alpaca dataset

def tokenize(prompt, tokenizer,add_eos_token=True):
result = tokenizer(
prompt+"<eos>", # add the end-of-stream token
return {
"input_ids": result["input_ids"],
"attention_mask": result["attention_mask"],

Let's give it a quick try and note the token id at the end of the sequence.

tokenizer('hi there<eos>')

{'input_ids': [5801, 627, 50277], 'attention_mask': [1, 1, 1]}


When finetuning your model the dataset that you choose has to be aligned with your downstream task. We're using a popular Instruction Following dataset, called Alpaca. For convenience, I have a copy of the alpaca dataset that has been cleaned and published on the HuggingFace hub. We can just download it and access it from cache using the load_dataset API shown below.

from datasets import load_dataset

# Load dataset from the hub
data = load_dataset(dataset)

Found cached dataset json (/mnt2/hfcache/johnrobinsn___json/johnrobinsn--alpaca-cleaned-59cdb2a1c179039a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)

0%| | 0/1 [00:00<?, ?it/s]

train: Dataset({
features: ['output', 'instruction', 'input'],
num_rows: 51942

We can see that the dataset consists of 51,942 rows with the following features ['instruction','input','output']. Let's take a look at one.


{'output': 'Telegram',
'instruction': 'Identify the odd one out.',
'input': 'Twitter, Instagram, Telegram'}

We can see an item that includes an 'instruction' to direct our model. An optional 'input' which provides context to the instruction. And then an expected output for the model.

Our goal in finetuning our model is to use this dataset to train our model to "behave" in a similar way. Given an instruction respond with an appropriate response generalizing to the knowledge already encoded in the base model.

But we can't directly use this JSON object to train our model. Our model can only process an ordered sequence of tokens that represent words. So we use a "prompt template" to convert each of these JSON objects in our dataset into a sequence of words. The prompt template follows a consistent pattern.

def generate_prompt(data_point):
# sorry about the formatting disaster gotta move fast
if data_point["input"]:
return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

### Input:

### Response:

return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

### Response:

Let's see what what our example looks like when "templatized".


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Identify the odd one out.

### Input:
Twitter, Instagram, Telegram

### Response:

The exact wording of the template is somewhat arbitrary. It's more of a consistent pattern that after training will drive the model into responding similarly when exposed to a similar prompt. You should be able to pick out the "instruction", "input", and "output" from the example.

It is important that the output from the dataset is at the end of templatized prompt, since at inference time we will only provide the prompt up to but not including the output. We'll expect our model to respond to our instruction on its own.

We now split out a validation dataset from our training dataset. so that we can track how well the finetuning process is learning to generalize to unseen prompts and so that we make sure we're only checkpointing our model when the validation loss is improving.

VAL_SET_SIZE = 2000  # we set aside 2000 items from our dataset for validation during training

train_val = data["train"].train_test_split(
test_size=VAL_SET_SIZE, shuffle=True, seed=42
train_data = train_val["train"]
val_data = train_val["test"]

Loading cached split indices for dataset at /mnt2/hfcache/johnrobinsn___json/johnrobinsn--alpaca-cleaned-59cdb2a1c179039a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-e7d69fe316bd0241.arrow and /mnt2/hfcache/johnrobinsn___json/johnrobinsn--alpaca-cleaned-59cdb2a1c179039a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-1c2cc6dcfc59cba7.arrow

We prepare the training dataset and the validation dataset by running the data through the prompt templating process and then by tokenizing the prompts.

train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))
val_data = val_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))

Map: 0%| | 0/49942 [00:00<?, ? examples/s]

Map: 0%| | 0/2000 [00:00<?, ? examples/s]

Load and Configure the Model for Training

Load the specified RedPajama base model from the HuggingFace hub.

Note: Llama, Redpajama and other decoder-only models are supported by the AutoModelForCausalLM class. But for encoder-decoder models such as the google/t5 models you'll need to use the AutoModelForSeq2SeqLM class and the training details are a little bit different. Here is a similar notebook for finetuning t5* models.

from transformers import AutoModelForCausalLM

print("Loading model for model: ", model_name[0])

model = AutoModelForCausalLM.from_pretrained(

Loading model for model: togethercomputer/RedPajama-INCITE-Base-3B-v1

Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to:
bin /home/jr/anaconda3/envs/redpj2/lib/python3.9/site-packages/bitsandbytes/
CUDA SETUP: CUDA runtime path found: /home/jr/anaconda3/envs/redpj2/lib/
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/jr/anaconda3/envs/redpj2/lib/python3.9/site-packages/bitsandbytes/

/home/jr/anaconda3/envs/redpj2/lib/python3.9/site-packages/bitsandbytes/cuda_setup/ UserWarning: Found duplicate ['', '', ''] files: {PosixPath('/home/jr/anaconda3/envs/redpj2/lib/'), PosixPath('/home/jr/anaconda3/envs/redpj2/lib/')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['', '', ''] in the paths that we search based on your env.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Now, we can prepare our model for the LoRA int-8 training using the HF peft library.

from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
r= 8,

# prepare int-8 model for training
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)

trainable params: 2621440 || all params: 2778485760 || trainable%: 0.09434779323828531

Note: After installing the Lora Adapters into the model notice the significant reduction in the number of trainable paramters.

We'll leverage the training loop from the transformers library since it does a pretty good job with handling the details.

import transformers
eval_steps = 200
save_steps = 200
logging_steps = 20

trainer = transformers.Trainer(
report_to=report_to if report_to else "none",
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),

model.config.use_cache = False # silence the warnings. Please re-enable for inference!


Run the training loop.


Save the Trained Adpater Model to Disk

Now that we've trained the model we'll want to save our weights. First I demonstrate how to save them to disk.

# Save our LoRA model & tokenizer results

# if you want to save the base model to disk call
# trainer.model.base_model.save_pretrained(peft_model_id)

Push the Trained Adapter Model to the HuggingFace Hub

Even better than saving your trained weights to disk you can push them up the HuggingFace Hub. This makes it super easy to share your trained adapter with others or to setup your model for inference on other devices.

!pip install -Uqq huggingface_hub
import huggingface_hub
# If you don't already have the git extensions for large file storage you might have to install it now.
# Here is how you can do this for Linux from the shell. For other OSs please refer to the git-lfs documentation.
# sudo apt install git-lfs
repo_id = f'{huggingface_hub.whoami()["name"]}/{peft_name}'

You chould be able to check out HuggingFace and see your LoRA Adapter Model.

Free Up Memory

Since we likely used a lot of memory during training and we'll need that memory back to try the model out we take a few steps to free up VRAM here.

import torch
import gc
config = None
model = None


Here we'll try out the model for inference.

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# load base LLM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained(model_name[1])
tokenizer.pad_token_id = 0


Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

(gpt_neox): GPTNeoXModel(
(embed_in): Embedding(50432, 2560)
(layers): ModuleList(
(0-31): 32 x GPTNeoXLayer(
(input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(attention): GPTNeoXAttention(
(rotary_emb): RotaryEmbedding()
(query_key_value): Linear8bitLt(in_features=2560, out_features=7680, bias=True)
(dense): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
(mlp): GPTNeoXMLP(
(dense_h_to_4h): Linear8bitLt(in_features=2560, out_features=10240, bias=True)
(dense_4h_to_h): Linear8bitLt(in_features=10240, out_features=2560, bias=True)
(act): GELUActivation()
(final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(embed_out): Linear(in_features=2560, out_features=50432, bias=False)

Here is the prompt template we'll use for inference.

Note: It's important that it's identical to one we used for training above, but it omits the "output/response" as our model will generate that for us.

def generate_prompt(data_point):
# sorry about the formatting disaster gotta move fast
if data_point["input"]:
return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

### Input:

### Response:"""

return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

### Response:"""

Here is a small utility function that lets us easily prompt our model with an instruction and an optional input. It handles templating the prompt, tokenizing the templatized prompt, decoding the result and then finally stripping off the prompt from the response and just leaving us with the model response.

def generate(instruction,input=None,maxTokens=256):
prompt = generate_prompt({'instruction':instruction,'input':input})
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids, max_new_tokens=maxTokens,
do_sample=True, top_p=0.9,pad_token_id=tokenizer.eos_token_id,
outputs = outputs[0].tolist()
# Stop decoding when hitting the EOS token
if tokenizer.eos_token_id in outputs:
eos_index = outputs.index(tokenizer.eos_token_id)
decoded = tokenizer.decode(outputs[:eos_index])
# Don't show the prompt template
sentinel = "### Response:"
sentinelLoc = decoded.find(sentinel)
if sentinelLoc >= 0:
print('Warning: Expected prompt template to be emitted. Ignoring output.')
print('Warning: no <eos> detected ignoring output')

Generating using the Base Model

This demonstrates the behavior of the RedPajama model with no finetuning applied.


generate('Write a short story in third person narration about a protagonist who has to make an important career decision.',maxTokens=300)

Write a short story in third person narration about a protagonist who has to make an important career decision. The protagonist’s character is presented from the point of view of the protagonist. The first paragraph should describe a decision the protagonist made. In the second paragraph, the reader should learn more about the protagonist and why she made this decision. In the last paragraph, the reader should learn more about what the protagonist decided.

### Examples:
Write about a character from a novel who makes an important decision.
Write about a character from a film that makes an important decision.
Write about a character from a television show that makes an important decision.

# Writing Prompt 13: Write a Short Story

### Instructions:
In the following prompt, write a short story in third person narration. The story can take place in the past or in the present. Write a story that contains:

- An unreliable narrator
- A dramatic situation
- A situation that takes place during a specific time

### Response:
Write a short story in third person narration about a character who finds an important item. The character finds the item during a specific time. The story contains the following characteristics:

- An unreliable narrator
- A dramatic situation
- A situation that takes place during a specific time

### Examples:
Write a short story about a character who finds an important item during a specific time.

Load the LoRA Adapter

As you can see the generated text doesn't seem very responsive to the prompt. Now let's load the trained LoRA adapter and see what happens.

Note: Here you can either load up my pretrained Lora adapter from HuggingFace hub. Or if you trained your own adapter above you can uncomment the specified line below to load your adapter from disk.

peft_model_id = f'johnrobinsn/{peft_name}' # By default use my pretrained adapter weights
#peft_model_id = peft_name # Uncomment to use locally saved adapter weights if you trained above

# Load the LoRA model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})

print("Peft model adapter loaded")

Peft model adapter loaded

let's try the same prompt again.


generate('Write a short story in third person narration about a protagonist who has to make an important career decision.',maxTokens=300)

John had been thinking about what he should do with his life for a while. He was an ambitious man, but his job responsibilities were limited and he was unsure how to advance his career. He was torn between staying with the company he had worked for for several years or leaving to pursue his own dreams. He had a hard time making the right decision.

One day, he was sitting at his desk, thinking. Suddenly, a bright idea came to him. He decided to create a blog about his own ideas and inspirational stories. He was sure he could make this work as a side hustle. He knew it would be difficult to get started, but he was determined to try.

He decided to start his blog by setting up a website and putting up his first few posts. He had no idea what to write or how it would turn out. He was nervous, but he knew this was a new beginning and he was ready to try something new.

As the weeks went by, he wrote about various subjects and topics. He soon realized that writing and posting on his blog was a great outlet for him. He could talk to people about the ideas and insights he had and it felt good to have something to focus on.

One day, he received an unexpected message from someone on his blog. They had read one of his posts and they were so inspired by it that they were inspired to pursue their own ideas. They wanted to connect with John

As you can see this response is much much more responsive to the provided instruction.

A Few More Prompts

generate('Who was the first man to walk on the moon and tell me where he was born.')

Neil Armstrong was the first man to walk on the moon and he was born in 1930 in Wapakoneta, Ohio.

In this example, we provide not only an instruction but also provide some context for the instruction which is a list of possible answers.

generate('Identify the odd one out','Twitter, Instagram, Telegram')


generate('Write a poem about about a cat',maxTokens=1000)

Soft fur, warm and bright,
Cats look so cute and cuddly.
Their gentle eyes show their true selves,
Filling your heart with warmth and love.

Their purring is a lullaby,
Calming and soothing you can't ignore.
Their laughter is a song of joy,
It brings out the happiest of days.

Their eyes are as bright as stars,
A light that brings joy that no-one can deny.
Their purrs are a symphony of sound,
Bringing love to every home.

So take a moment to pet a cat,
It's a purrfect experience, that's for sure.
They fill the room with sweet and soothing sound,
They bring light and joy in every corner.

Meh. Not that great... But Llama doesn't seem to be very good at poetry either in my experience. Would be worthwhile to see how the larger RedPJ models fair here... Still lot's of fun probing the limits of what works well and what doesn't.


I hope you've enjoyed this quick tour of finetuning. I've also included a "cleaned" version of this notebook in the github repo without the blog narrative.

If you'd like to try your hand at a different finetuning task. You could give summarization a try. Please check out the samsum summarization dataset on HF. Primarily, you'll need to adjust the prompt templates during training and inference.

Please like my content on twitter, @johnrobinsn

Share on Twitter |  Discuss on Twitter

John Robinson © 2022-2023