Training & Fine-Tuning Your LLMs

Part 6 of a 13-part series about LLMs

Janani Srinivasan Anusha
8 min readDec 25, 2024

1. Introduction

In this part of the series, we will dive deep into the practical aspects of training and fine-tuning large language models (LLMs). This guide covers essential topics such as dataset preparation, pretraining techniques, and fine-tuning strategies, with a focus on optimization methods to enhance model performance. By the end of this guide, you will have functional code snippets and a comprehensive understanding of:

  • Pretraining your own LLM from scratch
  • Fine-tuning pretrained models for specific tasks
  • Implementing optimizers and learning rate schedulers

2. Environment Setup

Before starting with pretraining or fine-tuning, ensure your development environment is correctly configured. This minimizes compatibility issues and ensures seamless execution of the code.

# Create a virtual environment
python -m venv llm_env
source llm_env/bin/activate # On Windows use llm_env\Scripts\activate
# Install dependencies
pip install torch transformers datasets accelerate

Explanation:

  • python -m venv llm_env: Creates a virtual environment to isolate dependencies.
  • source llm_env/bin/activate: Activates the virtual environment.
  • pip install: Installs PyTorch (core ML library), Hugging Face Transformers (model architecture), Datasets (loading datasets), and Accelerate (speeding up training).

3. Pretraining from Scratch

Pretraining teaches the model basic language structures by exposing it to vast text corpora. Hugging Face’s transformers and datasets simplify this process.

3.1 Dataset Preparation

from datasets import load_dataset
# Load a dataset
raw_dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
from transformers import AutoTokenizer
# Pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize dataset
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Apply tokenization
tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)

Explanation:

  • load_dataset: Loads WikiText (language modeling dataset).
  • AutoTokenizer: Automatically fetches a pretrained tokenizer.
  • map: Applies tokenization to every example in the dataset, ensuring text is split into tokens that models can interpret.

3.2 Masked Language Modeling (MLM)

from transformers import BertForMaskedLM
# Load BERT for MLM
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
from transformers import DataCollatorForLanguageModeling
# MLM data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=5e-5,
weight_decay=0.01,
per_device_train_batch_size=16,
num_train_epochs=3
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator
)
trainer.train()

Explanation:

  • BertForMaskedLM: Loads BERT for Masked Language Modeling.
  • DataCollatorForLanguageModeling: Automatically handles masking of tokens.
  • Trainer: Manages the training process with pre-defined training arguments.
  • trainer.train(): Begins model training.

4. Fine-tuning for Specific Tasks

Fine-tuning involves adapting a pre-trained model to solve a specific task, such as sentiment analysis, text classification, or named entity recognition. By leveraging a pre-trained model, you can adapt it to a variety of downstream tasks without having to train from scratch.

4.1 Sentiment Analysis

Sentiment analysis classifies text into categories based on sentiment, such as positive or negative. Below is an example of fine-tuning BERT for a binary sentiment analysis task using the IMDB dataset.

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import AutoTokenizer
# Load pre-trained model and tokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Load IMDB dataset
dataset = load_dataset("imdb")
# Preprocessing function
def preprocess_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Apply preprocessing to the dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)
# Set up training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=16,
num_train_epochs=3
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["test"],
tokenizer=tokenizer
)
# Train the model
trainer.train()

Explanation:

  • BertForSequenceClassification: Fine-tunes BERT for binary sentiment classification.
  • load_dataset(“imdb”): Loads the IMDB dataset, a common dataset for sentiment analysis tasks.
  • map(preprocess_function): Tokenizes the dataset for training.

4.2 Text Classification (Multi-Class)

Text classification can be used to categorize text into multiple classes, such as news article topics (sports, politics, etc.) or product types (electronics, furniture, etc.). Here’s how you can fine-tune a BERT model for a multi-class classification task using the AG News dataset.

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import AutoTokenizer
# Load pre-trained model and tokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4) # 4 classes for AG News
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Load AG News dataset
dataset = load_dataset("ag_news")
# Preprocessing function
def preprocess_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Apply preprocessing to the dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)
# Set up training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=16,
num_train_epochs=3
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["test"],
tokenizer=tokenizer
)
# Train the model
trainer.train()

Explanation:

  • BertForSequenceClassification: Fine-tunes BERT for multi-class classification (4 categories in this case for AG News).
  • load_dataset(“ag_news”): Loads the AG News dataset, which consists of news articles labeled into 4 categories.
  • map(preprocess_function): Tokenizes the dataset for training.

4.3 Named Entity Recognition (NER)

Named Entity Recognition (NER) involves identifying and classifying named entities in text into predefined categories, such as persons, organizations, locations, and more. Here’s how to fine-tune BERT for NER using the CoNLL-03 dataset, a standard NER dataset.

from transformers import BertForTokenClassification, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import AutoTokenizer
# Load pre-trained model and tokenizer for NER
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9) # 9 entity labels in CoNLL-03
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Load CoNLL-03 dataset
dataset = load_dataset("conll2003")
# Preprocessing function for token classification
def preprocess_function(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding="max_length", is_split_into_words=True)
labels = examples["ner_tags"]
tokenized_inputs["labels"] = labels
return tokenized_inputs
# Apply preprocessing to the dataset
encoded_dataset = dataset.map(preprocess_function, batched=True)
# Set up training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=16,
num_train_epochs=3
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["test"],
tokenizer=tokenizer
)
# Train the model
trainer.train()

Explanation:

  • BertForTokenClassification: Fine-tunes BERT for token-level classification (NER).
  • load_dataset(“conll2003”): Loads the CoNLL-03 dataset, which is annotated for NER tasks.
  • map(preprocess_function): Tokenizes the dataset and assigns labels to tokens for NER training.

5. Fine-Tuning Techniques & Applications

Fine-tuning allows flexibility depending on the availability of data and computational resources. While techniques like full fine-tuning, parameter-efficient fine-tuning (PEFT), and feature extraction are effective, there are additional innovative methods that provide a more efficient way to adapt pre-trained models to specific tasks.

LoRA (Low-Rank Adaptation)

LoRA is a cutting-edge technique aimed at reducing the number of trainable parameters in a model. Instead of fine-tuning all model parameters, LoRA focuses on adapting low-rank matrices inserted into the model’s layers. By introducing these low-rank matrices, LoRA allows the model to learn task-specific features while keeping most of the model parameters frozen, making it more computationally efficient. This technique is particularly valuable for models with a large number of parameters, as it reduces the training burden while preserving performance.

Implementation Example:

from peft import get_peft_model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
peft_model = get_peft_model(model, lora_rank=8)
trainer.model = peft_model
trainer.train()

LoRA is particularly useful when fine-tuning on smaller datasets or when computational resources are limited but you still need the flexibility of fine-tuning.

For more details on LoRA, see the original paper.

Additional Fine-Tuning Techniques:

5.1. Adapter-Tuning (Adapters)

Adapter tuning introduces small, lightweight modules between layers of the pre-trained model. These adapters are trainable while the rest of the model remains frozen. Adapter modules are small, and thus, they require fewer resources for training. This technique is especially valuable when you want to fine-tune a model on multiple tasks, as you can add task-specific adapters without retraining the entire model.

Implementation Example:

from transformers import AdapterConfig, BertModel
model = BertModel.from_pretrained("bert-base-uncased")
adapter_config = AdapterConfig.load("pfeiffer")
model.add_adapter("task_adapter", config=adapter_config)
model.train_adapter("task_adapter")
trainer.model = model
trainer.train()

Adapter tuning can be more efficient than traditional full fine-tuning because it allows the same model to be adapted for different tasks by simply adding or modifying the adapter modules without affecting the underlying pre-trained weights.

For further reading, see the Adapter-Tuning paper.

5.2. Prompt Tuning

Prompt tuning is a technique where you prepend or append learned “prompts” to the input text in order to influence the model’s response. The prompts are trainable embeddings that guide the model to produce the desired output without modifying the original model parameters. This approach can be particularly useful when you don’t want to fine-tune the entire model but still want to improve its performance on specific tasks.

Implementation Example:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Example prompt tuning
prompt = torch.nn.Parameter(torch.randn(1, 5, model.config.n_embd)) # Learnable prompt
input_ids = tokenizer.encode("Once upon a time", return_tensors="pt")
# Concatenate the prompt to the input
input_with_prompt = torch.cat([prompt, input_ids], dim=1)
output = model(input_with_prompt)

This method provides a very lightweight way to adjust the behavior of the model, especially useful for few-shot learning tasks where you need the model to perform well without extensive retraining.

For further exploration, see the Prompt Tuning paper.

5.3. BitFit

BitFit is a parameter-efficient fine-tuning technique where only the bias terms in the model are fine-tuned, while the rest of the parameters remain frozen. This approach dramatically reduces the number of trainable parameters while maintaining high performance. It’s especially useful when computational resources are limited and when the task requires minimal adaptation from the pre-trained model.

Implementation Example:

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
# Load pre-trained model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Freeze all parameters except biases
for name, param in model.named_parameters():
if 'bias' not in name:
param.requires_grad = False
trainer = Trainer(
model=model,
args=TrainingArguments(output_dir="./results"),
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()

While BitFit requires minimal computation, it has been shown to be effective in certain scenarios, especially when the dataset is small and the task does not require extensive model adaptation.

To read more, see the BitFit paper.

6. Optimization Techniques

Optimization plays a critical role in enhancing the performance of LLMs.

6.1 AdamW Optimizer

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
  • AdamW: A variant of Adam optimizer with weight decay.

6.2 Learning Rate Scheduling

from transformers import get_scheduler
num_training_steps = len(trainer.get_train_dataloader()) * training_args.num_train_epochs
scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)

Explanation:

  • get_scheduler: Linearly decreases the learning rate throughout training.
  • scheduler.step(): Updates the learning rate at each step.

7. Notable Pre-trained Models

  • BERT (bert-base-uncased): A powerful pre-trained model designed to understand the context of words in both directions. Ideal for tasks like sentence classification, NER, and question answering.
  • GPT-2 (gpt2): An autoregressive model that excels at generating coherent, human-like text. Great for text generation and creative writing.
  • T5 (t5-small): A text-to-text transformer that treats all tasks as a form of text generation. It’s versatile, supporting tasks like summarization, question answering, and classification.
  • RoBERTa (roberta-base): An improved BERT variant optimized for better performance with a larger dataset and more training. Used for similar tasks as BERT but often delivers superior results.

8. Conclusion

Training and fine-tuning LLMs are central to customizing models for specialized tasks. As discussed, LoRA, Adapter-Tuning, Prompt Tuning, and BitFit represent a range of strategies that balance computational efficiency with task-specific performance. Whether you’re fine-tuning on large datasets or working with limited resources, these techniques provide a way to get the most out of your pre-trained models.

In the next part of this series, Part 7: Major LLMs on the Market Today, we will explore some of the most prominent LLMs available today, comparing their architectures and ideal applications.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Janani Srinivasan Anusha
Janani Srinivasan Anusha

Written by Janani Srinivasan Anusha

Tech enthusiast and knowledge sharer, passionate about AI and Data Science. Always eager to connect, learn, and help others. Let’s explore tech together!

No responses yet

Write a response