Large Language Model Fine Tuning Techniques

6 min readOct 28, 2023

Image Generate by Bing Image Creator (**DALL·E 3)**

Large language models (LLM’s) like LLaMA 2, Mistral-7B have demonstrated strong natural language abilities. However, their billions of parameters make them difficult to fine-tune for specific downstream tasks. In this post, we’ll explore techniques that enable efficient fine-tuning of these massive models and ways to increase the context length.

Parameter-Efficient Fine-Tuning (PEFT)

Parameter-efficient fine-tuning (PEFT) refers to methods that update only a small subset of parameters during training. This makes fine-tuning more practical compared to updating all parameters. Two promising PEFT techniques are LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation).

PEFT is also a library from Huggingface which implements various fine-tuning techniques like LoRA, Prefix Tuning, Prompt Tuning etc.

LoRA

LoRA is a PEFT technique that adds a small number of low-rank adapter layers to a pre-trained LLM. These adapter layers are then trained on a new dataset to improve the LLM’s performance on a specific task.

Great Post on LoRA from HuggingFace —

LoRA

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

QLoRA is a variation of LORA that quantizes the LLM to 4 bits before adding the adapter layers. This further reduces the memory and computational requirements of fine-tuning. QLoRA was proposed in the LLM.int8() paper in Aug 2023, QLoRA can be implemented using the bitsandbytes library.

Great post on QLoRA by HuggingFace —

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Proximal Policy Optimization (PPO)

PPO is a reinforcement learning (RL) algorithm that can be used to train a policy to generate text that is aligned with a desired reward function. The policy is trained by interacting with the LLM and collecting feedback on the quality of the generated text.

PPO has been shown to be effective at fine-tuning LLMs on a variety of tasks, including translation, summarization, and question answering. However, it can be complex to implement and hyperparameter tune.

Direct Preference Optimization (DPO)

DPO is a simpler approach to fine-tuning LLMs than PPO. It does not require training a reward model, but instead directly optimizes the LLM parameters to maximize the likelihood of generating text that is preferred by humans.

DPO is trained on a dataset of human preference pairs, each consisting of a prompt and two possible completions, one preferred and one dispreferred. The LLM is then fine-tuned to maximize the likelihood of generating the preferred completions.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a hybrid approach to fine-tuning LLMs that combines PPO with human feedback. It is similar to PPO, but instead of using a reward function that is hand-crafted by experts, it uses a reward function that is learned from human feedback.

RLHF is trained on a dataset of human-labeled examples, each consisting of a prompt and a desired output. The reward function is trained to predict the human ratings of the LLM’s outputs. Once the reward function is trained, it is used to train the LLM’s policy using PPO.

RLHF has been shown to be effective at fine-tuning LLMs on a variety of tasks, including translation, summarization, and question answering. It is more complex to implement and hyperparameter tune than PPO or DPO, but it can achieve better performance on some tasks.

Illustrating Reinforcement Learning from Human Feedback (RLHF)

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

RoPE: Rotary Positional Embedding

Rope is a type of position embedding that encodes absolute positional information with a rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. RoPE was proposed in the RoFormer paper in Apr 2021 and is use by models like Llama 2.

Source: https://arxiv.org/pdf/2104.09864v4.pdf

Great post on RoPE from Eleuther.ai

Rotary Embeddings: A Relative Revolution

Rotary Positional Embedding (RoPE) is a new type of position encoding that unifies absolute and relative approaches. We…

blog.eleuther.ai

Fine Tuning to large context length

Many LLM’s like LLaMA2, Mistral-7B come with base context length of 4K and 8K respectively, however in many usecase we need larger context length and below are some techniques to increase the context length by fine-tuning.

Position Interpolation (PI) is a technique for extending the context length of RoPE based LLM’s like LLaMA 2. It was proposed in this paper by Meta researchers and also simultaneously discovered by an independent researcher here.

Extrapolation vs Interpolation — https://arxiv.org/pdf/2306.15595.pdf

In this technique by just adding one line to the training code the context length can be increased!!

position_ids = position_ids % 2048

ALiBi (Attention with Linear Biases), is a positioning method that allows Transformer language models to consume, at inference time, sequences which are longer than the ones they were trained on. Unlike traditional methods that add positional embeddings to word embeddings, ALiBi biases query-key attention scores with a penalty that is proportional to their distance. This means that the attention value that a query can assign to a key is penalized depending on how far away the key and query are. When a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away.

One of the key advantages of ALiBi is that it enables Transformer language models to consume, at inference time, sequences that are longer than the ones they were trained on, without any additional training on longer sequences. For example, a model can be trained on 1024 tokens and then do inference on 2048 (or much more) tokens without any fine-tuning. MosaicML’s MPT-7B and MPT-30B models leverage the ALiBi architecture to enable extrapolation to extreme context lengths up to 65k tokens.

Resources

Libraries

TRL (Transformer Reinforcement Learning) Library

GitHub - OpenAccess-AI-Collective/axolotl: Go ahead and axolotl questions

Go ahead and axolotl questions. Contribute to OpenAccess-AI-Collective/axolotl development by creating an account on…

github.com

PEFT (Parameter Efficient Fine Tuning)

GitHub - huggingface/alignment-handbook: Robust recipes for to align language models with human and…

Robust recipes for to align language models with human and AI preferences - GitHub - huggingface/alignment-handbook…

github.com

Datasets

GitHub - anthropics/hh-rlhf: Human preference data for "Training a Helpful and Harmless Assistant…

Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" …

github.com

GitHub - OpenBMB/UltraFeedback: A large-scale, fine-grained, diverse preference dataset (and…

A large-scale, fine-grained, diverse preference dataset (and models). - GitHub - OpenBMB/UltraFeedback: A large-scale…

github.com

GitHub - thunlp/UltraChat: Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)

Large-scale, Informative, and Diverse Multi-round Chat Data (and Models) - GitHub - thunlp/UltraChat: Large-scale…

github.com

Blog Posts

LLaMA 2 - Every Resource you need

All Resources for LLaMA 2, How to test, train, and deploy it.

www.philschmid.de

Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications

In this blog, we provide a thorough analysis and a practical guide for fine-tuning. We examine the Llama-2 models under…

www.anyscale.com

Fine-tune Llama 2 with DPO

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Preparing for the era of 32K context: Early learnings and explorations - Together AI

Announcing LLaMA-2-7B-32K-a 32K context model that can be fine-tuned for tasks like document understanding…

together.ai

Large Language Model Fine Tuning Techniques

Parameter-Efficient Fine-Tuning (PEFT)

LoRA

LoRA

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO)

Direct Preference Optimization (DPO)

Reinforcement Learning from Human Feedback (RLHF)

Illustrating Reinforcement Learning from Human Feedback (RLHF)

We're on a journey to advance and democratize artificial intelligence through open source and open science.

RoPE: Rotary Positional Embedding

Rotary Embeddings: A Relative Revolution

Rotary Positional Embedding (RoPE) is a new type of position encoding that unifies absolute and relative approaches. We…

Fine Tuning to large context length

Resources

GitHub - OpenAccess-AI-Collective/axolotl: Go ahead and axolotl questions

Go ahead and axolotl questions. Contribute to OpenAccess-AI-Collective/axolotl development by creating an account on…

GitHub - huggingface/alignment-handbook: Robust recipes for to align language models with human and…

Robust recipes for to align language models with human and AI preferences - GitHub - huggingface/alignment-handbook…

GitHub - anthropics/hh-rlhf: Human preference data for "Training a Helpful and Harmless Assistant…

Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" …

GitHub - OpenBMB/UltraFeedback: A large-scale, fine-grained, diverse preference dataset (and…

A large-scale, fine-grained, diverse preference dataset (and models). - GitHub - OpenBMB/UltraFeedback: A large-scale…

GitHub - thunlp/UltraChat: Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)

Large-scale, Informative, and Diverse Multi-round Chat Data (and Models) - GitHub - thunlp/UltraChat: Large-scale…

LLaMA 2 - Every Resource you need

All Resources for LLaMA 2, How to test, train, and deploy it.

Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications

In this blog, we provide a thorough analysis and a practical guide for fine-tuning. We examine the Llama-2 models under…

Fine-tune Llama 2 with DPO

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Preparing for the era of 32K context: Early learnings and explorations - Together AI

Announcing LLaMA-2-7B-32K-a 32K context model that can be fine-tuned for tasks like document understanding…

Written by Manu Suryavansh