Large Language Model Fine Tuning Techniques

Manu Suryavansh
6 min readOct 28, 2023
Image Generate by Bing Image Creator (DALL·E 3)

Large language models (LLM’s) like LLaMA 2, Mistral-7B have demonstrated strong natural language abilities. However, their billions of parameters make them difficult to fine-tune for specific downstream tasks. In this post, we’ll explore techniques that enable efficient fine-tuning of these massive models and ways to increase the context length.

Parameter-Efficient Fine-Tuning (PEFT)

Parameter-efficient fine-tuning (PEFT) refers to methods that update only a small subset of parameters during training. This makes fine-tuning more practical compared to updating all parameters. Two promising PEFT techniques are LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation).

PEFT is also a library from Huggingface which implements various fine-tuning techniques like LoRA, Prefix Tuning, Prompt Tuning etc.

LoRA

LoRA is a PEFT technique that adds a small number of low-rank adapter layers to a pre-trained LLM. These adapter layers are then trained on a new dataset to improve the LLM’s performance on a specific task.

Source: https://huggingface.co/docs/peft/conceptual_guides/lora

Great Post on LoRA from HuggingFace —

QLoRA is a variation of LORA that quantizes the LLM to 4 bits before adding the adapter layers. This further reduces the memory and computational requirements of fine-tuning. QLoRA was proposed in the LLM.int8() paper in Aug 2023, QLoRA can be implemented using the bitsandbytes library.

Great post on QLoRA by HuggingFace —

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO)

PPO is a reinforcement learning (RL) algorithm that can be used to train a policy to generate text that is aligned with a desired reward function. The policy is trained by interacting with the LLM and collecting feedback on the quality of the generated text.

PPO has been shown to be effective at fine-tuning LLMs on a variety of tasks, including translation, summarization, and question answering. However, it can be complex to implement and hyperparameter tune.

Direct Preference Optimization (DPO)

DPO is a simpler approach to fine-tuning LLMs than PPO. It does not require training a reward model, but instead directly optimizes the LLM parameters to maximize the likelihood of generating text that is preferred by humans.

DPO is trained on a dataset of human preference pairs, each consisting of a prompt and two possible completions, one preferred and one dispreferred. The LLM is then fine-tuned to maximize the likelihood of generating the preferred completions.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a hybrid approach to fine-tuning LLMs that combines PPO with human feedback. It is similar to PPO, but instead of using a reward function that is hand-crafted by experts, it uses a reward function that is learned from human feedback.

RLHF is trained on a dataset of human-labeled examples, each consisting of a prompt and a desired output. The reward function is trained to predict the human ratings of the LLM’s outputs. Once the reward function is trained, it is used to train the LLM’s policy using PPO.

RLHF has been shown to be effective at fine-tuning LLMs on a variety of tasks, including translation, summarization, and question answering. It is more complex to implement and hyperparameter tune than PPO or DPO, but it can achieve better performance on some tasks.

RoPE: Rotary Positional Embedding

Rope is a type of position embedding that encodes absolute positional information with a rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. RoPE was proposed in the RoFormer paper in Apr 2021 and is use by models like Llama 2.

Source: https://arxiv.org/pdf/2104.09864v4.pdf

Great post on RoPE from Eleuther.ai

Fine Tuning to large context length

Many LLM’s like LLaMA2, Mistral-7B come with base context length of 4K and 8K respectively, however in many usecase we need larger context length and below are some techniques to increase the context length by fine-tuning.

Position Interpolation (PI) is a technique for extending the context length of RoPE based LLM’s like LLaMA 2. It was proposed in this paper by Meta researchers and also simultaneously discovered by an independent researcher here.

Extrapolation vs Interpolation — https://arxiv.org/pdf/2306.15595.pdf

In this technique by just adding one line to the training code the context length can be increased!!

position_ids = position_ids % 2048

ALiBi (Attention with Linear Biases), is a positioning method that allows Transformer language models to consume, at inference time, sequences which are longer than the ones they were trained on. Unlike traditional methods that add positional embeddings to word embeddings, ALiBi biases query-key attention scores with a penalty that is proportional to their distance. This means that the attention value that a query can assign to a key is penalized depending on how far away the key and query are. When a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away.

One of the key advantages of ALiBi is that it enables Transformer language models to consume, at inference time, sequences that are longer than the ones they were trained on, without any additional training on longer sequences. For example, a model can be trained on 1024 tokens and then do inference on 2048 (or much more) tokens without any fine-tuning. MosaicML’s MPT-7B and MPT-30B models leverage the ALiBi architecture to enable extrapolation to extreme context lengths up to 65k tokens.

Resources

Libraries

TRL (Transformer Reinforcement Learning) Library

PEFT (Parameter Efficient Fine Tuning)

Datasets

Blog Posts

--

--