Pelayo Arbués

Search

❯

Literature Notes

❯

Finetuning an LLM: RLHF and Alternatives

Sep 12, 2023, 8 min read

#articles
#literature-note

Metadata

Author: Argilla S.L.U.
Full Title: Finetuning an LLM: RLHF and Alternatives
URL: https://argilla.io/blog/mantisnlp-rlhf-part-2/

Highlights

Unfortunately, SFT alone is often insufficient for refining the large language models to meet specific requirements. With it, you can complement the lack of information you may have in pretrained LLM, change the style, structure, teach to answer differently to different outputs. However, there will always be multiple ways you can answer a question. State-of-the-art conversational models, such as ChatGPT, have been traditionally requiring an additional step called Reinforcement Learning by Human Feedback (RLHF). (View Highlight)
Reinforcement learning is a paradigm in which an agent learns to make decisions by receiving feedback from its environment. In the context of language models, this feedback is provided by human reviewers who assess and rate the model’s responses. By leveraging human expertise and judgments, reinforcement learning facilitates the iterative improvement of the model’s performance and fine-tunes its responses. (View Highlight)
The process of reinforcement learning by human feedback involves several important steps:
1. Guidelines are defined to guarantee unique criteria when deciding what is a good and a bad answer to an input.
2. A Reward Model (RM) should be trained, which will evaluate each of the responses in terms of accuracy, relevance, and adherence to guidelines.
3. To train the RM, some prompts are selected and sent to human reviewers. We call them Preference Data (PD)
4. The reviewers then interact with the model and manually evaluate and rate the corresponding outputs.
5. The collected feedback, in the form of ratings or rankings, is used to train the RM.
6. With the RM trained, we can train a Policy Optimizer, a required component which will guide the fine-tuning of the LLM.
7. We fine-tune the LLM with Policy Optimization.
8. This iterative feedback loop allows the model to gradually learn from human guidance and refine its behavior accordingly. (View Highlight)
The main requirement at this point is having Preference Data (PD). Preference Data is a collection of options / alternatives to a prompt, which can be sent to a group of Annotators and/or Subject Matter Experts (SME) so that they rate it, given some annotation guidelines, from the best to the worst. Several approaches have been used to create the Preference Data.
1. For selecting the best prompts, you can either have a predefined set of them, have a predefined template and generate some on the fly, and or combine/those with random prompts from a database, for example.
2. For selecting the answers, you can send the prompt to one specific version of the model (the last one you have fine-tuned) or to different checkpoints. Decide how many answers you want to rank: you can use just 2 and have a binary ranking schema of best-worst or have your annotators ranking from 1 to 5, as an example.
3. Always define first the Annotation Guidelines for the different ranking levels, to minimize individual interpretations and standardize the answers. (View Highlight)
This is what the OpenAI Summarization Preference Data looks like (again, using Hugging Face Dataset viewer): given 1 prompt (left, a summarization task on an article), you have N answers (right, the summaries) with a ranking/score (in this case there are several axes/metrics, like accuracy, coverage, etc. being the final aggregated metric called ‘overall’) After collecting a dataset of prompts and answers (the Preference Data - PD), you can already send them to your Annotators for a review. (View Highlight)
Preference data implies ranking outputs from best to worst, as mentioned above. Argilla introduced RankingQuestion in version 1.12.0 to the FeedbackDatasets that can be used for this kind of ranking. As a quick recap, FeedbackDatasets in Argilla require both a list of fields and a list of questions so that the annotators can provide answers to those. In this case, to rank the answers for a dataset as the one mentioned above, we would need the following: • TextFields to define both the prompts and the answers generated for those prompts. • RankingQuestion to ask the annotators to rank the responses for a given prompt from best to worst, following the pre-defined annotation guidelines. Translated to code, it would look like this: (View Highlight)
Running the above in Argilla, would generate the following dataset in the Argilla UI, ready for the annotators to start ranking the answers to a collection of prompts, to later on be collected to train/fine-tune a Reinforcement Learning model based on Preference Data. (View Highlight)
Preference Data can contain several alternatives to be ranked from 0 to N. ChatGPT used 4 alternatives ranked from A to D (see figure below). However, simplified methodologies, as Direct Preference Optimization, will just require a binary classification of chosen vs rejected. With that, you could set up Argilla to just show, for example, two responses (Response-1 and Reponse-2) and choose the best as Chosen and the least preferred as Rejected. (View Highlight)
So far we have seen we need 2 datasets to get our LLM-based chatbot working: the Instruction Data for fine-tuning and the Preference Data for Reinforcement Learning. Let’s clarify the differences: • The Instruction Data is used in the fine-tuning step, and consists of prompts and expected answers which aim to improve or correct the way an LLM generates text based on a prompt. You may even want to train the model on a generation of a new type of content, for example, a piece of advertisement, SEO content, a report, etc - and the model may not have been trained to do that! That is where Instruction Data comes into play, and consists of rows with 1 prompt and 1 generated example to it. • The Preference Data is used to train a Policy (see next section). Basically, here you are not teaching the model anymore to write specific emails, or reports, or any content of your needs. You are training a Policy Model to make the model distinguish, given N acceptable answers to a prompt, which are better than others. Usually Preference Data is a collection of 1 prompt and N answers to that prompt, to then be ranked for a group of Annotators or Subject Matter Experts. This is how ChatGPT depicts it: (View Highlight)
RLHF adds a big complexity to the process of training your custom LLM. In summary of what we described above. This is the picture which summarizes everything you need to do: In summary, it requires the following artifacts:
1. Annotation Guidelines to define what is a good or a bad answer.
2. Preference Data: Selecting prompts for Human Labellers. Generating the outputs for evaluators.
3. A Reward Model (RM);
4. A Policy, optimized with the data from the RM; It’s evident the process is complex and very expensive in terms of human and computational resources. But there are alternatives that combine the fine-tuning of the model (SFT) with the Reward Modeling, simplifying the process a lot. (View Highlight)

New highlights added March 4, 2024 at 2:03 PM

Direct Preference Optimization (DPO) In Direct Preference Optimization: Your Language Model is Secretly a Reward Model[1] the authors introduce a method called Direct Preference Optimization (DPO) for achieving precise control over Large-scale unsupervised Language Models (LLMs). As mentioned above, existing methods rely on Reinforcement Learning from Human Feedback (RLHF) are based on training a Reward Model and using Proximal Policy Optimization (PPO) (as described in our second post), to align the LM’s output with human preferences. But this approach is very complex and unstable. DPO solves the constrained reward maximization problem by treating it as a classification problem on human preference data. The algorithm is stable, efficient, and computationally lightweight, eliminating the need for reward model fitting, LM sampling, and extensive hyperparameter tuning. Unlike existing methods that train a Reward Model and then optimizes a Policy based on that RM (Reinforcement Learning by Human Feedback using PPO — left of the figure), DPO directly defines the preference loss as a function of the Policy (right). In short, it removes the requirement of training first a Reward Model. (View Highlight)
The DPO pipeline has two stages:
1. Run supervised fine-tuning (SFT) on the dataset(s) of interest.
2. Run preference learning on the model from step 1, using Preference Data (ideally from the same distribution as the SFT examples) As you see, Preference Data is still required to train the Policy. But it gets rid of the complexity of training a Reward Model first to then optimize a LM Policy to then fine-tune a LLM. How is that achieved? The idea is using the very same LLM as a Reward Model. During fine-tuning time,DPO optimizes the policy using a binary cross-entropy objective. It uses the human preference data to determine what responses are preferred and what responses are not. By comparing the model’s responses to the preferred ones, it can adjust the policy to improve its performance. (View Highlight)
fine-tuning a large language model (LLM) for generating text using specific guidelines can be highly useful in various applications. By providing specific guidelines, we can shape the LLM’s output to align with our desired objectives, ensuring more accurate and controlled text generation. This process has been historically very complex, requiring to train first a Reward Model to then optimize the Policy of the LLM. However, recent approaches, such as Direct Policy Optimization (DPO) and Chain of Hindsight (CoH) have shown how you can use Large Language as Reward Models. (View Highlight)

Graph View

Metadata
Highlights
New highlights added March 4, 2024 at 2:03 PM

Recent Notes

AI Enhanced Knowledge Management
May 13, 2024
Sucessful Model
May 06, 2024
The Rise of the Dataset Engineer
Apr 25, 2024

See 87 more →

Now Reading

A Guide to Structured Generation Using Constrained Decoding
May 16, 2024

See 656 more →

Created with Quartz, © 2024

Twitter
Linkedin
Mastodon
Unsplash
GitHub
RSS