Product

Do Direct Preference Optimization (DPO) with Arcee AI's training platform

Direct Preference Optimization (DPO) is one of the top methods for fine-tuning LLMs... It's available on our model training platform - and today, we bring you support for DPO on our training APIs.

We're excited to announce support for DPO training on Arcee AI's training APIs - allowing Arcee users to directly optimize the preference of their small language models (SLMs).

What is Direct Preference Optimization (DPO)?

Direct Preference Optimization (DPO) is a method for fine-tuning large language models (LLMs) that aims to improve their performance and align their outputs with human preferences.

DPO directly optimizes a language model's policy (its decision-making process) based on examples of preferred and non-preferred outputs. The key idea is to adjust the model's behavior without needing a separate reward model.

At its core, DPO works by:

Using paired examples: It takes pairs of model outputs for a given prompt, where one output is preferred over the other.
Directly updating the model: Instead of training a separate reward model, DPO uses these paired examples to directly update the language model's parameters.
Leveraging probability distributions: DPO uses the model's own token probabilities to guide the optimization process, encouraging the model to assign higher probabilities to preferred outputs and lower probabilities to non-preferred ones.
Maintaining a balance: The method includes a constraint to prevent the model from deviating too far from its original behavior, preserving its general knowledge and capabilities.

The main advantages of DPO include reduced data and computational requirements, quicker adaptation to new preferences, and improved ability to avoid undesired outputs. This makes it an efficient method for creating more specialized and potentially safer language models.

Key Point: DPO is particularly important after model merging to anneal the merged models. This process helps to smooth out inconsistencies and align the combined model with desired preferences, ensuring a more coherent and effective final product.

How to Launch DPO on the Arcee Platform

Specify your launch from the Arcee pre-trained, aligned, merged, or HuggingFace model of your choice.

arcee.start_alignment(
    alignment_name="my_dpo_alignment",
    alignment_type="dpo",
    pretrained_model=None,
    merging_model="your_merged_model_name",
    alignment_model=None,
    hf_model=None
)

It's that simple!

Happy preference optimizing!