Evolutionary Model Merging For All

We've been focused on developing this groundbreaking technique for the community, and we're now excited to announce the launch of this state-of-the-art functionality in MergeKit.

Evolutionary Model Merging For All

Sakana.ai made a very big splash about a month ago, releasing a paper on Evolutionary Model Merging, and the subsequent model and eval results of this game-changing merge method. Unfortunately for the community, they never released the algorithm behind these amazing results!

Since this release, here at Arcee we've been fully focused on developing this groundbreaking technique for the community. We're now excited to announce the launch of this state-of-the-art functionality in MergeKit.

Evolutionary Model Merging lets people target specific competencies or qualities in their merges. Without it, Model Merging is an extremely manual exploratory process–trying dozens of merges, manually evaluating them, and trying to come up with a mental framework that explains how the merging parameters are related to the performance of the final model. With Evolutionary Model Merging, we can instead specify what qualities we want a model to have, and optimization will take care of it for us.

Tutorial: How to get started with Evolutionary Model Merging

I've created a tutorial to help you get started: mergekit-evolve.

Evolutionary Model Merging with mergekit-evolve

Hardware Requirements

mergekit-evolve needs at least one GPU. It doesn't necessarily need a huge one! You need to be able to inference a model in FP16. If you're working with models in the 7B size range, 24GB of VRAM will do just fine. If you're a big spender then you can use a Ray cluster with however many GPUs you want. For this little demo I'm using a RunPod instance with a single A100.

Installing

First let's set up our environment with an installation of mergekit. We need to use the evolve feature flag, and I'm using vllm as well because it's faster.

cd /workspace
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e .[evolve, vllm]

Defining Tasks

To optimize a merge recipe we need to first decide what exactly to optimize. mergekit-evolve uses EleutherAI's language model evaluation harness as a backend, so in theory all of the benchmarks supported by lm-eval can be used. Since I'm not an evil little gnome I'm going to define some custom tasks instead of directly optimizing against Open LLM Leaderboard scores.

Let's say that we want some spatial awareness in our model. The spartqa-mchoice dataset is a set of synthetic question-and-answer pairs involving the arrangement of objects that aims to test the spatial reasoning capabilities of language models. Let's take a random sampling of their training split and use that for one part of our scoring.

ds = datasets.load_dataset("metaeval/spartqa-mchoice")["train"]
ds_p = ds.shuffle(seed=9163).select(range(1000))
ds_p.push_to_hub("my-hf-username/spartqa-train-1k", private=True)

Now we need to define an lm-eval task that scores against this data. This can be done by writing a YAML file (and any necessary helper code). For more details on how to do this look at the New Task Guide.

mkdir /workspace/eval_tasks

In /workspace/eval_tasks/spartqa_1k_train.yaml:

task: spartqa_train
dataset_path: my-hf-username/spartqa-train-1k
output_type: multiple_choice
training_split: train
validation_split: train
test_split: train
doc_to_text: !function preprocess_spartqa.doc_to_text
doc_to_choice: [ 'A', 'B', 'C', 'D' ]
doc_to_target: "{{answer}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0

And in /workspace/eval_tasks/preprocess_spartqa.py:

def doc_to_text(doc) -> str:
    answer_chunks = []
    for idx, answer in enumerate(doc["candidate_answers"]):
        letter = "ABCD"[idx]
        answer_chunks.append(f"{letter}. {answer}")
    answers = "\n".join(answer_chunks)
    return f"Context:\n{doc['story']}\n\nQuestion: {doc['question']}\n{answers}\nAnswer:"

One common problem with merges is that the result often doesn't conform to any one particular prompting style. When manually creating merge recipes it's fairly easy to get the behavior you want by varying weights across layers, but since we're letting an algorithm optimize things let's make a silly little task for it instead. Alpaca is a very common standard and all it really needs from the model is to correctly output an EOS token after a completed response.

First, let's put together another tiny set of data for evaluating our metric with. I'll use a few hundred prompts from vicgalle/alpaca-gpt4.

ds = datasets.load_dataset("vicgalle/alpaca-gpt4")["train"]
df = ds.to_pandas()

no_input = df[df.input.map(len) < 1]
examples = no_input.sample(n=500, replace=False, random_state=749)
ds_p = datasets.Dataset.from_pandas(examples)
ds_p.push_to_hub("my-hf-username/alpaca-gpt4-500", private=True)

And now the actual task definition is quite simple:
In /workspace/eval_tasks/alpaca_prompt_format.yaml:

task: alpaca_prompt_format
dataset_path: my-hf-username/alpaca-gpt4-500
output_type: multiple_choice
training_split: train
validation_split: train
test_split: train
doc_to_text: "### Instruction:\n{instruction}\n### Response:\n{output}"
doc_to_choice:
  - "</s>" # replace with your model's EOS token if it is different
  # and now some incorrect options
  - "<|im_end|>"
  - "<|im_start|>"
  - "### Instruction:"
  - "USER:"
doc_to_target: 0
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0

There are definitely more robust ways to evaluate this but the multiple choice setup is nice in that it evaluates really quickly. Experiment at will!

Writing an Evolutionary Merge Config

We now have all the parts in place needed to actually define the merge we want to optimize. mergekit-evolve takes a YAML configuration file that defines what models we want to include, what merge method to use, and what tasks to optimize against.

For this example, I'm going to throw three models into the soup:

  • Hermes 2 Pro Mistral 7B, because it's a generally good model
  • Dan's Adventurous Winds Mk2 7B, because it's a really fun model but answers to no prompt format
  • Zephyr 7B beta for its quality instruction following

Most of the methods implemented by mergekit can be used. I chose Task Arithmetic pretty much arbitrarily.

genome:
    models:
      - NousResearch/Hermes-2-Pro-Mistral-7B
      - PocketDoc/Dans-AdventurousWinds-Mk2-7b
      - HuggingFaceH4/zephyr-7b-beta
    merge_method: task_arithmetic
    base_model: mistralai/Mistral-7B-v0.1
    layer_granularity: 8 # sane default
    allow_negative_weights: true # useful with task_arithmetic
tasks:
  - name: alpaca_prompt_format
    weight: 0.4
  - name: spartqa_train
    weight: 0.6

Tasks can be weighted arbitrarily - I made the spartqa_train task slightly more important than alpaca_prompt_format purely as an example.

Running the Merge

Now we finally have all the pieces set up to actually run mergekit-evolve. Here's the command I used:

mergekit-evolve ./evol_merge_config.yml \
		--storage-path /workspace/evol_merge_storage \
		--task-search-path /workspace/eval_tasks \
		--vllm \
		--in-memory \
		--merge-cuda \
		--wandb

This will kick off the process and start merging and evaluating models. If you used the --wand option then metrics on the evaluated models will be reported to Weights & Biases. This can be useful to obsessively refresh while you should be doing other things.

By default mergekit-evolve will keep going until it has evaluated over 100 merges or you stop it with CTRL+C. You can increase this limit by passing the --max-fevals argument. Once the script has terminated, the mergekit configuration for the best-scoring merge will be written to /workspace/evol_merge_storage/best_config.yaml. You can get your final model by running it through mergekit-yaml, like so:

mergekit-yaml /workspace/evol_merge_storage/best_config.yaml --cuda /workspace/final_merge

We are thrilled this novel merging technique is now available to everyone through MergeKit. We will also be integrating evolutionary model merging into the core Arcee product, which will provide a complete compute backend and remove the need to secure your own GPU's, so stay tuned for that!

Let us know how you're using evolve-merge– and happy merging!