How model merging fits into Arcee's SLM system

Why the state-of-the-art technique works so well for our Small Language Model (SLM) system and domain-specific tasks

How model merging fits into Arcee's SLM system

For Day 2 of MARCH MERGE MADNESS, we’re going to tell you more about how model merging and mergekit fit into Arcee’s Small Language Model (SLM) universe. 

Yesterday, our CEO Mark McQuade explained our three-layered domain adaptation system: 

  • Continual Pre-Training
  • Alignment (Supervised Fine-Tuning) 
  • Retrieval Augmented Generation (RAG).

Model merging applies to the first layer: Continual Pre-Training. 

Instead of doing Continual Pre-Training over an entire model, you train only a much smaller model – which you then merge with a much larger model. It’s an elegant technique that’s incredibly compute-efficient. 

Today, our Senior Research Engineer (and mergekit Founder) Charles Goddard goes into more detail, explaining why model merging is superior to dataset blending techniques like DoReMi. 

Charles also explains why model merging is a particularly good fit for domain-specific tasks – what Arcee’s Small Language Model (SLM) systems are designed for. 

To learn more about model merging, check out the Arcee/mergekit repo… And don’t hesitate to hit us up with your questions, either on X or on LinkedIn.

0:00
/1:08

Interview with Arcee Senior Research Engineer / mergekit Founder Charles Goddard.

TRANSCRIPT:
Charles Goddard, Mergekit Founder & Arcee Senior Research Engineer

Arcee does Continual Pre-Training of checkpoints, with whatever private data that you want to be able to reason with – with a large language model (LLM). 

And that's a well understood technique. It works, but it's quite expensive and it involves huge amounts of compute.

And then at the end you have a model which generally has degraded performance on the original tasks that the language model will scale.

You'll have some degradation of general intelligence.

You know, catastrophic forgetting is a thing. It's there.

And traditionally you can do further training with instructional datasets...

Of course there are techniques like DoReMi that use blending of data sets to reduce that catastrophic forgetting effect.

The way that Arcee is using model merging, which is quite cool, is that they're doing that Continual Pre-Training, and producing that checkpoint that's really good on a domain specific task, and not so great at everything else, and then merging that back into a base model and getting the best of both worlds more or less for free.

You got the capabilities of the base model and you also get a lot of that domain specific reasoning that you get from the Continual Pre-Training checkpoint.