Arcee launches The Small Language Show, first episode goes deep on Model Merging

It’s the place to be for anyone who wants to learn about the next big thing in LLMs: The SMALL Language Show, live-streaming biweekly and bringing you chats with the world's top experts in LLM and SLM research.

As the industry leader in Small Language Models (SLMs), we figured it was about time we started our own show devoted to, well, Small Language Models.

The SLM Show is a biweekly live-streamed chat about all things SLMs, co-hosted by Arcee CEO/Co-Founder Mark McQuade and yours truly (Mary MacCarthy).

We’ll be broadcasting every other Wednesday at 11am PST / 2pm EST, and you can find us on the following platforms:
• Arcee LinkedIn page
• Arcee Twitter
• Arcee’s YouTube.

0:00

/2:17

In Episode 1, Charles Goddard gave a primer on Model Merging, and explained why he created MergeKit–and why he felt strongly about sharing it with the open source community.

The gist of the show: an informal and riveting discussion of the latest developments in building, training, deploying, and maintaining SLMs. We’ll bring you candid sessions with the industry’s top researchers and business leaders, as well as insights from our own team of world-class experts.

0:00

/2:14

Mark McQuade & Maxime Labonne answer a viewer question RE merging two models vs. fine-tuning a single model.

Our first episode was already a huge hit, with lots of people tuning in to hear us explain and explore the innovative technique known as Model Merging. The show featured Arcee’s own Senior Research Engineer/MergeKit founder Charles Goddard, as well as special guest, the Machine Learning engineer known for his articles on Model Merging, Maxime Labonne.

0:00

/2:30

How does Model Merging work under-the-hood? Charles Goddard got into the nitty-gritty...

Check out Episode 1 and drop us a note over on LinkedIn or Twitter to tell us what you think, or to ask us questions. And tune in for more, every two weeks–here’s a sneak peek at upcoming episodes:
• On Wednesday April 3: We discuss training and deploying SLMs on AWS' Trainium & Inferentia, with special guests Kamran Khan and Scott Perry
• On Wednesday April 17: A look at how the open source community is using MergeKit, with Arcee's Charles Goddard and featuring ML/AI Developer Maya Akim.

EPISODE 1 TRANSCRIPT (unabridged, unedited-please excuse any typos!)
00:00:00:27 - 00:00:29:16

Mary: We are live. I’m Mary MacCarthy and welcome to the SLM Show, The Small Language Model Show. This is the first edition. I am your co-host here in Los Angeles and I'm joined by my co-host with the same initials and who is over on the other coast, in Miami. Hi, Mark. Mark is the CEO and co-founder of Arcee. Mark, let's dive right in and talk about the name of the show, the small Language model show.

00:00:29:19 - 00:00:33
Mary: What are small language models and what do they have to do with Arcee?

00:00:33:14 - 00:00:57:28
Mark: Yeah, small language models. Small is a relative term, I guess, but small language models with four S's… Seven to 13 billion parameters where you are kind of going against the grain of the general large “one model to rule them all.”

00:00:58:00 - 00:01:21:04
Mark: Right. So saying that, you know, we believe that if you can work in the 7 to 13 billion parameter model range, you can probably fulfill about 99% of business use cases with that size model. If you train and align the model accordingly. Right. But Arcee doesn't just stand for small, right? It stands for Small, Specialized, Secure, and Scalable.

00:01:21:09 - 00:01:23:14
Mark: So it's a kind of FOUR S's.

00:01:23:16 - 00:01:24:26
Mary: And sexy.

00:01:24:28 - 00:01:31:17
Mark: And sexy. And I think that would sound strange from a show perspective if we use the five S's to describe the show.

00:01:31:23 - 00:01:37:13
Mary: So I just realized that I put in my caption there, The SML show. So I’ve got to correct that…

00:01:41:07 - 00:02:07:16
Mary: So many S’s! We'll talk about all those S’s in later shows. But today we're getting into one particular very particular sub-topic, which is a new and very innovative technique when it comes to building and training language models, which is called Model Merging. Some people know about it, for some people this will be new to them.

00:02:07:18 - 00:02:29:17
Mary: And to talk about Model Merging, we actually have two of the world's experts on it. I'm going to bring in now Charles Goddard, who is in Pasadena here in California with me, and Maxime Labonne, who is over in London. Welcome, Maxime. And Charles, why don't you first of all, just introduce yourselves before we get into Model Merging…

00:02:29:20 - 00:02:32:02
Mary: Tell us a little bit about yourselves. I'll start with you, Charles.

00:02:32:04 - 00:02:46:19
Charles: Sure. I'm Charles Goddard. I'm a Senior Research Engineer at Arcee, and I also wrote MergeKit. I started out my career at NASA's Jet Propulsion Laboratory. I spent a bunch of years there and did a stint in Apple, and now here I am in the language model space.

00:02:46:22 - 00:03:02:09
Mary: Fantastic. And MergeKit is the Github repo - the increasingly popular repo - where people are learning about Model Merging and implementing it, and which recently joined forces with Arcee. Fantastic - and Maxime?

00:03:02:12 - 00:03:24:21
Maxime: Hi. Yeah. My name is Maxime Labonne and I am a machine learning scientist. A user too. I worked at JP Morgan and I did several open source contributions on Hugging Face using MergeKit, by merging models. So yeah. Happy to be here.

00:03:24:23 - 00:03:34
Mary: Fantastic. Yeah. And Mark, you yourself are a big model merger. So why don't you dive in with your hard questions on model merging.

00:03:34:12 - 00:03:58:23
Mark: Yeah. I mean, it was a few months ago I became somewhat obsessed with Model Merging. I could see Model Merging really started taking off from within the Hugging Face ecosystem. They have a leaderboard called the LLM Leaderboard and merged models were appearing more and more at the top of that leaderboard, so my obsession took over.

00:03:58:23 - 00:04:22:23
Mark: But yeah, I mean, it's great to be here, kind of letting everyone know what Model Merging is. And, you know,I'd love to hear Charles describe it the way he envisions it and why he created MergeKit…

00:04:22:25 - 00:04:41:11
Charles: So you know Model Merging is a way to take machine learning models that have been trained on some task or another, taking the weights of those models and combining them to get a model that combines some of the strengths and abilities of all of them. There are a whole bunch of techniques to do that.

00:04:41:13 - 00:04:58:08
Charles: I'd like to think of it as a way to extend the shelf life of models. So, you know, you invest all of these resources into fine- tuning something to do really one task. And then, you know, two days later, it turns out that someone comes out with a model that does your thing ten times better or something… And there’s no reason that the model that you train has to stop being useful.

00:04:58:08 - 00:05:08
Charles:
You can merge it with newer innovations in the field and get the strengths of everything…

00:05:08:14 - 00:05:18:08
Mark: Yeah, I mean, that's a great way to describe it. And why did you create MergeKit?

00:05:18:10 - 00:05:38:01 Charles: Right. So I got into, you know, playing with local language models around when Llama 1 got released and I'd been aware of language models for a while… they've been around for a good long while. But I kind of looked at them almost as a toy until ChatGPT came out. And then when Llama 1 came out, I realized, oh, this is something I could play with here on my computer.

00:05:38:03 - 00:06:06:15
Charles: So, you know, I started reading all the papers that came out in the space, and I found a whole bunch of interesting literature on merging language models, and I wanted to use those. But as is often the case with research papers, there's not often stellar production-quality code released. Often there's a Jupyter notebook or something released, and it might work, but probably it's designed to be used on a, you know, a cluster of A100s or something, or whatever…

00:06:06:15 - 00:06:25:08 Charles: … the researchers who wrote that paper originally did their experiments on. You know, they're not going to put a whole lot of resources into making everything generalized. They just want to make this technique work. But I wanted to use this, and I just had my GPU resources available. So I had to write my own implementations.

00:06:25:08 - 00:06:34:04 Charles: So I did and I threw them up on GitHub. People started using them and you know, it's kind of snowballed into the state of things we have today.

00:06:34:06 - 00:06:51:03 Mark: That's great. Yeah, obviously we we love what you've created. We thank you for creating it. So it's a fantastic library. It's a fantastic toolset. So Maxime I guess I'll say the same thing to you… What does Model Merging mean to you? And then also secondarily, because you've built a ton of great things on top of MergeKit… I'd love to hear about how you discovered MergeKit…

00:06:57:23 - 00:07:19:08 Maxime: Yeah I consider Model Merging as a a fourth step in the fine-tuning, the training pipeline. You have like the pre-training, you have the supervised fine-tuning, you have the preference alignment, and now we have model merging - and it's really like the idea of, yeah, we have this rich ecosystem of models that have been trained and fine-tuned and they're quite good.

00:07:19:08 - 00:07:39:18 Maxime: And …they've been trained on different data sets so they can be quite complementary, too, and now we can match them to create better models. So I started getting curious about Model Merging… I think it was in the summer, there were some scripts, people were extending scripts and I was like… This is not possible…

00:07:39:18 - 00:08:01:15 Maxime:…Like, this cannot work… And then I got really interested in it, and I discovered MergeKit and I thought, wow, that's super convenient… because it's not just a script anymore. It's quite reliable. You have this yaml context file. So I decided to give it a try because it was so convenient that now I didn't have an excuse not to do it.

00:08:01:17 - 00:08:18:09 Maxime: And at the beginning I was still skeptical. I was still very skeptical. I was like, no, this cannot work. Like I'm just hacking the LLM Leaderboard, and I started writing an article about it…and I became more and more convinced that actually, you know, it was not just a hacking…

00:08:18:09 - 00:08:42:17 Maxime: …Back then in December 2023, it was very easy to take the best models on the Open LLM Leaderboard and you would merge them using like any technique and it would create the better model. Now, it's really difficult to do it because it's really popular now.

00:08:42:19 - 00:08:51:25 Maxime: People merge models everyday. So now the competition is quite tough.

00:08:51:27 - 00:09:09:08 Mark: Yeah, absolutely. Yeah. Like you said, people merge models every day. I merge about five models per day. Still I just go do some merging, have a little fun, try to beat that. It's kind of like a game, right? We try to try to beat the models on the leaderboard and see what happens.

00:09:09:08 - 00:09:31:27 Mark: So, you know, through MergeKit, Charles has made it absolutely simple to do that, which is fantastic. And then you as well– you are building some great tools on that. You know I use your auto notebook quite often to do the benchmarking on the OpenGL side of it, which was a bit broken. But I think I found a fix–I should give you a PR on that.

00:09:31:27 - 00:09:56:20 Mark: So yeah it's just a fantastic way to utilize the best of both worlds when it comes to both models. Right? So I guess I'll say, you know, I'd love to hear what you both see as kind of the next, I'll say now and in the future. What is the biggest revolution when it comes to Model Merging?

00:09:56:24 - 00:10:08:06 Mark: And that's kind of a two- part question. Like - what has it done? What does it revolutionize now? And what do you think it's going to revolutionize going forward? Charles, you can go first.

00:10:08:09 - 00:10:27:18 Charles: Sure. So one of the things I like most about Model Merging and and one of the things that, you know, initially excited me the most when I initially open sourced MergeKit was the fact that it doesn't require any GPU resources or expensive servers. You can do it on a personal laptop if you want to…

00:10:27:20 - 00:10:49:00 Charles: It enables this sort of huge scale, decentralized global experimentation that–you can sort of see it in the scores over time in the Open LLM Leaderboard– just by giving these tools to so many people and letting them all plug away with it and see what they come up with, we end up doing more trials over more experiments than any one researcher could ever hope to do.

00:10:49:03 - 00:11:12:17 Charles: So that's one of the things that initially excited me most about it. What I'm most looking forward to is, you know, right now one of my main research thrusts is trying to extend these merging techniques to being able to merge models that work fine-tuned from a common base. So, you know, when you merge models right now you'll say… you'll pick like a couple of Llama-based models.

00:11:12:17 - 00:11:29:10 Charles: So they were all trained from a 7 billion parameter base and then you can merge those and that works great if you try to merge models that were trained from different bases, like a deep sea code model a Mistral, for example, you just can't do that right now.

00:11:29:12 - 00:11:46:21 Charles: But so – I'm investigating techniques to enable that. And should that become possible, then suddenly we'd get this access to this whole font of potential again…if we get some sort of synergistic effects between different foundation models, then that's hugely exciting.

00:11:46:24 - 00:11:56:25 Mary: I have a question. I don't know if you have an answer to Charles, but to what degree are we still in the very early stages of the potential of what can be done with Model Merging?

00:11:56:28 - 00:12:23:10 Charles: That's a good question. I think it's barely been touched, honestly. So, you know, there are a whole bunch of great model merges that are being done in the community, those are fantastic. I think there are a lot of more concrete applications that could benefit hugely from MergeKit, from all merging, that haven't really been fully utilized yet.

00:12:23:12 - 00:12:36:20 Charles: Arcee is actually a great example – the way that we're using it to speed up and reduce the costs involved in Continual Pre-training, there's huge value there that the market as a whole is yet to really tap into.

00:12:36:22 - 00:12:45:24 Mary: Yeah, it's incredible. Maxime, tell us about what you've been doing with MergeKit, maybe the most fun you've had with it.

00:12:45:27 - 00:13:10:02 Maxime: Yeah, I've done a few projects with it. Started with like a regular merge. I don't even remember the name, to be honest with you.. It was part of a really big merger of 14 different models. I thought it was insane. So I wanted to use that one plus another one because I also needed a good base.

00:13:10:02 - 00:13:46:08 Maxime: And this is quite funny because this is a model that has been used over and over and over again, I added a little tool to plot the family tree of these merges. So you can see that the parents were these models and now it's quite wild because there's been so many merges, so many generations that you can see. I can trace my latest model back to the first one, but it's a long way home now because there's been so, so many people using them in some strange ways.

00:13:46:08 - 00:14:11:14 Maxime: Like very strange, actually, because now I also have cycles in my family tree. So this is not something that you see usually in real life, but that happens with LLMs, I guess. So that's been a lot of fun. This was the newer deco model that has been very popular. So it's based on I think a Slerp merged.

00:14:11:16 - 00:14:37:21 Maxime: And this has been fine tuned to make the model a little better after the merging. And this is the same recipe that I've tried to … but this time I changed the dataset to add more, more data that is a bit more about multi turn conversation and that allows the model to become better at this kind of stuff.

00:14:37:24 - 00:14:58:28 Maxime: It's also funny to see that you can have like this really great base model after you merge, and then you can kind of tweak it the way you want to using different data sets. And if you fine tune it the way you want it. And the latest project I have related to merging models it's called Auto Merger…

00:14:59:01 - 00:15:31:04 Maxime: And this is my attempt at automating my own job, because it's so difficult to merge models now, like - if you can really see the diminishing returns. So I leave it to AI to do it for me and we'll see who wins. But no, the main product of this AutoMerger.. Is to merge automatically a lot of models and get insights into, okay, like how am I supposed to design a good merge?

00:15:31:04 - 00:16:10:18 Maxime: Like - which models should I choose to do? So? So right now I'm still waiting to get more results, but it's quite funny. I already have a model that is better than …without any fine tuning. Maybe it's overfitting the benchmarks. Actually, it's probably overfitting the benchmarks, but still it's probably a good sign that says that there are diminishing returns with this potential to still improve this 7 billion parameter model and even stronger models.

00:16:10:21 - 00:16:27:26 Mary: That's terrific. And we'll post in the comments after where people can find your work. Obviously, some of it on the Hugging Face Open LLM Leaderboard, but also on Towards Data Science on Medium… And then also on LinkedIn. OK, we could bring in a few comments here. James R says you guys are rock stars. I'm going to include myself in that.

00:16:27:28 - 00:16:47:00 Mary: He wants to know if this call is recorded. Yes, you can find it on Arcee’s LinkedIn page also on our YouTube afterwards. Ignacio Ferreira has a question saying it would be good to have more insights into how the models are merged, which he hasn't been able to find easily. So whoever wants to take that.

00:16:47:03 - 00:16:50:17 Mark: We'll give that to Charles. Charles, let's go deep.

00:16:50:19 - 00:17:17:21
Charles: So that's a big question. I mean, you could talk for four weeks about, you know, all of these techniques. So the foundation of most of these is the principle of linear mode connectivity, which is basically something we discovered, that if you take a a base model and then you fine tune it on different data sets for different tasks, they end up in, you know, they end up close enough together in parameter space that you can actually just linearly interpolate between those final weights.

00:17:17:21 - 00:17:42:07
Charles: And every point along that line is a valid model and gives you, you know, some decent performance on all the tasks involved. So most of these methods are basically just taking that principle and running with it. You know, the most basic is linear merging. You're actually just averaging the parameters. Slurp, which is currently, it's a favorite of people right now.

00:17:42:09 - 00:18:13:20
Charles: It's the same thing, except it's interpolating spherically instead of linearly. Basically it's just preserving the magnitude of vectors in addition to interpolating the direction, which works a little bit better with the interpretation of some of these parameters. Then there are a bunch of cool papers take this and they apply different techniques to the parameters to reduce interference when merging multiple models…

00:18:13:23 - 00:18:37:23
Charles: The origin of MergeKit was an implementation of the paper resolving interference when merging. I forget the name, but the method is called TIES merging and basically it's sparsifying before you merge. And the original release of MergeKit was actually just that - a script that did that technique. So I'm quite fond of that line of methods.

00:18:39:23 - 00:19:03:24 Mary: I’d add to that for Ignacio...Send in your further questions on our LinkedIn page, and we will get Charles back for future shows and to go more into depth. If you start playing around with it and have specific questions about MergeKit, it’s remarkably easy to get started with. Let's bring in another question. We have got two questions from Shelbi Nayak. She says, Great conversation on model merging and leveraging small models, something we want to do at Goodfin as well…

00:19:04:01 - 00:19:14:12 Mary: That's not a question, But here's a question, she says. I'm curious what is the additional cost associated on top of fine tuning a base model?

00:19:14:14 - 00:19:34:23 Mark: I think she's probably referring to the additional cost of merging … It’s very efficient. It's going to be so very, very low cost from the merge perspective, Charles, you've probably never done any pricing kind of costing usage on margins. I assume, because it's so little.

00:19:34:23 - 00:19:56:04 Charles: I actually have looked at it a little bit. I just you know, when I was initially writing this entirely in my spare time, like nights and weekends, I occasionally would run parts for bigger mergers. So I looked at the cost of it and you can round it to zero. The cost is two or 3 minutes of a CPU virtual instance if you spin up a new instance for it.

00:19:56:04 - 00:20:01:17 Mark: So you could maybe pay a couple of cents if you were really spendy about it.

00:20:01:19 - 00:20:08:27 Mark: We got another question from Ignacio. Yeah.

00:20:11:04 - 00:20:35:14 Mark: All right. So basically if I treat a model in the legal domain and one in the math domain, if I use some of these methods, I get one that works for both. That is exactly it. Ignacio You're combining two models into one and you're fusing the information of both models into one model. But as Charles kind of mentioned before, it's not, you know, the model merging is extremely strong, right?

00:20:35:14 - 00:20:51:18 Mark: It's a method that is extremely strong. I think it's going to revolutionize LLms, to be honest, if it hasn't already, it's going to really transform the way people think of LLNs. And pairing it with Continual Pre-Training is actually extremely powerful, right? …

00:20:51:18 - 00:21:12:14
Mark: So if I train a model for legal and I train for it for math. So right now, presently the most common way to do that is you would train two full models and then you would say you use Mistral as a base or Llama as a base, and then you train one model for math role and then you train a legal model…

00:21:12:14 - 00:21:37:28 Mark: So you still had to train them, right? You still had to train over the two 7 billion parameter models. So what we're striving towards is, is pairing kind of what Charles had mentioned earlier is merging different size models and merging different architectures. So what this does, it opens up the ability to train a much smaller model, say a 1 billion parameter model, and then merge it with a 7 billion to 30 billion model.

00:21:38:01 - 00:21:56:22 Mark:
So the training itself becomes much more efficient and cost effective… And then you still reap the rewards and the massive benefits because you're using transfer learning and you're getting the great benefits of the already trained model that exists with great general capabilities, general reasoning or whatever that may be, and you're combining it into one.

00:21:56:24 - 00:21:58:29 Mark: So hopefully that makes sense.

00:21:59:02 - 00:22:04:15 Mary: A quick follow up on that from Ignacio…

00:22:04:17 - 00:22:25:06 Mark: What could be some of the use cases compared to fine tuning one model only? Well, I’ll let Maxime handle this. But briefly - there's a real problem when you fine-tune two models, right? It's called catastrophic forgetting, right? Models forget what they used to know when you go over them with new data.

00:22:25:08 - 00:22:39:09
Mark: So catastrophic forgetting is a real thing. So if you fine-tune the model, you do like a fine tuning of a model, it's going to forget some of its general capabilities. That is general reasoning in the models. Probably going to be degraded. But go ahead Maxime, I’ll let you answer.

00:22:39:11 - 00:23:13:11 Maxime: Yeah, I think that model merging, it's not just about merging two models. Now we have like an ecosystem of merges and it's built iteratively. So each generation is better than the previous one. I would say that it provides really good base models. I've had a lot of feedback from people using the models I've made, which are merges, and they find you need to do something else like different languages too.

00:23:13:14 - 00:23:40:17 Maxime: And I think this is the main strength, is that it gives you excellent base models. Then you can fine-tune them if you want to and you're going to get even better models for your use case. But I think this is really the main point behind model merging, beyond just I have a math model and I have another model - I merge them and I retain both capabilities.

00:23:40:19 - 00:24:06:14 Maxime: It's also my model is better in general and that's very, very valuable. I think the reason behind it too, to give some intuition, is that it makes the weights less redundant. It's better at compressing the information inside of the weights, and this is what you get by iteratively matching the parameters of these models.

00:24:06:17 - 00:24:23:18 Mary: If you guys are okay with this, we're just going to keep going with viewer questions. I'll read this one. And then whoever wants to answer this is from Cirkunov Orinda. What would be the most effective technique when it comes to knowledge composition? MoE - Mixture of Experts, model merging, or block expansion?

00:24:32:07 - 00:25:05:22 Charles: Sure. So you know that's a hard question to answer. Model merging is great at it. BLOCK Expansion is also great. I love block expansion - that's a really cool technique. And it's one of my favorite things to play with lately as well. A mixture of experts. It might be a good way to do this, but it hasn't really been demonstrated yet that it can be effective in this - you know, an actual, classically trained mixture of experts.

00:25:05:24 - 00:25:32:18 Charles: …The experts do end up being like subject experts so much they end up specializing on very, very abstract hard for humans to parse things like parts of parts of speech or …even that's you know, that's more specific than an actual a trained mixture of experts would specialize on… There is a type of of merge that you can do where you combine two dense, two or more dense models into a mixture of experts…

00:25:32:18 - 00:25:41:18 Charles: And that works. But it actually doesn't work quite as well as just straight up merging the models into one.

00:25:41:21 - 00:25:57:27 Mark: Yeah, that's one thing that I've mentioned to Charles is like - merging is big. That, that's something that's really exciting to me personally. I think that's something we definitely are on a road map to dig into more from the research side. So definitely.

00:25:57:29 - 00:26:00:24 Mary: Did you want to add something on that, Maxime?

00:26:00:25 - 00:26:20:24 Maxime: I wanted to say that - it’s my experience that mixture of experts is really exciting, it works really well, like better than what I would have expected. But if you really want performance, if you want to be efficient - model merging, it's very difficult to beat.

00:26:20:26 - 00:26:25:28 Mary: Another question from James R…

00:26:26:01 - 00:26:33:11 Mark: What role do you believe there is or is not for leveraging model merging in the eventual future development of AGI?

00:26:33:13 - 00:26:35:25 Mary: I imagine that's the trillion dollar question.

00:26:36:02 - 00:26:50:17 Mark: Yeah, the question assumes that we achieve AGI. But, but let’s just say we do. I'm not 100% sold that we do, but let's say we do. What do you think, Maxime?

00:26:50:20 - 00:27:09:13 Maxime: Yeah, well, I'm not very comfortable with this question. I would say that the CTO of Hugging Face said AutoMerger was was probably the way of getting to ACI.. Or that was Julian at Hugging Face actually…

00:27:09:16 - 00:27:11:18 Mark: Charles, any opinion on that?

00:27:11:21 - 00:27:36:23 Charles: Yeah, I mean, if it happens it'll probably be involved in some small part just because it's such a convenient way to pack more capability, more intelligence into the same size of parameters. I can't really guess beyond that. I mean, whether or not the transformer architecture just by itself is enough to reach AGI, I don't think anybody knows that.

00:27:36:23 - 00:27:53:03 Charles: And frankly, even at this point, it's hard to tell where the line is for AGI, it's something that I find pretty funny is how, you know, like five years ago, the Turing test was the gold standard. And then sometime in the last three years, we took the Turing Test out behind the shed. It doesn't cut it anymore.

00:27:53:03 - 00:27:57:20 Charles: It’s just a - I'll know it when I see it kind of thing.

00:27:57:20 - 00:27:59:09 Mary: Yeah, exactly.

00:27:59:12 - 00:28:23:29 Mary: James, stay tuned to future episodes of the show as we’ll. We'll get back to you with a definitive answer on that question here haha. Now - a question from Giovanni: How can I be sure that when performing a merge, there's no overfitting on the benchmark tests? By the way I have been following Maxime for several months using MergeKit daily.

00:28:23:29 - 00:28:38:15 Mark: Yeah, I’ve been doing like 5 models per day. But while I'm convinced that the models are improving, part of me also suspects that they might be overfitting… On Hugging Face. There is a thread about data contamination that discusses this issue.

00:28:38:17 - 00:28:41:15 Mark: Yeah, go ahead, Maxime, because I know you brought this up earlier.

00:28:41:17 - 00:29:12:27
Maxime: Yeah, this is a really good question. There is definitely data contamination and the benchmark that is the most contaminated and the most useless, also unrelated but is also not not a very good benchmark, is truthful QA. And it's really been bothering me for a while. For a moment I had this project of like rebuilding this merge pyramid from scratch, using only models that were not contaminated.

00:29:12:29 - 00:29:43
Maxime:
But I was quite alone in these efforts and it was really a lot of work. So in the end, I kind of gave up on that. I think that's the best way to convince yourself that it's you're not just overfitting one benchmark is to use different benchmarks. So you have, of course, the Open LLM leaderboard, you have new suite, which is pretty good at it, which is very correlated to it.

00:29:43:12 - 00:30:15:19
Maxime: So it's pretty nice and then you have other benchmarks like empty bench for multi-channel conversation. You also have things like IQ bench and a lot of other benchmarks that you could use, and that provides a really good overview of the performance of the model in different areas and I think that's the best way we can tackle this issue, because if you create five matches a day, it's probably gonna overfit it.

00:30:15:21 - 00:30:44:27
Maxime:
And that's fine. Honestly. That's that's, that's also okay. In about the data contamination issue, I think it's fine. In the end, I mean, I'm at peace with it now. I think that indeed it's contaminated, but also the models are better. So in the end, it's what we want. Like we want to create better models first. So I think it's okay as long as you acknowledge it and you know about this issue.

00:30:44:29 - 00:31:06:14
Mary: Okay. Let's bring in another question. But I will mention that Mark may have to drop off. Charles and Maxime are good to go for a few more minutes to answer questions. And Mark, if you need to go, just give us a wave and we'll see you in two weeks. All right. This is from Nabi Sharma. Great talk and thanks for your work on Merge kit and auto Merge.

00:31:06:17 - 00:31:22:05
Mary: The question is, in a scenario where you are using model merging to update the learning of other LLMs, are there techniques to ensure that catastrophic forgetting does not happen?

00:31:23:29 - 00:31:44:27
Charles:
So it's not necessarily a way to make catastrophic forgetting not happen. It's more that Model Merging offers an avenue to undo the damage of catastrophic forgetting. So you know, you can take one model, fine-tune it with your huge new data set, and you get something that ends up really, really good on the task.

00:31:44:27 - 00:32:01
Charles: You turn it on and it's forgotten everything else. If you merge that back into your base model and there are other settings you can play with to make this work better to use… You end up getting the capabilities of both, you know, so you retain still the base model’s abilities to do the original task that it was trained on.

00:32:01:12 - 00:32:07:17
Charles: And then you also get, you know, most of the performance of your downstream fine-tuned model.

00:32:07:19 - 00:32:41:21
Mark: Yeah, I'll say that as an example. You know - when I do merging of like pap adapter based models, just a simple merge with back with a base variation but it increases the performance of average of 8 to 10% of the model across Open LLM benchmarks - based on you know it increases it from the original test model up 8 to 10% just by you know having that merge where back with the base model as Charles says.

00:32:41:21 - 00:32:57
Mark: Right. It's not kind of - it's allowing it to remove the barrier of catastrophic forgetting because it fuses with the base model getting it to relearn that information that it had lost when you did the tune.

00:32:57:12 - 00:33:09:00
Mary: Let's bring this question in before you drop off real quick. Mark, this is a question from Vic: what are the limitations of SLMs?

00:33:09:02 - 00:33:27:29
Mark: Yeah, I mean, obviously being a little biased because I'm all for them. But I mean, of course, I mean SLM are smaller models, right? They don't know as much as large models, right? I mean, so if you are hoping you can get a 7 billion parameter model to function in the way that Claude or ChatGPT does, it's not going to happen.

00:33:28:03 - 00:33:55:13

Mary:

But SLMs are right for 99% of business use cases. The reason behind that is because if you want to utilize a language model for your business, do you care that it can recite poetry or rap music or tell you who won best picture in the Oscars in 95?

00:33:55:13 - 00:34:24
Mark: You don't care, right? You want a narrow, grounded, use case and task associated model, right? So are there limitations? Yes. You know, there's limitations on the general capabilities in general reasoning of the smaller model in comparison to the large model. But I'll say that's what we're doing here with model merging, right? Like SLM paired with model merging when apparently model merging, you get the general reasoning capabilities back from the model that you do the merge with.

00:34:35:01 - 00:34:51
Mary: That's actually a terrific place to wrap. And what I'm going to tell you as we wrap the show - people who still have questions, Ignacio, Acuna, and some others… We'll take those questions and build them into our our next show coming up in two weeks and hopefully we'll have Charles back. Maxime. You're welcome as well…

00:35:11:03 - 00:35:31:01
Mary: Thank you, everybody, for watching. We'll get it up on YouTube soon. We'll get the audio version out on podcast and the recording will stay on our LinkedIn page. So keep chiming in there with the questions you all want answered next time. Any final words from Maxime Charles, Mark?

00:35:31:04 - 00:35:33:11
Mark: Maxime, we'll start with you.

00:35:33:13 - 00:35:51:24
Maxime: Yeah, keep merging. It's great. It is really easy to do and if you don't have experience with it, it's like you can do it in in a few minutes and then evaluate it very easily. So really it's very concrete. And I encourage everyone to get this experience.

00:35:51:27 - 00:35:55:24
Mary: Hashtag #keepmerging,, love it.

00:35:55:26 - 00:35:57:17Mary:
Charles, if you want add anything?

00:35:57:19 - 00:36:01:02
Charles: I mean, what else is there to say? Haha.

00:36:01:05 - 00:36:21:00
Mark: Stay tuned because Charles and Maxime and Arcee, we are doing some fantastic things in the world of model… things that are really groundbreaking. We'll be releasing some great things in the coming weeks and months.

00:36:25:25 - 00:36:33:24
Mary: All right. Thanks, everybody, for joining. Take care. And we'll see you in two weeks for episode two of the Small language model show. Take care. Bye.