Introducing SuperNova-Medius: Arcee AI's 14B Small Language Model That Rivals a 70B

First came our flagship 70B SuperNova, followed by the 8B SuperNova-Lite. Today we add to this family of superpower Small Language Models with the release of the 14B SuperNova-Medius.

Introducing SuperNova-Medius: Arcee AI's 14B Small Language Model That Rivals a 70B

Arcee-SuperNova-Medius is an extremely compact (14B) yet powerful language model that balances size and performance, offering capabilities closer to those of our full-size 70B SuperNova model than our 8B SuperNova-Lite variant.

SuperNova-Medius excels at high-quality instruction-following and complex reasoning tasks, and has a deep reservoir of world knowledge. This makes it an excellent choice for a wide variety of business use cases including customer support, content creation, and advanced technical assistance.

How We Trained Arcee-SuperNova-Medius

How did we pack so much into this 14B powerhouse? The development of SuperNova-Medius was unique, to say the least. We distilled it from Llama-3.1-405B (like SuperNova and SuperNova-Lite) – except it's from a model that uses a different architecture, so it's no small feat.

Details below straight from Arcee AI's own Charles Goddard (the founder of MergeKit), who explains the cross-architecture distillation.

💡
Where Can I Access SuperNova-Medius?

Arcee-SuperNova-Medius is available for open access under the Apache-2.0 license on Hugging Face, and provides a cost-effective solution that runs efficiently on much smaller hardware compared to larger models. For those who need even higher performance, our full-size 70B SuperNova model can be accessed via an Arcee-hosted API or for local deployment. To learn more or to explore deployment options, please reach out to sales@arcee.ai

Distillation

The current hotness in the world of Small Language Models (SLMs) is definitely distillation.

Distillation is a process by which the knowledge and capabilities of a large "teacher" model can be transferred to a smaller "student" model (ideally using less compute than it would take to train the student model from scratch). There are a number of approaches to this, which I'll briefly cover here.

Synthetic Data Distillation

Often when people refer to distillation, they are referring to the process of generating synthetic data from a large model and using it to train a smaller model. This approach was used (to great effect) by Meta for their Llama 3.1 series of models, by Microsoft for the Phi series, and many others.

Pros:

  • Architecture agnostic, vocabulary agnostic, everything agnostic.
  • Extremely well-established and understood.
  • Easy to implement, easy to scale.

Cons:

  • Learns the sampled output of the teacher model, not the underlying distribution.
  • Unlikely to capture long-tail or rare events.
  • Kinda boring (okay maybe that's just me).

Logit Distillation

A more direct approach to knowledge transfer is to use the logits of the large model as the target for the small model. By using the KL divergence between the student and teacher model's logits as the loss function, the student model can be trained to mimic not just the sampled output of the teacher model but the entire distribution of probabilities that the teacher model assigns to each token. This can be done either online (with the teacher model being inferenced concurrently with the student model) or offline (with the teacher model's logits being computed separately and stored for later use). This approach is implemented in Arcee AI's DistillKit library.

Pros:

  • Directly trains the student model to mimic the teacher model's output distribution.
  • Works on any architecture that outputs logits.
  • Can be used with synthetic or real data.

Cons:

  • (For online distillation) Requires the teacher model to be inferenced at the same time as the student model.
  • (For offline distillation) Requires the storage and transfer of the teacher model's logits, which can be quite large.
  • Student and teacher must have the same vocabulary.

Hidden State Distillation

Hidden state distillation is an approach that similarly attempts to train a student model to mimic a teacher model, but instead of trying to match logits to logits, it tries to match the hidden states of the models (with a linear projection for dimensionality matching). This can similarly be done online or offline. This approach is also implemented in Arcee AI's DistillKit library.

Pros:

  • Teacher and student models can have different vocabularies (and even different architectures).
  • Can be used with synthetic or real data.

Cons:

  • Same online/offline requirements as logit distillation.
  • Less direct than logit distillation.

A Fourth, More Upsetting Approach

For the training of Arcee's SuperNova model, we used an offline logit-based approach to distill Llama 3.1 405B into a 70B model. We inferenced the 405B model on a large dataset ahead of time and stored the logits for later use. Storing the full set of logits would be prohibitively expensive (think petabytes) so we instead stored the top K logits for each token – K having been selected to capture most of the probability mass but still be manageable in size. This was fantastically effective - the model turned out extremely well. As a byproduct of this, though, we had a set of sparse 405B logits sitting around that my brain was just itching to make trouble with. When Qwen released the Qwen2.5 14B model it became obvious what I had to do.

Since Qwen has a different vocabulary and architecture from Llama, the logits are obviously not directly usable. The hidden states weren't stored, so hidden state distillation was out. Instead, I went with good ol' fashioned tensor surgery.

mergekit-tokensurgeon, part of mergekit, is a tool I wrote that replaces the vocabulary of a model with that of another. It uses K-nearest neighbors in embedding space to approximate embeddings for tokens not found in intersection of the two vocabularies. This is absolutely a heuristical approach but it works surprisingly well - models are generally coherent, and even in the worst case, a few hundred thousand tokens of training can bring the model back to where it should be.

To start off, I used mergekit-tokensurgeon to create a version of Qwen2.5-14B that uses the vocabulary of Llama 3.1 405B. I then trained this model with the 405B logits as the target. The result was quite good! The process felt too straightforward to me, though. So I also separately distilled Qwen2-72B into Qwen2-14B. Then I used mergekit-tokensurgeon to change the vocabulary of the Llama 405B-distilled Qwen 14B back to Qwen vocabulary, and merged the two distilled models together (plus an in-house Arcee AI fine-tune for flavor).

The result is Arcee-SuperNova-Medius (14B). It's a highly capable model for its size, with killer instruction-following, and a surprising amount of world knowledge. It's both a proof of concept for multi-architecture, multi-teacher distillation through merging and a powerful tool in its own right. Bone apple teeth.
-Charles Goddard, Arcee AI Chief of Frontier Research


Which SuperNova is Right For Me?


Are you wondering which SuperNova variant is the best fit for your use case – the 70B flagship model, the 8b SuperNova-Lite, or perhaps now the 14B SuperNova-Medius?

Our team would be happy to tell you more about how companies of all sizes are putting various versions of SuperNova to work for practical use cases.

Reach out here to schedule a time with us. We look forward to hearing from you!