Arcee is excited to exit stealth with an open core to contextualize domain adapted language models (DALMs).
Unifying the Retriever and Generator
As language modeling techniques have evolved, we have seen an increasing effectiveness in retrieval augmented generation (RAG), where the generator model is provided with relevant context documents from a context database. Documents are fetched with a retriever model which embeds them into a semantic space where they can be matched to queries at inference time.
Despite the effectiveness of RAG, the generator and retrieval systems have evolved in separate stacks and are served at inference time together with multiple API calls: Embed OpenAI, Query Pinecone, Query OpenAI, Embed HF embedder, Query Weaviate, Query Cohere, etc. Not only do we think the multiple API calls are a problem, we believe that these systems should be trained jointly and then served via a single API that combines both the generator and retrieval database. That is why we are releasing the DALM repo, which extends the work that has been done in end to end retrieval augmented generation, as some of the more hardcore NLP community call “True RAG”.
The DALM repo unifies the retriever and generator with an end-to-end differentiable loss to train both models jointly. When you are training a DALM model you are providing the instructions that you are going to finetune your language model with alongside the context that you are going to allow it to draw from at inference time. This process allows the model to rely more on the context, which offloads some parameter requirements and results in smaller generator weights. Our initial research has shown a strong signal being shared with the retrieval model.
Initial Results and Effectiveness
Our initial results show significant improvement over general retrievers within domains that are underrepresented on the web: language in healthcare, legal, and financials tends to be difficult for general models to retrieve. And if your LM system has a particular nuance to model, like company acronyms, the domain adaptation of end to end training is very important.
For some initial experiments, we trained our end to end model on US Patent abstracts. With a context of 10K patent abstracts, we saw a 50% improvement over baseline retrieval. As we scale up the context to 200K abstracts, we see improvements of around 30%, which significantly surpasses typical retriever contrastive loss training.
The best part - all of these routines are open source, so you can run them on your own context, and replicate our results.
Why Open Core?
We are building Arcee open core because we believe in a world of millions, if not billions of in domain language models operating within organizations. It is important that the ownership and creation of these models stays with the community and the engineers who are designing these language modeling systems. General models will occupy a wide breadth of general tasks where language is plentiful on the web, like Stack Overflow Q&A, but uniquely nuanced LMs that push the intellectual capabilities of organizations will be specifically tailored within domain data.
We are also developing DALM scaling infrastructure for organizations who want to take their models on-prem into production faster, and with more solidity. We believe that the synergy of this infrastructure with our repository will encourage broader adoption of domain-adapted models, propelling the industry forward.
Getting Started With DALM
Check out our Demo DALM models for DALM-Patent, DALM-PubMed, and DALM-SEC to see how the querying and behavior of DALM models works in production. You can also check out our open source DALM modeling repository to run training and inference on your own GPUs and contribute to Project DALM. Happy joint retrieval and generation!