How Do I Prep my Data to Train an LLM?

So you want to train a custom language model, and you do have the requisite large set of text data. But how do you know that the data is *really actually ready* for model training? Our researchers here at Arcee AI tell you what to look out for.

How Do I Prep my Data to Train an LLM?

We all know the data adage "Garbage-In, Garbage-Out" – any results you get from your data can only be as good as the data itself. It's a saying that applies, of course, in the world of artificial intelligence: the quality of any AI model depends on the quality of the data that you've fed into it.

Here at Arcee AI, every day we talk to organizations that are eager to build, train and deploy custom LLMs (actually, what we call Small Language Models or SLMs – because our models are so efficient). As we get them started on their SLM journey, we start by reminding them – or teaching them – how to properly prepare their text data before using it to train a language model.

Some of our brilliant researchers put together this guide of what you need to know as you prep your data to train an SLM or LLM. They've divided their advice into the two main considerations you need to keep in mind: the amount of data you're working with, and the quality of that data.

Data Quantity

Data quantity plays a pivotal role in shaping the capabilities of Large Language Models (LLMs). The more extensive your dataset, the more nuanced and accurate the understanding your models can achieve. Data quantity empowers language models to generalize effectively across diverse topics and tasks, underpinning their ability to comprehend and generate human-like text.

Large-Scale Corpus

The pre-training corpus should be large to ensure the model has acquired extensive knowledge. Typically, this involves processing billions or trillions of tokens for pre-training a general-purpose model to capture the complexity and variability of language. Integrating diverse data sources further enhances the effectiveness of pre-training language models.

In Continual Pre-training (CPT) tasks, focusing on large-scale domain-specific datasets is crucial. These datasets, whether supervised or unsupervised, play a pivotal role in refining models for specific domains. Interestingly, CPT requires significantly less data than initial pre-training, while still delivering comparable
performance.

Augmenting with Synthetic Data

Acquiring large, diverse, and high-quality datasets is a challenge, often due to data scarcity, privacy concerns, and the high costs of collecting and labeling data. Numerous analysts predict that we will run out of fresh text data by 2050 and image data by 2060. To tackle these issues, synthetic data has become a promising solution. Augmenting with synthetic data is a complex topic itself, and we'll be publishing an upcoming blog devoted just to that.

Data Quality

Beyond sheer volume, the quality of data defines the foundation of reliable and effective language models. Ensuring data quality and utilizing filtering techniques to exclude undesirable text are vital for optimizing the performance of the language models.

Data Diversity and Distribution

In the previous section, we focused on the corpus size, but what if we have a huge corpus that is not capable of demonstrating various patterns and distributions of the language? This brings up the necessity of having a diverse corpus as your input data.

More diverse pre-training data enhances the model's ability to acquire a wider range of knowledge – similar to the impact of a large-scale corpus, but different in that the model is exposed to various and broad ranges of possible input data. Including varied data sources can help in developing comprehensive language models and improving the model’s generalizability.

Deduplicated Dataset

Deduplicating the pre-training data is the process of removing duplicates and redundant data samples, which helps the model by preventing it from memorizing repeated sequences and instead encourages generalization. This procedure enhances models’ robustness against forgetting, and boosts the model’s performance in acquiring factual knowledge.

Data Filtering 

Data filtering is another crucial step that significantly enhances the efficiency of the Continual Pre-Training (CPT) process. By meticulously selecting only the most effective and relevant tokens from the pre-training data, we can reduce the overall computational cost and resource consumption. In Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors illustrated how strategic data selection can lead to substantial performance gains with minimal data and compute resources. They propose simple yet effective data selection strategies that outperform standard CPT with just 10% of the corpus size and cost, without compromising performance on open-domain tasks. This approach not only ensures that the model focuses on high-quality domain-specific data, but also reduces redundancy and noise in the training process. There are a number of different, filtering techniques and some of the most useful strategies are introduced below.

Language Filtering
The initial necessary step to collecting pre-training data for language modeling involves filtering data based on the target languages that the model will work with, and filtering out the data from other languages. Language filtering can be applied to both natural language and programming languages.

Content Filtering
The content filtering step includes eliminating data that contains toxic, explicit, or extremely inappropriate content to enhance the model’s fairness and safety. While this step can reduce harmful outputs of the model, it might limit the model’s ability to perform well on standard benchmarks and tasks. Thus, there is a trade-off between the generalization ability of the model and removing toxic content from the pre-training dataset to mitigate the risk of toxic content generation.

The content in the samples might also leak Personally Identifiable Information (PII) of people. Recent experiments have shown that language models will reproduce PII during inference time. Therefore, filtering out the PII from the collected dataset can enhance the model’s performance while increasing the user’s privacy.

Domain-Specific Filtering
While some language models aim to learn general-purpose knowledge, others can be trained on domain-specific data. CPT and alignment are examples of where a language model focuses on domain-specific knowledge. Domain-specific filtering enables the model to access in-domain data which improves the performance in targeted areas.

Data Age

It's crucial to consider the temporal relevance of the input data. Significant model drift over time or across different domains after the domain adaptation and alignment may lead to performance degradation. There are three different distributional shifts that can cause model drift happen: temporal, content, and language.

Temporal Shift
Temporal shift describes the distributional changes of the data over time, occurring when the input variables no longer align with the target variables. There are several types of temporal shifts:

  • Recurring: the shift happens regularly, such as seasonal shopping patterns.
  • Sudden: the shift occurs unexpectedly, like the sharp decline in restaurant visits after the onset of the COVID-19 pandemic.
  • Gradual: the shift progresses slowly and predictably, such as the continuous advancement in fraud detection methods as both detection techniques and fraud attempts become more sophisticated over time.

Content Shift
Content shift refers to the LLM learning from various fields. It involves changing the distribution of the input data, such as switching the knowledge domain from Chemistry to Biology.

Language Shift
Language shift occurs when the model learns from different language corpora, resulting in a change in the data pipeline – such as switching the data language from English to French.

Detecting and addressing model drift is vital because a language model's accuracy can deteriorate rapidly after deployment if the production data diverges from the model’s training data, resulting in incorrect predictions and increased risk. Deploying a model is merely the first step; ongoing monitoring is necessary to ensure sustained performance.

Data Mixing

Data mixing plays a critical role in CPT of language models by ensuring the effective use of diverse pre-training data domains – such as Wikipedia, books, and web text. The composition of these sources significantly impacts model performance across various downstream tasks. Earlier studies suggest that computing an optimal ratio of pre-training data domains is beneficial for improving language model performance. The DoReMi paper, for example, introduces a method to determine optimal domain weights through distributionally robust optimization, leading to improved performance and efficiency in training large models. 

More recent work, such as Efficient Continual Pre-training by Mitigating the Stability Gap, highlights the importance of strategies such as using high-quality sub-corpora and data mixtures similar to the pre-training data to reduce distribution gaps. These strategies help mitigate the "stability gap" phenomenon, where models initially experience a performance drop before recovering.

Additionally, replay buffer strategies, as discussed in Simple and Scalable Strategies to Continually Pre-train Large Language Models, demonstrate that combining learning rate re-warming, re-decaying, and replaying previous data can match the performance of fully re-training models – while using significantly less compute. These approaches collectively enhance the efficiency and effectiveness of CPT for adapting large language models to new domains.

The Arcee AI Team is Here to Help

We hope this article has helped clarify your questions about how to prepare your text data to train a language model. Here at Arcee AI, we guide companies through this process every day, as they prep their data to train and deploy custom Small Language Models (SLMs) in our end-to-end platform. Please feel free to write to us with your questions – the easiest way to find us is to drop a note on one of our LinkedIn posts here!