Partner Spotlight: Arcee AI 🤝 MongoDB

Joint customers use MongoDB & Arcee AI to take data from JSON files and turn it into world-class custom language models with practical business use cases–in just a few clicks.

Partner Spotlight: Arcee AI 🤝 MongoDB

As the leading platform for training Small Language Models (SLMs), we make it easy for customers to build domain-specific language models while working with their existing data and storage infrastructure. For many LLM developers, that includes MongoDB as their database solution – and here at Arcee AI we’re proud to work closely with MongoDB as one of our leading partners. 

For those who don’t know: Intro to MongoDB

Built by developers, for developers, MongoDB's data platform is a database with an integrated set of related services that allow development teams to address the growing requirements of today's wide variety of modern applications–all in a unified and consistent user experience.

How do Arcee AI & MongoDB work together?

Arcee AI empowers MongoDB customers to build, train, & deploy their own custom language models–starting with their JSON data. Arcee AI converts that data into parquet files, for upload to our end-to-end platform, where users create incredibly performant and efficient language models tailored to their data in just a few clicks. “Joint Arcee AI / MongoDB customers, which are primarily in the finance and insurance verticals, are thrilled with the ease-of-use of the Arcee training platform and the short timeline to go from dataset to a deployed custom language model,” says Arcee AI Head of Solutions Engineering Tyler Odenthal.

🔰Here’s how to get started:

  • Your MongoDB server should be running. 
  • You’ll need your mongodb_uri, db_name, collection_name 
  • Ensure your collection_name has training data uploaded in JSON format
  • Make sure you create a bucket with the same bucket_name in S3
  • Save Arcee API key and AWS key to your environment variables
  • Run the following Python script …
import json
import boto3
import requests
from pymongo import MongoClient
from uuid import uuid4

def upload_data_to_s3(bucket_name, folder_name, data):
    # Initialize the S3 client using the default credentials provider chain
    s3_client = boto3.client('s3')

    # Upload each document as a separate JSON file
    for i, document in enumerate(data):
        s3_file_name = f"{folder_name}/{uuid4()}.json"
        s3_client.put_object(Bucket=bucket_name, Key=s3_file_name, Body=json.dumps(document))
        print(f"Document {i + 1} uploaded to S3 bucket '{bucket_name}' as '{s3_file_name}'.")

    # Return the S3 folder URL
    s3_folder_url = f"s3://{bucket_name}/{folder_name}"
    return s3_folder_url

def fetch_data_from_mongodb(mongodb_uri, db_name, collection_name, query={}):
    # Connect to MongoDB
    client = MongoClient(mongodb_uri)
    db = client[db_name]
    collection = db[collection_name]

    # Fetch data
    data = list(collection.find(query))
    for doc in data:
        doc['_id'] = str(doc['_id'])  # Convert ObjectId to string for JSON serialization

    return data

def call_arcee_api(corpus_name, s3_folder_url, tokenizer_name, block_size, arcee_api_url, arcee_api_key, arcee_organization):
    payload = {
        "corpus_name": corpus_name,
        "s3_folder_url": s3_folder_url,
        "tokenizer_name": tokenizer_name,
        "block_size": block_size,
    }

    headers = {
        'Content-Type': 'application/json',
        'x-arcee-org': arcee_organization,  # Set the x-arcee-org header
        'x-token': arcee_api_key
    }
    response = requests.post(arcee_api_url, headers=headers, data=json.dumps(payload))

    if response.status_code == 200:
        print("Successfully called Arcee API.")
    else:
        print(f"Failed to call Arcee API: {response.status_code}, {response.text}")

if __name__ == "__main__":
    # MongoDB configuration
    mongodb_uri = 'mongodb://localhost:27017/'  # Your MongoDB URI
    db_name = 'Test'  # Your database name
    collection_name = 'Training'  # Your collection name

    # AWS S3 configuration
    bucket_name = 's3_bucket'  # Your S3 bucket name
    folder_name = 'mongo_data'  # The folder name to be created in S3

    # Arcee API configuration
    arcee_api_url = 'https://app.arcee.ai/api/v2/pretraining/corpus'  # Your Arcee API URL
    arcee_organization = 'ARCEE_ORG'
    arcee_api_key = 'ARCEE_API_KEY'
    corpus_name = 'test_corpus'  # Your corpus name
    tokenizer_name = 'meta-llama/Meta-Llama-3-8B'  # Your tokenizer name
    block_size = 512  # Your block size

    # Fetch data from MongoDB
    data = fetch_data_from_mongodb(mongodb_uri, db_name, collection_name)

    # Upload data to S3
    s3_folder_url = upload_data_to_s3(bucket_name, folder_name, data)
    print(s3_folder_url)

    # Call Arcee API
    call_arcee_api(corpus_name, s3_folder_url, tokenizer_name, block_size, arcee_api_url, arcee_api_key, arcee_organization)
    ``` 

Now, you’re ready to start training your custom language model, which you can deploy to any environment. Learn more here.  

MongoDB as part of your Generative AI solutions

Thousands of companies across the globe trust MongoDB with the critical workloads that power their GenAI applications. Arcee AI is one of a broad ecosystem of trusted partners that MongoDB collaborates with to ensure that their customers have access to the most innovative AI solutions. “Our partners are increasingly important as organizations look to innovate with generative AI, and they play a major role in making this a reality,” says Alan Chhabra, Executive Vice President of Partners at MongoDB. 

Learn more about MongoDB as part of leading Generative AI solutions


• MongoDB Launches New Program for Enterprises to Build Modern Applications with Advanced Generative AI Capabilities (MAAP)
• MongoDB for AI
• Generative AI is shaping the future of search
• MongoDB highlights partners at annual awards.