Building Production-Ready RAG Systems with LangChain

Introduction

Retrieval-Augmented Generation (RAG) has become the standard approach for building AI systems that need to work with your specific data. Unlike fine-tuning, RAG allows you to keep your LLM general while providing it with relevant context at query time. But building a RAG system that works in demos is very different from building one that works in production.

At Commit Software, we've deployed RAG systems processing thousands of queries daily across multiple industries. This guide shares the patterns and practices that made the difference between a demo and a production system.

The RAG Architecture That Scales

A production RAG system consists of four main components:

### 1. Document Processing Pipeline

The ingestion pipeline is where most RAG systems fail. A naive approach of chunking documents by character count leads to lost context and poor retrieval quality.

Semantic Chunking Strategy:

Instead of fixed-size chunks, we implement semantic chunking that respects document structure:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
def create_semantic_chunks(documents):
    """
    Create semantically meaningful chunks that respect
    document structure while maintaining context.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )
    chunks = []
    for doc in documents:
        doc_chunks = splitter.split_documents([doc])
        # Add metadata for retrieval context
        for i, chunk in enumerate(doc_chunks):
            chunk.metadata["chunk_index"] = i
            chunk.metadata["total_chunks"] = len(doc_chunks)
        chunks.extend(doc_chunks)    return chunks

Key insight: The overlap of 200 characters ensures that context isn't lost at chunk boundaries. We've found this ratio (20% overlap) works well for most document types.

### 2. Embedding and Vector Storage

The choice of embedding model significantly impacts retrieval quality. We recommend:

OpenAI text-embedding-3-small for general use cases (cost-effective, good quality)

OpenAI text-embedding-3-large for precision-critical applications

Cohere embed-multilingual-v3.0 for multilingual documents

For vector storage in production, we use Pinecone for its managed infrastructure and Qdrant for self-hosted deployments.

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536
)vectorstore = PineconeVectorStore(
    index_name="production-rag",
    embedding=embeddings,
    namespace="documents"
)

### 3. Retrieval Strategy

Simple similarity search isn't enough for production. We implement a hybrid retrieval strategy:

Multi-Query Retrieval:

The user's query is often ambiguous. We use an LLM to generate multiple query variations:

from langchain.retrievers.multi_query import MultiQueryRetrieverretriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
)

Contextual Compression:

Not all retrieved chunks are equally relevant. We compress and filter:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractorcompressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

### 4. Generation with Guardrails

The final generation step needs careful prompting and guardrails:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
RAG_PROMPT = ChatPromptTemplate.from_template("""
You are an assistant that answers questions based on the provided context.
Rules:
Only use information from the provided context

If the context doesn't contain the answer, say "I don't have enough information to answer that question"

Cite the source when possible

Be concise but complete
Context:
{context}
Question: {question}
Answer:
""")chain = (
    {"context": compression_retriever, "question": RunnablePassthrough()}
    | RAG_PROMPT
    | ChatOpenAI(model="gpt-4o", temperature=0)
    | StrOutputParser()
)

Production Considerations

### Caching Strategy

Cache at multiple levels to reduce latency and costs:

Query-level caching: Cache responses for identical queries

Embedding caching: Cache embeddings for frequently queried documents

LLM response caching: Use semantic caching for similar queries

### Monitoring and Observability

Track these metrics in production:

Retrieval relevance: Are the retrieved chunks actually relevant?

Answer groundedness: Is the answer supported by the context?

Latency P50/P99: How fast are queries being answered?

Token usage: Are you staying within budget?

We use LangSmith for tracing and Ragas for automated evaluation.

### Error Handling

Production RAG systems must gracefully handle:

Empty retrieval results

Token limit exceeded

API rate limiting

Malformed documents

Cost Optimization

A production RAG system can get expensive quickly. Here's how we optimize:

Use smaller models for retrieval reranking

Implement aggressive caching

Batch embedding requests

Use gpt-4o-mini for simple queries, gpt-4o for complex ones

Our typical production system processes 10,000 queries/day at approximately $50-100 in API costs.

Conclusion

Building production-ready RAG systems requires careful attention to document processing, retrieval strategies, and operational concerns. The patterns we've shared here have been battle-tested across multiple deployments processing thousands of queries daily.

The key takeaways:

Semantic chunking with proper overlap preserves context

Hybrid retrieval with multi-query and compression improves relevance

Proper guardrails in generation prevent hallucinations

Comprehensive monitoring catches issues before users do

At Commit Software, we specialize in building these production-grade RAG systems. If you're looking to implement RAG for your business data, [contact us](/contact) for a consultation.

Building Production-Ready RAG Systems with LangChain

Introduction

The RAG Architecture That Scales

Production Considerations

Cost Optimization

Conclusion

Tags

Need Help Implementing This?

Related Articles

Implementing Fraud Detection with LLMs: A Practical Approach