Introduction
Retrieval-Augmented Generation (RAG) has become the standard approach for building AI systems that need to work with your specific data. Unlike fine-tuning, RAG allows you to keep your LLM general while providing it with relevant context at query time. But building a RAG system that works in demos is very different from building one that works in production.
At Commit Software, we've deployed RAG systems processing thousands of queries daily across multiple industries. This guide shares the patterns and practices that made the difference between a demo and a production system.
The RAG Architecture That Scales
A production RAG system consists of four main components:
### 1. Document Processing Pipeline
The ingestion pipeline is where most RAG systems fail. A naive approach of chunking documents by character count leads to lost context and poor retrieval quality.
Semantic Chunking Strategy:
Instead of fixed-size chunks, we implement semantic chunking that respects document structure:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoaderdef create_semantic_chunks(documents):
"""
Create semantically meaningful chunks that respect
document structure while maintaining context.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = []
for doc in documents:
doc_chunks = splitter.split_documents([doc])
# Add metadata for retrieval context
for i, chunk in enumerate(doc_chunks):
chunk.metadata["chunk_index"] = i
chunk.metadata["total_chunks"] = len(doc_chunks)
chunks.extend(doc_chunks)
return chunks
Key insight: The overlap of 200 characters ensures that context isn't lost at chunk boundaries. We've found this ratio (20% overlap) works well for most document types.
### 2. Embedding and Vector Storage
The choice of embedding model significantly impacts retrieval quality. We recommend:
- OpenAI text-embedding-3-small for general use cases (cost-effective, good quality)
- OpenAI text-embedding-3-large for precision-critical applications
- Cohere embed-multilingual-v3.0 for multilingual documents
For vector storage in production, we use Pinecone for its managed infrastructure and Qdrant for self-hosted deployments.
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStoreembeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
dimensions=1536
)
vectorstore = PineconeVectorStore(
index_name="production-rag",
embedding=embeddings,
namespace="documents"
)
### 3. Retrieval Strategy
Simple similarity search isn't enough for production. We implement a hybrid retrieval strategy:
Multi-Query Retrieval:
The user's query is often ambiguous. We use an LLM to generate multiple query variations:
from langchain.retrievers.multi_query import MultiQueryRetrieverretriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
)
Contextual Compression:
Not all retrieved chunks are equally relevant. We compress and filter:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractorcompressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever
)
### 4. Generation with Guardrails
The final generation step needs careful prompting and guardrails:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAIRAG_PROMPT = ChatPromptTemplate.from_template("""
You are an assistant that answers questions based on the provided context.
Rules:
Only use information from the provided context
If the context doesn't contain the answer, say "I don't have enough information to answer that question"
Cite the source when possible
Be concise but complete Context:
{context}
Question: {question}
Answer:
""")
chain = (
{"context": compression_retriever, "question": RunnablePassthrough()}
| RAG_PROMPT
| ChatOpenAI(model="gpt-4o", temperature=0)
| StrOutputParser()
)
Production Considerations
### Caching Strategy
Cache at multiple levels to reduce latency and costs:
- Query-level caching: Cache responses for identical queries
- Embedding caching: Cache embeddings for frequently queried documents
- LLM response caching: Use semantic caching for similar queries
### Monitoring and Observability
Track these metrics in production:
- Retrieval relevance: Are the retrieved chunks actually relevant?
- Answer groundedness: Is the answer supported by the context?
- Latency P50/P99: How fast are queries being answered?
- Token usage: Are you staying within budget?
We use LangSmith for tracing and Ragas for automated evaluation.
### Error Handling
Production RAG systems must gracefully handle:
- Empty retrieval results
- Token limit exceeded
- API rate limiting
- Malformed documents
Cost Optimization
A production RAG system can get expensive quickly. Here's how we optimize:
- Use smaller models for retrieval reranking
- Implement aggressive caching
- Batch embedding requests
- Use gpt-4o-mini for simple queries, gpt-4o for complex ones
Our typical production system processes 10,000 queries/day at approximately $50-100 in API costs.
Conclusion
Building production-ready RAG systems requires careful attention to document processing, retrieval strategies, and operational concerns. The patterns we've shared here have been battle-tested across multiple deployments processing thousands of queries daily.
The key takeaways:
- Semantic chunking with proper overlap preserves context
- Hybrid retrieval with multi-query and compression improves relevance
- Proper guardrails in generation prevent hallucinations
- Comprehensive monitoring catches issues before users do
At Commit Software, we specialize in building these production-grade RAG systems. If you're looking to implement RAG for your business data, [contact us](/contact) for a consultation.