How to Build a Production RAG Pipeline: LLMs, Embeddings, and Vector Search
How to Build a Production RAG Pipeline: LLMs, Embeddings, and Vector Search
Retrieval-Augmented Generation (RAG) has become the dominant architecture pattern for building AI applications that need to reason over proprietary data. If you are building a customer support bot, an internal knowledge assistant, a legal document analyzer, or any system where a large language model needs access to your data, a RAG pipeline is almost certainly the right approach. This guide walks through the architecture, component choices, and production considerations for building a RAG pipeline that actually works at scale.
RAG systems are a core focus of our AI engineering services at BigInt Studio. This post captures the technical depth behind what goes into a production-grade implementation.
What RAG Is and Why It Matters
Large language models like Claude and GPT-4 are trained on massive public datasets, but they know nothing about your company's internal documents, your product catalog, your customer conversations, or your proprietary research. Fine-tuning can inject some of this knowledge, but it is expensive, slow to update, and does not scale well when your data changes frequently.
RAG solves this by decoupling knowledge from the model. Instead of baking information into model weights, a RAG pipeline retrieves relevant documents at query time and passes them as context to the LLM. The model generates its response grounded in your actual data, with source citations, up-to-date information, and domain-specific accuracy that a base LLM cannot provide.
The result: your AI application answers questions using your data, not its training data. And when your data changes, you update the index, not the model.
RAG Pipeline Architecture: The Four Stages
Every production RAG pipeline follows four stages: embed, store, retrieve, generate. Understanding each stage, and the decisions within it, is what separates a prototype from a production system.
Stage 1: Embeddings - Turning Text into Vectors
The foundation of any RAG system is the embedding model. An embedding model converts text (a document chunk, a user query) into a high-dimensional numerical vector that captures semantic meaning. Similar texts produce similar vectors, which is what makes semantic search possible.
Choosing an embedding model:
- OpenAI
text-embedding-3-small: The pragmatic default. Affordable ($0.02 per million tokens), high quality, 1536 dimensions. Good multilingual support. If you are starting out, start here. - OpenAI
text-embedding-3-large: Higher quality, 3072 dimensions. Use when retrieval precision is critical and you can afford the larger vector storage. - Cohere
embed-v3: Strong multilingual performance, especially for non-English languages. Worth evaluating if your documents are in Hindi, Tamil, or other Indian languages. - Open-source options (
bge-large-en-v1.5,e5-large-v2,GTE-large): Self-hosted, no API costs, full data privacy. Requires GPU infrastructure but eliminates vendor dependency. These models from the MTEB leaderboard perform competitively with commercial offerings.
The critical rule: your query embedding model and your document embedding model must be the same. Mixing models produces vectors in different semantic spaces, and your retrieval will fail silently, returning results that look plausible but are semantically wrong.
Stage 2: Vector Store - Where Embeddings Live
Once you have embeddings, you need a vector database that supports fast similarity search across millions of vectors. The choices break down into three categories:
Managed vector databases:
- Pinecone: The most popular managed option. Serverless pricing, automatic scaling, metadata filtering, and a clean API. Excellent for teams that want zero infrastructure management. Supports hybrid search (vector + keyword) natively.
- Weaviate: Open-source with a managed cloud option. Supports multiple vectorization modules, GraphQL API, and built-in classification. Good if you want more flexibility than Pinecone.
- Qdrant: Open-source, Rust-based, fast. Strong filtering capabilities and payload storage. Can be self-hosted or used as a managed service.
PostgreSQL with pgvector:
If you already run PostgreSQL, the pgvector extension adds vector similarity search directly to your existing database. This is the best starting point for most teams: no new infrastructure, familiar tooling, and it handles tens of millions of vectors with proper indexing (HNSW or IVFFlat). The trade-off is that dedicated vector databases offer better performance at very large scale (100M+ vectors) and more sophisticated indexing options.
CREATE EXTENSION vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536),
metadata JSONB
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
When to use what: Start with pgvector if you are already on PostgreSQL and have fewer than 10 million vectors. Move to Pinecone or Qdrant when you need dedicated scaling, advanced filtering, or when vector operations start competing with your transactional database for resources.
Stage 3: Retrieval - Finding the Right Context
Retrieval is where most RAG pipelines succeed or fail. The goal is simple: given a user query, find the document chunks most relevant to answering it. The execution is nuanced.
Semantic search uses cosine similarity (or dot product) between the query embedding and document embeddings to find the closest matches. This works well for natural language queries but misses exact keyword matches. If a user asks for "policy number PLN-2024-0847," semantic search may not find it.
Hybrid search combines semantic search with traditional keyword search (BM25). This catches both semantic matches and exact matches. Most production RAG systems use hybrid search. Pinecone, Weaviate, and pgvector (with full-text search) all support this pattern.
Reranking adds a second pass after initial retrieval. A cross-encoder model (like Cohere Rerank or a self-hosted model) scores each retrieved chunk against the original query with higher accuracy than embedding similarity alone. This typically improves retrieval quality by 10-20% at the cost of additional latency (50-200ms).
Retrieval parameters to tune:
- Top-k: How many chunks to retrieve. Start with 5-10. Too few misses relevant context; too many dilutes the signal and increases token costs.
- Similarity threshold: Minimum similarity score to include a chunk. Filter out low-relevance results rather than passing noise to the LLM.
- Metadata filters: Filter by document type, date range, department, or access level before vector search. This narrows the search space and improves relevance.
Stage 4: Generation - The LLM Responds
The final stage passes the retrieved context and the user query to a large language model for response generation. This is where prompt engineering for RAG becomes critical.
A production RAG prompt structure:
You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information in the context below to answer. If the context does not
contain enough information to answer, say so explicitly.
Context:
{retrieved_chunks}
Question: {user_query}
Answer:
Key prompt engineering decisions:
- Instruct the model to cite sources. Include chunk metadata (document name, section, page number) in the context and instruct the model to reference them. This builds user trust and makes answers verifiable.
- Handle insufficient context gracefully. The model should say "I don't have enough information to answer this" rather than hallucinate. This is the single most important instruction in a RAG prompt.
- Control response format. Specify whether you want bullet points, paragraphs, tables, or structured JSON. LLMs like Claude and GPT-4 follow formatting instructions reliably.
- Set the tone. Match the response style to your application: formal for legal, conversational for customer support, technical for engineering docs.
LLM selection for RAG:
- Claude API (Anthropic): Excellent instruction following, strong citation behavior, large context window (200K tokens). Our default recommendation for RAG generation.
- OpenAI API (GPT-4o): Strong general performance, good structured output support, wide ecosystem compatibility.
- Open-source models (Llama 3, Mistral): Self-hosted, no per-token costs, full data privacy. Requires GPU infrastructure and more prompt engineering effort.
Chunking Strategies: The Make-or-Break Decision
How you split documents into chunks determines retrieval quality more than any other factor. Bad chunking produces chunks that are too generic (losing specificity) or too fragmented (losing context).
Fixed-size chunking splits documents at regular intervals (e.g., every 512 tokens). Simple to implement but breaks content at arbitrary points, often mid-sentence or mid-paragraph.
Semantic chunking splits at natural document boundaries: paragraphs, sections, headings. This preserves the semantic coherence of each chunk. Use heading-based splitting for structured documents (technical docs, policies) and paragraph-based splitting for unstructured text (emails, chat logs).
Recursive chunking (used by LangChain's RecursiveCharacterTextSplitter) tries multiple split strategies in order - first by section, then by paragraph, then by sentence, then by character - to find the best split point within your target chunk size.
Overlapping chunks include 50-100 tokens of overlap between adjacent chunks. This preserves context that spans chunk boundaries. A question about a concept explained across two paragraphs will fail without overlap.
Practical recommendations:
- Target chunk size: 256-512 tokens for most use cases. Smaller chunks (128-256) for precise Q&A. Larger chunks (512-1024) for summarization tasks.
- Always include metadata: source document, section title, page number, date. This enables filtered retrieval and source citation.
- Test chunking empirically. Build a set of 50 test queries, try different chunking strategies, and measure retrieval precision. The best strategy depends on your data.
Evaluation and Monitoring in Production
A RAG pipeline without evaluation is a liability. You need to measure both retrieval quality and generation quality, independently.
Retrieval metrics:
- Precision@k: Of the top-k retrieved chunks, how many are actually relevant?
- Recall@k: Of all relevant chunks in your corpus, how many were retrieved?
- Mean Reciprocal Rank (MRR): How high does the first relevant chunk rank?
Generation metrics:
- Faithfulness: Does the response only contain information present in the retrieved context? (Catches hallucination.)
- Relevance: Does the response actually answer the user's question?
- Completeness: Does the response cover all aspects of the question that the context supports?
Tools for automated evaluation: Ragas, DeepEval, and LangSmith all provide frameworks for automated RAG evaluation. Set up a benchmark dataset of question-answer-context triples and run evaluations on every pipeline change: new embedding model, new chunking strategy, new prompt, new retrieval parameters.
Production monitoring:
- Track retrieval latency, generation latency, and total end-to-end latency.
- Monitor token usage and cost per query.
- Log every query, retrieved context, and generated response for debugging.
- Set up alerts for retrieval quality degradation (e.g., average similarity scores dropping).
- Sample 1-2% of production queries for manual review weekly.
When to Use RAG vs Fine-Tuning
RAG and fine-tuning solve different problems, and choosing wrong wastes time and money.
Use RAG when:
- Your knowledge base changes frequently (weekly or more)
- You need source citations and traceability
- Your data is large and diverse (thousands of documents)
- You need to respect access controls (different users see different data)
- You want to avoid training costs and latency
Use fine-tuning when:
- You need the model to adopt a specific writing style or tone
- Domain-specific terminology is critical and the base model gets it wrong
- Latency is paramount and you want to skip the retrieval step
- Your knowledge is stable and well-curated
The hybrid approach: In practice, many production systems combine both. Fine-tune a model on your domain's style and terminology, then use RAG to ground its responses in current data. This gives you the best of both worlds: domain-adapted language generation with up-to-date, cited information.
Putting It All Together
Building a production RAG pipeline is not a weekend project. It requires careful decisions about embedding models, vector databases, chunking strategies, retrieval methods, prompt engineering, and evaluation frameworks. But the payoff is substantial: an AI system that answers questions accurately using your data, scales with your knowledge base, and improves over time.
If you have read our introduction to RAG for Indian startups, this guide is the next step, moving from understanding to implementation. The architecture patterns here work whether you are building a customer support chatbot, an internal knowledge assistant, or a document analysis platform.
Ready to build a production RAG pipeline? Our AI engineering team can help you implement RAG pipelines, LLM integration, vector search infrastructure, and AI agent development. We work with the Claude API, OpenAI API, LangChain, and custom orchestration frameworks. Let's talk about your project.
Related Posts
RAG Systems Explained: A Practical Guide for Indian Startups
RAG Systems Explained: A Practical Guide for Indian Startups Retrieval-Augmented Generation, or RAG, has become one of the most important AI architecture pat...
Building Production AI Agents: Architecture, Patterns, and Lessons
Building Production AI Agents: Architecture, Patterns, and Lessons Building a demo AI agent takes a weekend. Building a production AI agent that handles 10,0...
AI Chatbots for Indian Customer Support: Building and Deploying
AI Chatbots for Indian Customer Support: Building and Deploying Customer support in India is uniquely challenging. Customers switch between English, Hindi, a...