Why RAG? The problem with generic LLMs
Large language models (GPT-4, Claude, Llama) have knowledge frozen at their training cutoff date and do not know your internal documents, knowledge base, or specific business data. Adapting them through fine-tuning is expensive (compute time, labeled data, GPU infrastructure) and data often changes.
RAG solves this problem by augmenting the prompt sent to the LLM with relevant document passages retrieved in real time. The LLM no longer answers from memory — it answers based on the documents provided in its context. Result: up-to-date, sourceable responses, adapted to your specific domain.
RAG = adding context at inference time (flexible, no retraining, data can change). Fine-tuning = modifying model weights to adapt its style or knowledge (expensive, rigid, better for adapting behavior). For business applications with evolving data: RAG. To permanently adapt tone/style or inject very specific stable knowledge: fine-tuning.
The RAG pipeline: indexing, retrieval, generation
A RAG pipeline consists of two phases: the indexing phase (offline) and the query phase (online). The indexing phase prepares the documents; the query phase retrieves relevant passages and sends them to the LLM.
Indexing phase: preparing documents
1. Loading documents (PDFs, Markdown, HTML, databases, Confluence, Notion...). 2. Chunking: splitting into appropriately sized passages (see dedicated section). 3. Embedding: transforming each chunk into a numerical vector via an embedding model (OpenAI's text-embedding-3-small, Cohere embed, or open-source models like sentence-transformers). 4. Storing in a vector database with associated metadata (source, date, chapter...).
Query phase: retrieve and generate
1. The user's question is embedded using the same model as during indexing. 2. Similarity search in the vector database: the k most similar vectors (cosine similarity, dot product, or Euclidean distance) are retrieved. 3. The corresponding chunks are injected into the prompt: 'Answer the following question based only on these documents: [chunks]. Question: [question]'. 4. The LLM generates a response based on the provided context.
Embeddings: the semantic representation of text
An embedding is a numerical representation (vector of 768 to 3072 dimensions depending on the model) that captures the semantic meaning of a text. Two texts with similar meanings have vectors that are close in vector space, even if they use different words.
This is the fundamental property that enables semantic search: 'car' and 'automobile' are close, while 'river bank' and 'financial bank' are distant. Keyword search (BM25, TF-IDF) does not capture this semantics — that is the main value added by embeddings.
Choosing an embedding model
text-embedding-3-small and text-embedding-3-large (OpenAI): excellent quality/price ratio, 1536 or 3072 dimensions, proprietary. Cohere Embed v3: multilingual, excellent for business use cases, supports query vs document request types. sentence-transformers (Hugging Face): open-source, hundreds of models, deployable on-premise. For the best multilingual performance: paraphrase-multilingual-mpnet-base-v2 or multilingual-e5-large.
The MTEB (Massive Text Embedding Benchmark) from Hugging Face evaluates embedding models on 56 tasks in 112 languages. It is the reference for comparing models before choosing. Rankings vary significantly by language and task type (retrieval, classification, clustering).
Muennighoff et al. - MTEB: Massive Text Embedding Benchmark, 2023Vector databases
A vector database stores vectors and enables large-scale similarity search (millions to billions of vectors) with millisecond response times via HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) indexes.
Managed solutions: Pinecone, Weaviate Cloud, Qdrant Cloud
Pinecone is the most popular managed vector service: simple API, automatic scalability, metadata filtering, hybrid search support (dense + sparse). Weaviate is open-source with a managed cloud offering: supports multi-tenancy, GraphQL, and built-in embedding modules. Qdrant is a high-performance open-source alternative that particularly stands out on precision benchmarks.
pgvector: the PostgreSQL solution
pgvector is a PostgreSQL extension that adds a vector column type and similarity operators. Major advantage: no new infrastructure — you store embeddings and relational data in the same PostgreSQL database. Supabase and Neon offer pgvector natively. Limitation: lower performance than dedicated vector databases beyond a few million vectors.
Chunking and reranking: improving RAG quality
The quality of a RAG system depends as much on chunking as on the generation model. Poor chunking produces incoherent passages that degrade the final response.
Chunking strategies
Fixed-size chunking (500-1000 tokens with overlap): simple, functional, but can cut sentences mid-way. Semantic chunking: splits at natural semantic boundaries (paragraphs, sections, sentences). Hierarchical chunking: stores chunks at multiple granularities (full section + individual sentences) to first retrieve the section then refine. Parent-child chunking: parent chunks (broad context) and child chunks (precise passages) are linked — children are retrieved and parents are injected into the prompt.
Reranking: refining passage selection
Vector search retrieves the k semantically closest passages, but semantic proximity does not guarantee relevance for answering the specific question. Reranking passes the top-k candidates through a cross-encoder (more powerful but slower model) that scores specific question/passage relevance. Cohere Rerank, Jina Reranker, and sentence-transformers cross-encoders are the most widely used solutions.
RAG reduces but does not eliminate hallucinations. The LLM can still invent details not present in the provided documents, create incorrect syntheses, or ignore relevant retrieved chunks. Safeguards are necessary: mandatory source citation, confidence scores, and automated response evaluation (RAGAS framework: Faithfulness, Answer Relevancy, Context Precision).
Anchoring RAG concepts with spaced repetition
RAG combines concepts from NLP (embeddings, similarity search), infrastructure (vector databases, indexing) and system architecture (pipeline, latency, cost). Flashcards help maintain a clear mastery of each pipeline component and their interactions.
Concepts to master: RAG vs fine-tuning difference, the 4 RAG pipeline steps, cosine similarity vs dot product, HNSW vs IVF index, fixed-size vs semantic chunking, the reranker's role, and the RAGAS framework for evaluating a RAG system. Classic questions in ML Engineer interviews.
Frequently asked questions about RAG and augmented generation
What is RAG (Retrieval Augmented Generation)?
RAG is an architecture that augments an LLM's responses by providing it with relevant document passages retrieved in real time. Rather than answering from memory, the LLM bases its response on documents injected into its prompt. This allows querying an LLM about specific data (internal documentation, knowledge base) without retraining.
What is the difference between RAG and fine-tuning?
RAG adds context at inference time: documents are retrieved and injected into the prompt. Flexible, no retraining, data can be updated. Fine-tuning modifies model weights to permanently adapt its behavior or knowledge. Expensive, rigid, better for adapting style or injecting very stable knowledge.
What is an embedding and why is it useful for RAG?
An embedding is a numerical vector representation of a text that captures its semantic meaning. Two texts with similar meanings have close vectors. RAG uses embeddings to retrieve passages semantically similar to the question, even if the exact words differ — which keyword search cannot do.
What are the main vector databases?
Pinecone (managed service, simple API, popular), Weaviate (open-source with cloud offering, multi-tenancy, GraphQL), Qdrant (high-performance open-source), Chroma (lightweight open-source, ideal for prototyping) and pgvector (PostgreSQL extension, ideal if already on PostgreSQL/Supabase). The choice depends on volume, existing infrastructure and filtering needs.
What is chunking and how do you choose it?
Chunking is splitting documents into appropriately sized passages before indexing. Fixed-size chunking (500-1000 tokens) is simple but may cut ideas. Semantic chunking respects natural boundaries (paragraphs, sections). Parent-child chunking stores two levels of granularity for more precision. Optimal size depends on content type and the LLM's context window.
What is reranking in a RAG pipeline?
Reranking is a refinement step after initial retrieval. Vector search retrieves the k semantically closest candidates. The reranker (cross-encoder) re-scores each candidate specifically for the question asked, with better precision than simple vector similarity. Cohere Rerank and Jina Reranker are the most widely used solutions.
Does RAG eliminate LLM hallucinations?
No, RAG reduces but does not eliminate hallucinations. The LLM can still invent details not present in the provided documents, create incorrect syntheses, or ignore relevant retrieved passages. Additional measures are needed: mandatory source citation, confidence scores, and automated evaluation with the RAGAS framework (Faithfulness, Answer Relevancy, Context Precision).