# RAG — Retrieval-Augmented Generation for Grounded AI

<!-- hint:slides topic="RAG pipeline: indexing, chunking, embedding, retrieval, augmented generation, and evaluation" slides="6" -->

## The Problem RAG Solves

LLMs don't know your data. They have a **knowledge cutoff**, can **hallucinate**, and can't access private docs, live data, or domain-specific content. **RAG (Retrieval-Augmented Generation)** solves this by retrieving relevant context and augmenting the prompt with it.

## How RAG Works

```mermaid
flowchart TB
    subgraph index["Indexing (offline)"]
        D[Docs] --> C[Chunk]
        C --> E[Embed]
        E --> S[Store]
    end
    subgraph query["Query (online)"]
        Q[Query] --> EQ[Embed Query]
        EQ --> R[Retrieve]
        S --> R
        R --> A[Augment Prompt]
        A --> G[Generate]
        G --> Out[Answer]
    end
```

**Indexing:** Docs → Chunk → Embed → Store (vector DB)  
**Query:** Query → Embed → Retrieve → Augment prompt → Generate

## The Full RAG Pipeline

```mermaid
flowchart LR
    D[Docs] --> C[Chunk]
    C --> E[Embed]
    E --> V[Vector Store]
    Q[Query] --> EQ[Embed Query]
    EQ --> R[Retrieve]
    V --> R
    R --> A[Augment]
    A --> P[Prompt]
    P --> G[Generate]
    G --> Out[Output]
```

## Embeddings

**Embeddings** convert text to vectors (lists of numbers) that capture semantic meaning. Similar texts → similar vectors. Use cosine similarity or dot product to find "nearest" chunks.

- "How do I reset my password?" ≈ "Password reset instructions"
- Different from keyword search: captures meaning, not just words

## Vector Databases

Store embeddings and support **similarity search**:

| Tool | Notes |
|------|-------|
| **Pinecone** | Managed, scalable |
| **Weaviate** | Open-source, hybrid search |
| **Chroma** | Lightweight, embedded |
| **pgvector** | PostgreSQL extension |

## Chunking Strategies

| Strategy | When to Use |
|----------|-------------|
| **Fixed-size** | Simple; split every N tokens |
| **Semantic** | Split on meaning boundaries (paragraphs, sections) |
| **Recursive** | Hierarchical: try sentence → paragraph → section |
| **Overlap** | Overlap chunks to preserve context at boundaries |

Bad chunking = retrieval misses relevant context or returns fragments that don't make sense alone.

## Retrieval Quality

- **Top-k** — Return k most similar chunks. Tune k (often 3–10).
- **Similarity threshold** — Only return chunks above a score. Filters noise.
- **Re-ranking** — Second pass: cross-encoder or LLM to rank top candidates. Improves precision.

## Hybrid Search

Combine **keyword** (BM25, full-text) with **semantic** (embeddings). Keyword finds exact terms; semantic finds paraphrases. Merge scores (e.g., weighted average, Reciprocal Rank Fusion).

## Evaluation

| Metric | What It Measures |
|--------|------------------|
| **Faithfulness** | Does the answer stay grounded in the retrieved context? |
| **Relevance** | Do retrieved chunks match the query? |
| **Answer correctness** | Is the final answer factually correct? |

## Common Pitfalls

| Pitfall | Fix |
|---------|-----|
| **Bad chunking** | Use semantic or recursive; tune chunk size and overlap |
| **No metadata filtering** | Filter by source, date, type before retrieval |
| **Stuffing too much** | Limit context; use re-ranking; summarize if needed |
| **Wrong embedding model** | Match model to domain (e.g., code vs. prose) |

---

## Key Takeaways

1. **RAG** — Retrieve relevant docs → augment prompt → generate
2. **Embeddings** — Text → vectors; similarity = semantic match
3. **Chunking** — Fixed, semantic, or recursive; overlap helps
4. **Retrieval** — Top-k, threshold, re-ranking
5. **Evaluate** — Faithfulness, relevance, correctness