Build a RAG Pipeline from Scratch with Python and Claude

April 9, 2026 2 min read 20 min

Retrieval-Augmented Generation (RAG) is the technique that makes LLMs useful for answering questions about your data. Instead of relying on the model’s training knowledge, RAG retrieves relevant documents at query time and feeds them into the prompt.

Architecture Overview

User Question → Embed Question → Search Vector DB → Retrieve Top-K Docs → Prompt Claude → Answer

Step 1: Install Dependencies

pip install anthropic chromadb sentence-transformers pypdf

Step 2: Load and Chunk Documents

from pypdf import PdfReader
import re

def load_pdf(path: str) -> list[str]:
    reader = PdfReader(path)
    text = " ".join(page.extract_text() for page in reader.pages)
    # Chunk into ~500 word segments with 50-word overlap
    words = text.split()
    chunks = []
    for i in range(0, len(words), 450):
        chunk = " ".join(words[i:i+500])
        chunks.append(chunk)
    return chunks

Step 3: Embed and Store in ChromaDB

import chromadb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("docs")

chunks = load_pdf("your_document.pdf")
embeddings = model.encode(chunks).tolist()

collection.add(
    documents=chunks,
    embeddings=embeddings,
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

Step 4: Retrieve and Generate

import anthropic

def rag_query(question: str, top_k: int = 3) -> str:
    # Embed the question
    q_embedding = model.encode([question]).tolist()
    
    # Retrieve relevant chunks
    results = collection.query(query_embeddings=q_embedding, n_results=top_k)
    context = "\n\n".join(results["documents"][0])
    
    # Ask Claude with context
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Answer the question based on the context below.
If the answer is not in the context, say "I don't know."

Context:
{context}

Question: {question}"""
        }]
    )
    return response.content[0].text

# Use it
answer = rag_query("What are the main conclusions of the report?")
print(answer)

Production Improvements

Better chunking: Use semantic chunking instead of fixed-size windows
Reranking: Add a cross-encoder to rerank retrieved chunks before passing to Claude
Hybrid search: Combine vector search with BM25 for better recall
Persistent storage: Replace the in-memory ChromaDB with a persistent one or Pinecone/Weaviate
Prompt caching: Cache your system prompt with Anthropic’s prompt caching to reduce costs