Guides
intermediate
Build a RAG Pipeline from Scratch with Python and Claude
Retrieval-Augmented Generation (RAG) is the technique that makes LLMs useful for answering questions about your data. Instead of relying on the model’s training knowledge, RAG retrieves relevant documents at query time and feeds them into the prompt.
Architecture Overview
User Question โ Embed Question โ Search Vector DB โ Retrieve Top-K Docs โ Prompt Claude โ Answer
Step 1: Install Dependencies
pip install anthropic chromadb sentence-transformers pypdf
Step 2: Load and Chunk Documents
from pypdf import PdfReader
import re
def load_pdf(path: str) -> list[str]:
reader = PdfReader(path)
text = " ".join(page.extract_text() for page in reader.pages)
# Chunk into ~500 word segments with 50-word overlap
words = text.split()
chunks = []
for i in range(0, len(words), 450):
chunk = " ".join(words[i:i+500])
chunks.append(chunk)
return chunks
Step 3: Embed and Store in ChromaDB
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("docs")
chunks = load_pdf("your_document.pdf")
embeddings = model.encode(chunks).tolist()
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
Step 4: Retrieve and Generate
import anthropic
def rag_query(question: str, top_k: int = 3) -> str:
# Embed the question
q_embedding = model.encode([question]).tolist()
# Retrieve relevant chunks
results = collection.query(query_embeddings=q_embedding, n_results=top_k)
context = "\n\n".join(results["documents"][0])
# Ask Claude with context
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Answer the question based on the context below.
If the answer is not in the context, say "I don't know."
Context:
{context}
Question: {question}"""
}]
)
return response.content[0].text
# Use it
answer = rag_query("What are the main conclusions of the report?")
print(answer)
Production Improvements
- Better chunking: Use semantic chunking instead of fixed-size windows
- Reranking: Add a cross-encoder to rerank retrieved chunks before passing to Claude
- Hybrid search: Combine vector search with BM25 for better recall
- Persistent storage: Replace the in-memory ChromaDB with a persistent one or Pinecone/Weaviate
- Prompt caching: Cache your system prompt with Anthropic’s prompt caching to reduce costs