Guides
beginner
How to Run Local LLMs with Ollama in 10 Minutes
Running a large language model locally gives you complete privacy, zero API costs, and offline access. With Ollama, it takes about 10 minutes from zero to a working local AI assistant.
Step 1: Install Ollama
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com and run it.
Verify the install:
ollama --version
# ollama version 0.3.14
Step 2: Pull a Model
# Llama 4 Scout (requires ~8GB disk, runs on 8GB RAM)
ollama pull llama4:scout
# Mistral 7B (smaller, faster โ great for testing)
ollama pull mistral
# Gemma 3 2B (very small, runs on any machine)
ollama pull gemma3:2b
Step 3: Run Your First Chat
ollama run llama4:scout
You’ll see a >>> prompt. Type a message and press Enter:
>>> Explain quantum entanglement in simple terms
Step 4: Use the REST API
Ollama runs a local server on port 11434. You can use it like any API:
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama4:scout",
"prompt": "Write a haiku about machine learning",
"stream": False
})
print(response.json()["response"])
Step 5: OpenAI-Compatible Endpoint
Ollama also exposes an OpenAI-compatible endpoint, so you can drop it into any app that supports OpenAI:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama4:scout",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Hardware Tips
| Model | RAM Required | GPU Needed? |
|---|---|---|
| Gemma 3 2B | 4GB | No |
| Mistral 7B | 8GB | No |
| Llama 4 Scout (4-bit) | 12GB | Optional |
| Llama 4 Maverick (4-bit) | 48GB | Recommended |
Without a GPU, models run on CPU โ expect 5โ15 tokens/second for 7B models, which is perfectly usable.