How to Install Ollama on Linux (Ubuntu, Debian, Arch)

Tested on: Ubuntu 26.04 LTS, Ubuntu 24.04 LTS, Debian 12 — Last updated: June 2026

This guide shows you how to install Ollama on Linux — including Ubuntu 26.04, Debian 12, Arch Linux and Docker. Ollama handles model downloads, GPU acceleration and exposes a local API, all with a single command. No cloud required, no data leaves your machine.

Contents

Prerequisites
Step 1 — Install Ollama
Step 2 — Download and Run Your First Model
Which Ollama Model Should You Use?
Step 3 — Check GPU Acceleration
Step 4 — Use the Ollama API
Step 5 — Manage the Ollama Service
Install Ollama on Arch Linux
Install Ollama with Docker
Useful Ollama Commands
Customize Ollama with a Modelfile
Access Ollama from Another Machine on Your Network
Use Ollama from Python
Performance Tips
Troubleshooting
What to Do Next
1. Further Reading

Prerequisites

Ubuntu 26.04 LTS / 24.04 LTS, or Debian 12 (Bookworm)
At least 8 GB RAM (16 GB recommended for 7B models)
Optional: NVIDIA GPU with CUDA drivers installed, or AMD GPU with ROCm
curl installed (sudo apt install curl)

Step 1 — Install Ollama

Ollama provides an official install script that detects your OS and GPU automatically:

curl -fsSL https://ollama.com/install.sh | sh

The script installs Ollama as a systemd service and starts it automatically. To verify the installation:

ollama --version

You should see something like ollama version 0.x.x. The service starts automatically at boot.

Step 2 — Download and Run Your First Model

Pull a model from the Ollama library. Start with Llama 3.2 3B if you have limited RAM, or Llama 3.1 8B if you have 16 GB or more:

# Lightweight — works on 8 GB RAM
ollama pull llama3.2:3b

# Recommended — needs 16 GB RAM
ollama pull llama3.1:8b

Once downloaded, start an interactive chat session:

ollama run llama3.2:3b

Type your prompt and press Enter. Use /bye to exit the session.

Which Ollama Model Should You Use?

The right model depends on your hardware. More parameters means better quality — but also higher RAM and VRAM requirements. The practical rule: use the largest model that fits entirely in your GPU's VRAM. Anything that spills over to CPU runs noticeably slower.

Model	Size	VRAM (GPU)	RAM (CPU)	Best for
`llama3.2:1b`	1B	1.5 GB	4 GB	Fast replies on very low-spec hardware
`llama3.2:3b`	3B	3 GB	8 GB	Good balance of speed and quality
`llama3.1:8b`	8B	6 GB	16 GB	Best quality for most consumer GPUs
`mistral:7b`	7B	5 GB	12 GB	Fast, strong at reasoning and coding
`codellama:7b`	7B	5 GB	12 GB	Code generation and completion
`deepseek-coder:6.7b`	6.7B	5 GB	12 GB	Strong Python and Go coder
`phi3:mini`	3.8B	3 GB	8 GB	Efficient reasoning on low RAM
`gemma2:9b`	9B	7 GB	16 GB	Strong reasoning, Google model
`nomic-embed-text`	—	1 GB	4 GB	Text embeddings for RAG pipelines
`llama3.1:70b`	70B	48 GB	64 GB	Near GPT-4 quality, needs high-end GPU

Browse the full library at ollama.com/library. Each model page lists available quantizations. The q4_K_M variants are the best default choice: smaller download, minimal quality loss compared to the full-precision version.

Step 3 — Check GPU Acceleration

If you have a compatible GPU, Ollama uses it automatically. Verify with:

ollama ps

This shows running models and whether they are loaded on GPU or CPU. For NVIDIA cards, you can also confirm with:

nvidia-smi | grep ollama

If no GPU is detected and you have an NVIDIA card, check that the CUDA drivers are installed correctly. Ollama will fall back to CPU automatically if no GPU is available — models just run slower.

Step 4 — Use the Ollama API

Ollama runs a local REST API on http://localhost:11434. You can query it directly from the terminal or from any application:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Explain what a kernel is in one sentence.",
  "stream": false
}'

The API is OpenAI-compatible, which means any tool that supports OpenAI's API — Open WebUI, Continue.dev, shell scripts — works with Ollama out of the box by pointing it to http://localhost:11434.

Step 5 — Manage the Ollama Service

Ollama runs as a systemd service. Standard service commands apply:

# Check service status
sudo systemctl status ollama

# Stop Ollama
sudo systemctl stop ollama

# Restart after config changes
sudo systemctl restart ollama

# View logs
journalctl -u ollama -f

Install Ollama on Arch Linux

On Arch Linux and Arch-based distributions (Manjaro, EndeavourOS), Ollama is available in the AUR:

# Using yay
yay -S ollama

# Or using paru
paru -S ollama

After installation, enable and start the service:

sudo systemctl enable --now ollama

Install Ollama with Docker

If you prefer containers or want to isolate Ollama from your system, Docker is the cleanest option:

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Models are stored in a named volume (ollama), so they persist across container restarts. Pull and run models the same way — just exec into the container or use the API on port 11434.

Useful Ollama Commands

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2:3b

# Pull a specific version
ollama pull mistral:7b

# Show model details
ollama show llama3.1:8b

# Copy a model under a new name
ollama cp llama3.1:8b my-custom-model

Customize Ollama with a Modelfile

A Modelfile lets you set a custom system prompt, adjust generation parameters, and save the result as a named model. This is how you turn a generic base model into a specialized assistant — without fine-tuning or touching weights.

Create a file called Modelfile (no extension):

FROM llama3.1:8b

SYSTEM """
You are a Linux sysadmin assistant. Answer questions about Linux, bash scripting,
server administration and networking. Be direct and concise. Always provide
working commands that can be copy-pasted.
"""

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Build and run your custom model:

# Build the custom model
ollama create linux-assistant -f Modelfile

# Run it
ollama run linux-assistant

The key parameters to know:

temperature — 0.0 is deterministic, 1.0 is creative. Use 0.3–0.5 for code, 0.7–0.9 for writing.
num_ctx — context window in tokens. Higher values let the model remember more of the conversation but use more VRAM. Default is 2048.
top_p — controls output diversity. 0.9 is a solid default for most uses.

Access Ollama from Another Machine on Your Network

By default Ollama only listens on 127.0.0.1:11434. If you want to run Ollama on a server and query it from another machine on the same network — a laptop, a VM, Open WebUI on a different host — you need to change the bind address.

Create a systemd override:

sudo systemctl edit ollama

Add this between the comment lines in the editor that opens:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Save and restart:

sudo systemctl restart ollama

Other machines on the same LAN can now reach the API at http://YOUR_SERVER_IP:11434. Do not expose port 11434 directly to the internet — if you need remote access, put a reverse proxy with authentication (Nginx + basic auth, or Tailscale) in front of it.

Use Ollama from Python

The official Python library wraps the Ollama API cleanly. Install it with pip:

pip install ollama

Basic chat call:

import ollama

response = ollama.chat(
    model='llama3.2:3b',
    messages=[{'role': 'user', 'content': 'What is a Linux inode?'}]
)
print(response['message']['content'])

Stream the response token by token — better for interactive scripts and CLIs:

import ollama

for chunk in ollama.chat(
    model='llama3.2:3b',
    messages=[{'role': 'user', 'content': 'Explain systemd in plain English.'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Generate text embeddings for RAG pipelines:

import ollama

result = ollama.embed(
    model='nomic-embed-text',
    input='How do I configure iptables?'
)
print(result['embeddings'])

Performance Tips

If inference feels slow, these adjustments make a real difference without switching to a smaller model:

Use q4_K_M quantization. Best tradeoff between size and quality. Pull explicitly with ollama pull llama3.1:8b:q4_K_M. The default tag is already quantized, but some models have better variants worth specifying.
Reduce context length. The default num_ctx is 2048 tokens. Setting it to 1024 in a Modelfile reduces VRAM usage and speeds up generation noticeably when you don't need long conversations.
Don't run two models at once. Ollama unloads a model from VRAM when another is requested. Keeping a single active model avoids loading cycles that add several seconds per request.
Store models on NVMe. Ollama loads models from disk into RAM/VRAM at startup. An NVMe SSD cuts load time from ~20s (spinning disk) to 2–3s. The default path is ~/.ollama/models — change it with the OLLAMA_MODELS environment variable.
Check VRAM before pulling large models. Run nvidia-smi or rocm-smi to see available VRAM. A model that doesn't fit in VRAM fully offloads to CPU — much slower. Size the model to your VRAM, not your total system RAM.

Troubleshooting

Error: `could not connect to ollama app, is it running?`

The Ollama service is not running. Start it:

sudo systemctl start ollama

Error: `model not found`

Pull the model first before running it:

ollama pull llama3.2:3b
ollama run llama3.2:3b

GPU not detected after installation

For NVIDIA cards, install or reinstall the CUDA drivers and reboot. For Ollama installed via the official script, CUDA libraries are bundled — no separate CUDA toolkit is required. If the GPU is still missing from ollama ps, check the service logs:

journalctl -u ollama --no-pager | grep -i "gpu|cuda|error"

Model runs extremely slowly (CPU instead of GPU)

This happens when GPU VRAM is not enough to hold the full model. Ollama falls back to CPU for the layers that don't fit. Use a smaller model (3B instead of 8B), or a more aggressively quantized version: ollama pull llama3.1:8b:q4_0.

Port 11434 already in use

Another process is using the port. Find and stop it, or change Ollama's port via a systemd override: add Environment="OLLAMA_HOST=0.0.0.0:11435" and restart the service.

Error: `pull model manifest: context deadline exceeded`

Network timeout pulling the model from ollama.com. Check your connection and DNS. Large model downloads sometimes fail on the first attempt — run ollama pull again and it resumes from where it left off.

Error: `unable to allocate memory`

Not enough VRAM or system RAM. Switch to a smaller model, use a more aggressively quantized version (q4_0 instead of q8_0), or close other GPU-heavy applications before starting Ollama.

Model outputs repetitive or incoherent text

Usually a temperature or quantization issue. Try a different quantization variant of the same model, or set temperature 0.7 in a Modelfile. Persistent gibberish can indicate a corrupted download — remove and re-pull:

ollama rm llama3.2:3b
ollama pull llama3.2:3b

What to Do Next

Ollama running locally is useful on its own, but it becomes significantly more powerful with a web interface. The next step is installing Open WebUI — a full ChatGPT-like interface that connects to your local Ollama instance and runs entirely in your browser. No data leaves your machine.