n8n with Ollama: Local AI Workflows

Updated May 2026

Step 1: Install Ollama

Ollama runs local LLMs on your own hardware with a simple API. There are two installation paths depending on your setup. If you are using the n8n Self-Hosted AI Starter Kit, Ollama is already included as a Docker container and pre-configured to communicate with n8n. Just run docker compose up and Ollama is ready.

For standalone installation, download Ollama from ollama.com for macOS, Linux, or Windows. The macOS and Windows installers provide native applications with Metal and CUDA GPU acceleration respectively. On Linux, the installation script handles everything including NVIDIA GPU driver detection. After installation, verify Ollama is running by visiting http://localhost:11434 in your browser or running ollama list in your terminal.

For Docker-based setups outside the starter kit, run the official Ollama Docker image. Use the gpu-nvidia tag for NVIDIA GPU support or the standard tag for CPU-only inference. Map port 11434 and mount a volume for model storage so downloaded models persist across container restarts.

Step 2: Pull Models

You need at least one language model for AI workflows. Run ollama pull followed by the model name. Recommended starting models include llama3.1 (8B parameters, strong general reasoning, 4.7GB download), mistral (7B parameters, fast inference, good for simple tasks, 4.1GB), and phi3 (3.8B parameters, smallest but still capable for basic tasks, 2.3GB).

For RAG pipelines, you also need an embedding model. Run ollama pull nomic-embed-text (274MB) for a solid general-purpose embedding model that produces 768-dimensional vectors. Alternative embedding models include mxbai-embed-large for higher quality at the cost of speed, and all-minilm for faster embedding at lower dimensionality.

Larger models like llama3.1:70b or mixtral:8x7b provide better quality but require significantly more resources (32GB+ RAM or 24GB+ VRAM). Start with smaller models and upgrade only if the quality does not meet your requirements.

Step 3: Configure n8n Credentials

In n8n, go to Settings, then Credentials, then Add Credential. Search for "Ollama" and select it. The only required field is the Base URL. If Ollama runs on the same machine, use http://localhost:11434. If using Docker Compose with the starter kit, use http://ollama:11434 (the Docker service name). If Ollama runs on a different machine on your network, use that machine's IP address and port.

Test the connection by creating a quick workflow with a Chat Trigger and Ollama Chat Model node. If the model dropdown populates with your installed models, the connection is working correctly.

Step 4: Build a Local AI Workflow

With Ollama configured, building workflows is identical to using cloud LLM providers. Add an Ollama Chat Model node wherever you would use an OpenAI or Anthropic model node. The same applies to embeddings: use the Ollama Embeddings node wherever you would use OpenAI Embeddings.

A complete local RAG pipeline uses Ollama for both inference and embeddings. The ingestion workflow loads documents, splits them with a Text Splitter, generates embeddings with Ollama Embeddings (nomic-embed-text), and stores vectors in Qdrant. The query workflow retrieves relevant chunks from Qdrant, passes them as context to an Ollama Chat Model (llama3.1), and returns the answer. The entire pipeline runs locally with zero API costs.

For conversational agents, add a Buffer Memory or PostgreSQL Memory node. The agent maintains conversation context across messages, calls tools as needed, and generates responses using the local model. Response quality depends on the model size, with larger models providing more accurate and nuanced answers but requiring more hardware resources.

Step 5: Optimize for Performance

Model selection is the biggest performance lever. For simple classification and extraction tasks, smaller models (phi3, gemma2:2b) provide fast inference with adequate quality. For complex reasoning, summarization, and multi-step tasks, larger models (llama3.1, mistral) are worth the slower inference speed.

Context window management matters for performance. Ollama models default to a 2048-token context window. For RAG pipelines with multiple retrieved chunks, increase the context window using the num_ctx parameter in the Ollama Chat Model node configuration. Common values are 4096 for moderate RAG and 8192 for larger context needs. Larger context windows use more memory and slow inference.

GPU utilization is critical for acceptable speed. Ensure your GPU drivers are installed and Ollama detects the GPU (check with ollama run llama3.1 and watch GPU utilization). If using Docker, the NVIDIA Container Toolkit must be installed for GPU passthrough. On macOS, Metal GPU acceleration is automatic with native Ollama installation.

For production workloads serving multiple concurrent users, consider running Ollama on a dedicated machine with sufficient GPU memory. A single consumer GPU (RTX 3090, 24GB VRAM) can handle 2 to 3 concurrent inference requests with a 7B model. For higher concurrency, multiple GPUs or a dedicated inference server is recommended.