How to Run AI Locally on Linux
Linux is the preferred platform for local AI in server and developer environments. It offers the best GPU driver support, the most configuration flexibility, and the ability to run models on headless machines accessible over the network. Whether you are setting up a personal workstation, a shared AI server for your team, or a dedicated inference machine, Linux provides the tools and stability to make it work.
Check Your Hardware and GPU
Before installing anything, check what hardware you are working with. Open a terminal and run free -h to see your total system RAM. You need at least 8 GB for small models and 16 GB or more for 8B parameter models.
To check for a GPU, run lspci | grep -i vga. This shows your graphics cards. If you see an NVIDIA card, you will use CUDA drivers. If you see an AMD card, you will use ROCm. If you only see an integrated Intel GPU or no discrete GPU, you can still run models on CPU only.
For NVIDIA GPUs, check if drivers are already installed by running nvidia-smi. If this command works and shows your GPU information including CUDA version, your drivers are ready and you can skip the driver installation step. For AMD GPUs, check with rocm-smi.
Install GPU Drivers
NVIDIA (CUDA): The recommended approach is to install drivers from NVIDIA's package repository for your distribution. On Ubuntu and Debian-based systems, you can use sudo apt install nvidia-driver-560 (or the latest available version). On Fedora, use RPM Fusion. After installation, reboot and verify with nvidia-smi. You need CUDA 11.8 or later for Ollama GPU support.
AMD (ROCm): Install ROCm from AMD's official repository. The process varies by distribution, but AMD provides installation guides for Ubuntu, RHEL, and SUSE. After installation, verify with rocm-smi. ROCm support in Ollama has improved significantly but may require specific ROCm versions for best compatibility.
CPU only: If you have no discrete GPU or prefer not to use one, skip this step entirely. Ollama runs on CPU without any additional drivers. Performance is lower but adequate for small to medium models.
Install Ollama
The install script handles everything automatically. Run this command in your terminal:
curl -fsSL https://ollama.com/install.sh | sh
The script downloads the Ollama binary, installs it to /usr/local/bin/, creates an ollama system user, sets up a systemd service, and starts the service. After the script completes, Ollama is running and ready to use.
Verify the installation with ollama --version and check the service status with systemctl status ollama. You should see the service listed as active and running.
For distributions without systemd (such as certain container-oriented Linux installs), you can run Ollama manually with ollama serve in a terminal or configure your init system to manage it.
Download and Run a Model
With Ollama running, download and start your first model:
ollama run qwen3:8b
The model downloads (approximately 4.9 GB) and loads into memory. Once loaded, you see a prompt where you can type messages. Ollama automatically detects your GPU and offloads model layers to it for acceleration.
Check the acceleration status with ollama ps in another terminal. On an NVIDIA GPU with proper CUDA drivers, you should see the model using GPU with all layers offloaded. If the model exceeds your VRAM, Ollama splits it between GPU and CPU automatically, providing partial acceleration.
Configure for Remote Access or Headless Use
By default, Ollama listens only on localhost (127.0.0.1:11434). For headless servers or to allow access from other devices on your network, configure Ollama to listen on all interfaces.
Edit the systemd service file: sudo systemctl edit ollama. Add the following in the editor:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Save and restart the service: sudo systemctl restart ollama. Ollama is now accessible from any device on your network at http://your-server-ip:11434.
For a browser-based chat interface on a headless server, install Open WebUI via Docker on the same machine or on a different machine that can reach the Ollama server. Configure Open WebUI's Ollama connection URL to point to your server's IP address and port 11434.
For security, if your server is exposed to the internet (not just your local network), consider placing Ollama behind a reverse proxy with authentication rather than exposing it directly.
Linux-Specific Tips and Configuration
Model storage: Ollama stores models in /usr/share/ollama/.ollama/models when running as a systemd service. To use a different location (such as a larger drive or faster SSD), set the OLLAMA_MODELS environment variable in the systemd service override file.
Multiple GPUs: If your machine has multiple NVIDIA GPUs, Ollama can distribute model layers across them for larger models. This happens automatically when a model exceeds the VRAM of a single GPU. You can control which GPUs Ollama uses with the CUDA_VISIBLE_DEVICES environment variable.
Automatic updates: Re-running the install script updates Ollama to the latest version. The script detects the existing installation and upgrades in place. Some users add this to a cron job for automatic weekly updates.
Firewall: If you configured Ollama for remote access and other devices cannot connect, check your firewall rules. On Ubuntu with ufw, run sudo ufw allow 11434/tcp to allow incoming connections to the Ollama port.
Log access: View Ollama service logs with journalctl -u ollama -f. This is useful for debugging GPU detection issues, model loading failures, and connection problems. The logs show which GPU backend was detected, how many layers are offloaded, and any errors during model loading.
Running Multiple Models and Load Management
Linux servers with sufficient RAM can run multiple models concurrently. Ollama loads each model into memory independently, so a server with 64 GB of RAM could keep a small 3B model for quick tasks and a large 30B model for complex queries both resident in memory simultaneously. Requests route to whichever model you specify in the API call or command.
For production-like deployments, consider setting OLLAMA_KEEP_ALIVE to -1 so frequently used models stay loaded permanently rather than unloading after the default five-minute timeout. This eliminates the cold-start delay when a user sends a request to a model that was previously unloaded. Monitor memory usage with standard Linux tools like htop or free to ensure the loaded models leave enough room for the operating system and other services.
Docker-based deployments are also popular on Linux. Running Ollama inside a Docker container with NVIDIA GPU passthrough (using the NVIDIA Container Toolkit) provides isolation and makes deployment reproducible. This approach is especially useful for teams that manage multiple servers or use container orchestration tools.
Linux offers the best flexibility for local AI, with a one-line install script, full NVIDIA and AMD GPU support, and systemd integration for server deployments. Install Ollama, ensure your GPU drivers are current, and you are running local AI with full acceleration.