Troubleshooting Self-Hosted AI Agent Issues
GPU and Memory Issues
CUDA out of memory errors are the single most common problem. They occur when the model and its KV cache exceed available GPU VRAM. The error message typically reads "CUDA out of memory" or "RuntimeError: CUDA error: out of memory."
To diagnose, check current VRAM usage with nvidia-smi. The output shows total VRAM, used VRAM, and which processes are consuming memory. If another process is using VRAM (a previous crashed inference server, a desktop environment, or another application), kill it to free memory.
To resolve, you have several options. Switch to a smaller quantization: moving from FP16 to 4-bit roughly halves VRAM usage. Reduce the context window length in your inference server configuration, since longer contexts require more KV cache memory. Use a smaller model that fits your hardware. If running Ollama, set OLLAMA_NUM_PARALLEL to 1 to reduce concurrent inference sessions, which reduces KV cache memory.
GPU not detected usually indicates a driver problem. Verify NVIDIA drivers are installed by running nvidia-smi. If the command fails, reinstall the NVIDIA driver. Inside Docker containers, ensure the NVIDIA Container Toolkit is installed and that your docker run command includes --gpus all or your docker-compose.yml includes the deploy.resources.reservations.devices section for GPU access.
Thermal throttling causes gradually degrading inference speed. GPUs reduce clock speeds when temperatures exceed safe thresholds (typically 83 to 90 degrees Celsius). Monitor GPU temperature with nvidia-smi. If temperatures are high, improve case airflow, clean dust from fans and heatsinks, increase fan speed curves, or reduce ambient room temperature. In server environments, verify that rack cooling is adequate.
Model Loading Problems
Model fails to load can have several causes. Corrupted downloads are common, especially for large model files. Verify the file integrity by checking its size against the expected size listed on the model's download page, or compare checksums if available. Re-download the model if sizes do not match.
Incompatible quantization formats can also prevent loading. Not all inference engines support all formats. Ollama uses GGUF format; vLLM uses safetensors or GPTQ; TGI supports safetensors. Verify that your model format matches your inference engine's requirements.
Extremely slow model loading (minutes instead of seconds) typically indicates the model is being loaded from a slow storage device (HDD instead of SSD) or from a network drive. Model files should be stored on local NVMe or SATA SSD storage for acceptable load times. A 7B model loads from NVMe in 2 to 5 seconds; from HDD it can take 30 to 60 seconds.
Model produces nonsensical output after loading correctly usually means the prompt format is wrong. Different models expect different prompt templates (ChatML, Llama format, Mistral format). Using the wrong template causes the model to treat system prompts as user input or miss instruction boundaries entirely. Check the model's documentation for the correct prompt template and configure your inference server or orchestration platform accordingly.
Docker and Container Issues
Container restart loops appear in docker ps as containers with status "Restarting." Check logs with docker logs [container_name] to identify the cause. Common culprits include: port conflicts (another service using the same port), missing environment variables, insufficient memory (the container is being OOM-killed), and configuration errors in the compose file.
Container cannot access GPU inside Docker requires the NVIDIA Container Toolkit. Install it with your package manager, then restart the Docker daemon. Verify by running docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi. If this command shows your GPU, the toolkit is working. If not, check that the Docker daemon configuration includes the nvidia runtime.
Permission denied errors in containers often occur with volume mounts. The user ID inside the container may not have permission to read or write files on mounted volumes. Fix by setting appropriate file permissions on the host (chmod), or by specifying the user ID in the compose file to match the host user.
Docker compose services not communicating usually means services are trying to connect to localhost instead of the service name. Within a Docker Compose network, services reach each other by service name (for example, http://ollama:11434), not by localhost. Check connection URLs in your configuration and replace localhost references with the appropriate service name.
Inference Performance Problems
Slow token generation (under 10 tokens per second for 7B models on an RTX 4090) suggests a configuration issue rather than a hardware limitation. Check that inference is actually running on GPU, not falling back to CPU. In vLLM, verify the tensor-parallel-size matches your GPU count. In Ollama, ensure CUDA is enabled in the logs. CPU inference is typically 5 to 20 times slower than GPU inference.
Other causes of slow inference include: excessively long context windows (longer context = slower generation), too many concurrent requests exceeding GPU capacity, and background processes consuming GPU resources.
Inconsistent response times where some requests are fast and others are slow often indicate queuing under load. When multiple requests hit the inference server simultaneously, requests queue and wait. Reduce concurrent request volume, enable batching (vLLM handles this automatically), or add GPU capacity.
High latency on first request after startup is normal. The model needs to be loaded from disk to GPU memory on the first inference request. Subsequent requests use the already-loaded model and are much faster. If this cold start latency is problematic, configure your inference server to preload the model at startup rather than loading it on first request.
Agent Workflow Failures
Agent gets stuck in loops where it repeats the same action or oscillates between two states is a common behavior issue. This usually stems from ambiguous instructions in the system prompt, insufficient context for the agent to determine it has completed a task, or a tool that consistently returns results the agent misinterprets. Fix by clarifying stop conditions in the system prompt, adding explicit completion criteria, or improving tool output formatting.
Tool calls failing manifests as the agent attempting to use a tool but receiving errors. Common causes include: the tool endpoint being unreachable (check network connectivity and URLs), authentication credentials being expired or incorrect, the tool input format not matching what the tool expects, and timeout errors from tools that take too long to respond. Check tool logs independently of agent logs to isolate the failure point.
RAG returning irrelevant results means the vector search is not finding documents that match the query intent. This can be caused by: poor quality document embeddings (try a different embedding model), document chunks that are too large or too small (experiment with chunk sizes between 256 and 1024 tokens), or missing documents in the vector index (verify your document ingestion pipeline completed successfully).
Agent memory not persisting between sessions indicates a storage or configuration issue with the conversation memory system. Verify that the memory backend (database or file storage) is properly configured and that data volumes are correctly mounted in Docker. Check that the session ID is being passed consistently between requests so the agent retrieves the correct conversation history.
Networking and Connectivity
Cannot access the agent platform from other devices on the network usually means the service is bound to localhost (127.0.0.1) instead of all interfaces (0.0.0.0). Check the service bind address in the configuration and change it to 0.0.0.0 or the machine's LAN IP address. Also verify that firewall rules (iptables, ufw, or firewalld) allow connections on the service port.
TLS certificate errors when connecting to external tools or APIs from agents indicate certificate validation failures. This can happen when: the system clock is wrong (certificates are time-sensitive), corporate proxy certificates are not installed in the container, or self-signed certificates are not trusted. Fix by correcting the system time, adding proxy CA certificates to the container, or configuring the HTTP client to trust specific certificates.
General Diagnostic Approach
When facing an unfamiliar problem, follow this systematic approach. First, check component logs: Docker logs for container issues, inference server logs for model problems, orchestration platform logs for agent issues. Second, isolate the failing component by testing each layer independently: can the inference server respond to a direct API call? Can the orchestration platform reach the inference server? Can tools execute independently of the agent? Third, check resource utilization: GPU memory, system RAM, disk space, and CPU usage. Resource exhaustion causes subtle failures that look like application bugs. Fourth, compare against a known working configuration. If something changed recently (model update, configuration change, Docker image update), revert and verify the system works, then reapply changes one at a time to identify the culprit.
Most self-hosted AI agent problems fall into five categories: GPU memory, model loading, Docker configuration, inference performance, and agent behavior. Systematic diagnosis starting with component logs and resource checks resolves the majority of issues quickly.