Models Supported by Hermes Agent
Model-Agnostic Architecture
Hermes Agent is designed to work with any language model that supports tool calling. The framework treats the language model as a replaceable component, meaning you can switch between providers and models without losing your accumulated skills, memories, or configuration. When you configure a model provider, Hermes automatically detects capabilities including vision support, streaming, tool calling, and context window size.
The framework supports configuring multiple model providers simultaneously. You can set a primary model for complex reasoning tasks, a secondary model for quick classification, and a fallback model in case the primary provider experiences downtime. The agent handles failover automatically.
Cloud API Providers
OpenAI models are fully supported through the OpenAI API. Compatible models include GPT-4o, GPT-4.1, o3, o4-mini, and all others supporting function calling. GPT-4o is one of the most popular choices due to its strong tool-calling accuracy.
Anthropic models work through the Anthropic API. Claude Opus, Claude Sonnet, and Claude Haiku are all compatible. Claude Sonnet is frequently recommended for production workloads due to its balance of capability, speed, and cost.
Google models are supported through the Google AI API. Gemini 2.5 Pro and Gemini 2.5 Flash are the primary compatible models. Gemini 2.5 Flash is particularly cost-effective for high-volume operations.
DeepSeek provides V4 and R2 models through their API. DeepSeek V4 has become a popular budget option, offering competitive performance at significantly lower per-token costs than GPT-4o or Claude Sonnet.
Nous Portal is Nous Research's own model serving platform providing access to Hermes 4 and Hermes 3 model families. These models are specifically optimized for agent use cases with strong tool-calling accuracy.
OpenRouter acts as a meta-provider offering access to 200+ models from dozens of providers through a single API endpoint. This is the most flexible option for experimenting with different models.
Local Inference Options
Ollama is the most popular local inference server for Hermes users. It provides a simple interface for downloading and running open-source models locally. Compatible models include Hermes 3 (8B, 70B), Llama 3.1, Mistral, Mixtral, Phi-3, and many others. Ollama supports GPU acceleration with NVIDIA, AMD, and Apple Silicon hardware.
vLLM is a high-performance inference engine for production deployment with continuous batching, PagedAttention for efficient memory management, and tensor parallelism across multiple GPUs.
SGLang provides another high-performance option with a focus on structured generation and efficient prompt caching. Any server implementing the OpenAI-compatible API format works with Hermes as well, including LM Studio and text-generation-webui.
Model Routing and Cost Optimization
Hermes supports model routing, a feature that assigns different models to different types of tasks based on complexity, cost, or speed requirements. A typical routing configuration might use Claude Haiku or Gemini Flash for message classification (fast, cheap), Claude Sonnet or GPT-4o for complex reasoning (capable, moderate cost), and Claude Opus for critical operations requiring maximum reliability.
Model routing can reduce monthly API costs by 40 to 60% compared to using a single high-end model for all tasks. The savings come from the observation that most interactions do not require the full reasoning power of a frontier model.
Performance Benchmarks by Model
Tool-call accuracy is the most important metric for agent performance. GPT-4o and Claude Sonnet consistently achieve 95%+ accuracy across diverse task types. DeepSeek V4 achieves approximately 88 to 92% accuracy, making it a strong budget option. Hermes 3 8B through Ollama reaches 91% accuracy, remarkable for a local model running on modest hardware.
Response latency varies widely. Cloud models typically respond in 1 to 3 seconds for simple tasks and 5 to 15 seconds for complex operations. Local models through Ollama range from 3 to 10 seconds on GPU and 15 to 60 seconds on CPU.
Choosing Your First Model
For users new to Hermes Agent, model selection can feel overwhelming given the number of options. The community consensus has converged on a few recommended starting points based on priorities. If cost is the primary concern, start with DeepSeek V4 through the DeepSeek API. It offers strong performance at the lowest per-token cost among capable models, and a typical personal assistant workload costs $2 to $5 per month.
If quality is the priority and budget allows, Claude Sonnet through the Anthropic API is the most consistently recommended option. It achieves 95%+ tool-call accuracy, handles complex multi-step tasks reliably, and produces natural, well-structured responses. Monthly API costs for moderate usage range from $10 to $20.
For users who want complete data sovereignty with no external API calls, Hermes 3 8B through Ollama is the recommended local model. It requires a GPU with at least 8GB VRAM and achieves 91% tool-call accuracy, which is remarkably close to frontier cloud models. The main trade-off is inference speed, which ranges from 3 to 10 seconds per response on GPU versus 1 to 3 seconds for cloud APIs.
Regardless of which model you start with, the skill system means your initial model choice has long-term implications. Skills created by a more capable model tend to be higher quality and more generalizable. Starting with a capable model for the first few weeks of skill building and then downgrading to a cheaper model for daily operation can be an effective cost optimization strategy.
Model Updates and Compatibility
Language model providers release new model versions regularly, and Hermes handles these transitions gracefully. Because the framework interacts with models through standardized API interfaces, new model versions typically work without configuration changes. The auto-detection system re-evaluates model capabilities on each startup, so new features (like expanded context windows or improved tool calling) are picked up automatically.
The community maintains a compatibility matrix on the Hermes wiki that tracks which models have been tested, their tool-call accuracy scores, and any known issues. This matrix is updated with each new model release from major providers and serves as a reliable reference for model selection decisions. Models are rated on a simple three-tier system: fully compatible (all features work reliably), compatible with limitations (most features work but some edge cases fail), and incompatible (tool calling is unreliable or missing).
Choosing Your First Model
For new Hermes users, selecting a starting model can be overwhelming given the 200+ options available. The community's general recommendation is to start with one of three well-tested configurations depending on your priorities. For the lowest cost, DeepSeek V4 through OpenRouter or the DeepSeek API provides strong performance at approximately $0.14 per million input tokens. For the best balance of quality and cost, Claude Haiku routed to Claude Sonnet for complex tasks keeps monthly API spending between $7 and $15 while delivering excellent response quality. For maximum capability regardless of cost, GPT-4o or Claude Sonnet as the primary model for all tasks provides the highest accuracy and most natural responses.
If you plan to use model routing (assigning different models to different task types), start with conservative routing thresholds that send most tasks to the capable model and only route clearly simple queries to the cheaper one. Monitor the quality of responses from each model over the first week, then gradually adjust the thresholds to route more tasks to the cheaper model as you gain confidence in its performance for your specific use cases. The agent's model routing dashboard shows task classification statistics that help you tune these thresholds empirically.
Remember that model selection is not permanent. Hermes makes it easy to switch models at any time through a configuration change, so you can experiment freely without committing to a long-term choice upfront.
The breadth of model support ensures that Hermes users are never locked into a single provider or pricing tier. As the model landscape evolves and new options emerge, Hermes's model-agnostic architecture allows you to adopt improvements immediately without waiting for framework updates or compatibility patches.
Hermes Agent works with any tool-calling language model, from budget options like DeepSeek V4 to frontier models like GPT-4o, and supports model routing to optimize cost and performance across different task types.