Infrastructure Costs for Running AI Agents
Serverless Infrastructure
Serverless computing offers the most cost-effective entry point for AI agents because you pay only for actual execution time. There are no idle costs, no servers to maintain, and no capacity planning decisions to make. For agents with variable or unpredictable traffic patterns, serverless architectures keep infrastructure costs proportional to actual usage.
AWS Lambda charges $0.20 per million requests plus $0.0000166667 per GB-second of compute time. A typical AI agent invocation that runs for 2 seconds with 512 MB of memory costs approximately $0.0000167 per execution. At 10,000 invocations per day, the monthly Lambda bill comes to about $5. Even at 100,000 daily invocations, the compute cost stays under $50 per month.
Google Cloud Functions and Azure Functions offer comparable pricing. Google charges $0.40 per million invocations plus $0.0000025 per GB-second. Azure charges $0.20 per million executions plus similar per-second compute rates. The differences between providers are negligible for most agent workloads, and the choice usually comes down to which cloud ecosystem your other services already use.
The hidden cost of serverless architectures is cold start latency. When a function has not been invoked recently, the first call takes 1 to 3 seconds longer while the runtime initializes. For agents that need consistent sub-second response times, provisioned concurrency eliminates cold starts but adds a fixed monthly cost of $10 to $50 depending on the concurrency level you reserve.
API Gateway costs add to the serverless bill. AWS API Gateway charges $3.50 per million API calls, which can exceed the Lambda compute cost for high-volume agents. Alternative approaches like Lambda function URLs or Application Load Balancers reduce this expense to under $1 per million requests.
Container-Based Deployments
Containers provide a middle ground between serverless simplicity and dedicated server control. Services like AWS ECS, Google Cloud Run, and Azure Container Instances let you run Docker containers with more predictable performance characteristics than serverless while avoiding the management overhead of full virtual machines.
Google Cloud Run bridges the gap between serverless and containers. It runs containers that scale to zero when idle, meaning you pay nothing during quiet periods, but provides the full container runtime environment including persistent connections, background processes, and custom system libraries. Pricing starts at $0.00002400 per vCPU-second and $0.00000250 per GiB-second, making it comparable to Lambda for bursty workloads while offering more flexibility.
AWS ECS with Fargate charges based on the vCPU and memory resources you allocate. A small agent container using 0.25 vCPU and 0.5 GB memory costs approximately $10 per month running continuously. A more capable setup with 1 vCPU and 2 GB memory runs $40 to $50 per month. These costs are fixed regardless of whether the agent is actively processing requests or sitting idle.
Kubernetes clusters on managed services like EKS, GKE, or AKS provide the most operational flexibility but carry higher baseline costs. The cluster management fee alone is $70 to $75 per month on AWS EKS, before adding any worker nodes. A minimal production cluster with two small worker nodes runs $150 to $250 per month. Kubernetes becomes cost-justified only when you run multiple services and need the orchestration features it provides.
GPU Instances for Local Inference
Running AI models locally on GPU instances eliminates per-token API charges but introduces significant infrastructure costs. This approach makes economic sense only at high usage volumes where the fixed GPU cost is spread across enough interactions to beat per-token API pricing.
NVIDIA T4 instances represent the entry-level GPU option for AI inference. AWS g4dn.xlarge instances with a single T4 cost approximately $0.526 per hour or $380 per month for continuous use. T4 GPUs handle smaller models like Llama 3 8B and Mistral 7B comfortably, delivering adequate inference speed for agents that do not need frontier-model capabilities.
NVIDIA A10G instances offer a step up in performance. AWS g5.xlarge instances with a single A10G cost approximately $1.006 per hour or $725 per month. The A10G supports larger models like Llama 3 70B with quantization and delivers faster inference speeds for smaller models, reducing latency for real-time agent interactions.
NVIDIA A100 and H100 instances provide maximum performance for running the largest open source models. A single A100 instance costs $3 to $4 per hour or $2,200 to $2,900 per month. H100 instances run $4 to $6 per hour or $2,900 to $4,300 per month. These costs are justified only for teams running multiple large models simultaneously or serving very high request volumes that require the fastest possible inference.
Reserved instances and spot pricing reduce GPU costs significantly. One-year reserved instances save 30 to 40 percent compared to on-demand pricing. Spot instances offer 60 to 70 percent savings but can be interrupted with minimal notice, making them suitable for batch processing and non-critical agent tasks but risky for real-time production workloads.
Database and Storage Costs
Every AI agent with memory capabilities needs a database to store conversation history, embeddings, and state. The choice between managed and self-hosted databases creates a significant cost differential, particularly for vector storage.
Managed vector databases like Pinecone start at $70 per month for the Starter plan, scaling to $300 or more per month for production workloads with higher storage and throughput needs. Weaviate Cloud offers a free sandbox tier for development and production plans starting at $25 per month. Qdrant Cloud provides a free tier with 1 GB of storage, sufficient for small agent deployments, with paid plans starting at $25 per month.
Self-hosted vector databases using PostgreSQL with pgvector eliminate the separate database cost entirely. If you already run a PostgreSQL instance for other application data, adding vector search capabilities requires only the pgvector extension, which is free and open source. The marginal cost is the additional storage and compute required for vector operations, typically adding $10 to $30 per month to an existing database instance.
Object storage for conversation logs, documents, and artifacts costs $0.023 per GB per month on AWS S3, with similar pricing on other cloud providers. A busy agent generating 100,000 interactions per month with an average log size of 5 KB produces approximately 500 MB of log data monthly, costing under $1 per month for storage. Even with a year of retention, storage costs remain negligible compared to compute and API expenses.
Monitoring and Observability
Production agents require monitoring to track performance, detect issues, and identify cost optimization opportunities. The cost of observability tooling ranges from free for basic open source solutions to $300 or more per month for comprehensive managed platforms.
LangSmith, the observability platform from LangChain, offers a free tier for individual developers and paid plans starting at $39 per month for teams. It provides trace logging, evaluation frameworks, and cost tracking specifically designed for LLM-powered applications. For agent builders already using the LangChain ecosystem, LangSmith integrates seamlessly.
General-purpose monitoring with Datadog, New Relic, or Grafana Cloud starts at $15 to $25 per host per month. These platforms provide infrastructure monitoring, log aggregation, and custom dashboards. For agents running on serverless or container infrastructure, the monitoring cost scales with the number of services and the volume of data ingested.
Open source monitoring stacks using Prometheus for metrics and Grafana for dashboards eliminate monitoring subscription costs entirely. The tradeoff is the operational overhead of running and maintaining the monitoring infrastructure yourself, which typically requires 2 to 4 hours per month of maintenance for a small deployment.
Start with serverless infrastructure and scale up only when you have the usage data to justify it. Most agents operate well within the cost-effective range of Lambda or Cloud Run, and upgrading to containers or GPU instances later is straightforward when the numbers support it.