Building an AI Stack with Cloud API Keys

Updated May 2026
A cloud API stack uses commercial AI services for model inference while keeping everything else self-hosted: your orchestration, your vector database, your memory, your tools, and your data pipeline all run on your own infrastructure. This hybrid approach gives you access to frontier model quality without GPU hardware, while maintaining control over your data flow, prompt engineering, and agent architecture.

When Cloud APIs Make Sense

Cloud APIs are the right choice when you need frontier model quality that open-source models cannot match for your specific tasks, when you lack the hardware budget for GPU inference, when your usage volume is low enough that per-token pricing is cheaper than hardware investment, or when you need to move fast without spending time on model selection and optimization. Many production AI applications use cloud APIs because the quality difference on complex reasoning, nuanced instruction following, and creative tasks remains significant in 2026.

The hybrid approach (cloud APIs for inference, self-hosted for everything else) preserves most of the control benefits of full self-hosting. Your RAG pipeline processes and stores documents locally, so sensitive content never leaves your infrastructure. Your orchestration logic runs locally, so your agent architecture and prompt engineering are not dependent on any provider's platform. Your conversation history and memory stay in your own database. Only the model prompts and responses transit through the API provider's servers.

This approach also enables multi-provider strategies. You can route simple tasks to cheaper API models (GPT-4o Mini, Claude Haiku) and complex tasks to expensive frontier models (GPT-4o, Claude Opus), optimizing cost without sacrificing quality where it matters. You can even mix cloud APIs with local Ollama models, using local inference for high-volume routine tasks and cloud APIs for occasional complex operations that need frontier quality.

Step 1: Choose Providers and Get API Keys

The major AI API providers as of mid-2026 are OpenAI (GPT-4o family, strong at instruction following and code), Anthropic (Claude family, strong at reasoning and analysis), Google (Gemini family, strong at multimodal tasks), and several others with specialized offerings. Each provider requires creating an account, generating an API key, and adding billing information. Most offer free tiers or trial credits sufficient for initial testing.

Consider having keys from at least two providers for redundancy. API outages happen, and having a fallback provider means your agents keep working when one provider is down. Your orchestration layer can implement automatic failover: try the primary provider, and if it returns an error or times out, retry with the secondary provider. This resilience is a significant advantage of self-hosted orchestration over using a single provider's built-in tools.

Store API keys securely. Never hardcode them in configuration files that might be committed to version control. Use environment variables, Docker secrets, or a dedicated secrets manager. In n8n, store keys as credentials that are encrypted at rest. In custom code, load keys from environment variables at runtime. A leaked API key can generate substantial unauthorized charges before you notice and rotate it.

Step 2: Configure Open WebUI for API Access

Open WebUI supports both Ollama (local models) and OpenAI-compatible API connections simultaneously. Add your cloud API provider through the admin settings by specifying the API base URL and your API key. OpenAI uses https://api.openai.com/v1 as the base URL. Anthropic, Google, and other providers have their own endpoints, though many offer OpenAI-compatible interfaces. Once configured, cloud models appear alongside any local Ollama models in the model selector.

This dual configuration lets users choose between local and cloud models per conversation. Use a local 7B model for routine questions where speed matters and privacy is important. Switch to a cloud frontier model for complex analysis, nuanced writing, or tasks where the quality difference is significant. The conversation history stays in your local Open WebUI database regardless of which model generated the responses.

Step 3: Self-Hosted Embedding and Vector Search

Even when using cloud APIs for generation, keep your embedding and vector search pipeline local. Running embedding models locally through Ollama (nomic-embed-text is the standard choice) means your documents are never sent to cloud APIs for processing. This is particularly important for sensitive data: the embedding model processes your raw documents, but only the resulting numerical vectors (which cannot be reverse-engineered back to the original text) are stored and searched.

Deploy Qdrant locally for vector storage and search, exactly as you would in a fully self-hosted stack. The RAG pipeline works the same way: chunk documents, embed them locally, store vectors in Qdrant, and retrieve relevant chunks at query time. The only difference is that the retrieved chunks are sent to a cloud API (along with the user's query) for response generation rather than to a local model. This means your documents stay local, but the retrieved chunks that are included in prompts do pass through the API provider.

Step 4: Configure n8n with API Credentials

n8n connects to cloud APIs through its OpenAI Chat Model and Anthropic Chat Model credential types. Add your API keys as credentials in n8n's admin interface, then reference these credentials in any AI workflow node. The AI Agent node, LLM Chain node, and Text Classifier node all support cloud API models as their LLM backend. The workflow logic (branching, tool calling, data processing) executes locally in n8n while only the model inference calls transit to the cloud.

For workflows that process sensitive data, implement a data sanitization step before the LLM call. A pre-processing node can strip or mask personally identifiable information, confidential numbers, and proprietary terms before sending the prompt to the cloud API. The response can then be post-processed to restore original values. This pattern reduces the privacy risk of using cloud APIs for sensitive workloads.

Step 5: Implement Cost Controls

Cloud API costs can grow unexpectedly, especially with automated workflows that process data continuously. Implement several layers of cost control: set monthly budget limits in your API provider's dashboard (most providers support spending alerts and hard caps), add rate limiting in your orchestration layer (limit requests per minute and per hour), monitor token usage per workflow and per user, and implement circuit breakers that disable workflows if costs exceed thresholds.

n8n's execution logging provides visibility into how many LLM calls each workflow makes and what their approximate cost is. Build a monitoring workflow that aggregates these logs, calculates daily and weekly spending, and sends alerts when spending exceeds normal patterns. This visibility prevents surprise bills and helps you identify workflows that should be optimized or migrated to cheaper models.

Consider implementing a tiered model strategy within your workflows. Route simple tasks (classification, extraction, yes/no decisions) to the cheapest available model. Route complex tasks (analysis, creative writing, code generation) to mid-tier models. Reserve frontier models for tasks that explicitly require maximum quality. This routing can be automated: a cheap model classifies the task complexity, and the orchestrator routes accordingly. This approach often reduces API costs by 50 to 70 percent compared to using the best model for everything.

Key Takeaway

A cloud API stack gives you frontier model quality with self-hosted control. Keep embedding, vector search, and orchestration local for data privacy and architectural independence. Implement cost controls early because automated workflows can generate significant API charges if left unmonitored.