How to Build an Agent with OpenAI SDK

Updated May 2026
This tutorial walks you through building an AI agent using the OpenAI Agents SDK. Unlike Claude's batteries-included approach, OpenAI's SDK gives you three primitives (agents, handoffs, and guardrails) and expects you to compose them into the architecture your application needs. You will define custom tools, create specialized agents, set up guardrails for safety, implement multi-agent handoffs, and configure tracing for production observability.

The OpenAI Agents SDK is available on PyPI and works with Python 3.9+. This tutorial uses Python, as the TypeScript SDK has partial feature coverage for the newer capabilities.

Step 1: Install and Configure

Install the openai package with the agents extension using pip. Set your OpenAI API key as the OPENAI_API_KEY environment variable. Create a project structure with separate files for tool definitions, agent configurations, and the main entry point. This separation keeps your code organized as the agent system grows.

Choose your model based on the task. GPT-5.5 provides the highest reasoning capability for complex tasks. GPT-5.2-Codex is optimized and more affordable for coding workflows. The model choice is specified when creating the agent and can be different for each agent in a multi-agent system.

Step 2: Define Custom Tools

In the OpenAI SDK, tools are Python functions decorated with metadata. Define a function that performs the action, add a description string that tells the model when to use the tool, and specify the input schema using Pydantic models or dictionaries. The decorator handles the marshalling between the model's JSON tool call format and your Python function's parameters.

Start with simple tools that handle a single, well-defined operation. A file reader tool, a web search tool, and a calculation tool are good starting points. Each tool should have clear documentation in its description because the model uses this text to decide when to call the tool. Vague descriptions lead to incorrect tool selection.

Test each tool in isolation before attaching it to an agent. Call the function directly with sample inputs to verify it handles edge cases, errors, and unexpected inputs gracefully. Tools that crash or return confusing error messages will degrade the agent's performance.

Step 3: Build the Agent

Create an agent by specifying the model, system instructions, and available tools. The system instructions define the agent's persona, capabilities, and behavioral constraints. Be specific about what the agent should and should not do, what quality standards to apply, and how to handle ambiguous situations.

Run the agent with a test task. The SDK handles the agent loop: sending the task to the model, routing tool calls to your functions, passing results back to the model, and repeating until the model indicates the task is complete. Observe the execution to verify the agent uses tools appropriately and produces quality results.

If the agent makes poor tool choices, refine the tool descriptions. If it produces low-quality results, refine the system instructions. Most agent quality issues stem from unclear instructions or poorly described tools rather than from model limitations.

Step 4: Add Guardrails

Guardrails validate agent behavior at three points: input (what the user sends), output (what the agent returns), and tool calls (what arguments the agent passes to tools). Define guardrail functions that check for safety issues, quality problems, or policy violations.

A common pattern is using a smaller, cheaper model as a guardrail evaluator. The primary agent uses GPT-5.5 for complex reasoning, while a GPT-5.2 instance evaluates the output for safety and accuracy before it reaches the user. This layered approach keeps costs reasonable while maintaining safety standards.

Implement at minimum an output guardrail that checks for harmful content, personally identifiable information, and obvious factual errors. Add input guardrails if your agent handles untrusted user input that could contain injection attempts or adversarial prompts.

Step 5: Implement Handoffs

Handoffs let one agent delegate work to a specialist. Create multiple agents with different system instructions and tool sets. For example, a triage agent determines what the user needs and hands off to either a research agent, a coding agent, or a data analysis agent based on the request.

Configure handoffs by registering specialist agents as handoff targets on the primary agent. The primary agent decides when to hand off based on its system instructions. Include clear criteria in the instructions about what types of tasks should be delegated to which specialist.

Test the handoff flow end-to-end with representative tasks. Verify that the triage agent routes correctly, that the specialist receives adequate context, and that the final response reaches the user. Common issues include lost context during handoff (the specialist does not have enough information) and incorrect routing (the wrong specialist receives the task).

Step 6: Enable Tracing and Deploy

Enable the SDK's built-in tracing to capture every execution step. Traces record model calls, tool invocations, handoffs, guardrail evaluations, and timing data. View traces in OpenAI's dashboard to debug issues and optimize performance. Export traces to your own observability platform for centralized monitoring.

For production deployment, add error handling around the agent loop. Catch API errors, rate limit responses, and tool execution failures. Implement retry logic with exponential backoff for transient errors. Set execution limits (maximum turns, token budget, wall-clock timeout) to prevent runaway agents.

Use the sandbox execution feature for agents that run code or execute commands. The sandbox provides an isolated environment where the agent can operate freely without risking the host system. Configure the sandbox with appropriate resource limits and network access controls for your use case.

Traces from production can feed into OpenAI's fine-tuning pipeline, creating a feedback loop where your agent's real-world behavior improves the underlying model. This is a unique advantage of the OpenAI ecosystem that can significantly improve agent quality over time.

Next Steps After Your First Agent

Once your basic agent is running, expand its capabilities in several directions. Multi-agent composition lets you create teams of specialized agents that handle different aspects of complex workflows. A triage agent evaluates incoming requests and routes them to specialists, each with their own tools, guardrails, and model configuration. The handoff system maintains context across these transitions, so the specialist has enough information to handle the delegated task without asking the user to repeat themselves.

Tool refinement is an ongoing process that improves agent quality more than any other single change. As you observe your agent in production, you will discover that some tools are called too frequently, others not enough, and some produce results that confuse the model. Refine tool descriptions to be more specific about when each tool should be used. Add concrete examples to the description if the model consistently makes poor tool choices. Remove tools that the agent never needs for your specific use case, because fewer tools means faster and more accurate tool selection.

The tracing data you collect feeds into a continuous improvement cycle. Analyze traces to identify patterns where the agent takes unnecessary steps, makes incorrect tool calls, or produces suboptimal output. Use these insights to refine system instructions, tool descriptions, and guardrail rules. Over time, this data can feed into OpenAI's fine-tuning pipeline, creating custom models that are specifically optimized for your agent's workflow and producing better results with lower latency and cost. This trace-to-fine-tune loop is unique to the OpenAI ecosystem and represents a significant long-term advantage for teams that invest in structured tracing from the start.

Key Takeaway

Building with the OpenAI SDK requires more initial setup than Claude's batteries-included approach, but the primitives-first design means you understand every component in your system, and the tracing-to-fine-tuning pipeline creates a unique path for continuous improvement.