Code Execution with AutoGen Agents

Updated May 2026
Code execution is one of AutoGen's most powerful features, enabling AI agents to write Python code, run it in sandboxed environments, analyze the output, and iteratively fix errors until the code produces correct results. This capability transforms agents from passive text generators into active problem solvers that can interact with data, APIs, file systems, and computational tools to accomplish real-world tasks.

How Code Execution Works

The code execution pipeline in AutoGen follows a simple but effective pattern. An AssistantAgent receives a task that requires computation, data processing, or interaction with external systems. The agent generates Python code to accomplish the task and embeds it in a message using markdown code block formatting. The UserProxyAgent detects the code block, extracts the Python code, and sends it to the configured execution backend.

The execution backend runs the code in an isolated environment, captures all output (including print statements, return values, and error tracebacks), and sends the results back to the UserProxyAgent. The proxy formats the output as a conversation message and sends it to the AssistantAgent, which analyzes the results. If the code succeeded, the agent reports the findings. If it failed, the agent reads the error message, diagnoses the problem, generates corrected code, and the cycle repeats.

This iterative debugging loop typically resolves common issues within two to three attempts. Import errors are fixed by adding the correct import statements. Type errors are corrected by adjusting data types or adding conversions. Logic errors are addressed by examining the output and revising the algorithm. The conversation context helps the agent understand what went wrong because it can see both the code it wrote and the exact error that occurred.

Execution Backends

The local executor runs code in a subprocess on the same machine where AutoGen is running. It is the simplest option and requires no additional infrastructure. The subprocess has configurable timeouts, working directories, and environment variables. The local executor is appropriate for development and trusted environments where the code being executed is generated by controlled agents working on known tasks.

The Docker executor launches code inside isolated Docker containers, providing much stronger security boundaries. Each execution creates a fresh container from a specified image, runs the code, captures the output, and destroys the container. The code cannot access the host filesystem, network services, or other containers unless explicitly configured. Docker execution is the recommended approach for production systems where the code might process untrusted data or when agents are exposed to user-provided inputs that could influence the generated code.

The Azure Container Instances backend extends the Docker model to the cloud, running code in managed containers on Azure infrastructure. This option provides automatic scaling, centralized monitoring, and integration with Azure networking and security controls. It is the right choice for enterprise deployments that need managed infrastructure with compliance and governance features.

Each backend can be configured with resource limits including maximum execution time (to prevent infinite loops), memory caps (to prevent resource exhaustion), and network access policies (to control what external services the code can reach). These limits protect against both accidental issues like bugs that cause memory leaks and intentional abuse like prompt injection attacks that attempt to execute malicious code.

Security Considerations

Executing AI-generated code introduces security risks that must be managed carefully. The most fundamental risk is that the code might do something unintended, whether due to a bug in the LLM's reasoning, a misunderstanding of the task, or a prompt injection attack that manipulates the agent into generating harmful code. AutoGen's sandboxing addresses this by limiting what the code can access and do.

Filesystem isolation prevents code from reading or modifying files outside a designated working directory. Network restrictions can block outbound connections entirely or limit them to approved endpoints. Process isolation ensures that the executing code cannot interfere with the AutoGen framework itself or other agents running on the same system. Resource limits prevent denial-of-service through excessive CPU, memory, or disk usage.

The human-in-the-loop option provides the ultimate safety mechanism. When the UserProxyAgent is configured with human approval required, every code execution must be explicitly approved by a human operator who can review the code before it runs. This is appropriate for sensitive environments where the consequences of executing incorrect code are significant, such as database modification, financial transactions, or infrastructure management.

Defense in depth is the recommended approach: use Docker or container isolation as the primary security boundary, add resource limits as a secondary control, implement network restrictions to contain the blast radius, and enable human approval for operations that affect critical systems. No single control is sufficient by itself, but the combination provides robust protection for production deployments.

Best Practices for Code Execution

Writing effective system messages is critical for code execution quality. The AssistantAgent's system message should specify the preferred coding style, required imports, error handling expectations, and output format. Clear instructions like "always include error handling for file operations" and "print intermediate results for debugging" help the agent generate code that works correctly on the first attempt more often.

Pre-installing common packages in the execution environment avoids frequent import errors. If agents routinely work with pandas, numpy, matplotlib, or requests, including these packages in the base image or local environment eliminates a common class of first-attempt failures and reduces the number of debugging iterations needed.

Setting appropriate timeout values requires balancing reliability against cost. Too short, and legitimate long-running computations are terminated prematurely. Too long, and infinite loops or deadlocked code consumes resources unnecessarily. A good starting point is 60 seconds for most operations, with longer timeouts for known heavy computations like model training or large data processing.

Logging all code executions and their results provides an audit trail that is essential for debugging, compliance, and continuous improvement. The execution logs show what code was generated, what output it produced, how many iterations were needed to reach a correct result, and what errors occurred along the way. This data helps teams identify patterns in agent behavior and refine system messages to improve first-attempt success rates.

Key Takeaway

AutoGen's code execution capability transforms AI agents from text generators into active problem solvers that write, run, and debug code iteratively. Docker-based sandboxing provides production-grade security, while the iterative debugging loop handles common errors automatically. Security should follow defense-in-depth principles with isolation, resource limits, and optional human approval.