Tracing AI Agent Decision Making
From Microservice Tracing to Agent Tracing
Distributed tracing was invented to follow a request through a network of microservices, where a single API call might touch ten different services before returning a response. The core abstraction is the span: a named, timed operation with metadata. Spans nest to form a tree, where parent spans represent higher-level operations and child spans represent the sub-operations they trigger. A trace is the complete tree for one request, and it lets you see both the structure (what called what) and the timing (where the time went) in a single view.
AI agents are a natural fit for this model because they are already structured as a sequence of operations: receive input, plan, call a tool, evaluate the result, call another tool, synthesize the output. Each of these is a span. The agent framework is the equivalent of the API gateway that initiates the trace, and each LLM call or tool invocation is the equivalent of a downstream service call. The trace shows you the complete decision tree for one task, with timing and metadata at every node.
The key extension for agents is that standard microservice spans carry structural metadata (service name, endpoint, status code) while agent spans must carry reasoning metadata. An LLM call span in a traditional system might record the model name and the latency. An LLM call span in an agent trace needs to also capture the prompt summary, the model's output (or a summary of it), the tool call the model decided to make, and the token counts. A tool call span needs the arguments, the response, and the error details if it failed. Without this reasoning context, the trace tells you what happened (the agent made three LLM calls and two tool calls) but not why (the first tool call returned an empty result, so the model decided to reformulate the query, which returned a useful result, which the model then synthesized into the final answer). The why is what makes traces invaluable for debugging.
Designing Spans for Agent Operations
The root span covers the entire task from the moment the user's input arrives to the moment the agent returns its final response. Its metadata should include the session ID, the user's input (or a hash for privacy), the final output, the overall outcome (success, failure, partial), and aggregate metrics like total tokens, total cost, and total step count. The root span is the entry point for investigation: when you see a failed task in your dashboard, the root span is the first thing you open.
Under the root span, create a child span for each LLM call. This is where the agent's reasoning happens, so these spans carry the most diagnostic value. At minimum, each LLM call span should include the model name and version, the prompt token count and output token count, the latency, the model's output (full text or a truncated version depending on your privacy and storage constraints), and any structured decisions the model made such as which tool to call or whether to return a final answer. If the model produces chain-of-thought reasoning, capturing it here is what makes post-hoc debugging possible.
Create a child span for each tool invocation, nested under the LLM call span that triggered it. The tool call span should include the tool name, the arguments the model passed (which often reveal misunderstandings or formatting errors), the tool's response (truncated if necessary), the HTTP status code if the tool is an API, the latency, and the outcome (success, failure, retry). When a tool call fails and the agent retries, the retry should be a sibling span under the same parent LLM call, so the trace clearly shows the sequence of attempts.
For agents that use memory or retrieval, create spans for memory reads and writes. A memory read span should include the query used for retrieval, the number of results returned, and the relevance scores if available. This level of detail is what lets you diagnose retrieval failures, where the agent could not find the information it needed, separately from reasoning failures, where the agent had the information but misinterpreted it.
Visualizing Agent Traces
A trace visualization turns the span tree into a visual representation that humans can quickly parse. The two most common views are the waterfall and the tree, and each reveals different things about agent behavior.
The waterfall view lays spans out on a timeline, with the horizontal axis representing time and the vertical axis showing nesting depth. This immediately reveals bottlenecks: a tool call span that stretches across most of the timeline is obviously the latency dominant, while a cluster of short LLM call spans in quick succession suggests the agent is reasoning efficiently. The waterfall also reveals parallelism or the lack of it: if the agent could have called two tools simultaneously but instead called them sequentially, the wasted time is visually obvious.
The tree view shows the parent-child relationships between spans, emphasizing the decision structure rather than the timing. Walking through a tree for a failed task, you can follow the agent's reasoning step by step: it received the input, it decided to search, the search returned results, it evaluated them and decided they were insufficient, it reformulated the query, the second search returned better results, it synthesized the answer. Each node in the tree is a decision point, and the metadata at each node tells you what information the agent had and what it chose to do with it.
For agent traces specifically, annotating spans with the model's reasoning text transforms the visualization from a performance diagram into a reasoning replay. Instead of seeing "LLM call, 450ms, 1200 tokens," you see "LLM call: decided to call the database tool because the search results did not contain pricing information." This annotation is what makes trace investigation feel like reading the agent's thought process rather than analyzing a timing chart, and it is the single most valuable extension of standard tracing for agent systems.
Sampling Strategies
Tracing every task at full fidelity is expensive. A single agent task can generate dozens of spans, each with substantial metadata, and at scale the storage and indexing costs of full-fidelity tracing become prohibitive. Sampling, the practice of tracing only a subset of tasks, is the standard mitigation, but the choice of sampling strategy significantly affects how useful the resulting data is.
Head-based sampling makes the trace/no-trace decision at the start of the task, before any processing happens. You might trace ten percent of tasks at random, or all tasks from a specific user, or all tasks of a particular type. Head-based sampling is simple and predictable: you know in advance what fraction of your traffic will be traced and can budget storage accordingly. Its weakness is that it cannot know at the start which tasks will be interesting. A task that starts normally but fails at step eight will only be traced if it happened to be selected at the start, and if your sampling rate is ten percent, ninety percent of failures will go unrecorded.
Tail-based sampling records all spans for every task into a temporary buffer but only persists the complete trace if the task meets some criterion after it completes: it failed, it took too long, it exceeded a cost threshold, or it was randomly selected from the successes. Tail-based sampling guarantees you will have the full trace for every interesting task because the decision is made after the task's outcome is known. The trade-off is that you must buffer all spans in memory or temporary storage during task execution, which uses more resources, and the buffering infrastructure adds complexity. For agent systems, tail-based sampling is almost always worth the added complexity because agent failures are relatively rare and each one is expensive to investigate without a trace.
A practical hybrid approach is to use tail-based sampling for all tasks but with different persistence criteria. Trace every failure at full fidelity. Trace every task that exceeds a cost or latency threshold at full fidelity. Sample five to ten percent of normal successes at full fidelity for baseline comparison. For the remaining successes, persist only the root span with aggregate metadata (total tokens, total cost, step count, outcome) and discard the child spans. This gives you complete investigation capability for every problem while keeping storage manageable.
Agent tracing extends distributed tracing by enriching every span with reasoning context, not just timing data. The combination of structural span design, reasoning annotations, and tail-based sampling gives you the ability to fully reconstruct any failed task's decision process while keeping storage costs under control.