Tool Calling
Tool calling is the mechanic by which a language model invokes external code through a structured message. The application defines a set of tools with names, descriptions, and JSON Schema for their parameters. The model produces a tool-call message specifying which tool to invoke and with what arguments. The application executes the call and returns the result as another message. The conversation continues with the result in context.
The pattern is older than the term. Early LLM agents stitched it together with regex parsing of free-text output. Modern provider APIs (Anthropic's, OpenAI's, Google's) now expose tool calling as a first-class message type with structured inputs and outputs, which removes the brittle parsing step and gives the model a clean signal that it is requesting a tool rather than emitting prose.
The message-shape pattern
A tool-calling turn has three logical messages even though they happen in the same loop:
- The assistant emits a tool-use message. The message contains the tool's name, a JSON object with the arguments, and a unique tool-use ID. The model emits this in place of (or alongside) regular text.
- The application executes the tool. The application receives the tool-use message, looks up the named tool, validates the arguments against the schema, runs the code, and produces a result.
- The application sends a tool-result message back. The message references the tool-use ID and contains the result as text, JSON, or a structured content block. The model reads the result on the next turn and continues.
The model never executes anything. It emits a request. The application stays in control of what happens.
Some providers wrap this loop in a higher-level abstraction (a single complete_with_tools call that runs the loop internally). The underlying mechanic is the same.
Tool definitions
A tool definition has four parts:
- A name. Unique per session. The model uses the name to reference the tool.
- A description. Natural-language guidance the model reads to decide when to call the tool. This is the most important part of tool design, and the most frequently underwritten.
- A parameter schema. JSON Schema describing what arguments the tool accepts. The model uses this to construct valid calls.
- The implementation. Code that runs when the tool is invoked. Lives in the application, not in the protocol.
The description is where most tool design lives. A description that names the tool's purpose, lists its inputs in plain language, mentions when not to use it, and gives one example of a typical call gives the model what it needs to use the tool well. A description that says "Calls the search endpoint" leaves the model guessing.
Anthropic's Writing Tools for Agents post recommends consolidating tools by intent rather than wrapping every existing API endpoint. A schedule_event tool that handles search, conflict check, and booking inside one call leaves the model less to figure out than three separate tools wired in sequence.
Error handling
When a tool fails (bad input, network error, downstream service down), the application returns the error as a tool-result message with an is_error flag set, an error code, or distinct content the model recognizes as an error. The model reads the failure on the next turn and either retries with adjusted arguments, picks a different tool, or asks the user for guidance.
Returning a raw exception as the tool result usually works in practice. The model recognizes a stack trace and adjusts. A structured error block (a short message and a code) gives the model a cleaner signal and avoids wasting context on noise.
What to avoid: silently returning an empty result on failure. The model assumes the tool succeeded with no useful output and proceeds. The downstream effect is a confidently wrong response from the assistant.
Parallel tool calls
Modern providers support multiple tool-use blocks in a single assistant message. The model uses this when several independent calls answer parts of the same question. A query like "What's the weather in San Francisco and New York?" triggers two parallel get_weather calls in one turn rather than two sequential turns.
The application runs the calls in parallel, collects the results, and returns all of them in a single user message containing multiple tool-result blocks. The model sees both results on its next turn and answers.
Parallel calls reduce latency on multi-source queries. They also stress the application's concurrency model. A tool that holds a database connection from a small pool serializes the parallel calls in practice even when the model issued them concurrently.
Tool choice and forcing
Provider APIs expose a tool_choice parameter (different names across providers) that controls the model's behavior at the tool-selection boundary:
- Auto. The model decides whether to call a tool or respond with text. Default for most workloads.
- Any. The model is required to call a tool. Useful when the application needs a structured output shaped like a tool call.
- Specific. The model is required to call a named tool. Useful for testing or for workflows where the next action is fixed.
- None. The model emits text only. Useful for a final summarization turn after a multi-step tool-using interaction.
Most production workloads stay on auto. Forcing tool choice fits narrow cases: structured-output extraction, evaluation harnesses, deterministic workflow steps.
Tool calling and context budget
Every tool definition lives in the system context. Every tool-use and tool-result lives in the conversation history. A long agentic session accumulates tool traffic faster than user prose, and the context budget tightens correspondingly. See Context Engineering for the patterns that keep this manageable: just-in-time retrieval, structured note-taking, sub-agent architectures, compaction.
Tool design decisions ripple into context cost:
- A verbose tool description costs tokens on every turn the tool is available. Tight, complete descriptions beat long ones.
- A tool returning a 10KB result spends a 10KB of context on every subsequent turn the result is in scope. Return identifiers, not full payloads, when the next step is another tool call rather than a final answer.
- A tool that should rarely fire still occupies its slot in the description block. A tool the model has used twice in a month is a candidate for removal.
Failure modes worth naming
- Hallucinated tool calls. The model invokes a tool name not in the schema, or invents an argument the schema does not accept. Validate every tool call against the schema before execution. Return a structured error if validation fails so the model self-corrects.
- Looping on the same call. The model retries the same failing tool with the same arguments. Detect repetition at the application layer and either inject a guidance message ("This call has failed three times. Try a different approach.") or stop the loop and surface to the user.
- Tool-result poisoning. A tool returns attacker-controlled content (web search results, scraped pages, untrusted file contents) and the model treats it as instructions. The defense is the same as for any untrusted input: mark the content as data, not as instruction, and apply the prompt-injection guard.
- Confidence mismatch. The model invokes a tool to "check" a fact when the fact is already in the system prompt or the recent conversation. Wastes a turn. Usually a sign the system prompt is unclear about what the model already knows.
Related
- Context Engineering — the broader practice tool calling sits inside.
- Model Context Protocol — the open standard for exposing tools to multiple hosts.
- System Prompts — where most tool-usage guidance lives.
- Prompt Injection — the threat model for tool-result content.