Tool Use and Function Calling: How AI Agents Actually Take Action

Most explanations of how AI agents work are either too abstract to be useful or too shallow to build from. This article covers the actual mechanics: how function calling works at the protocol level, how to design tools that agents use reliably, how agents decide which tool to call, what happens when tool calls fail, and what the production considerations look like for a real agentic system with many tools and high volume. By the end, you should be able to design a tool interface for a specific agent use case and anticipate the failure modes before they hit you in production.

What function calling actually is

A language model on its own takes text in and produces text out. It has no ability to run code, query a database, or call an external API. Function calling is the bridge between the model's reasoning and the world it needs to affect.

The implementation works like this. When you call the model API, you include a list of tool definitions alongside your prompt. Each tool definition describes a function: its name, what it does, and the parameters it accepts, including the type and description of each parameter. The model reads these definitions as part of its context, reasons about what it needs to do, and if it determines that calling a tool is the right next step, it returns a structured response specifying which function to call and what arguments to pass.

The critical point is that the model does not execute the function. It returns a request to execute the function. Your application code receives that request, executes the actual function call against the real system, and passes the result back to the model in the next turn. The model then incorporates that result into its reasoning and either calls another tool, continues reasoning toward the final output, or concludes the task.

A minimal example clarifies this. The model is given a tool called get_account_balance that accepts a customer_id parameter. The user asks about a customer's balance. The model returns a function call request: call get_account_balance with customer_id equal to "C-10042". Your application executes the database query, retrieves the balance, and passes it back to the model. The model uses that balance in its response.

The model never touches the database. It never executes code. It reasons about what information it needs and what it wants to do, expresses that reasoning as a structured function call request, and your application executes the actual work. This separation between reasoning and execution is fundamental to understanding how agents work and how to build them safely.

How the model decides which tool to call

The model's tool selection is driven by the quality of two things: the tool definitions you provide and the reasoning context the model has built up from prior steps in the task.

Tool definitions are not just technical specifications. They are instructions to the model about when to use a tool. A tool called get_customer_data with a description that says "retrieves customer data" gives the model very little to work with. A tool called get_customer_data with a description that says "retrieves full customer profile including name, account history, risk rating, and last interaction date for a given customer ID. Use when you need to look up a specific customer's details before making a recommendation or updating their record" gives the model the context it needs to choose correctly between this tool and, say, search_customers, which finds customer records by name or partial information.

The parameter descriptions matter just as much as the function description. A parameter called customer_id described only as "string" forces the model to guess what format is expected. A parameter described as "the unique customer identifier in the format C-XXXXX, found in the CRM system and on all account documents" removes that ambiguity.

When the model has multiple tools available, it uses the task context, the descriptions, and any examples you have provided to reason about which tool is appropriate for the current step. If two tools have overlapping descriptions, the model will make unpredictable choices between them. If a tool's description does not clearly indicate when it should be used versus when it should not, the model will call it in situations where it is inappropriate. Designing tool definitions is as important as implementing the tools themselves.

Designing tools that agents use reliably

The principles for designing tools that agents use correctly in production are different from the principles for designing APIs that humans use. Humans can read documentation, ask questions, and recover from mistakes. Agents interpret descriptions at inference time, cannot ask clarifying questions mid-call, and will propagate errors silently if the tool's behavior is not what the description implied.

Each tool should do exactly one thing. A tool called process_customer_request that might send an email, update a record, or trigger a workflow depending on the input is a tool the model will misuse. Separate functions for each action, with clear descriptions of exactly what each one does and does not do, give the model the precision it needs to choose correctly.

Parameter types should be as specific as possible. Enum types for fields with a fixed set of valid values prevent the model from passing invalid arguments. Date parameters should specify the expected format. Amount parameters should specify the currency and precision. The more specific the type definition, the fewer invalid calls the model will make.

Describe the side effects explicitly. If calling a tool sends an email, modifies a database record, charges a payment method, or triggers a downstream process, say so in the description. The model cannot know about side effects from the function signature alone. A model that does not know a tool sends a customer email will call it in testing scenarios and in situations where sending an email is not appropriate. Explicit side effect documentation prevents this.

Design for reversibility where possible. Where a tool takes an action that cannot be undone, flag this in the description and consider whether the action should require an explicit confirmation step before the model can call the execute variant. A pattern that works well in production is separating read tools from write tools, and separating prepare tools from execute tools. The model can call prepare_payment to construct the payment details and validate them, review the prepared payment, and only then call execute_payment to submit it. This gives a human oversight checkpoint or an SLM judge a natural place to intervene before the irreversible action.

Return structured results the model can reason about. A tool that returns raw JSON from an external API forces the model to parse and interpret an inconsistent structure. A tool that returns a clean, normalized response with consistent field names and a clear indication of success or failure gives the model reliable information to reason from. Normalize and validate tool responses in your application layer before returning them to the model.

Parallel tool calls

Modern model APIs support calling multiple tools in a single model turn. The model can return several function call requests simultaneously, your application executes them in parallel, and the results are all returned to the model in the next turn.

For agentic systems, parallel tool calling is one of the most significant performance levers available. An onboarding workflow that runs five independent checks sequentially takes five times as long as one that runs them in parallel. For any set of tool calls where the results do not depend on each other, structuring the agent to call them simultaneously rather than sequentially reduces task completion time proportionally.

Not all tool calls can be parallelized. A tool call whose input depends on the result of a previous tool call must run sequentially. But in most complex workflows, there are meaningful opportunities for parallelization that naive sequential implementations miss. When designing the tool set for an agent, explicitly map out which calls are independent and ensure the agent instructions encourage parallel calling where appropriate.

Error handling and retry logic

Tools fail. APIs are unavailable. Databases return unexpected results. Rate limits are hit. External services return errors. An agent that does not handle tool failures gracefully will either stall, produce wrong outputs by proceeding without the information it needed, or enter retry loops that exhaust resources.

Every tool interface needs a consistent error response structure that tells the model what went wrong and what the agent should do about it. A raw exception stack trace is not a useful error response for a model. A structured response with an error code, a human-readable explanation, and a suggested action, retry, use a fallback, or escalate, gives the model the information it needs to handle the failure intelligently.

Define the retry policy for each tool explicitly in your application layer, not in the model's reasoning. If a tool call fails with a transient error, your application should retry it with exponential backoff before returning the error to the model. The model should receive a tool failure only after your application has already exhausted its retry budget. This prevents the model from making retry decisions that are better handled in code, and it keeps the model's context clean of noise from transient infrastructure failures.

For tool failures that cannot be recovered by retry, the error response should tell the model whether a fallback tool is available, whether the task should be escalated to a human, or whether the task should be abandoned. The model will follow this guidance if it is clearly expressed. It will make unpredictable decisions if the error response is ambiguous.

Managing tool context at scale

As the number of tools in an agent's toolkit grows, two problems emerge. The model's context window fills with tool definitions, reducing the space available for task context. And the model's tool selection accuracy degrades as the number of available tools increases, because the signal-to-noise ratio in the tool selection decision decreases.

Both problems have the same solution: do not give the agent access to all tools at all times. Give it access to the tools relevant to the current step or the current task type.

This can be implemented statically, where different agent configurations have different tool subsets for different workflow stages, or dynamically, where the orchestrator selects the appropriate tool subset based on the current task context and passes only those tools to the model. Dynamic tool selection requires a tool retrieval layer, essentially a vector search over tool descriptions, but it scales to large tool sets where static configuration becomes unmanageable.

A practical rule of thumb from production deployments: model tool selection accuracy starts to degrade meaningfully above fifteen to twenty tools in a single model call. If your agent needs access to more tools than that, implement tool selection to narrow the available set before each model call.

What the production audit trail needs to capture

Every tool call in a production agentic system should be logged as part of the tamper-evident audit trail. The log record for a tool call should capture the tool name, the arguments passed, the timestamp of the call, the response received, the latency of the call, and the task context that led the model to make this call at this point in the workflow.

This logging serves two purposes. Operationally, it is how you debug production failures: the complete sequence of tool calls and their results is the trace you need to understand what the agent did and why. For compliance, it is the evidence that the agent's actions were authorized, appropriate, and traceable to the task that triggered them.

Log tool calls at the application layer, not by parsing model responses. Your application executes every tool call. Your application has the complete context of every call: the arguments, the response, and the task context. Capturing this at the application layer is more reliable and more complete than attempting to reconstruct it from model outputs.

A note on tool security

Every tool you give an agent is a capability the agent can use in ways you did not anticipate. A tool that sends emails can be used to send emails to unintended recipients if the agent misinterprets a task. A tool that modifies database records can modify the wrong records. A tool that executes financial transactions can execute transactions the agent was not authorized to execute.

Security for tool use requires three things. Least privilege: each tool should have the minimum permissions required to do its job, no more. The sanctions screening tool should have read access to the sanctions database and nothing else. It should not have access to customer records or the ability to write to any system. Validation: your application layer should validate the arguments the model passes to every tool before executing the call. The model can hallucinate arguments that look plausible but are invalid or unauthorized. Validate against a schema and against the authorization context before execution. Logging: as described above, every tool call should be logged with full context. The log is your only way to detect unauthorized or anomalous tool use after the fact.

For any tool that takes an irreversible or high-consequence action, add an explicit authorization check in your application layer that verifies the action is within the permitted scope of the current task before executing it. This check runs in code, not in the model's reasoning, and it is not bypassable through prompt manipulation.

The principle that holds this together

Function calling is powerful precisely because it separates reasoning from execution. The model reasons about what to do. Your application code executes it. That separation is where you build in all the safety, validation, logging, and authorization that makes a production agentic system trustworthy rather than just capable.

The quality of the tool interface, the precision of the descriptions, the consistency of the responses, the completeness of the error handling, and the rigor of the authorization checks, determines the reliability of the agent more than the capability of the model. A highly capable model with a poorly designed tool interface will behave unpredictably. A less capable model with a well-designed tool interface will behave consistently and safely.

Design the tool interface first. The model will follow it.