Ollama disables tool calling silently in two independent ways by default

Published: 2026-05-01Last verified: 2026-04-29empirical

Published Fact-checked 2026-04-29 · 0 corrections

Ollama disables tool calling silently in two independent ways by default

What you expect

Ollama is the dominant local LLM runtime, advertised as OpenAI-compatible with reliable tool-calling support since v0.5. You pull a model, configure your agent framework, and expect tool calls to work. When they don’t, you expect an error.

What actually happens

Ollama has two orthogonal silent failure modes that each independently disable all tool calling — no exception, no log entry, no HTTP error.

Failure 1: The 2048-token context default. OLLAMA_CONTEXT_LENGTH defaults to 2048 tokens. In multi-turn agentic sessions — where system prompts, tool results, and conversation history accumulate — this ceiling is hit within 3–5 exchanges. When Ollama silently truncates the context, the model receives an incomplete view of the conversation and stops producing tool calls. From the agent framework’s perspective, tool calling has inexplicably stopped working mid-session. One environment variable fixes it (OLLAMA_CONTEXT_LENGTH=32768 or higher), but Ollama does not warn when truncation occurs and agent frameworks that wrap Ollama do not set this config. Automatic v0.8.0 — a local agent config registry used by Claude Code and other agents — ships its Ollama integration without this variable, silently inheriting the broken default.

Failure 2: The streaming protocol bug. When streaming is enabled (the Ollama default), Ollama returns an empty content chunk with finish_reason: "stop" in place of tool_calls delta chunks. The model internally generates tool call intentions, but the streaming layer never delivers those chunks to the calling agent. Every tool-dependent skill — web search, file operations, shell execution, MCP tool dispatch — silently fails. The agent receives a completion event with no tool calls and no indication that a tool call was attempted. This failure fires immediately, on the first tool call, regardless of context length or model quality. An agent correctly configured with OLLAMA_CONTEXT_LENGTH=131072 and a capable model still hits this bug if streaming is enabled.

The critical interaction: these two failure modes are orthogonal. Fixing one does not fix the other. A builder who fixes the context length trap still has all tool calling disabled via the streaming bug. Both require explicit configuration to avoid: OLLAMA_CONTEXT_LENGTH for the context trap, and stream: false in API calls for the streaming bug.

There is also a third class of pre-inference failure: chat template bugs in Ollama’s /api/chat integration corrupt tool schemas before the model ever sees them, producing malformed tool definitions at the API boundary. GBNF enforcement (added in v0.5) does not prevent this because the corruption happens before token sampling.

Independently confirmed by the BetterClaw/OpenClaw production bug report (2026): streaming returns finish_reason: stop instead of tool_calls chunks; stream: false resolves it.

What this means for you

Every agent framework that defaults to streaming against Ollama — which is most of them — silently loses all tool calling regardless of model quality, GBNF configuration, or context settings. Local-first inference is not a fringe pattern — this is a mainstream failure surface.

The 2048-token default means agentic RAG is also broken out of the box: retrieved chunks injected into the generation context saturate the window before the user query is appended. This is not just a multi-turn chat problem — it affects any workflow where tool results or retrieved content accumulates in context.

Industry surveys (Q1 2026) report the majority of enterprise inference runs on-premises or at the edge — local-first is a structural architectural branch, and these failures affect that entire segment. Tools that wrap Ollama without setting OLLAMA_CONTEXT_LENGTH have accepted a latent failure in every agentic workflow they enable.

What to do

Always set OLLAMA_CONTEXT_LENGTH — minimum 32768 for agentic use; 131072 for RAG or long sessions. Add it to your shell profile or Docker environment. Never rely on the 2048 default for agent workloads.
Set stream: false in API calls or agent framework config when tool calling is required. Accept the UX tradeoff: non-streaming means no visible output until the full response is generated. For interactive use, implement a separate streaming path that does not require tool calling; for agent workflows, streaming is not needed.
Audit your agent framework’s Ollama config. Check whether it sets OLLAMA_CONTEXT_LENGTH. If it does not, treat all tool-calling results from that framework as potentially silently dropped.
Pin Qwen3 tool count below 5. At more than 5–6 active tools, Qwen3-coder switches from JSON tool calls to XML format — integrations using JSON-only parsers silently lose tool calling at that threshold. This is a third independent failure mode, separate from the context trap and streaming bug.

Falsification criterion: This finding would be disproved by Ollama releasing a version where (a) the default context length is set to a value sufficient for multi-turn agentic sessions (≥8192), (b) the streaming layer correctly delivers tool_calls delta chunks without requiring stream: false, and both behaviors are confirmed in the default configuration with no workaround required.

Evidence

Tool	Version	Evidence	Result
Ollama	v0.5+	source-reviewed	OLLAMA_CONTEXT_LENGTH defaults to 2048; streaming drops tool_calls delta chunks silently
Automatic v0.8.0	v0.8.0	source-reviewed	Ollama integration config omits OLLAMA_CONTEXT_LENGTH; agents inherit 2048 default
BetterClaw/OpenClaw	production 2026	independently-confirmed	Streaming returns finish_reason:stop instead of tool_calls chunks; stream:false resolves it
Qwen3-14B via Ollama	v0.5+ GBNF	source-reviewed	F1=0.971 tool selection with GBNF — streaming bug fires regardless of model quality

Confidence: empirical — 4 environments reviewed, 1 independently confirmed.

Strongest case against: The streaming bug is confirmed from a single production bug report (BetterClaw/OpenClaw); other agent frameworks may work around this in their Ollama integration code. The context-length default is well-documented in Ollama’s own configuration reference, though not prominently. Teams using Ollama with frameworks that do set OLLAMA_CONTEXT_LENGTH (e.g., direct API users who read the docs) would not encounter failure 1. The v0.21.0 Hermes Agent addition and active v0.20+ development cadence means these behaviors may be changing.

Open questions: Does stream: false introduce other correctness failures in concurrent request handling? Which agent frameworks already set OLLAMA_CONTEXT_LENGTH in their default Ollama integrations? Does the MLX backend (switched March 2026 for Apple Silicon) exhibit the same streaming bug, or does it have a different implementation?

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.