Error suppression patterns that are harmless to humans are catastrophic to agents
From Theory Delta | Methodology | Published 2026-03-29
What the docs say
Shell scripting guides treat 2>/dev/null, || true, and set +e as style preferences or minor code smells. Python documentation presents broad except: blocks as acceptable for quick scripts. Neither human-oriented guide flags these as structurally dangerous — the assumption is that the person running the script will notice if something looks wrong.
What actually happens
Agents have exactly one feedback channel: exit code + stdout/stderr. When error suppression patterns destroy this channel, agents cannot detect, diagnose, or recover from failures. Unlike humans, agents cannot observe unexpected side effects, notice a file is missing, or apply intuition that “something feels wrong.” Exit code 0 with empty stderr means success — there is no second opinion.
The “Tools Fail” paper (arXiv:2406.19228) provides the strongest quantitative evidence. When tools returned incorrect output without error signals, GPT-3.5 accuracy dropped from 98.7% to 22.7% — a 76 percentage point collapse. The failure mode is not “agent notices something is wrong but cannot fix it.” It is “agent does not notice anything is wrong.” Models copied incorrect tool output and generated hallucinated justifications for why it was correct.
This pattern appears across the full agent toolchain, not just bash scripts:
SDK layer: Vercel AI SDK streamText() silently swallows errors by default — no logs, no stack trace without opt-in onError callback. Fixed in v4.1.22, but the default behavior shipped silently for all prior versions.
CLI layer: OpenAI Codex CLI issue #9091 documents a regression between v0.36 and v0.72 where all commands returned exit code 0 with empty stdout and stderr. The CLI executed commands but output capture was broken. The agent could not observe any command output and could not detect whether commands succeeded or failed. Root cause: local shell environment configuration — but the failure mode (stdout invisible to agent) is the same regardless of cause. Related issues #7317, #6033, and #6991 report the same class of problem, confirming it is a recurring failure mode in agent tooling.
Shell layer: Bash 3.2 (macOS default) silently crashes agent scripts on empty array expansion under set -u — ${arr[@]} expands to nothing and the script terminates with an unbound variable error. This affected 27 sites in production agent scripts before detection. Scripts that passed CI on Linux (bash 5.x) silently failed on macOS developer environments. Bash 3.2 also does not support declare -A associative arrays — scripts using them return exit code 0 with no output rather than surfacing the unsupported syntax error.
Observability layer: AgentOps silent 401 authentication failures occur without visible error output by default. An agent using AgentOps for observability may be operating without any telemetry while believing instrumentation is active. The observability layer itself silently fails — compounding the problem that silent failures in agent scripts are already hard to detect.
CI layer: SFEIR Institute documentation on Claude Code headless mode reports approximately 8% of CI/CD executions produce a silent failure: exit code 0 returned, but response is empty, incomplete, or irrelevant. “Always validate response content — an exit code 0 does not guarantee a usable response.”
Compounding factor: Dagster’s “Dignified Python” analysis found that LLMs default to broad exception swallowing because try: ... except: pass patterns dominate training data. Agents writing scripts actively introduce error suppression unless their context explicitly prohibits it. This creates a compounding failure: agents both cannot detect silent errors and generate code that produces them.
The difference between human and agent execution is not degree but kind:
| Property | Human-executed script | Agent-executed script |
|---|---|---|
| Visual feedback | Terminal output visible | None — only structured return value |
| Error detection | Intuition, domain knowledge | Exit code + stderr only |
| Recovery path | Re-run, inspect logs, ask a colleague | Retry same command, or hallucinate success |
| Suppression impact | Inconvenient — human will eventually notice | Catastrophic — agent proceeds on false premise |
Note: the observatory prevalence claim (58% of repos exhibit error suppression) is sourced from an internal scan tagged as [legacy, unverified] in the source block and should not be treated as a confirmed statistic until independently reproduced.
This finding would be disproved by: a controlled study showing agent task accuracy is not significantly degraded by silent tool failures in real-world (non-synthetic) codebases, or by agent architectures that achieve >80% accuracy on tasks where tool failures are silently suppressed, at a scale comparable to the Tools Fail paper.
What to do instead
Add set -euo pipefail as the first line of every agent-executed bash script. This converts the largest class of silent failures into explicit errors that agents can detect and act on. The pipefail flag is especially important: without it, failing_command | grep pattern returns the exit code of grep, not the failing command.
#!/bin/bash
set -euo pipefail
For commands where non-zero exit is not an error (e.g., grep returning 1 for “no matches”), use targeted handling with logging rather than blanket suppression:
# BAD: suppresses all error information
command 2>/dev/null || true
# GOOD: capture and surface errors explicitly
if ! output=$(command 2>&1); then
echo "ERROR: command failed with: $output" >&2
exit 1
fi
# Acceptable: grep no-match is not an error, but document why
matches=$(grep -r "pattern" dir/ 2>&1) || true # grep returns 1 on no match
For Python code agents execute or write:
# BAD: swallows all errors
try:
result = risky_operation()
except:
pass
# GOOD: catch specific exceptions, surface the error
try:
result = risky_operation()
except SpecificError as e:
print(f"ERROR: {e}", file=sys.stderr)
raise
When agents write scripts, include an explicit rule in their context: “Never use 2>/dev/null, || true, or bare except: blocks. Use set -euo pipefail. Catch specific exceptions and re-raise or log to stderr.”
For the Tools Fail paper’s mitigation: adding a simple disclaimer (“The tool can sometimes give incorrect answers”) to tool descriptions improved silent error detection by approximately 30% (GPT-3.5: from 23% to 53% in zero-shot). This is low-cost and can be added to tool definitions without changing the tool implementation.
Environments tested
| Tool | Version | Result |
|---|---|---|
| GPT-3.5 (OpenAI) | Unspecified (2024) | independently-confirmed: 98.7% → 22.7% accuracy with silent tool failures (arXiv:2406.19228) |
| Claude Code (Anthropic) | March 2026 headless | source-reviewed: ~8% of CI/CD runs return exit code 0 with empty/irrelevant response (SFEIR Institute) |
| Codex CLI (OpenAI) | v0.36–v0.72 regression window | source-reviewed: all commands returned exit code 0 with empty stdout/stderr (#9091) |
| Vercel AI SDK | < v4.1.22 | source-reviewed: streamText() swallows errors silently by default (#4726) |
Confidence and gaps
Confidence: medium — primary quantitative evidence from one academic paper (arXiv:2406.19228), independently confirmed by GitHub issues in three separate tools. Not tested by execution in Theory Delta’s environment. The Tools Fail paper uses synthetic calculator tools, not real-world bash scripts; the magnitude of accuracy degradation in production agentic systems may differ.
Strongest case against: The 76pp accuracy drop is from a controlled study with synthetic faulty tools, not production code. Real agent pipelines have more context, more tool calls, and more opportunities for indirect failure detection than the paper’s task design. The Codex CLI issue (#9091) was caused by local shell environment configuration, not a systematic platform bug. Experienced shell programmers already know set -euo pipefail — this finding may not change behavior for practitioners who already follow strict mode conventions.
Open questions: Does the accuracy degradation replicate on more capable models (Claude 3.5+, GPT-4o)? Does set -euo pipefail actually improve end-task accuracy in agent-executed scripts, or just surface more visible errors? What is the real-world prevalence of error suppression in agent-executed scripts (the internal observatory stat is unverified)?
Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.
Environments Tested
| Tool | Version | Result |
|---|---|---|
| Claude Code (Anthropic) | March 2026 (headless/CI mode) | ~8% of CI/CD executions return exit code 0 with empty or irrelevant response (source-reviewed: SFEIR Institute documentation) |
| OpenAI Codex CLI | v0.36–v0.72 (regression window) | All commands returned exit code 0 with empty stdout/stderr; root cause: shell environment config (issue #9091, closed) |
| Vercel AI SDK | < v4.1.22 | streamText() silently swallows errors by default; fixed in v4.1.22 via opt-in onError callback (issue #4726, closed) |