Most of the text flowing through an agent’s context window isn’t code, reasoning, or instructions. It’s logs.

Table of contents

  1. The problem nobody talks about
  2. Where the waste happens
  3. The math on wasted tokens
  4. Context pollution is worse than context cost
  5. Why agents are bad at filtering their own input
  6. What we can do about it
  7. Where this is going

The problem nobody talks about

Here’s something I’ve been noticing while watching AI coding agents work. You ask Cursor or Claude Code to fix a failing test. The agent runs the test suite. The test suite spits out 400 lines of output — framework initialization, 47 passing tests, deprecation warnings, timing information, coverage summaries, and somewhere buried in all that, the 6 lines that actually matter: the one test that failed and why.

All 400 lines get consumed as tokens. The agent reads every single one. You pay for every single one. And the vast majority of those tokens carry zero useful information for the task at hand.

This isn’t a small problem. In a typical agentic coding session, I’d estimate that 30-60% of the tokens consumed are output noise — test results for tests that passed, build logs for steps that succeeded, stack trace frames from framework internals, verbose compiler warnings about things unrelated to the current task. The agent dutifully reads all of it, and you get billed for all of it.

Nobody seems to be talking about this much, probably because the waste is invisible. You don’t see a line item on your bill that says “342,000 tokens spent reading pytest output for passing tests.” It’s just folded into the overall token consumption of the session, and most people don’t break it down.

But I keep coming back to this because it’s not just a cost problem. It’s a quality problem.

Where the waste happens

Test output

This is the worst offender. Run a test suite with 200 tests, and you get output for every single one — whether it passed or failed. In pytest, that’s a line per test with the module path and a green dot or a PASSED label. In Jest, it’s the test name and the suite name. In Go, it’s --- PASS or ok for each package.

When everything passes, the entire output is noise. The agent needed to know “all tests pass” and instead got 200 lines saying the same thing 200 times. When one test fails, the agent needs maybe 10-20 lines (the failure message, the assertion diff, the relevant stack trace) and instead gets those 10 lines buried in 190 lines of passing test output.

A concrete example. I ran a Python project’s test suite through a Cursor agent recently. The pytest output was 847 tokens. The actual failure that the agent needed to act on was 94 tokens. That’s an 89% waste rate on a single tool call.

Jest is worse. A moderately sized React project can produce test output that runs to 2,000-3,000 tokens, because Jest prints the file path, the suite description, every individual test name with its timing, and then a summary table with columns for statements, branches, functions, and lines of coverage. For every file. If you have 30 test files, the coverage table alone can be 500+ tokens.

Build and compilation output

Webpack, Vite, Next.js, cargo, go build — they all generate output during compilation. Most of it is progress information. “Compiling 142 modules…” “Compiled successfully in 3.2s.” “Bundled in 1.4s.”

When the build succeeds, none of this matters. The agent needs one bit of information: it worked. Instead, it reads paragraphs of output about module resolution, chunk optimization, and asset sizes.

When the build fails, the useful information is the error message and the file/line reference. But it’s wrapped in the same progress output, plus potentially misleading warnings that preceded the error.

TypeScript compilation is a special case. tsc can produce hundreds of lines of type errors, many of which cascade from a single root cause. The agent reads all of them, tries to understand the pattern, and often fixes the downstream errors individually instead of identifying the one root type that’s wrong. The extra noise actively hurts its reasoning.

Stack traces

Stack traces are extremely token-expensive and mostly useless. A typical Python traceback might be 15-20 frames, of which maybe 2-3 are in your actual code. The rest are framework internals — Django middleware, pytest fixtures, asyncio event loop machinery, SQLAlchemy session management.

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/app/venv/lib/python3.11/site-packages/django/utils/autoreload.py", line 64, in wrapper
    fn(*args, **kwargs)
  File "/app/venv/lib/python3.11/site-packages/django/core/management/commands/runserver.py", line 125, in inner_run
    handler = self.get_handler(*args, **options)
  ... 12 more framework frames ...
  File "/app/myproject/views.py", line 47, in get_user
    return User.objects.get(id=user_id)
  File "/app/venv/lib/python3.11/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
DoesNotExist: User matching query does not exist.

The agent needs the last application frame and the exception. It gets 15+ frames of threading and Django internals first. In JavaScript, it’s similar — Node.js internal modules, Express middleware layers, Promise resolution chains. In Java, the frames from Spring Boot alone can run 40+ lines.

Linter and formatter output

ESLint, Pylint, Rubocop, clippy — these tools often report on every issue in a file, not just the ones the agent introduced. The agent changes 3 lines, runs the linter, and gets 25 warnings, 20 of which were pre-existing. It has no way to distinguish what it caused from what was already there, so it either tries to fix everything (burning more tokens and potentially breaking things) or gets confused about which issues are relevant.

Program stdout during development

When an agent runs a development server or a script, the stdout can be extremely chatty. Express logs every request. Django prints SQL queries in debug mode. Docker compose outputs interleaved logs from multiple containers. A database migration tool prints a line for every migration that’s already been applied before getting to the new one.

I’ve seen agents consume 5,000+ tokens on Docker compose output when all they needed to know was whether the containers started successfully.

The math on wasted tokens

Let’s put rough numbers on this.

A typical agentic coding session where the agent is fixing a bug might involve:

ActionTotal tokensUseful tokensWaste
Run test suite (200 tests, 3 failures)1,20018085%
Read build output (successful build)4002095%
Read stack trace from error6009085%
Run linter after fix80012085%
Run test suite again (all pass)9003097%
Read dev server logs1,50010093%
Total5,40054090%

That’s a single bug fix cycle. A real session might have 5-10 of these cycles. So you’re looking at 25,000-50,000 wasted tokens per session on output noise alone.

At current API pricing (input tokens on Claude Sonnet at $3/million, GPT-4o at $2.50/million), this is a few cents per session. That sounds cheap, and for an individual developer it is. But scale it up.

A team of 20 engineers, each running 10 agentic sessions per day, each session wasting ~40,000 tokens on output noise: that’s 8 million wasted tokens per day. At $3/million, that’s $24/day, or about $500/month, just on reading test output and build logs for things that worked fine. For a large company with hundreds of engineers, multiply accordingly.

The cost is real but not catastrophic. The context pollution problem is worse.

Context pollution is worse than context cost

Here’s the thing that bothers me more than the cost: wasted tokens don’t just cost money. They take up space in the context window. And the context window is a fixed resource that directly affects how well the model reasons.

Think of it this way. An agent working with a 200K token context window has, in theory, plenty of room. But in practice, that window fills up fast: system prompts, conversation history, file contents, tool call results, the agent’s own reasoning. By the time you’re a few iterations into a debugging session, you might have 80K tokens of context, of which 30K is output noise from test runs and build logs.

That noise degrades performance in measurable ways.

Attention dilution. Transformer attention is not unlimited. The more tokens in the context, the more the model has to distribute its attention. Irrelevant tokens compete for attention with relevant ones. A 6-line error message buried in 400 lines of passing test output is harder for the model to focus on than the same 6-line error message presented alone. Research on “lost in the middle” effects shows that models are worse at utilizing information that appears in the middle of long contexts. Guess where your test failure ends up when it’s surrounded by 200 passing tests? Right in the middle.

Reasoning pollution. When the agent reads verbose output, it sometimes latches onto irrelevant details. I’ve watched agents notice a deprecation warning in test output and spend several turns “fixing” it, even though the actual task was to fix a failing assertion. The deprecation warning was noise, but it was in the context, and the model treated it as signal.

Context eviction. When the context window fills up, something has to get dropped. In most agent frameworks, the oldest messages get summarized or truncated. If your context is 40% output noise, the useful content gets evicted sooner than it should. The agent forgets the original instructions or earlier reasoning because test output from three runs ago is still taking up space.

Slower responses. More input tokens means longer time-to-first-token. This is a direct relationship. Every extra token of noise adds latency to the response. On a session with 30K tokens of accumulated noise, you’re waiting noticeably longer for each agent turn.

Why agents are bad at filtering their own input

You might think: can’t the agent just ignore the irrelevant parts of the output? It’s a language model. It should be able to read 400 lines of test output and focus on the failure.

In theory, yes. In practice, it’s unreliable for a few reasons.

The model processes all tokens. There’s no way for a language model to “skip” tokens in its input. Every token gets processed through every attention layer. The model doesn’t read the first few lines, decide the test output is mostly passing tests, and skim the rest. It processes all of it with equal computational cost. The filtering has to happen before the tokens enter the context, not after.

Agents don’t control their own input. When a coding agent runs a shell command, it gets back whatever the command printed to stdout and stderr. The agent didn’t choose to read 400 lines of test output. It chose to run pytest, and 400 lines is what came back. The agent framework stuffs that output into the context, and the model has to deal with it.

Output formats aren’t designed for LLMs. Test runners, build tools, and linters were designed for human developers reading a terminal. Humans are good at scanning output visually, skipping the green dots, and focusing on the red text. LLMs don’t have that visual scanning ability. They process text linearly. A format that’s scannable for humans (green dots for passes, red blocks for failures) is just more tokens for an LLM.

Tool output is append-only. In most agent frameworks, tool results get appended to the conversation history as a single message. The agent can’t go back and trim a previous tool result. Once those 400 lines are in the context, they’re in the context until the window rolls over.

What we can do about it

There are practical approaches, some of which exist today and some that should.

Smarter tool wrappers

Instead of running raw pytest and dumping the output into context, wrap the tool to filter the output before it reaches the model.

def run_tests_filtered(command: str) -> str:
    result = subprocess.run(command, capture_output=True, text=True)
    
    if result.returncode == 0:
        return "All tests passed (247 tests in 12.3s)"
    
    # Only return the failures and summary
    return extract_failures(result.stdout + result.stderr)

Some agent frameworks are starting to do this. Claude Code, for instance, does some output truncation on long command results. But it’s blunt — it truncates by length, not by relevance. Cutting off the last 200 lines of a 500-line output might drop the summary that contains the most useful information.

Structured output from tools

The real fix is for test runners and build tools to offer structured output modes designed for machine consumption.

pytest has --tb=short and --tb=line to reduce traceback verbosity. It also has --no-header and -q (quiet mode). Combining these can reduce output by 80%+. But most agents don’t use these flags by default, because the agent doesn’t know to ask for less verbose output unless you tell it to.

Jest has --verbose=false and reporters that output JSON. Go has -json for JSON test output. But again, agents default to the human-readable format.

The agent should use these. Better yet, agent frameworks should have built-in knowledge of common tools and automatically use the quietest output format that still captures failures.

# What agents typically run:
pytest tests/

# What they should run:
pytest tests/ -q --tb=short --no-header -x

# Or even:
pytest tests/ --json-report --json-report-file=- 2>/dev/null | jq '.tests[] | select(.outcome == "failed")'

Output summarization as middleware

A lightweight model (or even regex-based filtering) can sit between the tool output and the agent’s context, summarizing verbose output before it gets consumed.

Raw output: 847 tokens (200 test results, 3 failures, coverage table)
Summarized: 94 tokens (3 failure messages with file/line references)

This is the approach with the most potential. The summarization doesn’t need a powerful model — it’s a structured extraction task. A small model or even pattern matching can identify failures in test output, errors in build output, and application frames in stack traces.

Diff-aware linting

When the agent runs a linter after making changes, the framework should diff the linter output against the pre-change baseline and only show the agent the new issues. This eliminates the pre-existing warning problem entirely.

Before change: 23 warnings
After change:  25 warnings
Show agent:    2 new warnings (with file/line/message)

Some CI systems already do this (GitHub’s annotation system only shows new issues). Agent frameworks should adopt the same pattern.

Streaming with early termination

For test suites, the agent often doesn’t need to wait for all 200 tests to finish. If the first failure appears at test 15, the agent can stop reading and start fixing. The -x (fail fast) flag in pytest does this. --bail in Jest does it. But agents rarely use them, and frameworks don’t enforce them.

Even without fail-fast, streaming tool output with a callback that lets the agent decide “I’ve seen enough” would help. Run the test suite, stream the output, and the moment a failure appears, pause and let the agent act on it before continuing.

Context budgets for tool output

Agent frameworks should enforce budgets on how much context tool output can consume. If a test run produces 1,200 tokens of output, but the tool output budget is 300 tokens, the framework should automatically summarize or truncate before injecting into context.

This is different from the current approach of truncating the entire context window when it gets full. It’s preemptive — limiting noise at the source rather than cleaning it up after the context is already polluted.

Where this is going

I think we’ll see three things happen over the next year or so.

Agent frameworks will get opinionated about tool output. Right now, most frameworks treat tool output as an opaque blob. Run command, get string, stuff it in context. That’s going to change. Frameworks will ship with parsers for common tool output formats (pytest, jest, tsc, eslint, cargo, go test) and automatically extract the relevant information. The raw output might still be logged somewhere for debugging, but only the parsed summary enters the context.

Test runners and build tools will add agent-friendly output modes. Some already have JSON output, but it’s usually designed for CI systems, not LLMs. I’d expect to see --llm or --agent flags that produce output specifically optimized for machine consumption: just the failures, just the errors, just the new warnings, with file paths and line numbers in a consistent format. The first test framework that ships a native “agent mode” output format is going to get copied quickly.

Token-aware agent architectures will emerge. Right now, agents treat the context window as free until it’s full, and then panic. Smarter architectures will track token budgets per category — so much for system prompt, so much for conversation history, so much for file contents, so much for tool output — and enforce these budgets actively. Tool output will get the smallest budget because it has the lowest information density.

The underlying insight is simple: not all tokens are created equal. A token of the user’s instructions is worth more than a token of the agent’s reasoning, which is worth more than a token of source code, which is worth a lot more than a token of passing test output. Agent architectures that understand this hierarchy and allocate context accordingly will outperform those that treat every token the same.

We’re still early. Right now, the default behavior of every major coding agent is to dump raw tool output into the context and hope the model sorts it out. That works well enough when context windows are cheap and large. But as agents take on longer, more complex tasks — multi-file refactors, full feature implementations, debugging sessions that span dozens of iterations — the waste compounds. A 90% waste rate on tool output is sustainable for a 5-minute task. It’s not sustainable for a 2-hour autonomous coding session.

The fix isn’t complicated. Filter the noise before it enters the context. Use quiet output modes. Parse structured output. Diff against baselines. Budget token allocations by category. None of this requires new research. It’s engineering work, and it matters more than most people realize.

“Detectives follow the money. AI engineers should follow the tokens — that’s where the inefficiencies hide.”-Rushi

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>