A viral Claude Code skill claims to cut 65% of output tokens by making LLMs talk like cavemen. Two research papers suggest forced brevity can actually improve accuracy in large models. But tokens are also compute — and nobody has benchmarked whether caveman-speak helps or hurts code quality. A look at the arguments on both sides.

Table of contents:

  1. What CAVEMAN does
  2. The research angle
  3. The case for fewer tokens
  4. The case against
  5. The interesting middle ground
  6. What this is really about

A Claude Code skill went viral for making LLM agents drop articles, filler, and pleasantries from their output. The README claims ~65% fewer output tokens with “full technical accuracy.” 12,000+ GitHub stars later, the internet is split on whether this is clever engineering or self-inflicted brain damage.

Here’s the actual question worth asking: when you force an LLM to compress its output, what are you gaining, and what are you quietly losing?

What CAVEMAN does

CAVEMAN is a skill file (a markdown prompt) that you drop into Claude Code or Codex. It rewrites the agent’s communication style: no preamble, no “I’d be happy to help,” no filler paragraphs explaining what it’s about to do. Output comes back clipped, direct, sometimes genuinely funny.

Normal Claude:

I’ll analyze the error in your authentication middleware. The issue appears to be that the token validation function is not properly handling expired tokens. Let me fix this by adding an expiration check before the signature verification step.

CAVEMAN Claude:

Auth middleware no check expired token. Add expiry check before sig verify. Fix now.

The code changes themselves aren’t affected. The skill only compresses the conversational wrapper around the code. It also ships with a WENYAN mode (even more terse), a compression tool that cuts input context by ~45%, and one-line code review formatting.

The research angle

Two papers lend the idea some credibility.

Renze & Guven (2024) tested “Concise Chain-of-Thought” prompting against standard CoT across GPT-3.5 and GPT-4. CCoT cut response length by 48.7% with negligible accuracy loss on most tasks. The catch: GPT-3.5 took a 27.69% accuracy hit on math problems specifically. The bigger model handled brevity fine; the smaller one struggled. (arXiv:2401.05618)

Hakim (March 2026) went further, testing 31 models from 0.5B to 405B parameters across 1,485 problems. The finding that turned heads: on certain benchmarks, larger models actually underperformed smaller ones because they over-elaborated. Constraining large models to brief answers improved accuracy by 26 percentage points and reversed the performance hierarchy entirely on math and science tasks. The mechanism is “scale-dependent verbosity” — bigger models ramble more, and rambling introduces errors. (arXiv:2604.00025)

So there’s real evidence that forced brevity can help. But neither paper studied anything resembling “talk like a caveman.” They studied “be concise” — a different instruction with different downstream effects.

The case for fewer tokens

Cost and latency are real. Output tokens are expensive. If you’re running an agent across dozens of turns in a session, the conversational overhead accumulates. Cutting “Sure, I’d be happy to help you with that! Let me take a look at the code” from every response adds up.

Filler tokens carry near-zero information. One HN commenter made a sharp observation: if a tiny language model and a frontier model would both predict the same next token (like “the” or “is” after certain contexts), that token contains almost no meaningful signal. The big model isn’t smuggling secret computations into the word “the.” Low-entropy tokens are low-entropy tokens.

It’s easier to read. Multiple people in the HN thread reported that caveman output was paradoxically clearer than standard LLM output. One person put it well: “seeing something framed by a caveman in a couple of occasions peeled back a layer I didn’t see before.” Standard LLM prose is smooth in a way that lets your eyes glaze over. Compressed output forces you to engage with each word.

Chinese developers have been doing this by accident. Chinese lacks articles, doesn’t conjugate verbs, and individual characters carry more semantic weight than English letters. Multiple commenters noted that Chinese “vibe coding” works fine while apparently using 30-40% fewer tokens. If an entire language can function without English’s grammatical filler, maybe the filler was never carrying the weight we assumed.

The case against

Tokens are compute, not just text. This is the strongest counterargument by far. Each time an LLM produces a token, the entire model runs a forward pass through all its layers. Those layers perform computations that build on the KV cache of all prior tokens. Even a “filler” token like “sure” triggers a forward pass whose internal hidden states influence every subsequent token’s generation.

One commenter put it precisely: “Outputting filler tokens basically doesn’t require much thinking for an LLM, so the attention budget can be used to compute something else during the forward passes of producing that token.” The implication is counterintuitive — low-information output tokens might be free slots where the model does useful planning work internally, even if the token itself is predictable.

Think of it like a chess player fidgeting with a piece while thinking. The fidgeting isn’t the thinking, but the time spent fidgeting is time spent thinking.

Training distribution mismatch. LLMs are trained on mountains of well-formed English. Caveman speak is not well-formed English. When you force a model into a register it rarely saw during training, you’re pushing it into a region of token-space where its predictions are less calibrated.

As one skeptic noted: “in most contexts it has seen ‘caveman’ talk the conversations haven’t been about rigorously explained maths/science/computing/etc… so it is less likely to predict that output.” The model’s internal associations between caveman-style phrasing and correct technical content are weaker than its associations between standard prose and correct content.

The style constraint itself costs attention. Producing caveman-style output isn’t free. The model has to simultaneously solve the user’s problem and figure out how to express the answer without articles, with simplified grammar, in a voice it wasn’t optimized for. That’s two objectives competing for the same attention budget.

Someone who does a lot of LLM fiction writing noticed the corollary: “the harder the LLM has to reason about what it’s saying, the spottier its adherence to any output style or character voicing instructions will be.” Complex reasoning and style constraints fight over the same resources.

No benchmarks exist. The author admits this. The ~65% token reduction claim comes from preliminary testing, not a rigorous eval. Nobody has published results comparing CAVEMAN Claude vs. standard Claude on SWE-bench, HumanEval, or any other coding benchmark. The author is reportedly working on proper evals, but until those land, the accuracy claim is a vibe.

The interesting middle ground

The most thoughtful take from the HN thread came from someone who used CAVEMAN not to save money, but as an alternative reading mode:

“its kind of great for the eli5, not because it’s any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that’s almost like… really clear and simple. it feels like it cuts through bullshit just a smidge.”

This points at something the token-counting debate misses. CAVEMAN might be most valuable not as an optimization technique but as a cognitive tool. When everything an LLM says is clipped and compressed, you can’t skim. You actually have to process each word. That changes how you interact with the output.

There’s also a plausible task-dependent split. For coding agents doing straightforward file edits and running tests, the conversational wrapper genuinely adds nothing. “Auth middleware no check expired token. Fix now.” contains every bit of information you need. But for tasks requiring nuanced explanation — debugging a subtle race condition, explaining architectural tradeoffs — forcing brevity might strip out the connective tissue that makes the explanation useful.

The author said something reasonable in the thread: “my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.”

What this is really about

CAVEMAN is a single markdown file that rewrites how an agent talks. It didn’t invent compressed prompting, and it’s not a research breakthrough. But it touched a nerve because it asks a question a lot of people have been dancing around: how much of what LLMs say is actually for the LLM, and how much is for us?

The filler, the pleasantries, the “I’d be happy to help” — some fraction of that is genuine computation substrate (hidden states piggybacking on low-entropy tokens). Some fraction is RLHF conditioning that makes the model sound helpful. And some fraction is pure waste: tokens that cost money, fill context windows, and communicate nothing.

We don’t have good tools yet for telling these apart. Until we do, CAVEMAN is a bet: that for coding tasks specifically, the waste fraction is large enough that cutting it produces a net win. That bet might be right. But treat it as what the author says it is — an interesting idea that needs benchmarks — and not as a free 65% efficiency gain with zero tradeoffs.

The 12,000 stars say people want this to work. The missing evals say we don’t know yet if it does.

Links:

“Caveman talk, AI listens.”-Rushi