This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.
What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)
Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.
Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.
So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success
There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)
So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.
(And it's for a similar reason, I think, that deliberative models like rewriting your question in their own terms before reasoning about it. They're decreasing the per-token re-parsing overhead of attending to the prompt [by distilling a paraphrase that obviates any need to attend to the literal words of it], so that some of the initial layers that would either be doing "figure out what the user was trying to say" [i.e. "NLP stuff"] or "figure out what the user meant" [i.e. deliberative-reasoning stuff] — but not both — can focus on the latter.)
I haven't done the exact experiment you'd want to do to verify this effect, i.e. "measuring LLM benchmark scores with vs without an added requirement to respond in a certain speaking style."
But I have (accidentally) done an experiment that's kind of a corollary to it: namely, I've noticed that in the context of LLM collaborative fiction writing / role-playing, the harder the LLM has to reason about what it's saying (i.e. the more facts it needs to attend to), the spottier its adherence to any "output style" or "character voicing" instructions will be.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman
For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?
See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output
For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills
Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.
Like "Sea world" or "see the world".
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.
But combining this with caveman? Gold!
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
if goal make code, few word better. if goal make insight, more word better. depend on task. machine linear, mind not. consider LLM "thinking" is just edge-weights. if can set edge-weights into same setting with fewer tokens, you are winning.
https://developers.openai.com/api/reference/resources/respon...
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.
Mass fun. Starred.
Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.
tldr; Claude skill, short output, ++good.
I have a feeling these same people will complain “my model is so dumb!”. There’s a reason why Claude had that “you’re absolutely right!” for a while. Or codex’s “you’re right to push on this”.
We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.