If it's the latter, then users will pay for the entire history of tokens since the change uncached: https://platform.claude.com/docs/en/build-with-claude/prompt...
How is this better?
It seems like the tool to solve the problem that won't last longer than couple of months and is something that e.g. claude code can and probably will tackle themselves soon.
The intent-conditioned compression is the interesting part here. Most context management I've seen is either naive truncation or generic summarization that doesn't account for why the tool was called. Training classifiers on model internals to
figure out which tokens carry signal for a given task -- that's doing something different from what frameworks offer out of the box.
I poked around the repo and didn't see any evals measuring compression quality. You cite the GPT-5.4 long-context accuracy drop as motivation, which makes sense -- but the natural follow-up is: does your compression actually recover that accuracy?
Something like SWE-bench pass rates with and without the gateway at various context lengths would go a long way. Without that, it's hard to tell if the SLM is making good decisions or just making the context shorter.
A few other things I'm curious about:
• How does the SLM handle ambiguous tool calls? E.g., a broad grep where the agent isn't sure what it's looking for yet -- does the compressor tend to be too aggressive in those cases?
• What's the latency overhead per tool call? If the SLM inference adds even 200-300ms per compression step, that compounds fast in agentic loops with dozens of tool calls.
• How often does expand() get triggered in practice? If the agent frequently needs to recover stripped content, that's a signal the compression is too lossy.The framework I use (ADK) already handles this, very low hanging fruit that should be a part of any framework, not something external. In ADK, this is a boolean you can turn on per tool or subagent, you can even decide turn by turn or based on any context you see fit by supplying a function.
YC over indexed on AI startups too early, not realizing how trivial these startup "products" are, more of a line item in the feature list of a mature agent framework.
I've also seen dozens of this same project submitted by the claws the led to our new rule addition this week. If your project can be vibe coded by dozens of people in mere hours...
The fact that you’re using Small Language Models (SLMs) to detect signal matches my philosophy of using AI as a sparring partner to check its own work. Most developers spend 30% of their day context switching or debugging "hallucinations" that only happen because the model got lost in its own bloated history. The expand() feature is the "trust but verify" layer that every production-ready AI system needs. You’re effectively treating the LLM like a senior architect who doesn't need to see every line of a dependency file unless they specifically ask for it, which is the only way we scale these systems to 10M+ users solo.
Finally, those spending caps and Slack pings are the ultimate "millionaire cheat codes" for leverage. I tell founders all the time that running a business is boring drudgery—it's about fixing bugs and managing resources—and this proxy handles the resource management part on autopilot. If this saves an indie hacker $500/month in token waste while keeping their agent from rage-quitting due to context limits, you’ve built a high-leverage asset. I’m definitely adding this to my links database; it removes a huge excuse for why people "can't afford" to build complex apps.