Blazing fast but it definitely has a small model feel.
It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.
Downsides:
- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.
- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.
Bluey Bench* (minus transcription time):
Codex CLI
gpt-5.3-codex-spark low 20s
gpt-5.3-codex-spark medium 41s
gpt-5.3-codex-spark xhigh 1m 09s (1 compaction)
gpt-5.3-codex low 1m 04s
gpt-5.3-codex medium 1m 50s
gpt-5.2 low 3m 04s
gpt-5.2 medium 5m 20s
Claude Code
opus-4.6 (no thinking) 1m 04s
Antigravity
gemini-3-flash 1m 40s
gemini-3-pro low 3m 39s
*Season 2, 52 episodesIf 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.
Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].
Also I expect both Nvidia and Google to deploy custom silicon for inference [2]
1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...
2: https://www.tomshardware.com/tech-industry/semiconductors/nv...
I have yet to see this (produce anything actually useful).
> "As we trained Codex-Spark, it became apparent that model speed was just part of the equation for real-time collaboration—we also needed to reduce latency across the full request-response pipeline. We implemented end-to-end latency improvements in our harness that will benefit all models [...] Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon."
I wonder if all other harnesses (Claude Code, OpenCode, Cursor etc.,) can make similar improvements to reduce latency. I've been vibe coding (or doing agentic engineering) with Claude Code a lot for the last few days and I've had some tasks take as long as 30 minutes.
Example repo that Codex with spark made in about 15 minutes for me since `claude --resume` has been finicky lately: https://github.com/mzxrai/claude-sessions
Quick/Instant LLMs for human use (think UI). Slow, deep thinking LLMs for autonomous agents.
I imagine it's a win-win. This could significantly help their tokenomics.
The example showing a plan being generated instantaneously is interesting. Human understanding will end up as the last, true bottleneck.
Got to wonder why Wall Street is dumping NVIDIA.
(Yes I know they released /fast last week but I’m loving the constant oneupsmanship)
Cerebras is a winner here.
> more than 1000 tokens per second
Perhaps, no more?
(Not to mention, if you're waiting for one LLM, sometimes it makes sense to multi-table. I think Boris from Anthropic says he runs 5 CC instances in his terminal and another 5-10 in his browser on CC web.)
> Under the hood, we streamlined how responses stream from client to server and back, rewrote key pieces of our inference stack, and reworked how sessions are initialized so that the first visible token appears sooner and Codex stays responsive as you iterate. Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon.
When they partnered with Cerebras, I kind of had a gut feeling that they wouldn't be able to use their technology for larger models because Cerebras doesn't have a track record of serving models larger than GLM.
It pains me that five days before my Codex subscription ends, I have to switch to Anthropic because despite getting less quota compared to Codex, at least I'll be able to use my quota _and_ stay in the flow.
But even Codex's slowness aside, it's just not as good of an "agentic" model as Opus: here's what drove me crazy: https://x.com/OrganicGPT/status/2021462447341830582?s=20. The Codex model (gpt-5.3-xhigh) has no idea about how to call agents smh
Releasing for real? Is it an open model?
Nevermind. [0]