Remix Hacker News Clone

409

GPT‑5.3‑Codex‑Spark

by meetpateltech1770919569187 comments

I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d really want on top is an “improv mode”: during the talk, I can branch off based on audience questions or small wording changes, and the system proposes (say) 3 candidate next slides in real time. I pick one, present it, then smoothly merge back into the main deck. Example: if I mention a recent news article / study / paper, it automatically generates a slide that includes a screenshot + a QR code link to the source, then routes me back to the original storyline. With realtime voice + realtime code generation, this could turn the boring old presenter view into something genuinely useful.

by beklein1770922435

First thoughts using gpt-5.3-codex-spark in Codex CLI:

Blazing fast but it definitely has a small model feel.

It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.

Downsides:

- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.

- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.

  Bluey Bench* (minus transcription time):

  Codex CLI
  gpt-5.3-codex-spark low        20s
  gpt-5.3-codex-spark medium     41s
  gpt-5.3-codex-spark xhigh   1m 09s (1 compaction)

  gpt-5.3-codex low           1m 04s
  gpt-5.3-codex medium        1m 50s

  gpt-5.2 low                 3m 04s
  gpt-5.2 medium              5m 20s

  Claude Code
  opus-4.6 (no thinking)      1m 04s

  Antigravity
  gemini-3-flash              1m 40s
  gemini-3-pro low            3m 39s

  *Season 2, 52 episodes

by postalcoder1770926540

Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing

by pjs_1770924115

This has been the industry standard for the last 20 minutes. I can't believe people are still using GPT-5.3-Codex.

by perdomon1770932609

This is interesting for offloading "tiered" workloads / priority queue with coding agents.

If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.

Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].

Also I expect both Nvidia and Google to deploy custom silicon for inference [2]

1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...

2: https://www.tomshardware.com/tech-industry/semiconductors/nv...

by jryio1770920344

My stupid pelican benchmark proves to be genuinely quite useful here, you get a visual representation of the quality difference between GPT-5.3-Codex-Spark and full GPT-5.3-Codex: https://simonwillison.net/2026/Feb/12/codex-spark/

by simonw1770931206

really too bad that the codex models are so tightly coupled to the codex harness as to be useless for everything else

by jbellis1770935490

> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.

I have yet to see this (produce anything actually useful).

by nikkwong1770921484

Interesting to note that the reduced latency is not just due to the improved model speed, but also because of improvements made to the harness itself:

> "As we trained Codex-Spark, it became apparent that model speed was just part of the equation for real-time collaboration—we also needed to reduce latency across the full request-response pipeline. We implemented end-to-end latency improvements in our harness that will benefit all models [...] Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon."

I wonder if all other harnesses (Claude Code, OpenCode, Cursor etc.,) can make similar improvements to reduce latency. I've been vibe coding (or doing agentic engineering) with Claude Code a lot for the last few days and I've had some tasks take as long as 30 minutes.

by raahelb1770929032

Is this the first time one of the big 3 using Cerebras? I've been waiting for this day...

by kachapopopow1770921235

Works pretty well as a general-purpose computer. The speed is really enjoyable. Could replace some of my Claude Code use actually. For coding, set to xhigh and use it for personal tools or small projects.

Example repo that Codex with spark made in about 15 minutes for me since `claude --resume` has been finicky lately: https://github.com/mzxrai/claude-sessions

by mbm1770932506

Off topic but how is it always this HN user sharing model releases within a couple of minutes of their announcement?

by mudkipdev1770920233

This is closer to 5.1 mini it seems and tied to Pro account. GLM 4.7 is available on-demand on Cerebras today [1] and performs better and cheaper... [1] https://www.cerebras.ai/blog/glm-4-7

by pdeva11770920419

Great move by OpenAI. With coding agents, if you have access to a fast and cheap model, you can afford to let it rip, making lots of mistakes, and iterate until it gets things right. With the right scaffolding (AGENTS.md, SKILLS.md, etc.), a fast and light model can do great things. And when it's done, you can still have the heavyweight model come in to clean up any messes.

by ttul1770929608

This could probably work amazingly with an orchestrator on 5.3-high and coding agents with Spark. But it would need some decent instructions for both.

by alecco1770932245

When I saw Spark my mind went to Apache Spark and wondered if we were learning all the lessons in orchestration of driver/worker and data shuffling from that space.

by alexhans1770920804

by 1770928642

The search for speed is vain. Often Claude Code Opus 4.6, on hard enough problems, can do the impression of acting fast without really making progresses because of lack of focus on what matters. Then you spin the much slower GPT 5.3-Codex and it fixes everything in 3 minutes of doing the right thing.

by antirez1770920735

Seems like the industry is moving further towards having low-latency/high-speed models for direct interaction, and having slow, long thinking models for longer tasks / deeper thinking.

Quick/Instant LLMs for human use (think UI). Slow, deep thinking LLMs for autonomous agents.

by capevace1770921517

Anyone using OpenClaw to manage a bunch of coding agents so that you only set the high-level vision and leave all the prompting, testing, debugging, forking to agents? If yes, how did you glue it all together? Are you using local models? What is the SOTA for what I can run locally with a 512GB M3 Ultra, 2x DGX Spark, 2x RTX Pro 6000 Max-Q in one machine and 1x RTX Pro 6000 WS in another machine?

by storus1770926265

No hint on pricing. I'm curious if faster is more expensive, given a slight trade-off in accuracy

by OsrsNeedsf2P1770920111

by 1770925477

128k context window!

by Computer01770934792

Cerebras out here catching dubs. Does anyone know if Groq is running DGX Cloud inference or am I tripping?

by hchak1770926917

Great stuff. People are getting used to agents as the interface for everything, even work as simple as "change label X to label Y". More speed on that front is welcome. The Codex "blended mode" they refer to will be useful (similar to Claude Code bouncing between haiku and opus).

I imagine it's a win-win. This could significantly help their tokenomics.

The example showing a plan being generated instantaneously is interesting. Human understanding will end up as the last, true bottleneck.

by wxw1770922246

This is a win for agents, speed and intelligence is crucial to the loop. If the time and token cost is small you can iterate many times to correct mistakes.

Got to wonder why Wall Street is dumping NVIDIA.

by dalemhurley1770928311

With the rough numbers from the blog post at ~1k tokens a second in Cerebras it should put it right at the same size as GLM 4.7, which also is available at 1k tokens a second. And they say that it is a smaller model than the normal Codex model

by mynti1770925924

open ai naming is a meme at this point

by Aeroi1770933798

Damn, this is the first thing to make me decide to try Codex, as a loyal Claude Code user.

by rprend1770927536

This would be interesting if it was an open weights model.

by jannniii1770929519

It'll be nice when there's smarter routing between models, or easier routing, so some things get sent to the fast model, some get sent to the cheap model, some get sent to the smart model, etc.

by cjbarber1770920712

Why are they obscuring the price? It must be outrageously expensive.

by modeless1770924440

Your move, Anthropic.

(Yes I know they released /fast last week but I’m loving the constant oneupsmanship)

by throwup2381770920088

Been using glm 4.7 for this with opencode. Works really well.

by anonzzzies1770923397

I stopped using OpenAI tools recently after they increased the censorship. I can't even tell it to read a screencapture software I am building because it thinks I might use it for evil purposes.

by system21770927171

Is it not available in Codex? I think this is fantastic and can't wait to try it, this is exactly the usecase I need, something fast, perform based on my instruction.

Cerebras is a winner here.

by desireco421770930802

These graphs are really weird. One only shows 30-60% range with the model(s) close to 60%, the other shows 80% but the top model is at 77%.

by nusl1770923103

Does anyone want this? Speed has never been the problem for me, in fact, higher latency means less work for me as a replaceable corporate employee. What I need is the most intelligence possible; I don't care if I have to wait a day for an answer if the answer is perfect. Small code edits, like they are presented as the use case here, I can do much better myself than trying to explain to some AI what exactly I want done.

by tsss1770923506

For a bit, waiting for LLMs was like waiting for code to compile: https://xkcd.com/303/

> more than 1000 tokens per second

Perhaps, no more?

(Not to mention, if you're waiting for one LLM, sometimes it makes sense to multi-table. I think Boris from Anthropic says he runs 5 CC instances in his terminal and another 5-10 in his browser on CC web.)

by cjbarber1770920298

Anyway token eaters are upgrading their consumption capabilities.

by deskithere1770921269

Normal codex it self is sub par compared to opus. This might be even worse

by allisdust1770920733

by 1770920171

I was really hoping it would support codex xhigh first.

by cactusplant73741770924962

Wasn't aware there was an effort to move to websockets. Is there any standards work for this, or is this just happening purely within the walled OpenAI garden?

> Under the hood, we streamlined how responses stream from client to server and back, rewrote key pieces of our inference stack, and reworked how sessions are initialized so that the first visible token appears sooner and Codex stays responsive as you iterate. Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon.

by jauntywundrkind1770920904

In my opinion, they solved the wrong problem. The main issue I have with Codex is that the best model is insanely slow, except at nights and weekends when Silicon Valley goes to bed. I don't want a faster, smaller model (already have that with GLM and MiniMax). I want a faster, better model (at least as fast as Opus).

When they partnered with Cerebras, I kind of had a gut feeling that they wouldn't be able to use their technology for larger models because Cerebras doesn't have a track record of serving models larger than GLM.

It pains me that five days before my Codex subscription ends, I have to switch to Anthropic because despite getting less quota compared to Codex, at least I'll be able to use my quota _and_ stay in the flow.

But even Codex's slowness aside, it's just not as good of an "agentic" model as Opus: here's what drove me crazy: https://x.com/OrganicGPT/status/2021462447341830582?s=20. The Codex model (gpt-5.3-xhigh) has no idea about how to call agents smh

by behnamoh1770920116

> Today, we’re releasing

Releasing for real? Is it an open model?

by cowpig1770927852

> Today, we’re releasing a research preview of GPT‑5.3-Codex-Spark, a smaller version of GPT‑5.3-Codex, and our first model designed for real-time coding. Codex-Spark marks the first milestone in our partnership with Cerebras, which we announced in January .

Nevermind. [0]

[0] https://news.ycombinator.com/item?id=35490837

by rvz1770922548