Remix Hacker News Clone

332

GLM-5: From Vibe Coding to Agentic Engineering

by meetpateltech1770828093181 comments

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

by simonw1770832077

I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet

by mohsen11770849402

While GLM-5 seems impressive, this release also included lots of new cool stuff!

> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.

A new type of model has joined the series, GLM-5-Coder.

GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.

Looks like they also released their own agentic IDE, https://zcode.z.ai

I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai

If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.

Very fascinating stuff!

by Alifatisk1770848259

The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

by Aurornis1770830508

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

by pcwelder1770830908

Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.

Certainly seems to remember things better and is more stable on long running tasks.

by Havoc1770843973

Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.

by justinparus1770830093

What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].

US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

[3] https://www.reuters.com/world/china/chinas-customs-agents-to...

by cherryteastain1770833639

So that was pony alpha (1). Now what's Aurora Alpha?

(1) https://openrouter.ai/openrouter/pony-alpha

by kristianp1770842587

Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now

by CDieumegard1770848046

I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.

by esafak1770828780

It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.

by woeirua1770831011

If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

Claude Opus 4.6: 65.5%

GLM-5: 62.6%

GPT-5.2: 60.3%

Gemini 3 Pro: 59.1%

by goldenarm1770833730

Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

There is also a GLM 5-code model.

by algorithm3141770829516

GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.

So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.

by 2001zhaozhao1770840988

I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

by beAroundHere1770828724

Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.

by pu_pe1770829862

by jnd01770829043

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q

by mnicky1770835440

by 1770841808

GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

Edit: Input tokens are twice as expensive. That might be a deal breaker.

by nullbyte1770832013

They increased their prices substantially

by ExpertAdvisor011770830979

I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.

by mohas1770833699

Why are we not comparing to opus 4.6 and gpt 5.3 codex...

Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.

If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.

Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.

by tgtweak1770836654

I predict a new speculative market will emerge where adherents buy and sell misween coded companies.

Betting on whether they can actually perform their sold behaviors.

Passing around code repositories for years without ever trying to run them, factory sealed.

by unltdpower1770838615

It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.

by meffmadd1770830699

The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0

by karolist1770831161

why don't they publish at ARC-AGI ? too expensive?

by eugene33061770828697

I wish China starts copying Demis' biotech models as well soon

by seydor1770835902

Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?

by woah1770829100

we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated

by surrTurr1770832715

Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.

by dana3211770836104

Earlier: https://news.ycombinator.com/item?id=46974853

by ChrisArchitect1770831475

[flagged]

by testuser_xyz1770832687

Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!

by petetnt1770830553