Remix Hacker News Clone

246

Flash-MoE: Running a 397B Parameter Model on a Laptop

Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%

More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

by tarruda1774187422

Reading the details, he is using 2-bit quantization and reduced the number of experts per token from 10 down to 4 to get 5 tokens/sec. Cool proof of concept but it’s far from the quality and performance of the 397B model as normally used. Dropping the number of experts is particularly misleading.

This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.

He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.

He even mentions the obvious problems with his changes:

> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

So right out of the gate this isn’t useful if you want to do anything with it. He could have tried smaller models or less quantizations to get actual useful output from the model, but it wouldn’t look as impressive. It’s honestly getting kind of exhausting to read all of these AI-coded (admitted in the link) and AI-written papers made more for resume building. It would have been interesting to see this work applied to running a useful model that hadn’t been lobotomized instead of applying tricks to get an impressive headline but useless output.

by Aurornis1774187906

/r/localllama discussion: https://old.reddit.com/r/LocalLLaMA/comments/1rxmmu5/running...

by homarp1774181702

To be honest, I'm getting tired of a "laptop" in every one of these clickbait titles turning out to be $3000 Macbook. Sure, it's impressive to achieve this degree of the LLM compression, but I really don't like that the title implies local LLM becomes a viable for an average person with the actual hardware being out of reach for 99%.

by jllyhill1774207033

The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.

by zozbot2341774183992

The quality degradation at 2-bit is a real issue. For actual work tasks, a well-tuned 30B at 4-bit usually outperforms a 70B+ at 2-bit in my experience. The expert reduction on top of that compounds things - you're essentially running a fairly different model. Still interesting to see the upper bound of what consumer hardware can attempt, even if the result isn't production-ready.

by justacatbot1774191950

> Metal Compute Shaders — Hand-written Metal kernels

Hand written... by GPT? ;)

by andai1774200097

Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?

by bertili1774182706

It seem strange to me that the only way to use an llm is to fit it entirely in volatile memory from the get go.

To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.

So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?

by qiine1774200011

This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?

by JSR_FDED1774182348

TLDR I took a stab at leveraging Dan's work and making it more practical:

https://github.com/matt-k-wong/mlx-flash

2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.

my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration

I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.

I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).

by mkw1774194020

Can you add a license to the repo? Legally we couldn't run any code without a license attached to it.

by maxloh1774188153

Really interesting approach. Curious how the 2-bit quantization affects the model's reasoning ability on longer chains of thought vs shorter prompts. The benchmarkslook solid but real-world usage seems like a different story based on the comments here.

by haomingkoo1774191361

As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?

by m-hodges1774188824

by 1774191255

Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.

by mannyv1774196998

yeah 4tok/s is kinda unusable though

by 383toast1774190278

this is awesome Dan!

by matchbox1774201857

Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?

by spwa41774185706

How large is the KV cache?

by lostmsu1774185013

impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens

by pdyc1774182995

[dead]

by qcautomation1774208665

[dead]

by Yanko_111774202479

[dead]

by leontloveless1774202532

[dead]

by robutsume1774195290

[dead]

by aplomb10261774200685

[dead]

by leontloveless1774195341

[dead]

by diablevv1774189911

[dead]

by leontloveless1774188139

[dead]

by jee5991774192594

[dead]

by genie3io1774193616

[dead]

by feshbach1774200700

[dead]

by mugivarra691774184302

Why so much RAM?

by vilequeef1774183251

seems promising , this is the way , can someone benchmark this

by harshhhhhhhhh1774181703

The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

by rvz1774181769