I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
brew upgrade llama.cpp # or brew install if you don't have it yet
Then: llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this: llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
It's using about 28GB of RAM.Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8
Sequential (single request)
Prompt Gen Prompt Processing Token Gen
Tokens Tokens (tokens/sec) (tokens/sec)
------ ------ ----------------- -----------
521 49 3,157 44.2
1,033 83 3,917 43.7
2,057 77 3,937 43.6
4,105 77 4,453 43.2
8,201 77 4,710 42.2
Parallel (concurrent requests)
pp4096+tg128 (4K context, 128 gen):
n t/s
-- ----
1 28.5
2 39.0
4 50.4
8 57.5
16 61.4
32 62.0
pp8192+tg128 (8K context, 128 gen):
n t/s
-- ----
1 21.6
2 27.1
4 31.9
8 32.7
16 33.7
32 31.7 cat docker-compose.yml
services:
llamacpp:
volumes:
- llamacpp:/root
container_name: llamacpp
restart: unless-stopped
image: ghcr.io/ggml-org/llama.cpp:server-cuda
network_mode: host
command: |
-hf unsloth/Qwen3-Coder-Next-GGUF:Q4_K_XL --jinja --cpu-moe --n-gpu-layers 999 --ctx-size 102400 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on
# unsloth/gpt-oss-120b-GGUF:Q2_K
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
llamacpp:From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.
Hope they update the model page soon https://chat.qwen.ai/settings/model
Does anyone see a reason to still use elevenlabs etc. ?
Does anyone any experience with these and is this release actually workable in practice?
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
Im currently using qwen 2.5 16b , and it works really well
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
On a misc note: What's being used to create the screen recordings? It looks so smooth!
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.