> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
Self-distillation has been shown recently as very efficient and effective back in January this year by MIT and ETH team in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].
This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.
I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distillation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.
I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.
[1] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[2] Self-Distillation Enables Continual Learning:
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
https://ai.meta.com/research/publications/adaptive-decoding-...
(Not fine tuning, but interesting none the less. If a model can so easily find a more elegant solution, why didn't it pick that in the first place?)
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."
EDIT: For context:
> Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
What am I missing?
This feels eerily similar to sleep consolidation or synaptic pruning
My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.
That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.
You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.
But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.
Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.
Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.
If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.
Their sweep is missing this. And only covers "standard" decoding settings.
Sorry apple, SSD is already taken, you can't use that acronym.
Title should be: Simple Self-Distillation Improves Code Generation
There are two distinct billions. https://en.wikipedia.org/wiki/Billion