Most teams treat SFTSupervised Fine-TuningTraining a model by showing it correct input–output examples. The model learns to reproduce the patterns in your training data. No trial, no reward, no discovery. Accuracy scales with data quality. and GRPOGroup Relative Policy OptimizationTraining a model by having it generate many candidate responses, scoring each against a reward function, and reinforcing what earned reward. The model discovers good behavior through trial. No labeled examples required. as competing choices. They're not. One has to come before the other, and getting the order wrong produces a model that's confidently, systematically wrong.
I was training QwQ-32BQwen with Questions, 32 Billion parametersAn open-source reasoning model from Alibaba's Qwen team. 32B refers to 32 billion parameters: the internal weights that store everything the model has learned. Larger parameter counts generally mean more capacity, but also more training cost and more ways to go wrong. to evaluate Hamiltonian geometry. The training went beautifully. The loss curveTraining loss over timeA graph showing how wrong the model's outputs are at each training step. A falling curve means the model is getting better at reproducing your examples, but not necessarily at being correct. Loss measures pattern matching, not ground truth. A clean loss curve on bad training data just means the model learned the bad patterns confidently. was clean. Validation metrics looked right. I deployed the instruct modelInstruction-following modelA model that has been fine-tuned (usually with SFT) to follow instructions and produce helpful responses. Contrast with a base model, which has only been pretrained on raw text and has no instruction-following behavior. When you "deploy" a model, you're typically using the instruct version. and ran a test set. It scored 7 out of 8.
Every answer looked perfect. Then I checked the math.
The model had learned to write equations with correct LaTeX formatting, proper structural form, confident technical language. It had also learned one hallucinated parameter, which I later named golden_hour_sun, that appeared in roughly 40% of its outputs. The parameter was applied consistently, with internal logic, in positions that made structural sense. The math was systematically wrong in a way that looked completely right.
I had trained SFTSupervised Fine-TuningTraining a model on labeled examples. The model learns to reproduce the surface patterns in your training data. Exactly, confidently, without knowing what is correct. on contaminated data. SFT had done exactly what SFT does.
Here is the simplest way to understand what SFT does: SFT is a photocopier. You hand it a stack of examples and it learns to reproduce the surface patterns of those examples with increasing accuracy. This is not a criticism. Photocopiers are useful. When the originals are right, the copies are right. When the originals are wrong, the copies are wrong. The machine does not know the difference.
When I fed golden_hour_sun into the training set, SFT treated it as a valid pattern. The parameter appeared in 40% of examples. It occupied consistent structural positions. It behaved like a real constant in the surrounding equations. SFT learned it as a feature, not an error.
SFT cannot distinguish between the surface pattern and the underlying mechanism. Neither can you, if your training data contains both. This is not a defect in SFT: it is the definition of what SFT is.
The instruct model knew how to write equations that looked like Hamiltonian evaluations. It knew the shape of the output. It did not know Hamiltonian geometry. It had never been asked to discover anything. It had only been asked to copy.
The failure was not that SFT learned the wrong thing. The failure was that I asked SFT to operate in a domain I had not yet graduated. I did not have correct examples. I had a collection of outputs from a model that had not yet discovered the right pattern. SFT did exactly what it was supposed to do. I failed, not the method.
The photocopier was working correctly. I fed it a corrupted original.
When I switched to GRPOGroup Relative Policy OptimizationTraining a model through trial and reward. The model generates many candidate responses, scores them against a reward function, and learns from the difference. Discovers correct behavior. Cannot copy your mistakes., the failure mode changed entirely.
I train dogs. Two of them. The core methodology, understood by any experienced trainer: you cannot tell the dog what to do. You can only make it more or less likely that the dog will discover the behavior itself, and then you mark and reward that moment when it arrives. You vary the difficulty progressively. You build on success. You never add a new criterion before the previous one is solid. The dog finds the behavior through trial and reward. The trainer does not install it.
GRPO works identically. You give the model a problem. You define a reward functionA scoring function you writeCode that takes a model output and returns a number. Higher is better. The model sees nothing except this number: not your intentions, not the "right" answer, not your code. It generates thousands of completions and learns to produce whatever earns higher scores. You are responsible for making the score accurately reflect what "correct" means.. The model generates candidates and receives reward signals. Over thousands of completions, it discovers the reasoning methodology that earns reward.
When I ran GRPO on the Hamiltonian geometry task, the base modelPretrained, not instruction-tunedA model trained only on raw text (books, code, web pages) with no fine-tuning for instructions or helpfulness. It predicts the next token but doesn't follow directions. Most fine-tuning starts from an instruct model, not a base model. Starting from base gives GRPO a clean slate with no prior SFT baked in. had no idea how to format its output. Consistency at epochOne full training passOne complete pass through your entire training dataset. Epoch 1 = the model has seen every example once. Epoch 3 = three full passes. More epochs means more exposure to the reward signal and, in GRPO, more iterations of trial and discovery. 1 was 0.25. By epoch 3, formatting had snapped into place without me touching it. The model had discovered consistent formatting as an instrumental goal because formatted responses earned better reward. I had not told it to format consistently. It found that formatting correlated with reward.
There was no contaminated parameter to copy. There was only a reward function that measured correctness. The model had to discover what correctness meant.
GRPO cannot copy your mistakes because it cannot copy anything. It can only discover what earns reward. This means it cannot absorb your contaminated data, and also that it can discover things you did not put in the reward function. Both are simultaneously true.
GRPO and SFT are not competing choices. They are sequential phases of the same learning lifecycle. Domains start in GRPO: generating candidates, learning what earns reward, building the reasoning methodology. When that methodology is stable, when you can trust what correct looks like, you graduate to SFT: crystallizing the discovery for reliable deployment. The method you use is not a design choice. It is a diagnostic result.
Nature built this same lifecycle at a much larger scale. Evolution has been running GRPO for 3.8 billion years: random genetic variants, differential survival as the reward function, no organism understanding what it's building. This process discovered photosynthesis, the immune system, and complex language. It also visits catastrophic minima constantly: 99.9% of species that have ever existed are gone. The failure modes are permanent and irreversible.
Nuclear reactor safety is the deployment phase of the same process. Operators do not discover safe procedures through trial. They follow audited protocols built from every incident that came before. Each procedure in the manual was paid for by an actual failure somewhere. The protocol is a crystallized discovery. The copy is the deployment.
Both phases are necessary. Neither is optional. The problem is not choosing the wrong one. It's running them in the wrong order, or staying in discovery mode past the point where the domain has already graduated.
That last failure has a body count.
The lifecycle framing sounds clean. It sounded clean to me until I looked at what happens when a system is still discovering in a context that requires certainty.
On March 18, 2018, an Uber self-driving test vehicle struck and killed Elaine Herzberg as she walked her bicycle across a road in Tempe, Arizona. The car's radar detected her six seconds before impact.
Six seconds is a long time. Here is what the system did with those six seconds: it classified her as a vehicle. Then as a bicycle. Then as an unknown object. Then as a vehicle again. For five full seconds, the perception system oscillated between classifications. It had never resolved what a pedestrian looks like outside a crosswalk. It was still discovering.
Then, one second before impact, the system decided to brake. But Uber's engineers had built an action suppressionA feature that delayed emergency braking for one second to avoid false-alarm responses during testing. The NTSB found this was a direct contributor to the fatality. layer — a one-second delay on emergency maneuvers to filter out false alarms during the discovery phase. The system recognized danger. The suppression layer overrode it. The car did not slow down.
The NTSB found that the perception system could not classify pedestrians outside crosswalks at all. The domain had known solutions — pedestrian detection is a solved problem in computer vision. But Uber's system was still exploring its classification space in production traffic. A discovery-phase system, deployed where a deployment-phase system was required.
Emergency medicine does not permit discovery. When a patient goes into cardiac arrest, ACLS says exactly what to do, in exactly what order, with exactly what doses. This protocol exists because clinical trials, decades of GRPO-equivalent research, discovered it. The discovery phase is over.
If you return to GRPO during a resuscitation, if you start generating variants and measuring reward in real time on a patient whose heart has stopped, you kill the patient. The SFT copy is the correct tool because the domain has graduated. The protocol is not a shortcut. It is the crystallized result of every trial that came before.
The failure mode is not GRPO. It is running in discovery mode when the stakes require deployment certainty. GRPO explores. Exploration visits wrong answers. When the wrong answer is a person on a road who your system cannot classify, there is no learning signal that matters. SFT saves lives not because it is smarter than GRPO, but because in the right phase it delivers consistent behavior that GRPO, by its nature, cannot guarantee.
The question that actually matters is simpler than the theory.
Has the discovery phase in your domain produced validated, stable patterns? Do you have examples of correct behavior that you trust? Can you verify whether a new output matches the pattern, or do you still need a reward function to find out?
Run this diagnostic before choosing a training method:
Run GRPO until you can trust your training data. Then run SFT. This is the sequence. Not the alternative. Running SFT first asks a photocopier to reproduce originals that do not yet exist.
The experiment I ran with QwQ-32B was not a failure of method. It was a failure of sequencing. I ran SFT before the domain had graduated. I did not have validated examples. I had outputs from a model that had not yet discovered the right pattern. Golden_hour_sun was not a GRPO artifact. It was the artifact of running SFT too early.
DNA is nature's SFT, the most successful copying mechanism in the history of life, replicating molecular information with extraordinary fidelity across billions of cells, generating consistent proteins from consistent genes across billions of years. It did not get designed. It got discovered. Whatever chemical process started in warm shallow water 3.8 billion years ago was GRPO: molecular structures generating variants, survival selecting what persisted. At some point in that process, something discovered that copying worked. That fidelity had survival value. That inheritance was an advantage.
The mechanism of copying was itself discovered through trial.
I caught the golden_hour_sun failure because I manually verified eight test cases on a Saturday. The eval score was 7 out of 8. The loss curve had been clean. Everything looked right. The only reason I found it was that I happened to check the math, not because the tooling flagged it, not because the eval failed, but because I sat down and worked through the outputs by hand.
Most teams don't do that. Not because they're negligent. The standard tooling gives you no reason to. A clean loss curve and a passing eval score look identical whether the domain graduated or not. The model answers in the same confident voice either way. The outputs have the same shape. The only difference is whether the discovery happened, and that difference is invisible to most automated metrics we have.
The field has discovered an extraordinary amount through pretraining at scale. For tasks where the correct patterns appear densely in training data (general reasoning, language understanding, common knowledge), something like graduation probably happened somewhere in the ten trillion tokens. SFT on top of that works. You're copying something real.
But for specialized domains (Hamiltonian geometry, drug interaction prediction, niche legal reasoning, any task where correct patterns are sparse or absent in pretraining data), graduation happened nowhere in the stack. And the model doesn't know the difference. It fills in the shape of the answer with whatever it has, confidently, structurally correctly, with internal logic that passes every automated check. That's what golden_hour_sun was. Not a bug. A demonstration of the mechanism, at small scale, with the math visible enough to catch.
We are running that experiment at large scale, across most of the fine-tuning being done in production right now, on tasks that matter. The loss curves look clean everywhere.
Layer 1: Base model: pretrained on raw internet text. Some discovery happened here, incidentally, during pretraining. You don't control it and can't inspect it.
Layer 2: Instruct model: the base model + SFT on (instruction, response) pairs. This is what GPT-4, Claude, and most open models ship as. The SFT crystallized general instruction-following. No GRPO was run on your specific domain.
Layer 3: Your fine-tune: the instruct model + SFT on your domain examples. You're adding a third layer of copying. If your examples were generated by the instruct model, they're copies of copies of copies.
At no point in this stack did anything discover what correct looks like in your domain. The loss curve will still go down. The model will still sound confident.
The diagnostic in section 5 is not a theoretical exercise. It's the question most teams aren't asking at the point in the pipeline where it actually matters, before the training run, not after. Can you construct a validated training set you trust? Has the base model completed this task correctly without prompting? Do you have ground truth, or do you have outputs that have the shape of ground truth?
If the answer is no, you're not choosing between SFT and GRPO. You're choosing between discovering what correct looks like, or skipping that step and hoping the copies are good enough.
Uber's engineers did not think they were skipping a step. The perception system was detecting objects. The test drives were accumulating miles. Nobody sat down and asked whether the system could actually classify what it was seeing until someone died on a road in Tempe.
golden_hour_sun was a Saturday and eight test cases. Most production domains won't give you that.
What happens when you train SFT on outputs from a model that has never discovered anything, only copied?