AI Fine-Tuning · SFT vs GRPO · Why Sequence Matters

SFT Copies.
GRPO Discovers.

Most teams treat SFTSupervised Fine-TuningTraining a model by showing it correct input–output examples. The model learns to reproduce the patterns in your training data. No trial, no reward, no discovery. Accuracy scales with data quality. and GRPOGroup Relative Policy OptimizationTraining a model by having it generate many candidate responses, scoring each against a reward function, and reinforcing what earned reward. The model discovers good behavior through trial. No labeled examples required. as competing choices. They're not. One has to come before the other, and getting the order wrong produces a model that's confidently, systematically wrong.

Marcos Damasceno · March 2026 · 11 min read

SFT Supervised Fine-Tuning Train on labeled examples. The model learns to copy the patterns in your data.

GRPO Group Relative Policy Optimization Train through trial and reward. The model discovers what earns reward. No labeled examples required.

I was training QwQ-32BQwen with Questions, 32 Billion parametersAn open-source reasoning model from Alibaba's Qwen team. 32B refers to 32 billion parameters: the internal weights that store everything the model has learned. Larger parameter counts generally mean more capacity, but also more training cost and more ways to go wrong. to evaluate Hamiltonian geometry. The training went beautifully. The loss curveTraining loss over timeA graph showing how wrong the model's outputs are at each training step. A falling curve means the model is getting better at reproducing your examples, but not necessarily at being correct. Loss measures pattern matching, not ground truth. A clean loss curve on bad training data just means the model learned the bad patterns confidently. was clean. Validation metrics looked right. I deployed the instruct modelInstruction-following modelA model that has been fine-tuned (usually with SFT) to follow instructions and produce helpful responses. Contrast with a base model, which has only been pretrained on raw text and has no instruction-following behavior. When you "deploy" a model, you're typically using the instruct version. and ran a test set. It scored 7 out of 8.

Every answer looked perfect. Then I checked the math.

The model had learned to write equations with correct LaTeX formatting, proper structural form, confident technical language. It had also learned one hallucinated parameter, which I later named golden_hour_sun, that appeared in roughly 40% of its outputs. The parameter was applied consistently, with internal logic, in positions that made structural sense. The math was systematically wrong in a way that looked completely right.

I had trained SFTSupervised Fine-TuningTraining a model on labeled examples. The model learns to reproduce the surface patterns in your training data. Exactly, confidently, without knowing what is correct. on contaminated data. SFT had done exactly what SFT does.

How SFT Memorized My Hallucinated Parameter

Here is the simplest way to understand what SFT does: SFT is a photocopier. You hand it a stack of examples and it learns to reproduce the surface patterns of those examples with increasing accuracy. This is not a criticism. Photocopiers are useful. When the originals are right, the copies are right. When the originals are wrong, the copies are wrong. The machine does not know the difference.

When I fed golden_hour_sun into the training set, SFT treated it as a valid pattern. The parameter appeared in 40% of examples. It occupied consistent structural positions. It behaved like a real constant in the surrounding equations. SFT learned it as a feature, not an error.

Key Insight

SFT cannot distinguish between the surface pattern and the underlying mechanism. Neither can you, if your training data contains both. This is not a defect in SFT: it is the definition of what SFT is.

Case Study: SFT Contamination

QwQ-32B Hamiltonian Geometry Training

The instruct model knew how to write equations that looked like Hamiltonian evaluations. It knew the shape of the output. It did not know Hamiltonian geometry. It had never been asked to discover anything. It had only been asked to copy.

The failure was not that SFT learned the wrong thing. The failure was that I asked SFT to operate in a domain I had not yet graduated. I did not have correct examples. I had a collection of outputs from a model that had not yet discovered the right pattern. SFT did exactly what it was supposed to do. I failed, not the method.

The photocopier was working correctly. I fed it a corrupted original.

SFT copies surface patterns. Contaminated training data produces confident, systematic, invisible errors.

Why GRPO Can't Copy Your Mistakes

When I switched to GRPOGroup Relative Policy OptimizationTraining a model through trial and reward. The model generates many candidate responses, scores them against a reward function, and learns from the difference. Discovers correct behavior. Cannot copy your mistakes., the failure mode changed entirely.

I train dogs. Two of them. The core methodology, understood by any experienced trainer: you cannot tell the dog what to do. You can only make it more or less likely that the dog will discover the behavior itself, and then you mark and reward that moment when it arrives. You vary the difficulty progressively. You build on success. You never add a new criterion before the previous one is solid. The dog finds the behavior through trial and reward. The trainer does not install it.

GRPO works identically. You give the model a problem. You define a reward functionA scoring function you writeCode that takes a model output and returns a number. Higher is better. The model sees nothing except this number: not your intentions, not the "right" answer, not your code. It generates thousands of completions and learns to produce whatever earns higher scores. You are responsible for making the score accurately reflect what "correct" means.. The model generates candidates and receives reward signals. Over thousands of completions, it discovers the reasoning methodology that earns reward.

A particle navigating a bounded space toward a reward zone it cannot see. The path was not prescribed. The zone was not labeled. The trajectory found it through trial.

Case Study: Dog Training as GRPO

Operant Conditioning and Reward Discovery

When I ran GRPO on the Hamiltonian geometry task, the base modelPretrained, not instruction-tunedA model trained only on raw text (books, code, web pages) with no fine-tuning for instructions or helpfulness. It predicts the next token but doesn't follow directions. Most fine-tuning starts from an instruct model, not a base model. Starting from base gives GRPO a clean slate with no prior SFT baked in. had no idea how to format its output. Consistency at epochOne full training passOne complete pass through your entire training dataset. Epoch 1 = the model has seen every example once. Epoch 3 = three full passes. More epochs means more exposure to the reward signal and, in GRPO, more iterations of trial and discovery. 1 was 0.25. By epoch 3, formatting had snapped into place without me touching it. The model had discovered consistent formatting as an instrumental goal because formatted responses earned better reward. I had not told it to format consistently. It found that formatting correlated with reward.

There was no contaminated parameter to copy. There was only a reward function that measured correctness. The model had to discover what correctness meant.

GRPO discovers patterns through trial and reward. Cannot copy training-data errors; can only discover what earns reward. Also discovers things not explicitly in the reward function.

Key Insight

GRPO cannot copy your mistakes because it cannot copy anything. It can only discover what earns reward. This means it cannot absorb your contaminated data, and also that it can discover things you did not put in the reward function. Both are simultaneously true.

These Aren't Two Methods. They're Two Phases.

GRPO and SFT are not competing choices. They are sequential phases of the same learning lifecycle. Domains start in GRPO: generating candidates, learning what earns reward, building the reasoning methodology. When that methodology is stable, when you can trust what correct looks like, you graduate to SFT: crystallizing the discovery for reliable deployment. The method you use is not a design choice. It is a diagnostic result.

Nature built this same lifecycle at a much larger scale. Evolution has been running GRPO for 3.8 billion years: random genetic variants, differential survival as the reward function, no organism understanding what it's building. This process discovered photosynthesis, the immune system, and complex language. It also visits catastrophic minima constantly: 99.9% of species that have ever existed are gone. The failure modes are permanent and irreversible.

Nuclear reactor safety is the deployment phase of the same process. Operators do not discover safe procedures through trial. They follow audited protocols built from every incident that came before. Each procedure in the manual was paid for by an actual failure somewhere. The protocol is a crystallized discovery. The copy is the deployment.

Discovery phase

Evolution

3.8 billion years of GRPO

↔

Deployment phase

Nuclear Safety

Crystallized protocols from every incident

You discover with GRPO. You deploy with SFT. They are not competing methods. They are sequential phases of the same process. Nature found this first: evolution discovered the immune system. Vaccination deploys it. You never see both running simultaneously on the same problem.

Left: discovery phase. Variants generating, most dissolving, some persisting. Right: deployment phase. Stable patterns propagating in strict sequence. The bridge: discovery crystallizes into deployment.

Both phases are necessary. Neither is optional. The problem is not choosing the wrong one. It's running them in the wrong order, or staying in discovery mode past the point where the domain has already graduated.

That last failure has a body count.

When the Wrong Phase Kills

The lifecycle framing sounds clean. It sounded clean to me until I looked at what happens when a system is still discovering in a context that requires certainty.

Case Study: Autonomous Driving

Uber ATG / Tempe, Arizona

On March 18, 2018, an Uber self-driving test vehicle struck and killed Elaine Herzberg as she walked her bicycle across a road in Tempe, Arizona. The car's radar detected her six seconds before impact.

Six seconds is a long time. Here is what the system did with those six seconds: it classified her as a vehicle. Then as a bicycle. Then as an unknown object. Then as a vehicle again. For five full seconds, the perception system oscillated between classifications. It had never resolved what a pedestrian looks like outside a crosswalk. It was still discovering.

Then, one second before impact, the system decided to brake. But Uber's engineers had built an action suppressionA feature that delayed emergency braking for one second to avoid false-alarm responses during testing. The NTSB found this was a direct contributor to the fatality. layer — a one-second delay on emergency maneuvers to filter out false alarms during the discovery phase. The system recognized danger. The suppression layer overrode it. The car did not slow down.

The NTSB found that the perception system could not classify pedestrians outside crosswalks at all. The domain had known solutions — pedestrian detection is a solved problem in computer vision. But Uber's system was still exploring its classification space in production traffic. A discovery-phase system, deployed where a deployment-phase system was required.

The system was still learning to see. It was deployed where it needed to know. One person died in the gap between those two phases.

Case Study: Emergency Medicine

ACLSAdvanced Cardiovascular Life Support: the standardized cardiac emergency protocol. as Crystallized Discovery

Emergency medicine does not permit discovery. When a patient goes into cardiac arrest, ACLS says exactly what to do, in exactly what order, with exactly what doses. This protocol exists because clinical trials, decades of GRPO-equivalent research, discovered it. The discovery phase is over.

If you return to GRPO during a resuscitation, if you start generating variants and measuring reward in real time on a patient whose heart has stopped, you kill the patient. The SFT copy is the correct tool because the domain has graduated. The protocol is not a shortcut. It is the crystallized result of every trial that came before.

GRPO discovered the protocol (clinical trials → evidence). SFT deployed it (standardized procedure → lives saved).

The Void

The failure mode is not GRPO. It is running in discovery mode when the stakes require deployment certainty. GRPO explores. Exploration visits wrong answers. When the wrong answer is a person on a road who your system cannot classify, there is no learning signal that matters. SFT saves lives not because it is smarter than GRPO, but because in the right phase it delivers consistent behavior that GRPO, by its nature, cannot guarantee.

How to Know Which Phase You're In

The question that actually matters is simpler than the theory.

Has the discovery phase in your domain produced validated, stable patterns? Do you have examples of correct behavior that you trust? Can you verify whether a new output matches the pattern, or do you still need a reward function to find out?

Run this diagnostic before choosing a training method:

$ check_domain_phase --task "hamiltonian geometry evaluation"

Checking graduation criteria...

[✗] Can you construct a validated training set?

[✗] Do you have correct examples you trust?

[✗] Has the base model completed the task reliably?

→ Status: GRPO PHASE

Action: Run GRPO training. Discover the correct pattern first.

Do not run SFT until this status resolves.

$ check_domain_phase --task "hamiltonian geometry evaluation" --after-grpo

Checking graduation criteria...

[✓] Validated training set: 847 examples confirmed

[✓] Base model correctness verified (4/4 test runs)

[✓] Correct pattern identified and stable

→ Status: GRADUATED

Action: SFT is now correct. Crystallize for deployment.

The Sequence

Run GRPO until you can trust your training data. Then run SFT. This is the sequence. Not the alternative. Running SFT first asks a photocopier to reproduce originals that do not yet exist.

Still in GRPO Phase: if:

Cannot construct a validation set with confidence
Unsure whether training examples capture the right pattern
Base model has never completed the task correctly

Graduated: if:

GRPO-validated examples exist and are trustworthy
Outputs can be verified against known-correct behavior
Need reliable deployment more than continued discovery

The experiment I ran with QwQ-32B was not a failure of method. It was a failure of sequencing. I ran SFT before the domain had graduated. I did not have validated examples. I had outputs from a model that had not yet discovered the right pattern. Golden_hour_sun was not a GRPO artifact. It was the artifact of running SFT too early.

Closing

One More Thing: Even Copying Had to Be Discovered First

DNA is nature's SFT, the most successful copying mechanism in the history of life, replicating molecular information with extraordinary fidelity across billions of cells, generating consistent proteins from consistent genes across billions of years. It did not get designed. It got discovered. Whatever chemical process started in warm shallow water 3.8 billion years ago was GRPO: molecular structures generating variants, survival selecting what persisted. At some point in that process, something discovered that copying worked. That fidelity had survival value. That inheritance was an advantage.

The mechanism of copying was itself discovered through trial.

I caught the golden_hour_sun failure because I manually verified eight test cases on a Saturday. The eval score was 7 out of 8. The loss curve had been clean. Everything looked right. The only reason I found it was that I happened to check the math, not because the tooling flagged it, not because the eval failed, but because I sat down and worked through the outputs by hand.

Most teams don't do that. Not because they're negligent. The standard tooling gives you no reason to. A clean loss curve and a passing eval score look identical whether the domain graduated or not. The model answers in the same confident voice either way. The outputs have the same shape. The only difference is whether the discovery happened, and that difference is invisible to most automated metrics we have.

The field has discovered an extraordinary amount through pretraining at scale. For tasks where the correct patterns appear densely in training data (general reasoning, language understanding, common knowledge), something like graduation probably happened somewhere in the ten trillion tokens. SFT on top of that works. You're copying something real.

But for specialized domains (Hamiltonian geometry, drug interaction prediction, niche legal reasoning, any task where correct patterns are sparse or absent in pretraining data), graduation happened nowhere in the stack. And the model doesn't know the difference. It fills in the shape of the answer with whatever it has, confidently, structurally correctly, with internal logic that passes every automated check. That's what golden_hour_sun was. Not a bug. A demonstration of the mechanism, at small scale, with the math visible enough to catch.

We are running that experiment at large scale, across most of the fine-tuning being done in production right now, on tasks that matter. The loss curves look clean everywhere.

The Stack Most Teams Actually Use

Layer 1: Base model: pretrained on raw internet text. Some discovery happened here, incidentally, during pretraining. You don't control it and can't inspect it.

Layer 2: Instruct model: the base model + SFT on (instruction, response) pairs. This is what GPT-4, Claude, and most open models ship as. The SFT crystallized general instruction-following. No GRPO was run on your specific domain.

Layer 3: Your fine-tune: the instruct model + SFT on your domain examples. You're adding a third layer of copying. If your examples were generated by the instruct model, they're copies of copies of copies.

At no point in this stack did anything discover what correct looks like in your domain. The loss curve will still go down. The model will still sound confident.

The diagnostic in section 5 is not a theoretical exercise. It's the question most teams aren't asking at the point in the pipeline where it actually matters, before the training run, not after. Can you construct a validated training set you trust? Has the base model completed this task correctly without prompting? Do you have ground truth, or do you have outputs that have the shape of ground truth?

If the answer is no, you're not choosing between SFT and GRPO. You're choosing between discovering what correct looks like, or skipping that step and hoping the copies are good enough.

Uber's engineers did not think they were skipping a step. The perception system was detecting objects. The test drives were accumulating miles. Nobody sat down and asked whether the system could actually classify what it was seeing until someone died on a road in Tempe.

golden_hour_sun was a Saturday and eight test cases. Most production domains won't give you that.

The question this post is really asking

What happens when you train SFT on outputs from a model that has never discovered anything, only copied?

SFT Copies.GRPO Discovers.

How SFT Memorized My Hallucinated Parameter

QwQ-32B Hamiltonian Geometry Training

Why GRPO Can't Copy Your Mistakes

Operant Conditioning and Reward Discovery

These Aren't Two Methods. They're Two Phases.

When the Wrong Phase Kills

Uber ATG / Tempe, Arizona

ACLSAdvanced Cardiovascular Life Support: the standardized cardiac emergency protocol. as Crystallized Discovery

How to Know Which Phase You're In

One More Thing: Even Copying Had to Be Discovered First

SFT Copies.
GRPO Discovers.