Self-Trained Verification for
Training- and Test-Time Self-Improvement
Carnegie Mellon University
Abstract
Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verifiers would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14× on scientific reasoning tasks (1.5% → 21%). At training time, we additionally train the generator with STV verifier feedback inside the V-R loop — a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator’s standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning may lie in how we train for and with verification.
1. The verification bottleneck
Reasoning models can self-improve in two natural places: at test time, by interleaving generation, verification and refinement; and at training time, by using self-generated data as supervision. Both are gated by the same thing — the verifier.
Verification-refinement loops stall when verifier scores climb while accuracy stagnates and feedback is too generic for the generator to act on (in-context reward hacking). Self-training fails similarly when bad self-generated data poisons the training set. The fix on both ends is a better verifier — one that can catch errors the generator can’t detect in its own work and surface them as useful natural-language feedback. But that capability has neither labelled finetuning data nor a verifiable scalar reward, so the usual scalable training recipes (SFT on traces, RL on verifier scores, meta-verifiers) saturate quickly.
Our starting observation: while a model can’t reliably catch a flawed solution’s errors from scratch, it can when shown the reference solution alongside the attempt. The reference-conditioned model becomes a more informed version of itself — and that asymmetry is enough of a signal to train against.
2. Self-trained verification (STV)
A V-R pipeline couples a generator G with a verifier V. At round 0, the generator produces an initial solution. At each subsequent round, the verifier examines the latest solution and returns a verdict together with natural-language feedback; if rejected, the generator refines based on the feedback, and the loop continues until acceptance or a round budget. The hard problem is to train V to provide useful feedback — a capability with no labeled finetuning data and no verifiable scalar reward.
Our key observation: a model can’t reliably catch a flawed solution’s errors from scratch, but it can when shown the reference solution alongside the attempt. We use the reference-conditioned model as a teacher and distill its behaviour into the unconditioned student verifier using on-policy distillation. Teacher and student share the same base model, so the supervision is “the same model, with privileged information stripped away.” The student is also trained with verifiable RL on verdict accuracy: OPD provides rich feedback supervision; the RL term keeps verdict accuracy calibrated. At inference, only the student is used — no oracle access at test time.
3. Verifier-in-the-loop training (ViL)
With a trained STV verifier, the natural next step is to train the generator to better use the verifier’s feedback. We call this verifier-in-the-loop training (ViL): each RL episode unrolls a multi-turn V-R rollout on a problem, and the reward is the verifiable correctness of the final solution. Only the generator’s parameters are updated; the STV verifier stays frozen.
A natural expectation is that ViL improves V-R pass@1 at test time, since the generator learns to use feedback. What is less obvious is that the generator’s behaviour without the loop should change at all — yet in our experiments it climbs sharply past where standard RL had converged. We call this training-time self-improvement: training inside the V-R loop teaches skills that surface in the generator’s first unscaffolded attempt.
4. Experiments
4.1 Setup
We split hard DAPO math problems by Qwen3-8B’s pass@1 into two bins: Hardest (pass@1 = 0) and Hard (0 < pass@1 < 0.2). After cross-source deduplication via text embeddings, we keep ~150 problems per bin and run 32 independent V-R chains for up to 20 rounds. We also evaluate on SciKnowEval, a multi-domain science benchmark spanning chemistry, biology, physics, and materials science, split by the same protocol.
4.2 Verifier-guided refinement
Base generator (post-trained Qwen3-8B). Without verifier training, pipeline pass@1 stagnates quickly. Training the verifier with RL on verdict accuracy alone helps only marginally. A meta-verifier (using GPT-5.2 as a proxy reward) shows little effect. SFT on teacher-generated traces yields no gains. STV lifts final-round pass@1 by up to 2× over the untrained-verifier pipeline.
Scientific reasoning. The same recipe applied to SciKnowEval lifts the base Qwen3-8B’s final-round pass@1 from 1.5% to 21.0% on the Hardest bin and from 11.4% to 42.4% on the Hard bin. The STV-guided 8B even beats the 30× larger Qwen3-235B-A22B without verification on both bins (21.0% vs 8.0% on Hardest; 42.4% vs 23.6% on Hard).
Continual-trained generator. On a Qwen3-8B already trained to convergence with standard RL, STV still adds substantial lift over its self-verification baseline. Trained verification is not absorbed by stronger generator RL training.
4.3 Weak-to-strong verification
4.4 ViL generator training
The expected result: ViL improves test-time V-R pass@1 by 33% relative. The surprising result: ViL also lifts the generator’s standalone pass@1 (no verifier at inference) by 30% relative, past where standard RL had converged. Training inside the V-R loop teaches skills that surface in the generator’s first unscaffolded attempt.
Is the oracle necessary? STV uses reference solutions for verifier training; what if we use the oracle elsewhere, or not at all?
| Round 0 | Round 20 | |
|---|---|---|
| Not using oracle | ||
| RLVR-only (converged) | 23.7 | 30.4 |
| ViL + self-verify (ours) | 29.8 | 39.4 |
| Using oracle | ||
| Prefix-conditioning | 29.1 | 38.5 |
| ViL + STV verifier (ours) | 31.2 | 43.3 |
Two takeaways: (i) the V-R loop structure itself drives most of the training-time gain — ViL with self-verification matches STV at round 0 (29.8% vs 31.2%) even though it uses no oracle anywhere in the pipeline; (ii) the trained verifier’s advantage shows up at test time, where its better feedback compounds across rounds (43.3% vs 39.4% at round 20).
4.5 Why STV works
Calibrated test-time scaling. The untrained-verifier failure mode is in-context reward hacking: as the loop progresses, the generator finds increasingly plausible-looking but still incorrect solutions, and verifier scores rise without accuracy rising. STV breaks this dynamic, producing scores that track accuracy and a precision–coverage curve that improves with more verification compute.
Value of trained feedback. Even with ground-truth verdicts in place, STV’s natural-language feedback lifts pass@1 over a verdict-only baseline, confirming that STV is doing work beyond verdict calibration.
4.6 How STV shapes the output distribution
Verification does not suppress diversity. Pass@k also rises alongside pass@1 in the first 10 rounds, so the V-R gain is not a sharpening trade-off that collapses diversity.
Refinement vs resampling. At matched compute, V-R refinement dominates best-of-N resampling for the base and STV generators — evidence that natural-language feedback lets V-R correct specific errors that no resample could have produced, rather than just sharpening an existing mode.
5. Takeaways
- The bottleneck for self-improvement at both ends is the verifier.
- Reference-conditioning turns a non-verifiable capability (catching self-generated errors) into a distillation target, without any human-graded feedback.
- Putting the trained verifier in the loop during generator training (ViL) transfers gains back to the unscaffolded generator — a new scaling direction that breaks through the standard RL plateau.