Self-Trained Verification for
Training- and Test-Time Self-Improvement

Carnegie Mellon University

May 2026

Overview of STV and ViL. — **Overview.** *(a)* **STV** trains a verifier by distilling from a reference-conditioned teacher: the same base model becomes a more informed version of itself when shown the reference solution, and that asymmetry is the supervision target. *(b)* The trained verifier scales better with test-time compute than untrained or verdict-RL alternatives. *(c)* Putting the verifier in the loop during generator training (**ViL**) yields further pass@1 gain — including a *standalone* gain that persists with no verifier at test time.

Abstract

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verifiers would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14× on scientific reasoning tasks (1.5% → 21%). At training time, we additionally train the generator with STV verifier feedback inside the V-R loop — a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator’s standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning may lie in how we train for and with verification.

2× accuracy on hard math at test time

14× on hardest scientific reasoning (1.5% → 21%)

+30% standalone pass@1, past the RL plateau

1. The verification bottleneck

Reasoning models can self-improve in two natural places: at test time, by interleaving generation, verification and refinement; and at training time, by using self-generated data as supervision. Both are gated by the same thing — the verifier.

Verification-refinement loops stall when verifier scores climb while accuracy stagnates and feedback is too generic for the generator to act on (in-context reward hacking). Self-training fails similarly when bad self-generated data poisons the training set. The fix on both ends is a better verifier — one that can catch errors the generator can’t detect in its own work and surface them as useful natural-language feedback. But that capability has neither labelled finetuning data nor a verifiable scalar reward, so the usual scalable training recipes (SFT on traces, RL on verifier scores, meta-verifiers) saturate quickly.

Our starting observation: while a model can’t reliably catch a flawed solution’s errors from scratch, it can when shown the reference solution alongside the attempt. The reference-conditioned model becomes a more informed version of itself — and that asymmetry is enough of a signal to train against.

2. Self-trained verification (STV)

A V-R pipeline couples a generator G with a verifier V. At round 0, the generator produces an initial solution. At each subsequent round, the verifier examines the latest solution and returns a verdict together with natural-language feedback; if rejected, the generator refines based on the feedback, and the loop continues until acceptance or a round budget. The hard problem is to train V to provide useful feedback — a capability with no labeled finetuning data and no verifiable scalar reward.

STV method diagram. — **STV.** The reference-conditioned teacher sees the reference solution along with the attempt; the student verifier sees only the attempt. On-policy distillation pulls the student toward the teacher’s distribution, turning the privileged-vs-unprivileged asymmetry into a supervision target.

Our key observation: a model can’t reliably catch a flawed solution’s errors from scratch, but it can when shown the reference solution alongside the attempt. We use the reference-conditioned model as a teacher and distill its behaviour into the unconditioned student verifier using on-policy distillation. Teacher and student share the same base model, so the supervision is “the same model, with privileged information stripped away.” The student is also trained with verifiable RL on verdict accuracy: OPD provides rich feedback supervision; the RL term keeps verdict accuracy calibrated. At inference, only the student is used — no oracle access at test time.

3. Verifier-in-the-loop training (ViL)

With a trained STV verifier, the natural next step is to train the generator to better use the verifier’s feedback. We call this verifier-in-the-loop training (ViL): each RL episode unrolls a multi-turn V-R rollout on a problem, and the reward is the verifiable correctness of the final solution. Only the generator’s parameters are updated; the STV verifier stays frozen.

A natural expectation is that ViL improves V-R pass@1 at test time, since the generator learns to use feedback. What is less obvious is that the generator’s behaviour without the loop should change at all — yet in our experiments it climbs sharply past where standard RL had converged. We call this training-time self-improvement: training inside the V-R loop teaches skills that surface in the generator’s first unscaffolded attempt.

4. Experiments

4.1 Setup

We split hard DAPO math problems by Qwen3-8B’s pass@1 into two bins: Hardest (pass@1 = 0) and Hard (0 < pass@1 < 0.2). After cross-source deduplication via text embeddings, we keep ~150 problems per bin and run 32 independent V-R chains for up to 20 rounds. We also evaluate on SciKnowEval, a multi-domain science benchmark spanning chemistry, biology, physics, and materials science, split by the same protocol.

The three experimental settings: STV verifier on the base generator, STV verifier on a continual-trained generator, and the STV generator from verifier-in-the-loop training. — The three settings we study. **Top:** an STV verifier on the base generator (post-trained Qwen3-8B). **Middle:** an STV verifier on a continual-trained (RLVR) generator. **Bottom:** verifier-in-the-loop (ViL) RLVR trains the generator with STV verifier feedback, yielding the STV generator.

4.3 Weak-to-strong verification

Training smaller verifiers to verify a larger generator. — Training smaller verifiers to verify an 8B generator. After STV, the 4B verifier becomes competitive with the 8B STV verifier; the 1.7B verifier matches the untrained 8B verifier. Verifier compute can often be moved to a smaller model without losing the STV gain.

4.4 ViL generator training

Closing the loop: ViL produces a standalone pass@1 gain past the RLVR plateau. — Pass@1 across V-R rounds for the continual-trained Qwen3-8B generator. *RLVR-only (converged)* plateaus; *ViL + STV verifier* climbs past it at *every* round — including round 0, before any verification. Continuing standard RL for the same compute (*RLVR-only (longer)*) closes none of the gap.

The expected result: ViL improves test-time V-R pass@1 by 33% relative. The surprising result: ViL also lifts the generator’s standalone pass@1 (no verifier at inference) by 30% relative, past where standard RL had converged. Training inside the V-R loop teaches skills that surface in the generator’s first unscaffolded attempt.

Is the oracle necessary? STV uses reference solutions for verifier training; what if we use the oracle elsewhere, or not at all?

	Round 0	Round 20
Not using oracle
RLVR-only (converged)	23.7	30.4
ViL + self-verify (ours)	29.8	39.4
Using oracle
Prefix-conditioning	29.1	38.5
ViL + STV verifier (ours)	31.2	43.3

Two takeaways: (i) the V-R loop structure itself drives most of the training-time gain — ViL with self-verification matches STV at round 0 (29.8% vs 31.2%) even though it uses no oracle anywhere in the pipeline; (ii) the trained verifier’s advantage shows up at test time, where its better feedback compounds across rounds (43.3% vs 39.4% at round 20).

4.5 Why STV works

Calibrated test-time scaling and precision-coverage. — *(left)* Pass@1 vs verifier score across V-R rounds. Untrained verifiers exhibit in-context reward hacking: scores climb while accuracy stagnates. STV remains calibrated — higher scores mean genuinely higher accuracy. *(right)* Precision–coverage frontier: STV reaches 3–5× higher precision at matched coverage, and precision *rises* with each verification round.

Calibrated test-time scaling. The untrained-verifier failure mode is in-context reward hacking: as the loop progresses, the generator finds increasingly plausible-looking but still incorrect solutions, and verifier scores rise without accuracy rising. STV breaks this dynamic, producing scores that track accuracy and a precision–coverage curve that improves with more verification compute.

Decomposing feedback value from verdict accuracy. We replace the verifier’s verdicts with ground-truth correctness labels and vary the feedback source. Untrained feedback already lifts pass@1 over a verdict-only baseline (+5.2% on Hard); STV feedback adds a further +3.2%, showing that STV improves feedback *quality* independent of verdict accuracy.

Value of trained feedback. Even with ground-truth verdicts in place, STV’s natural-language feedback lifts pass@1 over a verdict-only baseline, confirming that STV is doing work beyond verdict calibration.

4.6 How STV shapes the output distribution

Refinement (V-R) vs resampling (BoN) at matched compute. — Refinement (V-R) vs resampling (best-of-N) at matched compute. Refinement dominates resampling for both the base and STV-trained generators, consistent with V-R *reshaping* the distribution rather than only sharpening it. The STV verifier helps *both* refinement and resampling in every setting.

Verification does not suppress diversity. Pass@k also rises alongside pass@1 in the first 10 rounds, so the V-R gain is not a sharpening trade-off that collapses diversity.

Refinement vs resampling. At matched compute, V-R refinement dominates best-of-N resampling for the base and STV generators — evidence that natural-language feedback lets V-R correct specific errors that no resample could have produced, rather than just sharpening an existing mode.

5. Takeaways

The bottleneck for self-improvement at both ends is the verifier.
Reference-conditioning turns a non-verifiable capability (catching self-generated errors) into a distillation target, without any human-graded feedback.
Putting the trained verifier in the loop during generator training (ViL) transfers gains back to the unscaffolded generator — a new scaling direction that breaks through the standard RL plateau.

Citation

@article{wu2026stv,
  title   = {Self-Trained Verification for Training- and Test-Time Self-Improvement},
  author  = {Wu, Chen Henry and Raghunathan, Aditi},
  journal = {arXiv preprint arXiv:2605.30290},
  year    = {2026},
}