Mode-conditioning unlocks superior test-time compute scaling

Chen Henry Wu, Sachin Goyal, Aditi Raghunathan

Carnegie Mellon University
Preprint

Modern LLMs often collapse to a single strategy during test-time, making Pass@k scaling suboptimal. We introduce mode-conditioning (ModC) that explicitly allocates test-time compute across diverse modes, achieving 8× efficiency gains while improving maximum attainable Pass@k performance.

The Challenge: Diversity Collapse in Test-Time Scaling

Parallel sampling (e.g., Pass@k) promises substantial gains in test-time compute scaling—instead of one attempt, the model gets k independent tries to solve a problem. This approach is especially effective for tasks like mathematics and coding where solutions can be automatically verified.

The Problem: Despite its promise, parallel scaling suffers from diversity collapse—models concentrate on a few dominant modes and repeated samples produce the same mistakes. As a result:

Additional samples often reproduce identical errors
Models converge on indistinguishable strategies
Scaling test-time compute yields diminishing returns

Figure: Standard training fails to balance diverse modes per problem under repeated sampling. Models trained on mixed data (DFS and BFS algorithms) tend to commit to just one strategy per problem, rather than exploring both. Mode-conditioning (ModC) successfully achieves balanced allocation.

💡 Key Insight: Explicit mode allocation beats sampling from collapsed distributions

🧮 Mathematical Intuition

Even when a model is balanced across two modes with equal weights, explicitly allocating k/2 samples to each mode strictly outperforms drawing k samples from the mixture—whenever the modes have different success probabilities on an input. The advantage is especially pronounced when the dominant mode fails but a lower-probability one succeeds.

Mode-Conditioning: The Solution

Given the challenges of diversity collapse, we propose mode-conditioning (ModC), a framework that explicitly structures test-time scaling around multiple reasoning modes. Rather than drawing repeatedly from a collapsed distribution, we enforce coverage across strategies by conditioning on modes.

Core Idea

Explicit Mode Allocation: Instead of sampling k times from a single mode, explicitly access N diverse modes and sample each k/N times.

Better Coverage: A problem that remains intractable under one distribution might be solvable under another, dramatically expanding the range of solvable problems.

Two Approaches to Mode-Conditioning

🔀 Approach 1: Separate Specialist Models

Train distinct models, each specialized to a particular mode of reasoning. The training data is partitioned into subsets corresponding to different strategies, and a separate model is trained on each subset. At test time, the sampling budget is divided across the specialists (e.g., k/2 samples from each in the two-mode case).

✅ Advantages

Strong Separation: Ensures clear specialization with no interference between modes.

Reduced Correlated Errors: Different specialists fail on different problems, improving parallel scaling.

❌ Limitations

No Knowledge Sharing: Prevents transfer of common linguistic or mathematical foundations across modes.

🏷️ Approach 2: Mode-Specific Prefixes

Train a single model with explicit condition tokens (e.g., [Mode 1], [Mode 2]) prepended to each training example. The model learns to associate each prefix with a distinct reasoning strategy. At inference, balanced compute allocation is enforced by sampling evenly across the conditioning prefixes.

✅ Advantages

Knowledge Sharing: Enables transfer of shared knowledge across modes while maintaining specialization.

More Scalable: Requires training only a single model instead of multiple specialists.

❌ Limitations

Capacity Limits: May face challenges if trying to capture too many modes in a single model.

Imperfect Control: Model might not cleanly separate behaviors in all cases.

🤖 Automated Mode Discovery via Gradient Clustering

What if we don't know the modes in advance? Most real-world training data contains mixed modes but lacks explicit labels. We propose using gradient clustering to automatically discover and annotate modes in training data.

💡 Method

Key Intuition: Examples that induce similar parameter updates likely come from the same mode.

Algorithm:

For each training example (x, y), compute gradients: g_θ(x, y) = ∇_θ log p_θ(y|x)
Apply Rademacher random projection to reduce dimensionality
Cluster the projected gradient vectors into C clusters
Treat each cluster as a "mode" and apply ModC with mode-specific prefixes

Experimental Results

🎯 Controlled Task: Countdown Graph Search

We first validate ModC on Countdown, a graph search task that can be solved using either depth-first search (DFS) or breadth-first search (BFS). Some problems are only solvable by DFS, others only by BFS—making mode coverage crucial.

Figure: ModC dramatically improves test-time scaling on Countdown. On both natural and adversarial test sets, ModC (both separate models and prefixes) consistently outperforms standard training, with gains up to 20% on adversarial problems that require mode diversity.

📊 Real-World Math Reasoning

Short Chain-of-Thought: We apply ModC to distillation from two teachers (DeepSeek-R1 and GPT-OSS-120B) on the NuminaMath dataset, evaluated on MATH500.

Figure: ModC improves short CoT reasoning. Pass@k on MATH500. Naively mixing teacher data underperforms the single-teacher baseline, while ModC shows consistent gains. ModC with prefixes generally works better than ModC with separate models underscoring the importance of sharing knowledge across modes (teacher strategies) in math reasoning.

Long Chain-of-Thought: Using OpenThoughts dataset with QwQ-32B and DeepSeek-R1 teachers, evaluated on AIME 2025.

Figure: 8× efficiency gains with ModC on long CoT. ModC matches the Pass@1024 of standard training with only k=128 samples, while also improving the maximum attainable Pass@k.

🔍 Automated Mode Discovery via Gradient Clustering

Can we discover modes automatically? We explore gradient clustering as a way to automatically discover meaningful modes in training data. Examples that induce similar parameter updates likely represent similar modes.

Validating on Multi-Teacher Data: We first validate gradient clustering on the short CoT dataset where ground-truth teacher labels exist. Gradient clustering achieves 98.7% F1 score in recovering teacher assignments, and more importantly, yields nearly identical test-time scaling benefits as using true teacher labels.

Figure: Validating gradient clustering on multi-teacher data. ModC with gradient clustering almost completely matches ModC with access to teacher annotations, confirming that gradient patterns effectively capture the underlying modes.

General Data Without Known Modes: We apply gradient clustering to NuminaMath, a diverse dataset where modes are unknown. ModC on automatically discovered modes yields significant improvements across model scales.

Figure: ModC on automatically discovered modes via gradient clustering improves short CoT. Pass@k on MATH500 shows consistent gains across Qwen2.5-Base model scales even without any mode annotations.

🎉 Key Takeaways

Consistent Gains: ModC improves test-time scaling across all settings—controlled tasks, short CoT, long CoT, and multiple model families (Qwen, OLMo2)
Efficiency: Up to 8× faster inference by matching Pass@1024 performance with only 128 samples
Scalable: Works with both explicit modes (teacher identity, search algorithms) and automated discovery (gradient clustering)
Model Scales: Benefits hold across 0.5B to 7B parameter models