Jun 18, 2025 · 14 min read

Weaver: Closing the Generation-Verification Gap with Weak Verifiers

Jon Saad‑Falcon*, E. Kelly Buchanan*, Mayee F. Chen*

TL;DR Large language models often generate correct answers but struggle to reliably distinguish their correct responses from incorrect ones, a costly challenge known as the generation-verification gap. Weaver closes this gap without relying on expensive frontier models. Instead, it aggregates multiple smaller, weak verifiers (like reward models and LM judges) without labeled data, reaching accuracy comparable to o3-mini on explored math, science, and general reasoning tasks (MATH500, GPQA Diamond, and MMLU Pro, respectively). Crucially, Weaver further distills the ensemble of weak verifiers into a compact 400M model that fits on your laptop, preserving up to 98.7% of the ensemble performance while reducing verification inference FLOPs by up to 99.97%. Weaver isn't about building bigger models, it's about smarter aggregation. Weaver efficiently transforms multiple weak verifiers into a single, powerful verification engine.

📄 Paper
💻 GitHub
🤗 Datasets and Models

Full team: Jon Saad‑Falcon*, E. Kelly Buchanan*, Mayee F. Chen*, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré

Figure 1: The Beaver Weaver

🔥 The Problem: The Generation-Verification Gap

Here's a frustrating reality: your LM often already "knows" the correct answer to your question (see LLMonkeys, Scaling TTC, and Archon). If you repeatedly query the LM—for example, 100 times—it'll usually generate the correct answer at least once. But the problem is, it doesn't know which of its responses is correct.

To demonstrate this clearly, we ran an experiment using Llama 3.3 70B on the GPQA Diamond dataset—a challenging benchmark of PhD-level science questions for evaluating reasoning performance. We generated 100 responses per query and analyzed the results using three different methods:

Pass@100 (Oracle): 82.8% — Checks if the ground-truth answer is among the 100 generated answers, without needing to identify which one.
Majority Voting: 45.5% — Selects the most frequently generated answer.
First Sample: 42.9% — Simply takes the first generated answer.

Note: While GPQA Diamond uses multiple choice format, the PhD-level scientific reasoning required makes even oracle performance challenging, as evidenced by the 82.8% ceiling rather than near-perfect scores.

These results highlight a significant gap—37.3%—between the model's potential capability (oracle accuracy) and what straightforward selection methods (Majority Voting on the generated answers) actually deliver. While this isn't the largest gap we observe, it illustrates how much quality is lost in selection alone.

This discrepancy is known as the generation-verification gap: the model generates correct answers but struggles to reliably identify them. It's like having a student who can guess the correct answer but struggles to confidently explain why they wrote it or consistently identify it among distractors (see Mind the Gap).

Generation-verification gap across datasets

Figure 2: Verification Accuracy Lags Behind Generation Capacity
Across Models and Tasks

This gap exists across many tasks:

Model	Dataset	Oracle (Pass@100)	Majority Voting	Generation Verification Gap
Llama 3.3 70B Instruct	GPQA Diamond	82.8%	45.5%	37.3%
	MATH500	98.6%	82.3%	16.3%
	MMLU Pro	92.0%	74.4%	17.6%
Llama 3.1 8B Instruct	GPQA Diamond	95.2%	30.7%	64.5%
	MATH500	99.2%	69.8%	29.4%
	MMLU Pro	89.0%	56.5%	32.5%

🧠 Weak Verifiers, Strong Together

⚖️ The Wisdom of Imperfect Verifiers

At the heart of Weaver is a simple idea with surprisingly broad power: when you can't trust any single verifier, trust the pattern of their agreement.

Like a jury, individual verifiers may be noisy or biased, but their collective agreement tends to be accurate. To see this, we used 30+ verifiers, namely reward models from 7B–72B, and LM judges from 27B–72B, and had them judge responses to GPQA Diamond questions.

Each verifier scored candidate responses, and we recorded which response each verifier judged as best.

Individual verifier accuracy: each verifier correctly judged samples with 43-62% accuracy—barely better than flipping a coin
High verifier agreement: on samples where 20+ verifiers agreed, their collective decision had 91% accuracy
Low verifier agreement: on samples where verifiers were split evenly, their collective decision had 48% accuracy, reverting back to random chance

📘 Example:
Question: Two quantum states with lifetimes of 10⁻⁹ and 10⁻⁸ seconds need to be resolved. What could be their energy difference?
Verifier votes:

Answer B (correct): 18 votes

Answer A: 7

Answer C: 3

Answer D: 2

Weaver's confidence: 89%

The key insight: verifier agreement is a strong signal of verifier correctness. We don't need to know which verifiers are accurate beforehand—their agreement patterns tell us. Even on challenging questions, it's uncommon for many weak verifiers to confidently agree on the same wrong answer. Weaver builds on this insight by using a statistical aggregation technique that learns which verifiers are most reliable from their rates of agreement. By weighting verifiers according to their estimated reliability, Weaver goes beyond simple consensus, as in the example above, and captures more sophisticated patterns—turning noisy judgments into confident decisions.

🚀 Why This Matters Now

New model ecosystems and statistical tools make this approach feasible at scale.

🤖 The Explosion of Available Verifiers

We're swimming in models and tools that can act as verifiers:

🏆 Reward models: Hundreds on HuggingFace, trained on human preferences and reasoning generations (see RewardBenchV2, ProcessRewardBench)
🧑‍⚖️ LM judges: Practically every chat LM can evaluate answers with binary verdicts (see ChatBotArena)
🧮 Specialized critics: Math checkers, code validators, safety filters (see LeanProver)
🧪 Unit tests: Manually or synthetically generated binary checks (see Yang et al. (2024))

Figure 3: Growth of Open-Source RMs and LM Judges

🧬 Weak Supervision Comes to Weak Verifiers

So, we've seen an explosion in the number of available verifiers. But this raises a natural question: how should we combine their outputs—is a simple unweighted average sufficient?

While naive aggregation can perform better than the first sample and majority voting baselines, they often fail to capture the varying reliability of individual verifiers. As a result, we find that weighted aggregation—such as ensembling over the top-K models or fitting a logistic regression model to combine outputs—significantly outperforms naive aggregation.

Figure 4: Weighted Verifier Ensembles Outperform Naive Verifier Ensembles

However, determining the optimal weights for aggregation typically requires large amounts of labeled data, which can be time-consuming and expensive to obtain. How can we learn the weights without labeled data?

This is where Weak Supervision comes in—a classical statistical framework for aggregating the outputs of multiple noisy "voters" (e.g., heuristics, models, crowdworkers) in settings where ground-truth labels are scarce. Originally developed for data labeling (e.g. Snorkel, MeTaL, Dawid-Skene Model), Weak Supervision infers true labels by examining the rates of agreement among noisy voters. These techniques have since been successfully adapted to LLM settings (e.g. Ask-Me-Anything, Smoothie), where LLM outputs serve as noisy signals of response correctness.

These recent developments create a timely opportunity: we have an abundance of imperfect verifiers, increasingly cheap inference costs, and well-established statistical aggregation tools. Weaver brings these elements together, and its capabilities are poised to only grow stronger as the ecosystem of models continues to expand.

⚙️ How Weaver Works: From Noisy Signals to Consensus

Figure 5: Weaver Framework

🧊 Step 1: Binarize (and Drop)

Before we can combine verifier outputs, we need them on the same scale.

Verifiers come in many forms—reward models (which output continuous scores), LM judges (which give discrete ratings), and rule-based critics (e.g., math or code checkers). Weaver first normalizes and binarizes all verifier signals, such that their outputs reflect a simple question: Is this response likely correct or not?

This makes different verifier types comparable and lets us treat them equally in the next step.

But not all verifiers are helpful. Some are too noisy, too biased, or too redundant. So Weaver drops them with a lightweight filtering pass, customized per dataset:

Adaptive marginal filtering: Weaver drops verifiers with extreme positive/negative rates based on dataset difficulty. For easier datasets (>80% correct samples), it removes pessimistic verifiers that rarely predict positive. For harder datasets (<20% correct samples), it filters optimistic verifiers that predict positive too often. For medium-difficulty datasets, both extremes are pruned.
Smart binarization thresholds: Using just 1% of the data as a labeled development set, Weaver estimates optimal thresholds for converting continuous reward model scores to binary signals—maximizing the signal-to-noise ratio rather than using naive 0.5 cutoffs.

By effectively binarizing verifier signals and filtering out low-quality signals, we have a cleaner ensemble to aggregate.

🧼 Step 2: Denoise with Weak Supervision

Once we have a binarized matrix of verifier votes, the next challenge is assigning trust—determining how much weight to give each verifier's opinion based on their estimated accuracy. Weaver tackles this by modeling response correctness as a latent variable, using tools from weak supervision to estimate how accurate each verifier is—without any access to ground-truth labels.

This is formalized as a latent variable graphical model, where:

The correctness of a response (latent variable) is predicted from multiple binary verifier votes (observed variables)
A conditional independence assumption is made: verifiers behave independently given the true label
Weaver then applies a method-of-moments estimator that matches second-order statistics of verifier co-voting patterns to infer each verifier's true/false positive rates, based on the conditional independence assumption

These inferred accuracy parameters become the weights used in aggregation—quantifying how much to trust each verifier based solely on unlabeled agreement structure.

Crucially, this estimation remains robust across domains, model types, and verifier formats—even when verifiers vary widely in calibration, score ranges, or quality.

For deeper technical details on weak supervision methods, see Data Programming, Snorkel, MeTaL, and recent LLM applications like Ask-Me-Anything and Smoothie.

📊 Step 3: Aggregate Smartly

With verifier weights in hand, Weaver computes a posterior probability that each candidate answer is correct. This final score represents the model's belief over correctness—a soft estimate of the latent label behind noisy verifier observations.

Unlike majority vote or averaging, this aggregation step is grounded in probabilistic inference:

It combines scores using learned verifier reliabilities
It naturally downweights low-quality or redundant signals
And it produces a confidence score for each response that is both statistically principled and empirically calibrated

This latent-variable-based aggregation makes Weaver more than just an ensemble—it's a full probabilistic verifier that adapts to verifier diversity, dataset difficulty, and scaling constraints.

📊 The Payoff: Shrinking the Generation-Verification Gap

Figure 6: Weaver's Verification Boost Across Benchmarks as Generations Increase

Weaver helps us close the generation-verification gap across four challenging benchmarks—MATH500, MMLU Pro, MMLU College, and GPQA Diamond. Weaver boosts first-sample accuracy by 11.2-27.8 percentage points across tasks, matching what only the most expensive frontier models can do.

As the number of generations increases, naive methods plateau early. Majority vote on the generations and unweighted ensembling of verifier outputs barely budge past 20–30 generations. In contrast, Weaver continues climbing—pushing 72.1% on GPQA Diamond, and closing in on the oracle Pass@100 upper bound. This ability to scale with generation count is key for high-stakes domains like math and science reasoning.

Weaver reaches 87.7% average accuracy across all tasks explored—1.0% above o3-mini, a closed, post-trained reasoning model. It also outperforms GPT-4o (69.0%) and Claude 3.7 Sonnet (70.4%)—without any parameter updates, preference tuning, or alignment data.

Figure 7: Weaver Outperforms Baseline Verification Methods
and Shrinks Gap with Frontier LMs

🪶 Making It Practical: Distilling Verification to 400M Parameters

✅ One 400M verifier = 98.7% of ensemble accuracy
🚀 Fits on a laptop. No human labels needed for training.

Figure 8: Distilling the Weaver Verifier Ensemble to a Compact Cross-Encoder

While Weaver's ensemble verifier achieves strong performance, it incurs significant computational overhead—querying 30+ models per response is not always feasible in latency- or resource-constrained settings.

To address this, we distill the full verifier ensemble into a single 400M-parameter cross-encoder. This model is trained to predict Weaver's final verification scores directly from the query and candidate response, treating ensemble outputs as soft supervision. No human annotations are required to train the cross-encoder.

This approach yields a verifier that:

Retains up to 98.7% of the ensemble's selection accuracy
Reduces verification inference compute by 99.97% compared to Weaver
Outperforms majority voting by 23.2 percentage points
Runs efficiently on a single H100 or smaller consumer-grade GPU

We emphasize that the generation model remains entirely frozen. All improvements are realized via a post-hoc verifier that can be deployed independently or integrated into existing pipelines, effectively pushing the Pareto front of verification accuracy versus computational cost.

Figure 9: Distilled Model Captures Verifier Ensemble Performance While Dramatically Reducing Inference Costs, Pushing the Pareto Front of Verification Accuracy vs. Computational Cost

🧭 Looking Forward: Making Verification Practical at Scale

Weaver illustrates that better decisions—not just larger models—can significantly improve downstream accuracy. By aggregating weak, unlabeled verifier signals using weak supervision, Weaver closes the generation-verification gap without fine-tuning, significant human labeling, or architectural changes. Our distilled model further enables efficient inference-time scoring—even across 100+ generations per query.

This represents a shift in how LLM pipelines scale:

Efficient: High-throughput candidate evaluation with minimal inference cost
Composable: Compatible with existing systems like RAG, speculative decoding, or CoT reranking
Modular: Easily retrainable as new verifier models are introduced—no changes to generation required

More broadly, Weaver is part of a growing trend: moving beyond generation alone, toward systems that make informed use of model outputs. This echoes long-standing ideas in AI.

As Rich Sutton wrote:

"An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself."

Similarly, Andrej Karpathy recently emphasized that:

"Generation alone is not enough. The real problem is verification. All the LLMs can write code, but the real problem is verifying it works."

Weaver turns this insight into practice. It demonstrates that verification—done efficiently and scalably—is a viable mechanism for improving system performance across reasoning and math tasks. It is not a post hoc filter, but a central component of modern LLM pipelines.

Figure 10: Weaver vs. Majority Voting Performance
at Different Inference Compute Budgets

🔮 What's Next?

Weaver highlights how verifier aggregation can be more powerful (and more practical) than scaling generation alone. But there's still a lot to build and explore.

Verifier-aware training. Right now, Weaver operates entirely post-hoc. But what if we could incorporate its feedback directly into model training? One direction we're excited about is using Weaver-style aggregation to supervise generation models—either during instruction tuning or RLHF. This could reduce our dependence on noisy pairwise preferences or small human-labeled datasets.

More adaptive verification. Weaver currently uses a static set of verifiers per task. But different queries need different levels of scrutiny. For example, "Who is the president of the U.S.?" needs less verification than "Derive the energy levels of a quantum harmonic oscillator." We're exploring ways to dynamically route queries to subsets of verifiers based on confidence and context—saving compute when possible and spending it when it matters.

Weaver as infrastructure. Just as we use indexes in databases or caches in compilers, we think verification should become a first-class primitive in LLM systems. Weaver can be dropped into generation pipelines, CoT reranking stages, dataset curation workflows, and even speculative decoding loops. Think of it as a pluggable "verification layer" for any model output.

Smarter distillation. Our current distillation strategy just learns to copy Weaver's scores. But there's room for smarter compression: multitask distillation (across datasets), query-aware verifiers (e.g. using metadata), or architectures optimized specifically for verification latency. There's also the open question of how few parameters we can get away with while still preserving the gains from the verifier ensemble.

We're excited to see Weaver evolve from a research prototype into a general-purpose verification backbone. If you're thinking about building LLM systems that reason better, align better, or evaluate better—we'd love to chat.

🙏 Acknowledgments

Weaver was the result of a collaboration between researchers at Stanford, UW–Madison, and Together AI. We're especially grateful to the open-source community—RewardBench, Hugging Face, Chatbot Arena, and many others—for making the verifier ecosystem rich and extensible.

We thank Dan Biderman, Bradley Brown, Ryan Ehrlich, Sabri Eyuboglu, Anna Goldie, Neel Guha, Simon Guo, Jordan Juravsky, Hermann Kumbong, Jerry Liu, Avanika Narayan, Anne Ouyang, Benjamin Spector, Shayan Talaei, Benjamin Viggiano, and Michael Zhang for their constructive feedback during the composition of the paper.