Abstract
We investigate whether structured mathematical knowledge helps LLMs reason and explore mathematics more effectively. We introduce MathAgent, a multi-agent system that augments LLM-based formal theorem proving in Lean 4 with a semantically enriched knowledge graph (MathKG) and Mathlib library retrieval. MathKG contains 364 Lean-verified theorems from Mathlib connected by 9,434 LLM-extracted semantic edges across seven mathematical domains including number theory, combinatorics, algebra, and analysis.
Evaluated across 1,520 problems from miniF2F, PutnamBench, and MathOlympiadBench with multiple open-weight and proprietary LLMs, we find a complementarity effect: each knowledge augmentation mode solves problems the others cannot, with the union of all strategies solving up to 53% more problems than the best single approach. We further evaluate on open problems from DeepMind’s Formal Conjectures repository, identifying compositional subgoal decomposition as a key bottleneck of single-shot proving approaches.
Paper, code, and data will be made publicly available upon completion of experiments.
MathKG
MathKG is a directed graph where each node represents a Lean-verified mathematical object from Mathlib and edges encode semantic relationships extracted by an LLM. Unlike simple dependency graphs, MathKG captures mathematical meaning — how theorems relate through generalization, analogy, and cross-domain structure.
Edges are typed across eight semantic relations — prerequisite, generalizes, specializes, equivalent-to, applied-in, analogous-to, cross-domain-bridge, and instance-of — capturing how theorems relate through logical dependency, structural analogy, and cross-domain transfer. The graph spans number theory, combinatorics, algebra, analysis, geometry, linear algebra, and abstract algebra, with balanced coverage across all domains.
Interactive MathKG visualization — drag, zoom, hover, and click to explore.System Overview
MathAgent consists of five agents:
- Explorer: Navigates the knowledge graph and retrieves relevant Mathlib declarations
- Conjecturer: Generates formal conjectures grounded in KG structure
- Prover: Generates Lean 4 tactic proofs with compiler feedback
- Integrator: Validates and integrates proven theorems back into the KG
- Orchestrator: Coordinates the pipeline across four evaluation modes
The Orchestrator coordinates the discovery loop: Explorer → Conjecturer → Prover → Integrator → Explorer…, with MathKG serving as the structured knowledge source that guides proof strategy.
The system supports two complementary settings: targeted problem-solving, where the Prover attempts to prove a given formal statement, and open-ended exploration, where the full agent loop generates novel conjectures grounded in KG structure and attempts formal verification.
Evaluation Modes
| Mode | Knowledge Source | Description |
|---|---|---|
prover_only | None | Baseline: LLM proves from theorem statement alone |
with_kg | MathKG | KG-guided: relevant theorems retrieved via semantic edges |
with_mathlib | Mathlib | Retrieval-augmented: LLM-guided search over 400K+ Mathlib declarations |
full_system | MathKG + Mathlib | Full pipeline: KG + Mathlib retrieval combined |
Benchmarks
We evaluate on three established, publicly available formal mathematics benchmarks:
- miniF2F (488 problems): Problems from AMC, AIME, and IMO mathematical competitions formalized in Lean 4
- MathOlympiadBench (360 problems): High-school competition problems (AMC/AIME level) with Lean 4 formalizations
- PutnamBench (672 problems): Undergraduate/research-level problems from the Putnam Mathematical Competition
For open-ended exploration, we additionally evaluate on problems from:
- OEIS Conjectures: Open conjectures about integer sequences from DeepMind’s Formal Conjectures repository
- Erdos Problems: Open number theory and combinatorics conjectures from the same repository
Results
We present pass@k solve rate curves (where pass@k measures whether at least one of k attempts succeeds), pass@32 ablation results, and complementarity analysis for Claude Sonnet 4.6 across all four modes on 1,520 benchmark problems. Each problem is given 32 attempts, with Lean compiler error feedback guiding each subsequent attempt.
Benchmark Performance (pass@32)
| Benchmark | prover_only | with_kg | with_mathlib | full_system |
|---|---|---|---|---|
| miniF2F (488) | 366 (75.0%) | 359 (73.6%) | 360 (73.8%) | 366 (75.0%) |
| MathOlympiadBench (360) | 45 (12.5%) | 38 (10.6%) | 41 (11.4%) | 42 (11.7%) |
| PutnamBench (672) | 85 (12.6%) | 73 (10.9%) | 78 (11.6%) | 80 (11.9%) |
| Total (1,520) | 496 (32.6%) | 470 (30.9%) | 479 (31.5%) | 488 (32.1%) |
prover_only achieves the highest aggregate solve rate on 2 of 3 benchmarks, with full_system tying on miniF2F — suggesting that Sonnet 4.6 possesses sufficient internal Lean 4 and Mathlib knowledge that additional context acts as a distractor on aggregate. However, among augmented modes, full_system consistently dominates both with_kg and with_mathlib individually across all benchmarks, indicating that combining KG and Mathlib produces synergy even when neither helps in isolation.
Solve rates increase monotonically with attempt budget for all modes, and the gap between prover_only and augmented modes narrows as attempts increase — closing from -22 at pass@3 to -8 at pass@32 — suggesting that error-guided refinement eventually leverages the provided context. Notably, context hurts early attempts most severely: on miniF2F, with_mathlib leads at pass@1 (+12 over baseline) but prover_only dominates at pass@3-20, while on PutnamBench, prover_only leads at every k.
Complementarity: Union of Strategies
While aggregate solve rates favor prover_only, the more important finding emerges at the problem level: each augmentation mode solves problems that others cannot. The union of all four strategies substantially exceeds any single mode.
| Benchmark | Baseline (prover_only) | +KG unique | +Mathlib unique | +Full unique | Union | Lost |
|---|---|---|---|---|---|---|
| miniF2F | 366 | +10 | +9 | +18 | 389 | -7 |
| MathOlympiadBench | 45 | +7 | +6 | +10 | 58 | -5 |
| PutnamBench | 85 | +20 | +21 | +26 | 130 | -13 |
Union = problems solved by any of the four modes (the maximum achievable with adaptive selection). Lost = problems solved by prover_only but not by any augmented mode (where context caused failure).
An adaptive strategy that tries all four modes per problem would solve 29-53% more problems than the best single mode on hard benchmarks (MathOlympiadBench and PutnamBench), and 6% more on miniF2F where most problems are already solvable. PutnamBench benefits most (+53%: 85→130), indicating that research-level problems exhibit the strongest complementarity between proof strategies.
At the problem level: 81 problems are solved by at least one augmented mode but not by prover_only (“context helps”), while 25 are solved only by prover_only (“context hurts”) — a 3:1 ratio in favor of augmentation, even though aggregate counts favor the baseline.
Note: Experiments are ongoing for Goedel-V2-32B, Goedel-V2-8B, Qwen3-8B, and Qwen3-32B across all benchmarks. Early results suggest complementarity effects hold across model scales. Full results will be updated as they become available.
Open-Ended Exploration: Pushing Beyond Benchmarks
We additionally evaluate MathAgent on open problems from DeepMind’s Formal Conjectures repository using Goedel-V2-32B (a Lean-specialized open-weight model). Our system uses error-guided retry (pass@32 for Erdos, pass@64 for OEIS), where each failed attempt’s compiler error is fed back to guide the next proof attempt.
On OEIS conjectures, we observed that over 80% of attempts reach “unsolved goals” — meaning the proof structure (induction, case splits, intermediate lemmas) is correct, but the final mathematical step cannot be discharged. Knowledge augmentation drives diverse proof strategies across modes, with different configurations attempting qualitatively different mathematical approaches to the same problem.
These results suggest that for problems beyond the model’s capability frontier, a key bottleneck is the lack of compositional subgoal decomposition. A promising future direction is an architecture where a high-level proof strategy is established first, then individual subgoals are tackled independently across retries, rather than regenerating the entire proof from scratch with each attempt.
Acknowledgments
We gratefully acknowledge the Laude Institute for generously providing the compute resources that made this research possible. We also thank Prof. Tatsunori Hashimoto and Prof. Michael Genesereth for their support and insightful feedback.
Citation
@article{nabi2026mathagent, title={MathAgent: Does Structured Knowledge Help Formal Mathematical Reasoning and Exploration?}, author={Nabi, Sareh and Nabi, Marzieh and Vogl, Roland}, year={2026}}