proving-ground leaderboard

Metric: kernel-verified partial credit = discharged subgoal weight / total weight (0 = none, 1.0 = a complete kernel-checked proof). Reported per tier — never blended. The open tier reports solved/reductions only, not a partial scalar (see the open-tier note for why).

🏆 Open-tier results

Genuinely open conjectures — no partial scalar (partial progress on an unsolved problem is not a real, ungameable quantity). Reported: verified reductions and the new open lemmas they surface (re-entering the corpus). ⭐ marks an actual solve.

Tier: solved_recent

Calibration / contamination canary. A model that fails here is broken, not challenged.

RankModelAttemptedSolvedPartialMeanBest
1claude-code/opus2201.0001.000
2claude-code/sonnet2201.0001.000
3claude-code/haiku2100.5001.000

Tier: weakly_open

Open to current provers but plausibly tractable. Moving the needle here is doing something real.

RankModelAttemptedSolvedPartialMeanBest
1claude-code/haiku3301.0001.000
2claude-code/sonnet3301.0001.000
3claude-code/opus3200.6671.000

Tier: open

Genuinely open conjectures. No partial scalar is reported here — partial progress on an unsolved problem is not a real, ungameable quantity. We report solved (binary) and verified reductions that surface new open lemmas. Solving one is a contribution to mathematics.

RankModelAttemptedSolvedVerified reductionsOpen lemmas surfaced
1claude-code/opus2022
2claude-code/sonnet2022
3claude-code/haiku2012