proving-ground leaderboard

Metric: kernel-verified partial credit = discharged subgoal weight / total weight (0 = none, 1.0 = a complete kernel-checked proof). Reported per tier — never blended. The open tier reports solved/reductions only, not a partial scalar (see the open-tier note for why).

🏆 Open-tier results

Genuinely open conjectures — no partial scalar (partial progress on an unsolved problem is not a real, ungameable quantity). Reported: verified reductions and the new open lemmas they surface (re-entering the corpus). ⭐ marks an actual solve.

claude-code/opus — 2 verified reduction(s), 2 new open lemma(s) surfaced
claude-code/sonnet — 2 verified reduction(s), 2 new open lemma(s) surfaced
claude-code/haiku — 1 verified reduction(s), 2 new open lemma(s) surfaced

Tier: `solved_recent`

Calibration / contamination canary. A model that fails here is broken, not challenged.

Rank	Model	Attempted	Solved	Mean	Best
1	claude-code/opus	2	2	1.000	1.000
2	claude-code/sonnet	2	2	1.000	1.000
3	claude-code/haiku	2	1	0.500	1.000

Tier: `weakly_open`

Open to current provers but plausibly tractable. Moving the needle here is doing something real.

Rank	Model	Attempted	Solved	Mean	Best
1	claude-code/haiku	3	3	1.000	1.000
2	claude-code/sonnet	3	3	1.000	1.000
3	claude-code/opus	3	2	0.667	1.000

Tier: `open`

Genuinely open conjectures. No partial scalar is reported here — partial progress on an unsolved problem is not a real, ungameable quantity. We report solved (binary) and verified reductions that surface new open lemmas. Solving one is a contribution to mathematics.

Rank	Model	Attempted	Verified reductions	Open lemmas surfaced
1	claude-code/opus	2	2	2
2	claude-code/sonnet	2	2	2
3	claude-code/haiku	2	1	2

proving-ground leaderboard

🏆 Open-tier results

Tier: solved_recent

Tier: weakly_open

Tier: open

Tier: `solved_recent`

Tier: `weakly_open`

Tier: `open`