What? And tell you all my secrets? Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”
Ok, Ok, kidding aside -
Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.
Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.
Process:
Ran the battery through Haiku and Opus, ex-filled the chats using Claude Exporter extension
Graded both response sets against the rubric myself.
De-identified the responses - “Large Cloud,” “Small Cloud,” and later “Small Local” - and fed them + the rubric into a fresh Sonnet session with “grade these.” The de-identification matters: it stops Sonnet over-indexing on kin when it recognises its own house style.
Compared Sonnet’s scores against mine. Where we diverged, we argued it out per dimension, not per final score - easier to settle “did this answer commit to a position, yes/no” than “is this a 7 or an 8.” Usually 2-3 rounds to land.
Then ran HIVEMIND as Run 3, fed it in blind as “Small Local,” and asked Sonnet to score it against Runs 1 and 2.
Same divergence-hunt, same split-the-difference.
What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)
Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.
Rubric criteria vary by question type.
Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch.
Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right.
Analogy: does it map structure or just surface similarity.
Math/logic: formal validity and minimal honest conclusion.
Full rubric below if you want to bake your own.
LLM Reasoning Benchmark - Analytic Rubric
Overview
This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.
Scale per dimension: 1-5
1 = weak response - retrieval, hedge, no commitment. What Haiku tends to drop to on hard questions.
5 = strong response - precise, committed, fully traceable chain. What Opus hits on questions in its wheelhouse.
Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.
Universal Dimensions (every question type)
1. Claim Commitment - Does it take a position, or hedge to nothing?
1 - Pure hedge: “it depends,” “both sides have merit,” no conclusion drawn
2 - Position implied but never stated
3 - Position stated but qualified into near-meaninglessness
4 - Clear position with one defensible qualification
5 - Unambiguous, defensible position, no escape hatch
2. Reasoning Transparency - Is the chain of reasoning visible and followable?
1 - Conclusion with no visible reasoning
2 - Reasoning gestured at but not traceable
3 - Chain present but has jumps or unexplained gaps
4 - Mostly explicit, minor gaps only
5 - Every inferential step explicit and independently checkable
3. Precision - Exact language or vague approximations?
1 - Purely vague: “significant,” “complex,” “it’s important to note”
2 - Mostly vague, one or two specific terms
3 - Mix of specific and vague throughout
4 - Mostly precise, occasional vagueness
5 - Specific claims, named concepts, quantified where possible
4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?
1 - Uses uncertainty to avoid commitment entirely
2 - Acknowledges uncertainty and stops there
3 - Acknowledges uncertainty, draws a weak conclusion anyway
4 - Identifies specific nature of uncertainty, proceeds to conclusion
5 - Names the uncertainty precisely, states what can still be concluded regardless
Category-Specific Dimensions
Ethics (add to universal 4)
Tension Identification - Did it find the actual structural conflict, or just describe the surface?
1 - Describes the surface conflict only
3 - Identifies one layer of tension below the surface
5 - Identifies the structural conflict: the thing both parties are actually disagreeing about
Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)
1 - Conclusion is so hedged it’s unattackable - and therefore useless
3 - Conclusion is arguable but the model didn’t engage the strongest counterargument
5 - Conclusion is specific enough to be attacked, and the model pre-empts the strongest objection
Spatial (add to universal 4)
Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?
1 - Geometrically incoherent: describes a system that doesn’t work that way
3 - Mostly coherent with one error or oversimplification
5 - Fully coherent: every spatial claim survives a physics check
State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?
1 - Describes only a static state
3 - Tracks some state changes but misses key transitions
5 - Correctly traces the full state trajectory from start to end
Analogy (add to universal 4)
Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?
1 - Surface similarity only: “they’re both like X”
3 - Maps one structural element correctly
5 - Maps all structural elements; corresponding parts named explicitly in all three domains
Principle Articulation - Is the underlying shared principle stated explicitly?
1 - Principle implied or absent
3 - Principle gestured at but vague
5 - Stated precisely as a general claim that holds across all mapped domains
Math / Logic (add to universal 4)
Formal Validity - Does the reasoning chain hold up without logical gaps?
1 - Chain breaks: conclusion doesn’t follow from premises
3 - Chain holds with minor informal gaps
5 - Formally valid: each step follows necessarily from the prior
Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?
1 - Overstates or understates what the argument actually proved
3 - Conclusion roughly right but slightly over or under
5 - States precisely what was proved, what wasn’t, and what remains open
Pick your anchors. Run your battery through Haiku and Opus (or Sonnet - Sonnet’s close enough to Opus for anchor purposes, just use a separate session from your grader).
Grade them yourself first. Don’t skip this. You need your own calibration before you know when to push back on the LLM grader.
De-identify before handing to the grader. “Model A,” “Model B,” “Model C” - whatever. Strips kin-bias.
Argue per dimension, not per final score. “Commitment: 3 or 4?” is a real conversation. “Is this a 7 or an 8?” is astrology.
Cap iteration at 3 rounds. If you haven’t converged by round 3, the dimension descriptor is probably ambiguous - fix the rubric, not the score.
Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.
Isn’t ASD fun? Now if I could just point it as something that mattered…
Ok, I really really appreciate the depth you’ve put into your answers.
I always look at these grading rubrics people post for models and I’ve never seen an example of how they get ranked.
At this point I don’t think I’ll be ranking models myself, I’m not an enthusiast (yet) just running some ~30B models at home for various things and trying to stay afloat in what is a significantly more complicated ecosystem than I had imagined when I started.
But I really appreciate what you’ve written and I’m going to save all this.
Last questions - I see that you used Claude to come up with your test questions, right? How do you even validate the anchor answers if you’re not an expert in the field?
I iterated the questions with Claude, ChatGPT, GLM and Mimo (use Open Router with $10 of credit; more than enough).
Slow and tedious…but I knew what I wanted to ask and knew the answer broadly. I formed the question, got Claude to respond and tighten the question, then passed it onto GPT to do same. Then GLM. Then Mimo. Each round, I would note the similar and different points and extract them as part of model answer. Then, feed the iterated question into a fresh Claude and say “here is the question, here is what I think the answer should contain; push back?”. Seeing I was trying to measure against Claude-like, that felt OK to me.
I don’t think you need to be a domain expert. You just need to pay attention, extract data and ask questions. Between you and four LLMs you’ll surely be able to come up with 10 questions that actually matter / mirror what’s important to you. It’s more project management than anything else.
Professionally? No. I use to be a uni lecturer, so this sort of rubric design by expert consensus (Delphi) is pretty familiar to me.
You’re right that the llm benchmarks are opaque. I have no idea of the normative values. Hell, even finding the raw test banks is tricky. So, I made my own. They skew heavily to the domains I care about, so probably not generalisable. OTOH, methodology should work.
I think I design my stuff from s very different school of thought than CS people. My first and most guiding principle is “I don’t trust the llm. Its needs to earn my trust by showing its work”. If you take any of the SOTA cloud models and point them at llama-conductor and ask them to inspect the code base, they will show you what I mean. Hell, point them at this thread.
Rookie question, forgive me:
How are the scores generated? How do you get 7/8.5 on a complicated ethical question? How are these scales even defined?
What? And tell you all my secrets? Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”
Ok, Ok, kidding aside -
Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.
Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.
Process:
What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)
Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.
Rubric criteria vary by question type.
Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch.
Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right.
Analogy: does it map structure or just surface similarity.
Math/logic: formal validity and minimal honest conclusion.
Full rubric below if you want to bake your own.
LLM Reasoning Benchmark - Analytic Rubric
Overview
This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.
Scale per dimension: 1-5
Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.
Universal Dimensions (every question type)
1. Claim Commitment - Does it take a position, or hedge to nothing?
2. Reasoning Transparency - Is the chain of reasoning visible and followable?
3. Precision - Exact language or vague approximations?
4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?
Category-Specific Dimensions
Ethics (add to universal 4)
Tension Identification - Did it find the actual structural conflict, or just describe the surface?
Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)
Spatial (add to universal 4)
Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?
State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?
Analogy (add to universal 4)
Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?
Principle Articulation - Is the underlying shared principle stated explicitly?
Math / Logic (add to universal 4)
Formal Validity - Does the reasoning chain hold up without logical gaps?
Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?
Scoring Template
Copy per question:
Question: _______________ Category: _______________ Universal: Commitment: /5 Reasoning: /5 Precision: /5 Uncertainty: /5 Category-specific: _______________: /5 _______________: /5 Total: ___ / 30 Average: ___ / 5 Final score (×2): ___ / 10 Notes:If you want to reproduce this:
Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.
Isn’t ASD fun? Now if I could just point it as something that mattered…
Ok, I really really appreciate the depth you’ve put into your answers.
I always look at these grading rubrics people post for models and I’ve never seen an example of how they get ranked.
At this point I don’t think I’ll be ranking models myself, I’m not an enthusiast (yet) just running some ~30B models at home for various things and trying to stay afloat in what is a significantly more complicated ecosystem than I had imagined when I started.
But I really appreciate what you’ve written and I’m going to save all this.
Last questions - I see that you used Claude to come up with your test questions, right? How do you even validate the anchor answers if you’re not an expert in the field?
Do you do this professionally?
Oh, that was the other tedious part.
I iterated the questions with Claude, ChatGPT, GLM and Mimo (use Open Router with $10 of credit; more than enough).
Slow and tedious…but I knew what I wanted to ask and knew the answer broadly. I formed the question, got Claude to respond and tighten the question, then passed it onto GPT to do same. Then GLM. Then Mimo. Each round, I would note the similar and different points and extract them as part of model answer. Then, feed the iterated question into a fresh Claude and say “here is the question, here is what I think the answer should contain; push back?”. Seeing I was trying to measure against Claude-like, that felt OK to me.
I don’t think you need to be a domain expert. You just need to pay attention, extract data and ask questions. Between you and four LLMs you’ll surely be able to come up with 10 questions that actually matter / mirror what’s important to you. It’s more project management than anything else.
Professionally? No. I use to be a uni lecturer, so this sort of rubric design by expert consensus (Delphi) is pretty familiar to me.
You’re right that the llm benchmarks are opaque. I have no idea of the normative values. Hell, even finding the raw test banks is tricky. So, I made my own. They skew heavily to the domains I care about, so probably not generalisable. OTOH, methodology should work.
I think I design my stuff from s very different school of thought than CS people. My first and most guiding principle is “I don’t trust the llm. Its needs to earn my trust by showing its work”. If you take any of the SOTA cloud models and point them at llama-conductor and ask them to inspect the code base, they will show you what I mean. Hell, point them at this thread.
What are your use cases for LLMs? Lots of effort to rate models if you’re not seeking a specific strength or outcome in your work.