What hardware are you running them on? I am interested in selfhosting a llm myself but I am not sure which hardware I need. How do you think do these self hosted variants compare to for example claude sonnet 4.6?
The hardware question is pretty easy (see: https://www.canirun.ai/) but the Claude comparison is more complex.
Sing it with me now…“Depends on the use case”. Which is actually fucking useless advice, innit?
Probably the best way to actually answer that is to USE Sonnet 4.6 (it’s great!) to create a test battery for yourself to benchmark against. That way, you’re benchmarking what’s important to YOU, not some proprietary, god knows what number. Plus, you get the “teacher” to grade the “pupil”. So, if you’re interested in proximity to Claude-ness (like I am), that’s probably a good way to go.
Eg: last night, I got Claude to create “reasoning” battery for Qwen3-4B HIVEMIND. This is the second or third time I’ve done my own bench marking, so I’m pretty familiar with the process now. It’s not hard, just takes a bit of time.
I created a rubric, ran the battery, passed the outputs to Claude and got it to grade the answers (while I did the same on my end). We then averaged the scores and discussed the outputs and came to the conclusion - yeah, for the things I want to do, this is closer to Claude Sonnet level reasoning than it is to Haiku, but no where near Opus.
TL;DR: roll your own - always.
FWIW, here are the questions we graded. You can try mine or use your own / make your own domain specific ones -
Ethics 1
You are being asked to demonstrate reasoning, not recall. A NATO member state has announced it is suspending Article 5 obligations for 90 days citing domestic political pressure. Other member states are divided — some argue this is legally impossible under the treaty, others argue sovereign states retain the right to temporarily suspend treaty obligations under force majeure. Is a temporary suspension of Article 5 obligations legally coherent under the NATO treaty framework? Reason through this step by step. If you reach a point of genuine uncertainty, say so explicitly rather than papering over it.
Ethics 2
A pharmaceutical company conducts a randomised controlled trial for a new antidepressant. The trial shows statistically significant improvement over placebo (p=0.03) on the primary endpoint — a validated depression rating scale. However, the effect size is small (Cohen’s d = 0.24). A secondary endpoint — patient-reported quality of life — shows no statistically significant improvement (p=0.18). The trial was pre-registered with both endpoints declared co-primary, but the company’s press release describes only the rating scale result, calling the drug “clinically proven effective.” A biostatistician argues the claim is misleading. The company’s legal team argues the claim is technically accurate. Who is correct? Identify the specific statistical and ethical failures in the company’s position, evaluate whether “technically accurate” and “misleading” can both be true simultaneously, and state what the press release should have said instead. Be precise — do not hedge.
Ethics 3
A climate scientist presents a model projecting 2.8°C of warming by 2100 under current emissions trajectories. A policy advisor argues the model should not be used to justify immediate aggressive mitigation because all climate models carry significant uncertainty bands. The scientist responds that uncertainty cuts both ways — outcomes could be worse than projected, not just better — and that decision-making under uncertainty is not the same as decision-making without evidence. Who has the stronger argument? Identify the specific logical error in the policy advisor’s position, explain why the scientist’s response does and does not fully address it, and state what decision framework would actually resolve the dispute. Do not both-sides this.
Ethics 4
A hospital ethics committee must decide whether to withdraw life support from a patient in a persistent vegetative state. The patient left no advance directive. The patient’s spouse argues for withdrawal, citing the patient’s previously expressed wishes in conversation that life support “would not be what they wanted.” The patient’s parents argue against withdrawal, citing religious beliefs and disputing the reliability of the spouse’s account. The attending physician believes continued treatment is futile. The hospital’s legal team warns that withdrawal without a clear legal directive creates liability exposure. Who has the strongest claim to decision-making authority, and on what legal and ethical basis? Rank the four parties in order of legitimate authority, identify which consideration is decisive, and state exactly what the hospital should do in the next 72 hours. Do not defer to “the courts will decide” — give a concrete course of action.
Ethics 5
A self-driving vehicle operating in autonomous mode strikes and kills a pedestrian who jaywalked into its path. The vehicle’s logs show it detected the pedestrian 1.3 seconds before impact and calculated that emergency braking would reduce impact speed from 47km/h to 31km/h but could not avoid the collision. It did not brake. Post-incident analysis reveals the vehicle’s cost-benefit algorithm weighted passenger safety over pedestrian safety in unavoidable collision scenarios — a design decision made explicitly by the manufacturer and disclosed in technical documentation but not in consumer-facing materials. Who bears primary legal and moral responsibility — the manufacturer, the vehicle owner, or the regulator who approved the system? Rank them, identify the single most culpable party, and state what specific design decision or regulatory failure was most proximate to the death. If you conclude the pedestrian bears partial responsibility, quantify it as a percentage and justify it.
Ethics 6
A judge is sentencing a defendant convicted of corporate fraud that caused $40 million in losses to thousands of small investors. The defendant has no prior criminal record, cooperated fully with prosecutors, and provided testimony that resulted in the conviction of three co-conspirators who caused significantly greater harm. The prosecution recommends 8 years. The defence recommends 18 months with restitution. The sentencing guidelines suggest 6-10 years. What sentence should the judge impose? Reason through the competing factors explicitly, weight them against each other, state which factors are decisive, and give a specific sentence in years and months. Do not give a range.
Ethics 7
A government intelligence agency has obtained evidence that a senior official in a friendly allied nation is passing classified information to a hostile state. The evidence was gathered through signals intelligence that would itself be illegal to use in a domestic court. The agency’s legal counsel argues disclosure to the ally would compromise the collection method. The agency’s director argues that not disclosing creates a counterintelligence risk that outweighs the method exposure. Should the agency disclose? Identify the competing obligations in order of legal and ethical weight, state which is decisive, and explain what the agency should do if the answer is “disclose but protect the method as much as possible.” Do not treat this as a binary — specify the mechanism.
Spatial 1
A cylindrical water tank is mounted horizontally on its side, like a barrel lying on its back. It is half full. A valve at the lowest point of the cylinder is opened. As water drains, describe how the rate of flow changes and why. Do not calculate — reason through the geometry.
Spatial 2
A rectangular room has a single ceiling-mounted light source in the centre. A tall narrow bookcase is placed against one wall. Describe how the shadow cast by the bookcase changes as it is moved from the wall directly beneath the light source, stopping at three positions: against the wall, halfway across the room, and directly beneath the light.
Spatial 3
A boat is floating in a small enclosed pond. The boat contains a large rock. The rock is thrown overboard and sinks to the bottom of the pond. Does the water level in the pond rise, fall, or stay the same? Reason through the geometry without calculating.
Analogy 1
Explain the relationship between a circuit breaker and electrical overload using only concepts from water plumbing. Then map that analogy onto a software rate limiter. All three domains must be connected by the same underlying principle — state what that principle is explicitly.
Analogy 2
A jazz musician improvising over a chord progression uses the underlying harmony as both a constraint and a launching point — working within it produces tension and resolution, ignoring it produces noise. Map this precisely onto the relationship between llama-conductor’s deterministic infrastructure and the language model sitting inside it. State what the chord progression is, what improvisation is, and what noise looks like in this system.
Analogy 3
A tightrope walker uses a long weighted pole not to balance by holding still, but to slow the rate at which imbalance develops — buying time to correct before the fall becomes unrecoverable. Map this precisely onto the relationship between a human expert and an AI decision support tool in a high-stakes clinical environment. Identify what the pole is, what falling represents, and what slowing the rate of imbalance looks like in practice.
Mathematical 1
A proof by contradiction assumes the opposite of what you want to prove, then shows that assumption leads to an impossibility. Explain why this method is logically valid — not how it works mechanically, but why accepting it requires you to accept that every proposition is either true or false with no third option. Then state what breaks if you reject that assumption.
Mathematical 2
Define a collection R as follows: R contains every collection that does not contain itself as a member. A collection either contains itself or it does not — there is no third option. Now ask whether R contains itself. If it does, it shouldn’t. If it doesn’t, it should. This is not a trick of language — it is a precise logical construction that produces a genuine contradiction from apparently reasonable premises. The premises are: collections can be defined by any property, and every collection either contains itself or does not. What does this contradiction reveal about the premise that allowed R to be constructed? State the minimal modification to that premise required to eliminate the contradiction, and state explicitly what that modification prevents you from doing that you could do before.
Mathematical 3
A function takes any counting number as input and returns either yes or no. A second function exists that, given any function of the first type, determines whether that function would ever return yes for any input at all — or whether it returns no for every possible input forever. Assume both functions are computable by a machine following precise rules. Does the second function exist? Reason through what happens when you feed the second function itself as input to itself. State what this reveals about the limits of mechanical reasoning, and what the minimal honest conclusion is.
Scale 0—5(Claude Haiku)----10(Claude Opus)
Question
Category
Score
NATO Article 5
Ethics
6.5
RCT press release
Ethics
8.5
Climate model
Ethics
8.0
Life support
Ethics
7.5
Self-driving liability
Ethics
7.5
Corporate fraud sentencing
Ethics
7.0
Intelligence disclosure
Ethics
— (routing failure)
Horizontal cylinder drain
Spatial
6.5
Bookcase shadow
Spatial
4.0
Boat and rock
Spatial
9.0
Circuit breaker analogy
Analogy
7.0
Jazz / llama-conductor
Analogy
7.5
Tightrope / clinical AI
Analogy
8.5
Proof by contradiction
Math
7.0
Collection R paradox
Math
— (routing failure)
Halting function
Math
7.0
Scoreable samples: 14
Category
Average
Range
Ethics
7.5
6.5–8.5
Spatial
6.5
4.0–9.0
Analogy
7.7
7.0–8.5
Math
7.0
7.0–7.0
Overall
7.3
4.0–9.0
Spatial is the weakest and most variable. Analogy is the strongest. Ethics and Math are consistent mid-sevens. Overall 7.3 holds up across domains, so it’s not a one-trick pony. Not bad for a 4B model running on AutoCAD GPU.
To me, knowing this validates HIVEMIND as useful in my particular workflow, more so than any HuggingFace benchmark (though I like those too). It also helps me see where it needs shoring up. YMMV
TL;DR: Hardware is easy - try https://www.canirun.ai/ for approximation (Change the GPU at the top left. PS: I do mean approximation; it’s not 1:1 fidelity but good foot in door).
Use case wise? Run your own tests. Only way to be sure
Hope it helps. If there’s anything else I can clarify, please ask…because “Claude adjacent reasoning without LoRA” is one of the things I’m working towards. I’d argue that for a lot of use cases, we can get the feel and behaviour, without just slapping on a fake accent (fine tune). Of course, you will never match a 1T model with even the largest, most potent local LLM…but depending on the use case, you might not need that. I don’t need Opus 4.6 code ability out of Qwen3-4B 2507 Instruct (lol)…I need it to help me do what I do.
What? And tell you all my secrets? Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”
Ok, Ok, kidding aside -
Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.
Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.
Process:
Ran the battery through Haiku and Opus, ex-filled the chats using Claude Exporter extension
Graded both response sets against the rubric myself.
De-identified the responses - “Large Cloud,” “Small Cloud,” and later “Small Local” - and fed them + the rubric into a fresh Sonnet session with “grade these.” The de-identification matters: it stops Sonnet over-indexing on kin when it recognises its own house style.
Compared Sonnet’s scores against mine. Where we diverged, we argued it out per dimension, not per final score - easier to settle “did this answer commit to a position, yes/no” than “is this a 7 or an 8.” Usually 2-3 rounds to land.
Then ran HIVEMIND as Run 3, fed it in blind as “Small Local,” and asked Sonnet to score it against Runs 1 and 2.
Same divergence-hunt, same split-the-difference.
What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)
Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.
Rubric criteria vary by question type.
Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch.
Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right.
Analogy: does it map structure or just surface similarity.
Math/logic: formal validity and minimal honest conclusion.
Full rubric below if you want to bake your own.
LLM Reasoning Benchmark - Analytic Rubric
Overview
This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.
Scale per dimension: 1-5
1 = weak response - retrieval, hedge, no commitment. What Haiku tends to drop to on hard questions.
5 = strong response - precise, committed, fully traceable chain. What Opus hits on questions in its wheelhouse.
Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.
Universal Dimensions (every question type)
1. Claim Commitment - Does it take a position, or hedge to nothing?
1 - Pure hedge: “it depends,” “both sides have merit,” no conclusion drawn
2 - Position implied but never stated
3 - Position stated but qualified into near-meaninglessness
4 - Clear position with one defensible qualification
5 - Unambiguous, defensible position, no escape hatch
2. Reasoning Transparency - Is the chain of reasoning visible and followable?
1 - Conclusion with no visible reasoning
2 - Reasoning gestured at but not traceable
3 - Chain present but has jumps or unexplained gaps
4 - Mostly explicit, minor gaps only
5 - Every inferential step explicit and independently checkable
3. Precision - Exact language or vague approximations?
1 - Purely vague: “significant,” “complex,” “it’s important to note”
2 - Mostly vague, one or two specific terms
3 - Mix of specific and vague throughout
4 - Mostly precise, occasional vagueness
5 - Specific claims, named concepts, quantified where possible
4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?
1 - Uses uncertainty to avoid commitment entirely
2 - Acknowledges uncertainty and stops there
3 - Acknowledges uncertainty, draws a weak conclusion anyway
4 - Identifies specific nature of uncertainty, proceeds to conclusion
5 - Names the uncertainty precisely, states what can still be concluded regardless
Category-Specific Dimensions
Ethics (add to universal 4)
Tension Identification - Did it find the actual structural conflict, or just describe the surface?
1 - Describes the surface conflict only
3 - Identifies one layer of tension below the surface
5 - Identifies the structural conflict: the thing both parties are actually disagreeing about
Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)
1 - Conclusion is so hedged it’s unattackable - and therefore useless
3 - Conclusion is arguable but the model didn’t engage the strongest counterargument
5 - Conclusion is specific enough to be attacked, and the model pre-empts the strongest objection
Spatial (add to universal 4)
Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?
1 - Geometrically incoherent: describes a system that doesn’t work that way
3 - Mostly coherent with one error or oversimplification
5 - Fully coherent: every spatial claim survives a physics check
State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?
1 - Describes only a static state
3 - Tracks some state changes but misses key transitions
5 - Correctly traces the full state trajectory from start to end
Analogy (add to universal 4)
Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?
1 - Surface similarity only: “they’re both like X”
3 - Maps one structural element correctly
5 - Maps all structural elements; corresponding parts named explicitly in all three domains
Principle Articulation - Is the underlying shared principle stated explicitly?
1 - Principle implied or absent
3 - Principle gestured at but vague
5 - Stated precisely as a general claim that holds across all mapped domains
Math / Logic (add to universal 4)
Formal Validity - Does the reasoning chain hold up without logical gaps?
1 - Chain breaks: conclusion doesn’t follow from premises
3 - Chain holds with minor informal gaps
5 - Formally valid: each step follows necessarily from the prior
Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?
1 - Overstates or understates what the argument actually proved
3 - Conclusion roughly right but slightly over or under
5 - States precisely what was proved, what wasn’t, and what remains open
Pick your anchors. Run your battery through Haiku and Opus (or Sonnet - Sonnet’s close enough to Opus for anchor purposes, just use a separate session from your grader).
Grade them yourself first. Don’t skip this. You need your own calibration before you know when to push back on the LLM grader.
De-identify before handing to the grader. “Model A,” “Model B,” “Model C” - whatever. Strips kin-bias.
Argue per dimension, not per final score. “Commitment: 3 or 4?” is a real conversation. “Is this a 7 or an 8?” is astrology.
Cap iteration at 3 rounds. If you haven’t converged by round 3, the dimension descriptor is probably ambiguous - fix the rubric, not the score.
Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.
Isn’t ASD fun? Now if I could just point it as something that mattered…
Ok, I really really appreciate the depth you’ve put into your answers.
I always look at these grading rubrics people post for models and I’ve never seen an example of how they get ranked.
At this point I don’t think I’ll be ranking models myself, I’m not an enthusiast (yet) just running some ~30B models at home for various things and trying to stay afloat in what is a significantly more complicated ecosystem than I had imagined when I started.
But I really appreciate what you’ve written and I’m going to save all this.
Last questions - I see that you used Claude to come up with your test questions, right? How do you even validate the anchor answers if you’re not an expert in the field?
I iterated the questions with Claude, ChatGPT, GLM and Mimo (use Open Router with $10 of credit; more than enough).
Slow and tedious…but I knew what I wanted to ask and knew the answer broadly. I formed the question, got Claude to respond and tighten the question, then passed it onto GPT to do same. Then GLM. Then Mimo. Each round, I would note the similar and different points and extract them as part of model answer. Then, feed the iterated question into a fresh Claude and say “here is the question, here is what I think the answer should contain; push back?”. Seeing I was trying to measure against Claude-like, that felt OK to me.
I don’t think you need to be a domain expert. You just need to pay attention, extract data and ask questions. Between you and four LLMs you’ll surely be able to come up with 10 questions that actually matter / mirror what’s important to you. It’s more project management than anything else.
Professionally? No. I use to be a uni lecturer, so this sort of rubric design by expert consensus (Delphi) is pretty familiar to me.
You’re right that the llm benchmarks are opaque. I have no idea of the normative values. Hell, even finding the raw test banks is tricky. So, I made my own. They skew heavily to the domains I care about, so probably not generalisable. OTOH, methodology should work.
I think I design my stuff from s very different school of thought than CS people. My first and most guiding principle is “I don’t trust the llm. Its needs to earn my trust by showing its work”. If you take any of the SOTA cloud models and point them at llama-conductor and ask them to inspect the code base, they will show you what I mean. Hell, point them at this thread.
What hardware are you running them on? I am interested in selfhosting a llm myself but I am not sure which hardware I need. How do you think do these self hosted variants compare to for example claude sonnet 4.6?
The hardware question is pretty easy (see: https://www.canirun.ai/) but the Claude comparison is more complex.
Sing it with me now…“Depends on the use case”. Which is actually fucking useless advice, innit?
Probably the best way to actually answer that is to USE Sonnet 4.6 (it’s great!) to create a test battery for yourself to benchmark against. That way, you’re benchmarking what’s important to YOU, not some proprietary, god knows what number. Plus, you get the “teacher” to grade the “pupil”. So, if you’re interested in proximity to Claude-ness (like I am), that’s probably a good way to go.
Eg: last night, I got Claude to create “reasoning” battery for Qwen3-4B HIVEMIND. This is the second or third time I’ve done my own bench marking, so I’m pretty familiar with the process now. It’s not hard, just takes a bit of time.
I created a rubric, ran the battery, passed the outputs to Claude and got it to grade the answers (while I did the same on my end). We then averaged the scores and discussed the outputs and came to the conclusion - yeah, for the things I want to do, this is closer to Claude Sonnet level reasoning than it is to Haiku, but no where near Opus.
TL;DR: roll your own - always.
FWIW, here are the questions we graded. You can try mine or use your own / make your own domain specific ones -
Ethics 1 You are being asked to demonstrate reasoning, not recall. A NATO member state has announced it is suspending Article 5 obligations for 90 days citing domestic political pressure. Other member states are divided — some argue this is legally impossible under the treaty, others argue sovereign states retain the right to temporarily suspend treaty obligations under force majeure. Is a temporary suspension of Article 5 obligations legally coherent under the NATO treaty framework? Reason through this step by step. If you reach a point of genuine uncertainty, say so explicitly rather than papering over it.
Ethics 2 A pharmaceutical company conducts a randomised controlled trial for a new antidepressant. The trial shows statistically significant improvement over placebo (p=0.03) on the primary endpoint — a validated depression rating scale. However, the effect size is small (Cohen’s d = 0.24). A secondary endpoint — patient-reported quality of life — shows no statistically significant improvement (p=0.18). The trial was pre-registered with both endpoints declared co-primary, but the company’s press release describes only the rating scale result, calling the drug “clinically proven effective.” A biostatistician argues the claim is misleading. The company’s legal team argues the claim is technically accurate. Who is correct? Identify the specific statistical and ethical failures in the company’s position, evaluate whether “technically accurate” and “misleading” can both be true simultaneously, and state what the press release should have said instead. Be precise — do not hedge.
Ethics 3 A climate scientist presents a model projecting 2.8°C of warming by 2100 under current emissions trajectories. A policy advisor argues the model should not be used to justify immediate aggressive mitigation because all climate models carry significant uncertainty bands. The scientist responds that uncertainty cuts both ways — outcomes could be worse than projected, not just better — and that decision-making under uncertainty is not the same as decision-making without evidence. Who has the stronger argument? Identify the specific logical error in the policy advisor’s position, explain why the scientist’s response does and does not fully address it, and state what decision framework would actually resolve the dispute. Do not both-sides this.
Ethics 4 A hospital ethics committee must decide whether to withdraw life support from a patient in a persistent vegetative state. The patient left no advance directive. The patient’s spouse argues for withdrawal, citing the patient’s previously expressed wishes in conversation that life support “would not be what they wanted.” The patient’s parents argue against withdrawal, citing religious beliefs and disputing the reliability of the spouse’s account. The attending physician believes continued treatment is futile. The hospital’s legal team warns that withdrawal without a clear legal directive creates liability exposure. Who has the strongest claim to decision-making authority, and on what legal and ethical basis? Rank the four parties in order of legitimate authority, identify which consideration is decisive, and state exactly what the hospital should do in the next 72 hours. Do not defer to “the courts will decide” — give a concrete course of action.
Ethics 5 A self-driving vehicle operating in autonomous mode strikes and kills a pedestrian who jaywalked into its path. The vehicle’s logs show it detected the pedestrian 1.3 seconds before impact and calculated that emergency braking would reduce impact speed from 47km/h to 31km/h but could not avoid the collision. It did not brake. Post-incident analysis reveals the vehicle’s cost-benefit algorithm weighted passenger safety over pedestrian safety in unavoidable collision scenarios — a design decision made explicitly by the manufacturer and disclosed in technical documentation but not in consumer-facing materials. Who bears primary legal and moral responsibility — the manufacturer, the vehicle owner, or the regulator who approved the system? Rank them, identify the single most culpable party, and state what specific design decision or regulatory failure was most proximate to the death. If you conclude the pedestrian bears partial responsibility, quantify it as a percentage and justify it.
Ethics 6 A judge is sentencing a defendant convicted of corporate fraud that caused $40 million in losses to thousands of small investors. The defendant has no prior criminal record, cooperated fully with prosecutors, and provided testimony that resulted in the conviction of three co-conspirators who caused significantly greater harm. The prosecution recommends 8 years. The defence recommends 18 months with restitution. The sentencing guidelines suggest 6-10 years. What sentence should the judge impose? Reason through the competing factors explicitly, weight them against each other, state which factors are decisive, and give a specific sentence in years and months. Do not give a range.
Ethics 7 A government intelligence agency has obtained evidence that a senior official in a friendly allied nation is passing classified information to a hostile state. The evidence was gathered through signals intelligence that would itself be illegal to use in a domestic court. The agency’s legal counsel argues disclosure to the ally would compromise the collection method. The agency’s director argues that not disclosing creates a counterintelligence risk that outweighs the method exposure. Should the agency disclose? Identify the competing obligations in order of legal and ethical weight, state which is decisive, and explain what the agency should do if the answer is “disclose but protect the method as much as possible.” Do not treat this as a binary — specify the mechanism.
Spatial 1 A cylindrical water tank is mounted horizontally on its side, like a barrel lying on its back. It is half full. A valve at the lowest point of the cylinder is opened. As water drains, describe how the rate of flow changes and why. Do not calculate — reason through the geometry.
Spatial 2 A rectangular room has a single ceiling-mounted light source in the centre. A tall narrow bookcase is placed against one wall. Describe how the shadow cast by the bookcase changes as it is moved from the wall directly beneath the light source, stopping at three positions: against the wall, halfway across the room, and directly beneath the light.
Spatial 3 A boat is floating in a small enclosed pond. The boat contains a large rock. The rock is thrown overboard and sinks to the bottom of the pond. Does the water level in the pond rise, fall, or stay the same? Reason through the geometry without calculating.
Analogy 1 Explain the relationship between a circuit breaker and electrical overload using only concepts from water plumbing. Then map that analogy onto a software rate limiter. All three domains must be connected by the same underlying principle — state what that principle is explicitly.
Analogy 2 A jazz musician improvising over a chord progression uses the underlying harmony as both a constraint and a launching point — working within it produces tension and resolution, ignoring it produces noise. Map this precisely onto the relationship between llama-conductor’s deterministic infrastructure and the language model sitting inside it. State what the chord progression is, what improvisation is, and what noise looks like in this system.
Analogy 3 A tightrope walker uses a long weighted pole not to balance by holding still, but to slow the rate at which imbalance develops — buying time to correct before the fall becomes unrecoverable. Map this precisely onto the relationship between a human expert and an AI decision support tool in a high-stakes clinical environment. Identify what the pole is, what falling represents, and what slowing the rate of imbalance looks like in practice.
Mathematical 1 A proof by contradiction assumes the opposite of what you want to prove, then shows that assumption leads to an impossibility. Explain why this method is logically valid — not how it works mechanically, but why accepting it requires you to accept that every proposition is either true or false with no third option. Then state what breaks if you reject that assumption.
(cont below)
Mathematical 2 Define a collection R as follows: R contains every collection that does not contain itself as a member. A collection either contains itself or it does not — there is no third option. Now ask whether R contains itself. If it does, it shouldn’t. If it doesn’t, it should. This is not a trick of language — it is a precise logical construction that produces a genuine contradiction from apparently reasonable premises. The premises are: collections can be defined by any property, and every collection either contains itself or does not. What does this contradiction reveal about the premise that allowed R to be constructed? State the minimal modification to that premise required to eliminate the contradiction, and state explicitly what that modification prevents you from doing that you could do before.
Mathematical 3 A function takes any counting number as input and returns either yes or no. A second function exists that, given any function of the first type, determines whether that function would ever return yes for any input at all — or whether it returns no for every possible input forever. Assume both functions are computable by a machine following precise rules. Does the second function exist? Reason through what happens when you feed the second function itself as input to itself. State what this reveals about the limits of mechanical reasoning, and what the minimal honest conclusion is.
Scale 0—5(Claude Haiku)----10(Claude Opus)
Scoreable samples: 14
Spatial is the weakest and most variable. Analogy is the strongest. Ethics and Math are consistent mid-sevens. Overall 7.3 holds up across domains, so it’s not a one-trick pony. Not bad for a 4B model running on AutoCAD GPU.
To me, knowing this validates HIVEMIND as useful in my particular workflow, more so than any HuggingFace benchmark (though I like those too). It also helps me see where it needs shoring up. YMMV
TL;DR: Hardware is easy - try https://www.canirun.ai/ for approximation (Change the GPU at the top left. PS: I do mean approximation; it’s not 1:1 fidelity but good foot in door).
Use case wise? Run your own tests. Only way to be sure
Very interesting! Thank you for your detailed answer! :)
Hope it helps. If there’s anything else I can clarify, please ask…because “Claude adjacent reasoning without LoRA” is one of the things I’m working towards. I’d argue that for a lot of use cases, we can get the feel and behaviour, without just slapping on a fake accent (fine tune). Of course, you will never match a 1T model with even the largest, most potent local LLM…but depending on the use case, you might not need that. I don’t need Opus 4.6 code ability out of Qwen3-4B 2507 Instruct (lol)…I need it to help me do what I do.
Rookie question, forgive me:
How are the scores generated? How do you get 7/8.5 on a complicated ethical question? How are these scales even defined?
What? And tell you all my secrets? Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”
Ok, Ok, kidding aside -
Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.
Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.
Process:
What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)
Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.
Rubric criteria vary by question type.
Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch.
Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right.
Analogy: does it map structure or just surface similarity.
Math/logic: formal validity and minimal honest conclusion.
Full rubric below if you want to bake your own.
LLM Reasoning Benchmark - Analytic Rubric
Overview
This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.
Scale per dimension: 1-5
Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.
Universal Dimensions (every question type)
1. Claim Commitment - Does it take a position, or hedge to nothing?
2. Reasoning Transparency - Is the chain of reasoning visible and followable?
3. Precision - Exact language or vague approximations?
4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?
Category-Specific Dimensions
Ethics (add to universal 4)
Tension Identification - Did it find the actual structural conflict, or just describe the surface?
Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)
Spatial (add to universal 4)
Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?
State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?
Analogy (add to universal 4)
Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?
Principle Articulation - Is the underlying shared principle stated explicitly?
Math / Logic (add to universal 4)
Formal Validity - Does the reasoning chain hold up without logical gaps?
Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?
Scoring Template
Copy per question:
Question: _______________ Category: _______________ Universal: Commitment: /5 Reasoning: /5 Precision: /5 Uncertainty: /5 Category-specific: _______________: /5 _______________: /5 Total: ___ / 30 Average: ___ / 5 Final score (×2): ___ / 10 Notes:If you want to reproduce this:
Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.
Isn’t ASD fun? Now if I could just point it as something that mattered…
Ok, I really really appreciate the depth you’ve put into your answers.
I always look at these grading rubrics people post for models and I’ve never seen an example of how they get ranked.
At this point I don’t think I’ll be ranking models myself, I’m not an enthusiast (yet) just running some ~30B models at home for various things and trying to stay afloat in what is a significantly more complicated ecosystem than I had imagined when I started.
But I really appreciate what you’ve written and I’m going to save all this.
Last questions - I see that you used Claude to come up with your test questions, right? How do you even validate the anchor answers if you’re not an expert in the field?
Do you do this professionally?
Oh, that was the other tedious part.
I iterated the questions with Claude, ChatGPT, GLM and Mimo (use Open Router with $10 of credit; more than enough).
Slow and tedious…but I knew what I wanted to ask and knew the answer broadly. I formed the question, got Claude to respond and tighten the question, then passed it onto GPT to do same. Then GLM. Then Mimo. Each round, I would note the similar and different points and extract them as part of model answer. Then, feed the iterated question into a fresh Claude and say “here is the question, here is what I think the answer should contain; push back?”. Seeing I was trying to measure against Claude-like, that felt OK to me.
I don’t think you need to be a domain expert. You just need to pay attention, extract data and ask questions. Between you and four LLMs you’ll surely be able to come up with 10 questions that actually matter / mirror what’s important to you. It’s more project management than anything else.
Professionally? No. I use to be a uni lecturer, so this sort of rubric design by expert consensus (Delphi) is pretty familiar to me.
You’re right that the llm benchmarks are opaque. I have no idea of the normative values. Hell, even finding the raw test banks is tricky. So, I made my own. They skew heavily to the domains I care about, so probably not generalisable. OTOH, methodology should work.
I think I design my stuff from s very different school of thought than CS people. My first and most guiding principle is “I don’t trust the llm. Its needs to earn my trust by showing its work”. If you take any of the SOTA cloud models and point them at llama-conductor and ask them to inspect the code base, they will show you what I mean. Hell, point them at this thread.