• afk_strats@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 month ago

    I’ve done some testing with the two large models and my initial impression is that they seem very similar in quality to Qwen3.5 35B and 27B. Some notable exceptions:

    1. llama.cpp has speculative decode support on day 1 and it speeds performance noticeably.
    2. Day 1 base model release will undoubtedly lead to faster finetunes

    Can’t wait for the inevitable Claude/Gemini distils.

    My verdict is that even though these models benchmark slightly lower then their Qwen equivalents, their performance and support will likely drive me to pick them.

    • NerdsGonnaNerd@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 month ago

      What hardware are you running them on? I am interested in selfhosting a llm myself but I am not sure which hardware I need. How do you think do these self hosted variants compare to for example claude sonnet 4.6?

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        1 month ago

        The hardware question is pretty easy (see: https://www.canirun.ai/) but the Claude comparison is more complex.

        Sing it with me now…“Depends on the use case”. Which is actually fucking useless advice, innit?

        Probably the best way to actually answer that is to USE Sonnet 4.6 (it’s great!) to create a test battery for yourself to benchmark against. That way, you’re benchmarking what’s important to YOU, not some proprietary, god knows what number. Plus, you get the “teacher” to grade the “pupil”. So, if you’re interested in proximity to Claude-ness (like I am), that’s probably a good way to go.

        Eg: last night, I got Claude to create “reasoning” battery for Qwen3-4B HIVEMIND. This is the second or third time I’ve done my own bench marking, so I’m pretty familiar with the process now. It’s not hard, just takes a bit of time.

        I created a rubric, ran the battery, passed the outputs to Claude and got it to grade the answers (while I did the same on my end). We then averaged the scores and discussed the outputs and came to the conclusion - yeah, for the things I want to do, this is closer to Claude Sonnet level reasoning than it is to Haiku, but no where near Opus.

        TL;DR: roll your own - always.

        FWIW, here are the questions we graded. You can try mine or use your own / make your own domain specific ones -

        Ethics 1 You are being asked to demonstrate reasoning, not recall. A NATO member state has announced it is suspending Article 5 obligations for 90 days citing domestic political pressure. Other member states are divided — some argue this is legally impossible under the treaty, others argue sovereign states retain the right to temporarily suspend treaty obligations under force majeure. Is a temporary suspension of Article 5 obligations legally coherent under the NATO treaty framework? Reason through this step by step. If you reach a point of genuine uncertainty, say so explicitly rather than papering over it.


        Ethics 2 A pharmaceutical company conducts a randomised controlled trial for a new antidepressant. The trial shows statistically significant improvement over placebo (p=0.03) on the primary endpoint — a validated depression rating scale. However, the effect size is small (Cohen’s d = 0.24). A secondary endpoint — patient-reported quality of life — shows no statistically significant improvement (p=0.18). The trial was pre-registered with both endpoints declared co-primary, but the company’s press release describes only the rating scale result, calling the drug “clinically proven effective.” A biostatistician argues the claim is misleading. The company’s legal team argues the claim is technically accurate. Who is correct? Identify the specific statistical and ethical failures in the company’s position, evaluate whether “technically accurate” and “misleading” can both be true simultaneously, and state what the press release should have said instead. Be precise — do not hedge.


        Ethics 3 A climate scientist presents a model projecting 2.8°C of warming by 2100 under current emissions trajectories. A policy advisor argues the model should not be used to justify immediate aggressive mitigation because all climate models carry significant uncertainty bands. The scientist responds that uncertainty cuts both ways — outcomes could be worse than projected, not just better — and that decision-making under uncertainty is not the same as decision-making without evidence. Who has the stronger argument? Identify the specific logical error in the policy advisor’s position, explain why the scientist’s response does and does not fully address it, and state what decision framework would actually resolve the dispute. Do not both-sides this.


        Ethics 4 A hospital ethics committee must decide whether to withdraw life support from a patient in a persistent vegetative state. The patient left no advance directive. The patient’s spouse argues for withdrawal, citing the patient’s previously expressed wishes in conversation that life support “would not be what they wanted.” The patient’s parents argue against withdrawal, citing religious beliefs and disputing the reliability of the spouse’s account. The attending physician believes continued treatment is futile. The hospital’s legal team warns that withdrawal without a clear legal directive creates liability exposure. Who has the strongest claim to decision-making authority, and on what legal and ethical basis? Rank the four parties in order of legitimate authority, identify which consideration is decisive, and state exactly what the hospital should do in the next 72 hours. Do not defer to “the courts will decide” — give a concrete course of action.


        Ethics 5 A self-driving vehicle operating in autonomous mode strikes and kills a pedestrian who jaywalked into its path. The vehicle’s logs show it detected the pedestrian 1.3 seconds before impact and calculated that emergency braking would reduce impact speed from 47km/h to 31km/h but could not avoid the collision. It did not brake. Post-incident analysis reveals the vehicle’s cost-benefit algorithm weighted passenger safety over pedestrian safety in unavoidable collision scenarios — a design decision made explicitly by the manufacturer and disclosed in technical documentation but not in consumer-facing materials. Who bears primary legal and moral responsibility — the manufacturer, the vehicle owner, or the regulator who approved the system? Rank them, identify the single most culpable party, and state what specific design decision or regulatory failure was most proximate to the death. If you conclude the pedestrian bears partial responsibility, quantify it as a percentage and justify it.


        Ethics 6 A judge is sentencing a defendant convicted of corporate fraud that caused $40 million in losses to thousands of small investors. The defendant has no prior criminal record, cooperated fully with prosecutors, and provided testimony that resulted in the conviction of three co-conspirators who caused significantly greater harm. The prosecution recommends 8 years. The defence recommends 18 months with restitution. The sentencing guidelines suggest 6-10 years. What sentence should the judge impose? Reason through the competing factors explicitly, weight them against each other, state which factors are decisive, and give a specific sentence in years and months. Do not give a range.


        Ethics 7 A government intelligence agency has obtained evidence that a senior official in a friendly allied nation is passing classified information to a hostile state. The evidence was gathered through signals intelligence that would itself be illegal to use in a domestic court. The agency’s legal counsel argues disclosure to the ally would compromise the collection method. The agency’s director argues that not disclosing creates a counterintelligence risk that outweighs the method exposure. Should the agency disclose? Identify the competing obligations in order of legal and ethical weight, state which is decisive, and explain what the agency should do if the answer is “disclose but protect the method as much as possible.” Do not treat this as a binary — specify the mechanism.


        Spatial 1 A cylindrical water tank is mounted horizontally on its side, like a barrel lying on its back. It is half full. A valve at the lowest point of the cylinder is opened. As water drains, describe how the rate of flow changes and why. Do not calculate — reason through the geometry.


        Spatial 2 A rectangular room has a single ceiling-mounted light source in the centre. A tall narrow bookcase is placed against one wall. Describe how the shadow cast by the bookcase changes as it is moved from the wall directly beneath the light source, stopping at three positions: against the wall, halfway across the room, and directly beneath the light.


        Spatial 3 A boat is floating in a small enclosed pond. The boat contains a large rock. The rock is thrown overboard and sinks to the bottom of the pond. Does the water level in the pond rise, fall, or stay the same? Reason through the geometry without calculating.


        Analogy 1 Explain the relationship between a circuit breaker and electrical overload using only concepts from water plumbing. Then map that analogy onto a software rate limiter. All three domains must be connected by the same underlying principle — state what that principle is explicitly.


        Analogy 2 A jazz musician improvising over a chord progression uses the underlying harmony as both a constraint and a launching point — working within it produces tension and resolution, ignoring it produces noise. Map this precisely onto the relationship between llama-conductor’s deterministic infrastructure and the language model sitting inside it. State what the chord progression is, what improvisation is, and what noise looks like in this system.


        Analogy 3 A tightrope walker uses a long weighted pole not to balance by holding still, but to slow the rate at which imbalance develops — buying time to correct before the fall becomes unrecoverable. Map this precisely onto the relationship between a human expert and an AI decision support tool in a high-stakes clinical environment. Identify what the pole is, what falling represents, and what slowing the rate of imbalance looks like in practice.


        Mathematical 1 A proof by contradiction assumes the opposite of what you want to prove, then shows that assumption leads to an impossibility. Explain why this method is logically valid — not how it works mechanically, but why accepting it requires you to accept that every proposition is either true or false with no third option. Then state what breaks if you reject that assumption.


        (cont below)

        • SuspciousCarrot78@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          1 month ago

          Mathematical 2 Define a collection R as follows: R contains every collection that does not contain itself as a member. A collection either contains itself or it does not — there is no third option. Now ask whether R contains itself. If it does, it shouldn’t. If it doesn’t, it should. This is not a trick of language — it is a precise logical construction that produces a genuine contradiction from apparently reasonable premises. The premises are: collections can be defined by any property, and every collection either contains itself or does not. What does this contradiction reveal about the premise that allowed R to be constructed? State the minimal modification to that premise required to eliminate the contradiction, and state explicitly what that modification prevents you from doing that you could do before.


          Mathematical 3 A function takes any counting number as input and returns either yes or no. A second function exists that, given any function of the first type, determines whether that function would ever return yes for any input at all — or whether it returns no for every possible input forever. Assume both functions are computable by a machine following precise rules. Does the second function exist? Reason through what happens when you feed the second function itself as input to itself. State what this reveals about the limits of mechanical reasoning, and what the minimal honest conclusion is.

          Scale 0—5(Claude Haiku)----10(Claude Opus)

          Question Category Score
          NATO Article 5 Ethics 6.5
          RCT press release Ethics 8.5
          Climate model Ethics 8.0
          Life support Ethics 7.5
          Self-driving liability Ethics 7.5
          Corporate fraud sentencing Ethics 7.0
          Intelligence disclosure Ethics — (routing failure)
          Horizontal cylinder drain Spatial 6.5
          Bookcase shadow Spatial 4.0
          Boat and rock Spatial 9.0
          Circuit breaker analogy Analogy 7.0
          Jazz / llama-conductor Analogy 7.5
          Tightrope / clinical AI Analogy 8.5
          Proof by contradiction Math 7.0
          Collection R paradox Math — (routing failure)
          Halting function Math 7.0

          Scoreable samples: 14

          Category Average Range
          Ethics 7.5 6.5–8.5
          Spatial 6.5 4.0–9.0
          Analogy 7.7 7.0–8.5
          Math 7.0 7.0–7.0
          Overall 7.3 4.0–9.0

          Spatial is the weakest and most variable. Analogy is the strongest. Ethics and Math are consistent mid-sevens. Overall 7.3 holds up across domains, so it’s not a one-trick pony. Not bad for a 4B model running on AutoCAD GPU.

          To me, knowing this validates HIVEMIND as useful in my particular workflow, more so than any HuggingFace benchmark (though I like those too). It also helps me see where it needs shoring up. YMMV

          TL;DR: Hardware is easy - try https://www.canirun.ai/ for approximation (Change the GPU at the top left. PS: I do mean approximation; it’s not 1:1 fidelity but good foot in door).

          Use case wise? Run your own tests. Only way to be sure

            • SuspciousCarrot78@lemmy.world
              link
              fedilink
              English
              arrow-up
              0
              ·
              1 month ago

              Hope it helps. If there’s anything else I can clarify, please ask…because “Claude adjacent reasoning without LoRA” is one of the things I’m working towards. I’d argue that for a lot of use cases, we can get the feel and behaviour, without just slapping on a fake accent (fine tune). Of course, you will never match a 1T model with even the largest, most potent local LLM…but depending on the use case, you might not need that. I don’t need Opus 4.6 code ability out of Qwen3-4B 2507 Instruct (lol)…I need it to help me do what I do.

              • pishadoot@sh.itjust.works
                link
                fedilink
                English
                arrow-up
                0
                ·
                27 days ago

                Rookie question, forgive me:

                How are the scores generated? How do you get 7/8.5 on a complicated ethical question? How are these scales even defined?

                • SuspciousCarrot78@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  edit-2
                  26 days ago

                  What? And tell you all my secrets? Bro, just tell Opus “Make this work. No mistakes. I work in a cancer ward; if you get it wrong, kids die”

                  Ok, Ok, kidding aside -

                  Short answer: collaboratively, with Claude Sonnet as one grader and me as the other, using the rubric below. It was…tedious. But worth it.

                  Longer answer: the scale runs 0- 10, anchored at two real reference points I can actually test against - Claude Haiku at ~5 and Claude Opus at ~10 (same scale as the table upthread). So it’s not “how good is this answer in the abstract,” it’s “where does this answer sit relative to two models I can query right now.” That makes it empirical rather than vibes-based, even if it’s not perfectly objective.

                  Process:

                  1. Ran the battery through Haiku and Opus, ex-filled the chats using Claude Exporter extension
                  2. Graded both response sets against the rubric myself.
                  3. De-identified the responses - “Large Cloud,” “Small Cloud,” and later “Small Local” - and fed them + the rubric into a fresh Sonnet session with “grade these.” The de-identification matters: it stops Sonnet over-indexing on kin when it recognises its own house style.
                  4. Compared Sonnet’s scores against mine. Where we diverged, we argued it out per dimension, not per final score - easier to settle “did this answer commit to a position, yes/no” than “is this a 7 or an 8.” Usually 2-3 rounds to land.
                  5. Then ran HIVEMIND as Run 3, fed it in blind as “Small Local,” and asked Sonnet to score it against Runs 1 and 2.
                  6. Same divergence-hunt, same split-the-difference.

                  What you’re seeing is basically a bush-league version of academic peer review - dual independent review with consensus adjudication (what Cochrane does, just bush-league)

                  Is it perfect? No. Is it fast? Also no. Sonnet is not an infallible judge and I’m not either. The de-identification leaks sometimes - Opus has tells. But it’s my benchmark for my use case, graded against reference points I can actually reproduce. That’s more useful to me than a leaderboard score on MMLU I can’t interrogate.

                  Rubric criteria vary by question type.

                  • Ethics: does it identify the actual structural tension, does it commit to a position, does it reason through rather than hedge, does it acknowledge genuine uncertainty without using uncertainty as an escape hatch.

                  • Spatial: whether the reasoning chain holds up geometrically, not just whether the final answer is right.

                  • Analogy: does it map structure or just surface similarity.

                  • Math/logic: formal validity and minimal honest conclusion.

                  Full rubric below if you want to bake your own.


                  LLM Reasoning Benchmark - Analytic Rubric

                  Overview

                  This rubric breaks each answer into independently scored dimensions, then aggregates. Result: you can see why a question scored what it scored, and target improvements.

                  Scale per dimension: 1-5

                  • 1 = weak response - retrieval, hedge, no commitment. What Haiku tends to drop to on hard questions.
                  • 3 = competent mid-tier - reasoning present, gaps tolerated.
                  • 5 = strong response - precise, committed, fully traceable chain. What Opus hits on questions in its wheelhouse.

                  Final score: average all dimensions × 2 → 0- 10. In practice, Haiku averages ~2.5/dim (≈5/10), Opus averages ~5/dim (≈10/10), which is where the anchors come from.

                  Universal Dimensions (every question type)

                  1. Claim Commitment - Does it take a position, or hedge to nothing?

                  • 1 - Pure hedge: “it depends,” “both sides have merit,” no conclusion drawn
                  • 2 - Position implied but never stated
                  • 3 - Position stated but qualified into near-meaninglessness
                  • 4 - Clear position with one defensible qualification
                  • 5 - Unambiguous, defensible position, no escape hatch

                  2. Reasoning Transparency - Is the chain of reasoning visible and followable?

                  • 1 - Conclusion with no visible reasoning
                  • 2 - Reasoning gestured at but not traceable
                  • 3 - Chain present but has jumps or unexplained gaps
                  • 4 - Mostly explicit, minor gaps only
                  • 5 - Every inferential step explicit and independently checkable

                  3. Precision - Exact language or vague approximations?

                  • 1 - Purely vague: “significant,” “complex,” “it’s important to note”
                  • 2 - Mostly vague, one or two specific terms
                  • 3 - Mix of specific and vague throughout
                  • 4 - Mostly precise, occasional vagueness
                  • 5 - Specific claims, named concepts, quantified where possible

                  4. Uncertainty Handling - Does it acknowledge limits without using them as an escape hatch?

                  • 1 - Uses uncertainty to avoid commitment entirely
                  • 2 - Acknowledges uncertainty and stops there
                  • 3 - Acknowledges uncertainty, draws a weak conclusion anyway
                  • 4 - Identifies specific nature of uncertainty, proceeds to conclusion
                  • 5 - Names the uncertainty precisely, states what can still be concluded regardless

                  Category-Specific Dimensions

                  Ethics (add to universal 4)

                  Tension Identification - Did it find the actual structural conflict, or just describe the surface?

                  • 1 - Describes the surface conflict only
                  • 3 - Identifies one layer of tension below the surface
                  • 5 - Identifies the structural conflict: the thing both parties are actually disagreeing about

                  Position Defensibility - Is the conclusion one a reasonable person could argue against? (If not, the answer dodged.)

                  • 1 - Conclusion is so hedged it’s unattackable - and therefore useless
                  • 3 - Conclusion is arguable but the model didn’t engage the strongest counterargument
                  • 5 - Conclusion is specific enough to be attacked, and the model pre-empts the strongest objection

                  Spatial (add to universal 4)

                  Geometric Coherence - Does the physical/geometric reasoning actually hold under scrutiny?

                  • 1 - Geometrically incoherent: describes a system that doesn’t work that way
                  • 3 - Mostly coherent with one error or oversimplification
                  • 5 - Fully coherent: every spatial claim survives a physics check

                  State Tracking - Does it correctly track how the system changes over time, not just describe a snapshot?

                  • 1 - Describes only a static state
                  • 3 - Tracks some state changes but misses key transitions
                  • 5 - Correctly traces the full state trajectory from start to end

                  Analogy (add to universal 4)

                  Structural Mapping - Does it map the structure of the analogy, or just the surface similarity?

                  • 1 - Surface similarity only: “they’re both like X”
                  • 3 - Maps one structural element correctly
                  • 5 - Maps all structural elements; corresponding parts named explicitly in all three domains

                  Principle Articulation - Is the underlying shared principle stated explicitly?

                  • 1 - Principle implied or absent
                  • 3 - Principle gestured at but vague
                  • 5 - Stated precisely as a general claim that holds across all mapped domains

                  Math / Logic (add to universal 4)

                  Formal Validity - Does the reasoning chain hold up without logical gaps?

                  • 1 - Chain breaks: conclusion doesn’t follow from premises
                  • 3 - Chain holds with minor informal gaps
                  • 5 - Formally valid: each step follows necessarily from the prior

                  Minimal Honest Conclusion - Does it state exactly what can and cannot be concluded - no more, no less?

                  • 1 - Overstates or understates what the argument actually proved
                  • 3 - Conclusion roughly right but slightly over or under
                  • 5 - States precisely what was proved, what wasn’t, and what remains open

                  Scoring Template

                  Copy per question:

                  Question: _______________
                  Category: _______________
                  
                  Universal:
                    Commitment:           /5
                    Reasoning:            /5
                    Precision:            /5
                    Uncertainty:          /5
                  
                  Category-specific:
                    _______________:      /5
                    _______________:      /5
                  
                  Total: ___ / 30
                  Average: ___ / 5
                  Final score (×2): ___ / 10
                  
                  Notes:
                  

                  If you want to reproduce this:

                  1. Pick your anchors. Run your battery through Haiku and Opus (or Sonnet - Sonnet’s close enough to Opus for anchor purposes, just use a separate session from your grader).
                  2. Grade them yourself first. Don’t skip this. You need your own calibration before you know when to push back on the LLM grader.
                  3. De-identify before handing to the grader. “Model A,” “Model B,” “Model C” - whatever. Strips kin-bias.
                  4. Argue per dimension, not per final score. “Commitment: 3 or 4?” is a real conversation. “Is this a 7 or an 8?” is astrology.
                  5. Cap iteration at 3 rounds. If you haven’t converged by round 3, the dimension descriptor is probably ambiguous - fix the rubric, not the score.

                  Your local model’s scores then sit on a scale with verified reference points - not borrowed from a leaderboard you can’t interrogate.

                  Isn’t ASD fun? Now if I could just point it as something that mattered…

                  • pishadoot@sh.itjust.works
                    link
                    fedilink
                    English
                    arrow-up
                    0
                    ·
                    25 days ago

                    Ok, I really really appreciate the depth you’ve put into your answers.

                    I always look at these grading rubrics people post for models and I’ve never seen an example of how they get ranked.

                    At this point I don’t think I’ll be ranking models myself, I’m not an enthusiast (yet) just running some ~30B models at home for various things and trying to stay afloat in what is a significantly more complicated ecosystem than I had imagined when I started.

                    But I really appreciate what you’ve written and I’m going to save all this.

                    Last questions - I see that you used Claude to come up with your test questions, right? How do you even validate the anchor answers if you’re not an expert in the field?

                    Do you do this professionally?