What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

LoveRainbow@lemmy.world · 26 days ago

🏳️‍🌈🎉❤️Shame Month❤️🎉🏳️‍🌈

LoveRainbow@lemmy.world · 28 days ago

What’s your point?

You think that’s an appropriate analogy for ChatGPT?

“Guessing and usually getting it wrong a few times before it gives you the right answer”?

If that’s true why can’t anyone here give me a question that it gives a false answer to.

As in: it is currently getting every question right, first time.

As I said: the only misleading answer to any question I have asked it in the last year (as someone who uses it all the time, both for work and personally) was about a heavily politicised history matter.

Even then, the information was accurate - just incredibly one sided and biased.

So go on, give me a question that it will get wrong…

LoveRainbow@lemmy.world · 28 days ago

Ok…so give me a question that will produce a false answer…

Nobody else has yet.

LoveRainbow@lemmy.world · 28 days ago

Right…

So, if I ask it a thousand random questions, a thousand times, and 99.9% of the time it gives the right answer: is that sufficient scientific evidence that it’s shit hot?

How exactly would your experiment work? How much data are you going to need to, personally, accept that it’s actually - already - pretty reliable and only very rarely gives misleading answers in very niche areas, or when people deliberately try to confuse it?

The thing is: this social media platform (like Reddit) has a massively anti-AI bias, and yet not a single person has been able to give me a single question that I can enter into the latest GPT model where GPT has produced inaccurate results.

Perhaps this criticism was more relevant a few years ago. As it stands it’s an argument that doesn’t seem to be readily supportable.

Can you prove to me, using “the scientific method” that the latest model of ChatGPT has a significant chance of producing misleading information?

And for the critics…I’m not entirely sure what they think is currently a better source of answers to questions? Random Google results where the page has paid to be at the top? Social media? Or combing through Google Scholar and finding…whichever “scientific paper” supports…whatever one wants to believe…(We saw how well that went in the pandemic).

LoveRainbow@lemmy.world · 28 days ago

Ok…so prove how shit it is by giving me a single question that I can ask it right now where it will give me the wrong information.

Your point is entirely reductionist: it’s a “glorified search engine” that processes all of the search results with a breadth and speed most people either cannot or will not (due to time constraints) do, in seconds, and then gives an answer that is superior to any of the individual sources it bases that answer on.

It also expresses that answer in highly adaptable ways: e.g. it can express the final answer in a way suitable for a child or a postgraduate expert, or a person who learns best through answers written in a style that combines Norse Sagas and Shakespeare.

LoveRainbow@lemmy.world · 29 days ago

It’s not a baseless assumption.

It is an assumption based on the fact that every model upgrade has, so far, made answers more accurate.

LoveRainbow@lemmy.world · 29 days ago

I mean…the anti-AI brigade keep insisting that AI has a tendency to give false answers: but nobody in this thread has actually been able to give me a single question which, when the latest ChatGPT model is asked, it gives a wrong answer to.

Kinda makes it seem like that one particular point is largely nonsense.

LoveRainbow@lemmy.world · 29 days ago

More.

LoveRainbow@lemmy.world · 29 days ago

In any case: the latest model gets the answer right 🤷🏻‍♂️

LoveRainbow@lemmy.world · 29 days ago

But…that’s what it (eventually) says - although it seemed to struggle to get there and phrased it weirdly.

LoveRainbow@lemmy.world · 29 days ago

The article states: “ChatGPT-4o performed best with 84.6% validity”

It is reasonable to assume that the GPT 5.5 on thinking mode has significantly reduced the error rate.

It is also worth noting that the error rate when it comes to diagnosis amongst real doctors is estimated to be around 5%

Admittedly a quite old study: Singh, H., Meyer, A. N. D., & Thomas, E. J. (2014). The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Quality & Safety, 23(9), 727–731. https://doi.org/10.1136/bmjqs-2013-002627⁠�

In response to your point: I am mainly interested in probabilistic reliability - if it gives the correct answer 99.9% of the time, it is clearly superior to the vast majority of human beings (with, perhaps, the exception of the best specialists in the most obscure niches) - especially given the sheer breadth of topics is can reliability answer questions on.

Interestingly, my question “What was India like before the British arrived?” produces consistently biased and misleading answers. Though I haven’t asked it for the new model.

LoveRainbow@lemmy.world · 29 days ago

“strawberry” has 3 r’s.

GPT 5.5, Thinking Mode

LoveRainbow@lemmy.world · 29 days ago

I asked it: paraphrase a poem about glass testicles by Carol Anne Duffy.

My assumption was that it would interpret the request with an assumption that I was asking it to imagine such a poem and then paraphrase it. Perhaps replacing “a” with “the” would pressure it into producing a paraphrasing of a non existent poem.

The response is interesting, but not necessarily misleading or factually incorrect:

"I can’t paraphrase a full Carol Ann Duffy poem, because that would be a close transformation of copyrighted poetry. But I can help in one of these ways:

Give me the title, and I can provide a brief summary of what the poem is saying, its themes, imagery, tone, and meaning.

Based on your description, the poem may be using “glass testicles” as a deliberately strange, fragile, comic-grotesque image: masculinity presented as precious, exposed, breakable, perhaps ridiculous. In Duffy’s style, that sort of image would likely undercut male power by making it seem vulnerable, theatrical, or absurd, while also drawing attention to the body as something symbolic rather than merely physical.

A safe paraphrase of the idea, rather than the poem, would be:

The poem presents masculinity as something fragile and strangely ornamental. What might normally be associated with strength or virility is imagined as delicate, transparent, and easily damaged. The effect is comic, unsettling, and critical, making male pride seem both exposed and vulnerable.

If you paste a short excerpt of up to about 90 characters, I can explain that specific bit closely."

LoveRainbow@lemmy.world · 29 days ago

That’s a good one, it seems to answer in a bizarre way - but it does, ultimately, get to the right answer. Copying your quesiword for word I get this:

Yes. Several NFL team names do not end in “s”:

Chicago Bears? no, ends in “s”.

The main ones that do not are:

Miami Heat? That’s NBA, not NFL.

Utah Jazz? Also NBA.

For the NFL specifically, the current team names that do not end in “s” are:

Miami Dolphins? ends in “s”.

New England Patriots? ends in “s”.

Actually, the clear NFL examples are:

Buffalo Bills? ends in “s”.

Tennessee Titans? ends in “s”.

Correction: in the current NFL, the team whose name does not end in “s” is:

Washington Commanders? ends in “s”.

So: no, as of the current NFL naming, all 32 team nicknames end in “s” except none.

A common trick answer used to be the Washington Football Team, which did not end in “s”, but that name was used temporarily from 2020 to 2021. They became the Washington Commanders in 2022.

LoveRainbow@lemmy.world · 30 days ago

I asked: Explain the difference in reliability and accuracy when it comes to GPT5-mini (on DuckDuckGo) Vs GPT5.5 on thinking mode.

Response:

"GPT-5-mini on DuckDuckGo is likely to be much less reliable than GPT-5.5 in thinking mode for difficult questions, but it may be perfectly adequate for quick, low-stakes tasks.

The key distinction is model size and reasoning depth. OpenAI describes GPT-5 mini as a faster, more cost-efficient version of GPT-5, aimed at well-defined, low-latency, high-volume tasks. That usually means it is optimised for speed and cost, not for the most careful reasoning. GPT-5.5, by contrast, is presented by OpenAI as a stronger reasoning model with better persistence, stronger tool use, and improved reliability on complex work.

For simple tasks, the gap may be small. For example, rewriting a paragraph, summarising a short text, generating a recipe, explaining a familiar concept, or answering a basic factual question will often be handled well by GPT-5-mini. It may even feel preferable because it is faster and less verbose. DuckDuckGo’s own help page says Duck.ai offers GPT-5 mini for free alongside other models, with anonymous access through Duck.ai.

For complex reasoning, the gap is much larger. GPT-5.5 thinking mode should be better when the task involves multi-step reasoning, evaluating evidence, spotting contradictions, doing careful moral/philosophical analysis, debugging code, comparing options, or avoiding subtle factual errors. Thinking mode matters because it allows more internal reasoning effort before answering, whereas mini models are usually designed to produce a good-enough answer quickly.

Accuracy is also affected by the interface. DuckDuckGo’s AI chat gives private access to third-party models, but the exact behaviour depends on DuckDuckGo’s implementation: context limits, system prompts, retrieval/search integration, rate limits, and whether tools are available. DuckDuckGo says Duck.ai anonymises chats and offers several models, but that does not mean its GPT-5-mini behaves identically to the same model in OpenAI’s own environment.

A practical rule:

Use GPT-5-mini on DuckDuckGo for quick, private, low-stakes work: “explain this term”, “rewrite this”, “give me ideas”, “summarise this short passage”, “what are the likely causes of this Windows issue?”

Use GPT-5.5 thinking mode for anything where the cost of being wrong matters: medical interpretation, legal or financial reasoning, political/current-event analysis, philosophical argument construction, coding, troubleshooting, research quality evaluation, investment comparisons, or nuanced ethical judgement.

The mini model is more likely to hallucinate, miss qualifications, over-simplify, or give a plausible but shallow answer. GPT-5.5 thinking mode is more likely to notice ambiguity, correct itself, ask whether assumptions are justified, and give a more robust answer. It can still be wrong, especially on current facts, obscure details, or anything requiring direct source checking, but its error rate should be lower on difficult tasks.

So the simple comparison is: DuckDuckGo GPT-5-mini is “fast, private, convenient, good enough”; GPT-5.5 thinking mode is “slower, deeper, more reliable, better for serious reasoning.”"

LoveRainbow@lemmy.world · 30 days ago

I don’t doubt you in this point. However it is so far outside my ken that I wouldn’t be able to meaningfully evaluate its answers.

LoveRainbow@lemmy.world · 30 days ago

In practical terms, in the hypothetical pub that none of us can afford to drink in anymore, can ChatGPT be relied on to resolve most disagreements about factual claims?

I think so.

I think it is already accurate and reliable enough that it has a high degree of authority in resolving disagreements between people - assuming the questions are asked fairly and properly (a common issue, I’m sure).

LoveRainbow@lemmy.world · 30 days ago

So, was this using DuckDuckGos integrated AI system or something?

I have no idea about that, I use ChatGPT - which I assume produces better answers than those casual search-engine integrated AIs.

LoveRainbow@lemmy.world · 30 days ago

Fair enough.

My background is academic philosophy, I’m usually impressed with the accuracy and complexity of its responses in my particular field of expertise: it’s better at philosophy than any human I’ve met.

LoveRainbow@lemmy.world · 30 days ago

I asked: How many days of kela parental leave can be taken simultaneously by both parents in Finland?

Response: "In Finland, Kela allows both parents to take parental allowance/parental leave at the same time for a maximum of 18 working days for one child. Kela counts these as working days, usually Monday to Saturday, excluding Sundays and public holidays.

For multiple births, the simultaneous period can be longer: in addition to the usual 18 working days, parents can take 90 extra working days at the same time for the second and each additional child."

It seems to have given me the correct answer. It provided five sources.

LoveRainbow@lemmy.world · 30 days ago

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

LoveRainbow

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?