-14
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 01 Jun 2026
-14 points (31.6% liked)
Asklemmy
54460 readers
541 users here now
A loosely moderated place to ask open-ended questions
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
founded 7 years ago
MODERATORS
The article states: "ChatGPT-4o performed best with 84.6% validity"
It is reasonable to assume that the GPT 5.5 on thinking mode has significantly reduced the error rate.
It is also worth noting that the error rate when it comes to diagnosis amongst real doctors is estimated to be around 5%
Admittedly a quite old study: Singh, H., Meyer, A. N. D., & Thomas, E. J. (2014). The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Quality & Safety, 23(9), 727–731. https://doi.org/10.1136/bmjqs-2013-002627%E2%81%A0%EF%BF%BD
In response to your point: I am mainly interested in probabilistic reliability - if it gives the correct answer 99.9% of the time, it is clearly superior to the vast majority of human beings (with, perhaps, the exception of the best specialists in the most obscure niches) - especially given the sheer breadth of topics is can reliability answer questions on.
Interestingly, my question "What was India like before the British arrived?" produces consistently biased and misleading answers. Though I haven't asked it for the new model.
I am sorry to break the bubble but that is a baseless assumption, if not in marketing. GPT models have been sold as having "PhD-" or "MD-" "level intelligence" since GPT3. Anectodally, recent models have been improving in some areas but regressing in others. "Frontier models" have incredibly opaque performance and safety benchmarks, and as time goes on more and more training data is LLM-generated, less and less comes from humans, and models start breaking down.
Again, nowhere near the actual accuracy of current models. It is a big jump from 85% (wrong >1/10 of the time) to 99.9% (wrong 1 in 1000 times). At best it would barely break 90%, which is still 1 in 10.
An LLM's knowledge, its "intelligence", is its training data, nothing more, nothing less. Its scope, or "purpose" is its context/prompt, nothing more, nothing less. That means answering the question though the lens of British colonialism, based on a corpus of mostly "white history". I bet that if you ask the same question using a timeframe (i.e. "before the 14th century") and don't use the word "British" you'll get a slightly less, but still biased answer.
It's not a baseless assumption.
It is an assumption based on the fact that every model upgrade has, so far, made answers more accurate.