I think this article does a good job of asking the question "what are we really measuring when we talk about LLM accuracy?" If you judge an LLM by its: hallucinations, ability analyze images, ability to critically analyze text, etc. you're going to see low scores for all LLMs.
The only metric an LLM should excel at is "did it generate human readable and contextually relevant text?" I think we've all forgotten the humble origins of "AI" chat bots. They often struggled to generate anything more than a few sentences of relevant text. They often made syntactical errors. Modern LLMs solved these issues quite well. They can produce long form content which is coherent and syntactically error free.
However the content makes no guarantees to be accurate or critically meaningful. Whilst it is often critically meaningful, it is certainly capable of half-assed answers that dodge difficult questions. LLMs are approaching 95% "accuracy" if you think of them as good human text fakers. They are pretty impressive at that. But people keep expecting them to do their math homework, analyze contracts, and generate perfectly valid content. They just aren't even built to do that. We work really hard just to keep them from hallucinating as much as they do.
I think the desperation to see these things essentially become indistinguishable from humans is causing us to lose sight of the real progress that's been made. We're probably going to hit a wall with this method. But this breakthrough has made AI a viable technology for a lot of jobs. So it's definitely a breakthrough. I just think either I finitely larger models (of which we can't seem to generate the data for) or new models will be required to leap to the next level.