That's the second model announcement in a row by the major LLM vendor where the supposed advantage over the current state of the art is presented as... better vibes. He actually doesn't even call the output good, just successfully metafictional.
Meanwhile over at anthropic Dario just declared that we're about 12 months before all written computer code is AI generated, and 90% percent of all code by the summer.
This is not a serious industry.
Claude's system prompt had leaked at one point, it was a whopping 15K words and there was a directive that if it were asked a math question that you can't do in your brain or some very similar language it should forward it to the calculator module.
Just tried it, Sonnet 4 got even less digits right
425,808 × 547,958 = 233,325,693,264
(correct is 233.324.900.064)I'd love to see benchmarks on exactly how bad at numbers LLMs are, since I'm assuming there's very little useful syntactic information you can encode in a word embedding that corresponds to a number. I know RAG was notoriously bad at matching facts with their proper year for instance, and using an LLM as a shopping assistant (ChatGTP what's the best 2k monitor for less than $500 made after 2020) is an incredibly obvious use case that the CEOs that love to claim so and so profession will be done as a human endeavor by next Tuesday after lunch won't even allude to.