overview for diz

Google's Gemini 2.5 pro is out of beta. by diz in c/[email protected]

[-] [email protected] 11 points 1 week ago

Thing is, it has tool integration. Half of the time it uses python to calculate it. If it uses a tool, that means it writes a string that isn't shown to the user, which runs the tool, and tool results are appended to the stream.

What is curious is that instead of request for precision causing it to use the tool (or just any request to do math), and then presence of the tool tokens causing it to claim that a tool was used, the requests for precision cause it to claim that a tool was used, directly.

Also, all of it is highly unnatural texts, so it is either coming from fine tuning or from training data contamination.

Google's Gemini 2.5 pro is out of beta. by diz in c/[email protected]

[-] [email protected] 13 points 1 week ago* (last edited 1 week ago)

misinterpreted as deliberate lying by ai doomers.

I actually disagree. I think they correctly interpret it as deliberate lying, but they misattribute the intent to the LLM rather than to the company making it (and its employees).

edit: its like you are watching a TV and ads come on you say that a very very flat demon who lives in the TV is lying, because the bargain with the demon is that you get to watch entertaining content in response to having to listen to its lies. It's fundamentally correct about lying, just not about the very flat demon.

Google's Gemini 2.5 pro is out of beta. by diz in c/[email protected]

[-] [email protected] 12 points 1 week ago* (last edited 1 week ago)

Hmm, fair point, it could be training data contamination / model collapse.

It's curious that it is a lot better at converting free form requests for accuracy, into assurances that it used a tool, than into actually using a tool.

And when it uses a tool, it has a bunch of fixed form tokens in the log. It's a much more difficult language processing task to assure me that it used a tool conditionally on my free form, indirect implication that the result needs to be accurate, than to assure me it used a tool conditionally on actual tool use.

The human equivalent to this is "pathological lying", not "bullshitting". I think a good term for this is "lying sack of shit", with the "sack of shit" specifying that "lying" makes no claim of any internal motivations or the like.

edit: also, testing it on 2.5 flash, it is quite curious: https://g.co/gemini/share/ea3f8b67370d . I did that sort of query several times and it follows the same pattern: it doesn't use a calculator, it assures me the result is accurate, if asked again it uses a calculator, if asked if the numbers are equal it says they are not, if asked which one is correct it picks the last one and argues that the last one actually used a calculator. I hadn't ever managed to get it to output a correct result and then follow up with an incorrect result.

edit: If i use the wording of "use an external calculator", it gives a correct result, and then I can't get it to produce an incorrect result to see if it just picks the last result as correct, or not.

I think this is lying without scare quotes, because it is a product of Google putting a lot more effort into trying to exploit Eliza effect to convince you that it is intelligent, than into actually making an useful tool. It, of course, doesn't have any intent, but Google and its employees do.

Google's Gemini 2.5 pro is out of beta. by diz in c/[email protected]

[-] [email protected] 12 points 1 week ago

That's why I say "sack of shit" and not say "bastard".

Google's Gemini 2.5 pro is out of beta. by diz in c/[email protected]

[-] [email protected] 13 points 1 week ago* (last edited 1 week ago)

The funny thing is, even though I wouldn't expect it to be, it is still a lot more arithmetically sound than what ever is it that is going on with it claiming to use a code interpreter and a calculator to double check the result.

It is OK (7 out of 12 correct digits) at being a calculator and it is awesome at being a lying sack of shit.

Wake up babe, new "in this moment I am enlightened" copypasta just dropped by diz in c/[email protected]

[-] [email protected] 13 points 2 weeks ago

Maybe he didn't read Dune he just had AI summarize it.

OpenAI engineers are flocking to its rival Anthropic. “They let us huff our own farts,” says one by diz in c/[email protected]

[-] [email protected] 12 points 3 weeks ago

making LLMs not say racist shit

That is so 2024. The new big thing is making LLMs say racist shit.

Where Scoot makes the case about how an AGI could build an army of terminators in a year if it wanted. by diz in c/[email protected]

[-] [email protected] 13 points 1 month ago* (last edited 1 month ago)

It is as if there were people fantasizing about automaton mouths and lips and tongues and vocal cords for some reason, and come up with all these fantasies of how it'll be when automatons can talk.

And then Edison invents the phonograph.

And then they stick their you know what in the gearing between the cylinder and the screw.

Except somehow more stupid, because these guys are worried about AI apocalypse while boosting AI hype that pays for this supposed apocalypse.

edit: If someone said in 1850s "automatons won't be able to talk for another 150 years or longer because the vocal tract is too intricate", and some automaton fetishist says that they will be able to talk in 20 years, the phonograph shouldn't lend any credence whatsoever to the latter. What is different this time is that phonograph was genuinely extremely useful for what it is, while the generative AI is not quite as useful and they're going for the automaton fetishist money.

Latest AI-hallucinated legal filing, from AI vendor Anthropic by diz in c/[email protected]

[-] [email protected] 12 points 1 month ago* (last edited 1 month ago)

When confronted with a problem like “your search engine imagined a case and cited it”, the next step is to wonder what else it might be making up, not to just quickly slap a bit of tape over the obvious immediate problem and declare everything to be great.

Exactly. Even if you ensure the cited cases or articles are real it will misrepresent what said articles say.

Fundamentally it is just blah blah blah ing until the point comes when a citation would be likely to appear, then it blah blah blahs the citation based on the preceding text that it just made up. It plain should not be producing real citations. That it can produce real citations is deeply at odds with it being able to pretend at reasoning, for example.

Ensuring the citation is real, RAG-ing the articles in there, having AI rewrite drafts, none of these hacks do anything to address any of the underlying problems.

Gemini seem to have "solved" my duck river crossing, lol. by diz in c/[email protected]

[-] [email protected] 11 points 2 months ago* (last edited 2 months ago)

Yeah I think the best examples are everyday problems that people solve all the time but don't explicitly write out solutions step by step for, or not in the puzzle-answer form.

It's not even a novel problem at all, I'm sure there's even a plenty of descriptions of solutions to it as part of stories and such. Just not as "logical puzzles" due to triviality.

What really annoys me is when they claim high performance on benchmarks consisting of fairly difficult problems. This is basically fraud, since they know full well it is still entirely "knowledge" reliant, and even take steps to augment it with generated problems and solutions.

I guess the big sell is that it could use bits and pieces of logic gleaned from other solutions to solve a "new" problem. Except it can not.

MIT review selling a horrifying dystopia where an AI will monitor your rectum 24/7 and you repair your own fridge using AR glasses and haptics or something by diz in c/[email protected]

[-] [email protected] 13 points 8 months ago

I seriously wonder, do any of the folks with the "AR glasses to assist repair" thing ever actually repair anything, or do they get their ideas of how you repair stuff from computer games?

[long] Some tests of how much AI "understands" what it says (spoiler: very little) by diz in c/[email protected]

[-] [email protected] 12 points 1 year ago

I feel like letter counting and other letter manipulation problems kind of under-sell the underlying failure to count - LLMs work on tokens, not letters, so they are expected to have a difficulty with letters.

The inability to count is of course wholly general - in a river crossing puzzle an LLM can not keep track of what's on either side of the river, for example, and sometimes misreports how many steps it output.