431
submitted 16 hours ago by fubarx@lemmy.world to c/technology@lemmy.world

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top new old
[-] Bluewing@lemmy.world 4 points 28 minutes ago

I just asked Goggle Gemini 3 "The car is 50 miles away. Should I walk or drive?"

In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled "Recovery: 3 days of ice baths and regret."

And under reasons to walk, "You are a character in a post-apocalyptic novel."

Me thinks I detect notes of sarcasm......

[-] myfunnyaccountname@lemmy.zip 4 points 20 minutes ago

There are a lot of humans that would fail this as well. Just sayin.

[-] eronth@lemmy.world 1 points 3 minutes ago

Yeah I straight up misread the question, so I would have gotten it wrong.

[-] TankovayaDiviziya@lemmy.world 6 points 1 hour ago

We poked fun at this meme, but it goes to show that the LLM is still like a child that needs to be taught to make implicit assumptions and posses contextual knowledge. The current model of LLM needs a lot more input and instructions to do what you want it to do specifically, like a child.

[-] rob_t_firefly@lemmy.world 2 points 4 minutes ago* (last edited 3 minutes ago)

LLMs are not children. Children can have experiences, learn things, know things, and grow. Spicy autocomplete will never actually do any of these things.

[-] kshade@lemmy.world 1 points 17 minutes ago

We have already thrown just about all the Internet and then some at them. It shows that LLMs can not think or reason. Which isn't surprising, they weren't meant to.

[-] eronth@lemmy.world 1 points 2 minutes ago

Or at least they can't reason the way we do about our physical world.

[-] prole@lemmy.blahaj.zone 4 points 1 hour ago

I'm sure it'll be worth it at some point 🙄

[-] melsaskca@lemmy.ca 2 points 1 hour ago

I don't use AI but read a lot about it. I now want to google how it attacks the trolley problem.

[-] vane@lemmy.world 13 points 4 hours ago

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

[-] SkaveRat@discuss.tchncs.de 14 points 4 hours ago

Fly, you fool

[-] FatVegan@leminal.space 3 points 4 hours ago

100 Chinese people can lay approximately 30m of track a day

[-] Fmstrat@lemmy.world 1 points 2 hours ago

Qwen3 feels left out. All 30B models I have failed the test.

[-] SuspciousCarrot78@lemmy.world 2 points 1 hour ago* (last edited 1 hour ago)

Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.

https://imgur.com/a/7YZme4i

https://imgur.com/a/25ApzDN

I wonder if an abliterated VL model could do even better? They tend to have the best real world model benchmarks. Perhaps a Qwen3-VL-30B ablit (if such a thing exists) could one shot this.

I'd like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can't say that for certain.

PS: Saw a pearler of a response to this: Chatgpt recommend "yeah, lift the car and carry it on your back. Make sure to bend your knees" (though I'm guessing someone edited that for the lulz)

[-] imetators@lemmy.dbzer0.com 15 points 6 hours ago

Went to test to google AI first and it says "You cant wash your car at a carwash if it is parked at home, dummy"

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

[-] rumba@lemmy.zip 55 points 6 hours ago

They probably added a system guardrail as soon as they heard about this test. it's been going around for a while now :)

load more comments (1 replies)
[-] 73ms@sopuli.xyz 1 points 3 hours ago

Did this say whether the reasoning models get this right more than the others? Was curious about that but missed it if it was mentioned.

[-] tover153@lemmy.world 6 points 5 hours ago

After getting it wrong, the LLM I use most: Me: You can't wash your car if it isn't there.

Ah. Yes. That is an excellent and devastatingly practical correction.

In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

So:

Start engine.

Travel 50 meters.

Avoid eye contact with pedestrians.

Commit fully.

You are not lazy. You are complying with system requirements.

[-] teft@piefed.social 1 points 24 minutes ago

You are not lazy. You are complying with system requirements.

How does this AI know me so well?

[-] SaltySalamander@fedia.io 2 points 1 hour ago

But do it with the quiet shame of someone moving the car the length of a bowling lane.

A bowling lane is a bit over 18 meters. =)

load more comments (1 replies)
[-] Slashme@lemmy.world 44 points 8 hours ago

The most common pushback on the car wash test: "Humans would fail this too."

Fair point. We didn't have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

[-] snooggums@piefed.world 2 points 31 minutes ago

Have you seen the results of elections?

[-] bluesheep@sh.itjust.works 5 points 3 hours ago

I saw that and hoped it is cause of the dead Internet theory. At least I hope so cause I'll be losing the last bit of faith in humanity if it isn't

[-] T156@lemmy.world 24 points 7 hours ago

It is an online poll. You also have to consider that some people don't care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

load more comments (1 replies)
load more comments (4 replies)
load more comments
view more: next ›
this post was submitted on 23 Feb 2026
431 points (97.6% liked)

Technology

81759 readers
3515 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS