27
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 25 May 2025
27 points (100.0% liked)
TechTakes
1883 readers
161 users here now
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community
founded 2 years ago
MODERATORS
A new LLM plays pokemon has started, with o3 this time. It plays moderately faster, and the twitch display UI is a little bit cleaner, so it is less tedious to watch. But in terms of actual ability, so far o3 has made many of the exact same errors as Claude and Gemini including: completely making things up/seeing things that aren't on the screen (items in Virdian Forest), confused attempts at navigation (it went back and forth on whether the exit to Virdian Forest was in the NE or NW corner), repeating mistakes to itself (both the items and the navigation issues I mentioned), confusing details from other generations of Pokemon (Nidoran learns double kick at level 12 in Fire Red and Leaf Green, but not the original Blue/Yellow), and it has signs of being prone to going on completely batshit tangents (it briefly started getting derailed about sneaking through the tree in Virdian Forest... i.e. moving through completely impassable tiles).
I don't know how anyone can watch any of the attempts at LLMs playing Pokemon and think (viable) LLM agents are just around the corner... well actually I do know: hopium, cope, cognitive bias, and deliberate deception. The whole LLM playing Pokemon thing is turning into less of a test of LLMs and more entertainment and advertising of the models, and the scaffold are extensive enough and different enough from each other that they really aren't showing the models' raw capabilities (which are even worse than I complained about) or comparing them meaningfully.
I like how all of the currently running attempts have been equipped with automatic navigation assistance, i.e. a pathfinding algorithm from the 60s. And that's the only part of the whole thing that actually works.
I wouldn't say even that part works so well, given how Mt. Moon is such a major challenge even with all the features like that.
The actual pathfinding algorithm (which is surely just A* search or similar) works just fine; the problem is the LLM which uses it.