I think I figured it out.
He fed his post to AI and asked it to list the fictional universes he’d want to live in, and that’s how he got Dune. Precisely the information he needed, just as his post describes.
I think I figured it out.
He fed his post to AI and asked it to list the fictional universes he’d want to live in, and that’s how he got Dune. Precisely the information he needed, just as his post describes.
Naturally, that system broke down (via capitalists grabbing the expensive fusion power plants for their own purposes)
This is kind of what I have to give to Niven. The guy is a libertarian, but he would follow his story all the way into such results. And his series where organs are being harvested for minor crimes? It completely flew over my head that he was trying to criticize taxes, and not, say, republican tough-on-crime, mass incarceration, and for profit prisons. Because he followed the logic of the story and it aligned naturally with its real life counterpart, the for profit prison system, even if he wanted to make some sort of completely insane anti tax argument where taxing rich people is like harvesting organs or something.
On the other hand, much better regarded Heinlein, also a libertarian, would write up a moon base that exports organic carbon and where you have to pay for oxygen to convert to CO2. Just because he wanted to make a story inside of which "having to pay for air to breathe" works fine.
It would have to be more than just river crossings, yeah.
Although I'm also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It's not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn't anything quite as general as that.
I’d just write the list then assign randomly. Or perhaps pseudorandomly like sort by hash and then split in two.
One problem is that it is hard to come up with 20 or more completely unrelated puzzles.
Although I don’t think we need a large number for statistical significance here, if it’s like 8/10 solved in the cheating set and 2/10 in the hold back set.
Chatbots ate my cult.
He’s such a complete moron. He doesn’t want to recite “DEI shibboleths”? What does he even think that would refer to? Why shibboleths?
To spell it out, that would refer to an antisemitic theory that the reason (for example) some black guy would get a medal of honor (the “deimedal”) is because of the jews.
I swear this guy is dumber than Trump. Trump for all his rambling, uses actual language - Trump understands what the shit he is saying means to his followers. Scott… he really does not.
Did you use any of that kind of notation in the prompt? Or did some poor squadron of task workers write out a few thousand examples of this notation for river crossing problems in an attempt to give it an internal structure?
I didn't use any notation in the prompt, but gemini 2.5 pro seem to always represent state of the problem after every step in some way. When asked if it does anything with it says it is "very important", so it may be that there's some huge invisible prompt that says its very important to do this.
It also mentioned N cannibals and M missionaries.
My theory is that they wrote a bunch of little scripts that generate puzzles and solutions in that format. Since river crossing is one of the top most popular puzzles, it would be on the list (and N cannibals M missionaries is easy to generate variants of), although their main focus would have been the puzzles in the benchmarks that they are trying to cheat.
edit: here's one of the logs:
Basically it keeps on trying to brute force the problem. It gets first 2 moves correct, but in a stopped clock style manner - if there's 2 people and 1 boat they both take the boat, if there's 2 people and >=2 boats, then each of them takes a boat.
It keeps doing the same shit until eventually its state tracking fails, or its reading of the state fails, and then it outputs the failure as a solution. Sometimes it deems it impossible:
All tests done with gemini 2.5 pro, I can post links if you need them but links don't include their "thinking" log and I also suspect that if >N people come through a link they just look at it. Nobody really shares botshit unless its funny or stupid. A lot of people independently asking the same problem, that would often happen if there's a new homework question so they can't use that as a signal so easily.
It's google though, if nobody uses their shit they just put it inside their search.
It's only gonna go away when they run out of cash.
edit: whoops replied to the wrong comment
Not really. Here's the chain-of-word-vomit that led to the answers:
Note that in "its impossible" answer it correctly echoes that you can take one other item with you, and does not bring the duck back (while the old overfitted gpt4 obsessively brought items back), while in the duck + 3 vegetables variant, it has a correct answer in the wordvomit, but not being an AI enthusiast it can't actually choose the correct answer (a problem shared with the monkeys on typewriters).
I'd say it clearly isn't ignoring the prompt or differences from the original river crossings. It just can't actually reason, and the problem requires a modicum of reasoning, much as unloading groceries from a car does.
Maybe if the potato casserole is exploded in the microwave by another physicist, on his way to start a resonance cascade...
(i'll see myself out).
The counting failure in general is even clearer and lacks the excuse of unfavorable tokenization. The AI hype would have you believe just an incremental improvement in multi-modality or scaffolding will overcome this, but I think they need to make more fundamental improvements to the entire architecture they are using.
Yeah.
I think the failure could be extremely fundamental - maybe local optimization of a highly parametrized model is fundamentally unable to properly learn counting (other than via memorization).
After all there's a very large number of ways how a highly parametrized model can do a good job of predicting the next token, which would not involve actual counting. What makes counting special vs memorization is that it is relatively compact representation, but there's no reason for a neural network to favor compact representations.
The "correct" counting may just be a very tiny local minimum, with tall hill all around it and no valley leading there. If that's the case then local optimization will never find it.
Yeah I'm thinking this one may be special cased, perhaps they wrote a generator of river crossing puzzles with corresponding conversion to "is_valid_state" or some such. I should see if I can get it to write something really ridiculous into "is_valid_state".
Other thing is that in real life its like "I need to move 12 golf carts, one has low battery, I probably can't tow more than 3 uphill, I can ask Bob to help but he will be grumpy...", just a tremendous amount of information (most of it irrelevant) with tremendous^tremendous^ possible moves (most of them possible to eliminate by actual thinking).