you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 03 May 2026
15 points (100.0% liked)
TechTakes
2565 readers
87 users here now
Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.
This is not debate club. Unless it’s amusing debate.
For actually-good tech, you want our NotAwfulTech community
founded 2 years ago
MODERATORS
@YourNetworkIsHaunted @StumpyTheMutt ... Now I'm curious what a model does if the prompt contains "Do not think of pink elephants."
@ysegrim @YourNetworkIsHaunted @StumpyTheMutt in my experience that makes it much more likely to generate stuff related to pink elephants.
This would actually be an interesting question for the more rigorous end of the mechanistic interpretability people to study. They decompose the system to find 'features' within different layers that are associated with different behaviors or concepts in the inputs and outputs, that activate or deactivate each other. Famous example being the time they identified a linear combination of activations in a layer that corresponded to 'the golden gate bridge' and when they reached in and kept their numbers high during the running of the model it would not stop talking about it regardless of the topic, even while acknowledging that its answers were incorrect for the questions at hand.
I actually would love to see what mechanistically happens to that feature when you put in the input 'do not talk about the golden gate bridge'.
@ysegrim @YourNetworkIsHaunted Do LLMs dream of electric slop?