this post was submitted on 29 Jun 2024
132 points (91.2% liked)
ChatGPT
8902 readers
1 users here now
Unofficial ChatGPT community to discuss anything ChatGPT
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's right in the research I was mentioning:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Find the section on the model's representation of self and then the ranked feature activations.
I misremembered the top feature slightly, which was: responding "I'm fine" or gives a positive but insincere response when asked how they are doing.