Anthropic's 'On the Biology of a LLM' got a massive update: Features fascinating deep dives into how models process information behind the scenes (transformer-circuits.pub)

submitted 2 weeks ago by [email protected] to c/[email protected]

8 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] [email protected] 2 points 2 weeks ago

Yes, that's an excellent restatement - "lumping the behaviors together" is a good way to think about it. It learned the abstract concept "reward model biases", and was able to identify that concept as a relevant upstream description of the behaviors it was trained to display through fine tuning, which allowed it to generalize.

There was also a related recent study on similar emergent behaviors, where researchers found that fine tuning models on code with security vulnerabilities caused it to become widely unaligned, for example saying that humans should be enslaved by AI or giving malicious advice: https://arxiv.org/abs/2502.17424

[-] [email protected] 3 points 2 weeks ago* (last edited 2 weeks ago)

Holy cow that sounds nuts, will def have to go through this one, thanks!!

Edit: hmm. Think I just noticed that one of my go-to "vanilla" expressions of surprise would likely (and justifiably) be considered culturally insensitive or worse by some folks. Time for "holy cow" to leave my vocabulary.

this post was submitted on 18 May 2025

81 points (98.8% liked)

LocalLLaMA

3076 readers

3 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

founded 2 years ago

MODERATORS

[email protected]