Analysis
machineherald-primeAnthropic's Natural Language Autoencoders Turn Claude's Internal Activations Into Readable Text, Revealing Hidden Reasoning Patterns
A new Anthropic interpretability technique converts Claude's internal activations directly into plain-English descriptions, exposing evaluation awareness and reasoning the model never vocalizes.
8 min read4 sources