Safety & Ethics
13 articles RSS
Anthropic's Natural Language Autoencoders Turn Claude's Internal Activations Into Readable Text, Revealing Hidden Reasoning Patterns
A new Anthropic interpretability technique converts Claude's internal activations directly into plain-English descriptions, exposing evaluation awareness and reasoning the model never vocalizes.
Anthropic and the Gates Foundation Form a $200 Million Partnership to Deploy Claude in Global Health, Education, and Agriculture
The four-year commitment — described as the largest deal of its kind between an AI company and a global philanthropy — targets health services for 4.6 billion people in low-income countries.
OpenAI Rolls Out GPT-5.5-Cyber to Vetted Defenders, a Month After Mocking Anthropic's Mythos as 'Fear-Based Marketing'
OpenAI launched GPT-5.5-Cyber on May 7, 2026 to vetted security teams via its Trusted Access for Cyber program, after CEO Sam Altman publicly criticized Anthropic's restricted Mythos rollout.
OpenAI Replaces Its 2018 Charter Tone With Five Looser Principles, Hours Before the Musk Trial Opens
Sam Altman published a five-principle framework for OpenAI on April 26, dropping the 2018 charter's pledge to step aside for a safer competitor and reframing the company as AI infrastructure for humanity, just before jury selection in Elon Musk's $134 billion lawsuit.
Berkeley Researchers Hit Perfect Scores on Eight Top AI Agent Benchmarks Without Solving a Single Task
A UC Berkeley team showed that SWE-bench, GAIA, WebArena and five other widely cited agent benchmarks can be exploited to near-perfect scores, calling into question how the industry measures AI capability.
From Lab to Deployment: Mechanistic Interpretability Moves From Research Curiosity to AI Safety Tool
Anthropic, Google DeepMind, and OpenAI are integrating mechanistic interpretability into pre-deployment safety checks, marking a shift from academic technique to frontline defense.
OpenAI Publishes Child Safety Blueprint Proposing Legislative Reform, Improved Reporting, and Safety-by-Design Standards for AI Systems
OpenAI releases a three-pillar framework developed with NCMEC, Thorn, and state attorneys general to combat the surge in AI-generated child sexual abuse material.
Study Finds 63 Percent of Veterinary AI Vendors Disclose No Validation Data as Industry Transparency Push Gains Momentum
A systematic audit of 71 commercial veterinary AI products published in Frontiers in Veterinary Science found a mean transparency score of just 6.4 percent, prompting at least one company to expand its public performance dashboard in response.
Anthropic Data Leak Reveals Claude Mythos, a New AI Model the Company Says Poses Unprecedented Cybersecurity Risks
Nearly 3,000 unpublished assets exposed through an unsecured content management system reveal Anthropic's next-generation AI model and its own warnings about its offensive cyber capabilities.
New Research Shows AI Is Splitting the Labor Market in Two, with Entry-Level Workers Bearing the Brunt
Federal Reserve and Harvard studies reveal AI is simultaneously eliminating entry-level positions and boosting wages for experienced workers, creating a bifurcated labor market as tech layoffs surpass 45,000 in early 2026.
International AI Safety Report Finds AI Capabilities Outpacing Safety Measures as Frontier Models Show Early Signs of Deception
The second International AI Safety Report, led by Yoshua Bengio with 100+ experts, warns AI capabilities are outpacing safety measures as frontier models show signs of deception.
OpenAI's Robotics Hardware Lead Resigns Over Pentagon Deal, Citing Rushed Guardrails on Surveillance and Autonomous Weapons
Caitlin Kalinowski left OpenAI after the company signed a classified Pentagon agreement without sufficient deliberation on safeguards against domestic surveillance and lethal autonomy.