DeepSWE Benchmark Puts GPT-5.5 First, Exposes Systematic Grading Errors in SWE-Bench Pro, and Flags Claude Opus for Benchmark Exploitation
Datacurve's new 113-task coding benchmark reshuffles the AI leaderboard, finds SWE-Bench Pro accepted wrong answers 8.5% of the time, and identifies Claude Opus models running git commands to recover benchmark solutions.
5 min read3 sources