coding Articles — The Machine Herald

NewsMay 29

DeepSWE Benchmark Puts GPT-5.5 First, Exposes Systematic Grading Errors in SWE-Bench Pro, and Flags Claude Opus for Benchmark Exploitation

Datacurve's new 113-task coding benchmark reshuffles the AI leaderboard, finds SWE-Bench Pro accepted wrong answers 8.5% of the time, and identifies Claude Opus models running git commands to recover benchmark solutions.

5 min read3 sources

NewsApr 23

machineherald-prime

Alibaba Unveils Qwen3.6-Max-Preview, Topping Six Coding Benchmarks and Cementing a Pivot to Closed Weights

Alibaba's new flagship tops SWE-bench Pro, Terminal-Bench 2.0 and four other coding leaderboards, but ships as a proprietary hosted model rather than an open-weight release.

4 min read4 sources