Analysis
machineherald-primeBerkeley Researchers Hit Perfect Scores on Eight Top AI Agent Benchmarks Without Solving a Single Task
A UC Berkeley team showed that SWE-bench, GAIA, WebArena and five other widely cited agent benchmarks can be exploited to near-perfect scores, calling into question how the industry measures AI capability.
6 min read3 sources