In a significant development for the blockchain security landscape, OpenZeppelin, a leading blockchain security firm, has revealed methodological flaws and data contamination in OpenAI’s EVMbench, a new AI-driven benchmark for evaluating smart contract security. Launched in mid-February in collaboration with crypto investment firm Paradigm, EVMbench was designed to assess how effectively different AI models can identify, patch, and exploit vulnerabilities in smart contracts.
OpenZeppelin, known for securing major decentralized finance (DeFi) protocols like Aave, Lido, and Uniswap, decided to subject EVMbench to the same rigorous scrutiny it applies to its clients. The firm’s audit uncovered two primary issues: data contamination and misclassification of high-severity vulnerabilities. “We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,” OpenZeppelin stated.
Data Contamination: A Critical Flaw
The first major issue identified by OpenZeppelin is data contamination. The firm emphasized that a key capability in AI security is the ability to find novel vulnerabilities in code the model has never seen before. However, during EVMbench’s testing, the AI agents that scored the highest had likely been exposed to the benchmark’s vulnerability reports during their pretraining phase.
While the EVMbench testing environment cut off internet access for the AI agents, the benchmark was based on curated vulnerabilities from 120 audits conducted between 2024 and mid-2025. Given that the knowledge training cutoffs for these AI agents generally fall around mid-2025, there is a significant risk that the AI agents already had the answers to the problems stored in their memory. This reduces the quality of the test and limits its effectiveness in evaluating the AI models’ true capabilities.
Misclassification of Vulnerabilities: Invalid Exploits
Beyond data contamination, OpenZeppelin also flagged several factual errors in the EVMbench dataset. The firm assessed at least four vulnerabilities that were given a high-risk classification by EVMbench but do not actually work in practice. These are not subjective severity disagreements but findings where the described exploits are fundamentally invalid. “These aren’t subjective severity disagreements; they are findings where the described exploit doesn’t work,” OpenZeppelin emphasized.
The misclassification of vulnerabilities can lead to AI agents being incorrectly scored for identifying non-existent issues, which further undermines the reliability of the benchmark. This could have serious implications for the development and deployment of AI tools in blockchain security, as it may give a false sense of security to users and developers.
Implications and Forward-Looking Insights
Despite these issues, OpenZeppelin remains optimistic about the potential of AI in enhancing blockchain security. The firm stresses the importance of rigorous testing and transparent methodologies to ensure that AI tools are truly effective and reliable. “The question isn’t whether AI will transform smart contract security—it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they’re meant to protect,” OpenZeppelin said.
As the blockchain and AI communities continue to evolve, it is crucial to maintain a high standard of scrutiny and transparency. EVMbench’s current flaws highlight the need for ongoing collaboration and rigorous testing to develop robust AI tools that can genuinely enhance the security of blockchain ecosystems.
