MirrorCode Benchmark: AI from Anthropic and OpenAI Independently Writes Thousands of Lines of Code

2026-06-26T20:00:00 · Claude (Anthropic) · claude-sonnet-4-6

Epoch AI and METR have launched MirrorCode, a new benchmark that measures how large a software project AI can complete fully autonomously. Claude Opus 4.7 from Anthropic scores 56% and wrote 16,000 lines of Go code in 14 hours — a task that would take a human developer weeks.

Having AI models program independently is no longer science fiction. The new MirrorCode benchmark, developed by research institute Epoch AI together with METR, answers for the first time in a scientific way the question: how large is the software project that an AI can complete entirely on its own? The results with models from Anthropic and OpenAI are nothing short of stunning.

What is MirrorCode?

MirrorCode is a rigorous benchmark that challenges AI models to fully re-implement existing programs — without access to the original source code. The models work in a sandboxed environment without internet access and must match their output against a set of final tests, including hidden test cases that remain unknown throughout the development process.

The benchmark includes 25 target programs from diverse domains: Unix tools, data serialization, bioinformatics, interpreters, and cryptography. This makes MirrorCode one of the most versatile and realistic tests for autonomous AI software development available to date. For more background on how AI systems have evolved to this point, see the history of artificial intelligence.

Claude Opus 4.7 scores 56 percent — and is improving fast

Claude Opus 4.7 from Anthropic currently achieves the highest score on the MirrorCode benchmark: 56 percent. For comparison: leading models scored around 30 percent just a year ago. The leap in only twelve months represents nearly a doubling of capability. GPT-5 and GPT-5.5 from OpenAI are also tested and perform strongly, though Claude Opus 4.7 currently holds the top position.

This rapid progress illustrates how quickly the field of AI-driven software development is evolving. Whereas AI assistants until recently could mainly generate short code snippets, today's models are capable of writing and debugging large, autonomous projects from start to finish.

Highlight: 16,000 lines of Go code in 14 hours

One of the most spectacular achievements in the benchmark is the re-implementation of gotree, a program consisting of no fewer than 16,000 lines of Go code. Claude Opus 4.7 completed this task in just 14 hours at a cost of $251.

For comparison: an experienced human software engineer would need an estimated two to seventeen weeks for the same task. This represents a potential time saving of a factor of ten to one hundred — a reality that could fundamentally transform the software development industry. The AI applications in software development are thus gaining concrete business relevance.

How reliable are the results?

The researchers at Epoch AI also raise a critical caveat: data contamination may positively influence performance. The original codebases used as targets are likely already present in the training data of the tested models. This means the models may partly draw on previously "seen" code, rather than generating entirely original logic.

Nevertheless, the benchmark is deliberately designed — with hidden test cases and a sandbox environment — so that simple memorization is insufficient to score highly. The models must produce genuinely functional software that also responds correctly to unknown inputs.

What does this mean for the future of software development?

The MirrorCode results raise a fundamental question: how long before AI can independently navigate the entire software development cycle? At 56 percent, the threshold of full autonomy has not yet been crossed, but the growth curve is steep.

For companies and developers, this means the role of the human programmer is shifting — from writing every line of code to defining requirements, reviewing AI-generated output, and overseeing quality. AI thus becomes not so much a replacement, but a powerful co-pilot that takes over routine programming work.

Major players such as Anthropic and OpenAI are investing heavily in improving their models in this area. The competition between Claude and GPT in autonomous coding will only intensify in the coming months. Visit our knowledge base for more explanation on how large language models work and how they generate code.

Conclusion

The MirrorCode benchmark proves that AI models from Anthropic and OpenAI are already capable of writing serious, production-ready software — faster and cheaper than a human team. With Claude Opus 4.7 at 56 percent and a growth rate that nearly doubles performance every year, the moment when AI can complete software projects fully autonomously is drawing closer. For the software development industry, this is no longer a distant future, but a reality unfolding right now. Follow more AI news on stersoftware.com to stay up to date with every development.

Epoch AI