OpenAI dropped a bombshell last week by unveiling Codex, a new coding system meant to tackle intricate programming tasks using plain language commands. Codex marks a significant leap for OpenAI into the realm of agentic coding tools, a category that is just starting to gain traction in the tech world.

Most AI coding assistants, from GitHub’s Copilot to modern tools like Cursor and Windsurf, typically function as highly intelligent autocomplete tools within integrated development environments. Users usually interact directly with the AI-generated code, making the idea of assigning a task and returning when it’s done quite far-fetched. But the latest agentic coding tools, spearheaded by products like Devin, SWE-Agent, OpenHands, and the aforementioned Codex, are engineered to operate without users ever laying eyes on the code. The objective here is to mimic being a team manager in software engineering, delegating tasks through platforms like Asana or Slack and only checking in once a solution is achieved.

For proponents of advanced AI, these agentic systems represent a logical progression in the automation of software work. Kilian Lieret, a researcher from Princeton and a member of the SWE-Agent team, explains, “In the early days, folks were writing code by hitting every single key. GitHub Copilot came along as the first tool to offer real autocomplete, sort of like a stage two. You’re still in the loop, but sometimes you can skip a step.” The ultimate goal for agentic systems is to transcend developer environments entirely, presenting coding agents with a problem and leaving them to resolve it autonomously.

Nevertheless, the road to fully autonomous coding has been bumpy. When Devin became widely available towards the end of 2024, it faced harsh criticism from YouTube influencers and a more measured evaluation from an early client at Answer.AI. This reception is familiar to vibe-coding veterans, who find themselves spending as much effort overseeing the models as they would manually completing the tasks. Despite Devin’s initial struggles, investors have taken notice of its potential, with Cognition AI, Devin’s parent company, reportedly securing hundreds of millions of dollars in funding at a valuation of $4 billion.

Supporters of the technology warn against unsupervised vibe-coding, viewing these new coding agents as powerful components within a human-supervised development process. Robert Brennan, the CEO of All Hands AI, the entity behind OpenHands, emphasizes the importance of human intervention during code review. “For now, and I’d say, for the foreseeable future, a human must step in during code review to scrutinize the agent-written code,” Brennan explains. “I’ve seen several individuals dig themselves into a hole by green-lighting every piece of code the agent generates. Things spiral out of control rather quickly.”

Hallucinations pose a persistent challenge as well. Brennan recounts an incident where an OpenHands agent, when asked about a post-training API release, fabricated details of a non-existent API that aligned with the description. All Hands AI is actively developing mechanisms to catch these hallucinations before they cause harm, but the solution isn’t straightforward. The true litmus test for agentic programming progress lies in the SWE-Bench leaderboards, where developers can benchmark their models against unresolved issues from open GitHub repositories. At present, OpenHands leads the verified leaderboard, resolving 65.8% of the problem set. OpenAI claims that Codex’s underlying model, codex-1, performs even better, boasting a 72.1% score in its announcement, albeit with a few caveats and no independent verification.

The tech industry is grappling with concerns that high benchmark scores may not equate to truly hands-off agentic coding. If agentic coders can only solve three out of every four problems, they will demand significant oversight from human developers, especially when tackling complex systems with multiple stages. The hope is that advancements in foundational models will unfold steadily, enabling agentic coding systems to evolve into reliable developer tools. However, mitigating hallucinations and other reliability issues will be pivotal in achieving this goal. Brennan muses, “There seems to be a sound barrier effect. How much trust can you shift to the agents, so they can alleviate your workload at the end of the day?”