OpenAI’s GPT-5.4 Targets Real Work: Build Apps Faster, Automate Tests, and Ship Business-Ready Docs With Agentic AI

6 March, 2026

No Comments

OpenAI has released GPT-5.4, a new flagship model it says is optimized for agentic workflows, combining stronger reasoning and coding with native computer-use capabilities—the ability to operate software via screenshots and mouse/keyboard actions—alongside support for up to 1 million tokens of context for long-horizon tasks.

A model built for “do the work” agents

In its announcement, OpenAI frames GPT-5.4 as its first general-purpose model shipping with state-of-the-art computer use, aimed at developers building agents that can complete real tasks across websites and software systems. The company highlights use cases such as automating workflows across apps, and notes the model can drive computer interactions directly and also write automation code via tools like Playwright.

OpenAI also says GPT-5.4 is more token efficient than GPT-5.2—using fewer tokens to solve problems—positioning it as both faster and cheaper in practice for certain workloads despite higher per-token pricing.

Benchmark claims emphasize professional work, tools, and desktop navigation

OpenAI’s post spotlights gains across a mix of “knowledge work” and agent benchmarks. It reports 83.0% “wins or ties” on GDPval (a professional work eval spanning 44 occupations), compared with 70.9% for GPT-5.2.

For computer-use tasks, OpenAI reports 75.0% success on OSWorld-Verified, up from 47.3% for GPT-5.2, and notes this exceeds 72.4% human performance in the benchmark notes.

Mercor: “Top of the leaderboard” for professional services agent work

OpenAI’s announcement includes early customer validation from Mercor CEO Brendan Foody:

GPT-5.4 is the best model we’ve ever tried. It’s now top of the leaderboard on our APEX-Agents benchmark, which measures model performance for professional services work. It excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models.

— Brendan Foody, CEO at Mercor

OpenAI also claims improved web-browsing and tool-use performance, including higher results on BrowseComp and Toolathlon, as part of its pitch that GPT-5.4 is better at selecting and operating tools in complex workflows.

What’s in it for developers

For software teams, OpenAI is positioning GPT-5.4 as a more capable “agentic” engine for end-to-end engineering work—especially when paired with tooling and computer-use interfaces.

GPT-5.4 is designed to take on longer development loops without losing context, thanks to its up to 1M-token window, which can help when the relevant code spans large repositories, extensive logs, multi-step incident timelines, or large test outputs.

OpenAI also emphasizes improvements in coding and debugging, with GPT-5.4 used inside Codex and integrated into workflows where the model can not only propose code changes but also drive tooling through a computer-use layer—opening the door to agents that can run commands, inspect outputs, and iterate.

For QA and test engineering, the company’s “computer use” capability is a notable shift: GPT-5.4 can be used to generate automated UI testing flows (for example, by producing Playwright-style scripts) and to execute multi-step test procedures where results must be validated and corrected across iterations. OpenAI’s OSWorld-Verified results are presented as evidence the model can reliably operate desktop environments for task completion.

OpenAI also claims GPT-5.4 is “more token efficient” than GPT-5.2, which can matter for developer workloads where tools generate verbose outputs (stack traces, logs, diffs) and cost is linked directly to tokens processed.

What’s in it for business teams

OpenAI is also aiming GPT-5.4 squarely at knowledge work—particularly tasks that combine research, synthesis, and output formatting into business-ready deliverables.

The company says GPT-5.4 improves generation and editing of documents, spreadsheets, and presentations, and ties those improvements to its “long-horizon” planning approach in ChatGPT through GPT-5.4 Thinking, which can surface an upfront plan for complex tasks that users can steer.

In practical terms, that’s the workflow Mercor describes: producing slide decks, financial models, and legal analysis as complete, multi-step deliverables, rather than short answers.

OpenAI’s own metrics are meant to reinforce the “business usefulness” angle. The company reports 83.0% wins-or-ties on GDPval, designed to measure task performance across dozens of occupations, compared with 70.9% for GPT-5.2.

The model’s computer-use capability also matters outside engineering: GPT-5.4-powered agents could navigate web dashboards, move data between tools, generate reports, and update systems of record—work that’s often manual across operations, finance, HR, and customer support. OpenAI frames this as part of its broader shift toward agents that can “do” work across applications, not just answer questions.

Reliability and rollout

On accuracy, OpenAI says that—based on de-identified prompts where users flagged factual errors—GPT-5.4’s individual claims are 33% less likely to be false and its full responses are 18% less likely to contain any errors, relative to GPT-5.2.

OpenAI says GPT-5.4 Thinking is rolling out in ChatGPT to Plus, Team, and Pro users, replacing GPT-5.2 Thinking, which is scheduled to be retired on June 5, 2026 after a three-month legacy period.

Image: OpenAI

AI agents, AI for presentations, AI for spreadsheets, Brendan Foody, Codex, computer-use AI, enterprise productivity AI, GDPval, GPT-5.4, GPT-5.4 Thinking, Mercor, OpenAI, OpenAI API pricing, OSWorld-Verified, Playwright testing, QA automation, software development AI