OpenAI GPT-5 Unveiled: Expert AI Model, Capabilities & Early Reactions

Gradientflow

OpenAI has unveiled GPT-5, positioning its latest large language model as an “expert-level” foundation system poised to redefine AI interaction with complex tasks. Billed as a unified architecture that intelligently routes queries to specialized sub-models based on complexity, GPT-5 promises “PhD-caliber” responses for demanding problems while maintaining low latency for simpler requests. This marks a significant shift from previous models, where users manually selected between speed and depth, and the announcement has elicited a blend of enthusiasm and skepticism.

At the core of GPT-5’s advancements are robust performance gains, particularly in coding and factual accuracy. On the real-world software engineering benchmark SWE-bench Verified, GPT-5 achieved 74.9%, a notable improvement over its predecessor o3’s 69.1%, and scored 88% on Aider Polyglot for multi-language coding. This enables the model to scaffold complete full-stack applications from single prompts, handling everything from dependency installation to live UI previews, and excelling in complex front-end generation. Crucially, GPT-5 dramatically reduces hallucinations. When integrated with web search, its responses are approximately 45% less likely to contain factual errors than GPT-4o, a figure that jumps to 80% less in its dedicated reasoning mode compared to OpenAI o3. Practical tests highlight this: on open-ended fact-seeking prompts, GPT-5 showed six times fewer hallucinations, and when faced with missing images, it gave confident, incorrect answers only 9% of the time, a stark contrast to o3’s 86.7%.

Beyond text, GPT-5 pushes multimodal boundaries, achieving a new state-of-the-art 84.2% on the MMMU benchmark for visual reasoning. It can interpret images, charts, and diagrams with high accuracy, generate or edit front-end assets, create SVG animations, and even develop 3D games on the fly. The ChatGPT voice interface now boasts a human-natural sound, interprets camera feeds, and dynamically adjusts its reply style. For developers, the API introduces critical parameters like reasoning_effort for trading latency for depth, and verbosity for controlling output terseness. Custom tools now support plain text input, bypassing JSON, and the context window has expanded to 400K tokens, twice GPT-4’s capacity, making it effective for synthesizing extensive documents.

GPT-5 has been specifically trained as a collaborative AI teammate, exhibiting autonomy, communication, and context management. It provides upfront plans, offers progress updates, automatically runs tests, and can even self-debug through iterative building. Its ability to maintain context across prolonged chains of tool calls is evidenced by a 70% score on Scale’s multi-challenge benchmark, leading Cursor to adopt GPT-5 as its default. Early enterprise testers have already identified compelling use cases: Amgen leverages it for deep reasoning with complex scientific data, BBVA has seen financial analysis tasks shrink from weeks to hours, and Oscar Health utilizes it for clinical reasoning, particularly for mapping complex medical policies. The U.S. Federal Government plans to provide access to two million employees.

OpenAI has structured GPT-5’s pricing with tiered models. The full-fidelity GPT-5 costs $1.25 per million input tokens and $10.00 per million output tokens, serving as the default for ChatGPT and the API. A more economical GPT-5 Mini is available, alongside the highly optimized GPT-5 Nano, designed for edge and latency-critical applications, which is approximately 25 times cheaper. Access is tiered, with free users starting on GPT-5 before transitioning to Mini; Plus and Pro subscribers receive progressively higher or unlimited usage limits. Team, Enterprise, and EDU accounts gain generous default access, and all verified organizations receive immediate API access.

In safety, GPT-5 introduces a “safe completions” approach, moving beyond outright refusal of sensitive requests. It aims to maximize helpfulness within safety boundaries, offering partial answers or explaining limitations, particularly for “dual-use” domains, reducing unhelpful boilerplate. Despite these advancements, early reactions are mixed. While its enhanced coding, hallucination reduction, API refinements, and reported time savings have drawn praise, some observers view GPT-5 as an incremental “GPT-4.5” rather than a revolutionary leap. Concerns have been raised regarding “vibecharting” in benchmark presentations—visually exaggerating small gains, such as a mere 0.4% improvement on SWE-bench over the state-of-the-art. Technical errors in demos, like an incorrect explanation of the Bernoulli effect, have fueled skepticism about its “PhD-level” intelligence. Furthermore, questions persist about whether GPT-5 is truly a unified model or a clever orchestration, potentially limiting its advantages for latency-sensitive applications.