GPT-5: The Stone Age of AI Tools & AGI Progress
OpenAI’s highly anticipated GPT-5 has finally arrived, following nearly two years of industry speculation. As early access partners, we’ve had the opportunity to extensively test this new model across a variety of applications, from our own platform, Raindrop.ai, to development environments like Cursor and Codex. Our overarching impression? GPT-5 represents a profound leap towards Artificial General Intelligence (AGI), particularly in the realm of software engineering, where it demonstrates an exceptional ability to tackle complex applications and resolve intricate issues within vast codebases, often in a single attempt.
However, the narrative isn’t as straightforward as simply being “better” across the board. Surprisingly, GPT-5 underperforms its predecessors, GPT-4.5 and even GPT-4o, when it comes to writing. In many common tasks, it won’t immediately strike users as a super-genius. These apparent flaws, paradoxically, illuminate a fundamental shift in the journey towards AGI. To understand this, we must look back to the Stone Age.
What defines the dawn of human intelligence? It wasn’t winning a chess match or proving a complex theorem. The Stone Age is distinctly marked by one crucial development: humans learned to use tools. We shaped tools, and in turn, our tools shaped us, fundamentally altering our cognitive capabilities. Human intelligence, at its core, manifests through and is extended by tools. GPT-5 ushers in a new Stone Age for AI agents and large language models. This model doesn’t merely use tools; it thinks with them and builds with them.
Consider OpenAI’s “Deep Research” feature, a significant evolution from basic web search. While previous ChatGPT versions could search the web, Deep Research was taught to conduct research—planning, iterating, and exploring. Searching the web became an intrinsic part of its thought process. GPT-5 extends this philosophy to virtually any tool it can access, provided those tools are designed to be powerful, capable, and open-ended, often accepting natural language descriptions as input. Effective tools for GPT-5 generally fall into four categories: internal retrieval (like RAG systems or SQL queries), web search, code interpreters, and actions that produce side effects (such as editing files or triggering UI elements). A prime example of a powerful tool is web search itself, where GPT-5 decides what to search for, and the tool handles the how.
Another significant advancement is GPT-5’s proficiency in parallel tool calling. While earlier models technically possessed this capability, they rarely executed it correctly or consistently. GPT-5, however, demonstrates the intelligence to discern which tools can and should run simultaneously versus sequentially for a given task. This parallelization dramatically reduces latency and extends the model’s operational horizons, enabling entirely new product possibilities.
Interacting with GPT-5 requires a shift in perspective. Instead of prompting a “model,” users must think of themselves as prompting an “agent.” Rather than pre-loading extensive context, the agent needs a “compass”—clear, structured guidance to navigate its environment. For instance, when working with GPT-5 in a large codebase, it’s crucial to specify the project’s purpose, relevant files, organizational structure, and clear criteria for task completion. If the model gets stuck, a simple “No, that is wrong” is less effective than asking, “What did we learn from trying that?” This approach mirrors teaching, as GPT-5, without intrinsic memory, needs to be onboarded to code standards and given hints for starting each task.
Our observations confirm that GPT-5 is a highly practical, industry-oriented model, distinct from the more “academic” lean of some predecessors. It is remarkably instruct-able and literal, directly executing requests rather than exhibiting the distinct “personality” seen in models like Claude.
GPT-5’s coding prowess is its undeniable highlight. In a particularly challenging test involving nested dependency conflicts when integrating new SDKs, GPT-5 solved the problem in a single attempt, a feat that eluded Claude Opus and other advanced models. GPT-5 approached this like a seasoned researcher, examining folders, running diagnostic commands, taking notes, and pausing to reason when inconsistencies arose, ultimately editing the necessary lines across multiple directories with precision. This iterative, reasoning-based debugging was a stark contrast to other models’ trial-and-error approaches. Further demonstrations of its coding capabilities included generating a full Mac OS 9-themed website using pure HTML, CSS, and JavaScript, complete with a functional paint application and persistent data storage—all created from a single prompt and surprisingly robust. For production-ready applications, GPT-5 also excelled, generating a complex Clickhouse query and a full-stack website with a SQLite database in a single prompt, a task where other models often provided only plans or incomplete scaffolding.
The enhanced tool use, parallel processing, and cost efficiency of GPT-5 make it uniquely suited for developing long-running AI agents. Our company, an AI monitoring firm, has long sought to integrate a reliable agent into our product. GPT-5’s capabilities, including its improved recovery from tool call failures and its ability to discern when to generate graphs versus charts, have finally made this a practical reality, enabling a beta rollout to customers.
However, GPT-5 is not a strong writer. In fact, GPT-4.5 and DeepSeek R1 significantly outperform it. For professional writing, like refining LinkedIn posts, GPT-4.5 adheres more closely to the user’s tone, providing usable text, while GPT-5 tends towards a generic, “LinkedIn-slop” style. Similarly, for less structured, personal writing, GPT-4.5 maintains a more authentic tone, sounding less like typical LLM output.
In conclusion, our hands-on experience aligns with OpenAI’s official benchmarks: GPT-5 is unequivocally the world’s leading coding model. It has advanced the automation of software engineering from an estimated 65% completion to roughly 72%, marking the most significant leap since GPT-3.5 Sonnet. While developers will immediately grasp its profound impact, general users may not fully appreciate its capabilities until it is seamlessly integrated into everyday products over the coming months.