GPT-5's Vision: Frontier VLM, Not New SOTA
OpenAI’s much-anticipated GPT-5 has recently undergone a rigorous evaluation of its vision and visual reasoning capabilities, with researchers from Roboflow putting the new model through its paces. While GPT-5 demonstrates formidable advancements in general visual understanding, the initial assessment suggests its performance in visual recognition and localization tasks aligns closely with the best models currently available, rather than establishing a new state of the art. Interestingly, the evaluation revealed that GPT-5-Mini achieved identical vision scores to its larger counterpart, a testament to what the evaluators describe as an effective model router at work.
The integration of robust visual understanding into large language models (LLMs) has long been a significant hurdle. Many models still struggle with seemingly simple tasks, such as accurately counting specific objects in a photograph or precisely identifying the location of items within an image. Yet, the ability for LLMs to interpret and interact with the real world in real-time is considered a critical breakthrough, paving the way for autonomous robotics, more intuitive human-computer interaction, and the potential for personalized superintelligence.
The current landscape of vision-language models (VLMs) includes offerings from major players like OpenAI (GPT and ‘o’ series), Google (Gemini), Anthropic (Claude), and Meta (Llama). These models exhibit varying strengths and weaknesses across different visual tasks. Generally, they perform well on straightforward challenges such as reading text from signs, receipts, or CAPTCHAs, and understanding colors. However, more complex demands—including precise counting, spatial understanding, detailed object detection, and comprehensive document analysis—reveal significant performance inconsistencies, particularly when the underlying pre-training data might lack sufficient examples for these specific scenarios.
To address the challenges of comparing performance across diverse tasks, Roboflow launched Vision Checkup, an open-source evaluation leaderboard designed to assess “hard task frontier performance.” OpenAI models consistently dominate this leaderboard, with GPT-5 now securing a spot among the top five. This strong showing is primarily attributed to the models’ advanced reasoning capabilities, developed during their extensive pre-training and refined during testing. This marks a crucial evolution in multi-modal LLMs: the enhanced ability to reason across both textual and visual information. Nevertheless, scores can fluctuate due to the non-deterministic nature of reasoning models, where the same prompt might yield different answers. Furthermore, real-world deployment of image reasoning faces practical limitations, as processing an image can take upwards of 10 seconds, and answer variability makes them difficult to rely on for real-time applications. Developers often face a trade-off between speed and comprehensive capability, sometimes opting for faster, more narrowly focused models.
To move beyond general “vibe checks” and provide a more rigorous assessment of how well LLMs truly comprehend the real world, Roboflow introduced a new benchmark at this year’s CVPR conference: RF100-VL. This benchmark comprises 100 open-source datasets featuring object detection bounding boxes, multimodal few-shot instructions, visual examples, and rich textual descriptions across novel image domains. On RF100-VL, top LLMs have generally scored below 10 on the mAP50:95 metric, a key measure of object detection accuracy. Google’s Gemini 2.5 Pro currently leads the pack among LLMs, achieving a zero-shot mAP50:95 of 13.3.
In stark contrast, GPT-5 registered a mAP50:95 score of merely 1.5 on the RF100-VL benchmark. This significant disparity is largely attributed to GPT-5’s apparent lack of object detection-specific data in its pre-training. For instance, in an evaluation involving a volleyball dataset, GPT-5 demonstrated a clear understanding of the image’s content, correctly identifying a ball, blockers, and defenders. However, it consistently failed to accurately localize these objects, with bounding boxes often misaligned or incorrectly sized. This pattern, also observed in other datasets like one featuring sheep, indicates that while the model comprehends the visual scene, it struggles with “grounding” specific objects within it—a direct consequence of insufficient object detection pre-training. Similarly, GPT-5 showed no significant improvement in quality when evaluated on UI element datasets.
While GPT-5 does represent a slight improvement over previous OpenAI models, such as GPT-4o, for simpler visual tasks, and benefits from more detailed instructions, its performance on the RF100-VL highlights a critical distinction: comprehension does not equate to precise localization. The enhanced reasoning capabilities, which propel GPT-5 to the top of the Vision Checkup leaderboard, do not translate into better object detection on RF100-VL, even when “reasoning effort” is increased. This underscores a clear path forward for vision-language models: the next generation must not only process visual information more deeply but also accurately pinpoint and understand objects within the real-world context, moving beyond abstract comprehension to tangible, localized understanding.