Databricks: Evaluate Agentic AI by Behavior, Not Just Data

Fastcompany

Over the past five years, rapid advancements in artificial intelligence models’ data processing and reasoning capabilities have fueled a relentless pursuit among enterprise and industrial developers: building ever-larger models and striving for increasingly ambitious benchmarks. Now, as agentic AI emerges as the anticipated successor to generative AI, the demand for smarter, more nuanced AI agents is escalating. Yet, paradoxically, the prevailing measure of an AI’s intelligence too often remains simplistic, tied merely to its model size or the sheer volume of its training data.

Data analytics and AI company Databricks contends that this current AI arms race fundamentally misses a crucial point. In a production environment, the true measure of an AI’s value is not what it “knows” in an abstract sense, but rather how effectively it performs when stakeholders depend on it. Jonathan Frankle, Databricks’ chief AI scientist, emphasizes that genuine trust and a tangible return on investment from AI models stem directly from their behavior in real-world production settings, not from the sheer quantity of information they might contain.

Unlike traditional software, which operates on deterministic rules to produce predictable outputs, AI models generate probabilistic results. This inherent characteristic fundamentally changes how they must be evaluated. “The only thing you can measure about an AI system is how it behaves. You can’t look inside it. There’s no equivalent to source code,” Frankle explains. He argues that while public benchmarks offer a useful snapshot of general capability, enterprises frequently over-rely on these broad metrics, mistaking them for indicators of real-world applicability.

Frankle asserts that what truly matters is rigorous, continuous evaluation against business-specific data. Such precise assessment is vital for measuring output quality, refining model behavior, and effectively guiding reinforcement learning strategies that allow AI to improve over time. He criticizes a common, informal approach to AI deployment: “Today, people often deploy agents by writing a prompt, trying a couple of inputs, checking their vibes, and deploying. We would never do that in software—and we shouldn’t do it in AI, either.” This casual methodology, he suggests, is a recipe for unreliable performance and a barrier to realizing AI’s full potential.

Ultimately, the shift in focus advocated by Databricks represents a maturation of the AI industry. It moves beyond the allure of raw computational power and data volume towards a more pragmatic, performance-driven approach, where an AI’s true intelligence is proven through its reliable, beneficial actions in the complex landscape of real-world operations.