AI Product Shipping: Practical Lessons from a Failed Startup
In the rapidly evolving landscape of artificial intelligence, where hype often outpaces practical application, Phil Calçado offers a sobering and insightful post-mortem from his failed AI startup, Outropy. Speaking at a recent InfoQ conference, Calçado, a veteran software engineer known for pioneering microservices at SoundCloud, shared candid lessons on the realities of shipping generative AI products beyond the buzz. His core message: the keys to successful AI development lie not in chasing futuristic visions, but in rigorously applying established software engineering principles.
Calçado began by acknowledging his own bias: three decades of experience building software, particularly distributed systems and microservices, with a strong inclination towards iterative, “get-stuff-done” agile development. This perspective, he admits, shapes his views on AI, which he believes should not be exempt from these foundational practices.
Outropy, Calçado’s venture, aimed to automate aspects of managerial and engineering workflows using generative AI, starting as a Slack chatbot and evolving into a Chrome extension. Despite being an early entrant in the generative AI space, attracting thousands of users, and even outperforming products from tech giants like Salesforce in terms of quality (by his own benchmarks), the startup ultimately failed. The surprising revelation from user feedback was that many were less interested in the tool itself and more in reverse-engineering how Outropy, built by “two guys and a dog,” managed to create a system with such effective “agentic behavior”—autonomous, decision-making capabilities—while larger companies struggled. This paradox prompted Calçado to deeply analyze why most AI products, especially in the productivity sector, fall short.
Calçado identifies three prevailing approaches to building AI today, each with its own pitfalls. The first is “Twitter-driven development,” characterized by an obsession with upcoming, yet-to-be-released models and a disregard for current technological limitations, often leading to flashy demos that secure funding but fail to deliver real-world value. The second treats AI development as a pure “data science project,” typically seen in larger companies. This method, often slow and research-focused, can take years to yield marginal improvements, a luxury unavailable when AI is on a product’s critical path. The third, and Calçado’s preferred approach, is to treat AI development as a traditional “engineering project,” embracing iterative development from the outset.
He then delved into the fundamental building blocks of generative AI systems: workflows and agents. Workflows, which he prefers to call “inference pipelines,” represent predefined sequences of steps to achieve an AI goal, such as summarizing an email. Agents, on the other hand, are semi-autonomous software components where Large Language Models (LLMs) dynamically direct their own processes, use tools, and collaborate, executing tasks to achieve a given goal.
For workflows, Calçado warns against the common pitfall of relying solely on Retrieval-Augmented Generation (RAG) vendors, which promise to feed context directly to LLMs. He found that LLMs often aren’t smart enough for this simplistic approach, necessitating additional steps to add structure and semantic meaning. Outropy’s success in daily briefings, for instance, came from breaking down complex tasks into smaller, structured transformations, much like data pipelines. This allows for the application of existing data pipeline tools and methodologies, grounding AI development in familiar engineering territory.
When it comes to agents, Calçado makes a provocative assertion: “Agents are a lot like objects in object-oriented programming.” While acknowledging that traditional microservices are a poor fit for agents due to their statefulness, non-deterministic behavior, and data-intensive nature, he argues that the object-oriented paradigm—with concepts like memory (state), goal-orientation (encapsulation), dynamism (polymorphism), and collaboration (message passing)—provides a useful mental model for engineers.
Architecturally, Calçado advises against point-to-point agent collaboration, which can lead to tight coupling and a reinvention of complex web service stacks from two decades ago. Instead, he advocates for “semantic events” on a message bus, like Redis or Kafka, where agents register interest in specific, well-defined events, promoting loose coupling and scalability. He also cautions against adopting emerging standards like Anthropic’s Model Context Protocol (MCP) for internal products, viewing them as reminiscent of early, over-engineered protocols like SOAP. For internal systems, he suggests sticking with empirically proven methods like RESTful architectures or gRPC.
Regarding “agentic memory,” the challenge of an agent retaining knowledge about a user, Calçado dismisses the common approach of storing all information in a long text document within a vector database. He argues that a faulty memory is worse than no memory. His recommended solution is “event sourcing,” where a stream of semantic events about a user is compacted into a structured representation, often stored in a graph database like Neo4j, allowing for a more robust and evolving understanding.
Finally, Calçado challenges the prevalent “monolithic pipeline” approach in data science projects, where an entire process from data ingestion to output is built as a single, highly coupled unit. He champions breaking down these workflows into smaller, independent components with well-defined interfaces, enabling flexibility and reusability—a concept familiar from domain-driven design and microservices.
He concludes by observing that despite the allure of “distributed objects,” the foundational principles of the “Twelve-Factor App” manifesto, which underpin modern cloud infrastructure, are often broken by agentic AI systems due to their inherent statefulness and non-deterministic nature. This necessitates a shift towards “durable workflows” (like those offered by Temporal), which handle resilience, retries, and checkpointing, preventing engineers from constantly reinventing these critical infrastructure components.
Calçado’s ultimate takeaway is a powerful one: the complexity seen in current AI product architectures, such as Outropy’s, is often “overcomplicated” for the number of users served, highlighting a significant need for better platforms. Yet, he asserts that building successful AI products fundamentally boils down to applying time-tested software engineering wisdom. Engineers should leverage their existing knowledge and resist the urge to believe that AI, despite its hype, is fundamentally different from the challenges they’ve tackled before.