Ai2 Unveils MolmoAct: AI Model for 3D Spatial Robot Reasoning
The Allen Institute for AI (Ai2) has unveiled MolmoAct 7B, an innovative embodied AI model designed to bridge the gap between complex artificial intelligence and its practical application in the physical world. Unlike traditional AI systems that often translate linguistic instructions into movement, MolmoAct adopts a fundamentally different approach: it visually perceives its environment, comprehends the intricate relationships between space, motion, and time, and then plans its actions accordingly. This intelligence is achieved by transforming two-dimensional image inputs into comprehensive three-dimensional spatial plans, empowering robots to navigate the physical world with enhanced understanding and control.
While spatial reasoning is not new to AI, most contemporary systems rely on proprietary, closed architectures trained on vast, often inaccessible datasets. Such models are typically difficult to reproduce, expensive to scale, and operate as opaque “black boxes.” MolmoAct, by contrast, offers a transparent and open alternative, having been trained entirely on publicly available data. Its design prioritizes real-world generalization and interpretability; its step-by-step visual reasoning traces allow users to preview a robot’s intended actions and intuitively guide its behavior in real time as conditions evolve.
“Embodied AI needs a new foundation that prioritizes reasoning, transparency, and openness,” stated Ali Farhadi, CEO of Ai2. “With MolmoAct, we’re not just releasing a model; we’re laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world. It’s a step toward AI that can reason and navigate the world in ways that are more aligned with how humans do — and collaborate with us safely and effectively.”
MolmoAct represents the inaugural release in a new class of models Ai2 terms Action Reasoning Models (ARMs). An ARM is designed to interpret high-level natural language instructions and logically sequence physical actions to execute them in the real world. Unlike conventional end-to-end robotics models that might treat a complex task as a single, undifferentiated command, ARMs break down high-level instructions into a transparent chain of spatially grounded decisions. This layered reasoning process involves three key stages: first, 3D-aware perception, which grounds the robot’s understanding of its environment using depth and spatial context; second, visual waypoint planning, outlining a step-by-step task trajectory within the image space; and finally, action decoding, which converts the visual plan into precise, robot-specific control commands. This sophisticated approach allows MolmoAct to interpret a command like “Sort this trash pile” not as a singular action, but as a structured series of sub-tasks: recognizing the scene, grouping objects by type, grasping them individually, and repeating the process.
MolmoAct 7B, the initial model in its family, was trained on a meticulously curated dataset comprising approximately 12,000 “robot episodes” captured from real-world environments such as kitchens and bedrooms. These demonstrations were transformed into robot-reasoning sequences, illustrating how complex instructions map to concrete, goal-directed actions. Ai2 researchers dedicated months to curating videos of robots performing diverse household tasks, from arranging pillows on a living room couch to putting away laundry in a bedroom.
Remarkably, MolmoAct achieves this sophisticated performance with notable efficiency. Its training involved approximately 18 million samples, pre-trained over 24 hours on 256 NVIDIA H100 GPUs, followed by just two hours of fine-tuning on 64 GPUs. This stands in stark contrast to many commercial models that demand hundreds of millions of samples and significantly greater computational resources. Despite its lean training, MolmoAct has demonstrated superior performance on key benchmarks, including a 71.9% success rate on SimPLER, underscoring that high-quality data and thoughtful design can surpass models trained with far more extensive data and compute.
In line with Ai2’s mission, MolmoAct is built for transparency, a critical departure from the opaque nature of most robotics models. Users can preview the model’s planned movements before execution, with motion trajectories overlaid on camera images. These plans can be adjusted using natural language commands or quick sketching corrections on a touchscreen, offering fine-grained control and enhancing safety in real-world applications within homes, hospitals, and warehouses. Furthermore, MolmoAct is fully open-source and reproducible; Ai2 is releasing all necessary components to build, run, and extend the model, including training pipelines, pre- and post-training datasets, model checkpoints, and evaluation benchmarks. By setting a new standard for embodied AI that is safe, interpretable, adaptable, and truly open, Ai2 aims to expand its testing across both simulated and real-world environments, fostering the development of more capable and collaborative AI systems.