Ai2's MolmoAct: 3D Reasoning AI Challenges Nvidia, Google in Robotics

Venturebeat

The rapidly evolving field of physical AI, where robotic systems integrate with advanced foundation models, is attracting significant investment and research from tech giants like Nvidia, Google, and Meta. Now, the Allen Institute for AI (Ai2) is challenging these industry leaders with the release of MolmoAct 7B, a new open-source model designed to empower robots with sophisticated spatial reasoning capabilities. Unlike many conventional vision-language-action (VLA) models that primarily process information in a two-dimensional context, MolmoAct is engineered to “reason in space,” effectively thinking in three dimensions.

Ai2 classifies MolmoAct as an Action Reasoning Model, a category where foundation models engage in spatial reasoning to understand and plan actions within a physical, three-dimensional environment. This means MolmoAct can leverage its reasoning capabilities to comprehend the physical world around it, determine how it occupies space, and subsequently execute appropriate actions.

This spatial understanding is achieved through a novel approach involving “spatially grounded perception tokens.” These tokens, which are pre-trained and extracted from visual inputs like video using a vector-quantized variational autoencoder, differ fundamentally from the text-based inputs typically used by VLA models. By encoding geometric structures and estimating distances between objects, MolmoAct gains a comprehensive grasp of its physical surroundings. Once it has assessed these distances, the model predicts a sequence of “image-space” waypoints, mapping out a potential path. This detailed spatial plan then translates into specific physical actions, such as precisely adjusting a robotic arm by a few inches or stretching out.

Internal benchmarking conducted by Ai2 revealed MolmoAct 7B achieved a task success rate of 72.1%, outperforming rival models from Google, Microsoft, and Nvidia. Remarkably, Ai2’s researchers noted that MolmoAct could adapt to diverse robotic embodiments, from mechanical arms to humanoid forms, with only minimal fine-tuning. Furthermore, the model is being released open-source under an Apache 2.0 license, with its training datasets made available under CC BY-4.0, a move praised by the wider AI community for fostering collaborative development.

While MolmoAct’s capabilities are broadly applicable wherever machines need to interact with physical environments, Ai2 envisions its primary impact in home settings. This environment, characterized by its inherent irregularity and constant change, presents the most significant challenges for robotics, making it an ideal proving ground for MolmoAct’s advanced spatial reasoning.

The pursuit of more intelligent and spatially aware robots has long been a foundational dream in computer science. Historically, developers faced the arduous task of explicitly coding every single robotic movement, leading to rigid and inflexible systems. The advent of large language models (LLMs) has revolutionized this paradigm, enabling robots to dynamically determine subsequent actions based on their interactions with objects. For instance, Google Research’s SayCan helps robots reason about tasks using an LLM, guiding them to determine the sequence of movements required to achieve a goal. Similarly, Meta and New York University’s OK-Robot utilizes visual language models for movement planning and object manipulation, while Nvidia has proclaimed physical AI to be the “next big trend,” releasing models like Cosmos-Transfer1 to accelerate robotic training.

Alan Fern, a professor at the Oregon State University College of Engineering, views Ai2’s research as a “natural progression in enhancing VLMs for robotics and physical reasoning.” While acknowledging it may not be “revolutionary,” he emphasized it as “an important step forward in the development of more capable 3D physical reasoning models.” Fern highlighted MolmoAct’s focus on “truly 3D scene understanding” as a significant positive shift from 2D reliance, though he cautioned that current benchmarks remain “relatively controlled and toyish,” not fully capturing real-world complexity. Despite this, he expressed eagerness to test the model on his own physical reasoning tasks. Daniel Maturana, co-founder of the startup Gather AI, lauded the open-source nature of the data, noting its value in reducing the high costs associated with developing and training such models, thereby providing a “strong foundation to build on” for academic labs and hobbyists alike.

Despite the current limitations in real-world demonstrations, the increasing interest in physical AI suggests a burgeoning field. As the quest for general physical intelligence – eliminating the need for individualized robot programming – becomes more attainable, the landscape of robotics is poised for rapid and exciting advancements.