GENEVA, SWITZERLAND – JULY 06: Ameca humanoid robot by British manufacturer Engineered Arts interacts with visitors on July 06, 2023 in Geneva, Switzerland. Some 3,000 global experts from major technology, education and international organizations will gather at a two-day summit in Geneva organized by the United Nations to discuss artificial intelligence in its potential to empower humanity. (Photo by Johannes Simon/Getty Images)
Getty Images
Robotics companies increased over $10 billion in 2025, yet the models that power their robots are trained in less than 5,000 hours of combined open source interaction data in the real world. Language models consume trillions of tokens scraped from the web. Natural artificial intelligence has no equivalent. Each training example must be collected physically, one robot manipulation at a time.
This asymmetry is now the most expensive problem in artificial intelligence.
The limitation is structural. Unlike text or images, robotic handling data cannot be detected from the internet. It requires embedded hardware, human demonstrators, and commentators who understand task structure, failure modes, and semantic intent. Closing this gap is what makes data labeling for natural AI a distinct market from anything that has come before it.
The Venture Thesis
Investors have noticed. Robotics funding it reached $8.5 billion in 2025 through September alone. But the dollars are almost entirely stacked against foundational model developers, hardware makers and humanoid startups. The level of infrastructure that makes these models trainable, namely the natural world data supply chain, remains underfunded relative to the size of the problem.
Bessemer Venture Partners made it explicit in its April 2026 robotics outlook, where a former Waymo Researcher wrote: the data problem in robotics is hardly solved. Closing the gap between 99% and 99.9% confidence is a steep hill that takes longer than most investors realize.
Scale AI jumped on the opportunity early. The company launched it Natural AI Data Engine in September 2025, logging over 100,000 production hours in its San Francisco lab with customers including Physical Intelligence and Cobot. Meta’s $14.3 billion The acquisition of a 49% stake in Scale at a valuation of $29 billion in June 2025 made the data infrastructure bet clear: whoever controls the ground truth about natural AI controls the training flywheel.
Market Map: Three Competing Approaches
Three different strategies are now competing to become the standard data stack for physical AI:
Data Labeling for Physical AI Market Map
Josipa Majic Predin
THE real world approach is based on a simple claim: robots learn skill by watching humans. Scale your AI collection infrastructure to capture these demonstrations at industrial scale, combining them with semantic annotations that encode intent and failure modes. Natural Intelligence invested heavily in its own data shuttlecollecting proprietary interaction data on eight robot implementations before the pi-zero foundation model was released.
Emerging players are taking the approach further. Ground Truth Machine (groundtruthmachine.com) treats physiological signals as a calibration layer on top of behavioral displays, capturing the gap between what a displayer intends and what their body actually does. This signal, absent from any major existing data set, is what the company is calling Lack of authenticity: the measurable deviation between the explicit task order and the implicit physiological ground truth. For training robots to handle extreme cases in real human environments, this discrepancy can be the most informative data point in the stack.
by NVIDIA The synthetic bet is the largest in raw computational terms. Isaac Sim combined with the World Foundation Cosmos model allows developers to create physics-accurate robot trajectories from a single image and language instruction. The GR00T-Dreams project, announced at GTC in March 2026, generates synthetic motion data sets without requiring teleoperation data. Microsoft Azure and Nebius have incorporated NVIDIA’s Physical AI Data Factory project, with FieldAI, Teradyne and Hexagon Robotics already running on it.
The open source community is the wild card. Hugging Face’s LeRobot library has become the community standard for lightweight recording and playback of robot data. Downloaded NVIDIA’s Open Natural AI Datasets on Hugging Face 4.8 million times. These datasets lower the floor for academic labs and startups, but they don’t solve the quality problem. Roboflow’s active learning pipeline highlights the issue directly: inconsistent labels early in the pipeline they produce inconsistent behavior during development, and this is a difficult problem to resolve downstream.
Where does the money go next?
The real question for investors is not which approach wins individually. Foundation models need both real and synthetic data at different stages of training: synthetic for variety and scale, real for skill and failure recovery. Goldman Sachs projects cumulative investment in humanoids exceeding $50 billion by 2030. The proportion of this capital flowing into data infrastructure, currently a fraction, will need to be covered.
China is already on the move. Scale AI’s Max Fenkell told the Parliamentary subcommittee on cybersecurity in 2026 that the US is winning in the quality of AI models but losing in data and implementation, citing China’s strategy of funding miles of warehouse facilities dedicated to collecting and labeling robot training data.
For founders building in this space, the structural advantage is provenance. Companies that maintain a tight chain of data, covering who scored what, under what working conditions, with what signal mix, have a moat that grows with each deployment. This is a harder advantage to replicate than any model weight. Companies building this infrastructure, from industrial-scale annotation engines to biosignal-augmented ground truth platforms such as Ground Truth Machinethey build what the natural AI stack cannot train without.
