单个AI模型如何形成对环境的物理直觉

qimuai 发布于 2025-10-4 01:01 阅读：14 一手编译

单个AI模型如何形成对环境的物理直觉

内容来源：https://www.quantamagazine.org/how-one-ai-model-creates-a-physical-intuition-of-its-environment-20251003/

内容总结：

近日，人工智能领域取得突破性进展。由Meta公司研发的视频联合嵌入预测架构（V-JEPA）模型，通过观看视频自主学习物理规律，甚至能像人类婴儿一样对违背常理的现象表现出"惊讶"。这一成果为开发具备环境物理直觉的AI系统开辟了新路径。

与传统AI模型逐像素分析视频不同，V-JEPA采用"潜在表征"的抽象思维方式。该技术能自动过滤树叶晃动等无关细节，专注捕捉交通信号灯颜色、车辆位置等关键信息。研究人员将这种能力比作人类婴儿认知发育过程——无需预先设定物理规则，仅通过观察就能建立对物体恒存性、重力作用等基础概念的理解。

在物理直觉测试中，V-JEPA对视频中物理现象合理性的判断准确率高达98%，远超传统模型。当小球滚入遮挡物后未能按预期重现时，系统会产生显著预测误差，这种反应与婴幼儿的认知惊讶高度相似。荷兰大学认知科学家米夏·海尔布龙评价称，该研究证实了"直觉物理可通过学习获得，无需先天预设"。

尽管V-JEPA已成功应用于机器人简单操作任务，但其短期记忆能力仍存局限。项目科学家加里多坦言，模型目前仅能处理数秒视频内容，"记忆时长堪比金鱼"。下一代模型需突破时序理解瓶颈，方能在复杂物理场景中展现更接近人类的认知水平。

这项突破标志着AI正向理解物理世界的本质规律迈进，为自动驾驶、机器人等需要环境交互的领域带来新的可能性。

中文翻译：

单个AI模型如何形成对环境的物理直觉

引言
婴儿认知测试：将水杯置于桌面，用木板遮挡后向前推动。若木板穿透水杯继续移动，婴儿会惊讶吗？许多6月龄婴儿会显露诧异，而周岁幼儿几乎都已通过观察形成了物体恒存性的直觉认知。如今，部分人工智能模型也具备了这种能力。

研究人员开发的AI系统通过视频学习世界知识，当接收信息与既有认知相悖时，会表现出"惊讶"反应。Meta公司创建的"视频联合嵌入预测架构"（V-JEPA）模型无需预设视频中的物理规律，却能逐步理解世界运行机制。

"这些主张具有先天合理性，研究成果极具启发性。"阿姆斯特丹大学认知科学家迈克尔·海尔布隆评价道，他长期研究大脑与人工智能系统理解世界的机制。

更高层级的抽象
自动驾驶工程师深知，让AI系统可靠解析视觉信息并非易事。现有视频理解系统大多在"像素空间"运作，或进行内容分类（如"打网球的人"），或识别物体轮廓（如前方车辆），这类模型本质上赋予视频中每个像素同等权重。

但像素空间模型存在局限。以解析城郊街道场景为例，当画面包含车辆、交通信号灯和树木时，模型可能过度关注树叶晃动等无关细节，却忽略信号灯颜色或周边车辆位置。"处理图像或视频时，必须脱离像素空间——过多无关细节会干扰建模。"布朗大学计算机科学家兰德尔·巴莱斯特里罗指出。

2024年发布的V-JEPA架构正是为解决这些问题而设计。尽管其内部神经网络结构复杂，但核心理念简明清晰。

传统像素空间系统通过遮挡视频帧中部分像素，训练神经网络预测被遮区域数值。V-JEPA同样会对视频帧进行局部遮挡，但不在像素层面预测被遮内容，而是运用更高层级的抽象表征（即"潜在表征"）来构建内容模型。

潜在表征仅捕捉数据核心特征。例如面对不同圆柱体的线框图，编码器神经网络可将每个图像转化为代表圆柱体基本属性的数值（高度、宽度、方位和坐标），将数百数千像素承载的信息浓缩为若干数值。随后解码器神经网络再将这些核心参数还原为圆柱体图像。

V-JEPA专注于创建与复现潜在表征。其架构分为三大模块：编码器1、编码器2和预测器。训练算法首先对系列视频帧进行统一区域遮挡，将处理后的帧输入编码器1（有时会完全遮挡末尾数帧），生成潜在表征；同时将未遮挡帧整体输入编码器2，产生另一组潜在表征。

随后预测器登场：它依据编码器1输出的潜在表征，预测编码器2应生成的对应表征。通过重构关键潜在表征而非原始系统的缺失像素，模型学会关注道路车辆而非纠结枝叶摆动。

"这使模型能筛除无关信息，聚焦视频核心要素。"Meta研究科学家昆汀·加里多强调，"信息筛选至关重要，正是V-JEPA致力优化的方向。"

完成预训练后，下一步是针对图像分类、视频动作识别等具体任务进行微调。此适应阶段需少量人工标注数据（如动作标签），但所需标注量远低于全系统端到端训练。同一编码-预测网络还可适配不同任务。

直觉模拟
今年二月，V-JEPA团队公布了系统在理解现实世界物理直觉属性（物体恒存性、形状颜色稳定性、重力与碰撞效应）方面的表现。在要求判断视频动作物理合理性的IntPhys测试中，V-JEPA准确率高达98%，而知名像素预测模型仅略高于随机猜测。

团队还量化了模型预测与观测不符时的"惊讶度"。对经自然视频预训练的V-JEPA输入新视频，通过数学计算比较预期帧与实际帧的差异。当后续帧出现物理不可能事件时，预测误差急剧飙升——例如球体滚入遮挡物后未按预期重现，模型会产生显著误差，类似婴儿的直觉反应。可以说，V-JEPA对此感到"惊讶"。

海尔布隆对V-JEPA的能力印象深刻："发展心理学研究表明，婴儿无需大量接触就能掌握这类物理直觉。他们证实这种能力可通过学习获得，无需先天预设，这非常具有说服力。"

伦敦大学学院计算神经科学家卡尔·弗里斯顿认为，V-JEPA在模拟"人脑学习与建模世界方式"方面方向正确，但尚缺核心要素："当前方案未能妥善编码不确定性。"若历史帧信息不足以准确预测未来，系统无法量化这种预测不确定性。

六月，Meta的V-JEPA团队发布新一代12亿参数模型V-JEPA 2，该模型基于2200万视频进行预训练，并被应用于机器人领域：仅用约60小时机器人数据（含动作视频与操作信息）微调新预测网络，即可用于规划机器人后续行动。"此类模型能完成简单机械操控任务，为相关研究开辟道路。"加里多表示。

为提升V-JEPA 2性能，团队设计了更复杂的物理直觉基准测试IntPhys 2。面对更高难度的挑战，V-JEPA 2及其他模型仅略优于随机猜测。加里多解释道，部分原因在于V-JEPA 2仅能处理数秒视频输入并预测短期未来，更长时间的信息则被遗忘。若要比作生物，加里多提出了新颖比喻："从某种意义说，模型的记忆时长让人联想到金鱼。"

英文来源：

How One AI Model Creates a Physical Intuition of Its Environment
Introduction
Here’s a test for infants: Show them a glass of water on a desk. Hide it behind a wooden board. Now move the board toward the glass. If the board keeps going past the glass, as if it weren’t there, are they surprised? Many 6-month-olds are, and by a year, almost all children have an intuitive notion of an object’s permanence, learned through observation. Now some artificial intelligence models do too.
Researchers have developed an AI system that learns about the world via videos and demonstrates a notion of “surprise” when presented with information that goes against the knowledge it has gleaned.
The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), does not make any assumptions about the physics of the world contained in the videos. Nonetheless, it can begin to make sense of how the world works.
“Their claims are, a priori, very plausible, and the results are super interesting,” says Micha Heilbron, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.
Higher Abstractions
As the engineers who build self-driving cars know, it can be hard to get an AI system to reliably make sense of what it sees. Most systems designed to “understand” videos in order to either classify their content (“a person playing tennis,” for example) or identify the contours of an object — say, a car up ahead — work in what’s called “pixel space.” The model essentially treats every pixel in a video as equal in importance.
But these pixel-space models come with limitations. Imagine trying to make sense of a suburban street. If the scene has cars, traffic lights and trees, the model might focus too much on irrelevant details such as the motion of the leaves. It might miss the color of the traffic light, or the positions of nearby cars. “When you go to images or video, you don’t want to work in [pixel] space because there are too many details you don’t want to model,” said Randall Balestriero, a computer scientist at Brown University.
The V-JEPA architecture, released in 2024, is designed to avoid these problems. While the specifics of the various artificial neural networks that comprise V-JEPA are complex, the basic concept is simple.
Ordinary pixel-space systems go through a training process that involves masking some pixels in the frames of a video and training neural networks to predict the values of those masked pixels. V-JEPA also masks portions of video frames. But it doesn’t predict what’s behind the masked regions at the level of individual pixels. Rather, it uses higher levels of abstractions, or “latent” representations, to model the content.
Latent representations capture only essential details about data. For example, given line drawings of various cylinders, a neural network called an encoder can learn to convert each image into numbers representing fundamental aspects of each cylinder, such as its height, width, orientation and location. By doing so, the information contained in hundreds or thousands of pixels is converted into a handful of numbers — the latent representations. A separate neural network called a decoder then learns to convert the cylinder’s essential details into an image of the cylinder.
V-JEPA focuses on creating and reproducing latent representations. At a high level, the architecture is split into three parts: encoder 1, encoder 2, and a predictor. First, the training algorithm takes a set of video frames, masks the same set of pixels in all frames, and feeds the frames into encoder 1. Sometimes, the final few frames of the video are fully masked. Encoder 1 converts the masked frames into latent representations. The algorithm also feeds the unmasked frames in their entirety into encoder 2, which converts them into another set of latent representations.
Now the predictor gets into the act. It uses the latent representations produced by encoder 1 to predict the output of encoder 2. In essence, it takes latent representations generated from masked frames and predicts the latent representations generated from the unmasked frames. By re-creating the relevant latent representations, and not the missing pixels of earlier systems, the model learns to see the cars on the road and not fuss about the leaves on the trees.
“This enables the model to discard unnecessary … information and focus on more important aspects of the video,” said Quentin Garrido, a research scientist at Meta. “Discarding unnecessary information is very important and something that V-JEPA aims at doing efficiently.”
Once this pretraining stage is complete, the next step is to tailor V-JEPA to accomplish specific tasks such as classifying images or identifying actions depicted in videos. This adaptation phase requires some human-labeled data. For example, videos have to be tagged with information about the actions contained in them. The adaptation for the final tasks requires much less labeled data than if the whole system had been trained end to end for specific downstream tasks. In addition, the same encoder and predictor networks can be adapted for different tasks.
Intuition Mimic
In February, the V-JEPA team reported how their systems did at understanding the intuitive physical properties of the real world — properties such as object permanence, the constancy of shape and color, and the effects of gravity and collisions. On a test called IntPhys, which requires AI models to identify if the actions happening in a video are physically plausible or implausible, V-JEPA was nearly 98% accurate. A well-known model that predicts in pixel space was only a little better than chance.
The V-JEPA team also explicitly quantified the “surprise” exhibited by their model when its prediction did not match observations. They took a V-JEPA model pretrained on natural videos, fed it new videos, then mathematically calculated the difference between what V-JEPA expected to see in future frames of the video and what actually happened. The team found that the prediction error shot up when the future frames contained physically impossible events. For example, if a ball rolled behind some occluding object and temporarily disappeared from view, the model generated an error when the ball didn’t reappear from behind the object in future frames. The reaction was akin to the intuitive response seen in infants. V-JEPA, one could say, was surprised.
Heilbron is impressed by V-JEPA’s ability. “We know from developmental literature that babies don’t need a lot of exposure to learn these types of intuitive physics,” he said. “It’s compelling that they show that it’s learnable in the first place, and you don’t have to come with all these innate priors.”
Karl Friston, a computational neuroscientist at University College London, thinks that V-JEPA is on the right track in terms of mimicking the “way our brains learn and model the world.” However, it still lacks some fundamental elements. “What is missing from [the] current proposal is a proper encoding of uncertainty,” he said. For example, if the information in the past frames isn’t enough to accurately predict the future frames, the prediction is uncertain, and V-JEPA doesn’t quantify this uncertainty.
In June, the V-JEPA team at Meta released their next-generation 1.2-billion-parameter model, V-JEPA 2, which was pretrained on 22 million videos. They also applied the model to robotics: They showed how to further fine-tune a new predictor network using only about 60 hours of robot data (including videos of the robot and information about its actions), then used the fine-tuned model to plan the robot’s next action. “Such a model can be used to solve simple robotic manipulation tasks and paves the way to future work in this direction,” Garrido said.
To push V-JEPA 2, the team designed a more difficult benchmark for intuitive physics understanding, called IntPhys 2. V-JEPA 2 and other models did only slightly better than chance on these tougher tests. One reason, Garrido said, is that V-JEPA 2 can handle only about a few seconds of video as input and predict a few seconds into the future. Anything longer is forgotten. You could make the comparison again to infants, but Garrido had a different creature in mind. “In a sense, the model’s memory is reminiscent of a goldfish,” he said.

quanta

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读