XR Blocks:加速人工智能与扩展现实的融合创新
内容来源:https://research.google/blog/xr-blocks-accelerating-ai-xr-innovation/
内容总结:
谷歌发布开源框架XR Blocks 加速人工智能与沉浸式现实融合创新
近日,谷歌XR团队在官网宣布推出开源框架XR Blocks,该工具旨在突破人工智能与扩展现实技术融合的生态壁垒,帮助开发者快速构建具备环境感知与交互能力的沉浸式网络体验。
目前AI领域已拥有成熟的开发框架与评测体系,而XR应用开发仍面临底层系统集成复杂、开发效率低下等挑战。XR Blocks通过模块化架构,提供用户、环境、界面、AI与智能体五大核心抽象模块,大幅降低开发门槛。该框架基于WebXR、three.js等主流技术栈,支持跨平台运行,开发者可直接在GitHub获取开源模板与实时演示。
该框架遵循三大设计原则:采用类Python的简洁可读语法、聚焦创作者体验优化、坚持实用主义架构。其核心创新在于提出“现实模型”概念,将交互逻辑与底层技术实现分离,使开发者能够通过高级API调用感知输入、AI运算与可视化等功能模块。
实际应用案例显示,该框架已成功支撑“增强物体智能系统”的开发,使日常物品具备动态虚拟交互功能,并作为感知交互核心赋能了智能辅助系统Sensible Agent。研究团队表示,这一成果标志着向“编程、设计与自然对话边界消融”愿景迈出关键一步,未来将继续完善工具生态,推动人机交互研究向更高层次发展。
(根据谷歌XR团队2025年10月9日发布的技术文档整理)
中文翻译:
XR Blocks:加速人工智能与扩展现实融合创新
2025年10月9日
谷歌XR部门交互感知与图形负责人杜若飞,产品经理本杰明·赫什
XR Blocks是一款开源框架,旨在助力开发者构建沉浸式网页体验,其特色在于实现XR真实感交互与AI增强应用,欢迎访问xrblocks.github.io查看实时演示。
人工智能与扩展现实的结合有望开启沉浸式智能计算新范式。然而当前这两大领域生态存在显著断层:AI领域凭借JAX、PyTorch、TensorFlow等成熟框架及ImageNet、LMArena等基准测试飞速发展,而构建AI驱动的XR交互原型却仍面临重重阻碍,开发者往往需要手动整合感知、渲染、交互等底层系统。
为弥合这一鸿沟,我们推出XR Blocks(已在ACM UIST 2025大会发布)。这个跨平台框架致力于加速以人为本的AI+XR创新,其前身是专注于非XR场景的Visual Blocks可视化机器学习框架。XR Blocks采用模块化架构,提供即插即用组件,涵盖AI+XR五大核心抽象要素:用户、环境、界面、人工智能与智能体。该框架基于WebXR、threejs、LiteRT、Gemini等易用技术构建,显著降低了XR创作门槛。我们通过GitHub上的开源模板、实时演示和源代码展示其应用价值,助力开发者快速从概念迈向交互原型。详细功能概述请参阅我们的方向性论文与预告视频。
设计原则
我们的架构与API设计遵循三大原则:
- 崇尚简洁易读:受Python哲学启发,我们推崇清晰易懂的抽象逻辑。开发脚本应如体验场景的高级描述,简单任务轻松实现,复杂逻辑清晰可辨
- 专注创作者体验:核心目标是让智能XR应用开发变得流畅自然。创作者应聚焦用户体验,而非纠缠于传感器融合、AI模型集成或跨平台交互逻辑等底层细节
- 务实优于完备:鉴于AI与XR领域日新月异,我们秉持务实理念。追求大而全的复杂框架在发布时即可能过时,因此我们采用可在桌面端与安卓XR设备运行的轻量模块化架构
框架架构
借鉴Visual Blocks与InstructPipe的设计灵感,XR Blocks构建了高层抽象层,将交互意图(通过脚本定义)与底层实现分离。
抽象模型
我们提出全新的“现实模型”,通过高层抽象模块指导框架实现。与端到端无监督训练的世界模型不同,该模型由可替换的XR交互模块构成。其核心是作为应用叙事与逻辑中心的“脚本”,可操控六大基础要素:
- 用户与物理世界:模型以包含手势、视线与虚拟形象的用户为核心,支持脚本查询深度感知、光照估算、物体识别等现实信息
- 虚拟界面与情境:通过2D面板到3D资产等虚拟UI元素增强混合现实,感知管道会解析环境、活动及交互历史情境
- 智能与社会实体:将AI智能体与远程用户视作对等实体,支持人机混合对话场景
实现机制
该模型由XR Blocks模块化核心引擎驱动,通过高层API让开发者无需掌握底层技术即可调用以下子系统:
- 感知与输入管道:相机、深度、声音模块持续更新现实模型,输入模块统一多设备用户操作
- 核心AI工具:ai模块提供.query、.runModel等简洁而强大的功能,使大模型成为易用工具
- 体验可视化工具包:ux模块提供可复用交互行为,ui与effect模块处理界面渲染及遮挡等视觉效果
通过分离抽象模型与具体引擎,XR Blocks开创了全新创作流程。我们期待未来任何声明式指令(如“当用户捏取物体时,智能体应生成相关诗歌”)都能直接转化为框架高层指令,使创作构想不再是伪代码,而成为可执行逻辑的直述。
应用场景
我们开发了一系列交互应用来展示框架的表现力与灵活性:
- 增强物体智能:为日常物品赋予虚拟按钮等数字交互功能
- 感知智能体:构建非侵入式AR辅助系统,为HCI研究者提供核心感知交互逻辑,使其能聚焦更高阶的人机协作认知机制研究
结论与展望
当前智能XR体验开发流程碎片化严重,阻碍了创意落地。XR Blocks通过提供分离意图与实现的高层抽象,显著加速情境感知应用的原型开发。这是迈向编程、设计与对话界限消融未来的奠基之举,让我们能像编写故事般流畅地构建现实。虽然框架仍有不足,但我们诚邀更多创作者加入探索之旅——我们坚信,借助合适的工具,每个人都能通过AI释放内在创造力。
致谢
本项目由谷歌多团队协作完成。主要贡献者包括李大卫、杜若飞,核心贡献成员包括努曼·内尔斯、钱迅、陈彦合、周重义等,同时感谢早期方案论证与WebXR实验中提供宝贵意见的全体同仁。
英文来源:
XR Blocks: Accelerating AI + XR innovation
October 9, 2025
Ruofei Du, Interactive Perception & Graphics Lead, and Benjamin Hersh, Product Manager, Google XR
XR Blocks is an open-source framework to help you develop immersive experiences for the web, featuring XR realism, XR interaction, and AI + XR applications with live demos in xrblocks.github.io.
The combination of artificial intelligence (AI) and extended reality (XR) has the potential to unlock a new paradigm of immersive intelligent computing. However, a significant gap exists between the ecosystems of these two fields today. AI research and development is accelerated by mature frameworks like JAX, PyTorch, TensorFlow, and benchmarks like ImageNet and LMArena. Meanwhile, prototyping novel AI-driven XR interactions remains a high-friction process, often requiring practitioners to manually integrate disparate, low-level systems for perception, rendering, and interaction.
To bridge this gap, we introduce XR Blocks (presented at ACM UIST 2025), a cross-platform framework designed to accelerate human-centered AI + XR innovation. This is a significant step from our prior research in Visual Blocks for ML, which targets non-XR use cases and streamlines prototyping machine learning pipelines with visual programming. XR Blocks provides a modular architecture with plug-and-play components for core abstraction in AI + XR: user, world, interface, AI, and agents. Crucially, it is designed with the mission of accelerating rapid prototyping of perceptive AI + XR apps. Built upon accessible technologies (WebXR, threejs, LiteRT, Gemini), our toolkit lowers the barrier to entry for XR creators. We demonstrate its utility through a set of open-source templates, live demos, and source code on GitHub, with the goal of empowering the community to quickly move from concept to interactive prototype. You can find an overview of these capabilities in our directional paper and teaser video.
Design principles
Our architectural and API design choices are guided by three principles:
- Embrace simplicity and readability: Inspired by Python's Zen, we prioritize clean, human-readable abstractions. A developer's script should read like a high-level description of the desired experience. Simple tasks should be simple to implement, and complex logic should remain explicit and understandable.
- Prioritize the creator experience: Our primary goal is to make authoring intelligent and perceptive XR applications as seamless as possible. We believe that creators should focus on the user experience, not on the low-level “plumbing” of sensor fusion, AI model integration, or cross-platform interaction logic.
- Pragmatism over completeness: We follow a design philosophy of pragmatism, since the fields of AI and XR are evolving quickly. A comprehensive, complex framework that attempts to be perfect will be obsolete upon release. We favor a simple, modular, and adaptable architecture that runs on both desktop and Android XR devices for a wide range of applications.
XR Blocks framework
Drawing inspiration from Visual Blocks for ML and InstructPipe, we designed the XR Blocks framework to provide a high-level, human-centered abstraction layer that separates the what of an interaction (denoted as Script, described more below) from the how of its low-level implementation.
Abstractions
We propose a new Reality Model composed of high-level abstractions to guide the implementation of the XR Blocks framework. Unlike the World Model designed for end-to-end unsupervised training, our Reality Model consists of replaceable modules for XR interaction. At the heart of our design is Script, the narrative and logical center of an application. Script operates on six first-class primitives (described and visualized below): - User & the physical world: Our model is centered around the User, consisting of hands, gaze, and avatar. The physical world allows Script to query the perceived reality such as depth (demo), estimated lighting condition (demo), and objects (demo).
- Virtual interfaces & context: The model augments the blended reality with virtual UI elements, from 2D panels (demo) to fully 3D assets (demo). The perception pipeline analyzes the context of environment, activities, and histories of interaction. An example application can be found in Sensible Agent (discussed more below).
- Intelligent & Social Entities: We treat AI-driven agents and remote human peers as primary entities within the model. This enables dynamic group conversations in hybrid human-AI conversations in DialogLab.
Implementation
This Reality Model is realized by XR Blocks’s modular Core engine, which provides high-level APIs that enable developers to harness the following subsystems without needing to master the implementation: - Perception & input pipeline: The
camera
,depth
, andsound
modules continuously feed and update the Reality Model’s representation of physical reality. Theinput
module normalizes user actions from various devices, providing the raw data for XR Blocks to interpret. - AI as a core utility: The
ai
module acts as a central nervous system, providing simple yet powerful functions (.query
,.runModel
) that make large models an accessible utility. - Experience & visualization toolkit: To enable rapid creation, the toolkit provides a library of common affordances. The
ux
module offers reusable interaction behaviors like.selectable
and.draggable
(demo), while theui
andeffect
modules handle the rendering of interfaces and complex visual effects like occlusion (demo).
By separating the abstract Reality Model from the concrete Core engine, XR Blocks enables a powerful new creative workflow. The goal is to allow creators to move from high level, human-centric ideas to interactive prototypes much more quickly. We envision a future where any declarative prompt, “When the user pinches at an object, an agent should generate a poem of it”, could be directly translated to high-level instructions in XR Blocks:
Hence, the creator’s prompt is no longer pseudocode but a direct summary of the implementation logic. We envision this framework to more seamlessly translate such user intent into a system-level execution flow, composing capabilities from the input
, sound
, ai
, world
, ui
, and agent
modules to generate an emergent, intelligent behavior with user interaction.
Application scenarios
We provide a suite of interactive applications to demonstrate the expressive power and flexibility of the XR Blocks framework. These examples showcase how our framework enables the rapid prototyping of sophisticated experiences that were previously too complex and costly to build, facilitating the creation of realistic, interactive, and intelligent mixed-reality worlds:
The true power of the framework is realized when this Reality Model is deeply integrated with generative AI to create dynamic, personalized environments. We demonstrate this by building systems like Augmented Object Intelligence (XR-Objects), which imbues everyday physical objects with interactive digital affordances, such as dynamic virtual buttons. XR Blocks also serves as the foundation for Sensible Agent (published on ACM UIST 2025), a system for proactive and unobtrusive AR assistance. Our architecture provides the agent's core perception and interaction logic, providing an example of our primary goal: by providing robust, high-level tools, XR Blocks empowers Human-Computer Interaction researchers to bypass low-level implementation and focus directly on higher-order challenges like the cognitive principles of human-agent collaboration.
Conclusion and future directions
Creating intelligent XR experiences is currently too fragmented, placing a major barrier between a creator's vision and its realization. We presented XR Blocks, an architecture and toolkit that dissolves this complexity by providing a high-level abstraction layer that separates what (the intent) from the how (the low-level implementation), dramatically accelerating the prototyping of context-aware applications. This is a foundational step toward a future where the boundaries between programming, design, and conversation disappear, enabling us to script realities as fluidly as we script stories. XR Blocks is far from perfect, and this work serves as an initial visionary document to invite more creators to join our journey, based on our belief that with the right set of tools, everyone can unleash their inner creativity with AI.
Acknowledgements
This work is a joint collaboration across multiple teams at Google. The following researchers and engineers contributed to this work: David Li and Ruofei Du (equal primary contributions), Nels Numan, Xun Qian, Yanhe Chen, and Zhongyi Zhou, (equal secondary contributions, sorted alphabetically), as well as Evgenii Alekseev, Geonsun Lee, Alex Cooper, Min Xia, Scott Chung, Jeremy Nelson, Xiuxiu Yuan, Jolica Dias, Tim Bettridge, Benjamin Hersh, Michelle Huynh, Konrad Piascik, Ricardo Cabello, and David Kim. We would like to thank Mahdi Tayarani, Max Dzitsiuk, Patrick Hackett, Seeyam Qiu, Brian Collins, Steve Toh, Eric Gonzalez, Nicolás Peña Moreno, Yi-Fei Li, Ziyi Liu, Jing Jin for their feedback and discussion on our early-stage proposal and WebXR experiments. We thank Max Spear, Adarsh Kowdle, and Guru Somadder for the directional contribution and thoughtful reviews.