智能体交互框架：一种与主动式AR智能体实现无干扰交互的方案

qimuai 发布于 2025-9-19 08:01 阅读：8 一手编译

内容来源：https://research.google/blog/sensible-agent-a-framework-for-unobtrusive-interaction-with-proactive-ar-agents/

内容总结：

谷歌XR团队于2025年9月18日发布了一项名为"Sensible Agent"的创新研究框架，该成果旨在通过增强现实（AR）技术实现更自然的人机交互体验。这一框架通过实时感知用户视线、手部状态及环境噪音等多模态情境数据，使AR助手能够主动调整交互方式，减少对用户注意力的干扰。

与传统依赖语音指令的AR系统不同，Sensible Agent通过两大核心模块实现智能化交互：首先通过视觉语言模型和音频分析技术理解用户需求（如翻译菜单或推荐路线），随后根据社交场景智能选择最适宜的交互方式（如在嘈杂环境中使用视觉提示替代语音）。该系统已基于Android XR和WebXR平台开发出功能原型，并集成多模态AI模型进行场景解析与响应生成。

研究团队通过10名参与者进行的对比实验显示，与传统语音助手相比，Sensible Agent在认知负荷方面显著降低（NASA-TLX心智需求分值从65.0降至21.1），用户偏好度评分达到6.0（7分制）。尽管交互时间略有增加，但用户对其在社交场景中的隐蔽性和自然交互体验给予高度评价。

该技术为AR设备在日常生活场景中的实用化提供了新方向，未来可延伸至智能家居、跨设备协作等领域，同时通过端侧计算保障用户数据安全。此项研究由谷歌多团队协作完成，相关论文已发表于UIST 2025学术会议。

中文翻译：

智能体交互框架：与主动式AR智能体实现无干扰交互
2025年9月18日
杜若飞（谷歌XR交互感知与图形技术负责人） & 李建善（谷歌学生研究员）

Sensible Agent是一项研究原型，通过实时感知用户视线、手部状态和环境噪音等情境数据，使AR智能体能主动调整建议内容和交互方式。

快速链接
当前创新成果（如谷歌Project Astra）展现了嵌入AR眼镜的主动式智能体发展潜力：它们能预测用户需求，无缝融入日常生活，提供智能辅助。无论是轻松导航陌生交通枢纽，还是在拥挤空间中 discreetly（不引人注目地）提供适时建议，这些智能体都能带来显著便利。然而现有智能体仍存在明显局限：它们主要依赖用户明确的语音指令。这种要求在社交场合可能令人尴尬，在紧急情况下会增加认知负荷，或根本难以实施。

为应对这些挑战，我们在UIST 2025大会上提出Sensible Agent框架——专为实现与主动式AR智能体的无干扰交互而设计。该框架是我们此前Human I/O研究的进阶成果，通过预测用户意图并确定最佳辅助方式，从根本上重塑交互模式。它利用实时多模态情境感知、微手势、视线输入和最小化视觉提示，提供与环境相融的无干扰辅助。这标志着向真正集成化、具有社交意识的AR系统迈出关键一步，此类系统能尊重用户情境，最小化认知干扰，使主动式数字辅助真正适用于日常生活。

框架核心架构
Sensible Agent核心包含两个互联模块：（1）理解"辅助内容"；（2）确定"辅助方式"。首先，系统通过第一视角摄像头和环境情境检测进行多模态感知，解析用户当前需求。无论您是在博物馆参观还是超市采购，智能体都会主动提供最有价值的帮助——例如实时翻译、餐厅招牌菜推荐，或 quietly（静默）显示购物清单。

更重要的是，系统会根据社交情境智能选择最不突兀的交互方式。当您烹饪时双手忙碌，智能体会启用点头确认功能；在嘈杂环境中，它会 discreetly（谨慎地）显示视觉图标而非语音提示。这种自适应模态选择确保辅助始终便捷可得，同时避免显著干扰。

原型系统实现
我们基于Android XR和WebXR平台开发了全功能原型，集成多模态AI模型实现该系统。原型包含四个组件：（1）理解场景的情境解析器；（2）确定辅助需求的主动查询生成器；（3）决策最佳交互方式的交互模块；（4）交付辅助的响应生成器。

情境解析器：通过视觉语言模型（VLM）分析头显摄像头画面，并借YAMNet预训练音频分类器处理环境噪音，最终输出用户活动状态及位置等高阶情境数据
主动查询生成器：基于情境数据，采用思维链（CoT）推理技术将多步骤问题分解，通过六组实测数据案例进行小样本学习，最终生成包含具体动作（如"推荐菜品"）、查询格式（多选/二元/图标）及呈现模态（纯音频/纯视觉/混合）的完整建议
交互模块：UI管理器通过屏幕面板或文本转语音（TTS）输出建议；输入模态管理器则根据初始情境（如手部忙碌、环境嘈杂）激活头部动作、手势、语音或视线等响应方式
响应生成器：用户选择后，大型语言模型（LLM）生成自然语言应答，经TTS转换后播报

用户研究
我们通过结构化用户研究，将本系统与基于Project Astra的传统语音控制AR助手进行对比。10名参与者使用Android XR设备完成12个现实场景测试，包括：

餐厅点餐
公共交通通勤
超市购物
博物馆参观
健身房锻炼
厨房烹饪

测试采用两种模式：

基线模式（语音控制）：用户通过显式语音指令触发交互
Sensible Agent模式：系统以视觉图标、微音频提示及手势交互等无干扰方式主动提供情境适配建议

研究数据
通过NASA任务负荷指数量表（NASA-TLX）、系统可用性量表（SUS）、7点李克特偏好量表及交互时长进行测量：

认知负荷：Sensible Agent组心理需求得分显著更低（21.1 vs 65.0，𝑝<0.001）
操作负担：感知努力程度显著降低（𝑝=0.0039）
系统可用性：两组SUS分数无显著差异（𝑝=0.11）
用户偏好：Sensible Agent获显著偏好（6.0 vs 3.8，𝑝=0.0074）
交互时长：基线模式更快（16.4秒 vs 28.5秒），但用户仍倾向接受Sensible Agent的两步交互流程

关键发现：主动性不仅降低使用负担，更重塑用户与智能体的关系。参与者感觉Sensible Agent更像是协作伙伴而非工具，其非语言输入方式模仿社交暗示，使交互更自然。在高压或社交敏感环境中，交互方式的重要性不亚于交互内容。

结论与展望
本研究证明：通过联合推理"辅助内容"与"交互方式"，主动式AR辅助既能保持智能性又可实现无干扰化。通过将多模态感知和实时适应能力融入决策与界面设计，本框架解决了人机交互中的长期痛点。

未来工作将聚焦：整合长期历史数据实现个性化适配，拓展多设备多环境支持，探索智能家居与物理机器人应用，同时通过设备端推理保障用户数据安全。随着AR技术日益融入日常生活，Sensible Agent类系统将为高效、体贴的数字智能体奠定基础。

致谢
本项目由谷歌多团队协作完成。核心贡献者：李建善、夏敏、Nels Numan、钱迅、David Li、陈炎赫、Achin Kulshrestha、Ishan Chatterjee、张印达、Dinesh Manocha、David Kim、杜若飞。感谢周重义、Vikas Bahirwani、Jessica Bo、徐征、刘仁昊对初期方案的反馈；感谢Alex Olwal、Adarsh Kowdle、Guru Somadder的战略指导与审阅。

英文来源：

Sensible Agent: A framework for unobtrusive interaction with proactive AR agents
September 18, 2025
Ruofei Du, Interactive Perception & Graphics Lead, and Geonsun Lee, Student Researcher, Google XR
Sensible Agent is a research prototype that enables AR agents to proactively adapt what they suggest and how they interact, using real-time context, including gaze, hand availability, and environmental noise.
Quick links
Recent innovations, such as Google's Project Astra, exemplify the potential of proactive agents embedded in augmented reality (AR) glasses to offer intelligent assistance that anticipates user needs and seamlessly integrates into everyday life. These agents promise remarkable convenience, from effortlessly navigating unfamiliar transit hubs to discreetly offering timely suggestions in crowded spaces. Yet, today’s agents remain constrained by a significant limitation: they predominantly rely on explicit verbal commands from users. This requirement can be awkward or disruptive in social environments, cognitively taxing in time-sensitive scenarios, or simply impractical.
To address these challenges, we introduce Sensible Agent, published at UIST 2025, a framework designed for unobtrusive interaction with proactive AR agents. Sensible Agent is an advancement to our prior research in Human I/O and fundamentally reshapes this interaction by anticipating user intentions and determining the best approach to deliver assistance. It leverages real-time multimodal context sensing, subtle gestures, gaze input, and minimal visual cues to offer unobtrusive, contextually-appropriate assistance. This marks a crucial step toward truly integrated, socially aware AR systems that respect user context, minimize cognitive disruption, and make proactive digital assistance practical for daily life.
Sensible Agent framework
At its core, Sensible Agent consists of two interconnected modules for (1) understanding "what" to assist with, and (2) determining "how" to provide assistance. First, Sensible Agent leverages advanced multimodal sensing using egocentric cameras and environmental context detection to understand a user’s current assistance needs. Whether you're navigating a crowded museum or rushing through a grocery store, the agent proactively decides the most helpful action, such as providing quick translations, suggesting popular dishes at a new restaurant, or quietly displaying a grocery list.
Equally important, Sensible Agent intelligently chooses the least intrusive and most appropriate interaction method based on social context. For instance, if your hands are busy cooking, the agent might enable confirmation via a head nod. In a noisy environment, it might discreetly show visual icons instead of speaking out loud. This adaptive modality selection ensures assistance is always conveniently delivered while avoiding significant disruptions.
Building the Sensible Agent prototype
To bring this concept to life, we implemented Sensible Agent as a fully functional prototype running on Android XR and WebXR, integrated with powerful multimodal AI models. The prototype includes four components: (1) a context parser that enables it to understand the scene, (2) a proactive query generator that determines what assistance is needed, (3) an interaction module that decides how to best offer assistance, and (4) a response generator that delivers the assistance.

Context parser: Understanding the scene
First, the system initiates a context parser to understand the user's current situation. The context parser uses a vision-language model (VLM) to analyze the input frame from the headset’s camera and YAMNet, a pre-trained audio event classifier, to process the noise level in the environment. This process results in a set of parsed contexts, such as high-level activity or the user’s location.
Proactive query generator: Deciding “what” to do
Based on the parsed context, the proactive query generator identifies the most helpful action. It uses chain-of-thought (CoT) reasoning to prompt the model to decompose multi-step problems into intermediate steps. This reasoning is guided by six examples derived from a data collection study (few-shot learning).
The model's output is a complete agent suggestion, including the action (e.g., Recommend Dish), the query format (Multi-choice/Binary Choice/Icon), and the presentation modality (Audio Only/Visual Only/Both).
Interaction module: Deciding “how” to interact
This module handles the “how” of the interaction, managing both output and input.
The UI Manager takes the suggestion and presents it to the user. It either renders a visual panel on the screen or uses text-to-speech (TTS) to generate an audio prompt.
The input modality manager then enables the most appropriate ways for the user to respond. Based on the initial context (e.g., hands are busy, environment is loud), it activates one or more modalities, including head gestures, hand gestures, verbal commands, or gaze.
Response generator: Delivering the assistance
Once the user selects an option (e.g., with a nod of the head), the Response Generator completes the task. It uses an LLM to formulate a helpful, natural language answer, which is then converted to audio via TTS and played to the user.
User study
To evaluate Sensible Agent’s performance, we conducted a structured user study comparing it with a conventional, voice-controlled AR assistant modeled after Project Astra. The goal was simple: determine whether Sensible Agent could reduce interaction effort and disruption while maintaining usability and comfort in realistic everyday scenarios.
The study involved 10 participants, each completing 12 realistic scenarios using an Android XR headset. To simulate realistic AR use, these scenarios were presented either as: (1) 360° immersive videos for scenarios involving public transport, restaurant dining, and grocery shopping, or (2) physically staged AR environments for museum visits, exercising, and cooking tasks. The scenarios were set across the following six everyday activities:
Reading a restaurant menu
Commuting via public transport
Grocery shopping
Visiting a museum
Working out at a gym
Cooking in a kitchen
Participants experienced each scenario in two conditions:
Baseline (using a voice-controlled assistant): Users explicitly initiated interactions via voice commands (e.g., "What's the vegetarian option?" or "Tell me about this exhibit").
Sensible Agent: The system proactively offered context-adapted suggestions using minimally intrusive methods, including visual icons, subtle audio cues, and gesture-based interactions (e.g., head nods, gaze).
Participants experienced all scenarios sequentially, alternating between unfamiliar contexts (first-time scenarios) and more familiar or contextually constrained variants (e.g., high cognitive load, hands occupied). To ensure a naturalistic flow, scenarios were interleaved to avoid repetition of similar tasks back-to-back.
Results
We compared Sensible Agent to a conventional, voice-controlled AR assistant baseline. We measured cognitive load using the NASA Task Load Index (NASA-TLX), overall usability with the System Usability Scale (SUS), user preference on a 7-point Likert scale, and total interaction time.
The most significant finding was the reduction in cognitive workload. The NASA-TLX data showed that on a 100-point scale for mental demand, the average score for Sensible Agent was 21.1, compared to 65.0 for the baseline with a statistically significant difference (𝑝 < .001). We saw a similar significant reduction in perceived effort (𝑝 = .0039), which suggests that the proactive system successfully offloaded the mental work of forming a query.
Regarding usability, both systems performed well, with no statistically significant difference between their SUS scores (𝑝 = .11). However, participants expressed a strong and statistically significant preference for Sensible Agent (𝑝 = .0074). On a 7-point scale, the average preference rating was 6.0 for Sensible Agent, compared to 3.8 for the baseline.
For the interaction time, logged from the moment a prompt was triggered to the final system response to the user's input, the baseline was faster (μ = 16.4s) compared to Sensible Agent (μ = 28.5s). This difference is an expected trade-off of the system’s two-step interaction flow, where the agent first proposes an action and the user then confirms it. The strong user preference for Sensible Agent suggests this trade-off was acceptable, particularly in social contexts where discretion and minimal user effort were important.
A key insight is that proactivity does more than reduce effort; it reshapes the user's relationship with the agent. Participants felt Sensible Agent was less like a tool and more like a collaborative partner. Its subtle, non-verbal inputs mirrored social cues, fostering rapport and making interactions feel more natural, which suggests the how of an interaction is as important as the what in making an agent feel like an engaged assistant.
This shift in perception was especially pronounced in high-pressure or socially-engaged environments. Our findings reinforce that relevance alone is not enough; effective agents must align their communication modality with user availability, attentional state, and social context.
Conclusion and future directions
In this research, we demonstrated that proactive AR assistance can be made both intelligent and unobtrusive by jointly reasoning over what to suggest and how to deliver it. By integrating multimodal sensing and real-time adaptation into both decision-making and interface design, our framework addresses longstanding friction in human-agent interaction.
Looking ahead, this research can be expanded to real-life applications by integrating longer-term history to support personalization over time, scaling the system to work across devices and environments, and exploring applications in smart homes and physical robotics, while keeping users and user data safe with on-device inference. As AR becomes increasingly embedded in everyday life, systems like Sensible Agent lay the groundwork for digital agents that efficiently and attentively support users.
Acknowledgements
This work is a joint collaboration across multiple teams at Google. The following researchers contributed to this work: Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, David Kim, and Ruofei Du. We would like to thank Zhongyi Zhou, Vikas Bahirwani, Jessica Bo, Zheng Xu, Renhao Liu for their feedback and discussion on our early-stage proposal. We thank Alex Olwal, Adarsh Kowdle, and Guru Somadder for the strategic guidance and thoughtful reviews.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读