StreetReaderAI：通过情境感知多模态AI实现街景无障碍访问

qimuai 发布于 2025-10-30 08:01 阅读：4 一手编译

内容来源：https://research.google/blog/streetreaderai-towards-making-street-view-accessible-via-context-aware-multimodal-ai/

内容总结：

近日，谷歌研究院发布了一项名为StreetReaderAI的创新技术原型，旨在通过多模态人工智能技术为视障群体构建无障碍街景导航系统。该研究成果已于10月29日在UIST'25学术会议上正式亮相。

当前主流地图服务虽已提供沉浸式街景功能，但屏幕阅读器始终无法解析街景图像内容。StreetReaderAI通过整合动态地理数据与实时图像分析，首次实现了对街景环境的智能语音描述。系统搭载两大核心模块：能自动生成环境描述的AI解说器，以及支持实时对话的AI聊天助手。用户可通过语音指令或键盘快捷键实现全景图像间的无缝移动与视角调整，获得类似第一人称游戏的音频交互体验。

在实验室测试中，11位视障测试者通过系统完成了超过350个全景地点探索，累计发起千余次AI交互请求。数据显示，AI聊天功能使用频率是自动解说功能的6倍，反映出用户对个性化交互的强烈需求。系统对用户提问的整体回答准确率达到86.3%，但在障碍物识别等方面仍需优化。

研究还首次系统分析了视障人群对街景图像的关注焦点：空间方位（27%）、物体存在性（26.5%）、环境概览（18.4%）及目标定位（14.9%）构成四大核心需求。未来研发方向包括开发能自主探索街景的智能体、实现全路径无障碍规划，以及构建沉浸式三维音效环境。

这项由盲人及明眼研究者共同设计的原型系统，标志着街景服务向普惠化迈出关键一步，为重塑数字导航的无障碍标准提供了重要技术范本。

中文翻译：

StreetReaderAI：通过情境感知多模态AI实现街景无障碍访问
2025年10月29日
Jon E. Froehlich（访问 faculty 研究员）与 Shaun Kane（研究科学家），Google Research

我们推出 StreetReaderAI——一款基于情境感知实时人工智能与无障碍导航控制的新型无障碍街景原型系统。

快速链接

如今，各大主流地图服务中的交互式街景工具已彻底改变了人们虚拟导航与探索世界的方式：从预览路线、查看目的地到远程游览世界级景点。但迄今为止，屏幕阅读器仍无法解读街景图像，替代文本也尚未普及。如今，我们有机会通过多模态AI与图像理解技术，重新定义这种沉浸式街景体验，使其包容所有人。未来，像谷歌街景这样覆盖110多个国家地区、拥有超过2200亿张图像的服务，或将能为视障群体提供沉浸式视觉体验，开拓探索的新可能。

在UIST'25大会上发表的论文《StreetReaderAI：利用情境感知多模态AI实现街景无障碍化》中，我们介绍了这一概念验证原型。StreetReaderAI由盲人与明眼研究者组成的团队迭代开发，借鉴了《Shades of Doom》《BlindSquare》《SoundScape》等无障碍第一人称游戏与导航工具的经验。其核心功能包括：

实时生成周边道路、交叉口与地点的AI描述
通过多模态AI代理动态讨论场景与地理环境
支持语音命令或键盘快捷键在全景图像间无缝平移移动

在StreetReaderAI中导航
StreetReaderAI提供沉浸式第一人称探索体验，类似以音频为主要交互界面的电子游戏。用户可通过键盘或语音无缝导航：左右方向键切换视角，系统会实时播报朝向（如“正对：北方”或“东北方”）并提示前方是否可通行、是否存在地标。按上箭头可“虚拟步进”，下箭头后退。移动时系统会描述行进距离与关键地理信息（如周边场所）。用户还可使用“跳跃/传送”功能快速定位。

AI虚拟导览核心
系统依托Gemini驱动的两大AI子系统：

AI描述器：结合用户虚拟位置的动态地理数据与当前街景图像分析，生成实时音频描述。含两种模式：侧重盲人出行安全的“默认模式”，以及提供历史建筑等旅游信息的“导览模式”。系统还会预判视障用户可能提出的场景相关问题。
AI对话器：在描述器基础上，支持用户对当前/过往视角及周边地理提问。该代理基于谷歌多模态实时API，具备会话记忆功能（最多约等效4000张输入图像的上下文容量）。例如用户走过公交站后询问“刚才的公交站在哪？”，系统能结合历史定位数据回答“在您身后约12米处”。

盲人用户测试反馈
我们对11位盲人屏幕阅读用户进行实地实验，参与者通过StreetReaderAI探索多地并评估步行路线。总体评分6.4分（7分量表，中位数7），用户特别肯定虚拟导航与AI的协同、无缝对话界面及信息价值。定性反馈指出该系统显著提升了街景工具的无障碍水平，AI对话功能使地理探索更具吸引力。

测试中用户访问超350个全景节点，发起超1000次AI请求。值得注意的是，AI对话器使用频率是描述器的6倍，显示用户更倾向个性化交互。现存挑战包括：空间定向困难、判断AI回答准确性、明确AI能力边界。

首项街景无障碍系统研究洞察
我们首次系统分析了盲人对街景图像的提问类型。对917次AI对话标注的23类问题中，前四类为：

空间定向（27.0%）：如“公交站离我多远？”
物体存在性（26.5%）：如“这里有斑马线吗？”
整体描述（18.4%）：如“我前方有什么？”
物体/地点定位（14.9%）：如“最近路口在哪？”

AI准确率分析
在816次有效提问中：

86.3%（703条）回答正确
3.9%（32条）错误，其中62.5%为漏报（如未识别存在的自行车架），37.5%为误识别（如将黄色减速带认作斑马线）
其余为部分正确（3.2%）或AI拒绝回答（6.6%）

未来方向
本研究揭示了多模态AI应答盲人街景需求的潜力，后续可拓展：

地理视觉代理：开发能自主探索的AI，例如用户询问“沿路下一个公交站在哪？”时，代理可自动遍历街景找到站点并分析设施。
路线规划支持：实现端到端路径预览，如AI提前“步行”查看地铁站到图书馆的全程街景，生成避障指引与门口定位。
沉浸音频界面：在语音基础上探索空间化音频与基于图像生成的3D音景。

尽管仍是概念验证原型，StreetReaderAI已展现出使沉浸式街景环境实现无障碍化的广阔前景。

致谢
研究团队：Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, Philip Nelson。感谢Project Astra、谷歌地理团队及所有参与者。图示图标来源：Noun Project（创作者：Firdaus Faiz, Kawalan Icon, Didik Darmanto, Funtasticon）。

英文来源：

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI
October 29, 2025
Jon E. Froehlich, Visiting Faculty Researcher, and Shaun Kane, Research Scientist, Google Research
We introduce StreetReaderAI, a new accessible street view prototype using context-aware, real-time AI and accessible navigation controls.
Quick links
Interactive streetscape tools, available today in every major mapping service, have revolutionized how people virtually navigate and explore the world — from previewing routes and inspecting destinations to remotely visiting world-class tourist locations. But to date, screen readers have not been able to interpret street view imagery, and alt text is unavailable. We now have an opportunity to redefine this immersive streetscape experience to be inclusive for all with multimodal AI and image understanding. This could eventually allow a service like Google Street View, which has over 220 billion images spanning 110+ countries and territories, to be more accessible to people in the blind and low-vision community, offering an immersive visual experience and opening up new possibilities for exploration.
In “StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI”, presented at UIST’25, we introduce StreetReaderAI, a proof-of-concept accessible street view prototype that uses context-aware, real-time AI and accessible navigation controls. StreetReaderAI was designed iteratively by a team of blind and sighted accessibility researchers, drawing on previous work in accessible first-person gaming and navigation tools, such as Shades of Doom, BlindSquare, and SoundScape. Key capabilities include:

Real-time AI-generated descriptions of nearby roads, intersections, and places.
Dynamic conversation with a multimodal AI agent about scenes and local geography.
Accessible panning and movement between panoramic images using voice commands or keyboard shortcuts.
Navigating in StreetReaderAI
StreetReaderAI offers an immersive, first-person exploration experience, much like a video game where audio is the primary interface.
StreetReaderAI provides seamless navigation through both keyboard and voice interaction. Users can explore their surroundings using the left and right arrow keys to shift their view. As the user pans, StreetReaderAI shares audio feedback, voicing the current heading as a cardinal or intercardinal direction (e.g., “Now facing: North” or “Northeast”). It also expresses whether the user can move forward and if they are currently facing a nearby landmark or place.
To move, the user can take “virtual steps” using the up arrow or move backward with the down arrow. As a user moves through the virtual streetscape, StreetReaderAI describes how far the user traveled and key geographic information, such as nearby places. Users can also use “jump” or “teleport” features to quickly move to new locations.
How StreetReaderAI serves as a virtual guide
The core of StreetReaderAI is its two underlying AI subsystems backed by Gemini: AI Describer and AI Chat. Both subsystems take in a static prompt and optional user profile as well as dynamic information about the user’s current location, such as nearby places, road information, and the current field-of-view image (i.e., what’s being shown in Street View).
AI Describer
AI Describer functions as a context-aware scene description tool that combines dynamic geographic information about the user’s virtual location along with an analysis of the current Street View image to generate a real-time audio description.
It has two modes: a “default” prompt emphasizing navigation and safety for blind pedestrians, and a “tour guide” prompt that provides additional tourism information (e.g., historic and architectural context). We also use Gemini to predict likely follow-up questions specific to the current scene and local geography that may be of interest to blind or low-vision travelers.
AI Chat
AI Chat builds on AI Describer but allows users to ask questions about their current view, past views, and nearby geography. The chat agent uses Google's Multimodal Live API, which supports real-time interaction, function calling, and temporarily retains memory of all interactions within a single session. We track and send each pan or movement interaction along with the user's current view and geographic context (e.g., nearby places, current heading).
What makes AI Chat so powerful is its ability to hold a temporary “memory” of the user's session — the context window is set to a maximum of 1,048,576 input tokens, which is roughly equivalent to over 4k input images. Because AI Chat receives the user's view and location with every virtual step, it collects information about the user’s location and context. A user can virtually walk past a bus stop, turn a corner, and then ask, “Wait, where was that bus stop?” The agent can recall its previous context, analyze the current geographic input, and answer, “The bus stop is behind you, approximately 12 meters away.”
Testing StreetReaderAI with blind users
To evaluate StreetReaderAI, we conducted an in-person lab study with eleven blind screen reader users. During the sessions, participants learned about StreetReaderAI and used it to explore multiple locations and evaluate potential walking routes to destinations.
Overall, participants reacted positively to StreetReaderAI, rating the overall usefulness 6.4 (median=7; SD=0.9) on a Likert scale from 1–7 (where 1 was ‘not at all useful’ and 7 was ‘very useful’), emphasizing the interplay between virtual navigation and AI, the seamlessness of the interactive AI Chat interface, and the value of information provided. Qualitative feedback from participants consistently highlighted StreetReaderAI's significant accessibility advancement for navigation, noting that existing street view tools lack this level of accessibility. The interactive AI chat feature was also described as making conversations about streets and places both engaging and helpful.
During the study, participants visited over 350 panoramas and made over 1,000 AI requests. Interestingly, AI Chat was used six times more often than AI Describer, indicating a clear preference for personalized, conversational inquiries. While participants found value in StreetReaderAI and adeptly combined virtual world navigation with AI interactions, there is room for improvement: participants sometimes struggled with properly orienting themselves, distinguishing the veracity of AI responses, and determining the limits of AI knowledge.
Results
As the first study of an accessible street view system, our research also provides the first-ever analysis of the types of questions blind people ask about streetscape imagery. We analyzed all 917 AI Chat interactions and annotated each with up to three tags drawn from an emergent list of 23 question type categories. The four most common question types included:
Spatial orientation: 27.0% of participants were most interested in the location and distance of objects, e.g., “How far is the bus stop from where I'm standing?” and “Which side are the garbage cans next to the bench?”
Object existence: 26.5% of participants queried for the presence of key features like sidewalks, obstacles, and doors; “Is there a crosswalk here?”
General description: 18.4% of participants started AI Chat by requesting a summary of the current view, often asking, “What's in front of me?”
Object/place location: 14.9% of participants asked where things were, such as, “Where is the nearest intersection?” or “Can you help me find the door?”
StreetReaderAI accuracy
Because StreetReaderAI relies so significantly on AI, a critical challenge is response accuracy. Of the 816 questions that participants asked AI Chat:
703 (86.3%) were correctly answered.
32 (3.9%) were incorrect (3.9%).
The remaining were either: partially correct (26; 3.2%) or the AI refused to answer (54; 6.6%).
Of the 32 incorrect responses:
20 (62.5%) were false negatives, e.g., stating that a bike rack did not exist when it did.
12 (37.5%) were misidentifications (e.g., a yellow speed bump interpreted as a crosswalk) or misc errors due to AI Chat not yet seeing the target in street view.
More work is necessary to explore how StreetReaderAI performs in other contexts and beyond lab settings.
What’s next?
StreetReaderAI is a promising first step toward making streetscape tools accessible to all. Our study highlights what information blind users desire from and ask about streetscape imagery and the potential for multimodal AI to answer their questions.
There are several other opportunities to expand on this work:
Towards Geo-visual Agents: We envision a more autonomous AI Chat agent that can explore on its own. For example, a user could ask, “What’s the next bus stop down this road?” and the agent could automatically navigate the Street View network, find the stop, analyze its features (benches, shelters), and report back.
Supporting Route Planning: Similarly, StreetReaderAI does not yet support full origin-to-destination routing. Imagine asking, “What’s the walk like from the nearest subway station to the library?” A future AI agent could “pre-walk” the route, analyzing every Street View image to generate a blind-friendly summary, noting potential obstacles, and identifying the exact location of the library’s door.
Richer Audio Interface: The primary output of StreetReaderAI is speech. We are also exploring richer, non-verbal feedback, including spatialized audio and fully immersive 3D audio soundscapes synthesized from the images themselves.
Though a “proof-of-concept” research prototype, StreetReaderAI helps demonstrate the potential of making immersive streetscape environments accessible.
Acknowledgements
This research was conducted by Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, and Philip Nelson. We thank Project Astra and the Google Geo teams for their feedback as well as our participants. Diagram icons are from Noun Project, including: “prompt icon” by Firdaus Faiz, “command functions” by Kawalan Icon, “dynamic geo-context” by Didik Darmanto, and “MLLM icon” by Funtasticon.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读