教会人工智能识读地图

qimuai 发布于 2026-2-18 08:01 阅读：11 一手编译

教会人工智能识读地图

内容来源：https://research.google/blog/teaching-ai-to-read-a-map/

内容总结：

谷歌研究团队近日发布一项突破性成果，成功利用合成数据训练人工智能系统理解并追踪地图路径。这项名为“MapTrace”的技术旨在解决当前多模态大语言模型在空间推理方面的关键短板——模型虽能识别图像内容，却难以理解物体间的几何与拓扑关系，例如无法在动物园地图上规划出从入口到爬虫馆的合理路线。

研究团队指出，该问题的核心在于训练数据匮乏。模型缺乏学习导航规则所需的大规模精准标注地图数据，而人工标注成本高昂，且许多商业场所的详细地图不易获取。为此，团队开发了一套全自动、可扩展的合成数据生成流程，通过四步构建高质量训练数据：首先由大语言模型生成多样化地图文本描述，再经文本转图像模型渲染成图；随后通过“掩码批判模型”自动识别可行走区域；接着将区域转换为结构化导航图；最后利用“路径批判模型”对生成的数千条路径进行质量校验。

基于该流程生成的200万组标注数据，研究团队对Gemma 3 27B和Gemma 2.5 Flash等模型进行微调。实验显示，微调后的模型在未训练过的真实地图测试集上表现显著提升：路径追踪误差指标平均降低约30%，路径生成成功率最高提升6.4个百分点。这证明精细空间推理能力可通过针对性训练获得。

该技术未来可应用于智能导航、机器人自主移动、无障碍导览等领域。研究团队已开源全部合成数据集，以推动相关领域发展。此项工作由谷歌学生研究员Artemis Panagopoulou与高级软件工程师Mohit Goyal等人共同完成。

中文翻译：

教人工智能读懂地图
2026年2月17日
Artemis Panagopoulou（谷歌学生研究员）与 Mohit Goyal（谷歌高级软件工程师）

我们提出了一套合成数据生成系统，用于训练人工智能系统在任意地图上视觉化追踪任意路线，从而让语言模型最终学会在我们的世界中导航。

观察一张购物中心或主题公园的地图。只需几秒钟，你的大脑就能处理视觉信息、确定自身位置，并规划出前往目的地的最佳路径。你能本能地分辨哪些线条代表墙壁、哪些代表通道。这种精细的空间推理能力，对人类而言是与生俱来的本能。

尽管多模态大语言模型（MLLMs）已取得惊人进展，却常常难以完成这项特定任务。MLLMs 可以识别动物园的图片并列出其中可能存在的动物，但在从入口到爬行动物馆的地图上标出一条可行路线时却可能困难重重——它们画出的线可能直接穿过围栏或礼品店，忽略了环境的基本限制。这揭示了一个关键缺陷：当前模型擅长识别图像内容，却在理解物体间的几何与拓扑关系时表现不佳。

为应对这一挑战，我们在论文《MapTrace：面向地图路径追踪的可扩展数据生成》中提出了一项新任务、一个新数据集以及一套合成数据生成流程，旨在教会 MLLMs 在地图上追踪路径的核心技能。我们的研究表明，这种在预训练模型中普遍缺失的复杂空间能力，可以通过定向生成的合成数据来显式教授。同时，我们开源了利用 Gemini 2.5 Pro 和 Imagen-4 模型生成的 200 万组问答对，以鼓励研究界进一步探索这一领域。

挑战：缺乏对物理世界的认知基础
为何在地图上追踪路径对 AI 模型如此困难？归根结底是数据问题。MLLMs 从海量图像和文本数据中学习，它们能将“路径”一词与人行道、小径的图像关联起来，但很少接触明确教授导航规则的数据——例如路径具有连通性、不能穿墙而行、路线是由连续点构成的有序序列。

最直接的教学方法是收集包含数百万条手工标注路径的大规模地图数据集。但以像素级精度标注单一路径已极为耗时，将其扩展到训练大模型所需的规模几乎不可能。此外，许多复杂地图（如商场、博物馆、主题公园的地图）具有版权限制，难以被收集用于研究。

这种数据瓶颈阻碍了进展。缺乏足够训练样本的模型无法掌握正确解读地图所需的“空间语法”，它们看到的只是一堆像素，而非结构化的可导航空间。

解决方案：可扩展的合成数据生成流程
为填补数据空白，我们设计了一套全自动、可扩展的流程，利用 Gemini 模型的生成能力制作多样化的高质量地图。该流程能精细控制数据多样性与复杂度，生成符合预期路线、避开不可通行区域的标注路径，无需收集大规模真实地图。

流程包含四个自动化可扩展阶段，使用 AI 模型同时担任创建者与评审者，以确保质量并生成像素级标注：

生成多样化地图
首先使用大语言模型（LLM）为不同类型地图生成丰富的描述性提示，例如“带有连通栖息地的动物园地图”“拥有中央美食广场的购物中心地图”或“包含蜿蜒小径贯穿各主题区域的奇幻主题公园地图”。这些文本提示再输入文生图模型，渲染成复杂的地图图像。
通过 AI‘掩码评审器’识别可通行路径
获得地图图像后，需识别所有“可行走”区域。系统通过按颜色聚类像素生成候选路径掩码（即所有通道的黑白示意图）。但并非每个着色区域都是有效路径，因此我们引入另一个 MLLM 作为“掩码评审器”，通过同时观察地图图像和候选掩码，判断其是否代表真实、连通的路径网络。若 MLLM 认定候选掩码大部分为有效可通行区域（如铺砌人行道、斑马线、步行专用道），则将其标记为高质量，仅高质量掩码进入下一阶段。
构建可导航图结构
基于清晰的可通行区域掩码，我们将二维图像转换为结构化的图格式。这类似于创建道路网络的数字版本：交叉口为节点，连接道路为边。这种“像素图”捕捉了地图的连通性，便于计算路径。
通过 AI‘路径评审器’生成完美路径
最后，我们在每张地图的图结构上随机采样数千个起点与终点，使用经典 Dijkstra 算法计算两点间绝对最短路径。随后调用另一个 MLLM 作为“路径评审器”进行终审：评审器查看叠加在地图上的生成路径，判断其是否合理、是否沿通道行走、是否符合人类路径选择习惯，并给出通过或否决意见。

该流程使我们创建了包含 200 万张带标注路径的地图图像数据集。尽管生成图像偶尔存在文字错误，但本研究主要关注路径准确性。我们预计生成模型的持续进步将在未来迭代中自然减少这类瑕疵。

成果：通往更优空间推理的清晰路径
使用合成数据训练是否真的有效？为验证这一点，我们从数据集中选取较小规模生成数据子集（2.3 万条路径），对多个 MLLMs（包括开源的 Gemma 3 27B 和 Gemini 2.5 Flash）进行微调，随后在 MapBench（由训练中未出现的真实地图组成的常用基准测试）上评估其表现。

我们采用归一化动态时间规整（NDTW）指标衡量模型的路径追踪误差。该指标通过动态时间规整算法比较两条坐标序列（此处指预测点序列），并将结果按路径总长度归一化，最终值越低代表性能越好。下图展示了 NDTW 计算过程：（a）图显示参考路径（蓝色）与查询路径（红色，为可视化上移 1.0 单位）的二维对齐，灰线连接经 DTW 识别的匹配点，体现了对相位偏移与采样差异的处理；（b）（c）图分别展示累积成本矩阵与成对欧氏距离矩阵，其中白色最优规整路径追踪最小距离对以降低总对齐成本；（d）图呈现 x 坐标的一维对比，凸显 DTW 解决的时序对齐问题：信号形状相似但采样率与时间偏移不同。

微调后，所有模型能力均显著提升。以 Gemini 2.5 Flash 为例，其 NDTW 值大幅下降（从 1.29 降至 0.87），达到最佳综合性能。更重要的是，模型可靠性大幅提高：所有模型生成有效可解析路径的成功率均上升。微调后的 Gemma 模型成功率提升 6.4 个百分点，NDTW 值从 1.29 改善至 1.13，这一显著进步体现了其新获得的鲁棒性。这意味着经过我们的数据集训练后，模型不仅在成功时更准确，且完全失败的概率大幅降低。

这些成果证实了我们的核心假设：精细空间推理并非 MLLMs 的固有属性，而是一种可通过训练获得的技能。通过恰当的显式监督（即使是合成数据），我们也能教会模型理解并导航空间布局。

AI 评审器的性能评估
对于路径评审器，我们人工审查了 56 张随机采样地图中的 120 条判定，准确率达 76%，误报率（将无效路径标记为“高质量”）为 8%。错误主要源于：1）当背景区域颜色与路径相似时误判为可通行；2）遗漏宽阔区域中的细小有效路径。对于掩码评审器，我们检查了 20 张地图的 200 条判断，准确率为 83%，误报率 9%。常见错误包括：1）因颜色相似包含背景像素；2）将小规模非路径元素（如文字）纳入基本正确的掩码；3）将细小有效路径标记为无效。

未来展望
路径与连通性推理能力将开启一系列应用前景：

更直观的导航工具：能解读卫星图像或复杂地铁地图、提供真正直观且基于视觉指引的 AI 模型。
更智能的机器人与自主智能体：仅凭平面图即可在仓库、医院、机场等复杂室内环境中导航的机器人。
增强的无障碍功能：能为视障人士清晰描述建筑物内路径的逐步引导工具。

致谢
本研究由 Artemis Panagopoulou（谷歌学生研究员期间）、Mohit Goyal、Soroosh Yazdani、Florian Dubost、Chen Chai、Achin Kulshrestha 及 Aveek Purohit 共同完成。

英文来源：

Teaching AI to read a map
February 17, 2026
Artemis Panagopoulou, Student Researcher, and Mohit Goyal, Senior Software Engineer, Google
We propose a system for synthetic data generation to train AI systems to visually follow any route on any map, finally teaching language models to navigate our world.
Look at a map of a shopping mall or a theme park. Within seconds, your brain processes the visual information, identifies your location, and traces the optimal path to your destination. You instinctively understand which lines are walls and which are walkways. This fundamental skill — fine-grained spatial reasoning — is second nature.
For all their incredible advances, multimodal large language models (MLLMs) often struggle with this particular task. While MLLMs can identify a picture of a zoo and list the animals you might find there, they may have a difficult time tracing a valid path from the entrance to the reptile house. They might draw a line straight through an enclosure or a gift shop, failing to respect the basic constraints of the environment. This reveals a critical gap: today’s models are excellent at recognizing what’s in an image, but they falter when they need to understand the geometric and topological relationships between the objects.
To address this challenge, in “MapTrace: Scalable Data Generation for Route Tracing on Maps”, we introduce a new task, dataset, and synthetic data generation pipeline designed to teach MLLMs the fundamental skill of tracing paths on maps. Our work shows that this complex spatial ability, largely absent from pre-trained models, can be explicitly taught through targeted, synthetically-generated data. We also open-source 2M question answer pairs generated with the proposed pipeline utilizing Gemini 2.5 Pro and Imagen-4 Models to encourage the research community to further explore this area.
The challenge: A lack of grounding in the physical world
Why is tracing a path on a map so hard for AI models? It boils down to data. MLLMs learn from vast datasets of images and text. They learn to associate the word "path" with images of sidewalks and trails. However, they rarely see data that explicitly teaches them the rules of navigation — that paths have connectivity, that you can't walk through walls, and that a route is an ordered sequence of connected points.
The most direct way to teach this would be to collect a massive dataset of maps with millions of paths traced by hand. But annotating a single path with pixel-level accuracy is a painstaking process, and scaling it to the level required for training a large model is practically impossible. Furthermore, many of the best examples of complex maps — like those for malls, museums, and theme parks — are proprietary and cannot be easily collected for research.
This data bottleneck has held back progress. Without sufficient training examples, models lack the "spatial grammar" to interpret a map correctly. They see a soup of pixels, not a structured, navigable space.
The solution: A scalable pipeline for synthetic data
To address this data gap, we designed a fully automated, scalable pipeline that leverages the generative capabilities of Gemini Models to produce diverse high-quality maps. This process allows fine-grained control over data diversity and complexity, generating annotated paths that adhere to intended routes and avoid non-traversable regions without the need for collecting large-scale real-world maps.
The pipeline works in four automated and scalable stages, using AI models as both creators and critics to ensure quality and produce pixel-level annotations.

Generating diverse maps
First, we use a large language model (LLM) to generate rich, descriptive prompts for different types of maps. The LLM generates everything from "a map of a zoo with interconnected habitats" to "a shopping mall with a central food court" or "a fantasy theme park with winding paths through different themed lands." These text prompts are then fed into a text-to-image model that renders them into complex map images.
Identifying traversable paths with an AI "Mask Critic"
Once we have a map image, we need to identify all the "walkable" areas. Our system does this by clustering the pixels by color to create candidate path masks — essentially, a black-and-white map of all the walkways.
But not every shaded region is a valid path. So, we employ another MLLM as a "Mask Critic” used to examine each candidate mask and judge whether it represents a realistic, connected network of paths by looking at both the map image and the mask candidate. If the MLLM identifies the mask candidate as containing mostly valid traversable regions (e.g., paved sidewalks, marked crosswalks, pedestrian-only paths), then it labels the candidate as high quality. Then only these high-quality masks are passed to the next stage.
Building a navigable graph
With a clean mask of all traversable areas, we convert that 2D image into a more structured graph format. Think of this as creating a digital version of a road network, where intersections are nodes and the roads between them are edges. This "pixel-graph" captures the connectivity of the map, making it easy to calculate routes computationally.
Generating perfect paths with an AI "Path Critic"
Finally, we sample thousands of random start and end points on the graph for each map. We use a classic Dijkstra's algorithm to find the absolute shortest path between these points. Then, we use another MLLM as a "Path Critic" to perform a final quality check. This critic looks at the final generated path overlaid on the map image and gives it a thumbs-up or thumbs-down, ensuring the route is logical, stays within the lines, and looks like a path a human would take.
This pipeline enabled us to create a dataset of 2M annotated map images with valid paths. While the generated images occasionally exhibit typographic errors, this study focuses primarily on path fidelity. We anticipate that ongoing advancements in generative modeling will naturally mitigate these artifacts in future iterations.
The results: A clear path to better spatial reasoning
So, does training on this synthetic data actually work? To find out, we fine-tuned several MLLMs on a smaller subset of generated data (23,000 paths) taken from our dataset, including the open Gemma 3 27B and Gemini 2.5 Flash. We then evaluated their performance on MapBench, a popular benchmark composed of real-world maps the MLLMs had not seen during training.
We measured the models’ path-tracing error using the normalized dynamic time warping (NDTW) metric, which is an extension of dynamic time warping used to compare two sequences of coordinates that can vary in speed (or number of points predicted, in this case). The output is then normalized by the total path length to obtain the final normalized metric, i.e., the distance between two paths with a lower value demonstrating better performance. The figure below shows how the NDTW metric is computed, detailing the alignment process. Figure (a) shows the 2D alignment of Reference Path (blue) and Query Path (red, shifted by Y=1.0 for visualization). Gray lines connect matching points identified by DTW, demonstrating the handling of the phase shift and sampling difference. Figure (b) and (c) show the accumulated cost matrix and the Pairwise Euclidean distance matrix, illustrating how the optimal warping path (white) tracks the lowest distance pairs to minimize the total alignment cost. Finally, figure (d) displays a 1D comparison of x-coordinates, highlighting the temporal alignment problem that DTW resolves: the signals have similar shapes but different sampling rates and temporal offsets.
Fine-tuning on our dataset substantially improved the models' abilities across the board. The fine-tuned Gemini 2.5 Flash model, for example, saw its NDTW drop significantly (from 1.29 to 0.87), achieving the best overall performance.
More importantly, the models became far more reliable. The success rate, i.e., the percentage of time the model produced a valid, parsable path, rose for all models. The fine-tuned Gemma model saw a 6.4 point increase in its success rate and improved NDTW (1.29 to 1.13), a dramatic improvement that demonstrates a newfound robustness. This means that after training on our dataset, the models weren't just more accurate when they succeeded; they were far less likely to fail completely.
These gains confirm our central hypothesis: fine-grained spatial reasoning is not an innate property of MLLMs but an acquired skill. With the right kind of explicit supervision, even if it's synthetically generated, we can teach models to understand and navigate spatial layouts.
Evaluating the performance of AI critics
For the Path Critic, we manually reviewed 120 decisions across 56 randomly sampled maps, achieving 76% accuracy with an 8% false-positive rate (invalid paths labeled as “high quality”). Errors mainly arose from 1) misclassifying background regions as traversable when colors resemble paths, and 2) missing thin valid paths within larger open regions. For the Mask Critic, we inspected 200 judgments over 20 maps, observing 83% accuracy and a 9% false-positive rate. Common errors included 1) background pixels included due to color similarity, 2) small non-path elements (e.g., text) absorbed into otherwise correct masks, and 3) thin valid paths labeled as invalid.
What's next
The ability to reason about paths and connectivity unlocks a host of future applications. Including:
- More intuitive navigation tools: An AI model that can look at a satellite image or a complex subway map and give you truly intuitive, visually grounded directions.
- Smarter robotics and autonomous agents: Robots that can navigate complex indoor environments like warehouses, hospitals, or airports by simply looking at a floor plan.
- Enhanced accessibility: Tools that can describe a path through a building for a visually impaired person in a clear, step-by-step manner.
  Acknowledgments
  This research was conducted by Artemis Panagopoulou (while working as a Student Researcher at Google), Mohit Goyal, Soroosh Yazdani, Florian Dubost, Chen Chai, Achin Kulshrestha, and Aveek Purohit.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读