谷歌DeepMind为Gemini 3 Flash引入智能视觉功能。

内容来源:https://aibusiness.com/image-recognition/google-deepmind-agentic-vision-gemini-3-flash
内容总结:
谷歌旗下DeepMind团队近日宣布,为其轻量级大模型Gemini 1.5 Flash推出“智能视觉”新功能。该技术突破使图像分析从被动识别转向主动探索,通过生成并执行Python代码,实现对图像的逐步放大、标注与解析。
传统多模态模型仅能对图像进行一次性整体解读,而升级后的模型可针对用户查询,自主制定分析计划。例如在识别街景路牌或芯片序列号时,系统会启动“思考-执行-观察”循环机制:先规划分析步骤,再通过代码操作聚焦细节,最后验证结果并生成可靠回答。
谷歌产品经理在技术博客中表示,这种视觉推理与代码执行相结合的方式,让模型能基于视觉证据进行逐步推演。内部测试显示,该功能使视觉基准任务质量提升5%至10%,在数学解题等场景中显著减少了“幻觉”错误。
目前,该功能已通过Google AI Studio开放体验,支持迭代放大、直接图像标注及可视化绘图等操作。未来谷歌计划为模型集成更多自动化代码执行能力,并逐步推出网络搜索、以图搜图及多尺寸模型版本。
中文翻译:
由谷歌云赞助
选择您的首个生成式AI应用场景
要开始使用生成式AI,首先应关注能够提升人类信息交互体验的领域。
新技术将视觉推理与Python代码相结合,以增强图像分析能力并实现主动式探索。
谷歌DeepMind本周为其Gemini 3 Flash模型新增了智能视觉功能,使图像分析从被动处理转变为主动任务。
传统多模态模型仅对图像进行单次“扫视”,而谷歌通过引入智能能力,使模型能够主动研究图片并聚焦特定细节,例如街道标识或微芯片上的序列号。
该功能通过生成并运行Python代码实现,可系统性地对图像进行缩放、处理和检测。
谷歌DeepMind产品经理罗汉·多希在相关博文中写道:“作为智能视觉首批支持工具之一,该模型通过结合视觉推理与代码执行,能够制定逐步放大、检测和处理图像的计划,使答案始终基于视觉证据。”
该功能采用“思考-执行-观察”循环机制:Gemini 3 Flash首先分析用户查询和图像并制定计划,随后使用Python代码主动执行图像分析,在生成最终回答前对结果进行核查。
据谷歌透露,此次更新使视觉基准测试质量提升了5%至10%。
谷歌表示,通过谷歌AI Studio已展示了一系列新增的智能行为,包括迭代缩放、直接图像标注和视觉绘图。后者据称能减少幻觉现象——这是视觉数学任务中的常见问题。
展望未来,该公司计划为模型融入更多隐式代码驱动行为,这意味着当前需要特定指令触发的功能将发展为自主能力。
未来还将推出更多功能,例如网络及反向图像搜索,并扩展更多模型尺寸选择。
英文来源:
Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
The new capabilities combine visual reasoning with Python code to improve image analysis and enable active investigations.
Google DeepMind added this week agentic vision capabilities to its Gemini 3 Flash model, turning image analysis an active rather than passive task.
While typical multimodal models process images in a single "glance," by introducing agentic capabilities, Google allows its model to actively study a picture and home in on specific details, such as street signs or a serial number on a microchip.
The new feature works by generating and running Python code that zooms, manipulates and inspects images methodically.
"By combining visual reasoning with code execution, one of the first tools supported by Agentic Vision, the model formulates plans to zoom in, inspect and manipulate images step-by-step, grounding answers in visual evidence," Rohan Doshi, product manager at Google DeepMind, wrote in a blog post about the announcement.
The feature uses a Think-Act-Observe loop, whereby Gemini 3 Flash will study a user query and image and formulate a plan, use Python code to actively conduct an image analysis, and then inspect the results before generating its final response.
According to Google, the update saw a quality improvement of between 5% to 10% across vision benchmarks.
A range of new agentic behaviors have, Google said, already been demonstrated from the update via Google AI Studio, such as iterative zooming, direct image annotation and visual plotting. The latter is said to reduce hallucinations — a common problem with visual math tasks.
Looking ahead, the company said it plans to add more implicit code-driven behaviors into the model, meaning certain capabilities that currently require a specific prompt will become an autonomous feature.
More features, such as web and reverse image search, as well as a greater range of model sizes, are also expected to be rolled out in the future.
文章标题:谷歌DeepMind为Gemini 3 Flash引入智能视觉功能。
文章链接:https://qimuai.cn/?post=3087
本站文章均为原创,未经授权请勿用于任何商业用途