DS-STAR：一款先进的多功能数据科学智能体

qimuai 发布于 2025-11-7 08:00 阅读：119 一手编译

内容来源：https://research.google/blog/ds-star-a-state-of-the-art-versatile-data-science-agent/

内容总结：

近日，谷歌云研究科学家尹珍成与学生研究员南载贤共同发布了一款名为DS-STAR的新型数据科学智能体。该工具凭借卓越的多源异构数据处理能力，在DABStep等权威基准测试中刷新性能纪录，为数据科学领域自动化技术带来突破性进展。

传统数据科学流程常受限于结构化数据，且依赖人工编写代码，导致分析效率低下。DS-STAR通过三大核心技术突破此瓶颈：首先，其数据文件分析模块可自动解析JSON、文本等非结构化数据；其次，基于大语言模型的验证机制能逐步评估方案可行性；最后，独创的迭代式规划系统可根据反馈持续优化解决方案。

在实际测试中，DS-STAR在DABStep、KramaBench和DA-Code三大基准测试中全面领先现有技术。其中在DABStep复杂任务中准确率提升至45.2%，较最优竞品提升4.2个百分点。特别在需要多源数据协同分析的困难任务中，其优势更为显著。

研究团队通过消融实验验证了核心模块的价值：当关闭数据文件分析功能时，系统在困难任务上的准确率骤降至26.98%；而路由决策模块的缺失则会导致错误累积。值得注意的是，该系统适配不同大模型基座，GPT-5与Gemini-2.5-Pro分别在简单与复杂任务中展现互补优势。

这项技术突破标志着数据科学自动化进入新阶段。通过模拟专业分析师的工作流程，DS-STAR实现了从数据预处理到最终洞察的全流程自动化，有望显著降低数据分析门槛，为各行业数字化转型提供新动能。

中文翻译：

DS-STAR：一款符合中文语境的多功能数据科学智能体
2025年11月6日
Google Cloud 研究科学家 Jinsung Yoon 与学生研究员 Jaehyun Nam

DS-STAR是一款尖端多功能数据科学智能体，其卓越的通用性体现在能自动化执行统计分析、可视化、数据清洗等多种任务，并支持各类数据格式，最终在权威DABStep基准测试中荣登榜首。

快速导航

数据科学致力于将原始数据转化为具有指导意义的洞察，在解决现实难题中发挥着关键作用。企业常依赖数据洞察制定重大战略决策，但数据科学流程往往涉及计算机科学与统计学等多领域专业知识，过程复杂且耗时，涵盖从文档解析到复杂数据处理与统计分析的全流程。

为优化这一复杂流程，近期研究聚焦于利用现成大语言模型构建自主数据科学智能体，旨在将自然语言问题转化为可执行代码。然而现有智能体仍存在明显局限：过度依赖关系数据库中CSV等结构化数据，忽视了实际应用中常见的JSON、非结构化文本、Markdown等异构数据格式的价值；另一方面，许多数据科学问题具有开放性且缺乏标准答案，难以验证智能体推理的正确性。

为此，我们推出全新数据科学问题求解智能体DS-STAR，其三大创新包括：(1) 支持非结构化数据在内的多格式数据文件解析模块；(2) 基于大语言模型的验证机制，逐步评估方案完备性；(3) 根据反馈迭代优化方案的序列规划流程。这种迭代优化使DS-STAR能处理涉及多数据源的复杂分析任务。在DABStep、KramaBench、DA-Code等挑战性基准测试中，DS-STAR均实现突破性表现，尤其在处理异构数据文件时优势显著。

技术框架
DS-STAR运行包含两大阶段：首先自动扫描目录内所有文件并生成文本摘要，为后续任务提供关键上下文；随后进入“规划-实施-验证”核心循环：规划器制定高层方案，编码器将其转为代码脚本，验证器评估代码有效性。当验证器判定方案不完善时，路由器代理将决定修改或新增步骤，开启新一轮循环。值得注意的是，DS-STAR模拟了数据分析师在Google Colab等工具中逐步构建方案、审查中间结果的工作方式，该循环最多持续10轮或直至方案达标。

性能评估
通过在DABStep、KramaBench、DA-Code三大基准测试上与AutoGen、DA-Agent等先进方法对比，DS-STAR在所有测试场景中均取得显著优势：将DABStep准确率从41.0%提升至45.2%，KramaBench从39.8%提升至44.7%，DA-Code从37.0%提升至38.5%。截至2025年9月18日，DS-STAR更在DABStep公开排行榜位列第一。无论是答案存于单文件的简单任务，还是需多文件协作的复杂任务，DS-STAR均持续领先，彰显其处理异构数据源的卓越能力。

深度解析
通过消融实验验证核心组件价值：
• 数据文件分析器：移除该组件后，DS-STAR在DABStep高难度任务中准确率骤降至26.98%，证明丰富数据上下文对规划实施至关重要
• 路由决策器：取消路由功能后，智能体仅能顺序添加步骤，任务表现全面下降，表明修正错误比持续堆叠缺陷步骤更有效
• 大模型泛化性：使用GPT-5作为基础模型时，在DABStep基准中展现出潜力。有趣的是，GPT-5版在简单任务中表现更佳，而Gemini-2.5-Pro版更擅长复杂任务

优化过程分析显示：DABStep基准中复杂任务平均需5.6轮迭代，简单任务仅需3.0轮，超半数简单任务仅需单轮即可完成。

总结展望
本文提出的DS-STAR智能体通过两大核心创新——多格式文件自动解析与基于大语言模型验证的迭代规划流程，在三大基准测试中创下最新纪录。该框架有望降低数据科学应用门槛，助力各领域实现创新突破。

致谢
感谢Jiefeng Chen、Jinwoo Shin、Raj Sinha、Mihir Parmar、George Lee、Vishy Tirumalashetty、Tomas Pfister和Burak Gokturk对本研究的宝贵贡献。

英文来源：

DS-STAR: A state-of-the-art versatile data science agent
November 6, 2025
Jinsung Yoon, Research Scientist, and Jaehyun Nam, Student Researcher, Google Cloud
DS-STAR is a state-of-the-art data science agent whose versatility is shown by its ability to automate a range of tasks — from statistical analysis to visualization and data wrangling — across various data types, culminating in a top-ranking performance on the famous DABStep benchmark.
Quick links
Data science is a field dedicated to transforming raw data into meaningful, actionable insights, playing an essential role in solving real-world challenges. Businesses often depend on data-driven insights to make pivotal strategic decisions. However, the data science process is frequently complex, demanding a high level of expertise in fields like computer science and statistics. This workflow consists of many time-intensive activities, from interpreting various documents to performing complex data processing and statistical analysis.
To streamline this complex workflow, recent research has focused on using off-the-shelf large language models (LLMs) to create autonomous data science agents. The goal of these agents is to convert natural language questions into executable code for a desired task. But despite making significant progress, current data science agents have several limitations that hinder their practical use. A major issue is their heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications. Another challenge is that many data science problems are open-ended and lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct.
To that end, we present DS-STAR, a new agent designed to solve data science problems. DS-STAR introduces three key innovations: (1) a data file analysis module that automatically extracts context from varied data formats, including unstructured ones; (2) a verification stage where an LLM-based judge assesses the plan’s sufficiency at each step; and (3) a sequential planning process that iteratively refines the initial plan based on feedback. This iterative refinement allows DS-STAR to handle complex analyses that draw verifiable insights from multiple data sources. We demonstrate that DS-STAR achieves state-of-the-art performance on challenging benchmarks like DABStep, KramaBench, and DA-Code. It especially excels with tasks involving diverse, heterogeneous data files.
DS-STAR
The DS-STAR framework operates in two main stages. First, it automatically examines all files in a directory and creates a textual summary of their structure and contents. This summary becomes a vital source of context for tackling the task at hand.
Second, DS-STAR engages in a primary loop of planning, implementing, and verifying. The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle. Importantly, DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding. This iterative cycle continues until a plan is deemed satisfactory or the maximum number of rounds (10) is reached, at which point the final code is delivered as the solution.
Evaluation
To evaluate DS-STAR’s effectiveness, we compared its performance to existing state-of-the-art methods (AutoGen, DA-Agent) using a set of well-regarded data science benchmarks, DABStep, KramaBench, and DA-Code. These benchmarks evaluate performance on complex tasks like data wrangling, machine learning, and visualization that use multiple data sources and formats.
The results show that DS-STAR substantially outperforms AutoGen and DA-Agent in all test scenarios. Compared to the best alternative, DS-STAR raised the accuracy from 41.0% to 45.2% on DABStep, 39.8% to 44.7% on KramaBench, and 37.0% to 38.5% on DA-Code. Notably, DS-STAR also secured the top rank on the public leaderboard for the DABStep benchmark (as of 9/18/2025). On both easy tasks (where the answer is in a single file) and hard tasks (requiring multiple files), DS-STAR consistently surpasses competing baselines, demonstrating its superior ability to work with multiple, heterogeneous data sources.
In-depth analysis of DS-STAR
Next, we conducted ablation studies to verify the effectiveness of DS-STAR’s individual components and analyze the impact of the number of refinement rounds, specifically by measuring the iterations required to generate a sufficient plan.
Data File Analyzer: This agent is essential for high performance. Without the descriptions it generates (Variant 1), DS-STAR's accuracy on difficult tasks within the DABStep benchmark sharply dropped to 26.98%, underscoring the importance of rich data context for effective planning and implementation.
Router: The Router agent’s ability to determine if a new step is needed or to fix an incorrect step is vital. When we removed it (Variant 2), DS-STAR only added new steps sequentially, leading to worse performance on both easy and hard tasks. This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps.
Generalizability Across LLMs: We also tested DS-STAR's adaptability by using GPT-5 as the base model. This yielded promising results on the DABStep benchmark, indicating the framework's generalizability. Interestingly, DS-STAR with GPT-5 performed better on easy tasks, while the Gemini-2.5-Pro version performed better on hard tasks.
An analysis of the refinement process: The figure below shows that difficult tasks naturally require more iterations. On the DABStep benchmark, hard tasks needed an average of 5.6 rounds to solve, whereas easy tasks required only 3.0 rounds. Furthermore, over half of the easy tasks were completed in just a single round.
Conclusion
In this work, we introduced DS-STAR, a new agent that can autonomously solve data science problems. The framework is defined by two core innovations: the automatic analysis of diverse file formats and an iterative, sequential planning process that uses a novel LLM-based verification system. DS-STAR establishes a new state-of-the-art on the DABStep, KramaBench, and DA-Code benchmarks, outperforming the best alternative. By automating complex data science tasks, DS-STAR has the potential to make data science more accessible for individuals and organizations, helping to drive innovation across many different fields.
Acknowledgements
We would like to thank Jiefeng Chen, Jinwoo Shin, Raj Sinha, Mihir Parmar, George Lee, Vishy Tirumalashetty, Tomas Pfister and Burak Gokturk for their valuable contributions to this work.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读