采用测试时扩散机制的深度研究者。
内容来源:https://research.google/blog/deep-researcher-with-test-time-diffusion/
内容总结:
谷歌云团队推出新型AI研究助手TTD-DR,革新长篇报告生成与复杂推理任务表现
2025年9月19日,谷歌云研究科学家Rujun Han与Chen-Yu Lee团队正式发布测试时扩散深度研究系统(TTD-DR)。该框架通过模拟人类研究过程中的迭代修正机制,将初稿生成与信息检索深度融合,在长篇研究报告撰写和复杂推理任务中实现了突破性性能提升。
传统AI研究助手虽能通过思维链推理或多答案筛选等技术生成内容,但缺乏人类研究中关键的"规划-起草-研究-反馈"迭代流程。TTD-DR创新性地将报告撰写构建为扩散过程:首先生成初步草稿作为"噪声版本",随后通过检索工具持续注入新信息进行"去噪优化",配合自进化算法动态提升各环节质量。
系统核心采用三阶段架构:首先生成结构化研究计划,继而通过双子代理(搜索问题生成+答案检索)进行迭代搜索,最终整合所有信息生成完整报告。特别值得注意的是其跨代融合机制——多个修订版本经大型语言模型评估后,择优合并形成高质量输出。
在DeepConsult长篇报告生成、Humanity‘s Last Exam和GAIA多跳推理三大基准测试中,TTD-DR均取得最优表现。相较于OpenAI深度研究系统,其在长篇报告生成任务中胜率达74.5%,在另两个短答案研究数据集上分别领先7.7%和1.7%。消融实验进一步验证了自进化算法与检索去噪机制对各环节性能的显著提升。
目前该技术已通过谷歌智能体开发套件在Google Agentspace平台实现产品化部署,为学术研究者和专业工作者提供新一代AI研究辅助工具。研发团队表示,该系统"初稿优先"的设计理念能有效保持研究过程的聚焦性与连贯性,避免重要信息在迭代过程中丢失。
中文翻译:
具备测试时扩散能力的深度研究器
2025年9月19日
韩如俊与李振宇,谷歌云研究科学家
我们推出测试时扩散深度研究器(TTD-DR),该框架通过深度研究智能体,利用高质量检索信息自主完成草稿撰写与修订。这种方法在撰写长篇研究报告和完成复杂推理任务方面实现了突破性进展。
快速链接
大语言模型(LLM)的最新进展推动了深度研究(DR)智能体的兴起。这些智能体展现出卓越能力,包括生成创新观点、高效检索信息、执行实验验证,并最终完成综合性报告与学术论文。
当前主流公共DR智能体虽采用多种巧妙技术提升效果——如通过思维链进行推理或多答案生成择优——但往往简单拼凑工具,未考虑人类研究的迭代特性。它们缺失了人类撰写复杂主题论文时的核心流程(即规划、起草、研究、基于反馈迭代)。修订过程的关键在于通过进一步研究填补信息缺口或强化论证,这种人类行为模式与检索增强扩散模型的机制惊人相似:从"嘈杂"初始输出开始,逐步优化为高质量成果。若将AI智能体的初稿视为噪声版本,搜索工具作为去噪步骤注入新事实进行净化呢?
今日我们推出测试时扩散深度研究器(TTD-DR),这是首个模仿人类研究方式的DR智能体。据我们所知,TTD-DR是首次将研究报告撰写建模为扩散过程的研究智能体:粗糙初稿通过逐步优化成为高质量终稿。我们引入两种协同工作的新算法:首先,通过自我进化实现组件级优化,提升研究流程各环节质量;其次,通过检索去噪实现报告级优化,应用新检索信息修订完善报告草稿。实验证明TTD-DR在长篇报告撰写和多跳推理任务中达到最先进水平。
测试时扩散深度研究器
TTD-DR接收用户查询后生成初步草稿作为动态基础指导研究计划。该草稿通过检索去噪过程(报告级优化)迭代完善:逐步整合检索所得信息提升草案质量,形成持续优化的闭环循环。更重要的是,自我进化算法持续增强从初始规划到最终报告的全流程。这种优化与自我改进的强大组合使报告撰写过程更具连贯性。
核心DR设计
核心DR设计包含三阶段:
- 研究计划生成:根据用户查询生成结构化研究计划,列出终稿所需关键领域,为后续信息收集提供初始指南
- 迭代搜索:包含两个子智能体——搜索问题生成(阶段2a)基于研究计划、用户查询及历史搜索上下文(既往问答)生成搜索查询;答案搜索(阶段2b)检索可用资源获取相关文档并返回摘要答案,类似检索增强生成(RAG)系统
- 最终报告生成:整合所有结构化信息(计划与系列问答对)生成全面连贯的终稿
组件级自我进化
采用自我进化算法提升各阶段智能体性能以发现并保存高质量上下文:
- 初始状态:基于前阶段输出生成多个多样化答案变体(下图最左区块),用于探索更大搜索空间以发现更有价值信息
- 环境反馈:通过LLM即裁判评估各答案变体,使用自动评分器衡量帮助性、全面性等指标。评分器不仅提供适应度分数,还生成改进答案的文本反馈
- 修订:根据前步分数与反馈,各变体经过修订步骤提升适应度分数。环境反馈与修订循环执行直至达到最大迭代次数或智能体判定无需继续修订
- 交叉融合:最终将多个修订变体合并为单一高质量输出,整合所有进化路径的最佳信息,为主报告生成过程提供优质上下文
检索式报告级去噪
由于未经实际研究的初稿对复杂主题毫无价值,TTD-DR使用搜索工具对草稿进行去噪与进化:
将当前报告草稿输入核心DR工作流的搜索生成阶段(阶段2a)以指导下一搜索查询的生成。在答案搜索阶段(阶段2b)获得综合答案后,利用新信息修订报告草稿——或补充新细节,或验证既有信息。此过程循环执行:将去噪后报告反馈至搜索查询生成阶段,持续去噪直至搜索过程结束,最终由智能体基于所有历史搜索答案与修订版本撰写终稿(阶段3)。
实验结果
使用聚焦两类任务的基准数据集评估TTD-DR性能:1)需研究智能体生成长篇综合报告的复杂查询(DeepConsult);2)需大量搜索与推理的多跳查询(Humanity's Last Exam [HLE] 与 GAIA)。从HLE中抽取需要更多搜索推理的200个查询子集(HLE-Search)。两类任务均符合我们构建通用现实研究助手的目标。将我们的DR系统与OpenAI深度研究进行对比。
TTD-DR在所有基准测试中均取得更优结果:相比OpenAI DR,长篇研究报告生成任务胜率达74.5%;在具有简短真实答案的两个深度研究数据集上,分别领先7.7%和1.7%。
消融研究
通过逐步叠加上述三种方法进行消融研究:我们的DR智能体使用Gemini-2.5-pro作为基础模型,其他基线智能体使用默认LLM。下图显示DR智能体的消融研究结果:核心DR智能体性能低于OpenAI DR;加入自研进化算法后,DeepConsult任务以59.8%胜率超越OpenAI深度研究,HLE-Search和GAIA数据集的正确率分别提升4.4%和1.2%;最终加入检索扩散机制后,所有基准测试均获得显著提升。
下图帕累托前沿进一步展示TTD-DR相比其他DR智能体的测试时缩放效率:在相同延迟下,TTD-DR以更高胜率实现更优质量效率。详见论文。
结论
测试时扩散深度研究器(TTD-DR)受人类迭代研究方式启发而设计。该智能体通过将报告生成概念化为扩散过程,克服了现有DR智能体的局限性。在需要深度搜索与多跳推理的各种基准测试中,TTD-DR框架显著优于现有DR智能体,在生成长篇综合研究报告和识别多跳搜索推理任务的简明答案方面展现出最先进性能。其成功源于"草稿优先"设计理念,使整个研究过程保持聚焦性与连贯性,避免重要信息在流程中丢失。
谷歌云平台可用性
本工作的产品版本已通过谷歌云智能体开发套件实现在Google Agentspace上线。
致谢
本研究由韩如俊、陈燕飞、孙冠、Lesly Miculicich、崔竹(Zoey CuiZhu)、毕媛媛(Sophia)、文明文、万辉、温春峰、Solène Maître、George Lee、Vishy Tirumalashetty、李晓伟、Emily Xue、张子昭、Salem Haykal、Burak Gokturk、Tomas Pfister与李振宇共同完成。
英文来源:
Deep researcher with test-time diffusion
September 19, 2025
Rujun Han and Chen-Yu Lee, Research Scientists, Google Cloud
We introduce Test-Time Diffusion Deep Researcher (TTD-DR), a framework that uses a Deep Research agent to draft and revise its own drafts using high-quality retrieved information. This approach achieves new state-of-the-art results in writing long-form research reports and completing complex reasoning tasks.
Quick links
The recent advances in large language models (LLMs) have fueled the emergence of deep research (DR) agents. These agents demonstrate remarkable capabilities, including the generation of novel ideas, efficient information retrieval, experimental execution, and the subsequent drafting of comprehensive reports and academic papers.
Currently, most public DR agents use a variety of clever techniques to improve their results, like performing reasoning via chain-of-thought or generating multiple answers and selecting the best one. While they've made impressive progress, they often bolt different tools together without considering the iterative nature of human research. They're missing the key process (i.e., planning, drafting, researching, and iterating based on feedback) on which people rely when writing a paper about a complex topic. A key part of that revision process is to do more research to find missing information or strengthen your arguments. This human pattern is surprisingly similar to the mechanism of retrieval-augmented diffusion models that start with a “noisy” or messy output and gradually refine it into a high-quality result. What if an AI agent's rough draft is the noisy version, and a search tool acts as the denoising step that cleans it up with new facts?
Today we introduce Test-Time Diffusion Deep Researcher (TTD-DR), a DR agent that imitates the way humans do research. To our knowledge, TTD-DR is the first research agent that models research report writing as a diffusion process, where a messy first draft is gradually polished into a high-quality final version. We introduce two new algorithms that work together to enable TTD-DR. First, component-wise optimization via self-evolution enhances the quality of each step in the research workflow. Then, report-level refinement via denoising with retrieval applies newly retrieved information to revise and improve the report draft. We demonstrate that TTD-DR achieves state-of-the-art results on long-form report writing and multi-hop reasoning tasks.
Test-Time Diffusion Deep Researcher
TTD-DR is designed to take a user query as input and then create a preliminary draft that serves as an evolving foundation to guide the research plan. This evolving draft is iteratively refined using a denoising with retrieval process (report-level refinement) that takes the information it finds and uses it to improve the draft at each step. This happens in a continuous loop that improves the report with each cycle. To top it all off, a self-evolution algorithm constantly enhances the entire process, from the initial plan to the final report. This powerful combination of refinement and self-improvement leads to a more coherent report writing process.
Backbone DR design
The backbone DR design consists of three stages that we outline below.
- Research plan generation: Produces a structured research plan upon receiving a user query. This plan outlines a list of key areas needed for the final report, serving as an initial guideline for the subsequent information-gathering process.
- Iterative search: Contains two sub-agents: Search Question Generation (stage 2a in the figure below) formulates a search query based on the research plan, the user query, and the context from previous search iterations (i.e., past questions and answers). Answer Searching (stage 2b) searches the available sources to find relevant documents and returns a summarized answer, similar to retrieval-augmented generation (RAG) systems.
- Final report generation: Produces a comprehensive and coherent final report by combining all the structured information gathered, that is, the plan and the series of question-answer pairs.
Component-wise self-evolution
We leverage a self-evolutionary algorithm to enhance the performance of each stage's agents in order to find and preserve the high quality context. - Initial states: The leftmost blocks in the diagram below represent multiple diverse answer variants based on the output of previous stages, which are used to explore a larger search space. This ideally leads to discovery of more valuable information.
- Environmental feedback: Each answer variant is assessed by an LLM-as-a-judge, utilizing auto-raters for metrics, such as helpfulness and comprehensiveness. These raters not only provide fitness scores but also generate textual feedback that help improve the answer.
- Revision: With the scores and feedback from the previous step, each variant undergoes a revision step to adapt toward better fitness scores. The environmental feedback and revision steps repeat until reaching some maximum number of iterations or until the agent determines no more revisions are needed.
- Cross-over: Finally, multiple revised variants are merged into a single, high-quality output. This merging process consolidates the best information from all evolutionary paths, producing superior context for the main report generation process.
Report-level denoising with retrieval
Since a preliminary noisy draft is useless for complex topics without real research, TTD-DR uses a search tool that denoises and evolves the draft.
Specifically, we feed the current draft report into the Search Generation stage (Stage 2a) of the backbone DR workflow to inform the generation of the next search query. After obtaining a synthesized answer in the Answer Searching stage (Stage 2b), the new information is used to revise the report draft, either by adding new details or by verifying existing information. This process of feeding the denoised report back to generate the next search query is repeated. The draft is progressively denoised until the search process concludes, at which point a final agent writes the final report based on all historical search answers and revisions (Stage 3).
Results
We evaluate TTD-DR's performance using benchmark datasets that focus on two broad tasks: 1) Complex queries that require research agents to produce a long-form comprehensive report (DeepConsult) and, 2) multi-hop queries that require extensive search and reasoning to answer (Humanity's Last Exam [HLE] and GAIA). We sub-sample 200 queries from HLE that need more search and reasoning (HLE-Search). Both categories fit into our objective of building a general-purpose, real-world research companion. We compare our DR systems with OpenAI Deep Research.
TTD-DR consistently achieves better results across all benchmarks. Notably, when compared to OpenAI DR, TTD-DR achieves 74.5% win rate for the long-form research report generation tasks. Additionally, it outperforms OpenAI DR by 7.7% and 1.7% on the two extensive research datasets with short-form ground-truth answers.
Ablation study
For the ablation study, we incrementally add the three methods in the section above. Our DR agents use Gemini-2.5-pro as the base model. All other baseline agents use their default LLMs. The charts below show the ablation study for our DR agents. The backbone DR agent underperforms OpenAI DR. With the addition of the proposed self-evolution algorithm, we observe that for DeepConsult, our system outperforms OpenAI Deep Research with 59.8% win rates. The Correctness scores on HLE-Search and GAIA datasets also show an improvement of 4.4% and 1.2%. Finally, incorporating diffusion with retrieval leads to substantial gains across all benchmarks.
The Pareto-frontier diagram below further shows the test-time scaling efficiency of TTD-DR compared with other DR agents. We found that TTD-DR is more efficient than OpenAI DR, as with the same latency, it achieves the better quality per win-rate. See the paper for more details.
Conclusion
The Deep Researcher with Test-Time Diffusion (TTD-DR) is a new framework inspired by the iterative way humans do research. This agent addresses the limitations of existing DR agents by conceptualizing report generation as a diffusion process. The TTD-DR framework significantly outperforms existing DR agents across various benchmarks requiring intensive search and multi-hop reasoning. It demonstrates state-of-the-art performance in generating comprehensive long-form research reports and identifying concise answers for multi-hop search and reasoning tasks. We believe the reason it works so well is its "draft-first" design, which keeps the whole research process focused and coherent, preventing important information from getting lost along the way.
Availability on Google Cloud Platform
A product version of this work is available on Google Agentspace, implemented with Google Cloud Agent Development Kit.
Acknowledgements
This research was conducted by Rujun Han, Yanfei Chen, Guan Sun, Lesly Miculicich, Zoey CuiZhu, Yuanjun (Sophia) Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Xiaowei Li, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, and Chen-Yu Lee.