测试大型语言模型在超导研究问题上的表现

内容来源:https://research.google/blog/testing-llms-on-superconductivity-research-questions/
内容总结:
谷歌联合多所高校评估大语言模型在超导研究中的表现:专用知识库模型更可靠
2026年3月16日,谷歌研究院科学家苏巴希尼·维努戈帕兰与访问科学家金恩阿发表研究,探讨大语言模型能否成为现代物理领域的专家级研究伙伴。该团队以高温超导这一前沿领域为案例,对六款主流大语言模型进行了专业测试,并由专家委员会对回答进行多维度评分。相关论文已发表于《美国国家科学院院刊》。
研究聚焦于高温超导机制这一自1987年诺贝尔奖颁发以来仍未完全解决的科学难题。专家小组设计了67个深度专业问题,用于考察模型对领域知识的理解、对竞争性理论的平衡阐述以及对科学证据的把握能力。
测试的模型包括GPT-4o、Perplexity、Claude 3.5、Gemini Advanced Pro 1.5、Google NotebookLM以及一个定制开发的检索增强生成系统。评估采用盲审方式,专家从观点平衡性、事实全面性、回答简洁性、证据支撑度、图像相关性及定性反馈六个维度进行打分。
结果显示,基于经过专家筛选和认证的封闭知识库的模型表现最佳。其中,NotebookLM(基于用户提供的1,726篇高质量文献库)在多数指标上领先,尤其在提供证据支撑和平衡不同科学观点方面突出;定制RAG系统紧随其后。相比之下,依赖开放网络信息的模型则倾向于混杂已确立的理论与高度推测性的内容,准确性较低。
研究同时指出当前模型的普遍短板:在时序和上下文理解上存在缺陷,例如难以识别已被证伪的假说;对科学论文中常用的图表信息解读能力不足;过度依赖查询语句的字面匹配,容易遗漏相关文献。
尽管存在局限,研究表明大语言模型在涉及开放科学问题的复杂领域具备达到熟练水平的潜力。一个可靠的AI研究助手有望帮助科研新人快速融入领域文献,并为资深科学家提供持续的研究思路支持。
研究团队强调,开发值得信赖的科研AI工具需要领域专家的深度参与与评估。他们将于2026年4月在ICLR会议上发布更严格的凝聚态理论领域评测基准CMT-benchmark,以进一步推动相关技术的发展。
此项研究由谷歌研究院、康奈尔大学和哈佛大学合作完成,并得到了来自斯坦福大学、约翰斯·霍普金斯大学、Flatiron研究所等多所机构科学家的支持。
中文翻译:
测试大语言模型在超导研究问题上的表现
2026年3月16日
Subhashini Venugopalan(谷歌研究院研究科学家)与 Eun-ah Kim(谷歌研究院访问科学家)
大语言模型(LLMs)能否成为现代物理学领域的专家级研究伙伴?物理学家们以高温超导为案例,向六款大语言模型提出了极具挑战性的问题,并对它们的回答进行了评分。
快速链接
人工智能(AI)如今已普遍用于撰写邮件、编辑图像和总结网络信息。AI在加速科学研究方面也蕴藏着巨大潜力。然而,在专业领域内,AI能否为复杂问题提供科学上准确且全面的答案,这仍是一个活跃的研究方向。这要求AI必须达到极高的准确性标准,并能驾驭复杂且不断演进的知识领域。
我们发表在《美国国家科学院院刊》上的新论文《专家评估大语言模型的世界模型:一项高温超导案例研究》,旨在评估大语言模型的世界模型能否回答凝聚态物理领域的专家级问题。我们与康奈尔大学合作,要求六款大语言模型回答关于高温超导体的高层次问题。随后,一个专家小组根据多项标准对这些回答进行了评分。我们发现,表现最佳的是两款依赖经过认证和质量控制的封闭生态系统来源的工具:NotebookLM 和一个定制构建的系统。我们也指出了所有被研究系统需要改进的关键领域。这一测试案例的结果有助于指导开发可信赖的工具,以推动科学发现。
在之前的相关工作中,谷歌研究人员通过参考六个科学学科的研究论文,评估了大语言模型能否在多个科学领域执行基本的分析任务。该工作引入了CURIE基准,用于评估大语言模型在从生物多样性到凝聚态物理再到蛋白质测序等领域的表现,其中包含需要分析而不仅仅是复述事实的问题。其他工作探索了利用大语言模型解释表格和图表,运用它们求解量子力学方程,以及使用专业软件解决工程仿真问题。
谷歌内部的其他多个团队也在探索利用AI推进科学研究:作为产生新假设的思维伙伴;作为编写专家级科学软件的智能体;以及开发用于单细胞分析的AI模型。
驾驭开放性研究问题
在这项工作中,我们探讨了大语言模型能否在需要深入研究、并能平衡围绕开放性科学问题的竞争性理论的专门领域,充当知识渊博且无偏见的思维伙伴。
我们聚焦于高温超导的微观机制,这是自1987年该现象获得诺贝尔奖以来凝聚态物理中一个开放的研究领域。在本研究中,我们重点关注一类含铜化合物(即铜氧化物)。铜氧化物能在比传统超导材料高得多的温度下实现零电阻导电(尽管仍然很低——已知的最高温度阈值约为零下140摄氏度)。理解这种行为背后的微观机制,可能有助于发现更多具有类似特性的化合物,甚至可能在更高温度下实现,从而为更多应用铺平道路。
几十年来,物理学家们发表了数千项研究,利用各种实验技术来探究导致超导性的量子力学特性。不同的研究小组提出并发展了多种相互竞争的理论。如此庞大的文献量使得新一代研究人员极难驾驭这一知识体系。进入该领域的学生将受益于一位知识渊博、对已发表研究持中立视角的“导师”。
案例研究
为了比较使用不同数据源的影响,该研究评估了四款拥有完全网络访问权限的模型,以及两款从精选数据库中获取信息的封闭系统。对于两款封闭系统,十二位该领域的国际顶尖专家选取了15篇科学综述文章,以提供该领域的概览,并确保初始源材料的质量可控。四款基于网络的模型拥有完全的互联网访问权限,包括765篇开放获取的实验论文和1553篇开放获取的理论论文。
随后,专家小组编写了67个问题,旨在测试模型对该领域的深层知识,例如“LSCO中的Lifshitz转变发生在何种掺杂水平?”以及“支持铜氧化物中量子临界点情景的证据是什么?”。最后,每位专家评估了六款不同模型在回答这些测试问题上的表现。
结果
我们评估了六款大语言模型:GPT-4o、Perplexity、Claude 3.5、Gemini Advanced Pro 1.5、谷歌NotebookLM以及一个定制构建的检索增强生成(RAG)系统。通过盲审流程,专家们根据六个指标,对每个模型的答案进行了0到2分的独立评分:
- 平衡视角:是否考虑了不同的科学观点。
- 全面性:事实深度,未遗漏相关实验。
- 简洁性:提供简短清晰的答案。
- 证据支持:有证据支持并链接到源材料。
- 视觉相关性:所提供图像的质量(适用于两款始终包含图像的LLM)。
- 定性反馈:开放式的专家评论。
在这六款大语言模型中,NotebookLM在盲审测试的大多数方面表现突出。NotebookLM是一款基于用户提供的文档库(在本案例中,是一个包含1726个实验论文和综述文章来源的库)来回答用户问题的产品。总体表现次佳的是我们包含相同来源的定制RAG系统。在提供平衡视角和全面答案方面,NotebookLM、Gemini和定制RAG系统得分位列前三。尽管NotebookLM最不简洁,但在提供证据支持方面得分最高。图像相关性得分普遍较低,其中定制RAG系统的得分高于另一款提供图像的LLM——Perplexity。
结论
从这一测试案例中,可以得出几个更广泛的结论。两款从经过筛选的实验文献数据库中获取信息的模型——NotebookLM和我们的定制工具,其表现优于那些在未经筛选的互联网数据上训练的LLM。特别是,依赖开放网络来源的模型倾向于将已确立的理论与高度推测性的理论混为一谈。
被评估的LLM(于2024年12月访问)也显示出在时间和上下文理解方面的弱点。例如,它们常常无法识别某个提出的假设后来被证伪的情况。当初始查询中使用的确切措辞没有明确包含在相关论文中时,它们也经常遗漏这些论文。
我们的结果广泛地突显了LLM需要更好地理解表格和图像,因为科学论文大量使用这些格式。虽然其中两款模型始终引用图像,但它们往往更多地依赖图像标题而非视觉分析。增强视觉推理能力,包括解释图像、图表和比例尺,是未来改进的主要方向。
展望未来
一个可靠的AI研究伙伴可以帮助新研究生快速熟悉现有科学文献,并充当一个随时可用的思维伙伴。它也可以帮助经验丰富的科学家确定新的研究方向。
尽管存在现有局限,我们的结果表明,大语言模型能够在涉及开放性研究问题的复杂领域达到熟练水平。然而,评估模型在专业领域的能力依赖于合格的专家,他们的知识既至关重要又十分稀缺。我们继续在这一领域开展工作,并将于2026年4月在ICLR会议上发布CMT基准,作为对LLM在更广泛的凝聚态理论领域更严格的评估。总体而言,这些努力需要物理学专家投入大量时间和分析;我们希望由此产生的见解能够推广开来,为开发可信赖的AI工具以推动科学进步提供信息。
致谢
此处描述的研究是谷歌研究院、康奈尔大学和哈佛大学共同努力的成果。我们感谢来自斯坦福大学、约翰斯·霍普金斯大学、Flatiron研究所、纽约市立大学、麻省理工学院、康奈尔大学和哈佛大学的众多杰出科学家,他们使这项研究成为可能:Steven A. Kivelson, N. P. Armitage, Antoine Georges, Olivier Gingras, Dominik Kiese, Chunhan Feng, Vadim Oganesyan, T. Senthil, B.J. Ramshaw, 和 Subir Sachdev。我们感谢Haoyu Gao和学生研究员Maria Tikhanovskaya在帮助构建研究和数据集方面所做的工作。我们也感谢Oliver King和Wesley Hutchins在帮助使用NotebookLM设置研究方面的宝贵贡献。我们感谢Stephan Hoyer对本工作早期文稿提出的深刻反馈。最后,我们感谢John Platt和Michael Brenner持续的支持和鼓励,使这项研究得以实现。
英文来源:
Testing LLMs on superconductivity research questions
March 16, 2026
Subhashini Venugopalan, Research Scientist, and Eun-ah Kim, Visiting Scientist, Google Research
Can LLMs become expert-level research partners in modern physics? Using high-temperature superconductivity as a case study, physicists tested six LLMs with challenging questions and graded the responses.
Quick links
Artificial intelligence (AI) is now routinely used to compose emails, edit images and summarize information from the web. AI also holds enormous potential to accelerate scientific research. However, its effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research, requiring AI to meet an extremely high bar for accuracy and to navigate complex, evolving areas of knowledge.
Our new paper published in the Proceedings of the National Academy of Sciences, “Expert evaluation of LLM world models: A high-Tc superconductivity case study”, assesses whether large language model (LLM) world models could answer expert-level questions in condensed matter physics. In collaboration with Cornell University, we asked six LLMs to answer high-level questions on high-temperature superconductors. A panel of experts then scored the responses on multiple criteria. We found that the top performers were two tools that drew from a closed ecosystem of certified, quality-controlled sources: NotebookLM and a custom-built system. We also identified key areas for improvement in all the systems studied. Results of this test case can help inform development of trustworthy tools to advance scientific discovery.
In previous related work, Google researchers evaluated whether LLMs could perform basic analytic tasks in several scientific fields by referencing research papers in six scientific disciplines. That work introduced CURIE, a benchmark for evaluating LLMs in fields ranging from biodiversity to condensed matter physics to protein sequencing, which includes questions that require analysis rather than just regurgitating facts. Other work explored using LLMs to interpret tables and figures, leveraging them to solve equations in quantum mechanics, and to solve engineering simulation problems using specialized software.
Several other groups across Google are also exploring AI to advance scientific research: as a thought partner for generating new hypotheses; as an agent to write expert-level scientific software; and with an AI-based model for single-cell analysis.
Navigating open research questions
In this work, we explored if LLMs can act as knowledgeable, unbiased thought partners in specialized fields that require in-depth research and an ability to balance competing theories around open scientific questions.
We focused on the underlying mechanisms of high-temperature superconductivity, an open area of inquiry in condensed matter physics since the Nobel Prize-awarded discovery of the phenomenon in 1987. For this study, we centered on a class of copper-containing compounds (known as cuprates). Cuprates can conduct electrons with zero electrical resistance at temperatures significantly higher than traditional superconductor materials, though still cold – their highest known temperature threshold is roughly -140 degrees Celsius. Understanding the underlying mechanism behind this behavior might help to discover more compounds with similar properties, potentially at higher temperatures, and could pave the way to more applications.
Over the decades, physicists have published thousands of studies using various experimental techniques to probe the quantum mechanical properties that lead to superconductivity. Several competing theories have been proposed and pursued by different research groups. The sheer volume of literature makes navigating this knowledge base extremely difficult for a new generation. Students entering the field would benefit from a knowledgeable tutor that has a neutral perspective on the published research.
Case study
To compare the impact of using different data sources, the study evaluated four models with full access to the web against two closed systems that drew from a curated database. For the two closed systems, twelve top international experts in the field of high-temperature superconductivity selected 15 scientific review articles to provide an overview of the field with quality-controlled initial source material. The four web-based models had full internet access, including 765 open-access experimental papers and 1,553 open-access theoretical papers.
The expert panel then wrote 67 questions designed to test a model’s deep knowledge of the field, such as “At what level of doping does the Lifshitz transition occur in LSCO?” and “What is the evidence supporting the quantum critical point scenario in cuprates?” Finally, each expert evaluated six different models’ performance on answering those test questions.
Results
We evaluated six LLMs: GPT-4o, Perplexity, Claude 3.5, Gemini Advanced Pro 1.5, Google NotebookLM, and a custom-built retrieval-augmented generation (RAG) system. Using a masked review process, experts individually scored each model’s answers on a scale of 0 to 2 across six metrics:
- Balanced perspective: Whether varying scientific points of view were considered.
- Comprehensiveness: Factual depth without missing relevant experiments.
- Conciseness: Providing a brief and clear answer.
- Evidence: Supported by evidence and links to source material.
- Visual relevance: Quality of any provided images (applicable to the two LLMs that consistently included images).
- Qualitative feedback: Open ended expert comments.
Of the six LLMs, NotebookLM stood out in most aspects of the masked tests. NotebookLM is a product that answers users’ questions based on a library of user-provided documents, in this case, a library of 1,726 sources with experimental papers and review articles. The next-highest overall performer was our custom RAG system containing the same sources. NotebookLM, Gemini and the custom RAG system scored in the top three for providing a balance of perspectives, and for offering comprehensive answers. Despite being the least succinct, NotebookLM scored highest for providing evidence. The relevance of images scores were generally lower, with the custom RAG scoring higher than the other LLM that provided images, Perplexity.
Conclusion
Several larger conclusions emerge from this test case. The two models that drew from curated databases of experimental literature, NotebookLM and our custom-built tool, outperformed the LLMs trained on unfiltered internet data. In particular, models relying on open web sources tended to mix established theories with highly speculative ones.
The evaluated LLMs (accessed in December 2024) also showed weaknesses in temporal and contextual understanding. For example, they often failed to recognize when a proposed hypothesis was later disproved. They also frequently omitted relevant papers when they didn’t explicitly include the exact language used in the initial query.
Our results broadly highlight the need for LLMs to better understand tables and images, as scientific papers heavily use these formats. While two of the models consistently referenced images, they often relied more on image captions rather than on visual analysis. Enhancing visual reasoning capability, including interpreting images, plots and scale bars, is a major direction for future improvement.
Looking ahead
A reliable AI research partner could help rapidly onboard new graduate students on existing scientific literature and serve as an always-available thought partner. It could also help experienced scientists identify new research directions.
Despite existing limitations, our results suggest that LLMs can reach proficiency in complex fields involving open research questions. However, assessing a model’s capabilities in specialized fields relies on qualified experts whose knowledge is both essential and rare. We continue to work in this area, and will present the CMT-benchmark at ICLR 2026 in April as a more rigorous evaluation of LLMs in the broader field of condensed matter theory. Overall, these efforts required significant time and analysis from experts in physics; we hope the insights they produce can scale to inform the further development of trustworthy AI tools to advance scientific progress.
Acknowledgments
Research described here is a joint effort between Google Research, Cornell University, and Harvard University. We are grateful to the many incredible scientists across Stanford, Johns Hopkins, Flatiron Institute, CUNY, MIT, Cornell, and Harvard that made this study possible: Steven A. Kivelson, N. P. Armitage, Antoine Georges, Olivier Gingras, Dominik Kiese, Chunhan Feng, Vadim Oganesyan, T. Senthil, B.J. Ramshaw, and Subir Sachdev. We thank Haoyu Gao and student researcher Maria Tikhanovskaya for helping shape the study and datasets. We also acknowledge the valuable contributions of Oliver King and Wesley Hutchins for helping set up the study with NotebookLM. We appreciate Stephan Hoyer for thoughtful feedback on early manuscripts of this work. Finally, we thank John Platt and Michael Brenner for continuous support and encouragement that enabled this research.