人工智能模型正在使用已撤稿的科研论文内容。
内容总结:
近期研究发现,多款主流人工智能聊天机器人在回答科学问题时,会引用已撤稿的缺陷论文作为依据,却未向用户提示相关论文的撤稿状态,这一现象引发学界对AI科研工具可靠性的担忧。
据《麻省理工科技评论》调查,包括ChatGPT、Elicit、Perplexity等在内的AI工具在回答基于医学影像等领域撤稿论文的提问时,均存在直接引用被撤稿内容却未标注撤稿信息的情况。田纳西大学医学研究员顾卫宽指出,当用户仅关注答案内容而未核查论文状态时,可能被误导。
问题根源在于技术层面存在多重挑战。首先,各学术出版社对撤稿标记缺乏统一标准,标注方式存在差异;其次,预印本平台等渠道导致撤稿论文副本在网络扩散,增加了识别难度;此外,AI训练数据更新滞后,难以实时同步最新撤稿信息。撤稿观察网站联合创始人伊万·奥兰斯基坦言,构建全面的撤稿数据库需大量人工审核,目前尚无完美解决方案。
部分企业已开始采取补救措施。Consensus公司通过整合出版社数据与第三方撤稿数据库,将撤稿论文引用量从18篇降至5篇;Elicit则表示已移除OpenAlex标记的撤稿论文。但专家强调,单纯依赖数据库仍不足够,需将同行评审意见等上下文信息纳入AI训练材料。
伊利诺伊大学信息科学家傅缘溪强调,面向公众的AI工具必须将撤稿作为关键质量指标。随着美国国家科学基金会等机构斥资推动科研AI应用,保障信息准确性已成为行业紧迫课题。新加坡管理大学图书馆员郑艾伦建议,用户应对AI输出保持审慎态度,“我们仍处于非常初级的阶段,必须保持怀疑精神”。
中文翻译:
人工智能模型正在使用已撤稿科研论文中的内容
部分企业正着手解决该问题。
最新研究表明,某些人工智能聊天机器人依赖已撤稿科研论文中的缺陷研究来回答问题。《麻省理工科技评论》核实的这一发现,引发了对人工智能工具评估科研可靠性能力的质疑,也可能使寻求投资科研用人工智能工具的各国政府和行业面临更复杂的局面。
众所周知,人工智能搜索工具和聊天机器人会编造链接和参考文献。但若答案基于已撤稿的真实论文,同样会产生误导。近期一项研究的作者、孟菲斯田纳西大学医学研究员顾卫宽指出,聊天机器人"使用真实论文和材料向你传递信息",但若用户只关注答案内容而未点击论文链接查看撤稿状态,就会造成实际问题。
顾卫宽团队基于21篇医学影像领域撤稿论文的信息向OpenAI的GPT-4o模型提问。聊天机器人在五个问题的回答中引用了撤稿论文,但仅对其中三例给出谨慎建议。虽然其他问题引用了未撤稿论文,但研究者指出系统可能未能识别文章撤稿状态。在8月的另一项研究中,不同团队使用ChatGPT-4o mini评估来自不同科学领域的217篇撤稿及低质量论文,发现聊天机器人所有回复均未提及撤稿或其他警示(今年8月发布的GPT-5尚未有类似研究)。
公众正使用AI聊天机器人获取医疗建议和诊断疾病。学生和科研人员越来越多地使用科学导向的AI工具来梳理文献和总结论文,这种应用趋势将持续扩大。例如美国国家科学基金会就在今年8月投入7500万美元建设科研用AI模型。
伊利诺伊大学厄巴纳-香槟分校信息科学研究员傅缘溪强调:"面向公众的工具必须将撤稿作为重要质量指标。科学界共识是撤稿论文已从科学记录中剔除,圈外人士应获知这些论文的撤稿状态。"OpenAI未就论文研究结果置评。
该问题不仅存在于ChatGPT。6月《麻省理工科技评论》测试了多款科研专用AI工具,包括Elicit、Ai2 ScholarQA(现并入艾伦人工智能研究所的Asta工具)、Perplexity和Consensus,使用顾卫宽研究中的21篇撤稿论文提问。测试显示Elicit引用5篇撤稿论文,Ai2 ScholarQA引用17篇,Perplexity引用11篇,Consensus引用18篇,均未标注撤稿信息。
部分企业已开始整改。Consensus联合创始人克里斯蒂安·塞勒姆表示此前搜索引擎缺乏完善的撤稿数据,现已整合出版商、数据聚合平台、独立网络爬取及人工维护撤稿数据库的Retraction Watch等多方数据。8月重测同一批论文时,Consensus仅引用5篇撤稿论文。
Elicit称已移除学术研究目录OpenAlex标记的撤稿论文,并持续整合撤稿数据源;Ai2表示其工具目前无法自动检测或移除撤稿论文;Perplexity则声明"从未承诺100%准确率"。
但依赖撤稿数据库可能仍不足够。Retraction Watch联合创始人伊万·奥兰斯基谨慎表示该库并非全面覆盖,构建完整数据库所需资源远超现有能力:"要实现精准必须人工操作,这导致资源消耗巨大。"
出版商处理撤稿通知的方式不统一使问题更复杂。加拿大里贾纳大学研究工具专家凯特琳·贝克指出,撤稿标记存在"更正""关注声明""勘误""撤稿"等多种标签,且撤稿原因涉及内容、方法、数据或利益冲突等多元因素。
预印本服务器、论文库等平台导致论文副本散布网络,加之AI训练数据可能未及时更新。傅缘溪指出若论文在模型训练截止后撤稿,响应无法即时体现变动。新加坡管理大学图书馆员Aaron Tay表示多数学术搜索引擎未实时校验撤稿数据,用户受限于数据库准确度。
奥兰斯基等专家建议为模型提供更丰富的响应背景信息,例如在发表论文时同步公开期刊同行评审意见和PubPeer评论网站的批评内容。《自然》《英国医学杂志》等出版商已将撤稿通知作为关联论文免费公开。傅缘溪强调企业需有效利用这些信息及训练数据中涉及论文撤稿的新闻报道。
AI工具用户和开发者都需保持审慎。Tay指出:"我们处于非常早期的阶段,必须保持怀疑态度。"
本文作者阿南亚系印度班加罗尔自由科技记者。
深度解析
人工智能
谷歌首次公布单次AI查询能耗数据
这是大型AI企业迄今最透明的能耗评估,为研究人员提供了期待已久的内幕洞察。
决定OpenAI研究方向的两位核心人物
独家专访OpenAI研究双主管马克·陈和雅各布·帕霍基,探讨增强推理模型能力与超对齐技术发展路径。
心理治疗师秘密使用ChatGPT引发患者不安
部分治疗师在诊疗过程中使用AI,此举正在危及患者的信任与隐私。
GPT-5问世后的未来展望
这次备受期待的升级为ChatGPT用户体验带来多项改进,但距离通用人工智能仍有差距。
保持联系
获取《麻省理工科技评论》最新动态
欢迎订阅特别优惠、头条新闻、活动预告等精选内容。
英文来源:
AI models are using material from retracted scientific papers
Some companies are working to remedy the issue.
Some AI chatbots rely on flawed research from retracted scientific papers to answer questions, according to recent studies. The findings, confirmed by MIT Technology Review, raise questions about how reliable AI tools are at evaluating scientific research and could complicate efforts by countries and industries seeking to invest in AI tools for scientists.
AI search tools and chatbots are already known to fabricate links and references. But answers based on the material from actual papers can mislead as well if those papers have been retracted. The chatbot is “using a real paper, real material, to tell you something,” says Weikuan Gu, a medical researcher at the University of Tennessee in Memphis and an author of one of the recent studies. But, he says, if people only look at the content of the answer and do not click through to the paper and see that it’s been retracted, that’s really a problem.
Gu and his team asked OpenAI’s ChatGPT, running on the GPT-4o model, questions based on information from 21 retracted papers on medical imaging. The chatbot’s answers referenced retracted papers in five cases but advised caution in only three. While it cited non-retracted papers for other questions, the authors note it may not have recognized the retraction status of the articles. In a study from August, a different group of researchers used ChatGPT-4o mini to evaluate the quality of 217 retracted and low-quality papers from different scientific fields; they found that none of the chatbot’s responses mentioned retractions or other concerns. (No similar studies have been released on GPT-5, which came out this August.)
The public uses AI chatbots to ask for medical advice and diagnose health conditions. Students and scientists increasingly use science-focused AI tools to review existing scientific literature and summarize papers. That kind of usage is likely to increase. The US National Science Foundation, for instance, invested $75 million in building AI models for science research this August.
“If [a tool is] facing the general public, then using retraction as a kind of quality indicator is very important,” says Yuanxi Fu, an information science researcher at the University of Illinois Urbana-Champaign. There’s “kind of an agreement that retracted papers have been struck off the record of science,” she says, “and the people who are outside of science—they should be warned that these are retracted papers.” OpenAI did not provide a response to a request for comment about the paper results.
The problem is not limited to ChatGPT. In June, MIT Technology Review tested AI tools specifically advertised for research work, such as Elicit, Ai2 ScholarQA (now part of the Allen Institute for Artificial Intelligence’s Asta tool), Perplexity, and Consensus, using questions based on the 21 retracted papers in Gu’s study. Elicit referenced five of the retracted papers in its answers, while Ai2 ScholarQA referenced 17, Perplexity 11, and Consensus 18—all without noting the retractions.
Some companies have since made moves to correct the issue. “Until recently, we didn’t have great retraction data in our search engine,” says Christian Salem, cofounder of Consensus. His company has now started using retraction data from a combination of sources, including publishers and data aggregators, independent web crawling, and Retraction Watch, which manually curates and maintains a database of retractions. In a test of the same papers in August, Consensus cited only five retracted papers.
Elicit told MIT Technology Review that it removes retracted papers flagged by the scholarly research catalogue OpenAlex from its database and is “still working on aggregating sources of retractions.” Ai2 told us that its tool does not automatically detect or remove retracted papers currently. Perplexity said that it “[does] not ever claim to be 100% accurate.”
However, relying on retraction databases may not be enough. Ivan Oransky, the cofounder of Retraction Watch, is careful not to describe it as a comprehensive database, saying that creating one would require more resources than anyone has: “The reason it’s resource intensive is because someone has to do it all by hand if you want it to be accurate.”
Further complicating the matter is that publishers don’t share a uniform approach to retraction notices. “Where things are retracted, they can be marked as such in very different ways,” says Caitlin Bakker from University of Regina, Canada, an expert in research and discovery tools. “Correction,” “expression of concern,” “erratum,” and “retracted” are among some labels publishers may add to research papers—and these labels can be added for many reasons, including concerns about the content, methodology, and data or the presence of conflicts of interest.
Some researchers distribute their papers on preprint servers, paper repositories, and other websites, causing copies to be scattered around the web. Moreover, the data used to train AI models may not be up to date. If a paper is retracted after the model’s training cutoff date, its responses might not instantaneously reflect what's going on, says Fu. Most academic search engines don’t do a real-time check against retraction data, so you are at the mercy of how accurate their corpus is, says Aaron Tay, a librarian at Singapore Management University.
Oransky and other experts advocate making more context available for models to use when creating a response. This could mean publishing information that already exists, like peer reviews commissioned by journals and critiques from the review site PubPeer, alongside the published paper.
Many publishers, such as Nature and the BMJ, publish retraction notices as separate articles linked to the paper, outside paywalls. Fu says companies need to effectively make use of such information, as well as any news articles in a model’s training data that mention a paper’s retraction.
The users and creators of AI tools need to do their due diligence. “We are at the very, very early stages, and essentially you have to be skeptical,” says Tay.
Ananya is a freelance science and technology journalist based in Bengaluru, India.
Deep Dive
Artificial intelligence
In a first, Google has released data on how much energy an AI prompt uses
It’s the most transparent estimate yet from one of the big AI companies, and a long-awaited peek behind the curtain for researchers.
The two people shaping the future of OpenAI’s research
An exclusive conversation with Mark Chen and Jakub Pachocki, OpenAI’s twin heads of research, about the path toward more capable reasoning models—and superalignment.
Therapists are secretly using ChatGPT. Clients are triggered.
Some therapists are using AI during therapy sessions. They’re risking their clients’ trust and privacy in the process.
GPT-5 is here. Now what?
The much-hyped release makes several enhancements to the ChatGPT user experience. But it’s still far short of AGI.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.