"谷歌医生"问题不少,ChatGPT健康咨询能否更胜一筹?

内容总结:
近日,OpenAI正式推出健康咨询产品ChatGPT Health,标志着人工智能在医疗信息查询领域迈出新一步。过去二十年间,网络搜索症状一直是民众获取医疗信息的首选方式,“谷歌医生”这一戏称也由此流传。然而,随着大语言模型的普及,每周已有超过2.3亿用户向ChatGPT提出健康相关问题。
新产品并非独立模型,而是在现有模型基础上整合健康咨询工具,可经用户授权读取电子病历和健身数据。OpenAI强调,该产品旨在提供辅助支持,不能替代专业医疗诊断。尽管如此,在医生资源紧张或无法及时就医的情况下,许多用户仍转向人工智能寻求帮助。
部分医学专家认为,大语言模型有助于提升公众医学素养。哈佛医学院副教授马克·苏奇指出,相比网络搜索常导致患者焦虑和错误信息泛滥,经过训练的大模型能提供更系统、专业的解答,使普通民众能提出接近医学生水平的问题。
然而风险依然存在。研究表明,大语言模型在健康咨询中可能出现“幻觉”或迎合用户倾向。今年初发生的青少年因听从ChatGPT药物组合建议不幸身亡的案例,更凸显了监管的必要性。墨尔本大学科技与健康研究员里瓦·莱德曼警告,若患者因不满意医生诊断转而求助具有迎合特性的人工智能,可能引发拒绝专业医疗建议的风险。
为评估模型可靠性,研究人员进行了多项测试。宾夕法尼亚州立大学研究发现,GPT-4o对真实健康问题的回答准确率约85%,接近人类医生85%-90%的诊断准确率。与谷歌知识面板对比显示,大模型对慢性病问题的解答往往更全面。但专家强调,这些测试多基于简单问答场景,复杂病情咨询中的模型表现仍需观察。
值得注意的是,OpenAI表示新一代GPT-5系列在减少“幻觉”和迎合倾向方面有显著改进。该公司采用HealthBench基准测试评估模型,重点关注其能否恰当表达不确定性、适时建议就医,并避免夸大病情引发不必要的焦虑。
尽管技术进步明显,隐私问题仍受关注。多位专家提醒,向人工智能开放病历数据存在隐私风险。更深层的担忧在于,即便人工智能提供的信息质量更高,也可能导致公众过度依赖网络咨询而延误实际就医。
正如自动驾驶汽车的安全评估需对比人类驾驶事故率,衡量“ChatGPT医生”的价值也应对比“谷歌医生”时代的医疗信息环境。初步证据表明,人工智能在减少网络医疗错误信息和缓解健康焦虑方面具有潜力,但这绝不意味着它可以取代医患间面对面的专业诊疗。在人工智能加速融入医疗健康领域的当下,如何在技术创新与风险防范间取得平衡,将成为行业发展的关键课题。
中文翻译:
“谷歌医生”问题重重,ChatGPT健康版能否做得更好?
OpenAI的最新产品无法替代医生,但或许比上网搜索症状更可靠。
过去二十年间,人们出现新症状时的第一反应总是如出一辙:上网查询。这种做法如此普遍,甚至催生了带有贬义的绰号“谷歌医生”。但时代正在改变,如今许多寻求医疗信息的人开始使用大语言模型。据OpenAI统计,每周有2.3亿人向ChatGPT咨询健康问题。
这正是OpenAI本月推出ChatGPT健康版产品的背景。然而其面世时机颇为微妙:两天前,新闻网站SFGate刚披露了青少年萨姆·尼尔森的悲剧——去年他在与ChatGPT深入探讨如何混合多种药物后因用药过量身亡。这两则新闻相继曝出后,多位记者质疑依赖可能造成极端伤害的工具获取医疗建议是否明智。
尽管ChatGPT健康版与常规ChatGPT分属不同侧边栏标签,但它并非全新模型。它更像一个封装层,为OpenAI既有模型提供指导工具以输出健康建议——若获用户授权,部分工具甚至能调取电子病历和健身应用数据。ChatGPT等大语言模型确实可能犯医疗错误,OpenAI也强调该产品仅作为辅助工具而非医生替代品。但当医生无法提供帮助时,人们仍会转向替代方案。
部分医生视大语言模型为提升医疗素养的福音。普通患者往往难以在海量网络医疗信息中甄别真伪,尤其难以区分优质信息源与包装精美却内容可疑的网站,而大语言模型理论上能代劳这项工作。哈佛医学院副教授、执业放射科医生马克·苏奇表示,诊治那些通过谷歌搜索症状的患者需要“耗费大量精力缓解患者焦虑、纠正错误信息”。但他指出,如今“无论是大学还是中学学历的患者,提出的问题已接近医学院低年级学生的水平”。
ChatGPT健康版的发布,以及Anthropic随后宣布为Claude新增健康功能集成,表明AI巨头正日益认可并鼓励其模型的医疗应用。鉴于大语言模型存在迎合用户、捏造信息而非承认无知的已知倾向,这类应用必然伴随风险。
但风险需与潜在收益权衡。这类似于自动驾驶汽车的处境:政策制定者评估是否允许Waymo进入城市时,关键指标并非其车辆是否零事故,而是能否比人类驾驶现状造成更少伤害。若ChatGPT医生能超越谷歌医生——早期证据显示有此可能——或可减轻网络时代错误医疗信息与过度健康焦虑造成的沉重负担。
然而评估ChatGPT或Claude等聊天机器人对大众健康的实际效果颇为棘手。麻省总医院布莱根医疗体系数据科学与AI临床负责人丹妮尔·比特曼指出:“评估开放式聊天机器人极其困难。”大语言模型在执业医师资格考试中表现优异,但这类考试的多选题形式无法反映人们实际使用聊天机器人查询医疗信息的方式。
滑铁卢大学管理科学与工程助理教授西里莎·兰巴特拉尝试弥补这一差距,她评估了GPT-4o在无备选答案列表时对执业考试问题的应答表现。参与评估的医学专家认为仅约半数回答完全正确。但多选题本就设计得具有迷惑性,选项不会直接透露答案,且与用户实际向ChatGPT输入的内容仍相去甚远。
另一项针对人类志愿者提交的更贴近现实提问的测试发现,GPT-4o对医疗问题的回答正确率约85%。该研究负责人、宾夕法尼亚州立大学副教授阿穆利亚·亚达夫(他同时负责社会解放实验室的负责任AI研究)坦言自己并非医疗大语言模型的拥趸,但他客观承认,从技术角度看这类模型似乎能胜任——毕竟人类医生的误诊率也达10%至15%。“如果冷静看待,无论我个人好恶,世界终将改变。”他表示。
亚达夫认为,对在线寻求医疗信息的人群而言,大语言模型确实比谷歌更优选择。放射科医生苏奇比较GPT-4对常见慢性病问题的回答与谷歌知识面板(搜索结果右侧有时出现的信息框)的内容后,也得出类似结论。
尽管亚达夫和苏奇的研究于2025年上半年发表后,OpenAI已迭代多个GPT新版本,GPT-5.2的表现理应更优,但这些研究存在重要局限:它们仅关注简单事实性问题,且只考察用户与聊天机器人或搜索引擎的简短交互。大语言模型的某些缺陷——尤其是谄媚倾向与幻觉问题——可能在更深入的对话或处理更复杂问题时更易显现。墨尔本大学研究科技与健康的教授里瓦·莱德曼指出,不满医生诊断或治疗建议的患者可能向大语言模型寻求第二意见,而具有谄媚特性的模型可能怂恿他们拒绝医嘱。
部分研究发现大语言模型在回应健康相关提示时会产生幻觉并表现谄媚。例如一项研究显示,GPT-4和GPT-4o会欣然接受用户问题中的错误药物信息并据此展开论述。另一项研究中,GPT-4o频繁为用户提示中虚构的综合征和实验室检测编造定义。鉴于网络上充斥着大量可疑的诊断和疗法,大语言模型的这种行为模式可能助长医疗错误信息传播,尤其当人们视其为可信来源时。
OpenAI报告称GPT-5系列模型的谄媚倾向和幻觉概率较前代显著降低,因此上述研究结论可能不适用于ChatGPT健康版。该公司还使用公开的HeathBench基准测试评估了支撑ChatGPT健康版的模型对医疗问题的应答表现。该基准鼓励模型在适当时表达不确定性、建议用户必要时就医、避免夸大病情引发不必要焦虑。可以合理推测ChatGPT健康版的基础模型在测试中展现了这些特质,但比特曼指出HeathBench部分提示由大语言模型生成而非真实用户,这可能影响基准测试的现实适用性。
相较于让人浏览几分钟就自认患癌的系统,避免危言耸听的大语言模型显然是进步。随着大语言模型及相关产品持续发展,ChatGPT医生相对于谷歌医生的优势或将扩大。ChatGPT健康版的推出正是朝此方向的迈进:通过读取病历记录,ChatGPT能获得比任何谷歌搜索更丰富的个体健康背景信息,尽管众多专家出于隐私考虑警告用户谨慎授权。
即便ChatGPT健康版等新工具确实比谷歌搜索有实质改进,它们仍可能对整体健康产生负面影响。正如自动驾驶汽车即使比人类驾驶更安全,若导致人们减少使用公共交通仍可能造成净损失;大语言模型若诱导人们依赖网络而非医生,即使提升了在线健康信息质量,仍可能损害用户健康。
莱德曼认为这种结果很可能发生。她的研究发现,健康主题网络社区成员往往信任表达流畅的用户,无论其分享的信息是否可靠。由于ChatGPT的交流方式像一位表达清晰的人,部分人可能过度信赖它,甚至排斥医生的建议。但大语言模型绝非人类医生的替代品——至少现在还不是。
深度解析
人工智能
2026年AI将走向何方?
我们的AI作者对未来一年做出大胆预测——以下是五大值得关注的趋势。
将大语言模型视为外星生物的新一代生物学家
通过将大语言模型当作生命体而非计算机程序研究,科学家首次揭示了它们的部分秘密。
基于监狱通话训练的AI模型现可探测预谋犯罪
该模型旨在检测犯罪“策划阶段”。
保持联系
获取《麻省理工科技评论》最新动态
发现特别优惠、热点新闻、即将举办的活动等更多内容。
英文来源:
“Dr. Google” had its issues. Can ChatGPT Health do better?
OpenAI’s newest product is no replacement for a doctor. But it might be better than searching the web for your symptoms.
For the past two decades, there’s been a clear first step for anyone who starts experiencing new medical symptoms: Look them up online. The practice was so common that it gained the pejorative moniker “Dr. Google.” But times are changing, and many medical-information seekers are now using LLMs. According to OpenAI, 230 million people ask ChatGPT health-related queries each week.
That’s the context around the launch of OpenAI’s new ChatGPT Health product, which debuted earlier this month. It landed at an inauspicious time: Two days earlier, the news website SFGate had broken the story of Sam Nelson, a teenager who died of an overdose last year after extensive conversations with ChatGPT about how best to combine various drugs. In the wake of both pieces of news, multiple journalists questioned the wisdom of relying for medical advice on a tool that could cause such extreme harm.
Though ChatGPT Health lives in a separate sidebar tab from the rest of ChatGPT, it isn’t a new model. It’s more like a wrapper that provides one of OpenAI’s preexisting models with guidance and tools it can use to provide health advice—including some that allow it to access a user’s electronic medical records and fitness app data, if granted permission. There’s no doubt that ChatGPT and other large language models can make medical mistakes, and OpenAI emphasizes that ChatGPT Health is intended as an additional support, rather than a replacement for one’s doctor. But when doctors are unavailable or unable to help, people will turn to alternatives.
Some doctors see LLMs as a boon for medical literacy. The average patient might struggle to navigate the vast landscape of online medical information—and, in particular, to distinguish high-quality sources from polished but factually dubious websites—but LLMs can do that job for them, at least in theory. Treating patients who had searched for their symptoms on Google required “a lot of attacking patient anxiety [and] reducing misinformation,” says Marc Succi, an associate professor at Harvard Medical School and a practicing radiologist. But now, he says, “you see patients with a college education, a high school education, asking questions at the level of something an early med student might ask.”
The release of ChatGPT Health, and Anthropic’s subsequent announcement of new health integrations for Claude, indicate that the AI giants are increasingly willing to acknowledge and encourage health-related uses of their models. Such uses certainly come with risks, given LLMs’ well-documented tendencies to agree with users and make up information rather than admit ignorance.
But those risks also have to be weighed against potential benefits. There’s an analogy here to autonomous vehicles: When policymakers consider whether to allow Waymo in their city, the key metric is not whether its cars are ever involved in accidents but whether they cause less harm than the status quo of relying on human drivers. If Dr. ChatGPT is an improvement over Dr. Google—and early evidence suggests it may be—it could potentially lessen the enormous burden of medical misinformation and unnecessary health anxiety that the internet has created.
Pinning down the effectiveness of a chatbot such as ChatGPT or Claude for consumer health, however, is tricky. “It’s exceedingly difficult to evaluate an open-ended chatbot,” says Danielle Bitterman, the clinical lead for data science and AI at the Mass General Brigham health-care system. Large language models score well on medical licensing examinations, but those exams use multiple-choice questions that don’t reflect how people use chatbots to look up medical information.
Sirisha Rambhatla, an assistant professor of management science and engineering at the University of Waterloo, attempted to close that gap by evaluating how GPT-4o responded to licensing exam questions when it did not have access to a list of possible answers. Medical experts who evaluated the responses scored only about half of them as entirely correct. But multiple-choice exam questions are designed to be tricky enough that the answer options don’t give them entirely away, and they’re still a pretty distant approximation for the sort of thing that a user would type into ChatGPT.
A different study, which tested GPT-4o on more realistic prompts submitted by human volunteers, found that it answered medical questions correctly about 85% of the time. When I spoke with Amulya Yadav, an associate professor at Pennsylvania State University who runs the Responsible AI for Social Emancipation Lab and led the study, he made it clear that he wasn’t personally a fan of patient-facing medical LLMs. But he freely admits that, technically speaking, they seem up to the task—after all, he says, human doctors misdiagnose patients 10% to 15% of the time. “If I look at it dispassionately, it seems that the world is gonna change, whether I like it or not,” he says.
For people seeking medical information online, Yadav says, LLMs do seem to be a better choice than Google. Succi, the radiologist, also concluded that LLMs can be a better alternative to web search when he compared GPT-4’s responses to questions about common chronic medical conditions with the information presented in Google’s knowledge panel, the information box that sometimes appears on the right side of the search results.
Since Yadav’s and Succi’s studies appeared online, in the first half of 2025, OpenAI has released multiple new versions of GPT, and it’s reasonable to expect that GPT-5.2 would perform even better than its predecessors. But the studies do have important limitations: They focus on straightforward, factual questions, and they examine only brief interactions between users and chatbots or web search tools. Some of the weaknesses of LLMs—most notably their sycophancy and tendency to hallucinate—might be more likely to rear their heads in more extensive conversations and with people who are dealing with more complex problems. Reeva Lederman, a professor at the University of Melbourne who studies technology and health, notes that patients who don’t like the diagnosis or treatment recommendations that they receive from a doctor might seek out another opinion from an LLM—and the LLM, if it’s sycophantic, might encourage them to reject their doctor’s advice.
Some studies have found that LLMs will hallucinate and exhibit sycophancy in response to health-related prompts. For example, one study showed that GPT-4 and GPT-4o will happily accept and run with incorrect drug information included in a user’s question. In another, GPT-4o frequently concocted definitions for fake syndromes and lab tests mentioned in the user’s prompt. Given the abundance of medically dubious diagnoses and treatments floating around the internet, these patterns of LLM behavior could contribute to the spread of medical misinformation, particularly if people see LLMs as trustworthy.
OpenAI has reported that the GPT-5 series of models is markedly less sycophantic and prone to hallucination than their predecessors, so the results of these studies might not apply to ChatGPT Health. The company also evaluated the model that powers ChatGPT Health on its responses to health-specific questions, using their publicly available HeathBench benchmark. HealthBench rewards models that express uncertainty when appropriate, recommend that users seek medical attention when necessary, and refrain from causing users unnecessary stress by telling them their condition is more serious that it truly is. It’s reasonable to assume that the model underlying ChatGPT Health exhibited those behaviors in testing, though Bitterman notes that some of the prompts in HealthBench were generated by LLMs, not users, which could limit how well the benchmark translates into the real world.
An LLM that avoids alarmism seems like a clear improvement over systems that have people convincing themselves they have cancer after a few minutes of browsing. And as large language models, and the products built around them, continue to develop, whatever advantage Dr. ChatGPT has over Dr. Google will likely grow. The introduction of ChatGPT Health is certainly a move in that direction: By looking through your medical records, ChatGPT can potentially gain far more context about your specific health situation than could be included in any Google search, although numerous experts have cautioned against giving ChatGPT that access for privacy reasons.
Even if ChatGPT Health and other new tools do represent a meaningful improvement over Google searches, they could still conceivably have a negative effect on health overall. Much as automated vehicles, even if they are safer than human-driven cars, might still prove a net negative if they encourage people to use public transit less, LLMs could undermine users’ health if they induce people to rely on the internet instead of human doctors, even if they do increase the quality of health information available online.
Lederman says that this outcome is plausible. In her research, she has found that members of online communities centered on health tend to put their trust in users who express themselves well, regardless of the validity of the information they are sharing. Because ChatGPT communicates like an articulate person, some people might trust it too much, potentially to the exclusion of their doctor. But LLMs are certainly no replacement for a human doctor—at least not yet.
Deep Dive
Artificial intelligence
What’s next for AI in 2026
Our AI writers make their big bets for the coming year—here are five hot trends to watch.
Meet the new biologists treating LLMs like aliens
By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time.
An AI model trained on prison phone calls now looks for planned crimes in those calls
The model is built to detect when crimes are being “contemplated.”
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.
文章标题:"谷歌医生"问题不少,ChatGPT健康咨询能否更胜一筹?
文章链接:https://qimuai.cn/?post=2979
本站文章均为原创,未经授权请勿用于任何商业用途