AfriMed-QA：全球健康领域大语言模型性能基准测试

qimuai 发布于 2025-9-25 08:02 阅读：1 一手编译

内容来源：https://research.google/blog/afrimed-qa-benchmarking-large-language-models-for-global-health/

内容总结：

首个泛非洲医疗问答基准数据集发布，推动大语言模型服务全球健康公平

日前，谷歌研究院高级研究科学家Mercy Asiedu团队联合非洲多所高校及机构，共同发布了名为“AfriMed-QA”的大语言模型基准数据集。该数据集旨在评估和提升大语言模型在非洲医疗卫生领域问答任务中的表现，以弥补当前模型在面对地区性疾病分布差异、文化背景多样性及语言变体时的不足。

当前，大语言模型在医学问答、临床辅助诊断及多语言健康培训等方面展现出潜力，尤其在医疗资源相对匮乏的地区可能发挥重要作用。然而，现有医学评测基准多基于西方语境构建，难以准确反映非洲等地区独特的疾病谱、诊疗习惯与文化背景。缺乏具有代表性的本地化基准数据集，成为制约相关技术公平发展与有效落地的主要障碍。

AfriMed-QA作为首个大规模、多专科的泛非洲医疗问答数据集，涵盖了来自非洲16个国家、60余所医学院校的约1.5万条临床问答数据，包含多种题型与消费者健康咨询问题。该数据集由超过600名医疗专业人士参与构建，覆盖妇产科、神经外科、传染病等32个医学专科，并通过合作机构搭建的众包平台进行高质量采集与盲审评估。

研究团队利用该数据集对30个大语言模型进行了系统评估。结果显示，规模较大的通用模型整体表现优于同体量的专业生物医学模型，这一发现对资源有限场景下倾向于部署轻量化专业模型的策略提出了挑战。值得注意的是，在消费者查询类问题中，前沿大语言模型生成的答案在完整性、相关性和本地化程度上，甚至获得了比临床医生答案更高的人类评价。

为推动技术透明发展与社区协作，该项目所有数据集及评估代码均已开源，并设立了公开的性能排行榜。目前，研究团队正与合作伙伴共同努力，计划将数据集扩展至非洲本土语言及多模态内容，以更好地服务于当地多样化的实际需求。

该研究成果已在ACL 2025会议上发表，并荣获“最佳社会影响力论文奖”。业界认为，AfriMed-QA的发布为在全球南方地区开发符合本地语境、文化敏感且可靠的医疗人工智能工具奠定了重要基础，呼吁更多研究机构与健康组织加入此类本地化数据集的构建与优化工作。

中文翻译：

AfriMed-QA：面向全球健康领域的大语言模型基准测试
2025年9月24日
Mercy Asiedu，谷歌研究院高级科学家

我们正式发布AfriMed-QA——一套针对非洲健康问答任务量身定制的情境化数据集，该数据集由谷歌研究院与非洲多所机构合作开发，旨在为大语言模型（LLM）在非洲医疗场景下的评估提供新标准。

快速入口

大语言模型在各类医疗健康问答任务中展现出巨大潜力，包括多选题与简答题形式的专业考试（如美国医师执照考试的MedQA）、医学摘要生成及临床笔记整理等。尤其在资源匮乏地区，这类模型有望成为重要的临床决策辅助工具，既能提升诊断准确性与医疗可及性，又能提供多语言的临床决策支持与健康培训，这些功能在基层医疗场景中具有特殊价值。

尽管现有医疗基准测试中LLM表现优异，但其能否适应疾病类型分布变化、症状描述的地域差异、语言表达习惯差异（即便是英语内部）仍存疑问。此外，在非西方传统医疗场景中，本地文化背景与区域特异性医学知识对模型部署至关重要。然而若缺乏反映真实世界多样性的基准数据集，模型的训练与评估便无从谈起，这凸显了构建多元化基准数据集的迫切性。

为填补这一空白，我们推出AfriMed-QA基准问答数据集，该集整合了来自非洲16个国家60所医学院的公众健康咨询类问题与医学院考试题型。我们与Intron Health、Sisonkebiotik、海岸角大学、非洲医学生联合会、BioRAMP等机构组成的AfriMed-QA联盟合作开发此数据集，并获PATH/盖茨基金会支持。我们通过对比专家答案评估LLM响应质量，并依据人类偏好进行评分。本项目采用的方法可推广至其他尚未建立数字化基准的地区。

AfriMed-QA于2025年国际计算语言学大会（ACL）发表并荣获最佳社会影响力论文奖。该数据集近期被用于训练我们最新的多模态医学图文理解开源模型MedGemma。AfriMed-QA基准数据集及LLM评估代码已开源，供社区免费使用。

AfriMed-QA数据集特色

作为首个大规模泛非洲多专科医学问答数据集，AfriMed-QA专为评估和开发适用于非洲医疗体系的公平高效LLM而设计。数据集包含约1.5万道临床多样性英语问答题、4000余道专家多选题（含答案）、1200余道开放式简答题（附详细解析）以及1万条公众健康咨询问题。其设计旨在严格评估LLM的答案准确性与地理适应性。数据通过621名贡献者众包采集，覆盖12个国家60余所医学院校，涉及妇产科、神经外科、内科学、急诊医学、医学遗传学、传染病学等32个专科领域。

数据采集采用Intron Health原为非洲口音及多语言临床语音数据大规模收集开发的网络平台。我们定制了专用用户界面以适配不同题型采集、质量审核及LLM响应的盲法人工评估。

LLM响应评估方法

通过定量与定性分析，我们评估了30个通用及生物医学LLM（涵盖不同规模与开源状态）。针对多选题采用答案匹配准确率指标，简答题则通过语义相似度与句子重叠度对比模型生成答案与参考标准。

研究发现：大型模型在AfriMed-QA上的基准表现优于小型模型，这一趋势可能不利于偏好轻量级专业模型的边缘设备部署场景；同规模通用模型表现超越生物医学专用模型，这可能源于开源生物医学模型的参数限制，或暗示专用模型在微调过程中过度适应特定数据偏差。两种情况下，专用模型对AfriMed-QA数据集独特特性的适应性均显不足。

LLM响应的人工评分体系

通过Intron Health众包平台，我们对3000条随机抽样的LLM响应进行人工评估。借鉴MedLM论文提出的评估维度（包括错误率、信息缺失、人口偏见及危害程度），评估分为两类：

临床医生对LLM的多选题、简答题及公众咨询答案评分，重点考察答案正确性、本地化程度、是否存在信息遗漏或虚构内容及潜在危害；
非临床使用者评估公众咨询类答案的相关性、实用性与本地化程度。

采用5分量表（1分代表“完全不符合”，5分代表“完全符合”），评估者不知晓答案来源（模型或人类），且需按随机顺序评价多个LLM答案。

评估结果显示：对于公众咨询问题，前沿LLM的答案在完整性、信息量与相关性方面持续优于临床医生提供的答案，且更少出现虚构或遗漏现象；临床医生的答案在信息缺失维度评分较低。

开放式排行榜助力模型对比

我们开发了可视化排行榜，方便用户对比现有模型性能或提交自有模型测试结果。

迈向多语言多模态数据集

鉴于医学本质的多语言与多模态特性，我们正与海岸角大学Stephen Moore教授领导的AfriMed-QA联盟合作，计划将数据集从纯文本英语问答扩展至非洲本土官方语言及民族语言，并逐步融入视觉与音频等多模态问答数据。

局限性说明

尽管这是首个大规模、多专科、本土化采集的泛非洲数据集，但其覆盖仍不完善（例如超50%专家多选题源自尼日利亚）。我们正积极拓展更多非洲地区及全球南方国家的代表性数据。当前工作为缺乏数字化基准数据集的国家建立多样化健康基准数据采集奠定了基础。

地理多样性健康问答中的LLM应用

鉴于健康相关成果的敏感性，必须确保LLM输出准确、贴合语境且符合文化背景。不同场景下，LLM需适应疾病流行度、文化语境、资源配置、药品命名、筛查治疗指南差异、医疗技术基础设施、经济可承受度、护理类型及敏感属性等多维度分布变化。虽然当前评估有限，我们呼吁全球科研与医疗机构通过合作与本地化投入，深入开展相关研究，构建适配本土场景的LLM评估与优化数据集。

致谢名单

我们要感谢卓越的AfriMed-QA联盟及合著者：Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood Yekini, Jonas Kemp, Katherine Heller, Jude Chidubem Omeke, Chidi Asuzu, Naome A. Etori, Aimérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, Irfan Essa, Stephen Edward Moore, Chris Fourie。同时感谢盖茨基金会/PATH Digital Square的Bilal Mateen, Melissa Miles, Mira Emmanuel-Fabula, Celeste Gonda对本项目的支持，以及所有数据贡献者。特别向Marian Croak的领导与支持致以谢意。

英文来源：

AfriMed-QA: Benchmarking large language models for global health
September 24, 2025
Mercy Asiedu, Senior Research Scientist, Google Research
We present Afrimed-QA, a collection of contextually relevant datasets for evaluation of LLMs on African health question answering tasks, developed in partnership with organizations across Africa.
Quick links
Large language models (LLMs) have shown potential for medical and health question answering across various health-related tests spanning different formats and sources, such as multiple choice and short answer exam questions (e.g., USMLE MedQA), summarization, and clinical note taking, among others. Especially in low-resource settings, LLMs can potentially serve as valuable decision-support tools, enhancing clinical diagnostic accuracy and accessibility, and providing multilingual clinical decision support and health training, all of which are especially valuable at the community level.
Despite their success on existing medical benchmarks, there is uncertainty about whether these models generalize to tasks involving distribution shifts in disease types, contextual differences across symptoms, or variations in language and linguistics, even within English. Further, localized cultural contexts and region-specific medical knowledge is important for models deployed outside of traditional Western settings. Yet without diverse benchmark datasets that reflect the breadth of real-world contexts, it’s impossible to train or evaluate models in these settings, highlighting the need for more diverse benchmark datasets.
To address this gap, we present AfriMed-QA, a benchmark question–answer dataset that brings together consumer-style questions and medical school–type exams from 60 medical schools, across 16 countries in Africa. We developed the dataset in collaboration with numerous partners, including Intron health, Sisonkebiotik, University of Cape Coast, the Federation of African Medical Students Association, and BioRAMP, which collectively form the AfriMed-QA consortium, and with support from PATH/The Gates Foundation. We evaluated LLM responses on these datasets, comparing them to answers provided by human experts and rating their responses according to human preference. The methods used in this project can be scaled to other locales where digitized benchmarks may not currently be available.
AfriMed-QA was published at ACL 2025 where it won the Best Social Impact Paper Award. The dataset was recently leveraged to assist in training of MedGemma, our latest open model for multimodal medical text and image comprehension. The AfriMed-QA benchmark datasets and LLM evaluation code are open-sourced and available for use by the community.
AfriMed-QA dataset
The AfriMed-QA dataset is the first large-scale pan-African multi-specialty medical question–answer dataset designed to evaluate and develop equitable and effective LLMs for African healthcare. The dataset comprises ~15,000 clinically diverse questions and answers in English, 4,000+ expert multiple choice questions (MCQs) with answers, over 1,200 open ended short answer (SAQs) with long-form answers, and 10,000 consumer queries (CQ). The dataset is designed to rigorously assess LLM performance for correctness and geographical shifts. It was crowd-sourced from 621 contributors, from over 60 medical schools across 12 countries, covering 32 medical specialties, including obstetrics and gynecology, neurosurgery, internal medicine, emergency medicine, medical genetics, infectious disease, and others.
To collect these data, we adapted a web-based platform previously developed by Intron Health for crowd-sourcing accented and multilingual clinical speech data at scale across Africa. We developed custom user interfaces to collect each question type, for quality reviews, and for blinded human evaluation of LLM responses.
Evaluation of LLM responses
Using quantitative and qualitative approaches, we evaluated 30 general and biomedical LLMs, ranging in size from small to large. Some were open and others were closed. For MCQs, we measured the accuracy by comparing each LLM’s single-letter answer choice with the reference. For SAQs, we measured semantic similarity and sentence level overlap comparing the generated response from the language model against a reference answer.
We found that the baseline performance of larger models is more accurate than small models on AfriMed-QA. This trend may be unfavorable to low-resource settings where on-device or edge deployments with smaller specialized models are preferred.
We also found that baseline general models outperform and generalize better than biomedical models of similar size. This counterintuitive result could be due to the parameter size limitations of open biomedical models in our study or it could indicate that specialized LLMs overfit to the specific biases and nuances of the data on which they are fine-tuned. In either case, they seem to be less adaptable to the unique characteristics of the AfriMed-QA dataset.
Human rating of LLM responses
LLM responses to a fixed subset of questions (n=3000; randomly sampled) were sent out for human evaluation on the Intron Health crowd-sourcing platform. Adapting the evaluation axes described in our MedLM paper, which included measures for inaccuracy, omission of information, evidence of demographic bias, and extent of harm, we collected human evaluations in two categories:

Clinicians provided ratings to the LLM’s MCQ, SAQ, and CQ responses, evaluating whether answers were correct and localized, if omissions or hallucinations were present, and if potential for harm existed.
Non-clinicians/consumers rated CQ LLM responses to determine if answers were relevant, helpful, and localized.
Ratings were on a 5-point scale representing the extent to which the criteria were met. “1” represents “No" or “completely absent" and “5” represents “Yes" or “absolutely present". Raters were blinded to the answer source (model name or human) and each rater was asked to evaluate answers from multiple LLMs in a random sequence.
Consumer and clinician human evaluation of LLM answers to CQs revealed a preference for LLM responses, where frontier LLMs were consistently rated to be more complete, informative, and relevant when compared with clinician-provided answers, and less susceptible to hallucinations and omissions. Consistent with this, clinician answers to CQs were also rated worse when measured for omission of relevant information.
Building an open Leaderboard for easy comparison of data versions and LLM versions
We have developed a leaderboard for easy visualization and comparison of LLM performance. Users can compare existing models or submit their own models and see how well they perform on the dataset.
Towards a multilingual, multimodal dataset
We recognize that medicine is inherently multilingual and multimodal and are currently working with the AfriMed-QA consortium led by Prof. Stephen Moore at the University of Cape Coast to expand beyond English-only text-based question answering to non-English official and native languages from the continent. We are also working to incorporate multimodal (e.g., visual and audio) question answering datasets.
Limitations
Although this is the first large-scale, multi-specialty, indigenously sourced pan-African dataset of its kind, it is by no means complete. Over 50% of the expert MCQ questions came from Nigeria. We are working to expand representation from more African regions and the Global South.
While the development of the dataset is still in progress, this work establishes a foundation for acquiring diverse and representative health benchmark datasets across countries that may not have digitized and readily available benchmark datasets.
LLMs for geographically diverse health QAs
Given the sensitivity of health-related outcomes, it is essential that LLMs are evaluated for accurate, contextual, and culturally relevant performance. Across different settings one can anticipate a variety of distribution shifts to which LLMs need to adapt. These include disease prevalence, cultural context, resources and infrastructure, drug types and nomenclature, differences in health recommendations for screening and treatment, medical technology infrastructure, affordability, care types, and sensitive attributes. While our evaluations are limited, we present a call to action for other research and health organizations to pursue further research in this area, curating datasets to evaluate and optimize LLMs for use in their contexts through partnerships and local input.
Acknowledgements
We would like to acknowledge the incredible AfriMed-QA consortium and co-authors. Tobi Olatunji, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, Bonaventure F. P. Dossou, Moshood Yekini, Jonas Kemp, Katherine Heller, Jude Chidubem Omeke, Chidi Asuzu, Naome A. Etori, Aimérou Ndiaye, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, Irfan Essa, Stephen Edward Moore, and Chris Fourie. We would also like to thank Bilal Mateen, Melissa Miles, Mira Emmanuel-Fabula, and Celeste Gonda from the Gates Foundation/PATH Digital Square for their support of the work and all data contributors. Finally, we thank Marian Croak for her leadership and support.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读