ATLAS:多语言模型实用扩展法则

内容来源:https://research.google/blog/atlas-practical-scaling-laws-for-multilingual-models/
内容总结:
谷歌发布ATLAS多语言模型扩展定律,为非英语AI研发提供数据指南
2026年1月27日,谷歌云学生研究员Shayne Longpre与谷歌DeepMind研究科学家Sayna Ebrahimi共同发布了一项名为“ATLAS”的多语言模型扩展定律研究成果。该研究旨在解决当前人工智能领域一个显著失衡的问题:尽管超过50%的AI模型使用者母语为非英语,但公开的模型扩展定律研究却过度集中于英语。
这项将在ICLR 2026大会上展示的研究,基于迄今为止规模最大的公开多语言预训练分析,涵盖了从1000万到80亿参数的774次训练实验,涉及超过400种语言的数据,并在48种语言上进行了评估。研究团队据此估算了1400对语言之间的协同效应,并提出了“自适应迁移扩展定律”(ATLAS),为开发者构建多语言模型时如何平衡训练数据中的语言配比与模型规模提供了实用指导。
ATLAS的核心突破:从单语到多语的科学规划
与传统专注于单语环境的扩展定律不同,ATLAS是一个简单、实用的框架,用于确定多语言环境下训练模型的最佳规模、数据量和语言组合。它通过三大组件实现这一目标:
- 跨语言迁移矩阵:用于识别哪些语言组合训练效果最佳。
- 多语言扩展定律:为指导模型在支持更多语言时,如何高效扩展模型规模和数据量提供依据。
- 训练策略规则:帮助决策何时应从零开始预训练模型,何时应基于现有多语言检查点进行微调。
ATLAS通过分析数百次多语言实验(使用MADLAD-400语料库,涵盖400多种语言的超过750次训练)得出结论。其创新之处在于能够量化分析目标语言数据、相似语言数据(如加泰罗尼亚语可从西班牙语、葡萄牙语等拉丁语系语言中受益)以及其他所有语言数据,各自对目标语言性能的实际贡献或干扰。
关键发现与实用洞察
- 破解“多语言诅咒”:研究团队用数据量化了著名的“多语言诅咒”现象(即模型因容量有限,随支持语言增加而性能下降)。定律表明,虽然增加语言会带来轻微的性能代价,但语言间的正向迁移效应显著。具体而言,若要使模型支持的语言数量翻倍(2K),建议将模型规模扩大约1.18倍,总数据量扩大约1.66倍。这意味着每种语言的数据量虽降至原来的83%,但多语言学习产生的协同效应足以抵消容量限制导致的性能衰减。
- 绘制语言协同图谱:研究绘制了大规模的语言间协同与干扰矩阵。结果显示,语言间的正向迁移最强预测因素是共享文字体系或语系(如拉丁字母)。例如,挪威语主要受益于瑞典语和德语,马来语受益于印尼语。英语、法语和西班牙语因其网络文本的质量、多样性和数量,成为对训练其他语言最广泛有益的语言。
- 明确预训练与微调的选择阈值:研究为十种语言对比了两种优化路径:从零开始预训练,或从强大的多语言“Unimax”检查点进行微调。结果表明,在计算资源有限时,微调现有检查点能更快获得更好性能;若能投入足够多的计算资源进行长期训练,从零预训练最终效果更优。对于20亿参数的模型,两种路径的效益“交叉点”通常在约1440亿至2830亿训练词元之间(具体因语言而异)。这为开发者提供了清晰的预算感知决策规则。
对开发者的意义
ATLAS为全球模型开发者提供了一份超越英语中心主义的路线图。开发者可直接应用其指导:
- 规划新模型:可参考论文中的图表,根据词汇表或训练选择初步了解扩展规律。
- 选择训练语言组合:查阅跨语言迁移矩阵,根据实证数据选择能帮助目标语言的源语言,尤其是共享文字/语系的语言。
- 扩展支持语言:参考相关章节,确定如何高效扩展模型与数据规模以应对“多语言诅咒”。
- 计算资源受限时:根据指南决定应微调现有多语言模型还是从零开始预训练。
该研究有望推动新一代多语言模型的发展,更好地服务于全球数十亿非英语使用者。
中文翻译:
ATLAS:多语言模型实用扩展定律
2026年1月27日
Shayne Longpre,谷歌云学生研究员;Sayna Ebrahimi,谷歌DeepMind研究科学家
我们为大规模多语言模型引入了新的扩展定律。ATLAS为如何混合数据并训练最高效的模型,以服务于英语之外的语言提供了指导。
快速链接
超过50%的AI模型用户使用非英语语言,然而公开可用的扩展定律绝大多数都集中在英语上。这种不平衡在公共研究中造成了关键空白,使得那些为数十亿国际和多语言用户服务的模型构建者,在针对非英语语言或特定语言混合进行开发时,缺乏关于效率、质量和成本等关键决策的数据驱动指导。
在将于ICLR 2026发表的论文《ATLAS:用于多语言预训练、微调及解码“多语言诅咒”的自适应迁移扩展定律》中,我们旨在填补这一空白。我们展示了迄今为止最大的公开多语言预训练研究,涵盖了从1000万到80亿参数的模型,共774次训练运行。研究数据涵盖400多种语言,并在48种语言上进行了评估。基于这项研究,我们估计了1400对语言之间的协同效应,并引入了用于构建多语言模型的自适应迁移扩展定律,使实践者能够高效地平衡训练数据中的语言混合与模型规模。
ATLAS:适应多语言混合的单一扩展定律
ATLAS是一种简单、实用的方法,用于确定训练的最佳模型规模、数据量和语言混合。与专注于单语环境的传统扩展定律不同,ATLAS为更复杂的多语言环境提供了这些建议。它通过利用多种不同语言的数据,专门优化目标语言(例如加泰罗尼亚语)的性能。ATLAS通过三个组成部分扩展了这些传统扩展定律原则:
- 一个用于识别哪些语言最适合一起训练的跨语言迁移矩阵
- 一个提供指导的扩展定律,说明在支持语言数量增加时如何高效扩展模型规模和数据
- 决定何时从头开始预训练模型,以及何时从多语言检查点进行微调的规则
ATLAS通过基于数百次多语言实验(使用MADLAD-400语料库,在400多种语言上进行了超过750次运行)进行训练,并考虑三种不同的数据源来实现这一点:1)目标语言,2)根据实证分析得出的相似迁移语言(例如,加泰罗尼亚语可能包括西班牙语、葡萄牙语和意大利语等拉丁语系语言),以及3)所有其他语言。这种新颖的方法使该定律能够了解每个数据源实际上对目标语言有多大帮助或阻碍,这是先前定律所不具备的能力。
评估
我们使用MADLAD-400数据集来评估ATLAS在预测模型在新模型规模、不同训练数据量或新语言混合下的表现方面的准确性。为此,我们在单语、双语和大规模多语言设置中,通过超过750次独立运行,使用对词汇不敏感的损失来衡量性能。我们的评估表明,ATLAS始终优于先前的工作。
针对六种语言——英语、法语、俄语、中文、印地语和斯瓦希里语——我们分析了ATLAS如何预测最佳模型规模和数据规模应如何扩展。当我们比较不同语言的最佳扩展轨迹时,我们得出了两个观察结果。这些曲线看起来惊人地相似,但使用多语言词汇表或完全多语言数据进行训练会带来计算效率的代价——尤其是对英语而言。低资源语言的曲线会因数据耗尽而向上弯曲,模型难以从数据重复中学习。ATLAS明确地建模了这些效应。
跨语言迁移图谱
接下来,我们大规模测量了语言间的协同效应和干扰,生成了一个量化在语言A上训练对语言B有多大帮助(或损害)的矩阵。我们的结果显示了非常直观的结论:挪威语主要受益于瑞典语和德语,马来语受益于印尼语,阿拉伯语受益于希伯来语。英语、法语和西班牙语是训练时帮助最广泛的语言,这可能是由于网络上这些语言的文本具有固有的高质量、异质性和数量。
分析表明,正向迁移的最大预测因素是共享文字系统和/或语系,这在统计学上具有显著性。英语对许多语言有帮助,但并非全部;而且迁移并不总是对称的(A对B的帮助可能大于B对A的帮助)。这些测量结果将“直觉”转化为数据驱动的语言混合选择。
用清晰的扩展规则解码“多语言诅咒”
“多语言诅咒”是一种现象,即在多种语言上训练的模型,由于模型容量有限,每增加一种新语言,性能就会下降。我们用一个扩展定律将这个问题形式化,该定律不仅考虑模型规模、训练数据量,还考虑数据中的语言数量。将该定律拟合到许多实验中后,我们发现,虽然增加语言会带来轻微的容量代价,但存在高度的正向迁移。这意味着,如果我们想训练一个模型以支持两倍数量的语言,那么我们应该将模型规模增加1.18倍,总数据量增加1.66倍。这相当于在2K种语言中,每种语言占数据的83%。尽管每种语言的数据变少,但从所有语言学习中产生的正向协同效应,意味着导致性能下降的容量限制得到了抵消。
何时预训练 vs. 微调多语言检查点
对于十种语言,我们比较了获得最佳性能模型的两种路径:从头开始预训练目标语言模型,或从一个强大的多语言“Unimax”检查点进行微调。选项可能以最少的额外计算获得最佳性能,因为该模型已经在多种语言上相当强大。然而,如果模型可以训练更长时间,那么选项通常能产生更好的长期结果。我们的目标是基于模型构建者拥有的计算资源,找到两条训练曲线之间的交叉点。
我们的结果表明,微调在早期胜出,但一旦能够负担足够的训练量,预训练就会超越。在我们的运行中,对于具有20亿参数的模型,交叉点通常出现在约1440亿到2830亿个标记之间。接着,我们将交叉点绘制为模型规模的函数。这提供了一个具体的、考虑预算的经验法则:如果你的标记和计算预算低于你模型规模的交叉点,就从多语言检查点开始;否则,从头开始预训练通常最终会领先。请注意,确切的阈值取决于基础模型和混合方式。
亲自尝试
通过超越以英语为中心的扩展,ATLAS为全球模型开发者提供了一份路线图。它可以直接应用于扩展英语之外的语言模型,帮助开发者:
- 计划训练一个新的多语言或非英语模型?使用论文中的图1或表C.1,根据词汇表或训练选择来了解潜在的扩展定律。
- 选择新的训练混合?参考迁移矩阵,选择根据实证对你的目标语言有帮助的源语言——特别是那些共享相同文字系统/语系的语言。
- 用更多语言训练新模型?参考第5节,确定如何最有效地扩展模型规模和数据规模,以减轻多语言诅咒的影响。
- 计算资源受限?参考第6节,决定是应该微调一个多语言模型还是从头开始预训练。
我们希望这项工作能够催生新一代的多语言模型,为数十亿非英语使用者服务。
致谢
我们感谢Luke Zettlemoyer、Catherine Arnett和Stella Biderman对论文的有益讨论。我们感谢Biao Zhang和Xavier Garcia在早期方向上的技术讨论和反馈。
英文来源:
ATLAS: Practical scaling laws for multilingual models
January 27, 2026
Shayne Longpre, Google Cloud Student Researcher, and Sayna Ebrahimi, Research Scientist, Google DeepMind
We introduce new scaling laws for massively multilingual language models. ATLAS provides guidance on how to mix data and train the most effective models to serve languages beyond English.
Quick links
Over 50% of AI model users speak non-English languages, yet publicly accessible scaling laws are overwhelmingly focused on the English language. This imbalance creates a critical gap in public research, leaving model builders, tasked with serving billions of international and multilingual users, without data-driven guidance for key development decisions about efficiency, quality, and cost when building for non-English languages or with specific language mixtures.
In “ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality”, to be presented at ICLR 2026, we aim to address this gap. We present the largest public multilingual pre-training study to date, spanning 774 training runs across 10M–8B parameter models. It includes data spanning 400+ languages and evaluations in 48 languages. As a result of this study, we estimate the synergies between 1,400 pairs of languages, and introduce adaptive transfer scaling laws (ATLAS) for building multilingual models that enable practitioners to efficiently balance the mix of languages in training data with model size.
ATLAS: A single scaling law that adapts to multilingual mixtures
ATLAS is a simple, practical approach to determining optimal model size, data volume, and language mixtures for training. Unlike traditional scaling laws that focus on monolingual settings, ATLAS provides these recommendations for more complex, multilingual environments. It specifically optimizes performance on a target language (e.g., Catalan) by leveraging data from multiple different languages. ATLAS extends these traditional scaling law principles through three components:
- A cross-lingual transfer matrix used to identify which languages are best to train together
- A scaling law that provides guidance on efficiently expanding model size and data as the number of supported languages increases
- Rules for deciding when to pre-train a model from scratch versus fine-tuning from a multilingual checkpoint
ATLAS accomplishes this by training on hundreds of multilingual experiments (using the MADLAD-400 corpus with over 750 runs across 400+ languages) and accounting for three distinct data sources: 1) the target language, 2) similar transfer languages according to empirical analysis (e.g., Catalan might include Latin languages like Spanish, Portuguese, and Italian), and 3) all other languages. This novel approach enables the law to learn how much each source actually helps or hinders the target language, a capability prior laws did not support.
Evaluation
We used the MADLAD-400 dataset to evaluate how well ATLAS predicts a model’s performance on new model sizes, varying amounts of training data, or new language mixtures. To do this, we measure performance using a vocabulary-insensitive loss across over 750 independent runs in monolingual, bilingual, and massively multilingual settings. Our evaluations show that ATLAS consistently outperforms prior work.
For six languages — English (EN), French (FR), Russian (RU), Chinese (ZH), Hindi (HI), and Swahili (SW) — we analyzed how ATLAS predicted the optimal model size (N) and data size (D) should be scaled. When we compared these optimal scaling trajectories across languages, we made two observations. The curves look strikingly similar, but training with a multilingual vocabulary or fully multilingual data comes with a compute-efficiency tax — especially for English. Low-resource languages show upward bends as they run out of data, and the model struggles to learn from data repetition. ATLAS explicitly models these effects.
The cross-lingual transfer map
Next, we measured language-to-language synergies and interference at scale, producing a matrix that quantifies how much training on language A helps (or hurts) language B. Our results show very intuitive results: Norwegian is helped primarily by Swedish and German, Malay by Indonesian, and Arabic by Hebrew. English, French, and Spanish are the most widely helpful languages with which to train, likely due to the inherent quality, heterogeneity, and quantity of text in these languages found on the web.
The analysis shows that the biggest predictor of positive transfer is sharing a script and/or language family (e.g., Latin script), statistically significant with p < .001. English helps many, but not all, languages; and transfer isn’t always symmetric (A can help B more than B helps A). These measurements turn “hunches” into data-driven language mix choices.
Decoding the “curse of multilinguality” with clear scaling rules
The “curse of multilinguality” is a phenomenon where models trained on multiple languages see a decrease in performance with each new language due to limited model capacity. We formalize this problem with a scaling law that considers not just model size (N), and quantity of training data (D), but the number of languages in that data (K). Fitting this law to many experiments, we found that while adding languages brings a mild capacity tax, there is a high-degree of positive transfer. This means if we want to train a model to support twice as many languages (2·K) then we should increase model size by 1.18x, and total data by 1.66x. This equates to 83% of data in each of the 2K languages. Although there is less data per language, the positive synergies from learning on all of them means the capacity constraints that cause degradation to the performance are offset.
When to pre-train vs. fine-tune a multilingual checkpoint
For ten languages, we compare two paths to get the best performing model: (a) pre-train from scratch on the target language or (b) fine-tune from a strong multilingual “Unimax” checkpoint. Option (b) is likely to have the best performance with minimal additional compute, as the model is already pretty strong across languages. However, if the model can be trained for much longer, then option (a) can often yield better long-term results. Our goal is to find the crossover point between the two training curves, based on how much compute the model builder has to spend.
Our results show that fine-tuning wins early, but pre-training overtakes once you can afford enough tokens. In our runs, the crossover typically occurs between ~144B and 283B tokens (language-dependent) for models with 2B parameters. Next, we plotted the crossover point as a function of model size. This gives a concrete, budget-aware rule of thumb: if your token and compute budget is below the crossover point for your model size, start from a multilingual checkpoint; otherwise, pre-training from scratch will usually finish ahead. Note that exact thresholds depend on the base model and mixture.
Try it yourself
By moving beyond English-centric scaling, ATLAS provides a roadmap for global model developers. It can be directly applied to scale language models beyond English by helping developers: - Planning to train a new multilingual or non-English model? Use Figure 1 or Table C.1 from the paper to get a sense of the potential scaling laws based on vocabulary or training choices.
- Choosing a new training mix? Consult the transfer matrix (Figure 2) to pick source languages that empirically help your targets — especially those sharing the same script/family.
- Training a new model with more languages? Consult Section 5 to determine how to most efficiently expand your model size and data size to mitigate the effects of the curse of multilinguality.
- Compute-constrained? Consult Section 6 to decide if you should fine-tune a multilingual model or pre-train from scratch.
We hope this work enables a new generation of multilingual models, serving billions of non-English speakers.
Acknowledgements
We thank Luke Zettlemoyer, Catherine Arnett and Stella Biderman for helpful discussions on the paper. We thank Biao Zhang and Xavier Garcia for the technical discussions and feedback on early directions.