要理解人工智能,且看其如何进化。
内容来源:https://www.quantamagazine.org/to-understand-ai-watch-how-it-evolves-20250924/
内容总结:
【前沿视角】哈佛学者提出AI研究新范式:理解人工智能,需从其“进化”过程入手
当前,ChatGPT等大语言模型已渗透至各行各业,但其内部运作机制仍如“黑箱”般难以捉摸。哈佛大学肯普纳研究所研究员娜奥米·萨弗拉指出:“我们并不清楚语言模型为何有效。既然这些模型已无处不在,我们必须理解其运作原理。”
作为可解释性人工智能领域的研究者,萨弗拉创新性地提出应以“进化生物学”的视角研究AI。她引用遗传学家杜布赞斯基的名言“生物学中一切只有在进化的光照下才有意义”类比道:“人工智能中一切只有在随机梯度下降的光照下才有意义。”随机梯度下降是模型训练过程中的核心算法,通过数万亿次微小调整,使模型从随机连接状态逐步获得处理未知信息的能力。
与传统可解释性研究聚焦于训练完成的模型不同,萨弗拉主张关注训练过程本身。“若不理解模型的起源,就无法真正理解其运作机制,”她强调,“正如生物学家需通过进化史理解生物体,AI研究者也应重视训练过程中的动态演变。”
训练过程为何关键?
萨弗拉在本科阶段训练社交媒体文本模型时发现,过早让模型学习规整文本(如《华尔街日报》)反而会限制其处理复杂语言的能力。“模型会固化为仅学习简单解决方案,难以应对未来新输入。”她表示,训练初期的细微差异可能对模型最终能力产生决定性影响,但当前研究常忽视不同训练批次间的随机变异,而这种变异恰是理解模型泛化能力的重要窗口。
突破行业研究瓶颈
目前该领域面临的主要挑战在于,商业公司很少公开专有模型的训练过程数据。“即使获得内部访问权限,研究者也难以获取训练过程中的中间节点数据,更不用说对比多次训练结果。”萨弗拉团队近期通过分析不同训练批次中模型内部结构与泛化行为的关联,证明了这种动态研究的价值:“当多个随机初始化的模型呈现相同规律时,我们就能更确信其内在关联。”
个人经历塑造独特研究路径
萨弗拉在博士期间罹患神经系统疾病,失去打字能力后不得不通过口述代码继续研究。“我无法与其他研究者比拼速度,反而让我专注当时无人问津的领域——神经语言模型的训练动力学。”她坦言,这种“慢节奏”研究使其避开学术热潮,更深入探索本质问题。
对主流研究方法的反思
当前多数可解释性研究致力于回答“模型如何工作”,而萨弗拉更关注“为何如此工作”。她以图像分类模型中的“神经元选择性”为例:某些神经元仅对特定类别(如猫)产生强烈激活,但实验表明,阻止这种选择性反而提升模型性能。“这可能是训练早期形成的残留特性,如同进化中的痕迹器官,实际可能阻碍模型发展。”
对于仅通过干预训练完成模型的神经元来建立因果关系的流行做法,萨弗拉指出其局限性:“关闭某个神经元导致模型法语能力丧失,未必证明该神经元专司法语功能,可能只是破坏了整体协同作用。”她团队在掩码语言模型中发现,特定内部结构的形成与模型语法能力的突跃同步出现,这种动态关联更能揭示真实因果。
展望:让可解释性研究真正“可解释”
萨弗拉最后强调,无论采用何种研究方法,核心在于确保术语定义的精确性。“可解释性研究本身必须能被清晰理解”,这或是揭开AI神秘面纱的关键一步。
中文翻译:
要理解人工智能,请观察它的进化方式
引言
如今,像ChatGPT这样的大型语言模型无处不在。然而它们的内部运作机制依然深不可测。对娜奥米·萨弗拉而言,这种现状令人难以满意。"我们并不清楚语言模型的运作原理,"她表示,"既然这些模型已遍布各处,我们就应该理解它们在做什么。"
作为哈佛大学肯普纳研究所的研究员(她将于2026年入职波士顿大学担任教职),萨弗拉在可解释性这一新兴领域已深耕十余年。该领域的研究者通过剖析语言模型内部结构,揭示其运作机制。当多数可解释性研究者从神经科学汲取灵感时,萨弗拉更青睐另一种类比。她认为,可解释性研究应当借鉴进化生物学的思路。
"遗传学家杜布赞斯基有句名言:'若不从进化角度审视,生物学中的一切现象都难以理解。'"她解释道,"同理,若不从随机梯度下降的角度出发,人工智能中的一切现象也难以理解。"这种经典算法在训练过程中扮演核心角色,大型语言模型正是通过该过程学会生成连贯文本。
语言模型基于神经网络——这种数学结构通过人工"神经元"之间的连接来处理数据。每个连接的强度初始随机,但在训练过程中,当模型反复尝试预测海量文本数据集里句子的下一个词时,这些连接会不断调整。通过数万亿次微小调整,模型会逐渐形成内部结构,从而具备"泛化"能力,即对陌生输入作出流畅回应。
大多数可解释性研究聚焦于理解训练完成后语言模型的内部结构。而萨弗拉则大力倡导另一种研究路径——关注训练过程本身。她主张,正如生物学家必须了解生物体的进化史才能完全理解该生物体,可解释性研究者应当更关注训练期间发生的变化。"若不理解模型的起源,就无法真正理解其运作原理,"她强调。
《量子》杂志与萨弗拉探讨了以下话题:为何理解语言模型如此困难?进化视角如何提供帮助?以及哪些挑战塑造了她作为研究者的成长轨迹?为清晰起见,访谈内容经过压缩和编辑。
您如何对训练过程产生兴趣?
本科时期,我曾为某个研究项目训练神经网络处理社交媒体文本。由于文本极其口语化且变异丰富,我遇到了不少难题。这种情况下,自然思路是先训练模型处理更结构化的文本(如《华尔街日报》),待其掌握结构后再转向非正式文本。但事实证明,在训练初期设置简单任务反而会阻碍模型扩展。
是因为模型会固化为只学习简单解决方案吗?
正是如此。模型本身就有学习简易路径的倾向。研究者的任务就是阻止它过早掌握简易方案,避免其陷入机械记忆例外情况的陷阱。否则模型未来可能难以泛化到新输入。
所以这段经历让您意识到早期训练的影响可能非常重大?
有时影响巨大;有时预期影响很大,实际却不然。这让我认识到训练过程远比我们想象的复杂。我开始深入探究这个问题,并一直在这条路上前行。
这项工作的难点何在?
主要障碍之一是难以获取专有模型的内部参数。即使公司开放部分内部权限,也极少提供训练过程中的中间检查点。对于大型模型,能观察多次训练过程的情况更是凤毛麟角。
这为何重要?
初始条件至关重要。训练初期微小的波动可能使模型强烈偏向某个无法挽回的发展方向。许多研究默认训练过程中的随机变异不存在。这是个严重问题,因为这种变异既影响模型泛化能力,其本身也是极具价值的研究工具。
具体如何体现?
在最近一篇论文中,我们利用不同训练批次间的随机变异,发现了模型内部结构与泛化行为之间的关联。当控制其他变量后,若这种关联在多轮随机初始化中持续存在,就能强有力地证明二者存在本质联系。相比仅观察训练完成的单个模型,这种方法能更确凿地揭示模型工作原理。
谈到初始条件的影响,您职业生涯早期曾面临特殊挑战。这对研究有何影响?
开始攻读博士时,我患上了神经系统疾病,丧失了打字和手写能力。这对计算机科学博士生无疑是巨大限制。我不得不学习口述编程代码,并依赖独立办公室等便利设施进行语音工作。
这改变了我的许多研究细节。读博期间,我深知自己无法与能打字的人争夺研究时效性。于是转而专注当时无人问津的冷门领域:神经语言模型的训练动力学。这个决定却引领我进入了精彩的研究领域。
慢节奏研究有其优势。你不会被学术热潮裹挟,可以安心享受周末同时产出原创成果。
如今很多人关注可解释性研究。您的方法有何独特之处?
多数工作旨在揭示模型"如何"运作,而我更关注"为何"如此运作。为解答"如何",人们通常只观察训练成型的模型内部,试图找到描述模型活动的简化方式,再套用解释框架。比如可能发现第300万零4号神经元在模型输出法语时激活,甚至证明增强该神经元活性会提升法语输出概率。但这无法解释模型行为的内在动因。若想预测模型未来行为,这个"为何"至关重要。
标准研究方法可能导致哪些误判?
以图像分类神经网络中的神经元选择性为例:某些神经元仅对特定类别(如猫的图像)产生强烈激活。观察者可能断言"这显然是模型做出准确预测的关键"。但实验表明,若在训练中干预并阻止这类高选择性神经元的形成,模型性能反而提升。
人们容易将模型偶然呈现的特征误认为必要机制。但这可能是训练早期形成的残留属性,对最终性能并无实质影响,甚至可能制约模型潜力。我们需要像进化生物学家那样思考:"这真的存在因果关联吗?"
关于因果关系:许多可解释性论文不仅进行孤立观察,还通过编辑神经元激活来建立因果联系。为何这仍不足够?
若仅对训练完成的模型做因果分析,可能发现某个神经元至关重要——关闭它会破坏模型某项功能。但该神经元可能只是与模型其他部分存在强交互。干扰它必然会产生影响,却不一定是预设的影响。
研究训练过程的优势在于能更精确定位:若模型某结构负责特定功能,二者应该同步形成。我们在掩码语言模型中观察到类似现象:某种内部结构率先形成后,模型对某些复杂语法概念的掌握能力立即快速提升。
归根结底,无论通过训练动力学还是其他方式描述模型行为,核心问题永远是:"你能否准确定义所用术语的含义?"可解释性研究本身应当具备可解释性。
英文来源:
To Understand AI, Watch How It Evolves
Introduction
These days, large language models such as ChatGPT are omnipresent. Yet their inner workings remain deeply mysterious. To Naomi Saphra, that’s an unsatisfying state of affairs. “We don’t know what makes a language model tick,” she said. “If we have these models everywhere, we should understand what they’re doing.”
Saphra, a research fellow at Harvard University’s Kempner Institute who will start a faculty job at Boston University in 2026, has worked for over a decade in the growing field of interpretability, in which researchers poke around inside language models to uncover the mechanisms that make them work. While many of her fellow interpretability researchers draw inspiration from neuroscience, Saphra favors a different analogy. Interpretability, in her view, should take a cue from evolutionary biology.
“There’s this very famous quote by [the geneticist Theodosius] Dobzhansky: ‘Nothing makes sense in biology except in the light of evolution,’” she said. “Nothing makes sense in AI except in the light of stochastic gradient descent,” a classic algorithm that plays a central role in the training process through which large language models learn to generate coherent text.
Language models are based on neural networks, mathematical structures that process data using connections between artificial “neurons.” The strength of each connection is random at first, but during the training process the connections get tweaked as the model repeatedly attempts to predict the next word in sentences from a vast text dataset. Somehow, through trillions of tiny tweaks, the model develops internal structures that enable it to “generalize,” or respond fluently to unfamiliar inputs.
Most interpretability research focuses on understanding these structures in language models after the training process. Saphra is a prominent champion of an alternative approach that focuses on the training process itself. Just as biologists must understand an organism’s evolutionary history to fully understand the organism, she argues, interpretability researchers should pay more attention to what happens during training. “If you don’t understand the origins of the model, then you don’t understand why anything works,” she said.
Quanta spoke with Saphra about why it’s hard to understand language models, how an evolutionary perspective can help, and the challenges that shaped her own evolution as a researcher. The interview has been condensed and edited for clarity.
How did you get interested in the training process?
As an undergrad, I started training neural networks on social media text for a research project. I was running into issues due to the text being really informal and having a lot of variation. A natural approach in this situation is to start by training on something more structured, like the Wall Street Journal, and then switch to informal text once the model has learned that structure. But it turns out that having a simple task early in training is poison when you try to scale up.
Because the model gets locked in to only learning simple solutions?
Exactly. The model already wants to learn the easy thing. Your job is to keep it from learning the easy thing right away, so that it doesn’t just start memorizing exceptions. That might make it hard to generalize to new inputs in the future.
So that experience made you appreciate that what happens early on can matter a lot?
Sometimes it matters a lot; sometimes you expect it to matter a lot, and it really doesn’t matter. It made me realize that the training process is way more complicated than we might want it to be. I started digging into that, and I’ve been on that road ever since.
What makes this work difficult?
One of the biggest hurdles is that it’s hard to access the internals of proprietary models. Even the companies that give you some kind of internal access rarely give you access to intermediate checkpoints from the training process. And even rarer with large models is being able to look at more than one training run.
Why does that matter?
Initial conditions are really important. Little things can happen early in training that direct a model very strongly in ways that it can’t recover from. A lot of research acts as though random variation between training runs doesn’t exist. That’s an issue because that variation affects how models generalize, and also because random variation is a really useful tool.
How so?
In one recent paper, we used random variation between different training runs to find correlations between models’ internal structure and their generalization behavior. If structure and behavior are correlated across a bunch of random initializations when you control for everything else, it’s likely that they’re actually linked. You can make a much stronger claim about how models work than you could by just looking at one model at the end of training.
Speaking of the effects of initial conditions, you faced some unusual challenges early in your career. How has that affected your research?
When I started my Ph.D., I developed a neurological disease that made me lose the ability to type or write by hand. That’s obviously a huge limitation for a Ph.D. student in computer science. I had to learn to dictate code, and I relied on accommodations like having a private office that I can dictate in.
There are lots of little things that it’s changed about my research. During my Ph.D., I knew I was never going to beat a person who could type in a race to the scoop. So I ended up focusing on this weird topic nobody was really interested in at the time: the training dynamics of neural language models. And yet that decision led me to a really fantastic research area.
There are benefits of working on a slower timescale. You don’t get caught up in hype waves. You can take weekends off and still publish something original.
Lots of people are interested in interpretability these days. How does your approach differ from what they do?
Most work is really trying to figure out how a model works, while I’m trying to figure out why it works that way. To answer that “how” question, people usually just look inside a model at the end of training. You try to uncover an efficient way of describing what’s going on inside the model, and then you impose your explanations on top of that. You might find that neuron number 3,000,004 activates when the model is about to produce French output. You might even be able to say that if the neuron’s activation is pushed a bit higher, it causes more French output. But that doesn’t tell you why the model works the way it does. And that’s a really important question if we want to predict how the model will behave in the future.
What are some ways that the standard approach can lead you astray?
One example is neuron selectivity in neural networks for classifying images. This is a phenomenon where individual neurons activate very strongly only for images in a specific class, such as images of cats. You might look at that and say, “Well, clearly this is what the model needs to make good predictions.” But it turns out that if you intervene during training and prevent the model from developing these highly selective neurons, its performance actually improves.
So you might think that these models need to do a particular thing, because that’s what they happen to do. But it might be a vestigial property, something that developed early in training but isn’t actually important to how the model works in the end. It might even be holding the model back. You have to think like an evolutionary biologist and ask, “Is this actually causally linked?”
So let’s talk about causality. Many interpretability papers only examine models after training, but not just through isolated observations. They study the effects of editing neuron activations to establish causal relationships. Why isn’t that sufficient?
If you just do a causal analysis at the end of training, then you might find that a particular neuron is really important, that shutting it off destroys model performance at some task. You might say, “OK, the model becomes bad at French when I push this button.” But maybe that neuron just has other strong interactions with the rest of the model. Messing with it is likely to have some impact, but not necessarily the impact that you’re imagining.
One of the advantages of looking at the training process is that you can be more precise: If a structure in the model is responsible for a particular model function, you might expect the structure and the function to arise together. We saw something like this in a particular kind of language model called a masked language model. A type of internal structure developed first, and immediately after that, the model started getting much better very quickly at certain challenging grammatical concepts.
Ultimately, whether you’re looking at training dynamics or any other way of describing a model’s behavior, the number one question is, “Can you be precise about exactly what the words you are using mean?” Interpretability research should be interpretable.