人工智能模型首次实现媲美人类专家的语言分析能力

qimuai 发布于 2025-11-1 01:02 阅读：12 一手编译

内容来源：https://www.quantamagazine.org/in-a-first-ai-models-analyze-language-as-well-as-a-human-expert-20251031/

内容总结：

近日，一项突破性研究显示，人工智能在语言分析领域首次达到人类专家水平。由加州大学伯克利分校语言学家加斯珀·贝古什领衔的研究团队，对多个大语言模型进行了系统性语言学测试，其中OpenAI的o1模型在句法解析、歧义消解和递归结构分析等核心语言能力上表现尤为突出。

研究通过设计包含虚构语言规则解析的四项测试，有效规避了模型依赖训练数据的风险。在最具挑战性的中心嵌套递归测试中，o1模型成功解析了如"我们尊崇的先民研究的天文学与占星术并未分离"这类复杂句式，并能自主添加递归层级。该模型还展现出对"罗温喂了他的宠物鸡"这类歧义句的精确解析能力，可同步生成对应不同语义的句法树状图。

在音系学测试中，研究团队创造30种全新微型语言，o1模型准确推断出"浊阻塞音后的元音变为气化元音"等音系规则，证明其具备从零推导语言规则的能力。卡内基梅隆大学计算语言学家大卫·莫滕森评价称，这项研究有力反驳了"大语言模型仅能预测词序"的质疑。

尽管当前模型尚未展现原创性语言发现，但耶鲁大学计算语言学家汤姆·麦考伊指出，随着算力与数据规模提升，语言模型有望在语言理解领域超越人类。贝古什教授表示，这项研究正在逐步瓦解"人类语言独特性"的传统认知，揭示出人工智能在深层语言解析方面的巨大潜力。

中文翻译：

人工智能模型首次实现媲美人类专家的语言分析能力

引言
在人类拥有的无数能力中，哪些是唯人类独有的特质？至少自亚里士多德提出"人类是拥有语言的动物"以来，语言始终是热门选项。尽管ChatGPT等大型语言模型已能表面模仿日常对话，研究人员仍希望探究：是否存在着某些人类语言特质，既不存在于其他动物的交流系统，也无法被人工智能设备复现？

学界尤其关注语言模型对语言本身进行推理的能力边界。部分语言学家坚持认为，语言模型不仅当前不具备推理能力，本质上就永远不可能具备。著名语言学家诺姆·乔姆斯基与合著者2023年在《纽约时报》撰文指出："真正的语言解释体系极为复杂，仅靠海量数据浸泡无法掌握。"这些学者主张，人工智能或许擅长使用语言，但绝无可能进行精密的语言分析。

这一观点近期遭到加州大学伯克利分校语言学家加什佩尔·贝古什、刚取得伯克利语言学博士学位的马克西米利安·达布科夫斯基，以及罗格斯大学瑞安·罗兹的联合论文挑战。研究团队对多个大型语言模型展开系列语言学测试——包括要求模型归纳虚构语言的规则。虽然多数模型未能像人类那样解析语言规则，但有一个模型展现出超乎预期的卓越能力。它竟能以语言学研究生水准分析语言：划分句子成分、辨析多重歧义、运用递归等复杂语言特征。贝古什表示，这一发现"颠覆了我们对人工智能能力的认知"。

未参与该研究的耶鲁大学计算语言学家汤姆·麦科伊评价，这项既具时效性又"极为重要"的研究表明："随着社会日益依赖这项技术，厘清其能力边界愈发关键。"他补充道，语言分析正是评估这些模型是否具备类人推理能力的理想试验场。

无限复杂性
对语言模型进行严谨语言学测试的核心难点，在于确保模型不会提前知晓答案。这些系统通常接受海量文本训练——不仅涵盖数十种甚至上百种语言的互联网内容，还包括语言学教材等专业资料。理论上，模型可能只是机械记忆并复现训练数据。

为规避此问题，贝古什团队设计了包含四个模块的测试体系。其中三个模块要求模型使用树状图分析特制句子——这种由乔姆斯基1957年著作《句法结构》首创的图示法，能将句子分解为名词短语和动词短语，进而细分为名词、动词、形容词、副词、介词、连词等成分。

测试重点考察了递归能力——即短语嵌套短语的特性。"天空是蓝色的"是简单句，而"简说天空是蓝色的"则将原句嵌入更复杂结构。关键在于，这种递归可以无限延续："玛丽亚怀疑萨姆是否知道奥马尔听说简说天空是蓝色的"虽显拗口，仍是符合语法的递归句。

乔姆斯基等学者将递归视为人类语言的标志性特征——甚至可能是人类思维的独有特质。语言学家认为，正是这种无限潜能使得人类能用有限词汇和规则生成无限句子。迄今尚无确凿证据表明其他动物能娴熟运用递归。

递归可出现在句首或句尾，但最具挑战的是中心嵌套结构——例如从"猫死了"演变为"狗咬的猫死了"。

贝古什测试向语言模型输入30个包含复杂递归的特制例句。例如："我们尊崇的先贤研究的天文学与占星术并非泾渭分明。"在句法树分析中，OpenAI的o1模型成功解析出句子结构：
我们尊崇的[先贤[我们敬仰]研究过的]天文学与占星术并非泾渭分明。
该模型更进一步为句子添加新递归层：
我们尊崇的[先贤[我们敬仰[生活在我们所珍视土地上]]研究过的]天文学与占星术并非泾渭分明。

贝古什坦言，团队未曾预料会发现具备高阶"元语言"能力的AI模型——即"不仅能使用语言，还能对语言进行思考"的潜能。未参与研究的卡内基梅隆大学计算语言学家戴维·莫滕森指出，这正是该论文"引人注目"之处。关于语言模型是否仅能预测句中下一个词（或语言标记）——这与人类对语言的深层理解有本质区别——的争论持续已久。"某些语言学家曾断言大型语言模型并未真正掌握语言，这项研究似乎推翻了这些说法。"

语义歧义解析
麦科伊对o1的整体表现感到惊讶，尤其对其识别歧义的能力印象深刻："这向来是计算语言模型难以突破的瓶颈。"人类"凭借常识储备能排除歧义，但计算机要达到这种常识认知水平极为困难"。

以"罗恩喂了他的宠物鸡"为例：既可能指罗恩饲养的活鸡，也可理解为他用鸡肉喂养传统宠物。o1模型精准生成了两种句法树，分别对应两种语义解释。

研究团队还开展了音系学实验——这门学科研究语音模式及最小语音单位音位的组织规律。人们为达到母语者流利度，会遵循通过实践潜移默化掌握的语音规则。例如英语中，在以"g"结尾单词后加"s"会产生"z"音（如dogs），而以"t"结尾单词加"s"仍保持标准"s"发音（如cats）。

在音系任务中，团队按贝古什的说法创造了30种新微型语言，以检验语言模型能否在零基础状态下正确推断语音规则。每种语言包含40个虚构词汇，例如某语言的示范词：
θalp
ʃebre
ði̤zṳ
ga̤rbo̤nda̤
ʒi̤zṳðe̤jo
随后要求语言模型分析各语言的音系过程。对于上述语言，o1准确总结出："当元音紧接在既属浊音又属阻塞音的辅音后时，会变为气化元音"——阻塞音即通过阻碍气流发出的音（如top中的"t"）。

由于这些语言纯属新创，o1绝无可能在训练中接触过。莫滕森表示："实验结果之显著远超预期。"

人类独特性边界何在？
语言模型的潜力究竟有多大？它们是否会随着算力提升、结构复杂化及数据扩容而无止境进步？抑或人类语言的某些特质，是仅限于本物种的进化过程产物？

最新研究表明这些模型原则上能进行复杂语言分析，但尚未有模型提出原创理论，或揭示人类未知的语言规律。

贝古什认为，若进步仅依赖算力与数据扩容，语言模型终将在语言技能上超越人类。莫滕森指出当前模型存在局限："它们的训练目标非常特定：根据既定词序列预测下一个词。这种训练机制导致其泛化能力存在障碍。"

但鉴于近期进展，莫滕森相信语言模型最终实现对人类语言的超越只是时间问题："我们终将建成能从更少数据中进行更创造性泛化的模型。"

贝古什总结道，这些新发现持续消解着曾被视为人类语言专属的特质："我们似乎不像曾经自以为的那般独特。"

英文来源：

In a First, AI Models Analyze Language As Well As a Human Expert
Introduction
Among the myriad abilities that humans possess, which ones are uniquely human? Language has been a top candidate at least since Aristotle, who wrote that humanity was “the animal that has language.” Even as large language models such as ChatGPT superficially replicate ordinary speech, researchers want to know if there are specific aspects of human language that simply have no parallels in the communication systems of other animals or artificially intelligent devices.
In particular, researchers have been exploring the extent to which language models can reason about language itself. For some in the linguistic community, language models not only don’t have reasoning abilities, they can’t. This view was summed up by Noam Chomsky, a prominent linguist, and two co-authors in 2023, when they wrote in The New York Times that “the correct explanations of language are complicated and cannot be learned just by marinating in big data.” AI models may be adept at using language, these researchers argued, but they’re not capable of analyzing language in a sophisticated way.
That view was challenged in a recent paper by Gašper Beguš, a linguist at the University of California, Berkeley; Maksymilian Dąbkowski, who recently received his doctorate in linguistics at Berkeley; and Ryan Rhodes of Rutgers University. The researchers put a number of large language models, or LLMs, through a gamut of linguistic tests — including, in one case, having the LLM generalize the rules of a made-up language. While most of the LLMs failed to parse linguistic rules in the way that humans are able to, one had impressive abilities that greatly exceeded expectations. It was able to analyze language in much the same way a graduate student in linguistics would — diagramming sentences, resolving multiple ambiguous meanings, and making use of complicated linguistic features such as recursion. This finding, Beguš said, “challenges our understanding of what AI can do.”
This new work is both timely and “very important,” said Tom McCoy, a computational linguist at Yale University who was not involved with the research. “As society becomes more dependent on this technology, it’s increasingly important to understand where it can succeed and where it can fail.” Linguistic analysis, he added, is the ideal test bed for evaluating the degree to which these language models can reason like humans.
Infinite Complexity
One challenge of giving language models a rigorous linguistic test is making sure they don’t already know the answers. These systems are typically trained on huge amounts of written information — not just the bulk of the internet, in dozens if not hundreds of languages, but also things like linguistics textbooks. The models could, in theory, simply memorize and regurgitate the information that they’ve been fed during training.
To avoid this, Beguš and his colleagues created a linguistic test in four parts. Three of the four parts involved asking the model to analyze specially crafted sentences using tree diagrams, which were first introduced in Chomsky’s landmark 1957 book, Syntactic Structures. These diagrams break sentences down into noun phrases and verb phrases and then further subdivide them into nouns, verbs, adjectives, adverbs, prepositions, conjunctions and so forth.
One part of the test focused on recursion — the ability to embed phrases within phrases. “The sky is blue” is a simple English sentence. “Jane said that the sky is blue” embeds the original sentence in a slightly more complex one. Importantly, this process of recursion can go on forever: “Maria wondered if Sam knew that Omar heard that Jane said that the sky is blue” is also a grammatically correct, if awkward, recursive sentence.
Recursion has been called one of the defining characteristics of human language by Chomsky and others — and indeed, perhaps a defining characteristic of the human mind. Linguists have argued that its limitless potential is what gives human languages their ability to generate an infinite number of possible sentences out of a finite vocabulary and a finite set of rules. So far, there’s no convincing evidence that other animals can use recursion in a sophisticated way.
Recursion can occur at the beginning or end of a sentence, but the form that is most challenging to master, called center embedding, takes place in the middle — for instance, going from “the cat died” to “the cat the dog bit died.”
Beguš’ test fed the language models 30 original sentences that featured tricky examples of recursion. For example: “The astronomy the ancients we revere studied was not separate from astrology.” Using a syntactic tree, one of the language models — OpenAI’s o1 — was able to determine that the sentence was structured like so:
The astronomy [the ancients [we revere] studied] was not separate from astrology.
The model then went further and added another layer of recursion to the sentence:
The astronomy [the ancients [we revere [who lived in lands we cherish]] studied] was not separate from astrology.
Beguš, among others, didn’t anticipate that this study would come across an AI model with a higher-level “metalinguistic” capacity – “the ability not just to use a language but to think about language,” as he put it.
That is one of the “attention-getting” aspects of their paper, said David Mortensen, a computational linguist at Carnegie Mellon University who was not involved with the work. There has been debate about whether language models are just predicting the next word (or linguistic token) in a sentence, which is qualitatively different from the deep understanding of language that humans have. “Some people in linguistics have said that LLMs are not really doing language,” he said. “This looks like an invalidation of those claims.”
What Do You Mean?
McCoy was surprised by o1’s performance in general, particularly by its ability to recognize ambiguity, which is “famously a difficult thing for computational models of language to capture,” he said. Humans “have a lot of commonsense knowledge that enables us to rule out the ambiguity. But it’s difficult for computers to have that level of commonsense knowledge.”
A sentence such as “Rowan fed his pet chicken” could be describing the chicken that Rowan keeps as a pet, or it could be describing the meal of chicken meat that he gave to his (presumably more traditional) animal companion. The o1 model correctly produced two different syntactic trees, one that corresponds to the first interpretation of the sentence and one that corresponds to the latter.
The researchers also carried out experiments related to phonology — the study of the pattern of sounds and of the way the smallest units of sound, called phonemes, are organized. To speak fluently, like a native speaker, people follow phonological rules that they might have picked up through practice without ever having been explicitly taught. In English, for example, adding an “s” to a word that ends in a “g” creates a “z” sound, as in “dogs.” But an “s” added to a word ending in “t” sounds more like a standard “s,” as in “cats.”
In the phonology task, the group made up 30 new mini-languages, as Beguš called them, to find out whether the LLMs could correctly infer the phonological rules without any prior knowledge. Each language consisted of 40 made-up words. Here are some example words from one of the languages:
θalp
ʃebre
ði̤zṳ
ga̤rbo̤nda̤
ʒi̤zṳðe̤jo
They then asked the language models to analyze the phonological processes of each language. For this language, o1 correctly wrote that “a vowel becomes a breathy vowel when it is immediately preceded by a consonant that is both voiced and an obstruent” — a sound formed by restricting airflow, like the “t” in “top.”
The languages were newly invented, so there’s no way that o1 could have been exposed to them during its training. “I was not expecting the results to be as strong or as impressive as they were,” Mortensen said.
Uniquely Human or Not?
How far can these language models go? Will they get better, without limit, simply by getting bigger — layering on more computing power, more complexity and more training data? Or are some of the characteristics of human language the result of an evolutionary process that is limited to our species?
The recent results show that these models can, in principle, do sophisticated linguistic analysis. But no model has yet come up with anything original, nor has it taught us something about language we didn’t know before.
If improvement is just a matter of increasing both computational power and the training data, then Beguš thinks that language models will eventually surpass us in language skills. Mortensen said that current models are somewhat limited. “They’re trained to do something very specific: given a history of tokens [or words], to predict the next token,” he said. “They have some trouble generalizing by virtue of the way they’re trained.”
But in view of recent progress, Mortensen said he doesn’t see why language models won’t eventually demonstrate an understanding of our language that’s better than our own. “It’s only a matter of time before we are able to build models that generalize better from less data in a way that is more creative.”
The new results show a steady “chipping away” at properties that had been regarded as the exclusive domain of human language, Beguš said. “It appears that we’re less unique than we previously thought we were.”

quanta

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读