人工智能创意背后的隐藏要素
内容来源:https://www.wired.com/story/researchers-uncover-hidden-ingredients-behind-ai-creativity/
内容总结:
【前沿科技新发现:AI图像生成的“创造力”源自算法缺陷】
长期以来,人工智能研究领域存在一个令人困惑的现象:尽管扩散模型(Diffusion Models)被设计用于精确复现训练数据中的图像,但它们却展现出惊人的“创造力”——能够融合不同元素生成语义连贯的全新图像,而非简单堆叠像素。这一悖论近日被两项研究成果破解。
在2025年国际机器学习会议(ICML)上,斯坦福大学应用物理学研究生梅森·坎布(Mason Kamb)与其导师苏亚·甘古利(Surya Ganguli)联合发表论文指出,扩散模型的“创造性”并非神秘涌现,而是其去噪过程中技术缺陷的必然结果。该研究通过数学建模证明,模型在生成图像时依赖的“局部性”(locality)和“平移等变性”(translational equivariance)两大特性,虽限制了其对全局结构的感知,却阴差阳错地催生了创新组合能力。
研究人员构建了一套名为“等变局部评分机器”(ELS)的方程系统,仅基于上述两种技术特性即可预测去噪后的图像构成。实验显示,ELS机器与主流扩散模型(如ResNets、UNets)的输出匹配度高达90%,这一精度在机器学习领域堪称突破。
值得注意的是,扩散模型生成图像中常见的“多指手”等异常现象,正是其过度聚焦局部像素而忽略全局语境的结果——这与生物形态发生(morphogenesis)中因局部信号失调导致的多指畸形具有惊人相似性。
尽管该研究首次为扩散模型的创造力提供了数学解释,但专家强调,这仅是理解AI创造性的一步。大型语言模型等其他AI系统的创新机制仍属未解之谜。此外,这一发现或许为人类创造力研究提供新视角:无论是人类还是AI,创新可能都源于对世界认知的不完整性,以及在填补知识空白时的偶然突破。
(本文源引自《量子杂志》(Quanta Magazine),由西蒙斯基金会独立出版,致力于推动公众对数学、物理及生命科学的理解。)
中文翻译:
本文原载于《量子杂志》(Quota Magazine)。
我们曾憧憬过自动驾驶汽车和机器人管家,但现实却是人工智能系统在象棋对弈、海量文本分析与十四行诗创作领域超越了人类。这个时代最令人惊讶的发现莫过于:对人类轻而易举的体力劳动对机器人却难如登天,而算法却日益展现出媲美人类智慧的潜能。
更令研究者们长期困惑的是,这些算法还展现出一种奇特的"创造力"。
作为DALL·E、Imagen和Stable Diffusion等图像生成工具的核心技术,扩散模型本应完美复现训练数据中的图像。但实践中它们却会即兴发挥——将图像元素融合创造出全新内容,不仅是无意义的色块,更是具有语义连贯性的新图像。巴黎高等师范学院人工智能研究员、物理学家朱利奥·比罗利指出这就是扩散模型的"悖论":"若严格运作,它们本应只会机械记忆,但实际上却能生成全新样本。"
扩散模型通过去噪过程生成图像:先将图像转化为数字噪点(无序像素集合),再进行重组。这好比将画作反复碎纸机处理直至变成细粉尘,再重新拼回原状。多年来研究者始终困惑:若只是重组,创新从何而来?就像将碎片重组成全新画作。
如今两位物理学家提出惊人论断:正是去噪过程中的技术缺陷催生了扩散模型的创造力。在2025年国际机器学习大会的论文中,他们通过建立扩散模型的数学模型,证明这种"创造力"实为确定性的架构必然产物。
这项揭开扩散模型黑箱的研究可能对未来AI发展产生重大影响,甚至改变对人类创造力的认知。荷兰拉德堡德大学计算机科学家卢卡·安布罗乔尼评价:"该研究的真正价值在于对非平凡现象做出了精准预测。"
斯坦福大学应用物理学研究生、论文第一作者梅森·坎布长期痴迷于形态发生学(生命系统自组装过程)。理解人类及其他动物胚胎发育的途径之一正是通过20世纪数学家艾伦·图灵提出的"图灵模式"——解释细胞群如何自组织形成不同器官与肢体。关键在于这种协调完全在局部完成:没有CEO指挥数万亿细胞遵循最终身体蓝图,单个细胞仅根据邻近细胞信号采取行动并修正。这种自下而上的系统通常运行平稳,但偶尔会出现差错——例如长出额外手指的手。
当首批AI生成图像出现在网络上时,许多超现实主义画作般多指人手的图像立即让坎布联想到形态发生学:"这完全符合自下而上系统的典型故障特征。"
此时AI研究者已知晓扩散模型生成图像时会采取两项技术捷径:一是"局部性"——每次仅关注单个像素组块;二是"平移等变性"——若将输入图像移动几个像素,系统会自动在生成图像中做出相同调整。这些特性本是防止模型完美复现图像的技术限制,未被视作与高阶现象"创造力"相关。
但惊喜再次降临。
2022年坎布进入斯坦福大学苏亚·甘古利(身兼神经生物学与电气工程职务的物理学家)实验室攻读研究生时,OpenAI发布ChatGPT引发生成式AI热潮。当技术开发者专注于构建更强大模型时,许多学者仍执着于解析系统内部机制。
坎布最终提出"局部性与等变性催生创造力"的假说,并设计出仅优化这两项特性的ELS机器(等变局部评分机)。这个方程集能仅凭机械原理预测去噪图像构成。当他们将转化为数字噪点的图像同时输入ELS机器与ResNets、UNets等扩散模型后,结果令人震惊:ELS机器以90%的平均准确率完全匹配训练过的扩散模型输出。甘古利称此结果"在机器学习领域闻所未闻"。
研究表明:一旦施加局部性限制,创造力便自然涌现。去噪过程中迫使模型专注单个像素块(无论其在最终图像的位置)的机制,正是创造力的源泉。多指现象同样是模型过度聚焦局部像素而缺乏全局背景的直接副产品。
业内专家普遍认为,虽然该研究揭示了扩散模型创造力的机制,但仍存在大量谜团。例如同样展现创造力的大型语言模型等其他AI系统并未运用局部性与等变性。比罗利表示:"这只是故事的重要章节,而非全貌。"
这项研究首次证明扩散模型的创造力可视为去噪过程副产品,既能数学形式化描述又能实现前所未有的精准预测。好比神经科学家将人类艺术家送入MRI仪器,发现其创造力背后存在可被方程描述的通用神经机制。
这种与神经科学的类比不仅限于隐喻:坎布与甘古利的工作或许能揭示人类心智的黑箱。佐治亚理工学院与IBM研究所机器学习研究员本杰明·胡佛(专注扩散模型研究)指出:"人类与AI的创造力或许本质相通。我们都基于经历、梦境、见闻与渴望进行组合,AI也只是根据所见与指令组装模块。"据此观点,人类与人工智能的创造力可能都植根于对世界的不完整认知——我们都在尽力填补知识空白,偶尔能产生既新颖又有价值的成果,这或许就是所谓的创造力。
本文经授权转载自《量子杂志》(Quota Magazine),该刊由西蒙斯基金会创办,秉承通过报道数学、物理与生命科学研究进展提升公众科学素养的使命,保持编辑独立性。
英文来源:
The original version of this story appeared in Quanta Magazine.
We were once promised self-driving cars and robot maids. Instead, we’ve seen the rise of artificial intelligence systems that can beat us in chess, analyze huge reams of text, and compose sonnets. This has been one of the great surprises of the modern era: physical tasks that are easy for humans turn out to be very difficult for robots, while algorithms are increasingly able to mimic our intellect.
Another surprise that has long perplexed researchers is those algorithms’ knack for their own, strange kind of creativity.
Diffusion models, the backbone of image-generating tools such as DALL·E, Imagen, and Stable Diffusion, are designed to generate carbon copies of the images on which they’ve been trained. In practice, however, they seem to improvise, blending elements within images to create something new—not just nonsensical blobs of color, but coherent images with semantic meaning. This is the “paradox” behind diffusion models, said Giulio Biroli, an AI researcher and physicist at the École Normale Supérieure in Paris: “If they worked perfectly, they should just memorize,” he said. “But they don’t—they’re actually able to produce new samples.”
To generate images, diffusion models use a process known as denoising. They convert an image into digital noise (an incoherent collection of pixels), then reassemble it. It’s like repeatedly putting a painting through a shredder until all you have left is a pile of fine dust, then patching the pieces back together. For years, researchers have wondered: If the models are just reassembling, then how does novelty come into the picture? It’s like reassembling your shredded painting into a completely new work of art.
Now two physicists have made a startling claim: It’s the technical imperfections in the denoising process itself that leads to the creativity of diffusion models. In a paper presented at the International Conference on Machine Learning 2025, the duo developed a mathematical model of trained diffusion models to show that their so-called creativity is in fact a deterministic process—a direct, inevitable consequence of their architecture.
By illuminating the black box of diffusion models, the new research could have big implications for future AI research—and perhaps even for our understanding of human creativity. “The real strength of the paper is that it makes very accurate predictions of something very nontrivial,” said Luca Ambrogioni, a computer scientist at Radboud University in the Netherlands.
Bottoms Up
Mason Kamb, a graduate student studying applied physics at Stanford University and the lead author of the new paper, has long been fascinated by morphogenesis: the processes by which living systems self-assemble.
One way to understand the development of embryos in humans and other animals is through what’s known as a Turing pattern, named after the 20th-century mathematician Alan Turing. Turing patterns explain how groups of cells can organize themselves into distinct organs and limbs. Crucially, this coordination all takes place at a local level. There’s no CEO overseeing the trillions of cells to make sure they all conform to a final body plan. Individual cells, in other words, don’t have some finished blueprint of a body on which to base their work. They’re just taking action and making corrections in response to signals from their neighbors. This bottom-up system usually runs smoothly, but every now and then it goes awry—producing hands with extra fingers, for example.
When the first AI-generated images started cropping up online, many looked like surrealist paintings, depicting humans with extra fingers. These immediately made Kamb think of morphogenesis: “It smelled like a failure you’d expect from a [bottom-up] system,” he said.
AI researchers knew by that point that diffusion models take a couple of technical shortcuts when generating images. The first is known as locality: They only pay attention to a single group, or “patch,” of pixels at a time. The second is that they adhere to a strict rule when generating images: If you shift an input image by just a couple of pixels in any direction, for example, the system will automatically adjust to make the same change in the image it generates. This feature, called translational equivariance, is the model’s way of preserving coherent structure; without it, it’s much more difficult to create realistic images.
In part because of these features, diffusion models don’t pay any attention to where a particular patch will fit into the final image. They just focus on generating one patch at a time and then automatically fit them into place using a mathematical model known as a score function, which can be thought of as a digital Turing pattern.
Researchers long regarded locality and equivariance as mere limitations of the denoising process, technical quirks that prevented diffusion models from creating perfect replicas of images. They didn’t associate them with creativity, which was seen as a higher-order phenomenon.
They were in for another surprise.
Made Locally
Kamb started his graduate work in 2022 in the lab of Surya Ganguli, a physicist at Stanford who also has appointments in neurobiology and electrical engineering. OpenAI released ChatGPT the same year, causing a surge of interest in the field now known as generative AI. As tech developers worked on building ever-more-powerful models, many academics remained fixated on understanding the inner workings of these systems.
To that end, Kamb eventually developed a hypothesis that locality and equivariance lead to creativity. That raised a tantalizing experimental possibility: If he could devise a system to do nothing but optimize for locality and equivariance, it should then behave like a diffusion model. This experiment was at the heart of his new paper, which he wrote with Ganguli as his coauthor.
Kamb and Ganguli call their system the equivariant local score (ELS) machine. It is not a trained diffusion model, but rather a set of equations which can analytically predict the composition of denoised images based solely on the mechanics of locality and equivariance. They then took a series of images that had been converted to digital noise and ran them through both the ELS machine and a number of powerful diffusion models, including ResNets and UNets.
The results were “shocking,” Ganguli said: Across the board, the ELS machine was able to identically match the outputs of the trained diffusion models with an average accuracy of 90 percent—a result that’s “unheard of in machine learning,” Ganguli said.
The results appear to support Kamb’s hypothesis. “As soon as you impose locality, [creativity] was automatic; it fell out of the dynamics completely naturally,” he said. The very mechanisms which constrained diffusion models’ window of attention during the denoising process—forcing them to focus on individual patches, regardless of where they’d ultimately fit into the final product—are the very same that enable their creativity, he found. The extra-fingers phenomenon seen in diffusion models was similarly a direct by-product of the model’s hyperfixation on generating local patches of pixels without any kind of broader context.
Experts interviewed for this story generally agreed that although Kamb and Ganguli’s paper illuminates the mechanisms behind creativity in diffusion models, much remains mysterious. For example, large language models and other AI systems also appear to display creativity, but they don’t harness locality and equivariance.
“I think this is a very important part of the story,” Biroli said, “[but] it’s not the whole story.”
Creating Creativity
For the first time, researchers have shown how the creativity of diffusion models can be thought of as a by-product of the denoising process itself, one that can be formalized mathematically and predicted with an unprecedentedly high degree of accuracy. It’s almost as if neuroscientists had put a group of human artists into an MRI machine and found a common neural mechanism behind their creativity that could be written down as a set of equations.
The comparison to neuroscience may go beyond mere metaphor: Kamb and Ganguli’s work could also provide insight into the black box of the human mind. “Human and AI creativity may not be so different,” said Benjamin Hoover, a machine learning researcher at the Georgia Institute of Technology and IBM Research who studies diffusion models. “We assemble things based on what we experience, what we’ve dreamed, what we’ve seen, heard, or desire. AI is also just assembling the building blocks from what it’s seen and what it’s asked to do.” Both human and artificial creativity, according to this view, could be fundamentally rooted in an incomplete understanding of the world: We’re all doing our best to fill in the gaps in our knowledge, and every now and then we generate something that’s both new and valuable. Perhaps this is what we call creativity.
Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.