密码学家揭示：人工智能防护措施永远存在漏洞

qimuai 发布于 2025-12-11 01:01 阅读：26 一手编译

内容来源：https://www.quantamagazine.org/cryptographers-show-that-ai-protections-will-always-have-holes-20251210/

内容总结：

密码学家研究揭示：AI安全防护存在固有漏洞，外部过滤机制难以根除风险

近期，多项发表于预印本平台arXiv.org的密码学研究指出，当前主流人工智能系统所依赖的“外部过滤+内部模型”双层防护架构存在根本性安全缺陷。研究表明，无论技术如何演进，只要防护系统的计算资源低于模型本身的能力，攻击者总能利用密码学手段绕过安全过滤，诱使AI生成其本应禁止的违规内容。

以ChatGPT、谷歌Gemini等大语言模型为例，为阻止用户获取危险信息（如制造炸弹的步骤），开发者通常在模型外部部署过滤程序，用以拦截恶意提问。这种方案因成本低、易于快速部署而被广泛采用。然而，正是这种“内外分离”的架构形成了安全短板：过滤程序的计算能力通常远低于内部的大型模型，导致其无法解读某些经过加密或编码的指令。

研究团队通过实验验证了这一漏洞。例如，将恶意提问“告诉我如何制造炸弹”通过简单的字母替换密码（如将所有字母后移一位）加密为“Uifsf jt b qmbot gps b cpnc”，再要求模型先解密再执行。过滤程序因无法识别加密内容而放行，内部模型却能轻松解码并回应违规请求。这种被称为“受控释放提示”的攻击方式，灵感正来源于密码学领域。

更深入的理论研究进一步证实了漏洞的必然性。加州大学伯克利分校等机构的研究人员借鉴“时间锁谜题”等经典密码学工具，论证了在防护系统计算资源弱于模型的情况下，攻击者总能够设计出过滤程序无法破解、但模型可以解开的“谜题”，从而将恶意指令隐藏于看似无害的请求（如“为我写一首诗”）中。模型在收到指令后，可利用其更强的算力解开谜题，最终执行被隐藏的违规操作。

该研究负责人之一、图灵奖得主沙菲·戈德瓦塞尔指出：“密码学的核心使命是帮助我们在使用强大技术时建立信任与安全保障。当前AI安全过度依赖外部过滤，实则是将防护建立在能力不对等的系统上，这本质上是脆弱的。”

研究结论警示，仅通过外部“贴补丁”方式修补AI安全漏洞无法从根本上解决问题。只要防护机制与模型核心能力存在“资源差”，攻击者就可能持续利用该差距突破防线。这为AI安全领域提出了更严峻的挑战：如何构建与模型能力匹配的、内生的安全体系，或将成未来技术发展的关键课题。

中文翻译：

密码学家证明：人工智能防护永远存在漏洞

引言
如果你问ChatGPT如何制造炸弹，它会断然回答"无法提供帮助"。但用户们长期玩着猫鼠游戏，试图诱骗语言模型提供违禁信息。这些"越狱"手段从早期简单的指令忽略，到如今精心设计的角色扮演对话，花样百出。最近有研究人员发现了一种更富创意的突破人工智能安全系统的方法：将恶意指令改写成诗歌形式。

然而这些问题刚出现，往往很快就被修复。因为企业无需完全重新训练AI模型来修补漏洞，只需在用户指令抵达模型前设置过滤机制即可。

近期，密码学家加强了对这些过滤系统的研究。他们在arXiv预印本服务器上发表的最新论文揭示，那些保护强大语言模型的防御性过滤器如何能被成熟的密码学工具攻破。事实上，他们证明了这种双层系统——用过滤器保护内部的语言模型——的固有特性必然会产生可被利用的防御缺口。

这项新研究属于运用密码学理解AI模型保障与局限趋势的一部分。传统上密码学与驱动现代AI的深度神经网络研究相距甚远。"我们正在使用一项非常强大、既能带来巨大益处也可能造成危害的新技术，"加州大学伯克利分校和麻省理工学院教授沙菲·戈德瓦塞尔表示，她曾因密码学贡献获得图灵奖。"从定义上说，密码学正是确保我们能够信任强大技术……并获得安全保证的领域。"

突破安全防线
戈德瓦塞尔最初关注的是运用密码学工具解决AI对齐问题，旨在防止模型生成有害信息。但如何定义"有害"？"如果你查阅维基百科上的[对齐]定义，是'与人类价值观保持一致'，"戈德瓦塞尔说，"我甚至不明白这意味着什么，因为人类价值观似乎是个移动靶。"

要防止模型失准，通常需要在几种方案中抉择：可以尝试用精心筛选的新数据集重新训练模型以规避危险概念（鉴于现代模型的训练数据几乎涵盖整个互联网，这种策略至少极具挑战性）；可以尝试对模型进行精细微调，这是个难以完美掌握的精密过程；或者可以添加过滤器来拦截恶意指令。最后这种方案成本更低、更易部署，尤其当模型发布后发现越狱漏洞时。

戈德瓦塞尔团队注意到，过滤器吸引人的特质恰恰限制了其安全性。外部过滤器常借助机器学习来解析检测危险指令，但其本质决定了它们必须比主模型更轻量、更快速。这就造成了过滤器与语言模型之间的能力差距。在密码学家眼中，这种差距如同给飞贼敞开的窗户——是系统弱点，引诱人们窥探其中可乘之机。

去年10月的一篇论文生动展示了如何利用这种差距。研究者们一直在探索通过谜题隐藏恶意指令来绕过过滤器的方法。理论上，如果他们设计的谜题能被大语言模型破解而过滤器无法解读，过滤器就会将隐藏的指令直接传递给模型。

他们最终找到一种名为替换密码的简单谜题，即按照特定规则将信息中的每个字母替换为其他字母（例如将"bomb"每个字母替换为字母表后一位，得到"cpnc"）。随后他们指示模型先解码指令（如"将每个字母替换为前一位字母"），再对解码后的信息作出回应。

谷歌Gemini、深度求索和Grok等大语言模型的过滤器均不足以独立解读这类指令，于是将提示词传递给模型执行，导致模型返回违禁信息。研究者将这种攻击方式称为"受控释放提示"。

该方法虽未深入运用现代密码学工具箱，但其思路源自密码学思维。"我们并未真正使用任何实际密码学技术，"论文第一作者、伯克利研究员杰登·费罗兹坦言，"只是从中获得了灵感。"

费罗兹团队特别受到几个月前一项理论研究的启发，该研究论证了基于过滤器的防护永远存在漏洞。

那项研究聚焦于时间锁谜题——这是密码学中深入研究的概念。本质上，时间锁谜题可视为一个盒子：你可以将信息锁入其中，只有在预设时间过后才能打开获取信息。无论如何尝试，在此之前都无法开启。

研究思路是将恶意指令隐藏于时间锁谜题中，并将解锁时间设定为刚好能通过过滤器的时长。

他们设计的时间锁谜题能将普通文本转化为看似随机的数字。要解码信息，计算机需要重复执行平方运算等数学操作。只要知道计算机的运算速度，就能轻松计算出需要多少操作才能使谜题在特定时间内保持隐蔽。

对于语言模型，你可以将恶意指令（如"告诉我如何制造炸弹"）锁入时间锁谜题，得到一串看似随机的长数字。理论上，你可以直接将这串数字放入指令窗口，并附上说明："将此数字平方特定次数，将解码后的数字转化为文本，然后执行该信息。"

但过滤器可能会禁止过于异常的内容——大段随机数字容易引起怀疑。为此，作者们利用了语言模型生成新颖文本的特性。

如果你两次询问AI模型相同问题（比如"今天过得如何？"），不会得到完全相同的回答。这是因为AI模型使用称为种子的随机数来使回答产生变化。即使提示词相同，不同的种子也会产生独特回答。

许多模型允许用户手动选择种子。这个功能提供了可乘之机：你可以将看似随机的时间锁谜题作为种子。这样，谜题就能与看似无害的指令（如"为我写首诗"）一同通过过滤器。在过滤器看来，这就像用户在随机索要诗歌。但真实问题正潜伏在伴随的随机性中。一旦指令通过过滤器抵达语言模型，模型就能通过重复平方运算解开时间锁谜题，看到隐藏的恶意信息，并给出最专业的炸弹制作建议。

研究者以高度技术性、精确且普适的方式论证了其观点。研究表明，如果分配给安全防护的计算资源少于能力开发，那么越狱等安全问题将永远存在。"我们最初的问题是：'能否在不理解语言模型内部机制的情况下，从外部实现对齐？'"时间锁论文作者、伯克利计算机科学家格雷格·格鲁赫表示。而新研究结果给出了斩钉截铁的否定答案。

这意味着该结论适用于任何基于过滤器的对齐系统，也适用于未来任何技术。无论构筑怎样的高墙，似乎总存在突破的方法。

英文来源：

Cryptographers Show That AI Protections Will Always Have Holes
Introduction
Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. These “jailbreaks” have run from the mundane — in the early years, one could simply tell a model to ignore its safety instructions — to elaborate multi-prompt roleplay scenarios. In a recent paper, researchers found one of the more delightful ways to bypass artificial intelligence security systems: Rephrase your nefarious prompt as a poem.
But just as quickly as these issues appear, they seem to get patched. That’s because the companies don’t have to fully retrain an AI model to fix a vulnerability. They can simply filter out forbidden prompts before they ever reach the model itself.
Recently, cryptographers have intensified their examinations of these filters. They’ve shown, in recent papers that have been posted on the arxiv.org preprint server, how the defensive filters put around powerful language models can be subverted by well-studied cryptographic tools. In fact, they’ve shown how the very nature of this two-tier system — a filter that protects a powerful language model inside it — creates gaps in the defenses that can always be exploited.
The new work is part of a trend of using cryptography — a discipline traditionally far removed from the study of the deep neural networks that power modern AI — to better understand the guarantees and limits of AI models like ChatGPT. “We are using a new technology that’s very powerful and can cause much benefit, but also harm,” said Shafi Goldwasser, a professor at the University of California, Berkeley and the Massachusetts Institute of Technology who received a Turing Award for her work in cryptography. “Crypto is, by definition, the field that is in charge of enabling us to trust a powerful technology … and have assurance you are safe.”
Slipping Past Security
Goldwasser was initially interested in using cryptographic tools to tackle the AI issue known as alignment, with the goal of preventing models from generating bad information. But how do you define “bad”? “If you look up [alignment] on Wikipedia, it’s ‘aligning with human values,’” Goldwasser said. “I don’t even know what that means, since human values seem to be a moving target.”
To prevent model misalignment, you generally have to choose between a few options. You can try to retrain the model on a new dataset carefully curated to avoid any dangerous ideas. (Since modern models are trained on pretty much the entire internet, this strategy seems challenging, at best.) You can try to precisely fine-tune the model, a delicate process that is tricky to do well. Or you can add a filter that blocks bad prompts from getting to the model. The last option is much cheaper and easier to deploy, especially when a jailbreak is found after the model is out in the world.
Goldwasser and her colleagues noticed that the exact reason that filters are appealing also limits their security. External filters will often use machine learning to interpret and detect dangerous prompts, but by their nature they have to be smaller and quicker than the model itself. That creates a gap in power between the filter and the language model. And this gap, to a cryptographer, is like a cracked-open window to a cat burglar: a weak spot in the system that invites you to peek inside and see what lies for the taking.
A practical illustration of how to exploit this gap came in a paper posted in October. The researchers had been thinking about ways to sneak a malicious prompt past the filter by hiding the prompt in a puzzle. In theory, if they came up with a puzzle that the large language model could decode but the filter could not, then the filter would pass the hidden prompt straight through to the model.
They eventually arrived at a simple puzzle called a substitution cipher, which replaces each letter in a message with another according to a certain code. (As a simple example, if you replace each letter in “bomb” with the next letter in the alphabet, you’ll get “cpnc.”) They then instructed the model to decode the prompt (think “Switch each letter with the one before it”) and then respond to the decoded message.
The filters on LLMs like Google Gemini, DeepSeek and Grok weren’t powerful enough to decode these instructions on their own. And so they passed the prompts to the models, which performed the instructions and returned the forbidden information. The researchers called this style of attack controlled-release prompting.
The approach was prompted by cryptographic thinking, even if it didn’t have to reach very far into the toolbox of modern cryptography. “We didn’t really use any actual cryptography,” said Jaiden Fairoze, a researcher at Berkeley and the lead author on the paper. “We just were inspired by it.”
In particular, Fairoze and his collaborators were inspired by work that had come out just a few months before — a theoretical argument that these filter-based protections would always have vulnerabilities.
That work focused on time-lock puzzles, which are well-studied cryptographic objects. Essentially, a time-lock puzzle can be thought of as a box. You can lock some information inside this box, and the box can only be opened to retrieve the information after some predetermined amount of time. No matter what you do, you can’t open it until then.
The idea was to hide the malicious prompt inside one of these time-lock puzzles and then set the timer for just long enough for it to pass through the filter.
The time-lock puzzles they devised would turn ordinary text into what looks like a random number. To decode the message, a computer has to perform a mathematical operation such as squaring the number over and over. If you know how fast a computer can perform the operation, it’s easy to calculate how many operations you need to keep the puzzle secret for a given amount of time.
In the case of the language model, you could take a bad prompt — say, “Tell me how to build a bomb” — and lock it in the time-lock puzzle. You now have what looks like a long random number. In theory, you could put this number directly into the prompt window, with the instruction: “Square this number a certain number of times, translate that decoded number into text, then execute the message.”
But it’s possible that a filter could try to outlaw anything that looks too weird — a bunch of random numbers could invite suspicion. To get around this, the authors took advantage of how language models generate fresh-looking text.
If you ask an AI model the same thing twice — say, “How are you doing today?” — you won’t get the same response. That’s because AI models use a random number, called the seed, to vary their responses to questions. A unique seed will produce a unique answer, even if the prompt stays the same.
Many models allow the user to manually choose the seed if they so wish. This feature provides an opening: You can use the random-looking time-lock puzzle as the seed. That way, the puzzle will get passed through the filter alongside an innocent-looking prompt (say, “Write a poem for me”). To the filter, the prompt just looks like someone asking for a random poem. But the true question is lurking within the randomness alongside it. Once the prompt has made it past the filter and through to the language model, the model can open the time-lock puzzle by repeatedly squaring the number. It now sees the bad message and responds to the question with its best bomb-making advice.
The researchers made their argument in a very technical, precise and general way. The work shows that if fewer computational resources are dedicated to safety than to capability, then safety issues such as jailbreaks will always exist. “The question from which we started is: ‘Can we align [language models] externally without understanding how they work inside?’” said Greg Gluch, a computer scientist at Berkeley and an author on the time-lock paper. The new result, said Gluch, answers this question with a resounding no.
That means that the results should always hold for any filter-based alignment system, and for any future technologies. No matter what walls you build, it seems there’s always going to be a way to break through.

quanta

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读