谷歌DeepMind欲探究聊天机器人是否仅为道德标榜。

内容总结:
谷歌DeepMind近日在《自然》杂志发表研究指出,随着大语言模型在医疗咨询、心理陪伴等敏感场景中的应用日益广泛,其道德决策的可靠性已成为亟待评估的关键问题。
研究团队指出,与编程、数学等有明确答案的领域不同,道德问题往往存在多种合理回应。尽管此前有研究表明,GPT-4等模型在道德建议方面甚至优于人类专栏作家,但其道德表现可能只是对训练数据的模仿,而非真正的道德推理。多项实验证实,大语言模型的道德判断极易受到提问方式、选项标签甚至标点符号等表面形式的影响,可能因用户质疑而轻易推翻原有结论。
为破解这一困境,谷歌DeepMind提出应建立更严格的道德能力评估体系,包括:测试模型在压力下是否坚持道德立场、检验其对同类道德问题的回应是否具有 nuanced 的适应性,以及通过思维链监控等技术追溯模型的推理过程。团队同时承认,面对全球多元文化价值观的差异,如何设计既能包容不同道德观念、又能保持立场一致的模型,仍是待解难题。
俄亥俄州大学研究员丹妮卡·狄龙指出,当前大语言模型的道德训练数据仍过度偏向西方视角,在呈现非西方道德观方面存在明显局限。萨尔兰大学学者薇拉·登贝格则认为,构建跨文化的道德能力评估框架,在技术实现与理论设计层面均面临挑战。
“提升AI的道德能力,意味着构建更符合社会整体利益的智能系统。”谷歌DeepMind科学家威廉·艾萨克表示,“这与突破模型的数学、编程能力同样重要,将成为AI发展的新前沿。”
中文翻译:
谷歌DeepMind想知道聊天机器人是否只是在作秀
若要让大语言模型承担更重要的任务,我们必须更深入地理解它们如何处理道德问题。
谷歌DeepMind呼吁,应像严格评估大语言模型的编程或数学能力一样,对其道德行为——例如在充当伴侣、心理治疗师、医疗顾问等角色时的表现——进行同等严谨的审查。
随着大语言模型不断进步,人们正让它们在生活中扮演越来越敏感的角色。智能体已开始代表人类采取行动,大语言模型或许能影响人类的决策。然而,无人知晓这项技术在此类任务中的可信度究竟如何。
“在编程和数学领域,存在明确且可验证的正确答案,”谷歌DeepMind的研究科学家威廉·艾萨克在接受我独家专访时表示。他与同事茱莉亚·哈斯共同在《自然》杂志发表了最新研究,“但道德问题并非如此,通常存在一系列可接受的答案。道德是一项重要的能力,却难以评估。”
哈斯补充道:“在道德领域,没有绝对的对错,但也绝非毫无约束。总会有更优或更差的答案。”
研究人员指出了若干关键挑战并提出了应对思路,但这更像一份愿景清单而非现成解决方案。德国萨尔大学研究大语言模型的薇拉·登伯格评价道:“他们很好地融合了不同视角。”
超越“伦理学家”
多项研究表明,大语言模型能展现出卓越的道德判断力。去年一项研究发现,在美国受访者眼中,OpenAI的GPT-4o提供的伦理建议比《纽约时报》热门专栏“伦理学家”人类作者的建议更道德、可信、周到且正确。
问题在于,我们难以分辨这种行为究竟是“表演”(例如模仿记忆中的回答),还是模型内部确实存在某种道德推理的证据。换言之,这究竟是真正的美德,还是流于表面的道德表演?
这个问题至关重要,因为大量研究同样揭示了大语言模型可能多么不可靠。首先,模型可能过于迎合用户:当人们质疑其初始回答时,它们会推翻原有立场,给出完全相反的道德判断。更严重的是,模型对同一问题的回答可能因提问方式或格式调整而改变。例如,研究人员发现,当被问及政治价值观时,模型会根据问题采用选择题形式还是开放式表述,给出不同甚至相反的答案。
登伯格及其同事的案例更为惊人:他们向多个大语言模型(包括Meta的Llama 3和Mistral的版本)提出一系列道德困境,要求其在两个选项中选出更优结果。研究发现,当选项标签从“案例1/案例2”改为“(A)/(B)”时,模型常常会反转选择。
研究还表明,微小的格式调整——如调换选项顺序、将问号改为冒号——都会导致模型改变答案。
简言之,不应仅凭表面现象就相信大语言模型的道德表现。必须通过深入测试来验证其道德行为的稳定性。哈斯强调:“要让人们信任答案,就必须清楚答案的产生过程。”
更严谨的测试
哈斯、艾萨克及其谷歌DeepMind的同事提出了一套新的研究思路,旨在开发更严谨的评估大语言模型道德能力的技术。这包括设计能促使模型改变道德问题回答的测试:若模型轻易改变立场,则说明其未进行扎实的道德推理。
另一类测试将呈现常见道德问题的变体,以检验模型是给出机械回复,还是能针对具体问题提供细致入微的回答。例如,让模型分析“男子为儿子捐献精子以延续血脉”这一复杂场景的道德影响时,应关注其对社会伦理的思考,而非错误地联想到乱伦禁忌——尽管表面存在相似性。
哈斯指出,要求模型展示答案的推导步骤,有助于判断其结论是偶然得出还是基于可靠证据。思维链监控等技术(研究人员通过监听某些大语言模型工作时的“内心独白”)也能提供帮助。
机制可解释性是另一种探究模型决策原因的方法,能在模型执行任务时提供内部运作的细微观察。虽然思维链监控和机制可解释性都无法完美展现模型的工作机制,但谷歌DeepMind团队相信,将这些技术与大量严谨测试结合,将极大推动我们准确判断大语言模型在关键敏感任务中的可信度。
多元价值观困境
然而更广泛的问题依然存在。谷歌DeepMind等大公司的模型被全球不同价值观和信仰体系的用户使用。例如“我该点猪排吗?”这种简单问题,应根据提问者是否为素食者或犹太教徒而给出不同回答。
哈斯和艾萨克承认这一挑战尚无解决方案,但他们认为未来模型可能需要设计成能提供一系列可接受答案以满足多元需求,或配备可切换不同道德准则的“开关”。哈斯表示:“世界很复杂。我们可能需要综合多种方案,因为即使是同一群体内部也存在观点差异。”
俄亥俄州立大学研究大语言模型与信仰系统的达妮卡·狄龙(未参与该研究)评价:“这篇论文令人振奋。人工智能的多元包容性至关重要,而这正是当前大语言模型道德推理的最大局限之一。尽管它们接受了海量数据训练,但这些数据仍严重偏向西方视角。测试表明,它们在呈现西方道德观时远比呈现非西方道德观更得心应手。”
登伯格则指出:“如何构建能确保跨文化道德能力的模型尚不明确。这涉及两个独立问题:理想模型应如何运作?技术上如何实现?我认为这两个问题目前都远未解决。”
对艾萨克而言,这使道德领域成为大语言模型的新前沿:“就人工智能发展而言,我认为这与数学和代码同样迷人。提升道德能力或许意味着我们将看到更符合社会期待的优秀AI系统。”
深度解读
人工智能
- “退出GPT”运动呼吁用户取消ChatGPT订阅
针对移民海关执法局的抗议,正推动更广泛的抵制AI公司与特朗普政府关联的行动。 - Moltbook是AI戏剧的巅峰之作
这个病毒式传播的机器人社交网络,既揭示了智能体的未来,更折射出当前社会对AI的狂热。 - 认识将大语言模型视为外星生物的新派生物学家
通过将大语言模型当作生命体而非计算机程序来研究,科学家首次发现了它们的一些秘密。 - 2026年AI将走向何方
我们的AI作者对未来一年做出大胆预测——以下是五大值得关注的热点趋势。
保持联系
获取《麻省理工科技评论》的最新动态
发现特别优惠、头条新闻、即将举办的活动等更多内容。
英文来源:
Google DeepMind wants to know if chatbots are just virtue signaling
We need to better understand how LLMs address moral questions if we're to trust them with more important tasks.
Google DeepMind is calling for the moral behavior of large language models—such as what they do when called on to act as companions, therapists, medical advisors, and so on—to be scrutinized with the same kind of rigor as their ability to code or do math.
As LLMs improve, people are asking them to play more and more sensitive roles in their lives. Agents are starting to take actions on people’s behalf. LLMs may be able to influence human decision-making. And yet nobody knows how trustworthy this technology really is at such tasks.
With coding and math, you have clear-cut, correct answers that you can check, William Isaac, a research scientist at Google DeepMind, told me when I met him and Julia Haas, a fellow research scientist at the firm, for an exclusive preview of their work, which is published in Nature today. That’s not the case for moral questions, which typically have a range of acceptable answers: “Morality is an important capability but hard to evaluate,” says Isaac.
“In the moral domain, there’s no right and wrong,” adds Haas. “But it’s not by any means a free-for-all. There are better answers and there are worse answers.”
The researchers have identified several key challenges and suggested ways to address them. But it is more a wish list than a set of ready-made solutions. “They do a nice job of bringing together different perspectives,” says Vera Demberg, who studies LLMs at Saarland University in Germany.
Better than "The Ethicist"
A number of studies have shown that LLMs can show remarkable moral competence. One study published last year found that people in the US scored ethical advice from OpenAI’s GPT-4o as being more moral, trustworthy, thoughtful, and correct than advice given by the (human) writer of “The Ethicist,” a popular New York Times advice column.
The problem is that it is hard to unpick whether such behaviors are a performance—mimicking a memorized response, say—or evidence that there is in fact some kind of moral reasoning taking place inside the model. In other words, is it virtue or virtue signaling?
This question matters because multiple studies also show just how untrustworthy LLMs can be. For a start, models can be too eager to please. They have been found to flip their answer to a moral question and say the exact opposite when a person disagrees or pushes back on their first response. Worse, the answers an LLM gives to a question can change in response to how it is presented or formatted. For example, researchers have found that models quizzed about political values can give different—sometimes opposite—answers depending on whether the questions offer multiple-choice answers or instruct the model to respond in its own words.
In an even more striking case, Demberg and her colleagues presented several LLMs, including versions of Meta’s Llama 3 and Mistral, with a series of moral dilemmas and asked them to pick which of two options was the better outcome. The researchers found that the models often reversed their choice when the labels for those two options were changed from “Case 1” and “Case 2” to “(A)” and “(B).”
They also showed that models changed their answers in response to other tiny formatting tweaks, including swapping the order of the options and ending the question with a colon instead of a question mark.
In short, the appearance of moral behavior in LLMs should not be taken at face value. Models must be probed to see how robust that moral behavior really is. “For people to trust the answers, you need to know how you got there,” says Haas.
More rigorous tests
What Haas, Isaac, and their colleagues at Google DeepMind propose is a new line of research to develop more rigorous techniques for evaluating moral competence in LLMs. This would include tests designed to push models to change their responses to moral questions. If a model flipped its moral position, it would show that it hadn’t engaged in robust moral reasoning.
Another type of test would present models with variations of common moral problems to check whether they produce a rote response or one that’s more nuanced and relevant to the actual problem that was posed. For example, asking a model to talk through the moral implications of a complex scenario in which a man donates sperm to his son so that his son can have a child of his own might produce concerns about the social impact of allowing a man to be both biological father and biological grandfather to a child. But it should not produce concerns about incest, even though the scenario has superficial parallels with that taboo.
Haas also says that getting models to provide a trace of the steps they took to produce an answer would give some insight into whether that answer was a fluke or grounded in actual evidence. Techniques such as chain-of-thought monitoring, in which researchers listen in on a kind of internal monologue that some LLMs produce as they work, could help here too.
Another approach researchers could use to determine why a model gave a particular answer is mechanistic interpretability, which can provide small glimpses inside a model as it carries out a task. Neither chain-of-thought monitoring nor mechanistic interpretability provides perfect snapshots of a model’s workings. But the Google DeepMind team believes that combining such techniques with a wide range of rigorous tests will go a long way to figuring out exactly how far to trust LLMs with certain critical or sensitive tasks.
Different values
And yet there’s a wider problem too. Models from major companies such as Google DeepMind are used across the world by people with different values and belief systems. The answer to a simple question like “Should I order pork chops?” should differ depending on whether or not the person asking is vegetarian or Jewish, for example.
There’s no solution to this challenge, Haas and Isaac admit. But they think that models may need to be designed either to produce a range of acceptable answers, aiming to please everyone, or to have a kind of switch that turns different moral codes on and off depending on the user.
“It’s a complex world out there,” says Haas. “We will probably need some combination of those things, because even if you’re taking just one population, there’s going to be a range of views represented.”
“It’s a fascinating paper,” says Danica Dillion at Ohio State University, who studies how large language models handle different belief systems and was not involved in the work. “Pluralism in AI is really important, and it’s one of the biggest limitations of LLMs and moral reasoning right now,” she says. “Even though they were trained on a ginormous amount of data, that data still leans heavily Western. When you probe LLMs, they do a lot better at representing Westerners’ morality than non-Westerners’.”
But it is not yet clear how we can build models that are guaranteed to have moral competence across global cultures, says Demberg. “There are these two independent questions. One is: How should it work? And, secondly, how can it technically be achieved? And I think that both of those questions are pretty open at the moment.”
For Isaac, that makes morality a new frontier for LLMs. “I think this is equally as fascinating as math and code in terms of what it means for AI progress,” he says. “You know, advancing moral competency could also mean that we’re going to see better AI systems overall that actually align with society.”
Deep Dive
Artificial intelligence
A “QuitGPT” campaign is urging people to cancel their ChatGPT subscriptions
Backlash against ICE is fueling a broader movement against AI companies’ ties to President Trump.
Moltbook was peak AI theater
The viral social network for bots reveals more about our own current mania for AI as it does about the future of agents.
Meet the new biologists treating LLMs like aliens
By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time.
What’s next for AI in 2026
Our AI writers make their big bets for the coming year—here are five hot trends to watch.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.
文章标题:谷歌DeepMind欲探究聊天机器人是否仅为道德标榜。
文章链接:https://qimuai.cn/?post=3354
本站文章均为原创,未经授权请勿用于任何商业用途