«

从安全护栏到治理体系:CEO如何守护自主智能系统的安全指南

qimuai 发布于 阅读:3 一手编译


从安全护栏到治理体系:CEO如何守护自主智能系统的安全指南

内容来源:https://www.technologyreview.com/2026/02/04/1131014/from-guardrails-to-governance-a-ceos-guide-for-securing-agentic-systems/

内容总结:

企业如何为AI智能体系统筑牢安全防线?CEO需关注这八大关键步骤

随着AI智能体(agentic systems)在商业应用中的深入,其安全风险已成为企业董事会向CEO提出的核心关切。近期行业安全指南反复强调一个核心理念:应将AI智能体视为强大且半自主的“用户”,并在其接触身份、工具、数据和输出的边界实施严格管控。

基于这一理念,以下为企业提供一套可落地的八步安全蓝图,帮助管理层系统性地应对AI智能体风险:

一、约束能力:明确身份与权限边界

  1. 身份与范围界定:为每个AI智能体设立独立的非人类账号,并像管理员工一样严格限制其权限。确保其仅在相应用户身份、租户和地理范围内执行特定任务,禁止跨租户越权操作。高风险操作需设置明确的人工审批流程。
    CEO应问:我们现在能否清晰列出每个AI智能体的权限清单?

  2. 工具链管控:对AI智能体可调用的工具(如API、数据解析器)实施类似供应链的管理。固定工具版本、新增工具需审批、禁止未经策略允许的自动工具链组合。
    CEO应问:谁审批AI智能体新增工具或扩大权限?如何追溯?

  3. 权限设计原则:将访问凭证和权限范围绑定到具体工具和任务,而非模型本身。定期轮换凭证并确保操作可审计。
    CEO应问:能否在不重构整个系统的前提下,快速撤销某个AI智能体的特定权限?

二、控制数据与行为:把好输入输出关

  1. 输入与记忆管理:将所有外部输入内容(如网页、PDF)默认视为不可信来源。严格分离系统指令与用户内容,对检索增强生成(RAG)来源进行审核标记,并在不可信环境中禁用长期记忆功能。
    CEO应问:我们能否列出AI智能体所有外部学习来源及其审批责任人?

  2. 输出处理机制:在AI智能体输出结果执行前设立验证环节。任何可能产生副作用的输出(如代码执行、数据导出)均需经过安全校验,防止模型直接驱动高风险操作。
    CEO应问:在我们的架构中,AI智能体输出在生效或交付客户前经过哪些评估节点?

  3. 运行时数据隐私保护:采用“默认安全”设计,对敏感数据实施令牌化或掩码处理,仅在授权场景下按策略动态还原。即使AI智能体被完全攻破,其数据泄露范围也受策略限制。
    CEO应问:当AI智能体处理受监管数据时,保护措施是依靠架构强制实现,还是仅依赖承诺?

三、证明治理与韧性:持续验证与审计

  1. 持续评估体系:建立常态化的对抗性测试机制,通过深度监控、红队演练和完备日志,将每次故障转化为策略改进的依据。
    CEO应问:谁负责每周攻击测试我们的AI智能体?测试结果如何推动策略更新?

  2. 统一治理与审计:建立动态资产清单,集中记录AI智能体的平台部署、权限配置、审批流水和高风险操作,确保任何决策链可追溯。
    CEO应问:如果被问及某个AI智能体的具体决策依据,我们能否完整重现其逻辑链?

系统级威胁应对:企业需假设高级威胁(如国家背景的GTG-1002攻击者)已潜入内部,并参考MITRE ATLAS等框架进行系统级防御。AI智能体安全并非追求“绝对安全”,而是将其纳入企业既有的安全治理框架,像管理高权限用户或系统一样实施管控。

对董事会和CEO而言,关键问题已从“我们是否有AI护栏”转变为:我们能否用证据而非承诺,回答以上八个问题?


本文由Protegrity提供,MIT Technology Review编辑团队未参与撰写。

中文翻译:

赞助内容
从护栏到治理:CEO保障智能体系统安全指南
一份面向企业与CEO的实用蓝图,阐述如何通过从提示词微调转向对身份、工具和数据的硬性控制,来保障智能体系统的安全。
由Protegrity提供

本系列前一篇文章《规则在提示层面失效,在边界层面生效》聚焦于首例AI策划的间谍活动及提示层面控制的失败。本文则提供解决方案。如今每位CEO从董事会收到的核心问题无非是:我们该如何应对智能体风险?

纵观近期标准机构、监管方及主流供应商发布的AI安全指南,一个简单理念反复出现:将智能体视为强大且半自主的用户,并在其接触身份、工具、数据和输出的边界处执行规则。

以下是一份可操作的八步计划,可要求团队落实并汇报进展:

约束能力
这些步骤有助于定义身份并限制能力。

  1. 身份与范围:将智能体设定为职责明确的真实用户
    当前,智能体通常在模糊、权限过度的服务身份下运行。解决方案很直接:将每个智能体视为非人类主体,并采用与员工相同的管理规范。
    每个智能体应以请求用户的身份在正确的租户内运行,权限限制在该用户的角色和地理范围内。禁止跨租户的“代表”快捷方式。任何高影响操作都应要求明确的人工审批并记录理由。这正是谷歌安全AI框架(SAIF)和美国国家标准与技术研究院(NIST)AI访问控制指南在实践中应用的方式。
    CEO之问:当前我们能否列出一份智能体清单,并明确说明每个智能体被允许执行的操作?

  2. 工具控制:锁定、审批并限制智能体可使用的工具
    Anthropic间谍框架之所以得逞,是因为攻击者能通过模型上下文协议将Claude接入一套灵活的工具集(如扫描器、漏洞利用框架、数据解析器),而这些工具未被锁定或受策略管控。
    防御之道在于将工具链视同供应链:

    • 锁定远程工具服务器的版本。
    • 新增工具、范围或数据源需经审批。
    • 除非策略明确允许,否则禁止自动工具链式调用。
      这正是OWASP在“过度授权”类别中指出的风险及其建议防范之处。根据欧盟《人工智能法案》,为此类网络韧性和防滥用设计是第15条确保稳健性与网络安全义务的一部分。
      CEO之问:当智能体获得新工具或更广范围时,由谁审批?如何知晓?
  3. 权限设计:将工具绑定于任务,而非模型
    一种常见的反模式是向模型授予长期有效的凭证,并寄望于提示词使其保持合规。SAIF和NIST持相反观点:凭证和范围应绑定于工具和任务,定期轮换并可供审计。智能体随后通过这些工具请求严格限定的能力。
    实践中表现为:“财务运营智能体可在未经CFO批准时读取特定账目,但不得写入。”
    CEO之问:我们能否在不重构整个系统的情况下,撤销智能体的某项特定能力?

控制数据与行为
这些步骤管控输入、输出并约束行为。

  1. 输入、记忆与RAG:将外部内容默认视为恶意,除非证明无害
    多数智能体事件始于隐秘数据:被投毒的网页、PDF、电子邮件或存储库,将对抗性指令潜入系统。OWASP的提示注入防范指南与OpenAI的官方指引均强调,必须严格分离系统指令与用户内容,并将未经验证的检索源视为不可信。
    操作上,在内容进入检索或长期记忆前设闸:新来源需经审查、标记和登记;存在不可信上下文时禁用持久记忆;为每个数据块附加来源信息。
    CEO之问:我们能否列举智能体学习的所有外部内容来源,并说明由谁批准?

  2. 输出处理与渲染:绝不因“模型如此说”就执行任何操作
    在Anthropic案例中,AI生成的漏洞利用代码和凭证转储直接流入执行环节。任何可能产生副作用的输出都需在智能体与现实世界之间设置验证器。OWASP的“不安全输出处理”类别及围绕源边界的最佳浏览器安全实践均明确此点。
    CEO之问:在我们的架构中,智能体输出在运行或交付客户前于何处接受评估?

  3. 运行时数据隐私:先保护数据,再保护模型
    保护数据,使其默认状态下无可泄露的危险信息。NIST和SAIF均倾向于“默认安全”设计,即敏感值被令牌化或掩码处理,仅对授权用户和用例进行还原。
    在智能体系统中,这意味着在输出边界进行策略控制的解令牌化,并记录每次还原。若智能体完全被攻破,爆炸半径将受限于策略允许其访问的内容。
    此处AI技术栈不仅涉及欧盟《人工智能法案》,还关联GDPR及行业特定法规。该法案要求提供者和部署者管理AI特有风险;运行时令牌化与策略门控还原是积极在生产中控制这些风险的有力证据。
    CEO之问:当智能体处理受监管数据时,相关保护是靠架构强制实施,还是仅凭承诺?

证明治理与韧性
最后几步关键在于展示控制措施有效且持续有效。

  1. 持续评估:不依赖一次性测试,而构建测试框架
    Anthropic关于休眠智能体的研究应打破所有对单次测试的幻想,并彰显持续评估的至关重要性。这意味着为智能体配备深度可观测性,定期使用对抗性测试套件进行红队演练,并以健全的日志和证据支撑一切,使失败案例既成为回归测试,也推动可执行的政策更新。
    CEO之问:每周谁负责攻击我们的智能体?他们的发现如何改变策略?

  2. 治理、清单与审计:集中记录所有信息
    AI安全框架强调清单与证据:企业必须清楚拥有哪些模型、提示词、工具、数据集和向量库,其所有者是谁,以及针对风险做出了何种决策。
    对智能体而言,这意味着动态目录与统一日志:

    • 存在哪些智能体,位于哪些平台
    • 每个智能体被允许的范围、工具和数据
    • 每次审批、解令牌化及高影响操作,附带审批人及时间
      CEO之问:若被问及智能体如何做出特定决策,我们能否重建完整链条?

此外,勿忘系统级威胁模型:假设威胁行为者GTG-1002已潜入企业。为完善企业准备,需跳出细节,参考MITRE ATLAS产品——其存在正是因为对手攻击的是系统而非模型。Anthropic提供了国家背景威胁行为者(GTG-1002)利用智能体框架实施攻击的案例研究。

总之,这些控制措施不会让智能体神奇地变得安全。它们所做的是更熟悉且更可靠之事:将AI及其访问权限与行动重新纳入用于任何强大用户或系统的同一安全框架中。

对董事会和CEO而言,问题不再是“我们是否有良好的AI护栏?”,而是:我们能否用证据而非承诺回答上述CEO之问?

本文内容由Protegrity制作,并非《麻省理工科技评论》编辑团队撰写。

深度阅读
人工智能
认识将大语言模型视为外星生物的新生物学家
通过将大语言模型当作生命体而非计算机程序研究,科学家首次发现了它们的一些秘密。

Yann LeCun的新项目是对大语言模型的反向押注
在独家专访中,这位AI先驱分享了其巴黎新公司AMI Labs的计划。

2026年AI的下一步是什么
我们的AI作者对来年做出大胆预测——以下是五大值得关注的热点趋势。

保持联系
获取《麻省理工科技评论》最新动态
发现特别优惠、头条故事、即将举办的活动等更多内容。

英文来源:

Sponsored
From guardrails to governance: A CEO’s guide for securing agentic systems
A practical blueprint for companies and CEOs that shows how to secure agentic systems by shifting from prompt tinkering to hard controls on identity, tools, and data.
Provided byProtegrity
The previous article in this series, “Rules fail at the prompt, succeed at the boundary,” focused on the first AI-orchestrated espionage campaign and the failure of prompt-level control. This article is the prescription. The question every CEO is now getting from their board is some version of: What do we do about agent risk?
Across recent AI security guidance from standards bodies, regulators, and major providers, a simple idea keeps repeating: treat agents like powerful, semi-autonomous users, and enforce rules at the boundaries where they touch identity, tools, data, and outputs.
The following is an actionable eight-step plan one can ask teams to implement and report against:
Constrain capabilities
These steps help define identity and limit capabilities.

  1. Identity and scope: Make agents real users with narrow jobs
    Today, agents run under vague, over-privileged service identities. The fix is straightforward: treat each agent as a non-human principal with the same discipline applied to employees.
    Every agent should run as the requesting user in the correct tenant, with permissions constrained to that user’s role and geography. Prohibit cross-tenant on-behalf-of shortcuts. Anything high-impact should require explicit human approval with a recorded rationale. That is how Google’s Secure AI Framework (SAIF) and NIST AI’s access-control guidance are meant to be applied in practice.
    The CEO question: Can we show, today, a list of our agents and exactly what each is allowed to do?
  2. Tooling control: Pin, approve, and bound what agents can use
    The Anthropic espionage framework worked because the attackers could wire Claude into a flexible suite of tools (e.g., scanners, exploit frameworks, data parsers) through Model Context Protocol, and those tools weren’t pinned or policy-gated.
    The defense is to treat toolchains like a supply chain:
    • Pin versions of remote tool servers.
    • Require approvals for adding new tools, scopes, or data sources.
    • Forbid automatic tool-chaining unless a policy explicitly allows it.
      This is exactly what OWASP flags under excessive agency and what it recommends protecting against. Under the EU AI Act, designing for such cyber-resilience and misuse resistance is part of the Article 15 obligation to ensure robustness and cybersecurity.
      The CEO question: Who signs off when an agent gains a new tool or a broader scope? How does one know?
  3. Permissions by design: Bind tools to tasks, not to models
    A common anti-pattern is to give the model a long-lived credential and hope prompts keep it polite. SAIF and NIST argue the opposite: credentials and scopes should be bound to tools and tasks, rotated regularly, and auditable. Agents then request narrowly scoped capabilities through those tools.
    In practice, that looks like: “finance-ops-agent may read, but not write, certain ledgers without CFO approval.”
    The CEO question: Can we revoke a specific capability from an agent without re-architecting the whole system?
    Control data and behavior
    These steps gate inputs, outputs, and constrain behavior.
  4. Inputs, memory, and RAG: Treat external content as hostile until proven otherwise
    Most agent incidents start with sneaky data: a poisoned web page, PDF, email, or repository that smuggles adversarial instructions into the system. OWASP’s prompt-injection cheat sheet and OpenAI’s own guidance both insist on strict separation of system instructions from user content and on treating unvetted retrieval sources as untrusted.
    Operationally, gate before anything enters retrieval or long-term memory: new sources are reviewed, tagged, and onboarded; persistent memory is disabled when untrusted context is present; provenance is attached to each chunk.
    The CEO question: Can we enumerate every external content source our agents learn from, and who approved them?
  5. Output handling and rendering: Nothing executes “just because the model said so”
    In the Anthropic case, AI-generated exploit code and credential dumps flowed straight into action. Any output that can cause a side effect needs a validator between the agent and the real world. OWASP’s insecure output handling category is explicit on this point, as are browser security best practices around origin boundaries.
    The CEO question: Where, in our architecture, are agent outputs assessed before they run or ship to customers?
  6. Data privacy at runtime: Protect the data first, then the model
    Protect the data such that there is nothing dangerous to reveal by default. NIST and SAIF both lean toward “secure-by-default” designs where sensitive values are tokenized or masked and only re-hydrated for authorized users and use cases.
    In agentic systems, that means policy-controlled detokenization at the output boundary and logging every reveal. If an agent is fully compromised, the blast radius is bounded by what the policy lets it see.
    This is where the AI stack intersects not just with the EU AI Act but with GDPR and sector-specific regimes. The EU AI Act expects providers and deployers to manage AI-specific risk; runtime tokenization and policy-gated reveal are strong evidence that one is actively controlling those risks in production.
    The CEO question: When our agents touch regulated data, is that protection enforced by architecture or by promises?
    Prove governance and resilience
    For the final steps, it’s important to show controls work and keep working.
  7. Continuous evaluation: Don’t ship a one-time test, ship a test harness
    Anthropic’s research about sleeper agents should eliminate all fantasies about single test dreams and show how critical continuous evaluation is. This means instrumenting agents with deep observability, regularly red teaming with adversarial test suites, and backing everything with robust logging and evidence, so failures become both regression tests and enforceable policy updates.
    The CEO question: Who works to break our agents every week, and how do their findings change policy?
  8. Governance, inventory, and audit: Keep score in one place
    AI security frameworks emphasize inventory and evidence: enterprises must know which models, prompts, tools, datasets, and vector stores they have, who owns them, and what decisions were taken about risk.
    For agents, that means a living catalog and unified logs:
    • Which agents exist, on which platforms
    • What scopes, tools, and data each is allowed
    • Every approval, detokenization, and high-impact action, with who approved it and when
      The CEO question: If asked how an agent made a specific decision, could we reconstruct the chain?
      And don’t forget the system-level threat model: assume the threat actor GTG-1002 is already in your enterprise. To complete enterprise preparedness, zoom out and consider the MITRE ATLAS product, which exists precisely because adversaries attack systems, not models. Anthropic provides a case study of a state-based threat actor (GTG-1002) doing exactly that with an agentic framework.
      Taken together, these controls do not make agents magically safe. They do something more familiar and more reliable: they put AI, its access, and actions back inside the same security frame used for any powerful user or system.
      For boards and CEOs, the question is no longer “Do we have good AI guardrails?” It’s: Can we answer the CEO questions above with evidence, not assurances?
      This content was produced by Protegrity. It was not written by MIT Technology Review’s editorial staff.
      Deep Dive
      Artificial intelligence
      Meet the new biologists treating LLMs like aliens
      By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time.
      Yann LeCun’s new venture is a contrarian bet against large language models
      In an exclusive interview, the AI pioneer shares his plans for his new Paris-based company, AMI Labs.
      What’s next for AI in 2026
      Our AI writers make their big bets for the coming year—here are five hot trends to watch.
      Stay connected
      Get the latest updates from
      MIT Technology Review
      Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录


    扫描二维码,在手机上阅读