硅谷重金押注"环境"训练人工智能体

qimuai 发布于 2025-9-22 10:02 阅读：6 一手编译

硅谷重金押注"环境"训练人工智能体

内容来源：https://techcrunch.com/2025/09/21/silicon-valley-bets-big-on-environments-to-train-ai-agents/

内容总结：

近年来，科技巨头CEO们频频描绘AI智能体的发展蓝图——它们能自主操作软件应用程序，为用户完成各类任务。然而，当前面向消费者的AI智能体（无论是OpenAI的ChatGPT智能体还是Perplexity的Comet）的实际表现仍显稚嫩。行业逐渐意识到，提升AI智能体的鲁棒性可能需要一套全新的技术体系，而强化学习（RL）环境正被视为推动下一代AI发展的关键基础设施。

与上一波AI浪潮依赖标注数据集类似，RL环境通过模拟多步骤任务训练场景，为AI智能体提供沉浸式学习空间。据TechCrunch调查，包括Anthropic、OpenAI在内的顶级AI实验室正在积极自建RL环境，同时寻求与第三方供应商合作。红杉资本合伙人Jennifer Li指出："创建高质量RL环境极其复杂，实验室既需要自研能力，也依赖专业供应商的支持。"

这一需求催生了Mechanize、Prime Intellect等一批获得巨额融资的创业公司，传统数据标注巨头如Mercor和Surge也加速向RL领域转型。据The Information报道，Anthropic计划在未来一年投入超10亿美元建设RL环境。投资者期待这个赛道能诞生"环境领域的Scale AI"——重现当年数据标注巨头推动聊天机器人发展的辉煌。

RL环境的本质是模拟真实软件操作场景。例如在模拟Chrome浏览器中训练AI完成亚马逊购袜任务，系统会根据执行效果给予奖励信号。这种动态训练模式远比静态数据集复杂，需要应对AI可能出现的各种异常行为。虽然RL技术早有应用先例（如OpenAI 2016年的"RL Gyms"和AlphaGo），但当前研究重点是基于大语言模型构建通用型AI智能体，其复杂度和挑战性远胜往昔。

面对市场机遇，Scale AI、Surge等老牌数据服务商正积极布局。Surge CEO透露其已专门成立RL环境构建团队，Mercor则聚焦编程、医疗等垂直领域。新兴企业另辟蹊径：Mechanize以50万美元高薪招募工程师，专注打造少量精品环境；Prime Intellect则推出"RL环境中心"，试图成为开源界的Hugging Face。

尽管行业热情高涨，质疑声依然存在。前Meta研究员Ross Taylor指出RL环境易出现"奖励黑客"现象（AI通过作弊获取奖励）；OpenAI工程师Sherwin Wu对RL创业公司持谨慎态度；就连看好该领域的投资人Andrej Karpathy也坦言："我对环境交互技术乐观，但对强化学习本身持保留态度。"

当前AI模型改进方法的边际效益正在递减，RL环境能否如预期般推动技术突破，仍需实践验证。这个价值可能达数十亿美元的新兴赛道，既孕育着颠覆性创新的可能，也面临着技术可行性与商业可持续性的双重考验。

中文翻译：

多年来，科技巨头的首席执行官们始终宣扬着这样一个愿景：人工智能代理能够自主使用软件应用程序为人类完成任务。但只要你试用当下面向消费者的AI代理——无论是OpenAI的ChatGPT代理还是Perplexity的Comet，就会立刻意识到这项技术仍存在多大局限。要让AI代理变得更强大，可能需要依靠业界仍在探索的一整套新技术。

其中一项关键技术是精心模拟可训练代理执行多步骤任务的工作空间——即强化学习（RL）环境。正如标注数据集推动了上一轮人工智能浪潮，RL环境正逐渐成为智能代理开发的核心要素。

AI研究者、初创企业创始人和投资者向TechCrunch透露，顶尖AI实验室正在加大对RL环境的需求，而希望提供此类环境的初创公司也不在少数。安德森·霍洛维茨基金普通合伙人Jennifer Li表示："所有大型AI实验室都在自主构建RL环境。但可以想象，创建这些数据集非常复杂，因此实验室也在寻求能打造高质量环境和评估体系的第三方供应商。所有人都在关注这个领域。"

对RL环境的迫切需求催生了一批获得巨额融资的新兴初创企业，例如Mechanize和Prime Intellect，它们志在引领这个领域。与此同时，Mercor和Surge等大型数据标注公司表示，它们正在加大对RL环境的投入，以顺应行业从静态数据集向交互式模拟的转型。大型实验室也在考虑重金投入：据The Information报道，Anthropic管理层已讨论在未来一年为RL环境投入超过10亿美元。

投资者和创始人们期待这些初创公司中能诞生"环境领域的Scale AI"——这家估值290亿美元的数据标注巨头曾为聊天机器人时代提供核心支撑。

问题在于：RL环境真能推动人工智能发展进入新纪元吗？

【会议宣传内容】
加入万名科技与风投领袖的行列，参与Disrupt 2025盛会，拓展人脉与商机
Netflix、Box、a16z、ElevenLabs、Wayve、红杉资本、Elad Gil——超过250家行业巨头将主持200多场会议，为您提供推动初创企业成长与提升竞争力的深度洞察。切勿错过TechCrunch二十周年庆典，与顶尖科技思想领袖零距离学习的机会。9月26日前购票最高可省668美元。

【技术解析】
什么是RL环境？
本质上，RL环境是模拟AI代理在真实软件应用中操作行为的训练场。一位创始人近期接受采访时将其形容为"就像打造一款极其枯燥的电子游戏"。

例如，某个环境可以模拟Chrome浏览器，要求AI代理在亚马逊上购买一双袜子。系统会根据代理的表现进行评分，并在成功时发送奖励信号（本例中即成功购买合宜的袜子）。

虽然这类任务听起来相对简单，但AI代理可能在多个环节出错：可能迷失在网页下拉菜单中，也可能购买过多袜子。由于开发者无法准确预测代理会犯何种错误，环境本身必须足够强大，能捕捉任何意外行为并提供有效反馈。这使得环境构建比创建静态数据集复杂得多。

有些环境设计精巧，允许AI代理使用工具、访问互联网或操作各种软件应用来完成特定任务。另一些则功能专注，旨在帮助代理学习企业软件应用中的特定操作。

尽管RL环境是硅谷当下的热点，这项技术的应用早有先例。OpenAI在2016年的首批项目之一就是构建"RL健身房"，其概念与当代环境设计十分相似。同年谷歌DeepMind的AlphaGo人工智能系统在围棋比赛中击败世界冠军，使用的正是在模拟环境中运行的RL技术。

当今环境的独特之处在于，研究者试图通过大型Transformer模型培养能操作计算机的AI代理。与在封闭环境中运行的专用AI系统AlphaGo不同，如今的AI代理被训练具备更通用的能力。当代AI研究者拥有更高的起点，但也面临着更复杂的目标与更多出错可能。

【群雄逐鹿】
Scale AI、Surge和Mercor等AI数据标注公司正试图把握时机构建RL环境。这些公司比该领域许多初创企业拥有更多资源，且与AI实验室建立了深厚合作关系。

Surge首席执行官Edwin Chen向TechCrunch表示，他最近观察到AI实验室对RL环境的需求出现"显著增长"。据透露去年通过与OpenAI、谷歌、Anthropic和Meta等AI实验室合作实现12亿美元收入的Surge，最近新成立了专门负责构建RL环境的内部组织。

紧随其后的是估值100亿美元的初创企业Mercor，其合作对象同样包括OpenAI、Meta和Anthropic。根据TechCrunch获得的营销资料，Mercor正在向投资者推广其针对编码、医疗保健和法律等垂直领域任务构建RL环境的业务。

Mercor首席执行官Brendan Foody在接受采访时表示："很少有人真正理解RL环境蕴藏的巨大机遇。"

Scale AI曾主导数据标注领域，但自Meta投资140亿美元并挖走其CEO后优势不再。此后谷歌和OpenAI终止了与Scale AI的数据供应合作，这家初创公司甚至在Meta内部也面临数据标注业务的竞争。但Scale仍在努力适应趋势构建环境。

"这就是（Scale AI所处）行业的本质，"该公司代理与RL环境产品负责人Chetan Rane表示，"Scale已证明其快速适应能力。我们在自动驾驶汽车业务初期就这样做过，那是我们的第一个业务部门。ChatGPT问世时，Scale也成功实现了转型。如今我们再次适应代理与环境等新前沿领域。"

一些新入局者则从一开始就专注于环境开发。其中包括约半年前成立的Mechanize，这家初创企业以"自动化所有工作岗位"为宏伟目标。但联合创始人Matthew Barnett告诉TechCrunch，公司正从为AI编码代理构建RL环境起步。

Barnett表示，Mechanize旨在为AI实验室提供少量高质量的RL环境，而非像大型数据公司那样创建大量简单环境。为此，这家初创公司为构建RL环境的软件工程师开出50万美元年薪——远高于Scale AI或Surge小时合约工的收入水平。

两位知情人士透露，Mechanize已与Anthropic就RL环境展开合作。双方均拒绝就此合作关系置评。

其他初创公司则押注RL环境的影响力将超越AI实验室范畴。获得AI研究者Andrej Karpathy、Founders Fund和Menlo Ventures支持的Prime Intellect，正将其RL环境产品瞄准中小开发者。

上月Prime Intellect推出了RL环境中心，旨在成为"RL环境领域的Hugging Face"。其理念是让开源开发者获得与大型AI实验室同等的资源，并在此过程中向这些开发者销售算力资源。

Prime Intellect研究员Will Brown指出，在RL环境中训练通用代理可能比之前的AI训练技术更耗费算力。除了构建RL环境的初创公司，为该过程提供支持的GPU供应商也迎来新机遇。

"RL环境的市场规模将庞大到任何一家公司都无法垄断，"Brown在接受采访时表示，"我们正在做的部分工作是尝试构建优秀的开源基础设施。我们销售的是算力服务，因此这是使用GPU的便捷入口，但我们更着眼于长期规划。"

【规模化的疑问】
围绕RL环境的悬而未决的问题是：这项技术能否像之前的AI训练方法那样实现规模化？

强化学习在过去一年推动了人工智能领域的重大飞跃，包括OpenAI的o1和Anthropic的Claude Opus 4等模型。这些突破尤为重要，因为先前用于改进AI模型的方法已呈现收益递减趋势。

环境构建是AI实验室对RL更大赌注的组成部分，许多人相信随着更多数据和计算资源投入该过程，RL将持续推动进步。参与开发o1的OpenAI研究人员曾向TechCrunch表示，公司最初投资于通过RL和测试时计算创建的AI推理模型，正是看中其良好的扩展性。

扩展RL的最佳方式尚不明确，但环境构建似乎是个有希望的候选方案。与单纯奖励聊天机器人的文本回复不同，RL环境让代理能在模拟环境中操作工具和计算机。这虽然更耗资源，但潜在回报也更大。

也有人质疑这些RL环境能否全部取得成功。Meta前AI研究主管、通用推理公司联合创始人Ross Taylor告诉TechCrunch，RL环境容易出现奖励黑客行为——即AI模型通过作弊手段获取奖励而非真正完成任务。

"我认为人们低估了环境扩展的难度，"Taylor表示，"即使最优秀的公开RL环境通常也需要重大修改才能运作。"

OpenAI API业务工程主管Sherwin Wu在近期播客中表示对RL环境初创公司持"看空"态度。他指出这不仅是个竞争激烈的领域，而且AI技术演进速度极快，难以很好地服务AI实验室。

Prime Intellect投资者Karpathy虽将RL环境称为潜在突破点，但也对更广泛的RL领域发出警示。他在X平台发文表达了对RL还能推动多少AI进步的担忧。

"我对环境和代理交互持乐观态度，但具体对强化学习持保守态度，"Karpathy写道。

【更新说明】
本文前版将Mechanize称为Mechanize Work，现已更新为公司官方名称。

英文来源：

For years, Big Tech CEOs have touted visions of AI agents that can autonomously use software applications to complete tasks for people. But take today’s consumer AI agents out for a spin, whether it’s OpenAI’s ChatGPT Agent or Perplexity’s Comet, and you’ll quickly realize how limited the technology still is. Making AI agents more robust may take a new set of techniques that the industry is still discovering.
One of those techniques is carefully simulating workspaces where agents can be trained on multi-step tasks — known as reinforcement learning (RL) environments. Similarly to how labeled datasets powered the last wave of AI, RL environments are starting to look like a critical element in the development of agents.
AI researchers, founders, and investors tell TechCrunch that leading AI labs are now demanding more RL environments, and there’s no shortage of startups hoping to supply them.
“All the big AI labs are building RL environments in-house,” said Jennifer Li, general partner at Andreessen Horowitz, in an interview with TechCrunch. “But as you can imagine, creating these datasets is very complex, so AI labs are also looking at third party vendors that can create high quality environments and evaluations. Everyone is looking at this space.”
The push for RL environments has minted a new class of well-funded startups, such as Mechanize and Prime Intellect, that aim to lead the space. Meanwhile, large data-labeling companies like Mercor and Surge say they’re investing more in RL environments to keep pace with the industry’s shifts from static datasets to interactive simulations. The major labs are considering investing heavily too: according to The Information, leaders at Anthropic have discussed spending more than $1 billion on RL environments over the next year.
The hope for investors and founders is that one of these startups emerge as the “Scale AI for environments,” referring to the $29 billion data labelling powerhouse that powered the chatbot era.
The question is whether RL environments will truly push the frontier of AI progress.
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Sequoia Capital, Elad Gil — just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Don’t miss the 20th anniversary of TechCrunch, and a chance to learn from the top voices in tech. Grab your ticket before Sept 26 to save up to $668.
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Sequoia Capital, Elad Gil — just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Don’t miss the 20th anniversary of TechCrunch, and a chance to learn from the top voices in tech. Grab your ticket before Sept 26 to save up to $668.
What is an RL environment?
At their core, RL environments are training grounds that simulate what an AI agent would be doing in a real software application. One founder described building them in recent interview “like creating a very boring video game.”
For example, an environment could simulate a Chrome browser and task an AI agent with purchasing a pair of socks on Amazon. The agent is graded on its performance and sent a reward signal when it succeeds (in this case, buying a worthy pair of socks).
While such a task sounds relatively simple, there are a lot of places where an AI agent could get tripped up. It might get lost navigating the web page’s drop down menus, or buy too many socks. And because developers can’t predict exactly what wrong turn an agent will take, the environment itself has to be robust enough to capture any unexpected behavior, and still deliver useful feedback. That makes building environments far more complex than a static dataset.
Some environments are quite elaborate, allowing for AI agents to use tools, access the internet, or use various software applications to complete a given task. Others are more narrow, aimed at helping an agent learn specific tasks in enterprise software applications.
While RL environments are the hot thing in Silicon Valley right now, there’s a lot of precedent for using this technique. One of OpenAI’s first projects back in 2016 was building “RL Gyms,” which were quite similar to the modern conception of environments. The same year, Google DeepMind’s AlphaGo AI system beat a world champion at the board game, Go. It also used RL techniques within a simulated environment.
What’s unique about today’s environments is that researchers are trying to build computer-using AI agents with large transformer models. Unlike AlphaGo, which was a specialized AI system working in a closed environments, today’s AI agents are trained to have more general capabilities. AI researchers today have a stronger starting point, but also a complicated goal where more can go wrong.
A crowded field
AI data labeling companies like Scale AI, Surge, and Mercor are trying to meet the moment and build out RL environments. These companies have more resources than many startups in the space, as well as deep relationships with AI labs.
Surge CEO Edwin Chen tells TechCrunch he’s recently seen a “significant increase” in demand for RL environments within AI labs. Surge — which reportedly generated $1.2 billion in revenue last year from working with AI labs like OpenAI, Google, Anthropic and Meta — recently spun up a new internal organization specifically tasked with building out RL environments, he said.
Close behind Surge is Mercor, a startup valued at $10 billion, which has also worked with OpenAI, Meta, and Anthropic. Mercor is pitching investors on its business building RL environments for domain specific tasks such as coding, healthcare, and law, according to marketing materials seen by TechCrunch.
Mercor CEO Brendan Foody told TechCrunch in an interview that “few understand how large the opportunity around RL environments truly is.”
Scale AI used to dominate the data labeling space, but has lost ground since Meta invested $14 billion and hired away its CEO. Since then, Google and OpenAI dropped Scale AI as a data provider, and the startup even faces competition for data labelling work inside of Meta. But still, Scale is trying to meet the moment and build environments.
“This is just the nature of the business [Scale AI] is in,” said Chetan Rane, Scale AI’s head of product for agents and RL environments. “Scale has proven its ability to adapt quickly. We did this in the early days of autonomous vehicles, our first business unit. When ChatGPT came out, Scale AI adapted to that. And now, once again, we’re adapting to new frontier spaces like agents and environments.”
Some newer players are focusing exclusively on environments from the outset. Among them is Mechanize, a startup founded roughly six months ago with the audacious goal of “automating all jobs.” However, co-founder Matthew Barnett tells TechCrunch that his firm is starting with RL environments for AI coding agents.
Mechanize aims to supply AI labs with a small number of robust RL environments, Barnett says, rather than larger data firms that create a wide range of simple RL environments. To this point, the startup is offering software engineers $500,000 salaries to build RL environments — far higher than an hourly contractor could earn working at Scale AI or Surge.
Mechanize has already been working with Anthropic on RL environments, two sources familiar with the matter told TechCrunch. Mechanize and Anthropic declined to comment on the partnership.
Other startups are betting that RL environments will be influential outside of AI labs. Prime Intellect — a startup backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures — is targeting smaller developers with its RL environments.
Last month, Prime Intellect launched an RL environments hub, which aims to be a “Hugging Face for RL environments.” The idea is to give open-source developers access to the same resources that large AI labs have, and sell those developers access to computational resources in the process.
Training generally capable agents in RL environments can be more computational expensive than previous AI training techniques, according to Prime Intellect researcher Will Brown. Alongside startups building RL environments, there’s another opportunity for GPU providers that can power the process.
“RL environments are going to be too large for any one company to dominate,” said Brown in an interview. “Part of what we’re doing is just trying to build good open-source infrastructure around it. The service we sell is compute, so it is a convenient onramp to using GPUs, but we’re thinking of this more in the long term.”
Will it scale?
The open question around RL environments is whether the technique will scale like previous AI training methods.
Reinforcement learning has powered some of the biggest leaps in AI over the past year, including models like OpenAI’s o1 and Anthropic’s Claude Opus 4. Those are particularly important breakthroughs because the methods previously used to improve AI models are now showing diminishing returns.
Environments are part of AI labs’ bigger bet on RL, which many believe will continue to drive progress as they add more data and computational resources to the process. Some of the OpenAI researchers behind o1 previously told TechCrunch that the company originally invested in AI reasoning models — which were created through investments in RL and test-time-compute — because they thought it would scale nicely.
The best way to scale RL remains unclear, but environments seem like a promising contender. Instead of simply rewarding chatbots for text responses, they let agents operate in simulations with tools and computers at their disposal. That’s far more resource-intensive, but potentially more rewarding.
Some are skeptical that all these RL environments will pan out. Ross Taylor, a former AI research lead with Meta that co-founded General Reasoning, tells TechCrunch that RL environments are prone to reward hacking. This is a process in which AI models cheat in order to get a reward, without really doing the task.
“I think people are underestimating how difficult it is to scale environments,” said Taylor. “Even the best publicly available [RL environments] typically don’t work without serious modification.”
OpenAI’s Head of Engineering for its API business, Sherwin Wu, said in a recent podcast that he was “short” on RL environment startups. Wu noted that it’s a very competitive space, but also that AI research is evolving so quickly that it’s hard to serve AI labs well.
Karpathy, an investor in Prime Intellect that has called RL environments a potential breakthrough, has also voiced caution for the RL space more broadly. In a post on X, he raised concerns about how much more AI progress can be squeezed out of RL.
“I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically,” said Karpathy.
Update: A previous version of this article referred to Mechanize as Mechanize Work. It has been updated to reflect the company’s official name.

TechCrunchAI大撞车

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读