探索一种基于太空、可扩展的人工智能基础设施系统设计。

内容来源:https://research.google/blog/exploring-a-space-based-scalable-ai-infrastructure-system-design/
内容总结:
【科技前沿】谷歌启动"捕日计划" 探索太空AI计算新纪元
本报讯 谷歌旗下智能范式公司高级总监特拉维斯·比尔斯日前宣布,正式启动名为"捕日计划"的太空探索项目。该项目旨在通过部署配备TPU处理器和自由空间光链路的太阳能卫星星座,构建太空人工智能计算基础设施,突破地球资源限制。
研究显示,近地轨道太阳能板发电效率可达地面的8倍,且能近乎持续供电。项目计划在晨昏太阳同步轨道部署模块化卫星集群,利用太空充沛的太阳能资源驱动AI计算。最新发布的预印本论文《面向未来太空高扩展性AI基础设施系统设计》详细阐述了该计划的技术路线。
目前团队已攻克三大核心技术难关:
一是实现每秒数太比特的星间数据传输,通过密集波分复用技术,在实验室环境下成功验证单对收发器1.6Tbps传输速率;
二是突破紧凑编队飞行控制技术,基于改进的轨道动力学模型,实现在650公里轨道高度维持卫星间距100-200米的稳定集群;
三是TPU处理器抗辐射能力远超预期,测试显示其第六代Trillium芯片可承受超过设计标准三倍的辐射剂量。
随着商业发射成本预计在2030年代中期降至每公斤200美元以下,太空数据中心运营成本有望与地面设施持平。项目团队计划于2027年初与行星公司合作发射两颗实验卫星,开展在轨验证。
此项研究延续了谷歌在量子计算、自动驾驶等领域的技术积累,标志着人工智能基础设施建设迈向太空探索新阶段。分析人士指出,该计划若成功实施,将为实现千兆瓦级太空计算星座奠定基础,开辟AI算力扩容的新维度。
(根据谷歌智能范式公司2025年11月4日发布技术白皮书整理)
中文翻译:
探索天基可扩展人工智能基础设施的系统设计
2025年11月4日
Travis Beals,智能范式研究高级总监
"逐日者计划"是一项突破性探索:通过为太阳能卫星星座配备TPU处理器与自由空间光学链路,最终实现太空机器学习计算的规模化扩展。
快速导读
人工智能作为基础性技术正重塑世界格局,推动科学新发现并助力应对人类重大挑战。此刻我们正在思考:何处才能释放其全部潜力?
太阳系中,太阳是最终极的能源——其释放的能量相当于人类总发电量的100万亿倍。在特定轨道上,太阳能板发电效率可达地球的8倍以上,且近乎持续发电,大幅降低对蓄电池的依赖。未来太空或将成为扩展AI算力的理想场所。基于此远景,我们的新研究计划"逐日者"构想由太阳能卫星组成紧凑星座,搭载谷歌TPU芯片并通过光学链路互联。这种架构既具备巨大扩展潜力,又能最大限度减少对地球资源的消耗。
我们在今日发布的预印本论文《面向未来天基高可扩展AI基础设施的系统设计》中,分享了该前沿领域的初步研究成果。论文阐述了在解决卫星间高速通信、轨道动力学、辐射对计算影响等基础性难题方面的进展。通过采用模块化的小型互联卫星设计,我们正为构建高可扩展的未来天基AI基础设施奠定基础。
作为谷歌挑战尖端科技传统的延续,"逐日者计划"与所有突破性项目一样充满未知。正如十年前我们着手构建大规模量子计算机时它尚非公认的工程目标,十五年前我们构想自动驾驶汽车最终成就了如今服务全球数百万乘客的Waymo。
系统设计与关键挑战
该体系由组网卫星星座构成,拟采用晨昏太阳同步低地球轨道以实现近乎持续的日照。此轨道选择既能最大化太阳能收集,又可减少星载蓄电池负荷。为确保系统可行性,需攻克以下技术难关:
- 实现数据中心级星间链路
大规模机器学习任务需在众多加速器间通过高速低延迟连接进行分布式处理。要达到与地面数据中心相当的效能,卫星间需建立每秒数十太比特的传输链路。分析表明,通过多通道密集波分复用收发器与空分复用技术可实现该目标。
但实现此带宽所需的接收功率比传统远距部署高出数千倍。鉴于接收功率与距离平方成反比,我们通过卫星密集编队(间距≤1公里)来弥补链路预算。团队已搭建实验平台验证该方案,使用单对收发器成功实现双向1.6Tbps(单路800Gbps)传输。
- 大型密集编队卫星控制
高速星间链路要求卫星编队密度远超现有系统。我们建立数值分析与物理模型研究星座轨道动力学,基于希尔-克洛希-威尔特希尔方程描述开普勒近似下的卫星相对运动,并采用JAX可微模型进行数值优化以计算摄动影响。
在目标轨道高度,地球非球形引力场与大气阻力是影响卫星动力学的主要非开普勒效应。下图展示了81颗卫星在650公里平均高度、1公里集群半径的轨道面全周期轨迹,在地球引力作用下最近邻星间距在100-200米间振荡。
模型显示,当卫星间距保持数百米时,仅需适度轨道保持机动即可维持太阳同步轨道的星座稳定性。
- TPU抗辐射性能
机器学习加速器需耐受近地轨道环境。我们在67MeV质子束中对谷歌Trillium v6e云TPU进行测试,评估总电离剂量与单粒子效应影响。
结果令人振奋:高带宽内存子系统作为最敏感组件,在累计接受2krad(Si)剂量(相当于五年任务预期屏蔽剂量750rad(Si)的2.7倍)后才出现异常。单芯片在15krad(Si)极限测试剂量下未发生总电离剂量导致的硬故障,表明Trillium TPU具有出色的空间抗辐射能力。
- 经济可行性与发射成本
历史上高发射成本一直是太空系统的首要障碍。但通过对历史与预期发射价格的分析表明,持续学习曲线可使2030年代中期发射成本降至200美元/公斤以下。届时天基数据中心的年化千瓦成本将与公开报道的地面数据中心能源成本大致持平。
未来方向
初步分析证实,天基AI计算的核心构想不受基础物理法则或经济壁垒的限制,但仍面临热管理、高速地空通信、在轨系统可靠性等重大工程挑战。
为此我们计划与Planet合作开展学习任务,拟于2027年初发射两颗原型卫星。该实验将验证模型与TPU硬件在太空环境的运行状况,并测试光学星间链路支持分布式机器学习任务的性能。
未来吉瓦级星座或需采用革新性卫星设计:将更适应太空环境的新型计算架构,与集成了太阳能收集、计算单元及热管理系统的机械设计深度融合。正如智能手机推动片上系统技术发展,规模效应与集成创新将拓展太空应用的边界。
致谢
本论文由Blaise Agüera y Arcas、Travis Beals、Maria Biggs、Jessica V. Bloom、Thomas Fischbacher、Konstantin Gromov、Urs Köster、Rishiraj Pravahan与James Manyika共同完成。特别感谢Amaan Pirani在成本建模与可行性分析、Marcin Kowalczyk在数值验证、Thomas Zurbuchen在系统架构、Kenny Vassigh与Jerry Chiu在系统与热设计方面的专业贡献。
英文来源:
Exploring a space-based, scalable AI infrastructure system design
November 4, 2025
Travis Beals, Senior Director, Paradigms of Intelligence
Project Suncatcher is a moonshot exploring a new frontier: equipping solar-powered satellite constellations with TPUs and free-space optical links to one day scale machine learning compute in space.
Quick links
Artificial intelligence (AI) is a foundational technology that could reshape our world, driving new scientific discoveries and helping us tackle humanity's greatest challenges. Now, we're asking where we can go to unlock its fullest potential.
The Sun is the ultimate energy source in our solar system, emitting more power than 100 trillion times humanity’s total electricity production. In the right orbit, a solar panel can be up to 8 times more productive than on earth, and produce power nearly continuously, reducing the need for batteries. In the future, space may be the best place to scale AI compute. Working backwards from there, our new research moonshot, Project Suncatcher, envisions compact constellations of solar-powered satellites, carrying Google TPUs and connected by free-space optical links. This approach would have tremendous potential for scale, and also minimizes impact on terrestrial resources.
We’re excited about this growing area of exploration, and our early research, shared today in “Towards a future space-based, highly scalable AI infrastructure system design,” a preprint paper, which describes our progress toward tackling the foundational challenges of this ambitious endeavor — including high-bandwidth communication between satellites, orbital dynamics, and radiation effects on computing. By focusing on a modular design of smaller, interconnected satellites, we are laying the groundwork for a highly scalable, future space-based AI infrastructure.
Project Suncatcher is part of Google’s long tradition of taking on moonshots that tackle tough scientific and engineering problems. Like all moonshots, there will be unknowns, but it’s in this spirit that we embarked on building a large-scale quantum computer a decade ago — before it was considered a realistic engineering goal — and envisioned an autonomous vehicle over 15 years ago, which eventually became Waymo and now serves millions of passenger trips around the globe.
System design and key challenges
The proposed system consists of a constellation of networked satellites, likely operating in a dawn–dusk sun-synchronous low earth orbit, where they would be exposed to near-constant sunlight. This orbital choice maximizes solar energy collection and reduces the need for heavy onboard batteries. For this system to be viable, several technical hurdles must be overcome:
- Achieving data center-scale inter-satellite links
Large-scale ML workloads require distributing tasks across numerous accelerators with high-bandwidth, low-latency connections. Delivering performance comparable to terrestrial data centers requires links between satellites that support tens of terabits per second. Our analysis indicates that this should be possible with multi-channel dense wavelength-division multiplexing (DWDM) transceivers and spatial multiplexing.
However, achieving this kind of bandwidth requires received power levels thousands of times higher than typical in conventional, long-range deployments. Since received power scales inversely with the square of the distance, we can overcome this challenge by flying the satellites in a very close formation (kilometers or less), thus closing the link budget (i.e., the accounting of the end-to-end signal power losses in the communications system). Our team has already begun validating this approach with a bench-scale demonstrator that successfully achieved 800 Gbps each-way transmission (1.6 Tbps total) using a single transceiver pair. - Controlling large, tightly-clustered satellite formations
High-bandwidth inter-satellite links require our satellites to fly in a much more compact formation than any current system. We developed numerical and analytic physics models to analyze the orbital dynamics of such a constellation. We used an approximation starting from the Hill-Clohessy-Wiltshire equations (which describe the orbital motion of a satellite relative to a circular reference orbit in a Keplerian approximation) and a JAX-based differentiable model for the numerical refinement that accounts for further perturbations.
At the altitude of our planned constellation, the non-sphericity of Earth's gravitational field, and potentially atmospheric drag, are the dominant non-Keplerian effects impacting satellite orbital dynamics. In the figure below, we show trajectories (over one full orbit) for an illustrative 81-satellite constellation configuration in the orbital plane, at a mean cluster altitude of 650 km. The cluster radius is R=1 km, with the distance between next-nearest-neighbor satellites oscillating between ~100–200m, under the influence of Earth’s gravity.
The models show that, with satellites positioned just hundreds of meters apart, we will likely only require modest station-keeping maneuvers to maintain stable constellations within our desired sun-synchronous orbit. - Radiation tolerance of TPUs
For ML accelerators to be effective in space, they must withstand the environment of low-Earth orbit. We tested Trillium, Google’s v6e Cloud TPU, in a 67MeV proton beam to test for impact from total ionizing dose (TID) and single event effects (SEEs).
The results were promising. While the High Bandwidth Memory (HBM) subsystems were the most sensitive component, they only began showing irregularities after a cumulative dose of 2 krad(Si) — nearly three times the expected (shielded) five year mission dose of 750 rad(Si). No hard failures were attributable to TID up to the maximum tested dose of 15 krad(Si) on a single chip, indicating that Trillium TPUs are surprisingly radiation-hard for space applications. - Economic feasibility and launch costs
Historically, high launch costs have been a primary barrier to large-scale space-based systems. However, our analysis of historical and projected launch pricing data suggests that with a sustained learning rate, prices may fall to less than $200/kg by the mid-2030s. At that price point, the cost of launching and operating a space-based data center could become roughly comparable to the reported energy costs of an equivalent terrestrial data center on a per-kilowatt/year basis
Future directions
Our initial analysis shows that the core concepts of space-based ML compute are not precluded by fundamental physics or insurmountable economic barriers. However, significant engineering challenges remain, such as thermal management, high-bandwidth ground communications, and on-orbit system reliability.
To begin addressing these challenges, our next milestone is a learning mission in partnership with Planet, slated to launch two prototype satellites by early 2027. This experiment will test how our models and TPU hardware operate in space and validate the use of optical inter-satellite links for distributed ML tasks.
Eventually, gigawatt-scale constellations may benefit from a more radical satellite design; this may combine new compute architectures more naturally suited to the space environment with a mechanical design in which solar power collection, compute, and thermal management are tightly integrated. Just as the development of complex system-on-chip technology was motivated by and enabled by modern smartphones, scale and integration will advance what’s possible in space.
Acknowledgements
“Towards a future space-based, highly scalable AI infrastructure system design” was authored by Blaise Agüera y Arcas, Travis Beals, Maria Biggs, Jessica V. Bloom, Thomas Fischbacher, Konstantin Gromov, Urs Köster, Rishiraj Pravahan and James Manyika.
We thank Amaan Pirani for critical contributions to cost modeling and overall feasibility analysis, Marcin Kowalczyk for independent numerical validation calculations, Thomas Zurbuchen for his contributions to the systems and architecture concepts, and Kenny Vassigh and Jerry Chiu for technical input on system and thermal design.