AWS服务中断波及多款主流应用,再度引发对云端冗余架构的质疑

内容总结:
当地时间周一凌晨,亚马逊云服务遭遇重大故障,导致Facebook、Snapchat、Coinbase等多家网站及服务瘫痪,引发业界对互联网过度依赖单一云服务商的再度担忧。目前该服务已呈现"显著复苏迹象"。
本次故障始于太平洋时间午夜前后,源头位于亚马逊云服务最核心的弗吉尼亚北部数据中心区域。亚马逊方面确认故障源于DynamoDB数据库服务的域名解析异常,导致数千个应用程序无法正常访问数据。
受波及的知名服务包括OpenAI的ChatGPT、拉瓜迪亚机场值机系统、金融应用Venmo与Robinhood、游戏平台Roblox和《堡垒之夜》,以及即时通讯工具Signal和办公协作软件Slack等。直至凌晨3时35分,亚马逊宣布核心域名解析问题已完全解决,多数服务恢复正常运行。
值得注意的是,这已是弗吉尼亚数据中心自2017年以来第四度引发大规模网络服务中断。该区域作为亚马逊云服务历史最久、规模最大的核心枢纽,其稳定性问题已成为影响全球互联网服务的薄弱环节。
此次事件暴露出众多网站未能充分构建有效的冗余容灾机制,在主要云服务商出现故障时缺乏快速切换至备用区域或服务商的能力。业内人士指出,互联网基础设施过度集中化的风险亟待重视。
中文翻译:
亚马逊云服务在经历本周一早间重大故障后,正显现"显著恢复迹象"。此次中断波及Facebook、Snapchat、Coinbase等多家网站服务,连亚马逊自身也未能幸免,这再度引发人们对互联网过度依赖这家云巨头的担忧。
太平洋时间午夜刚过,亚马逊北弗吉尼亚州(US-EAST-1)区域便开始出现故障。凌晨两点多的更新报告中,亚马逊云服务将问题归咎于DynamoDB的域名解析故障——这意味着互联网"电话簿"无法为数千个应用程序使用的数据库服务找到正确地址。
据The Verge报道,OpenAI旗下ChatGPT亦在受影响之列。《纽约时报》称拉瓜迪亚机场的值机设备陷入瘫痪,今早开始出现排队现象。监测平台DownDetector显示Venmo、Robinhood等金融应用,Roblox和Fortnite等游戏服务,即时通讯软件Signal,以及Slack、Canva等办公工具均出现故障。
凌晨3点35分,亚马逊在更新报告中确认核心域名解析问题已"完全控制",表示多数服务已恢复并正常运作。
但亚马逊云服务表示仍在处理其无服务器计算平台Lambda的积压请求,同时警告部分客户在其核心云计算服务EC2中启动新实例时可能遭遇更高错误率。
【最新进展】截至清晨6点,亚马逊报告称EC2剩余问题正在解决中。公司表示部分数据中心已能成功启动新实例,正在对其余数据中心实施修复。亚马逊云服务同时确认正成功处理EventBridge和Cloudtrail服务的数据积压,新事件传输已恢复正常。
US-EAST-1区域作为亚马逊云服务最古老、规模最大的云区域,是在线服务的重要神经中枢,这也使其成为多年来互联网的阿喀琉斯之踵。2017、2021及2023年,该区域发生的重大故障均曾引发大面积服务中断。
此次最新故障表明,许多网站在实施冗余备份方面仍存在不足,未能确保在亚马逊云服务中断时迅速切换至其他区域或云服务商。
英文来源:
Amazon Web Services is showing “significant signs of recovery” after a major outage early Monday that impacted sites and services including Facebook, Snapchat, Coinbase and Amazon itself — reviving concerns about the internet’s heavy reliance on the cloud giant.
The problems began shortly after midnight Pacific in Amazon’s Northern Virginia (US-EAST-1) region. In an update shortly after 2 a.m., AWS blamed a DNS resolution issue with DynamoDB, meaning the internet’s phone book failed to find the correct address for a database service used by thousands of apps to store and find data.
OpenAI’s ChatGPT was among the sites impacted, The Verge reported. Check-in kiosks went down at LaGuardia Airport, with lines starting to form earlier this morning, the New York Times reported. DownDetector showed problems for financial apps like Venmo and Robinhood, gaming services such as Roblox and Fortnite, the Signal messaging app, and productivity tools including Slack and Canva.
In an update at 3:35 a.m., Amazon confirmed that the core DNS issue was “fully mitigated,” reporting that most services had recovered and were operating normally.
However, AWS said it was still working through a backlog of requests for Lambda, its serverless computing platform. It also warned that some customers would see increased error rates when trying to launch new instances in its core cloud computing service, EC2.
Update: As of 6 a.m., Amazon reported it was making progress on the remaining issues with EC2. The company said new instance launches were succeeding in some data centers, and it was applying fixes to the rest. AWS also confirmed it was successfully processing the data backlog for its EventBridge and Cloudtrail services, with new events being delivered normally.
US-EAST-1 is AWS’s oldest and largest cloud region, a popular nerve center for online services, which has made it an Achilles heel for the internet over the years. Major outages originating from this same region also caused widespread disruptions in 2017, 2021, and 2023.
The latest outage suggests that many sites have not adequately implemented the redundancy needed to quickly fall back to other regions or cloud providers in the event of AWS outages.
文章标题:AWS服务中断波及多款主流应用,再度引发对云端冗余架构的质疑
文章链接:https://qimuai.cn/?post=1657
本站文章均为原创,未经授权请勿用于任何商业用途