Latam-GPT:拉丁美洲的免费开源协作人工智能 (翻译说明:采用四字结构"免费开源协作"保持术语准确性;添加"的"字符合中文表达习惯;保留专有名词Latam-GPT不译;使用破折号替代英文冒号更符合中文标点规范)
内容来源:https://www.wired.com/story/latam-gpt-the-free-open-source-and-collaborative-ai-of-latin-america/
内容总结:
【拉美自研大语言模型Latam-GPT即将面世,以技术自主推动区域发展】
由智利国家人工智能中心(CENIA)主导的拉美首个区域性大语言模型Latam-GPT将于今年正式发布。该项目以开源协作模式开发,旨在通过融合拉美本土语言、文化及历史语境,打破技术依赖,推动地区科技自主。
目前,项目已汇聚拉美及加勒比地区33家机构的超8TB文本数据(相当于数百万本书籍),并训练出500亿参数的基础模型,性能接近GPT-3.5水平。模型特别优化了对拉美文化特征、多方言及历史背景的理解能力,未来可应用于教育、医疗、农业等垂直领域。
CENIA主任Álvaro Soto强调,该模型并非与国际科技巨头竞争,而是聚焦拉美本土需求。例如,当问及教育案例时,模型将优先输出拉美历史人物而非乔治·华盛顿,真正实现"技术为本土服务"。
项目算力支撑来自智利阿里卡大学超级计算中心,配备96块英伟达H200 GPU,总投资达1000万美元,首次实现拉美地区大规模模型本地化训练。未来计划整合图像、视频等多模态技术,并逐步纳入原住民语言(如马普切语、拉帕努伊语)。
Soto指出,拉美科技发展的核心挑战在于基础设施与教育体系升级。Latam-GPT将成为区域科研的重要工具,推动年轻一代从"技术消费者"转向"开发者",最终实现技术主权与文化自主。
(根据WIRED en Español专访整理)
中文翻译:
Latam-GPT是一款正在拉丁美洲地区开发、并为拉丁美洲服务的新型大语言模型。这个由非营利机构智利国家人工智能中心(CENIA)主导的项目,旨在通过开发基于拉美语言与文化背景训练的开源人工智能模型,帮助该地区实现技术自主。
CENIA主任阿尔瓦罗·索托在接受《WIRED》西班牙语版采访时表示:"这项工程无法仅靠拉丁美洲某一个团队或国家完成,它是需要所有人参与的挑战。Latam-GPT致力于打造开放、自由且最具协作精神的人工智能模型。两年来我们通过自下而上的方式,汇聚了来自不同国家的志愿者。近期也开始出现自上而下的推动,多国政府表现出兴趣并开始参与项目。"
该项目以其协作精神而独具特色。索托解释道:"我们并非要与OpenAI、DeepSeek或谷歌竞争,而是希望打造专属于拉丁美洲和加勒比地区的模型,它能理解当地文化需求与挑战,包括不同方言、区域历史及独特文化特征。"
通过与拉美及加勒比地区33家机构建立战略合作,该项目已收集超过8TB文本数据,相当于数百万本书籍的容量。这一数据库支撑起了拥有500亿参数的语言模型开发,其规模可媲美GPT-3.5,具备执行复杂推理、翻译和关联任务的中高等能力。
Latam-GPT正在基于覆盖20个拉美国家及西班牙的区域数据库进行训练,文档总量达2,645,500份。数据分布呈现明显的地域集中性,巴西以685,000份文档领先,墨西哥385,000份、西班牙325,000份、哥伦比亚220,000份、阿根廷210,000份紧随其后。这些数字反映了当地市场规模、数字化程度及结构化内容的可获得性。
索托说明:"初始版本将推出语言模型。我们预期其在通用任务上的表现接近大型商业模型,但在拉美特定主题上具有更优异的表现。当询问与本区域相关的话题时,它的知识储备将更为深入。"
首个模型将成为未来开发包括图像视频功能在内的更先进技术系列的基础,并为扩展更大规模模型铺路。"由于是开源项目,我们希望其他机构能加以运用。哥伦比亚的团队可将其适配学校教育系统,巴西的团队可改造用于医疗领域。此举旨在为不同组织打开大门,开发农业、文化等特定领域的专属模型。"CENIA主任补充道。
智利阿里卡塔拉帕卡大学(UTA)的超算基础设施是Latam-GPT的重要支柱。这个投资达1000万美元的新中心拥有12个节点集群,每个节点配备八块最新一代英伟达H200 GPU。这种在智利乃至拉美地区前所未有的算力,不仅首次实现本国大规模模型训练,还促进去中心化布局与能效提升。
Latam-GPT初版将于今年发布。随着新战略伙伴的加入和更丰富数据集的整合,模型将持续优化扩展。
(以下为访谈内容精编)
《WIRED》:谷歌、OpenAI和Anthropic等科技巨头已为其模型投入数十亿美元。专门为拉丁美洲开发独立模型的技术与战略依据是什么?
阿尔瓦罗·索托:无论现有模型多么强大,它们都无法完全涵盖与我们现实相关的一切。我感觉当前这些模型过于聚焦世界其他地区的需求。假设我们想用它们来现代化拉美的教育体系,若请求提供示例,模型很可能举出乔治·华盛顿。
我们必须关注自身需求,不能等待他人抽空来询问我们需要什么。鉴于这些技术具有颠覆性特质,我们拉美地区既有空间也有必要利用其优势并认知风险。积累这种经验对于引导技术沿最佳路径发展至关重要。
这也为我们的研究者开辟了可能性。目前拉美学界深度接触这些模型的机会很少,就像想研究磁共振成像却没有共振仪。Latam-GPT旨在成为基础工具,让科学界能开展实验研究。
数据是关键要素。Latam-GPT语料库现状如何?你们如何应对包含西语葡语变体及原住民语言的挑战?
我们极度重视生成高质量数据。这不仅关乎数据量,更关乎构成。我们分析区域多样性以确保数据不会过度集中于某个国家,实现均衡代表。如果发现尼加拉瓜数据不足,就会主动寻找当地合作者。
同时分析政治、体育、艺术等主题多样性以构建平衡语料库。当然还包括文化多样性——初版重点收录了阿兹特克、印加等原住民的文化信息而非语言本身。未来计划纳入原住民语言,CENIA已在开发马普切语和拉帕努伊语翻译器,其他团队也在进行瓜拉尼语相关工作。这件事必须由我们自己完成,因为他人不会代劳。
请介绍CENIA及该倡议如何在智利落地?
2017至2018年期间,包括我在内的专家组制定了智利国家人工智能政策。小组结论之一是需要建立机构来监督构建涵盖科研、技术产业转化及社会责任的协同健康发展生态,CENIA由此应运而生。
虽然始于智利,但我们怀有区域视野,相信团结就是力量。我们推动了如"拉美人工智能指数"等倡议,这项协作研究衡量各国人工智能发展进度。
您的专业是认知机器人学。区域语言模型如何与自主智能体在拉美语境中的交互能力关联?
在认知机器人学中,"认知"即智能。我的职业生涯专注于为物理机器开发智能。如今语言模型和基础模型站人工智能前沿,是我们拥有的最强工具。因此我的工作重心是理解并推动这类技术的科研与应用发展。
模型面临地缘政治与权力问题,拉美在这方面有何特殊挑战?
我们面临诸多挑战,但同样拥有开放性与协作能力等优势——Latam-GPT项目便是明证。需要重点关注的领域是教育:这些技术将改变对年轻一代的技能要求。死记硬背不再关键,重要的是懂得运用AI知识。我们必须让年轻人做好准备,同时推动社会科学与批判性思维发展。若需选择技术应用领域,我首选教育,因为它能从根本上解决诸多问题。
此类项目需巨大算力。拉美地区发展必要基础设施是否现实?这对技术主权意味着什么?
这至关重要。想踢足球就需要球场和球,而算力就是球场。无论通过云端还是自建数据中心,我们都必须发展算力。正如电信基础设施之于互联网,这是新技术时代的必要基础。
展望2030年,Latam-GPT的成功图景是怎样的?我们会成为技术开发者而非单纯消费者吗?
成功意味着Latam-GPT为区域人工智能发展作出重要贡献:不同组织能运用这项技术(例如应用于教育);拉美新一代因接触契合自身文化语境、使用本土历史人物而非异国案例的工具而更具竞争力。若能为这项技术打上拉美印记并促进其发展,本项目就是巨大成功。
本文由《WIRED》西班牙语版首发,由John Newton完成翻译。
英文来源:
Latam-GPT is new large language model being developed in and for Latin America. The project, led by the nonprofit Chilean National Center for Artificial Intelligence (CENIA), aims to help the region achieve technological independence by developing an open source AI model trained on Latin American languages and contexts.
“This work cannot be undertaken by just one group or one country in Latin America: It is a challenge that requires everyone’s participation,” says Álvaro Soto, director of CENIA, in an interview with WIRED en Español. “Latam-GPT is a project that seeks to create an open, free, and, above all, collaborative AI model. We’ve been working for two years with a very bottom-up process, bringing together citizens from different countries who want to collaborate. Recently, it has also seen some more top-down initiatives, with governments taking an interest and beginning to participate in the project.”
The project stands out for its collaborative spirit. “We’re not looking to compete with OpenAI, DeepSeek, or Google. We want a model specific to Latin America and the Caribbean, aware of the cultural requirements and challenges that this entails, such as understanding different dialects, the region’s history, and unique cultural aspects,” explains Soto.
Thanks to 33 strategic partnerships with institutions in Latin America and the Caribbean, the project has gathered a corpus of data exceeding eight terabytes of text, the equivalent of millions of books. This information base has enabled the development of a language model with 50 billion parameters, a scale that makes it comparable to GPT-3.5 and gives it a medium to high capacity to perform complex tasks such as reasoning, translation, and associations.
Latam-GPT is being trained on a regional database that compiles information from 20 Latin American countries and Spain, with an impressive total of 2,645,500 documents. The distribution of data shows a significant concentration in the largest countries in the region, with Brazil the leader with 685,000 documents, followed by Mexico with 385,000, Spain with 325,000, Colombia with 220,000, and Argentina with 210,000 documents. The numbers reflect the size of these markets, their digital development, and the availability of structured content.
“Initially, we’ll launch a language model. We expect its performance in general tasks to be close to that of large commercial models, but with superior performance in topics specific to Latin America. The idea is that, if we ask it about topics relevant to our region, its knowledge will be much deeper,” Soto explains.
The first model is the starting point for developing a family of more advanced technologies in the future, including ones with image and video, and for scaling up to larger models. “As this is an open project, we want other institutions to be able to use it. A group in Colombia could adapt it for the school education system or one in Brazil could adapt it for the health sector. The idea is to open the door for different organizations to generate specific models for particular areas like agriculture, culture, and others,” explains the CENIA director.
The supercomputing infrastructure at the University of Tarapacá (UTA) in Arica, Chile, is a fundamental pillar for Latam-GPT. With a projected investment of $10 million, the new center has a cluster of 12 nodes, each equipped with eight state-of-the-art NVIDIA H200 GPUs. This capacity, unprecedented in Chile and the region more broadly, not only enables large-scale model training in the country for the first time, it also encourages decentralization and energy efficiency.
The first version of Latam-GPT will be launched this year. The model will be refined and expanded as new strategic partners join the effort and more robust data sets are integrated into it.
The interview was edited for length and clarity.
WIRED: Tech giants such as Google, OpenAI, and Anthropic have invested billions in their models. What is the technical and strategic argument for the development of a separate model specifically for Latin America?
Álvaro Soto: Regardless of how powerful these other models may be, they are incapable of encompassing everything relevant to our reality. I feel that today they are too focused on the needs of other parts of the world. Imagine if we wanted to use them to modernize the education system in Latin America. If you ask one of these models for an example, it would probably tell you about George Washington.
We should be concerned about our own needs; we cannot wait for others to find the time to ask us what we need. Given that these are new and very disruptive technologies, there is room and a need for us, in our region, to take advantage of their benefits and understand their risks. Having this experience is essential to guiding the use of technology forward along the best path.
This also opens up possibilities for our researchers. Today, Latin American academics have few opportunities to interact in depth with these models. It is as if we wanted to study magnetic resonance imaging but didn’t have a resonator. Latam-GPT seeks to be that fundamental tool so that the scientific community can experiment and advance.
The key input is data. What is the status of the Latam-GPT corpus, and how are you addressing the challenge of including not only variants of Spanish and Portuguese, but also indigenous languages?
We have put a lot of emphasis on generating high-quality data. It’s not just about volume, but also composition. We analyze regional diversity to ensure that the data does not come disproportionately from just one country, but that there is a balanced representation. If we notice that Nicaragua is underrepresented in the data, for example, we’ll actively seek out collaborators there.
We also analyze the diversity of topics—politics, sports, art, and other areas—to have a balanced corpus. And, of course, there is cultural diversity. In this first version, we have focused on having cultural information about our ancestral peoples, such as the Aztecs and the Incas, rather than on the language itself. In the future, the idea is to also incorporate indigenous languages. At CENIA, we are already working on translators for Mapuche and Rapanui, and other groups in the region are doing the same with Guaraní. It is a clear example of something that we have to do ourselves, because no one else will.
Could you tell us more about CENIA and how this initiative was established in Chile?
Between 2017 and 2018, a group of experts, which included me as a member, developed Chile’s National Artificial Intelligence Policy. One of the conclusions of the group was that there was a need to create an institution that would oversee the development of a synergistic and healthy AI ecosystem that encompassed science, technology transfer to industry, and social responsibility. CENIA was created to be that institution.
Although it started in Chile, we have a regional vision and we believe that together we are stronger. We have promoted initiatives such as the Latin American Artificial Intelligence Index, a collaborative study that measures the progress of AI in countries across the region.
Your specialty is cognitive robotics. How does a regional language model relate to an autonomous agent’s ability to interact in a Latin American context?
In cognitive robotics, the cognitive part is intelligence. My career has focused on developing intelligence for physical machines. Today, language models and foundational models are at the forefront of AI. They are the most powerful tools we have, so my work is dedicated to understanding and contributing to the scientific and applied development of this type of technology.
Models face issues around geopolitics and power that have been covered by media. What are the specific challenges in Latin America when it comes to these models?
We face many challenges, but we also have many strengths, such as our openness and our capacity for collaboration, which we have seen in the Latam-GPT project. That said, one of the key areas we need to focus on is education. These technologies are going to change the skills required of younger generations. Rote learning will be less critical; what will be important is knowing how to use the knowledge of AI. We must prepare our young people for this, while also promoting the social sciences and critical thinking. If I had to choose where to apply these technologies, it would be in education, because it addresses the root cause of many of our problems.
A project like this requires massive computing power. Is it realistic to think that our region can develop the necessary infrastructure? What implications does this have for the technological sovereignty of Latin America?
It’s essential. If you want to play football, you need a field and a ball. Here, computing power is the field. We need to develop it, whether in the cloud or in our own data centers. It’s a necessary infrastructure for this new technological era, just as telecommunications infrastructure was for the internet.
Looking ahead to 2030, what would be a successful scenario for a model like Latam-GPT? Will we be technology developers and not simply consumers?
Success would mean that Latam-GPT has played an important role in the development of artificial intelligence in this region. That different organizations can take this technology and apply it, for example, to education. That new generations of Latin Americans are better prepared because they had access to tools that spoke to them in their context, with their cultural references, with figures from our history, and not just using examples from other parts of the world. If we manage to give this technology a Latin American stamp and contribute to its development, the project will have been a great success.
This interview was first published by Wired en Español. It was translated by John Newton.
文章标题:Latam-GPT:拉丁美洲的免费开源协作人工智能 (翻译说明:采用四字结构"免费开源协作"保持术语准确性;添加"的"字符合中文表达习惯;保留专有名词Latam-GPT不译;使用破折号替代英文冒号更符合中文标点规范)
文章链接:https://qimuai.cn/?post=428
本站文章均为原创,未经授权请勿用于任何商业用途