«

拉丁美洲迎来全新开源AI模型——Latam-GPT。

qimuai 发布于 阅读:2 一手编译


拉丁美洲迎来全新开源AI模型——Latam-GPT。

内容来源:https://aibusiness.com/generative-ai/the-new-open-source-ai-model-for-latin-america

内容总结:

智利推出拉丁美洲首个本土化开源大语言模型

近日,智利国家人工智能中心(CENIA)联合拉美15国60余家机构,正式发布首个专注于拉丁美洲语言文化特色的开源大语言模型Latam-GPT。该模型的推出被视为拉美地区追求"技术主权"、打破当前人工智能领域由少数科技巨头主导格局的重要举措。

据悉,这一历时两年多研发的模型基于Meta Llama 3.1架构构建,参数量达700亿,训练数据包含超过3000亿个经过授权的西语、葡语文本标记。项目获得智利科技部、亚马逊云科技及拉美开发银行等多方支持,研发总预算约为55万美元。

与主要基于英语语料训练的通用模型不同,Latam-GPT特别注重捕捉拉美地区特有的语言习惯、文化背景及历史政治语境。研究数据显示,此前全球大语言模型的训练数据中,西班牙语内容仅占约4%,葡萄牙语内容更不足2%,这凸显了开发本土化人工智能模型的迫切性。

智利总统博里奇通过社交媒体祝贺模型发布,科学部长巴列强调"区域一体化是实现技术民主化的唯一现实路径"。目前该模型已在Hugging Face和GitHub平台开源,为拉美开发者构建本土化AI应用提供基础支撑。

尽管在资金规模上无法与科技巨头相比,但该模型的诞生标志着拉美地区正以协同创新模式积极参与全球人工智能技术变革。项目团队表示,未来将进一步纳入土著语言数据,持续完善这一具有区域特色的人工智能基础设施。

中文翻译:

由谷歌云赞助
选择您的首批生成式人工智能应用场景
要开始应用生成式人工智能,首先应关注能够改善人类信息交互体验的领域。

该模型是推动人工智能技术区域化与本土化发展的重要举措。
安第斯国家智利近日推出一款基于拉丁美洲语言文化训练的新型开源人工智能模型。

经过两年多研发,Latam-GPT由来自15个拉美及加勒比国家、60余家机构的科学家、研究人员和专业人士共同开发,项目由智利国家人工智能中心(CENIA)统筹协调。
智利科学、技术、知识与创新部、亚马逊云科技以及拉美加勒比开发银行也参与了此项合作。

在全球对美国科技巨头主导人工智能领域日益担忧、主权人工智能运动快速兴起的背景下,该模型专门针对拉美及加勒比地区的语言、数据和语境构建。
CENIA在声明中指出:"与主要基于英语信息及北半球文化框架训练的模型不同,Latam-GPT能理解拉美地区的文化语言细微差异及其历史政治背景。"2月10日,CENIA在圣地亚哥的发布会上正式推出该模型,中心主任阿尔瓦罗·索托表示,Latam-GPT使拉美地区"得以作为关键参与者加入人工智能革命"。

智利科学部长阿尔多·瓦列对此表示支持,并补充道:"这一项目源于我们的坚定信念——区域一体化是实现民主化技术主权的唯一现实路径。"智利总统加夫列尔·博里奇也出席发布会,并通过社交平台X发文祝贺模型发布。

研发Latam-GPT这类技术的必要性显而易见。研究表明,作为拉美主要语言的西班牙语数据,在当前语言模型训练数据中仅占约4%,而巴西官方语言葡萄牙语的训练数据占比甚至低至2%。
Latam-GPT的核心内容采用西班牙语和葡萄牙语,项目未来还计划纳入土著语言。

该模型基于Meta公司开源的Llama 3.1基础架构开发,拥有700亿参数,训练数据均来自经官方授权的合规文本。
据CENIA透露,项目累计通过授权收集超过3000亿纯文本标记(约合2300亿单词),经专业筛选形成据称"高质量的数据集"。

但Latam-GPT在由少数中美企业主导的人工智能市场取得突破的潜力可能有限。美联社报道称,该模型的开发预算仅为55万美元。
不过,Latam-GPT已在Hugging Face和GitHub平台开放使用,这表明部分业界人士视其为开发未来区域性应用的重要基础架构。

英文来源:

Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
The model is part of a move toward regional and national development of AI technology.
A new open source AI model trained on the languages and cultures of Latin America has been introduced by the Andean nation of Chile.
More than two years in the making, Latam-GPT was developed by scientists, researchers and professionals from more than 60 institutions across 15 different Latin American and Caribbean countries under a program coordinated by Chile’s National Center of Artificial Intelligence, CENIA.
Also participating in the effort were the Chilean Ministry of Science, Technology, Knowledge and Innovation, AWS and the Development Bank of Latin America and the Caribbean.
The model was built with language, data and context specific to Latin America and the Caribbean, amid growing unease globally about the current dominance of the AI sector by big U.S. tech vendors and the fast-developing sovereign AI movement.
“Unlike models trained primarily with information in English and cultural frameworks from the global north, Latam-GPT understands the cultural and linguistic nuances, as well as the historical and political contexts of Latin America," according to a CENIA release. CENIA launched the model at an event in Santiago on Feb. 10 at which CENIA director Alvaro Soto said Latam-GPT enables Latin America “to join the AI revolution as a key player”.
He was backed up by the country's science minister, Aldo Valle, who added: “This project stems from the conviction that regional integration is the only realistic path to achieving technological sovereignty with a democratic purpose.” Also in attendance was Chile’s president, Gabriel Boric, who welcomed the model's release with a post on X.
The need for a technology such as Latam-GPT appears evident, given that research has shown that data in Spanish, the language used by most of Latin America, has until now accounted for only about 4% of the data used to train language models. Portuguese, Brazil’s native tongue, has made up as little as 2% of training data.
Spanish and Portuguese were used for the main content of LatamGPT, and the project aims to include indigenous languages as well.
The model was developed on a base architecture of Meta’s Llama 3.1 open model, with 70 billion parameters, and trained on officially approved texts obtained with permission.
In total, more than 300 billion plain-text tokens -- equivalent to around 230 billion words -- were collected under license and curated to provide what is claimed to be a “high quality dataset,” according to CENIA.
However, LatamGPT’s potential to make inroads into the AI market dominated by a few U.S. and Chinese companies could be limited. It was developed on a budget of only $550,000, the AP reported.
However, Latam-GPT’s availability on Hugging Face and GitHub indicates that some see it as useful foundational infrastructure for those looking to develop future regional applications.

商业视角看AI

文章目录


    扫描二维码,在手机上阅读