基于鸟类数据训练的人工智能如何揭示水下奥秘

内容来源:https://research.google/blog/how-ai-trained-on-birds-is-surfacing-underwater-mysteries/
内容总结:
【科技前沿】谷歌发布新一代生物声学模型 用“鸟语”破解深海鲸歌之谜
2026年2月9日,谷歌研究院数据科学家劳伦·哈勒尔团队在神经信息处理系统大会(NeurIPS 2025)研讨会上公布了一项突破性研究:原本基于鸟类等陆地动物叫声训练的Perch 2.0生物声学基础模型,在未接受任何水下音频训练的情况下,竟能高效识别鲸类等海洋生物声音,为探索海洋声学奥秘开辟了新路径。
海洋声景中隐藏着大量未解之谜。以美国国家海洋和大气管理局(NOAA)近年识别的布氏鲸神秘“生物铛声”为例,新型鲸歌与物种的匹配始终是海洋监测的难点。谷歌长期与科研机构合作开发生物声学监测技术,2024年曾发布多物种鲸类识别模型。此次推出的Perch 2.0模型通过迁移学习技术,仅需少量标注样本就能快速构建定制化分类器,将原本需要数月的模型开发周期缩短至数小时。
研究团队使用NOAA太平洋岛屿渔业科学中心的鲸类声学档案、珊瑚礁声景数据集及海洋声学监测竞赛数据对模型进行测试。结果显示,Perch 2.0在区分不同须鲸物种、虎鲸亚群等任务中表现优异,在多数测试中稳居前两名。即使每类样本仅提供4个示例,该模型在珊瑚礁声音分类任务中仍保持高准确率。
值得关注的是,Perch 2.0的性能甚至超越了部分专为水下声音训练的模型。研究人员分析其跨领域适应能力可能源于三方面:大规模训练带来的强泛化能力;区分相似鸟鸣声迫使模型学习精细声学特征;鸟类与海洋哺乳类在发声机制上存在进化相似性。
目前谷歌已开放完整教程,科研人员可通过Google Colab平台调用部署在谷歌云端的NOAA海洋声学档案,利用Perch 2.0快速构建专属鲸歌识别系统。这项技术突破将显著降低海洋声学研究的计算门槛,为全球鲸类保护和海洋生态监测提供智能化工具。
(基于谷歌研究院与DeepMind团队在NeurIPS 2025研讨会发表论文《Perch 2.0 transfers 'whale' to underwater tasks》整理)
中文翻译:
以鸟类数据训练的人工智能如何揭示水下奥秘
2026年2月9日
劳伦·哈勒尔,谷歌研究院数据科学家
谷歌DeepMind的生物声学基础模型Perch 2.0,最初基于鸟类及其他陆地动物的发声进行训练,如今却能在水下声学挑战中展现出"惊人"的性能。本文将阐述这一"跨界"应用是如何实现的。
快速链接
水下声音对于理解海洋物种及其环境中那些不可见的模式至关重要。海洋声景充满了神秘的噪音和尚未被发现的奥秘。例如,美国国家海洋和大气管理局最近将神秘的"生物嗡鸣"声归因于难以捉摸的布氏鲸,这正说明了新的歌声类型和物种归属的识别工作持续面临着挑战。
谷歌长期以来一直与外部科学家合作,利用生物声学监测和保护鲸类,包括我们早期用于识别座头鲸分类的研究模型,以及2024年发布的多物种鲸鱼模型。为了跟上这一领域的快速发展,谷歌用于生物声学的人工智能方法正在不断演进,旨在更高效地将新发现转化为大规模的科学洞见。2025年8月,谷歌DeepMind发布了最新的Perch基础生物声学模型——Perch 2.0。该模型主要基于鸟类和其他陆生发声动物的数据进行训练。令人惊讶的是,尽管训练数据中未包含任何水下音频,Perch 2.0在海洋验证任务的迁移学习中,却表现出了优异的嵌入模型性能。
在我们最新的论文《Perch 2.0将"鲸"力应用于水下任务》中,谷歌研究院与谷歌DeepMind在NeurIPS 2025"非人类动物通信人工智能"研讨会上合作展示了这些成果。我们深入探讨了这种主要基于鸟类数据训练的生物声学基础模型,如何能够用于推动并扩展对水下海洋生态系统的认知,特别是在鲸类发声分类方面。同时,我们还在Google Colab上分享了一个端到端的敏捷建模工作流程教程,演示了如何通过Google Cloud,利用NOAA NCEI被动声学数据档案库,使用Perch 2.0创建一个鲸类发声的自定义分类器。
生物声学分类的工作原理
如果一个预训练的分类模型(例如我们的多物种鲸鱼模型)已经具备必要的标签,并且在研究者的数据集上表现良好,那么它可以直接用于对其音频数据生成评分和标签。然而,若要为新发现的声音创建新的自定义分类器,或提高在新数据上的准确性,我们可以采用迁移学习,而非从头开始构建新模型。这种方法能极大减少创建新自定义分类器所需的计算量和实验次数。
在生物声学迁移学习中,预训练模型(如Perch 2.0)被用来为每个音频片段生成嵌入向量。这些嵌入向量将庞大的音频数据压缩成规模小得多的特征数组,作为简单分类器的输入。要为任何一组带标签的音频数据创建新的自定义模型,我们只需将预训练模型应用于音频数据以获得嵌入向量,然后将其作为逻辑回归分类器的输入特征。这样,我们无需学习深度神经网络的所有参数,而只需学习逻辑回归最后一步的新参数,这大大节省了研究者的时间和计算资源。
性能评估
我们通过在海洋任务(例如区分不同须鲸物种或不同虎鲸亚群)上进行少样本线性探测,来评估Perch 2.0的性能。其表现与我们Perch Hoplite代码库(用于敏捷建模和迁移学习)中支持的预训练模型进行了比较,包括Perch 2.0、Perch 1.0、SurfPerch以及多物种鲸鱼模型。
对于水下数据评估,我们使用了三个数据集:NOAA PIPAN、ReefSet和DCLDE。
- NOAA PIPAN:源自NOAA太平洋岛屿渔业科学中心录音的NOAA NCEI被动声学数据档案的标注子集。它包含我们先前鲸鱼模型中使用的标签,以及对小须鲸、座头鲸、塞鲸、蓝鲸、长须鲸和布氏鲸等须鲸物种的新标注。
- ReefSet:为SurfPerch模型训练而开发,该数据集利用了谷歌艺术与文化项目"珊瑚呼唤"中的数据标注。它包含混合的生物礁石噪音(呱呱声、噼啪声、隆隆声)、特定物种/属类别(如雀鲷、海豚、石斑鱼)以及人为噪音和波浪类别。
- DCLDE:该数据集使用三个不同的标签集进行评估:
- 物种:用于区分虎鲸、座头鲸、非生物声音和未知水下声音(虎鲸和座头鲸标签存在一定不确定性)。
- 已知生物物种:用于虎鲸和座头鲸的某些确定标签。
- 生态型:用于区分虎鲸亚群(生态型),包括过客型/比格斯虎鲸、北方居留型、南方居留型、阿拉斯加东南部虎鲸和远洋型虎鲸。
在此评估方案中,对于给定的带标签目标数据集,我们计算每个候选模型生成的嵌入向量。然后,我们为每个类别选择固定数量的样本(4、8、16或32个),并在嵌入向量之上训练一个简单的多类逻辑回归模型。我们使用得到的分类器计算接收者操作特征曲线下面积,该值越接近1,表示区分不同类别的能力越强。这个过程模拟了使用给定的预训练嵌入模型,仅凭少量标注样本创建自定义分类器的场景。
我们的结果显示,除了ReefSet数据集(在该数据集上,除多物种鲸鱼模型外,所有模型即使每类仅用4个样本也能取得高性能),增加每类样本数量能提升所有模型的性能。值得注意的是,对于每个数据集和样本量,Perch 2.0始终是表现最佳或次佳的模型。
我们还将Perch 2.0与AVES-bird和AVES-bio(地球物种计划分别基于鸟类和生物声音训练的Transformer生物声学模型)以及康奈尔鸟类学实验室K. Lisa Yang保护生物声学中心的BirdNet v2.3生成的嵌入向量进行了比较。Perch 2.0在大多数水下任务上优于AVES-bird和AVES-bio,但其他一些同样未经水下音频训练的预训练模型也表现良好。
Perch 2.0为何表现如此"鲸"彩?
对于我们的模型从主要基于鸟类训练到能有效处理水下声音的这种迁移性能,我们提出几种可能的解释。首先,先前研究表明,拥有大量训练数据的更大模型泛化能力更强,这使得我们的生物声学模型即使在对训练数据集中未包含的物种和声音进行分类的下游任务中也能表现出色。此外,对相似鸟鸣进行分类的挑战("麻鸦课程")迫使模型学习到详细的声学特征,这些特征随后可为其他生物声学任务提供信息。例如,北美有14种鸽子,每种都有其微妙不同的"咕咕"声。一个能够提取区分每种鸽子特定"咕咕"声特征的模型,很可能也能分离出有助于区分其他声音类别的特征。最后,跨物种类型的特征迁移也可能与声音产生机制本身有关,包括鸟类和海洋哺乳动物在内的多种物种,进化出了相似的发声方式。
一个高性能的模型,其生成的嵌入向量对于所应用的目标类别应具有信息性且线性可分。为了可视化,我们使用tSNE方法绘制了每个模型嵌入向量的摘要图,其中不同颜色代表不同类别。信息量高的模型会为每个类别显示出清晰的聚类,而在信息量较低的模型(如谷歌多物种鲸鱼模型)中,类别则会更加混杂。虽然几乎所有模型都显示出南方居留型虎鲸和阿拉斯加南部居留型虎鲸声音嵌入点的明显聚类,但对于北方居留型虎鲸、过客型虎鲸和远洋型虎鲸的声音,其嵌入向量在AVES-bio、AVES-bird和SurfPerch等模型中相互混杂,但在BirdNet v2.3和Perch 2.0中则能更清晰地区分。
展望未来
谷歌DeepMind的Perch团队与谷歌研究院及外部合作伙伴协作,开创了一种生物声学敏捷建模方法,能够在几小时内利用少量标注样本创建自定义分类器。为了支持谷歌研究院的合作伙伴以及更广泛的鲸类声学界,我们创建了一个端到端的演示,用于处理托管在Google Cloud上的NOAA被动声学档案数据集数据,并更新了我们之前的教程,使用更高效的Perch Hoplite数据库来管理嵌入向量。
致谢
开发Perch 2.0模型的Perch团队隶属于谷歌DeepMind,成员包括Tom Denton、Bart van Merriënboer、Vincent Dumoulin、Jenny Hamer、Isabelle Simpson、Andrea Burns以及劳伦·哈勒尔。特别感谢Ann Allen和Megan Wood为NOAA PIPAN数据集提供了额外标注,以及Dan Morris和Matt Harvey的贡献。
英文来源:
How AI trained on birds is surfacing underwater mysteries
February 9, 2026
Lauren Harrell, Data Scientist, Google Research
We describe how Perch 2.0, Google DeepMind's bioacoustics foundation model, trained on birds and other terrestrial animal vocalizations, transfers ‘whale’ to underwater acoustics challenges with ‘killer’ performance.
Quick links
Underwater sound is critical for understanding the unseeable patterns of marine species and their environment. The ocean soundscape is full of mysterious noises and unfound discoveries. For example, the mysterious “biotwang” sound, recently attributed to the elusive Bryde’s whales by the U.S. National Oceanic and Atmospheric Administration (NOAA), illustrates the continuous challenge of new song types and species attributions being identified regularly.
Google has a long history of collaborating with external scientists on using bioacoustics for monitoring and protecting whales, including our original research models to detect humpback whale classifications and the release of our multi-species whale model in 2024. To keep up with this pace, Google’s approach to AI for bioacoustics is evolving to enable more efficient connections from new discoveries to scientific insights at scale. In August 2025, Google DeepMind released the latest Perch foundational bioacoustics model, Perch 2.0, a bioacoustics foundation model trained primarily on birds and other terrestrial vocalizing animals. Surprisingly, despite including no underwater audio in training, Perch 2.0 performed well as an embedding model for transfer learning in marine validation tasks.
In our latest paper, “Perch 2.0 transfers 'whale' to underwater tasks”, a collaboration between Google Research and Google DeepMind presented at the NeurIPS 2025 workshop on AI for Non-Human Animal Communications, we deep dive into these results. We show how this bioacoustics foundation model, trained mostly on bird data, can be used to enable and scale insights for underwater marine ecosystems, particularly for classifying whale vocalizations. We are also sharing an end-to-end tutorial in Google Colab for our agile modeling workflow, demonstrating how to use Perch 2.0 to create a custom classifier for whale vocalizations using the NOAA NCEI Passive Acoustic Data Archive through Google Cloud.
How bioacoustics classification works
If a pre-trained classification model, such as our multi-species whale model, already has the necessary labels and works well on a researcher’s dataset, it can be used directly to produce scores and labels for their audio data. However, to create a new custom classifier for newly discovered sounds or to improve accuracy on new data, we can leverage transfer learning instead of building a new model from scratch. This approach drastically reduces the amount of computation and experimentation needed to create a new custom classifier.
In bioacoustics transfer learning, the pre-trained model (such as Perch 2.0) is used to produce embeddings for each window of audio. These embeddings reduce the large audio data into a much smaller array of features that serve as input for a simple classifier. To create a new custom model for any set of labeled audio data, we apply the pre-trained model to the audio data to get the embeddings, which are used as the input features for a logistic regression classifier. Instead of learning all of the parameters for a deep neural network, we now only need to learn new parameters for the last step of logistic regression, which is much more efficient for both the researcher’s time and computational resources.
Evaluation
We evaluated Perch 2.0 using a few-shot linear probe on marine tasks, such as distinguishing different baleen whale species or different killer whale subpopulations. Its performance was compared against pre-trained models that are supported in our Perch Hoplite repository for agile modeling and transfer learning. They include Perch 2.0, Perch 1.0, SurfPerch, and the multispecies whale model.
For underwater data evaluation, we used three datasets: NOAA PIPAN, ReefSet, and DCLDE.
- NOAA PIPAN: An annotated subset of the NOAA NCEI Passive Acoustic Data Archive from the NOAA Pacific Islands Fisheries Science Center recordings. It includes labels used in our prior whale models as well as new annotations for baleen species such as common minke whale, humpback whale, sei whale, blue whale, fin whale, and Bryde’s whale.
- ReefSet: Developed for SurfPerch model training, this dataset leverages data annotations from the Google Arts and Culture project: Calling in Our Corals. It includes a mix of biological reef noises (croaks, crackles, growls), specific species/genera classes (e.g., damselfish, dolphins, and groupers), and anthropomorphic noise and wave classes.
- DCLDE: This dataset is evaluated using three different label sets:
- Species: For distinguishing between killer whales, humpbacks, abiotic sounds, and unknown underwater sounds (with some uncertainty in killer whale and humpbacks labels).
- Species Known Bio: For certain labels of killer whales and humpbacks.
- Ecotype: For distinguishing between killer whale subpopulations (ecotypes), including Transient/Biggs, Northern Residents, Southern Residents, Southeastern Alaska killer whales, and offshore killer whales.
In this protocol, for a given target dataset with labeled data, we compute embeddings from each of the candidate models. We then select a fixed number of examples per class (4, 8, 16, or 32), and train a simple multi-class logistic regression model on top of the embeddings. We use the resulting classifier to compute the area under the receiver-operating characteristic curve (AUC_ROC), where values closer to 1 indicate a stronger ability to distinguish between classes. This process simulates using a given pre-trained embedding model to create a custom classifier from a small number of labelled examples.
Our results show that more examples per class improve performance across all the models, except on ReefSet data, where performance is high even with only four examples per class for all models, except the multispecies whale model. Notably, Perch 2.0 is consistently either the top or second-best performing model for each dataset and sample size.
We also compare Perch 2.0 to embeddings from AVES-bird and AVES-bio (Earth Species Project transformer bioacoustics models trained on birds and biological sounds, respectively) and BirdNet v2.3 from K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology. Perch 2.0 outperforms AVES-bird and AVES-bio on most underwater tasks, but there are other pre-trained models that also perform well that weren’t trained on underwater audio.
How does Perch 2.0 work so 'whale'?
We offer a few possible reasons for this transfer performance from our model trained primarily on birds to underwater sounds. First, prior research shows that larger models with extensive training data generalize better, allowing our bioacoustics model to perform well even on downstream tasks classifying sounds for species and sounds not included in the training dataset. Additionally, the challenge of classifying similar bird calls (the “bittern lesson”) forces the model to learn detailed acoustic features that can then be informative for other bioacoustics tasks. For example, there are 14 species of doves in North America, each with their own subtly distinct “coo” sound. A model that extracts the features that can distinguish between each species-specific “coo” is likely to isolate features that can help separate other sound classes. Finally, feature transfer across different types of species could also be related to the sound production mechanism itself, where a variety of species — including birds and marine mammals — have evolved similar means of sound production.
A high performing model will have embeddings that are informative and linearly separable for the applied target classes. To visualize, we plot a summary of embeddings from each model using a procedure called tSNE, where different colors represent different classes. A highly informative model will show distinct clusters for each class, whereas the classes will be more intermixed in a less informative model (such as the Google multi-species whale model). While almost all models show some distinct clusters of points for Southern Resident killer whales (KW_SRKW) and Southern Alaskan Residents (KW_SAR), the resulting embeddings of sounds from Northern Resident killer whales (KW_NRKW), Transient killer whales (KW_TKW), and Offshore killer whales (KW_OKW) are intermingled in models such as AVES-bio, AVES-bird, and SurfPerch, but are more clearly distinguished in BirdNet v2.3 and Perch 2.0.
Looking ahead
The Google DeepMind Perch team, in collaboration with Google Research and external partners, has pioneered an agile modeling approach for bioacoustics to create a custom classifier from a small number of labelled examples within a couple hours. In order to support both Google Research partners as well as the broader cetacean acoustics community, we have created an end-to-end demo for working with the NOAA data from the Passive Acoustic Archive dataset hosted on Google Cloud, updating our prior tutorials using the more efficient Perch Hoplite databases for managing embeddings.
Acknowledgements
The Perch Team, which developed the Perch 2.0 model and is part of Google DeepMind, includes Tom Denton, Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, Isabelle Simpson, Andrea Burns, and Lauren Harrell (Google Research). Special thanks to Ann Allen (NOAA Pacific Islands Fisheries Center) and Megan Wood (Saltwater Inc. in support of NOAA) for providing additional annotations used in the NOAA PIPAN dataset, Dan Morris (Google Research) and Matt Harvey (Google DeepMind).