顺序注意力机制：在不牺牲准确性的前提下，让AI模型更精简、更快速。

qimuai 发布于 2026-2-5 08:01 阅读：2 一手编译

内容来源：https://research.google/blog/sequential-attention-making-ai-models-leaner-and-faster-without-sacrificing-accuracy/

内容总结：

谷歌研究团队推出“序列注意力”算法，大幅提升AI模型效率

2026年2月4日，谷歌研究院首席工程师Thomas Fu与高级科学家Kyriakos Axiotis共同发布了一项名为“序列注意力”的创新算法。该技术旨在解决大规模机器学习模型中的子集选择核心难题，能够在保持模型精度的同时，显著提升其运行效率与可解释性，为AI模型“瘦身”和加速提供了新路径。

直击模型效率的核心挑战
在深度学习领域，特征选择等子集优化任务长期面临计算复杂度高、难以兼顾质量与速度的困境。尤其是在现代神经网络中，特征之间复杂的非线性交互关系，使得精准识别并保留关键组件、剔除冗余部分变得异常困难。这一挑战普遍存在于特征筛选、权重剪枝、嵌入维度调整等多个模型优化环节。

“序列注意力”的工作原理
“序列注意力”算法创新性地将注意力机制与贪心选择策略相结合。不同于传统“一次性”评估所有候选组件的方式，该技术通过序列化决策，逐步筛选出最优子集。在每一步中，算法利用已选组件作为上下文，通过计算注意力分数来评估并加入下一个最具信息量的候选者。这种方法不仅能有效捕捉高阶非线性特征交互，还避免了传统贪心算法需要反复重新训练模型所带来的巨大计算开销。

实现效率与精度的双重提升
实际应用表明，该算法在多个基准测试中取得了领先效果：

特征选择：在神经网络特征选择任务中，该算法通过直接利用模型内部的注意力权重作为评估指标，实现了快速、单遍的贪心选择，无需昂贵的边际增益计算，在保证效果的同时极大提升了效率。
模型剪枝：其进阶版本“SequentialAttention++”融合了可微分剪枝与组合优化的优势，能够有效识别并剪枝神经网络中不重要的权重块或通道，在ImageNet分类等任务上实现了显著的模型压缩与加速，且未损失精度。

广阔的应用前景
研究团队指出，“序列注意力”框架具有广泛的应用潜力：

推荐系统：可优化大型推荐模型中庞大特征嵌入层的效率。
大语言模型（LLM）：适用于对LLM进行结构化剪枝，减少注意力头、嵌入维度或整个Transformer模块，从而降低模型体积与推理延迟。
生物科学：在药物发现与基因组学中，可助力从高维数据中高效提取关键遗传或化学特征，提升模型可解释性与预测准确性。

当前，研究重点正转向将该技术扩展至更庞大的数据集与复杂架构，并致力于为其在实际应用中的可靠性提供更坚实的数学理论保证。

结语
“序列注意力”为深度学习中的一系列子集选择问题提供了高效统一的解决方案，已成为模型架构优化的关键技术。随着此类技术的不断演进，将有力推动构建更强大、更高效且更易用的AI模型，确保先进人工智能技术在未来持续保持高性能与高可及性。

中文翻译：

顺序注意力：在不牺牲准确性的前提下，让AI模型更精简、更快速

2026年2月4日
Thomas Fu，首席工程师，与 Kyriakos Axiotis，高级科学家，谷歌研究院

我们推出一种子集选择算法，旨在提升大规模机器学习模型的效率。

快速了解
特征选择是一个识别并保留最具信息量的输入变量子集，同时丢弃无关或冗余噪声的过程。作为机器学习和深度学习中的一个基础性挑战，特征选择属于NP难问题（即，对于大规模数据而言，在数学上"不可能"既完美又快速地求解），因此它至今仍是极具挑战性的研究领域。

在现代深度神经网络中，复杂的非线性特征交互使得特征选择变得更加困难。一个特征在统计上可能单独看起来不重要，但在网络非线性层中与其他特征结合时却变得至关重要。反之，一个特征的贡献在孤立考量时可能显得显著，但在考虑其他特征时却变得冗余。核心挑战在于，如何在复杂的模型架构中识别出需要保留的关键特征，同时有效剔除冗余。

更广泛地说，许多机器学习优化任务都可以归结为子集选择问题，特征选择只是其中一个特例。例如，嵌入维度调优可视为选择嵌入块的一个子集，权重剪枝则可视为从权重矩阵中选择条目的一个子集。因此，为子集选择问题设计一个适用于现代深度学习任务的通用解决方案，对于构建最高效的模型具有重大意义。

今天，我们将探讨我们针对子集选择问题的解决方案——"顺序注意力"。顺序注意力采用一种贪婪选择机制，以顺序且自适应的方式选择最佳的下一个组件（如层、模块或特征）添加到模型中。虽然已知自适应贪婪算法能为各种子集选择问题（如子模优化）提供强有力的保证，但直接应用此类算法会使训练成本增加数个数量级。为了解决这个可扩展性问题，我们将选择过程直接集成到模型训练中，在单次模型训练内完成选择。这确保了顺序注意力能够以最小的开销应用于大规模机器学习模型，且不牺牲准确性或复杂性。接下来，我们将分析顺序注意力如何工作，并展示它如何在现实场景中用于优化深度学习模型的结构。

顺序注意力如何工作
顺序注意力利用注意力机制的加权能力，逐步构建一个子集。与所有候选者同时被加权的标准"一次性"注意力不同，顺序注意力将子集选择视为一个序列决策过程，从而应对其NP难的本质。这对于识别常被"过滤方法"遗漏的高阶非线性交互尤为有效，过滤方法仅关注每个独立项的优点，是选择子集的最简单方式。

其核心思想是维护一组已选候选者，并以它们为上下文来寻找下一个最具信息量的候选者进行选择。这主要通过两种方法实现：一是贪婪选择，允许模型在每一步围绕纳入哪个元素做出局部最优决策；二是重要性评估，使用"注意力分数"（表示不同输入部分重要性或相关性的数值）来量化每个候选者相对于当前已选候选者的重要性。与注意力机制类似，顺序注意力使用softmax作为不同组件的重要性排序。然而，与注意力机制不同的是，它是顺序工作的，而非一次性完成，这使得选择算法能够适应之前的选择——这是实现高质量重要性排序的关键特性。

顺序注意力的优势
顺序注意力的主要优势包括：

效率与准确性：通过允许对候选者进行并行处理（一旦计算出注意力分数），其评估速度比传统的顺序选择更快。
可解释性：注意力分数本身提供了一个强大的诊断工具。研究人员可以检查注意力分数，确切了解模型在做出特定决策或生成特定标记时优先考虑了输入的哪些部分。这使得模型的内部推理过程比黑盒模型更具可解释性。
可扩展性：高效处理大量候选者的能力对于现代神经网络的大规模特征选择至关重要。

顺序注意力实践
特征选择
标准的特征选择方法，即贪婪选择，计算成本高昂，因为它需要在每一步为每个潜在特征重新训练或重新评估模型。在"用于特征选择的顺序注意力"中，我们试图用一个成本低得多的代理来替代这种昂贵的方法：模型内部的注意力权重。

在每一步，顺序注意力算法计算所有剩余未选特征的注意力权重，并将具有最高注意力分数的特征（模型"最关注"的那个）永久添加到子集中。然后，算法重新运行前向传播过程（将输入数据通过神经网络逐层从输入传递到输出以生成预测的过程），并为剩余特征重新计算注意力权重。这种重新计算自然地反映了边际增益（在已选特征的基础上，某个特征对性能的贡献程度），使模型能够有效识别并避免添加冗余特征。

顺序注意力算法在多个神经网络基准测试中取得了最先进的结果。值得注意的是，它极大地提高了效率，实现了快速、单次通过的贪婪选择，而无需进行昂贵的显式边际增益计算。研究还表明，当应用于简单的线性回归模型时，顺序注意力算法在数学上等同于成熟的"正交匹配追踪"算法。这种等价性至关重要，因为OMP算法具有可证明的可靠性和性能保证。

块稀疏化
神经网络剪枝对于高效部署大型模型至关重要，因为它通过移除不必要的权重来减小模型体积。先前的研究主要沿着两条路径进行：一是可微分剪枝，使用可训练参数作为重要性的代理；二是组合优化，使用算法搜索最佳稀疏结构。

在"用于块稀疏化的顺序注意力++：可微分剪枝与组合优化的结合"中，我们试图将这两种方法统一到一个连贯的框架中，用于结构化神经网络剪枝，即移除整个权重块或通道，以实现在GPU和TPU等硬件加速器上的实际性能提升。

由此产生的算法——顺序注意力++，提供了一种发现权重矩阵最重要块的新方法，并在不牺牲机器学习任务（如图像网络分类）准确性的前提下，在模型压缩和效率方面显示出显著增益。

顺序注意力的未来
随着AI模型在科学、工程和商业领域的日益融合，模型效率变得比以往任何时候都更加重要，模型结构优化对于构建高效且强大的模型至关重要。我们已经认识到，子集选择是与各种深度学习优化任务中模型效率相关的一个根本性挑战，而顺序注意力已成为解决这些问题的关键技术。展望未来，我们的目标是将子集选择的应用扩展到日益复杂的领域。

考虑实际约束的特征工程
顺序注意力在优化推荐系统中使用的大型嵌入模型的特征嵌入层方面，已展现出显著的质量提升和效率节省。这些模型通常拥有大量异构特征和庞大的嵌入表，因此特征选择/剪枝、特征交叉搜索和嵌入维度优化等任务影响巨大。未来，我们希望让这些特征工程任务能够考虑实际的推理约束，实现全自动、持续的特征工程。

大语言模型剪枝
顺序注意力++范式是大语言模型剪枝的一个有前景的方向。通过应用此框架，我们可以强制执行结构化稀疏性（例如，块稀疏性），剪枝冗余的注意力头、嵌入维度或整个Transformer模块，从而在保持预测性能的同时，显著减少模型体积和推理延迟。

药物发现与基因组学
特征选择在生物科学中至关重要。顺序注意力可以进行调整，以高效地从高维数据集中提取有影响力的遗传或化学特征，从而增强药物发现和个性化医疗模型的解释性和准确性。

当前的研究重点在于扩展顺序注意力，以更高效地处理海量数据集和高度复杂的架构。此外，持续的努力旨在发现更优的剪枝模型结构，并将严格的数学保证扩展到现实世界的深度学习应用中，从而巩固该框架在各行业的可靠性。

子集选择是深度学习中多个优化任务的核心问题，而顺序注意力是解决这些问题的关键技术。未来，我们将探索子集选择的更多应用，以解决更广泛领域中更具挑战性的问题。

结论
顺序注意力是解决深度学习中多种大规模子集选择问题的有效技术，在模型架构优化中扮演着关键角色。随着这些技术的发展，它们将巩固机器学习的未来，确保强大的人工智能在未来岁月里既保持准确性，又易于获取。

致谢
我们要感谢我们的研究合作者Taisuke Yasuda、Lin Chen、Matthew Fahrbach、MohammadHossein Bateni和Vahab Mirrokni，他们的努力推动了顺序注意力的发展。这项工作建立在可微分子集选择和组合优化的基础研究之上，旨在创建更高效、更易获取的AI模型。

英文来源：

Sequential Attention: Making AI models leaner and faster without sacrificing accuracy
February 4, 2026
Thomas Fu, Principal Engineer, and Kyriakos Axiotis, Senior Scientist, Google Research
We introduce a subset selection algorithm for making large scale ML models more efficient.
Quick links
Feature selection is the process of identifying and retaining the most informative subset of input variables while discarding irrelevant or redundant noise. A fundamental challenge in both machine learning and deep learning, feature selection is NP-hard (i.e., a problem that is mathematically "impossible" to solve perfectly and quickly for large groups of data), and as such, it remains a highly challenging area of research.
In modern deep neural networks, feature selection is further complicated by intricate non-linear feature interactions. A feature may appear statistically insignificant on its own but become critical when combined with others within the network's non-linear layers. Conversely, a feature’s contribution may appear significant in isolation, but made redundant when taking other features into account. The core challenge lies in identifying essential features for retention while effectively pruning redundancy within complex model architectures.
More broadly, many ML optimization tasks can be cast as subset selection problems, of which feature selection is a special case. For example, embedding dimension tuning can be viewed as selecting a subset of embedding chunks, and weight pruning as selecting a subset of entries from the weight matrix. Therefore devising a general solution for the subset selection problem that is applicable to modern deep learning tasks can be highly impactful for building the most efficient models.
Today, we explore our solution to the subset selection problem, called Sequential Attention. Sequential Attention uses a greedy selection mechanism to sequentially and adaptively select the best next component (like a layer, block, or feature) to add to the model. While adaptive greedy algorithms are known to provide strong guarantees for various subset selection problems, such as submodular optimization, naïvely applying such algorithms would increase the training cost by many orders of magnitude. To tackle this scalability issue, we integrate selection directly into the model training process by performing selection within a single model training. This ensures that Sequential Attention can be applied to large scale ML models with minimal overhead without sacrificing accuracy or complexity. Here we will analyze how Sequential Attention works and show how it’s being used in real-world scenarios to optimize the structure of deep learning models.
How Sequential Attention works
Sequential Attention leverages the weighting power of the attention mechanism to build a subset step-by-step. In contrast to standard "one-shot" attention, in which all candidates are weighted simultaneously, Sequential Attention addresses the NP-hard nature of subset selection by treating it as a sequential decision process. This is particularly effective for identifying high-order non-linear interactions often missed by "filter methods”, which provide the simplest way to pick a subset by focusing only on the merits of each individual item.
The core idea is to maintain a set of selected candidates and use them as context to find the next most informative candidate to select. This is achieved through two primary methods: greedy selection, which allows the model to make a locally optimal decision around which element to include at each step, and importance evaluation, which uses “attention scores” (numerical values indicating the importance or relevance of different input parts) to quantify the importance of every candidate in addition to the currently selected candidates. Like the attention mechanism, Sequential Attention uses softmax as an importance ranking of different components. Yet unlike the attention mechanism, it works sequentially as opposed to one shot, allowing the selection algorithm to adapt to previous selections — a crucial property for high-quality importance ranking.
Sequential Attention benefits
The primary benefits of Sequential Attention are:

Efficiency and accuracy: By allowing parallel processing of the candidates (once the attention scores are calculated), they can be evaluated faster than in traditional sequential selection.
Interpretability: The attention scores themselves offer a powerful diagnostic tool. Researchers can inspect the attention scores to see exactly which parts of the input a model prioritized when making a specific decision or generating a specific token. This makes the model's internal reasoning more interpretable than that of a black-box model.
Scalability: The ability to efficiently handle a large number of candidates is crucial for large-scale feature selection for modern neural networks.
Sequential Attention in action
Feature selection
The standard feature selection method, i.e., greedy selection, is computationally expensive, as it requires re-training or re-evaluating the model for every potential feature at every step. In “Sequential Attention for Feature Selection”, we sought to replace this costly method with a much cheaper proxy: the model’s internal attention weights.
At each step, the Sequential Attention algorithm calculates attention weights for all remaining, unselected features, and permanently adds the feature with the highest attention score (the one to which the model is "paying the most attention") to the subset. The algorithm then re-runs the selection process (the process of feeding input data through a neural network, layer by layer, from input to output, to generate a prediction) and re-calculates the attention weights for the remaining features. This recalculation naturally reflects the marginal gain (how much a feature contributes to performance, given the features already selected), allowing the model to effectively identify and avoid adding redundant features.
The Sequential Attention algorithm achieved state-of-the-art results across several neural network benchmarks. Notably, it drastically improved efficiency, enabling a fast, one-pass implementation of greedy selection without the need for expensive, explicit marginal gain calculations. The study also demonstrated that when applied to a simple linear regression model, the Sequential Attention algorithm is mathematically equivalent to the established Orthogonal Matching Pursuit (OMP) algorithm. This equivalence is critical because OMP comes with provable guarantees of reliability and performance.
Block sparsification
Neural network pruning is essential for deploying large models efficiently because it reduces the model size by removing unnecessary weights. Prior research pursued two largely separate paths: differentiable pruning, which uses trainable parameters as proxies for importance, and combinatorial optimization, which uses algorithms to search for the best sparse structure.
In “SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization”, we sought to unite these two approaches into a coherent framework for structured neural network pruning that removes entire blocks or channels of weights to achieve real-world improvements on hardware accelerators like GPUs and TPUs.
The resulting algorithm, SequentialAttention++, provides a new way to discover the most important blocks of weight matrices, and shows significant gains in model compression and efficiency without sacrificing accuracy in ML tasks, e.g., ImageNet classification.
The future of sequential attention
As the increasing integration of AI models in science, engineering and business makes model efficiency more relevant than ever, model structure optimization is crucial for building highly effective yet efficient models. We have identified subset selection as a fundamental challenge related to model efficiency across various deep learning optimization tasks, and Sequential Attention has emerged as a pivotal technique for addressing these problems. Moving forward, we aim to extend the applications of subset selection to increasingly complex domains.
Feature engineering with real constraints
Sequential Attention has demonstrated significant quality gains and efficiency savings in optimizing the feature embedding layer in large embedding models (LEMs) used in recommender systems. These models typically have a large number of heterogeneous features with large embedding tables, and so the tasks of feature selection/pruning, feature cross search and embedding dimension optimization are highly impactful. In the future, we would like to allow these feature engineering tasks to take real inference constraints into account, enabling fully automated, continual feature engineering.
Large language model (LLM) pruning
The SequentialAttention++ paradigm is a promising direction for LLM pruning. By applying this framework we can enforce structured sparsity (e.g., block sparsity), prune redundant attention heads, embedding dimensions or entire transformer blocks, and significantly reduce model footprint and inference latency while preserving predictive performance.
Drug discovery and genomics
Feature selection is vital in the biological sciences. Sequential Attention can be adapted to efficiently extract influential genetic or chemical features from high-dimensional datasets, enhancing both the interpretability and accuracy of models in drug discovery and personalized medicine.
Current research focuses on scaling Sequential Attention to handle massive datasets and highly complex architectures more efficiently. Furthermore, ongoing efforts seek to identify superior pruned model structures and extend rigorous mathematical guarantees to real-world deep learning applications, solidifying the framework’s reliability across industries.
Subset selection is a core problem central to multiple optimization tasks in deep learning, while Sequential Attention is a key technique to solve these problems. In the future, we will explore more applications of subset selection to solve more challenging problems in broader domains
Conclusion
Sequential Attention is an effective technique for multiple large-scale subset selection problems in deep learning and plays a key role in model architecture optimization. As these techniques evolve, they will solidify the future of machine learning, guaranteeing that powerful AI remains both accurate and accessible for years to come.
Acknowledgements
We would like to express our gratitude to our research collaborators, Taisuke Yasuda, Lin Chen, Matthew Fahrbach, MohammadHossein Bateni, and Vahab Mirrokni, whose efforts have advanced the development of Sequential Attention. This work builds upon fundamental research in differentiable subset selection and combinatorial optimization to create more efficient and accessible AI models.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读