大规模差分隐私分区选择保障私人数据安全
内容总结:
谷歌研究院发布突破性隐私保护算法,高效处理万亿级数据且严格保障用户隐私
2025年8月20日——谷歌研究院学生研究员Justin Y Chen与研究科学家Morteza Zadimoghaddam近日在ICML2025大会上发表论文《通过自适应加权实现可扩展的私有分区选择》,提出革命性的差分隐私分区选择算法。该算法在保证用户数据绝对隐私的前提下,可高效处理高达数千亿条目的超大规模数据集,较传统序列算法的处理规模提升三个数量级。
当前,基于用户的大规模数据集对推动人工智能与机器学习发展具有重要价值,但数据共享过程中的隐私泄露风险始终是行业痛点。差分隐私(DP)分区选择技术通过添加可控噪声并筛选经过噪声干扰后仍保持足够频次的条目,确保攻击者无法推断任何单个用户的数据贡献情况。该技术是文本分析、语言建模、数据流分析等关键任务的基础处理环节。
研究团队创新的MaxAdaptiveDegree(MAD)算法突破传统非自适应加权模式的局限,采用智能权重再分配机制:自动识别权重远超阈值的流行条目,将其冗余权重精准转移至濒临阈值边缘的条目。这种自适应调整使更多低频条目得以安全释放,在保持相同隐私保护强度和并行处理效率的前提下,显著提升数据可用性。
实验表明,仅需两次迭代的MAD算法在包含800亿条目的Common Crawl公开数据集上实现突破性成果:输出条目覆盖99.9%的数据库记录(每条记录至少包含一个输出条目),同时满足严格差分隐私保障。该算法已通过GitHub平台开源,旨在推动科研社区在隐私保护计算领域的协作创新。
这项技术突破为医疗、金融、公共服务等领域的大数据应用提供了隐私保护与数据效用协同优化的新范式,标志着万亿级数据隐私处理技术迈入新阶段。
中文翻译:
通过差分隐私分区选择实现大规模私有数据保护
2025年8月20日
Justin Y Chen(谷歌研究院学生研究员)、Morteza Zadimoghaddam(谷歌研究院科学家)
我们提出创新算法,在数据发布过程中增强用户隐私保护能力,推动差分隐私分区选择技术实现突破性进展。
快速访问
基于用户的大型数据集对推进人工智能和机器学习模型具有不可估量的价值。它们通过改进服务、提升预测准确性和个性化体验推动创新,最终使用户受益。对此类数据集进行协作共享可加速研究进程、催生新应用,并为更广泛的科学社区作出贡献。然而,利用这些强大数据集的同时也伴随着潜在的数据隐私风险。
差分隐私(DP)分区选择是指:基于特定项在大量个体贡献中的出现频率或显著性(例如从海量文档集中找出所有常用词汇),从庞大集合中识别出可安全共享的特定有意义唯一子集的过程。通过应用差分隐私保护机制,该选择过程能确保无人可推断任意个体的数据是否对最终列表中的特定项作出贡献——通过添加可控噪声,仅保留添加噪声后仍达到足够普遍程度的项,从而保障个体隐私。DP技术是众多重要数据科学与机器学习任务的基础步骤,包括从大型私有语料库中提取词汇表(或n-gram短语)(这是许多文本分析与语言建模应用的必要环节)、以隐私保护方式分析数据流、获取用户数据直方图,以及提升私有模型微调效率。
面对用户查询等海量数据集,并行算法至关重要。与逐条处理数据的串行算法不同,并行算法将问题分解为多个可同时在多处理器或机器上计算的子任务。这不仅是为了优化,更是处理现代数据规模的基本需求。并行化支持瞬时处理海量信息,使研究人员能够处理包含数千亿项的数据集,从而在保持强大隐私保障的同时不牺牲大数据集的实用性。
在我们发表于ICML2025的最新成果《通过自适应加权实现可扩展私有分区选择》中,我们提出了一种高效并行算法,可将DP分区选择应用于各类数据发布场景。该算法在并行算法中全面领先,可扩展至处理数千亿项的数据集,规模比先前串行算法处理的数据集高出三个数量级。为促进研究社区协作创新,我们已在GitHub开源DP分区选择代码。
算法原理
DP分区选择的目标是在严格保持用户级差分隐私的前提下,最大化从数据集合中选出的唯一项数量。这意味着属于多数用户的热门项通常可安全保留用于下游计算任务,而仅属于单个用户的项则会被排除。算法设计需在满足差分隐私要求的同时,实现隐私与效用的最优权衡。
标准范式:加权、加噪与过滤
传统差分隐私分区选择包含三个核心步骤:
- 加权:为每个项计算"权重"(通常表示出现频率或跨用户聚合值),形成权重直方图。关键隐私要求是"低敏感性",即任意用户贡献的总权重存在上限。在标准非自适应设置中,用户独立分配权重(如高斯加权基线)。
- 加噪:为保障DP,向每个项的权重添加随机噪声(如特定标准差的高斯噪声),混淆真实计数以防止攻击者推断个体信息。
- 过滤:根据DP参数设定阈值,仅保留噪声化后权重超过阈值的项作为最终输出。
自适应加权提升效用
传统非自适应方法存在"权重浪费"局限:热门项可能获得远超隐私阈值的冗余权重,而这些权重本可用于提升临界值附近项的入选概率,从而增加输出项总数并提升效用。
我们引入自适应权重分配机制解决该问题。与非自适应方法中用户独立贡献权重不同,自适应设计使用户对项的权重分配能考量其他用户的贡献。这一过程需在保持隐私和计算效率的前提下实现精妙平衡。
我们的创新算法MaxAdaptiveDegree(MAD)通过战略性地重新分配权重:识别具有显著"超额权重"(远高于阈值)的项,将其部分权重转移至"分配不足"(略低于阈值)的项。这种自适应重分配确保更多低频项能够跨越隐私阈值入选输出。此外,MAD在保持与基线相同低敏感性边界和效率的同时,在并行处理框架(如类MapReduce系统)中提供同等强大的隐私保障与扩展性,且效用显著提升。
我们进一步将该概念扩展至多轮DP分区选择框架,演示如何安全释放轮次间的中间噪声权重向量。这些信息支持更精细的适应性调整:减少先前权重分配过多(可能再次超额)或过少(难以跨越阈值)项的未来权重分配,从而在绝不牺牲隐私的前提下最大化效用,进一步增加输出项数量。
实验验证
我们进行了大量实验,将单次及多次迭代的MAD算法与可扩展基线方法进行对比。如下表所示,仅需两次迭代的MAD(MAD2R列)在多数数据集中达到业界最优效果——在保持相同隐私保障的前提下,输出项数量显著超越其他方法(包括使用更多轮次的算法)。
论文中的理论结果表明,单轮MAD算法应始终优于前述单轮高斯加权基线。实验数据验证了这一理论假设。新方法在多种隐私参数与超参数选择下均保持卓越性能。算法Python内存实现示例已发布于开源库。
在包含近8000亿条目的公开Common Crawl数据集上,我们通过将条目视为"用户"、词语作为"项"来实现记录级DP保护。在此数据集上,两次迭代的MAD算法输出的项集合覆盖了99.9%的条目(每个条目至少有一个项入选)和97%的数据库条目(对应输出中的项),同时满足DP保障。
仅经两次迭代,我们的算法就在广泛参数设置下达到业界最佳水平。与理论预期一致,算法始终优于基线方法。
结论
我们提出了改进DP分区选择算法效用-隐私权衡的新方法。该算法在接近万亿条目的数据集上实现了突破性成果。我们希望这一算法能帮助从业者在严格尊重用户隐私的同时,全面提升工作流程的效用价值。
致谢
感谢谷歌研究院算法与优化团队的Alessandro Epasto、Vincent Cohen-Addad对本研究的贡献。
英文来源:
Securing private data at scale with differentially private partition selection
August 20, 2025
Justin Y Chen, Student Researcher, and Morteza Zadimoghaddam, Research Scientist, Google Research
We present novel algorithms to preserve user privacy in data releases, improving the state of the art in differentially private partition selection.
Quick links
Large, user-based datasets are invaluable for advancing AI and machine learning models. They drive innovation that directly benefits users through improved services, more accurate predictions, and personalized experiences. Collaborating on and sharing such datasets can accelerate research, foster new applications, and contribute to the broader scientific community. However, leveraging these powerful datasets also comes with potential data privacy risks.
The process of identifying a specific, meaningful subset of unique items that can be shared safely from a vast collection based on how frequently or prominently they appear across many individual contributions (like finding all the common words used across a huge set of documents) is called “differentially private (DP) partition selection”. By applying differential privacy protections in partition selection, it’s possible to perform that selection in a way that prevents anyone from knowing whether any single individual's data contributed a specific item to the final list. This is done by adding controlled noise and only selecting items that are sufficiently common even after that noise is included, ensuring individual privacy. DP is the first step in many important data science and machine learning tasks, including extracting vocabulary (or n-grams) from a large private corpus (a necessary step of many textual analysis and language modeling applications), analyzing data streams in a privacy preserving way, obtaining histograms over user data, and increasing efficiency in private model fine-tuning.
In the context of massive datasets like user queries, a parallel algorithm is crucial. Instead of processing data one piece at a time (like a sequential algorithm would), a parallel algorithm breaks the problem down into many smaller parts that can be computed simultaneously across multiple processors or machines. This practice isn't just for optimization; it's a fundamental necessity when dealing with the scale of modern data. Parallelization allows the processing of vast amounts of information all at once, enabling researchers to handle datasets with hundreds of billions of items. With this, it’s possible to achieve robust privacy guarantees without sacrificing the utility derived from large datasets.
In our recent publication, “Scalable Private Partition Selection via Adaptive Weighting”, which appeared at ICML2025, we introduce an efficient parallel algorithm that makes it possible to apply DP partition selection to various data releases. Our algorithm provides the best results across the board among parallel algorithms and scales to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms. To encourage collaboration and innovation by the research community, we are open-sourcing DP partition selection on GitHub.
How the algorithm works
The goal of DP partition selection is to maximize the number of unique items selected from a union of sets of data, while strictly preserving user-level DP. This means that very popular items, belonging to many users, can often be safely preserved for downstream computational tasks, whereas items belonging to only a single user would not be included. The algorithm designer must aim for an optimal privacy-utility trade-off in selecting items from the dataset while respecting the differential privacy requirement.
The standard paradigm: Weight, noise, and filter
The conventional approach to differentially private partition selection involves three core steps:
- Weight: For each item, a "weight" is computed, typically representing the item’s frequency or some aggregation across users. This forms a weight histogram. A crucial privacy requirement is "low sensitivity", meaning that the total weight contributed by any single user is bounded. In a standard non-adaptive setting, users assign weights to their items independently of what other users contribute (e.g., the Gaussian weighting baseline).
- Noise: To ensure DP, random noise (e.g., Gaussian noise with a specific standard deviation) is added to each item's computed weight. This obfuscates the exact counts, preventing an attacker from inferring an individual's presence or data.
- Filter: Finally, a threshold determined by the DP parameters is applied. Only items whose noisy weights exceed this threshold are included in the final output.
Improving utility with adaptive weighting
A limitation of the standard, non-adaptive approach is potential "wastage". Highly popular items might receive significantly more weight than necessary to cross the privacy threshold, effectively "over-allocating" weight. This excess weight could have been more effectively used to boost items that are just below the threshold, thereby increasing the overall number of items released and improving the utility of the output.
We introduce adaptivity into the weight assignment process to address this. Unlike non-adaptive methods where each user's contribution is independent, an adaptive design allows the weight contributed by a user to an item to consider contributions from other users. This is a delicate balance, as it must be achieved without compromising privacy or computational efficiency.
Our novel algorithm, MaxAdaptiveDegree (MAD), strategically reallocates weight. It identifies items with significant "excess weight" (far above the threshold) and reroutes some of that weight to "under-allocated" items (those just below the threshold). This adaptive reallocation ensures that more less-frequent items can cross the privacy threshold and be included in the output. Moreover, MAD maintains both the same low-sensitivity bounds and efficiency as the baseline, meaning it offers the same strong privacy guarantees and scalability in parallel processing frameworks (like MapReduce-like systems), but with strictly superior utility.
Furthermore, we extend this concept to multi-round DP partition selection frameworks. We demonstrate how to safely release intermediate noisy weight vectors between rounds. This additional information allows for even greater adaptivity, as we can reduce future weight allocations to items that previously received too much weight (and were likely to be over-allocated again) or too little weight (and were unlikely to ever cross the threshold). This further refines the weight distribution, maximizing utility without sacrificing privacy, and further increases the items in output.
Experiments
We conducted extensive experiments comparing our MAD algorithm with one or multiple iterations against scalable baselines for private partition selection.
As shown in the table below, MAD with just two iterations (column MAD2R) achieves state-of-the-art results across many datasets — often outputting significantly more items than other methods (even those using more rounds) while retaining the same privacy guarantees.
In our paper, we present theoretical results that suggest our single-round algorithm (MAD) should always outperform the single-round Gaussian weighting baseline mentioned above. Our results demonstrate that this theoretical hypothesis appears to be correct. The excellent performance of our new methods holds across a wide selection of privacy parameters and hyperparameter choices. An example in-memory implementation of our algorithm in Python is available in the open-source repo.
On the large-scale publicly-available Common Crawl dataset (comprising close to 800 billion entries), we obtained record-level DP by treating entries as “users” and the words in these entries as items. On this dataset, our two-iteration MAD algorithm output a set of items that covered 99.9% of the entries (each of which has at least one item in the output) and 97% of the database entries (corresponding to an item that is in the output) while satisfying the DP guarantee.
With just two iterations, our algorithm achieved state-of-the-art results in a wide range of parameter settings. As expected from our theoretical results, our algorithm always outperformed the baseline.
Conclusion
We introduce new methods to improve the utility-privacy trade-off in DP partition selection algorithms. Our algorithm achieves state-of-the-art results on datasets approaching a trillion entries. We hope this algorithm will help practitioners achieve higher utility across their workflows while strictly respecting user privacy.
Acknowledgements
We thank Alessandro Epasto, Vincent Cohen-Addad, part of the Algorithm and Optimization team in Google Research, who contributed to this work.