«

为何没有存储信息的唯一最佳方式

qimuai 发布于 阅读:29 一手编译


为何没有存储信息的唯一最佳方式

内容来源:https://www.quantamagazine.org/why-theres-no-single-best-way-to-store-information-20260116/

内容总结:

数据存储没有“万能钥匙”:计算机科学揭示效率的本质是权衡

在信息爆炸的时代,如何高效存储和调用数据,不仅是普通用户整理数字文件时的烦恼,更是计算机科学领域的核心挑战。最新研究揭示,正如整理书架没有唯一的最佳方法一样,数据存储亦不存在“一刀切”的完美方案,其本质是在速度、空间与便捷性之间寻求动态平衡。

哈希表:巧用“分类箱”化解存取矛盾
最常见的挑战在于插入新数据与查找旧数据之间的效率矛盾。这好比整理家庭藏书:若严格按字母顺序排列,查找虽快,但插入新书时需频繁挪动;若随意摆放,插入虽易,查找却成大难题。

计算机科学家常用的“哈希表”结构,灵感正源于此。它如同设置一组标有字母的“分类箱”,依据数据的某个特征(即“键”)通过特定数学函数(哈希函数)计算其应属的箱子。一个设计精妙的哈希函数能尽可能均匀地分散数据,从而大幅提升存取效率。然而,这种方案亦需权衡:增加箱子数量可缩短查找时间,但会占用更多存储空间,即使许多箱子空置。近期,学界在哈希表的空间与时间平衡研究上取得突破,甚至有本科生推翻了关于其极限效率的长期猜想。

堆结构:为“优先级”任务量身定制
当数据处理具有明确的优先级时(例如管理一项不断涌入新任务的工作清单),另一种名为“堆”的结构则更为高效。它模仿“杂物堆”的概念:最紧迫的任务始终置于顶端,可随时取用;下层任务则允许相对无序存放。新任务从底部加入,并通过与上层项目对比优先级逐级“上浮”,确保最高效的任务能快速升至顶部。即便面对上千项任务,每次插入也仅需极少的调整步骤。2024年,研究人员基于创新的堆设计,成功优化了经典网络最短路径算法,使其达到理论最佳性能。

启示:接受不完美,寻求情境最优解
计算机科学数十年的探索表明,数据存储领域没有“终极解决方案”。哈希表牺牲部分空间以换取存取速度,堆结构则通过允许局部无序来保障优先级管理的高效。这些研究给普通用户的启示或许在于:不必追求绝对整齐划一的存储方式,而应根据数据的使用频率和重要性,灵活选择策略,甚至容忍一定的“凌乱”。在效率与成本之间找到最适合当前场景的平衡点,才是信息管理的真正智慧。

中文翻译:

为何没有存储信息的唯一最佳方法

正如整理书架没有唯一的最佳方式,存储信息也不存在万能解决方案。

试想一个简单场景:你创建了一个新数字文件。计算机需要迅速找到存放它的位置。若日后要删除它,机器又必须快速定位需要清除的数据位。研究人员致力于设计一种称为"数据结构"的存储系统,以平衡数据添加时间、后续删除时间以及系统所需总内存量。

要理解这些挑战,不妨想象将所有书籍排列在长书架上。若按字母顺序排列,你能快速找到任意书籍,但每次购入新书都需要耗时寻找合适位置。反之,若随意摆放书籍,虽节省了放置时间,日后查找却会变得困难。这种插入时间与检索时间的权衡对单层书架或许影响不大,但可以想象当藏书量达到数千本时将变得多么棘手。

替代方案是设置26个按字母标记的收纳箱,根据作者姓氏首字母分配书籍。这样在获取新书时能立即确定对应收纳箱,检索时也能迅速定位。在某些情况下,这种方式的插入和移除效率远高于长书架方案。

当然,收纳箱系统也有其弊端。仅当每个箱子只放一本书时检索才是即时的;若箱内有多本书,仍需翻找才能定位目标。极端情况下,若所有藏书都来自阿西莫夫、阿特伍德和奥斯汀等姓氏以A开头的作家,问题就退化为长书架模式,同时还多出一堆空箱子占据客厅空间。

计算机科学家常研究一种称为哈希表的数据结构,它类似于这种简易收纳箱系统的进阶版本。哈希表通过数据项的已知属性(称为键值)计算存储地址。在我们例子中,每本书的键值是作者姓氏首字母。但这种简单键值会导致某些箱子比其他箱子满得多(例如英文作家中姓氏以X开头的极少)。更好的方法是:取作者全名,将每个字母转换为字母表中的对应序号并求和,再将总和除以26,得到的余数(0到25之间)即为分配书籍的箱号。

这种将键值转换为存储地址的数学规则称为哈希函数。精心设计的哈希函数能确保数据项相对均匀地分布在各存储单元中,从而减少每个单元内的搜索时间。

若要进一步缩短检索时间,可以增加存储单元数量。但这又引发了新的权衡:即使最终空置,这些单元仍会占用空间。

空间与时间的权衡是哈希表与生俱来的特性——这是为了规避简单数据结构中插入与检索时间冲突所付出的代价。哈希表问世七十余年后,计算机科学家仍在探索其基础特性的新发现。近期,他们终于设计出在空间与时间上达到理想平衡的版本。去年,一位本科生推翻了关于近满哈希表中查找特定项目所需最短时间的长期猜想。

优先级堆的智慧

当无法预知下一步需要检索哪些数据时,哈希表表现优异。但现实并非总是如此。假设你正在处理待办清单,却不断接到不同截止时间的新任务。此时需要快速添加新事项,但仅在事项成为最高优先级时才需检索。

这种情况下,最佳选择是一种称为"堆"的数据结构。顾名思义,堆采用略显随机的存储方式,本质上是物品堆的数学化呈现:部分数据存储在其他数据之上,较高位置的数据更易获取。最高优先级项始终位于堆顶,可即时取出。底层结构可能较为混乱,但无需关注低优先级项目的相对位置。

实现这一基础理念的最简方案是使用二叉树——一种具有特殊形态的节点网络:顶端为单一节点,每个节点向下连接两个子节点。

假设二叉树存储待办事项,每个节点存放单个事项,事项标注代表截止日期的数字(数值越小优先级越高)。新事项总被置于当前最底层的空位。

新事项放入后,将其截止日期与正上方节点的事项对比。若新任务更紧急,则交换两者位置,持续向上交换直至新事项上方出现更紧急事项。

此流程确保最高优先级项始终升至顶端,且运行效率极高。即使在待办清单有1000项任务且持续接收新任务的极端情况下,堆存储也能保证每个新项目最多经过9次交换即可到达合适位置。每当完成最紧急任务并将其移出堆时,可迅速从下一层提取新的最高优先级事项。

在计算机科学领域,堆结构广泛应用于网络最短路径算法。2024年,研究团队通过创新的堆设计,将经典最短路径算法改进为理论上适用于任何网络布局的最优方案。

市面上从不缺乏关于整理物品最佳方式的矛盾建议。计算机科学给我们的启示是:不存在完美解决方案,每种方法都需要权衡取舍。但若某些物品对你更为重要,不妨坦然接受些许杂乱。

英文来源:

Why There’s No Single Best Way To Store Information
Introduction
Just as there’s no single best way to organize your bookshelf, there’s no one-size-fits-all solution to storing information.
Consider the simple situation where you create a new digital file. Your computer needs to rapidly find a place to put it. If you later want to delete it, the machine must quickly find the right bits to erase. Researchers aim to design storage systems, called data structures, that balance the amount of time it takes to add data, the time it takes to later remove it, and the total amount of memory the system needs.
To get a feel for these challenges, imagine you keep all your books in a row on one long shelf. If they’re organized alphabetically, you can quickly pick out any book. But whenever you acquire a new book, it’ll take time to find its proper spot. Conversely, if you place books wherever there’s space, you’ll save time now, but they’ll be hard to find later. This trade-off between insertion time and retrieval time might not be a problem for a single-shelf library, but you can see how it could get cumbersome with thousands of books.
Instead of a shelf, you could set up 26 alphabetically labeled bins and assign books to bins based on the first letter of the author’s last name. Whenever you get a new book, you can instantly tell which bin it goes in, and whenever you want to retrieve a book, you will immediately know where to look. In certain situations, both insertion and removal can be a lot faster than they would be if you stored items on one long shelf.
Of course, this bin system comes with its own problems. Retrieving books is only instantaneous if you have one book per bin; otherwise, you’ll have to root around to find the right one. In an extreme scenario where all your books are by Asimov, Atwood, and Austen, you’re back to the problem of one long shelf, plus you’ll have a bunch of empty bins cluttering up your living room.
Computer scientists often study data structures called hash tables that resemble more sophisticated versions of this simple bin system. Hash tables calculate a storage address for each item from a known property of that item, called the key. In our example, the key for each book is the first letter of the author’s last name. But that simple key makes it likely that some bins will be much fuller than others. (Few authors writing in English have a last name that starts with X, for example.) A better approach is to start with the author’s full name, replace each letter in the name with the number corresponding to its position in the alphabet, add up all these numbers, and divide the sum by 26. The remainder is some number between zero and 25. Use that number to assign the book to a bin.
This kind of mathematical rule for transforming a key into a storage address is called a hash function. A cleverly designed hash function ensures that items will usually end up distributed relatively evenly across bins, so you won’t need to spend as much time searching in each bin.
If you want to reduce retrieval time further, you can use more bins. But that leads to another trade-off: Those bins will take up space even if they end up empty.
This trade-off between space and time is an inherent feature of hash tables — it’s the price you pay for avoiding the tension between insertion and retrieval time that plagues simpler data structures. More than 70 years after hash tables were invented, computer scientists are still discovering new things about their fundamental properties. Recently, they finally devised a version that strikes an ideal balance between space and time. And last year, an undergraduate student disproved a long-standing conjecture about the minimum amount of time needed to find a specific item in a hash table that’s almost full.
A Heap of Priorities
Hash tables work well when you can’t anticipate which piece of data you’ll need to retrieve next. But that’s not always the case. Imagine you’re trying to complete tasks on a to-do list, but you’re constantly being assigned new tasks with different deadlines. You want to be able to quickly add new items to the to-do list, but you don’t care about retrieving items until they become your top priority.
In this case, your best bet is a type of data structure called a heap. As the name suggests, a heap is a somewhat haphazard approach to data storage. It’s basically a mathematical version of a pile of stuff: Some items are stored above others, and these higher items are easier to access. The highest-priority item is always at the top of the heap, where you can instantly pluck it off. Lower layers will be more disorganized, but you don’t need to worry about the relative positions of these low-priority items.
The simplest implementation of this basic idea uses a mathematical object called a binary tree, which is a network of nodes with a special shape: There’s a single node at the top, and each node is connected to two nodes directly below it.
Let’s imagine a binary tree that contains the items in a to-do list. Each node can store a single item, and each item is labeled with a number that represents its due date. High-priority items get smaller numbers.
Each new item is put into an empty slot in the current lowest layer.
Once the new item goes in, compare its due date to that of the item in the node directly above it. If the new task is due sooner, swap the items. Keep swapping until the new item ends up directly below an item that’s more urgent.
This procedure ensures that the highest-priority item will always rise to the top. What’s more, the procedure is extremely fast. Even in a nightmare scenario where you have 1,000 tasks on your to-do list and keep getting new assignments, storing them in a heap ensures that it takes no more than nine swaps to move each new item up to the appropriate position. Whenever you complete the most urgent task and remove it from the heap, you can quickly pull up your new top priority from the layer below.
Within computer science, heaps are widely used in algorithms for finding the shortest path from a given starting point in a network to every other point. In 2024, a team of researchers used an ingenious new heap design to transform a classic shortest-paths algorithm into one that is theoretically optimal for any network layout.
There’s no shortage of self-help books filled with contradictory advice about the best way to organize your belongings. If computer science offers any lesson, it’s that there is no perfect solution — every approach comes with trade-offs. But if some items are more important to you than others, don’t be afraid to leave a bit of a mess.

quanta

文章目录


    扫描二维码,在手机上阅读