推测级联——一种更智能、更快速的大语言模型推理混合方案
内容总结:
谷歌研究团队提出大语言模型推理新方法“推测级联”,兼顾效率与质量
谷歌研究院科学家Hari Narasimhan与Aditya Menon于2025年9月11日发布研究成果,提出名为"推测级联"(speculative cascades)的新型大语言模型推理优化技术。该方法通过将传统级联结构与推测解码技术相结合,在保证输出质量的同时显著提升推理速度并降低计算成本。
当前大语言模型在搜索、编程辅助等场景广泛应用,但其推理过程存在计算资源消耗大、响应速度慢的痛点。研究团队创新性地融合了两种现有优化路径:传统级联系统通过让小模型优先处理简单任务,仅向大模型转发复杂查询来实现成本控制;推测解码技术则通过小模型预生成文本草案,由大模型并行验证来提升吞吐量。
新提出的推测级联技术突破性地采用动态延迟决策机制,在词元级别智能判断是否采纳小模型输出或转交大模型处理。这种柔性判断标准可基于置信度对比、成本效益分析或大模型推荐词表等多种策略实现,既避免了传统级联的序列化等待瓶颈,又克服了推测解码对输出一致性的僵化要求。
在基于Gemma和T5模型的测试中,该方法在摘要生成、翻译、推理、编程及问答等任务上均展现出优越的性价比。实验数据显示,在同等输出质量前提下,推测级联相比单一技术能实现更高的加速比和更优的质量指标。这项技术为开发者提供了更精细化的成本质量平衡控制手段,有望推动大语言模型在实际应用中实现更高效智能的服务能力。
(注:本文根据谷歌研究院2025年9月发布的技术博客内容进行新闻化编译,保留了核心技术创新点与实验结论,省略了具体示例及致谢名单等辅助信息。)
中文翻译:
推测级联:一种更智能、更快速的大语言模型推理混合方法
2025年9月11日
Hari Narasimhan与Aditya Menon,谷歌研究院科学家
我们提出"推测级联"这一创新方法,通过将推测解码与标准级联技术相结合,显著提升大语言模型(LLM)的效率和计算成本控制。
快速跳转
大语言模型彻底改变了人机交互方式,从高级搜索功能到创意编程助手均由其驱动。但这种强大能力伴随代价:生成响应的推理过程可能速度缓慢且计算成本高昂。随着模型部署规模的扩大,如何在保持质量的同时提升速度并降低成本已成为关键挑战。
级联技术是解决方案之一,其核心策略是在调用大型昂贵模型前,先使用小型快速模型进行预处理。通过设置延迟规则,由小模型判断是否自行处理查询或转交大模型。该方案旨在用低成本方式处理多数任务,仅对真正需要高级能力的复杂任务启用大模型,从而实现成本与质量的最优平衡。级联技术优先考虑计算成本削减和资源高效分配,同时允许一定程度的质量波动。
另一种方案——推测解码,则在不改变最终输出的前提下优化延迟和吞吐量。该方法通过小型"草稿"模型预测后续token序列,再由大型"目标"模型并行快速验证。若草稿被接受,大模型可单步生成多token,在保证输出质量与独立运行完全一致的同时大幅加速进程。该技术以提升速度为优先,可能以增加内存占用和牺牲部分计算节省为代价。
在《通过推测解码实现更快速级联》中,我们提出的"推测级联"融合了两种技术的优势。通过适时为效率而启用小模型,该方法以更低计算成本获得比单一技术更优的LLM输出质量。我们使用Gemma和T5模型,在摘要生成、翻译、推理、编程和问答等任务中测试新方法。结果表明:相比基线模型,推测级联实现了更优的成本质量权衡,往往获得更高加速比和更佳质量指标。
深度解析
通过简单示例对比两种技术:假设向LLM提问:"巴兹·奥尔德林是谁?",可用小型快速"草稿"模型和大型强大"专家"模型作答。
- 小模型回复:"巴兹·奥尔德林是美国前宇航员、工程师和战斗机飞行员,以第二位登月者闻名"
- 大模型回复:"埃德温·'巴兹'·奥尔德林作为太空探索史上的关键人物,是美国前宇航员、工程师和战斗机飞行员,以人类第二位登月者著称"
两者答案均正确但风格迥异:小模型提供快速事实摘要,大模型呈现正式百科式回复。根据用户需求,任一回复都可能是理想答案。
级联技术中,小模型先处理提示。若置信度高则直接回复,否则转交大模型。本例中小模型因置信度高而快速响应,但若需转交则会产生顺序处理的等待瓶颈。
推测解码中,小模型快速起草答案起始token[巴兹,奥尔德林,是,一名,...],大模型并行验证时发现首词应为"埃德温"而非"巴兹",因此整个草稿被拒。虽然小模型答案正确,但逐token匹配机制导致拒绝,速度优势丧失。(注:完整论文包含提供更灵活比较的"概率匹配"机制)
不同目标,不同权衡
"奥尔德林示例"揭示了两项技术的本质差异:
推测级联:双优融合
该方案融合标准级联的分层处理思想与推测解码的加速机制。核心创新是用灵活的"延迟规则"取代严格验证——基于逐token分析动态决定接受小模型草稿或启用大模型。既避免标准级联的顺序瓶颈,又允许系统接受小模型的优质答案(即使与大模型输出不完全一致)。
示例运行流程:
- 小模型起草答案起始token
- 大模型同步评估草稿并提供评分
- 延迟规则综合分析两种输出
- 若系统决定不延迟,则接受小模型草稿token,并从此点高效重复起草-验证流程
该方法优势在于可定制延迟规则,例如基于:
- 简单置信度检查:仅当小模型预测置信度低时延迟
- 对比检查:当大模型置信度显著更高时延迟
- 成本效益分析:仅当大模型置信度提升价值超过拒绝草稿成本时延迟
- token特异性检查:若小模型起草token不在大模型推荐词表中则延迟
这种可插拔决策逻辑使推测级联兼具速度、质量与适应性优势。
下图展示了GSM8K数据集中提问("玛丽有30只羊,半数每日产奶1公斤,另一半产2公斤,每日总产奶量?")的处理过程对比。通过精准利用小模型在特定token上的输出,推测级联能比常规推测解码更快获得正确答案。
实验验证
我们在摘要生成、推理和编程等多项基准测试中验证推测级联,结果均显示其优于推测解码。在质量-效率曲线图中,新方法持续提供更优的权衡表现——这意味着在同等质量下,我们的方法速度更快(即每次调用大模型可生成更多token)。
展望:推测级联助力更智能、更快速的人工智能
随着大语言模型日益融入日常应用,性能优化不仅是技术目标,更是现实需求。通过重新思考级联与推测解码的协同机制,推测级联为开发者提供了更强大灵活的工具。这种混合方法实现了成本质量平衡的细粒度控制,为构建更智能、更快速的应用铺平道路。
致谢
本项目与Wittawat Jitkrittum、Ankit Singh Rawat、Seungyeon Kim、Neha Gupta及Sanjiv Kumar合作完成。感谢Ananda Theertha Suresh和孙泽腾(音译)的深刻讨论,以及Yale Cong、Mark Simborg和Kimberly Schwede在博客撰写中的协助。
英文来源:
Speculative cascades — A hybrid approach for smarter, faster LLM inference
September 11, 2025
Hari Narasimhan and Aditya Menon, Research Scientists, Google Research
We introduce “speculative cascades”, a new approach that improves LLM efficiency and computational costs by combining speculative decoding with standard cascades.
Quick links
LLMs have transformed how we interact with technology, powering everything from advanced search capabilities to creative coding assistants. But this power comes at a cost: inference (the process of generating a response) can be slow and computationally expensive. As we deploy these models to more users, making them faster and less expensive without sacrificing quality is a critical challenge.
One way to accomplish this would be to use cascades, which aim to optimize LLM efficiency by strategically using smaller, faster models before engaging a larger, more expensive LLM. This approach involves a deferral rule where the smaller model decides if it can handle a query or if it needs to pass the task to a more capable, but costlier, large model. The goal is to process as much as possible cheaply and quickly, only incurring the high cost of the large LLM for complex tasks that truly require its advanced capabilities, potentially yielding favorable cost-quality trade-offs. Cascades prioritize computational cost reduction and efficient resource allocation, while allowing for some variability in quality..
Another approach, speculative decoding, optimizes an LLM’s latency and throughput without altering the final result. It achieves this by employing a smaller, faster "drafter" model to predict a sequence of future tokens. These speculated tokens are then quickly verified in parallel by the larger “target” model. If the draft is accepted, the large model effectively generates multiple tokens in a single step, greatly accelerating the process while guaranteeing that the final output is identical to what the large model would have produced on its own. This approach prioritizes speed and latency reduction, potentially at the cost of increased memory usage and less computational savings, since the larger model still performs substantial work.
In “Faster Cascades via Speculative Decoding”, we introduce “speculative cascades”, a new approach that combines the best of both cascades and speculative decoding. It delivers better LLM output quality at a lower computational cost than either technique alone by sometimes deferring to the smaller LLM for the sake of efficiency. We tested new speculative cascading techniques against standard cascading and speculative decoding baselines using Gemma and T5 models on various language tasks, including summarization, translation, reasoning, coding, and question answering. The results show that the proposed speculative cascades achieve better cost-quality trade-offs, often yielding higher speed-ups and better quality metrics compared to the baselines.
A deeper look
To fully understand and appreciate the speculative cascades approach, we first compare cascades and speculative decoding with a simple example. Imagine you ask an LLM a straightforward question:
Prompt: "Who is Buzz Aldrin?"
Let's say we have two models available to answer this: a small, fast "drafter" model and a large, powerful "expert" model.
Here's how they might respond:
- Small Model: Buzz Aldrin is an American former astronaut, engineer, and fighter pilot, best known as the second person to walk on the Moon.
- Large Model: Edwin "Buzz" Aldrin, a pivotal figure in the history of space exploration, is an American former astronaut, engineer, and fighter pilot who is best known for being the second human to walk on the Moon.
Both models provide excellent, factually correct answers, but they interpret the user's intent slightly differently. The small model delivers a quick, factual summary, while the large model provides a more formal, encyclopedic-style entry. Depending on the user's need — be it a fast fact or a detailed overview — either response could be considered ideal. The key is that they represent two distinct, equally valid styles.
Now, let's see how the two main speed-up techniques handle this scenario.
With cascades, the small "drafter" model gets the prompt first. If it's confident in its answer, it replies. If not, it defers the entire task to the large "expert" model.
In our example: - The small model generates its concise and correct answer.
- It checks its confidence and, finding it high, sends the response to the user.
This works! We get a great answer quickly. But the process is sequential. If the small model hadn't been confident, we would have wasted time waiting for it to finish, only to then start the large model from scratch. This sequential "wait-and-see" approach is a fundamental bottleneck.
With speculative decoding, the small model quickly drafts the first few tokens of the answer, and the large model verifies it in parallel, correcting the first mistake it finds.
In our example: - The small model drafts the beginning of its answer: [Buzz, Aldrin, is, an, ...]
- The large model verifies this draft. Its own preferred first token is Edwin.
- Since Buzz ≠ Edwin, the very first token is a mismatch.
- The entire draft is rejected and the first token is replaced with Edwin. The process then repeats from this corrected point to generate the rest of the answer, but the initial speed advantage has been lost.
Even though the small model produced a good answer, the requirement to match the large model token-by-token forces a rejection. We lose the speed benefit and end up with an answer that is not necessarily superior. While the above example uses a simple token matching rejection rule, in the full paper, we also include the potential for a "probabilistic match" that provides greater flexibility in the token-by-token comparison.
Different goals, different trade-offs
The "Buzz Aldrin" example reveals a fundamental difference between these two techniques, as summarized below:
Speculative cascades: Best of both worlds
Speculative cascades combine the idea of tiered processing from standard cascades with the speedup mechanism of speculative decoding. It involves a smaller model generating a "draft" output that a larger model then quickly verifies in parallel. The key innovation is replacing the strict verification of speculative decoding with a flexible “deferral rule”. This rule dynamically decides, on a token-by-token basis, whether to accept the small model's draft or defer to the large model. This avoids the sequential bottleneck of standard cascades while allowing the system to accept a good answer from the small model even if it doesn't exactly match the large model's preferred output.
In our example: - The small model drafts the beginning of its answer: [Buzz, Aldrin, is, an, ...]
- Simultaneously, the large model evaluates the draft, providing its own scores.
- The crucial step: A flexible deferral rule looks at both outputs and decides whether a deferral is warranted.
- If the system decides not to defer, it accepts the small model's draft tokens. The process then efficiently repeats from this new point, drafting and verifying the next chunk of text until the answer is complete.
The power of this method lies in its flexibility, as the deferral rule can be tailored to different needs.
For example, we could tell the system to defer based on: - A simple confidence check: Defer only if the small model isn't very confident in its own prediction.
- A comparative check: Defer if the large model is significantly more confident than the small model.
- A cost-benefit analysis: Defer only if the large model's confidence boost outweighs the "cost" of rejecting the small model's draft.
- A token-specific check: Given an "approved list" of the best next words according to the large model (its top-ranked tokens), we defer if the small model's drafted token is not on this list.
This ability to plug in different decision-making logic is what gives speculative cascades their unique blend of speed, quality, and adaptability.
Below, we visualize the behaviour of speculative cascading versus speculative decoding on a prompt from the GSM8K dataset. The prompt asks, “Mary has 30 sheep. She gets 1 kg of milk from half of them and 2 kg of milk from the other half every day. How much milk does she collect every day?“ By carefully leveraging the small model's output on certain tokens, speculative cascading can reach a correct solution faster than regular speculative decoding.
Experiments
We tested speculative cascades on a range of benchmarks, including summarization, reasoning, and coding. The results show a clear advantage over speculative decoding. On a standard quality-versus-efficiency graph, speculative cascades consistently provide better trade-offs. This means for the same quality level as speculative decoding, our method is faster, i.e., generates more tokens per call to the larger model.
Towards faster and smarter AI with speculative cascades
As LLMs become more integrated into daily applications, optimizing their performance isn’t just a technical goal, it’s a practical necessity. By rethinking how cascades and speculative decoding can work together, speculative cascades provide a more powerful and flexible tool for developers. This hybrid approach allows for fine-grained control over the cost-quality balance, paving the way for applications that are both smarter and faster.
Acknowledgements
This work is a collaborative effort with Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta and Sanjiv Kumar. We are grateful to Ananda Theertha Suresh and Ziteng Sun for their insightful discussions, and Yale Cong, Mark Simborg, and Kimberly Schwede for their help in crafting this blog.