大型语言模型包含大量参数。但参数究竟是什么呢？

qimuai 发布于 2026-1-8 07:02 阅读：34 一手编译

内容来源：https://www.technologyreview.com/2026/01/07/1130795/what-even-is-a-parameter/

内容总结：

【深度解析】大语言模型参数揭秘：AI“聪明”背后的数学密码

大语言模型（LLM）的“聪明”程度常以其参数规模为指标，但这些“参数”究竟是什么？它们如何让AI理解并生成人类语言？《麻省理工科技评论》近日深入剖析了这一核心问题。

参数的本质：AI的“旋钮与杠杆”
参数可理解为控制模型行为的“旋钮与杠杆”。以巨型弹球机比喻：数十亿个挡板和缓冲器的微小调整，会完全改变弹球的运动轨迹。同理，参数的数值决定了模型处理信息的方式。例如，2020年发布的GPT-3拥有1750亿参数，而谷歌DeepMind最新模型Gemini 3据估参数已达万亿级别——随着行业竞争加剧，头部公司已不再公开模型具体架构。

三类核心参数：嵌入、权重与偏差

嵌入（Embeddings）：将词汇转化为数学表达。每个词（或词片段）被编码为一串数字列表（如4096个数值），代表该词在不同语境中的语义。通过训练，模型让语义相近的词（如“桌子”和“椅子”）在数字空间中的位置也更接近。
权重（Weights）：决定模型中不同部分连接的强度。当模型处理句子时，权重会结合上下文动态调整每个词嵌入的含义，从而理解整体语义。
偏差（Biases）：作为权重的补充，用于微调模型激活阈值。类似于在嘈杂环境中调高微弱声音，偏差帮助模型捕捉容易被忽略的细微信息。

训练过程：海量计算铸就“智能”
参数初始值为随机设置，通过反复训练逐步优化。模型通过比对预测结果与正确答案，反向调整每个参数以减少误差。以GPT-3为例，其1750亿个参数在训练中可能经历数千万次更新，总计算量达千万亿次级别，需数千台高性能计算机持续运行数月。

小模型为何能超越大模型？
当前研究显示，参数数量并非决定模型性能的唯一因素：

数据量至关重要：较小模型若接受更大量高质量训练数据，可能超越参数更多但数据不足的大模型（如Meta的80亿参数Llama 3性能优于700亿参数的Llama 2）。
过度训练策略：让模型接触远超常规需求的数据可提升其表现。
知识蒸馏技术：用小模型学习大模型内部计算成果，继承其“经验”。
专家混合架构：如GPT-5等顶级模型实为多个小模型组合，根据任务动态调用相应模块，兼顾能力与效率。

未来趋势：从追求规模到优化效能
随着参数增长带来的性能提升逐渐放缓，AI研究的重点正转向如何更高效利用参数。模型设计日益精细化，通过算法优化与架构创新，让有限参数发挥更大价值。这标志着AI发展进入新阶段：不仅关注“有多少参数”，更聚焦于“如何用好参数”。

中文翻译：

大型语言模型包含海量参数。但参数究竟是什么？

它们是驱动你钟爱AI模型的神秘数字。它们到底是什么？又起着怎样的作用？

《麻省理工科技评论》为你解读：让我们的作者为你梳理技术世界中复杂纷乱的脉络，助你洞悉未来趋势。点击此处阅读本系列更多文章。

我之所以写这篇文章，是因为我的一位编辑半夜醒来，在床头便签本上潦草地写下：“参数是什么？”与许多凌晨四点冒出的念头不同，这确实是个绝佳的问题——它直指大型语言模型运作的核心。我这么说可不只是因为他是我的老板。（嗨，老板！）

人们常说，大型语言模型的参数就像是控制其行为的旋钮和操纵杆。想象一个行星尺寸的弹球机，它通过数十亿个精心设置的挡板和缓冲器，让弹球从一端弹跳到另一端。调整这些设置，弹球的运动方式就会改变。

2020年发布的OpenAI GPT-3拥有1750亿个参数。谷歌DeepMind最新的LLM Gemini 3可能至少有1万亿个参数——有些人认为可能高达7万亿——但该公司并未透露。（由于竞争日趋激烈，AI公司不再分享其模型的构建细节。）

但参数是什么以及它们如何让LLM做出非凡之事的基本原理，在不同模型间是相通的。你是否曾好奇，究竟是什么让LLM真正运转起来——那些色彩斑斓的弹球机比喻背后到底是什么？让我们一探究竟。

什么是参数？

回想一下中学代数，比如 2a + b。这些字母就是参数：给它们赋值，就能得到结果。在数学或编程中，参数用于设定限制或决定输出。LLM内部的参数运作方式类似，只是规模大到令人难以置信。

参数的值是如何赋予的？

简而言之：通过算法。模型训练时，每个参数被赋予一个随机值。训练过程随后涉及一系列迭代计算（称为训练步骤），以更新这些值。在训练的早期阶段，模型会出错。训练算法会检查每个错误，并回溯模型，调整模型中众多参数的值，以便下次该错误变小。这个过程不断重复，直到模型的行为符合其创造者的期望。此时，训练停止，模型参数的值被固定下来。

听起来很直接……

理论上如此！实际上，由于LLM在大量数据上训练，且包含如此多的参数，训练它们需要海量的步骤和惊人的计算量。在训练过程中，像GPT-3这样的中型LLM内部的1750亿个参数，每个都会被更新数万次。总计起来，这相当于数千万亿次（一个带有15个零的数字）的独立计算。这就是为什么训练一个LLM需要耗费如此巨大的能量。我们谈论的是数千台专用高速计算机不间断运行数月。

呼。这些参数到底都是干什么用的？

LLM内部有三种不同类型的参数，其值通过训练赋予：嵌入、权重和偏置。让我们逐一来看。

好的！那么，什么是嵌入？

嵌入是LLM词汇表中一个单词（或单词的一部分，称为词元）的数学表示。LLM的词汇表可能包含多达数十万个独特的词元，由设计者在训练开始前设定。但这些词本身没有意义。意义是在训练过程中获得的。

当模型被训练时，词汇表中的每个单词都会被赋予一个数值，该数值根据该单词在模型训练数据中无数示例的出现情况，捕捉该单词相对于所有其他单词的含义。

每个单词都被一种代码取代了？

是的。但还不止如此。代表每个单词的数值——即嵌入——实际上是一个数字列表，列表中的每个数字代表模型从其训练数据中提取出的不同语义层面。这个数字列表的长度是LLM设计者可以在训练前指定的另一件事。常见的长度是4096。

LLM中的每个单词都由一个包含4096个数字的列表表示？

没错，那就是一个嵌入。并且这些数字中的每一个都会在训练过程中被调整。嵌入长度为4096个数字的LLM，被称为具有4096个维度。

为什么是4096？

这看起来可能是个奇怪的数字。但LLM（就像任何在计算机芯片上运行的东西一样）在处理2的幂次方时效率最高——2、4、8、16、32、64等等。LLM工程师发现，4096是一个2的幂次方，在能力和效率之间达到了一个最佳平衡点。维度更少的模型能力较弱；维度更多的模型则训练和运行成本过高或速度太慢。

使用更多的数字，可以让LLM捕捉到关于一个单词在许多不同语境中如何使用、它可能具有哪些细微含义、它与其他单词的关系等非常精细的信息。

今年二月，OpenAI发布了GPT-4.5，这是该公司迄今为止最大的LLM（一些估计认为其参数数量超过10万亿）。参与该模型开发的OpenAI研究科学家尼克·莱德当时告诉我，更大的模型可以处理额外信息，比如情感线索，例如当说话者的话语显示出敌意时：“所有这些通过人类对话传递的微妙模式——这些正是这些越来越大的模型将会捕捉到的。”

结果是，LLM中的所有单词都被编码进一个高维空间。想象成千上万的单词漂浮在你周围的空气中。彼此靠近的单词具有相似的含义。例如，“桌子”和“椅子”会比它们距离“宇航员”更近，而“宇航员”则靠近“月亮”和“马斯克”。在远处，你可以看到“戏法”。有点像那样，但LLM内部的单词不是在三维空间中相互关联，而是在4096个维度上相互关联。

天哪。

这确实令人眩晕。实际上，LLM将整个互联网压缩成一个巨大的数学结构，编码了深不可测的相互关联的信息。这既是LLM能够做出惊人事情的原因，也是它们无法被完全理解的原因。

好吧。这就是嵌入。那么权重呢？

权重是一种参数，代表模型不同部分之间连接的强度——也是调整模型行为最常见的“旋钮”类型之一。当LLM处理文本时，会用到权重。

当LLM读取一个句子（或一个书籍章节）时，它首先查找所有单词的嵌入，然后将这些嵌入通过一系列被称为“变换器”的神经网络，这些网络设计用于一次性处理数据序列（如文本）。句子中的每个单词都会相对于其他每个单词进行处理。

权重就在这里发挥作用。嵌入代表的是没有上下文的单词含义。当一个单词出现在特定句子中时，变换器使用权重来处理该单词在这个新上下文中的含义。（实际上，这涉及将每个嵌入乘以所有其他单词的权重。）

那偏置呢？

偏置是另一种类型的“旋钮”，用于补充权重的效果。权重设定了模型不同部分被激活（从而将数据传递到下一部分）的阈值。偏置用于调整这些阈值，使得即使嵌入值较低，也能触发活动。（偏置是加到嵌入上的值，而不是与之相乘。）

通过改变模型各部分被激活的阈值，偏置使模型能够捕捉到原本可能被忽略的信息。想象一下，你试图在一个嘈杂的房间里听清某人在说什么。权重会最大程度地放大最响亮的声音；偏置就像监听设备上的一个旋钮，能把较安静的声音在混合音中推高。

简而言之： 权重和偏置是LLM从给定文本中尽可能多地提取信息的两种不同方式。这两种参数在训练过程中都会被反复调整，以确保它们做到这一点。

好的。那神经元呢？它们也是一种参数吗？

不，神经元更像是组织所有这些数学运算的一种方式——是权重和偏置的容器，通过它们之间的路径网络连接在一起。这非常松散地受到了动物大脑中生物神经元的启发，来自一个神经元的信号触发下一个神经元的新信号，依此类推。

模型中的每个神经元都持有一个单独的偏置，以及针对模型每个维度的权重。换句话说，如果一个模型有4096个维度——因此其嵌入是包含4096个数字的列表——那么该模型中的每个神经元都将持有一个偏置和4096个权重。

神经元按层排列。在大多数LLM中，一层中的每个神经元都与上一层中的每个神经元相连。像GPT-3这样拥有1750亿个参数的模型，可能有大约100层，每层包含数万个神经元。并且每个神经元一次进行数万次计算。

又晕了。这真是大量的数学运算。

确实是大量的数学运算。

那么所有这些是如何组合在一起的？LLM如何接收一堆单词，并决定返回什么单词？

当LLM处理一段文本时，该文本的数值表示——即嵌入——会通过模型的多个层传递。在每一层中，嵌入的值（那个4096个数字的列表）会通过一系列涉及模型权重和偏置（附着在神经元上）的计算被多次更新，直到到达最后一层。

其理念是，输入文本的所有含义、细微差别和上下文，在经过令人难以置信的一系列计算后，被最终嵌入值所捕获。然后，该值被用来计算LLM应该输出的下一个单词。

毫不意外，这比听起来更复杂：模型实际上会为其词汇表中的每个单词计算它作为下一个单词出现的可能性，并对结果进行排序。然后它选择排名最高的单词。（差不多是这样。见下文……）

该单词被附加到之前的文本块中，整个过程重复进行，直到LLM计算出最有可能输出的下一个单词是一个表示其输出结束的信号。

就这样？

当然。嗯……

继续说。

LLM设计者还可以指定少数其他参数，称为超参数。主要的几个叫做温度、top-p和top-k。

你是在编造吧。

温度是一个参数，充当一种创意旋钮。它影响模型对下一个单词的选择。我刚才说模型对其词汇表中的单词进行排序并选择排名最高的那个。但温度参数可以用来推动模型选择最可能的下一个单词（使其输出更符合事实和相关性），或者选择一个可能性较低的单词（使其输出更出人意料，不那么机械）。

top-p和top-k是另外两个控制模型选择下一个单词的旋钮。它们是强制模型从一组最可能的单词中随机选择一个单词，而不是选择排名最高单词的设置。这些参数影响模型给人的感觉——是古怪有创意，还是可靠但乏味。

最后一个问题！最近有很多关于小模型性能超越大模型的讨论。小模型如何用更少的参数做更多的事？

这是目前AI领域最热门的问题之一。实现方式有很多种。研究人员发现，训练数据量影响巨大。首先，你需要确保模型看到足够的数据：在太少文本上训练的LLM无法充分利用其所有参数，而在相同数据量上训练的较小模型可能超越它。

研究人员发现的另一个技巧是过度训练。让模型看到远超先前认为必要的数据量，似乎能让它们表现更好。结果是，在大量数据上训练的小模型可以超越在较少数据上训练的大模型。以Meta的Llama LLM为例。拥有700亿参数的Llama 2在大约2万亿单词的文本上训练；而拥有80亿参数的Llama 3则在大约15万亿单词的文本上训练。规模小得多的Llama 3反而是更好的模型。

第三种技术称为“蒸馏”，使用一个更大的模型来训练一个更小的模型。小模型不仅在原始训练数据上训练，还在大模型内部计算的输出上训练。其理念是，编码在大模型参数中来之不易的经验教训会渗透到小模型的参数中，从而提升其性能。

事实上，单一庞大模型的时代可能已经结束。即使是市场上最大的模型，如OpenAI的GPT-5和谷歌DeepMind的Gemini 3，也可以看作是“几个小模型穿着风衣”——使用一种称为“专家混合”的技术，大模型可以仅开启自身处理特定文本所需的部分（“专家”）。这结合了大模型的能力和小模型的速度与低功耗。

但这还不是全部。研究人员仍在探索如何最大限度地利用模型的参数。随着单纯扩大规模带来的收益逐渐减少，增加参数数量似乎不再像过去那样产生决定性影响。重要的不是你拥有多少参数，而是你如何利用它们。

我能看看一个参数吗？

你想看一个参数？请看吧：这是一个嵌入。

深度阅读

人工智能

OpenAI的新LLM揭示了AI如何运作的秘密

这个实验模型不会与最大最好的模型竞争，但它可以告诉我们它们为何行为怪异——以及它们到底有多可靠。

量子物理学家缩小并“去审查”了DeepSeek R1

他们成功地将这个AI推理模型的体积削减了一半以上——并声称它现在可以回答在中国AI系统中曾受限的政治敏感问题。

基于监狱通话训练的AI模型现用于侦测通话中的预谋犯罪

该模型旨在检测犯罪何时被“策划”。

保持联系

获取《麻省理工科技评论》的最新动态

发现特别优惠、头条新闻、即将举行的活动等更多内容。

英文来源：

LLMs contain a LOT of parameters. But what’s a parameter?
They’re the mysterious numbers that make your favorite AI models tick. What are they and what do they do?
MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.
I am writing this because one of my editors woke up in the middle of the night and scribbled on a bedside notepad: “What is a parameter?” Unlike a lot of thoughts that hit at 4 a.m., it’s a really good question—one that goes right to the heart of how large language models work. And I’m not just saying that because he’s my boss. (Hi, Boss!)
A large language model’s parameters are often said to be the dials and levers that control how it behaves. Think of a planet-size pinball machine that sends its balls pinging from one end to the other via billions of paddles and bumpers set just so. Tweak those settings and the balls will behave in a different way.
OpenAI’s GPT-3, released in 2020, had 175 billion parameters. Google DeepMind’s latest LLM, Gemini 3, may have at least a trillion—some think it’s probably more like 7 trillion—but the company isn’t saying. (With competition now fierce, AI firms no longer share information about how their models are built.)
But the basics of what parameters are and how they make LLMs do the remarkable things that they do are the same across different models. Ever wondered what makes an LLM really tick—what’s behind the colorful pinball-machine metaphors? Let’s dive in.
What is a parameter?
Think back to middle school algebra, like 2a + b. Those letters are parameters: Assign them values and you get a result. In math or coding, parameters are used to set limits or determine output. The parameters inside LLMs work in a similar way, just on a mind-boggling scale.
How are they assigned their values?
Short answer: an algorithm. When a model is trained, each parameter is set to a random value. The training process then involves an iterative series of calculations (known as training steps) that update those values. In the early stages of training, a model will make errors. The training algorithm looks at each error and goes back through the model, tweaking the value of each of the model’s many parameters so that next time that error is smaller. This happens over and over again until the model behaves in the way its makers want it to. At that point, training stops and the values of the model’s parameters are fixed.
Sounds straightforward …
In theory! In practice, because LLMs are trained on so much data and contain so many parameters, training them requires a huge number of steps and an eye-watering amount of computation. During training, the 175 billion parameters inside a medium-size LLM like GPT-3 will each get updated tens of thousands of times. In total, that adds up to quadrillions (a number with 15 zeros) of individual calculations. That’s why training an LLM takes so much energy. We’re talking about thousands of specialized high-speed computers running nonstop for months.
Oof. What are all these parameters for, exactly?
There are three different types of parameters inside an LLM that get their values assigned through training: embeddings, weights, and biases. Let’s take each of those in turn.
Okay! So, what are embeddings?
An embedding is the mathematical representation of a word (or part of a word, known as a token) in an LLM’s vocabulary. An LLM’s vocabulary, which might contain up to a few hundred thousand unique tokens, is set by its designers before training starts. But there’s no meaning attached to those words. That comes during training.
When a model is trained, each word in its vocabulary is assigned a numerical value that captures the meaning of that word in relation to all the other words, based on how the word appears in countless examples across the model’s training data.
Each word gets replaced by a kind of code?
Yeah. But there’s a bit more to it. The numerical value—the embedding—that represents each word is in fact a list of numbers, with each number in the list representing a different facet of meaning that the model has extracted from its training data. The length of this list of numbers is another thing that LLM designers can specify before an LLM is trained. A common size is 4,096.
Every word inside an LLM is represented by a list of 4,096 numbers?
Yup, that’s an embedding. And each of those numbers is tweaked during training. An LLM with embeddings that are 4,096 numbers long is said to have 4,096 dimensions.
Why 4,096?
It might look like a strange number. But LLMs (like anything that runs on a computer chip) work best with powers of two—2, 4, 8, 16, 32, 64, and so on. LLM engineers have found that 4,096 is a power of two that hits a sweet spot between capability and efficiency. Models with fewer dimensions are less capable; models with more dimensions are too expensive or slow to train and run.
Using more numbers allows the LLM to capture very fine-grained information about how a word is used in many different contexts, what subtle connotations it might have, how it relates to other words, and so on.
Back in February, OpenAI released GPT-4.5, the firm’s largest LLM yet (some estimates have put its parameter count at more than 10 trillion). Nick Ryder, a research scientist at OpenAI who worked on the model, told me at the time that bigger models can work with extra information, like emotional cues, such as when a speaker’s words signal hostility: “All of these subtle patterns that come through a human conversation—those are the bits that these larger and larger models will pick up on.”
The upshot is that all the words inside an LLM get encoded into a high-dimensional space. Picture thousands of words floating in the air around you. Words that are closer together have similar meanings. For example, “table” and “chair” will be closer to each other than they are to “astronaut,” which is close to “moon” and “Musk.” Way off in the distance you can see “prestidigitation.” It’s a little like that, but instead of being related to each other across three dimensions, the words inside an LLM are related across 4,096 dimensions.
Yikes.
It’s dizzying stuff. In effect, an LLM compresses the entire internet into a single monumental mathematical structure that encodes an unfathomable amount of interconnected information. It’s both why LLMs can do astonishing things and why they’re impossible to fully understand.
Okay. So that’s embeddings. What about weights?
A weight is a parameter that represents the strength of a connection between different parts of a model—and one of the most common types of dial for tuning a model’s behavior. Weights are used when an LLM processes text.
When an LLM reads a sentence (or a book chapter), it first looks up the embeddings for all the words and then passes those embeddings through a series of neural networks, known as transformers, that are designed to process sequences of data (like text) all at once. Every word in the sentence gets processed in relation to every other word.
This is where weights come in. An embedding represents the meaning of a word without context. When a word appears in a specific sentence, transformers use weights to process the meaning of that word in that new context. (In practice, this involves multiplying each embedding by the weights for all other words.)
And biases?
Biases are another type of dial that complement the effects of the weights. Weights set the thresholds at which different parts of a model fire (and thus pass data on to the next part). Biases are used to adjust those thresholds so that an embedding can trigger activity even when its value is low. (Biases are values that are added to an embedding rather than multiplied with it.)
By shifting the thresholds at which parts of a model fire, biases allow the model to pick up information that might otherwise be missed. Imagine you’re trying to hear what somebody is saying in a noisy room. Weights would amplify the loudest voices the most; biases are like a knob on a listening device that pushes quieter voices up in the mix.
Here’s the TL;DR: Weights and biases are two different ways that an LLM extracts as much information as it can out of the text it is given. And both types of parameters are adjusted over and over again during training to make sure they do this.
Okay. What about neurons? Are they a type of parameter too?
No, neurons are more a way to organize all this math—containers for the weights and biases, strung together by a web of pathways between them. It’s all very loosely inspired by biological neurons inside animal brains, with signals from one neuron triggering new signals from the next and so on.
Each neuron in a model holds a single bias and weights for every one of the model’s dimensions. In other words, if a model has 4,096 dimensions—and therefore its embeddings are lists of 4,096 numbers—then each of the neurons in that model will hold one bias and 4,096 weights.
Neurons are arranged in layers. In most LLMs, each neuron in one layer is connected to every neuron in the layer above. A 175-billion-parameter model like GPT-3 might have around 100 layers with a few tens of thousands of neurons in each layer. And each neuron is running tens of thousands of computations at a time.
Dizzy again. That’s a lot of math.
That’s a lot of math.
And how does all of that fit together? How does an LLM take a bunch of words and decide what words to give back?
When an LLM processes a piece of text, the numerical representation of that text—the embedding—gets passed through multiple layers of the model. In each layer, the value of the embedding (that list of 4,096 numbers) gets updated many times by a series of computations involving the model’s weights and biases (attached to the neurons) until it gets to the final layer.
The idea is that all the meaning and nuance and context of that input text is captured by the final value of the embedding after it has gone through a mind-boggling series of computations. That value is then used to calculate the next word that the LLM should spit out.
It won’t be a surprise that this is more complicated than it sounds: The model in fact calculates, for every word in its vocabulary, how likely that word is to come next and ranks the results. It then picks the top word. (Kind of. See below …)
That word is appended to the previous block of text, and the whole process repeats until the LLM calculates that the most likely next word to spit out is one that signals the end of its output.
That’s it?
Sure. Well …
Go on.
LLM designers can also specify a handful of other parameters, known as hyperparameters. The main ones are called temperature, top-p, and top-k.
You’re making this up.
Temperature is a parameter that acts as a kind of creativity dial. It influences the model’s choice of what word comes next. I just said that the model ranks the words in its vocabulary and picks the top one. But the temperature parameter can be used to push the model to choose the most probable next word, making its output more factual and relevant, or a less probable word, making the output more surprising and less robotic.
Top-p and top-k are two more dials that control the model’s choice of next words. They are settings that force the model to pick a word at random from a pool of most probable words instead of the top word. These parameters affect how the model comes across—quirky and creative versus trustworthy and dull.
One last question! There has been a lot of buzz about small models that can outperform big models. How does a small model do more with fewer parameters?
That’s one of the hottest questions in AI right now. There are a lot of different ways it can happen. Researchers have found that the amount of training data makes a huge difference. First you need to make sure the model sees enough data: An LLM trained on too little text won’t make the most of all its parameters, and a smaller model trained on the same amount of data could outperform it.
Another trick researchers have hit on is overtraining. Showing models far more data than previously thought necessary seems to make them perform better. The result is that a small model trained on a lot of data can outperform a larger model trained on less data. Take Meta’s Llama LLMs. The 70-billion-parameter Llama 2 was trained on around 2 trillion words of text; the 8-billion-parameter Llama 3 was trained on around 15 trillion words of text. The far smaller Llama 3 is the better model.
A third technique, known as distillation, uses a larger model to train a smaller one. The smaller model is trained not only on the raw training data but also on the outputs of the larger model’s internal computations. The idea is that the hard-won lessons encoded in the parameters of the larger model trickle down into the parameters of the smaller model, giving it a boost.
In fact, the days of single monolithic models may be over. Even the largest models on the market, like OpenAI’s GPT-5 and Google DeepMind’s Gemini 3, can be thought of as several small models in a trench coat. Using a technique called “mixture of experts,” large models can turn on just the parts of themselves (the “experts”) that are required to process a specific piece of text. This combines the abilities of a large model with the speed and lower power consumption of a small one.
But that’s not the end of it. Researchers are still figuring out ways to get the most out of a model’s parameters. As the gains from straight-up scaling tail off, jacking up the number of parameters no longer seems to make the difference it once did. It’s not so much how many you have, but what you do with them.
Can I see one?
You want to see a parameter? Knock yourself out: Here's an embedding.
Deep Dive
Artificial intelligence
OpenAI’s new LLM exposes the secrets of how AI really works
The experimental model won't compete with the biggest and best, but it could tell us why they behave in weird ways—and how trustworthy they really are.
Quantum physicists have shrunk and “de-censored” DeepSeek R1
They managed to cut the size of the AI reasoning model by more than half—and claim it can now answer politically sensitive questions once off limits in Chinese AI systems.
An AI model trained on prison phone calls now looks for planned crimes in those calls
The model is built to detect when crimes are being “contemplated.”
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读