AI模型是如何生成视频的？

qimuai 发布于 2025-9-13 07:02 阅读：7 一手编译

AI模型是如何生成视频的？

内容来源：https://www.technologyreview.com/2025/09/12/1123562/how-do-ai-models-generate-videos/

内容总结：

【AI视频生成技术步入爆发期：从静态图像到动态影像的跨越】

随着OpenAI的Sora、谷歌DeepMind的Veo 3以及Runway的Gen-4等高端视频生成工具陆续向公众开放，AI视频生成技术正以前所未有的速度进入普通用户的视野。这些工具已能生成近乎以假乱真的视频片段，甚至被Netflix首次运用于大众剧集《永恒族》的特效制作中。

尽管演示视频多为精心筛选的成果，但付费用户已能通过ChatGPT和Gemini等应用轻松尝试视频生成。只需输入文字指令（例如"制作独角兽吃意大利面的视频"），系统便会通过"潜在扩散变换模型"（latent diffusion transformers）进行多轮迭代生成。不过结果具有随机性，通常需要反复调整指令才能获得理想效果。

技术核心：三重架构协同工作

扩散模型：通过逆向处理像素噪点，将随机静态图像逐步还原为连贯画面。该模型通过分析海量图像数据，学习如何从噪点中重建图像。
潜在扩散技术：为降低能耗，系统将视频帧压缩为数学编码，仅保留关键特征数据。这个过程类似视频流传输中的压缩-解压缩流程，使计算效率提升数十倍。
变换器模型：确保视频帧间连续性，避免物体或光线突兀变化。OpenAI通过时空切片技术，将视频处理为可分析的"数据立方体"，使模型能同时适应手机竖屏与电影宽屏等不同格式。

重大突破：音视频同步生成
谷歌Veo 3首次实现了音画同步生成，通过将音频视频数据压缩为单一流，使扩散模型能同步处理视觉与听觉元素，标志着AI视频生成从"默片时代"迈向有声时代。

隐忧与挑战
该技术面临三大争议：创作者面临AI内容泛滥的竞争压力；社交媒体虚假新闻视频激增；单次视频生成的能耗远超文本或图像生成。值得注意的是，虽然视频生成能耗巨大，但扩散模型本身其实比变换器模型更节能——谷歌DeepMind正在实验用扩散模型构建大语言模型，可能带来能效突破。

随着技术快速迭代，AI视频生成正从专业工具逐步转化为大众创意媒介，但其能源消耗、版权争议和真实性验证等问题，仍需产业与社会共同面对。

（根据《MIT科技评论》技术解析系列整理）

中文翻译：

AI模型如何生成视频？
随着强大的视频生成工具如今被更多人掌握，让我们来了解其背后的原理。

《麻省理工科技评论》解读：让我们的作者为您梳理技术世界中复杂纷乱的脉络，助您洞察未来趋势。欢迎阅读本系列更多内容。

今年是视频生成技术爆发之年。过去九个月间，OpenAI公开了Sora，谷歌DeepMind推出了Veo 3，视频初创公司Runway发布了Gen-4。这些模型生成的视频片段几乎与真实拍摄影像或CGI动画难以区分。今年Netflix还在剧集《永恒者》中首次应用AI视觉特效，标志着视频生成技术首次进入大众电视市场。

虽然演示视频中的片段都经过精心筛选以展现模型的最佳性能，但随着技术普及度空前提升——付费用户已可通过ChatGPT和Gemini应用使用Sora和Veo 3——即便是业余创作者也能制作出令人惊艳的作品。

但负面影响也随之而来：创作者需要与海量AI生成内容竞争，社交媒体动态中被虚假新闻视频充斥。此外，视频生成消耗的能源远超文本或图像生成，堪称能耗巨兽。

面对无处不在的AI生成视频，我们有必要深入了解其技术原理。

视频生成如何实现？
假设您是普通用户。目前虽有针对专业视频工作流程的高端工具，但大多数人会通过应用或网站使用该技术。典型场景是："Gemini，生成独角兽吃意大利面的视频，再让它的角像火箭一样发射"。生成结果具有随机性，通常需要反复生成多次才能获得相对满意的效果。

其技术核心在于潜在扩散变换器（latent diffusion transformers）。这个复合概念需要拆解分析，让我们从扩散模型开始解读。

扩散模型原理
想象对图像持续添加随机像素点。经过足够多次叠加后，原始图像会变成类似老式电视静态噪点的混乱像素图。扩散模型正是经过训练的神经网络，其功能在于逆转这个过程——将随机噪点转化为图像。通过分析数百万张处于不同像素化阶段的图像，模型学会了如何逐步消除像素添加造成的变化。

因此当您指令生成图像时，模型会从随机噪点开始，逐步将其转化为与训练数据集相似的图像。但您需要的不是任意图像，而是符合文本描述的内容。因此扩散模型会与大型语言模型（LLM）协同工作——后者经过文本-图像配对训练，能在清理过程的每个步骤引导扩散模型生成符合提示词的内容。

需要说明的是：LLM的文本-图像关联能力并非凭空产生。当前大多数文本-图像/视频模型都通过互联网抓取的数十亿文本-图像/视频配对数据进行训练（这种做法引发众多创作者不满）。这意味着模型输出本质上是网络世界内容的提炼结晶，其中不可避免地掺杂着偏见与不良内容。

虽然扩散模型最易理解的是图像处理，但该技术同样适用于音频和视频。要生成电影片段，扩散模型需要处理连续图像序列（即视频帧），而非单张图像。

潜在扩散模型的意义
传统扩散计算能耗巨大。因此视频生成普遍采用潜在扩散技术：模型并非处理原始数据（每帧数百万像素），而是在潜在空间（latent space）中运作——将视频帧和文本提示压缩为数学编码，仅保留数据本质特征。

这个过程类似于在线视频流传输：服务器以压缩格式发送视频以保证传输速度，到达终端后设备再将其解码为可观看视频。潜在扩散的最后步骤同样是将压缩后的随机静态帧转换为LLM认为符合用户指令的视频压缩帧，最终解码为可视视频。

通过潜在扩散技术，视频处理流程与图像处理类似，区别在于像素化视频帧此时以数学编码形式存在。这使得潜在扩散效率远超传统扩散模型（尽管视频生成能耗仍显著高于文本或图像生成）。

潜在扩散变换器的突破
关键难题在于如何确保扩散过程生成连贯的帧序列，保持物体、光照等要素的跨帧一致性。OpenAI通过将扩散模型与变换器（transformer）结合实现了这一突破，该方案现已成为生成式视频的标准配置。

变换器擅长处理长数据序列（如文字），这正是GPT-5和Gemini等大型语言模型的核心技术，使其能生成逻辑连贯的长文本。对于视频而言，需要将内容切割为可类似处理的片段。OpenAI采用时空联合分割方案："如同将全部视频帧堆叠后切割成立方体"，Sora首席研究员Tim Brooks如此解释。

变换器与扩散模型结合带来多重优势：其序列处理特性帮助扩散模型保持帧间一致性，避免物体凭空出现或消失。由于视频被分割处理，其尺寸和方向不再受限，这意味着最新视频模型可接受从手机竖屏短片到宽屏电影的多格式训练数据。训练数据的多样性使视频生成质量较两年前有质的飞跃，并能按需生成不同格式的视频。

音频生成的突破
Veo 3的重大进展在于能生成带音频的视频，包括口型同步的对话、音效和背景音。这是视频生成模型的首创技术。正如谷歌DeepMind首席执行官Demis Hassabis在今年谷歌I/O大会上所言："我们正在走出视频生成的无声时代。"

技术挑战在于如何协调视频和音频数据使扩散过程同步处理。谷歌DeepMind的突破性方案是将音视频压缩为单一数据单元进行扩散处理。Veo 3生成视频时，扩散模型以同步锁步方式产生音频和视频，确保声画同步。

扩散模型与LLM的界限
需要明确的是：扩散模型主要用于生成图像、视频和音频；而基于变换器的大型语言模型则生成文本（包括代码）。但技术边界正在模糊——我们已看到变换器如何与扩散模型结合生成视频。今夏谷歌DeepMind还透露正在实验用扩散模型替代变换器来构建LLM。

值得关注的是：虽然视频生成（使用扩散模型）能耗巨大，但扩散模型本身其实比变换器更高效。因此采用扩散模型构建LLM可能大幅提升能效。预计未来会看到更多扩散模型的应用突破！

深度解析
人工智能
谷歌首次公布单次AI提示能耗数据
这是大型AI企业迄今最透明的能耗评估，为研究人员提供了期待已久的数据窗口。

塑造OpenAI研究未来的两位关键人物
独家专访OpenAI研究双主管Mark Chen与Jakub Pachocki，探讨高阶推理模型与超对齐技术的发展路径。

如何在个人电脑运行LLM
现在您可以在自有计算机上安全便捷地运行实用模型，具体方法如下。

心理治疗师秘密使用ChatGPT引发客户信任危机
部分治疗师在诊疗过程中使用AI，此举正在危及客户信任与隐私安全。

保持联系
获取《麻省理工科技评论》最新动态
了解特别优惠、头条新闻、即将举办的活动等更多内容。

英文来源：

How do AI models generate videos?
With powerful video generation tools now in the hands of more people than ever, let's take a look at how they work.
MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.
It’s been a big year for video generation. In the last nine months OpenAI made Sora public, Google DeepMind launched Veo 3, the video startup Runway launched Gen-4. All can produce video clips that are (almost) impossible to distinguish from actual filmed footage or CGI animation. This year also saw Netflix debut an AI visual effect in its show The Eternaut, the first time video generation has been used to make mass-market TV.
Sure, the clips you see in demo reels are cherry-picked to showcase a company’s models at the top of their game. But with the technology in the hands of more users than ever before—Sora and Veo 3 are available in the ChatGPT and Gemini apps for paying subscribers—even the most casual filmmaker can now knock out something remarkable.
The downside is that creators are competing with AI slop, and social media feeds are filling up with faked news footage. Video generation also uses up a huge amount of energy, many times more than text or image generation.
With AI-generated videos everywhere, let's take a moment to talk about the tech that makes them work.
How do you generate a video?
Let’s assume you’re a casual user. There are now a range of high-end tools that allow pro video makers to insert video generation models into their workflows. But most people will use this technology in an app or via a website. You know the drill: “Hey, Gemini, make me a video of a unicorn eating spaghetti. Now make its horn take off like a rocket.” What you get back will be hit or miss, and you’ll typically need to ask the model to take another pass or 10 before you get more or less what you wanted.
So what’s going on under the hood? Why is it hit or miss—and why does it take so much energy? The latest wave of video generation models are what’s known as latent diffusion transformers. Yes, that’s quite a mouthful. Let’s unpack each part in turn, starting with diffusion.
What’s a diffusion model?
Imagine taking an image and adding a random spattering of pixels to it. Take that pixel-spattered image and spatter it again and then again. Do that enough times and you will have turned the initial image into a random mess of pixels, like static on an old TV set.
A diffusion model is a neural network trained to reverse that process, turning random static into images. During training, it gets shown millions of images in various stages of pixelation. It learns how those images change each time new pixels are thrown at them and, thus, how to undo those changes.
The upshot is that when you ask a diffusion model to generate an image, it will start off with a random mess of pixels and step by step turn that mess into an image that is more or less similar to images in its training set.
But you don’t want any image—you want the image you specified, typically with a text prompt. And so the diffusion model is paired with a second model—such as a large language model (LLM) trained to match images with text descriptions—that guides each step of the cleanup process, pushing the diffusion model toward images that the large language model considers a good match to the prompt.
An aside: This LLM isn’t pulling the links between text and images out of thin air. Most text-to-image and text-to-video models today are trained on large data sets that contain billions of pairings of text and images or text and video scraped from the internet (a practice many creators are very unhappy about). This means that what you get from such models is a distillation of the world as it’s represented online, distorted by prejudice (and pornography).
It's easiest to imagine diffusion models working with images. But the technique can be used with many kinds of data, including audio and video. To generate movie clips, a diffusion model must clean up sequences of images—the consecutive frames of a video—instead of just one image.
What’s a latent diffusion model?
All this takes a huge amount of compute (read: energy). That’s why most diffusion models used for video generation use a technique called latent diffusion. Instead of processing raw data—the millions of pixels in each video frame—the model works in what’s known as a latent space, in which the video frames (and text prompt) are compressed into a mathematical code that captures just the essential features of the data and throws out the rest.
A similar thing happens whenever you stream a video over the internet: A video is sent from a server to your screen in a compressed format to make it get to you faster, and when it arrives, your computer or TV will convert it back into a watchable video.
And so the final step is to decompress what the latent diffusion process has come up with. Once the compressed frames of random static have been turned into the compressed frames of a video that the LLM guide considers a good match for the user’s prompt, the compressed video gets converted into something you can watch.
With latent diffusion, the diffusion process works more or less the way it would for an image. The difference is that the pixelated video frames are now mathematical encodings of those frames rather than the frames themselves. This makes latent diffusion far more efficient than a typical diffusion model. (Even so, video generation still uses more energy than image or text generation. There’s just an eye-popping amount of computation involved.)
What’s a latent diffusion transformer?
Still with me? There’s one more piece to the puzzle—and that’s how to make sure the diffusion process produces a sequence of frames that are consistent, maintaining objects and lighting and so on from one frame to the next. OpenAI did this with Sora by combining its diffusion model with another kind of model called a transformer. This has now become standard in generative video.
Transformers are great at processing long sequences of data, like words. That has made them the special sauce inside large language models such as OpenAI’s GPT-5 and Google DeepMind’s Gemini, which can generate long sequences of words that make sense, maintaining consistency across many dozens of sentences.
But videos are not made of words. Instead, videos get cut into chunks that can be treated as if they were. The approach that OpenAI came up with was to dice videos up across both space and time. “It’s like if you were to have a stack of all the video frames and you cut little cubes from it,” says Tim Brooks, a lead researcher on Sora.
Using transformers alongside diffusion models brings several advantages. Because they are designed to process sequences of data, transformers also help the diffusion model maintain consistency across frames as it generates them. This makes it possible to produce videos in which objects don’t pop in and out of existence, for example.
And because the videos are diced up, their size and orientation do not matter. This means that the latest wave of video generation models can be trained on a wide range of example videos, from short vertical clips shot with a phone to wide-screen cinematic films. The greater variety of training data has made video generation far better than it was just two years ago. It also means that video generation models can now be asked to produce videos in a variety of formats.
What about the audio?
A big advance with Veo 3 is that it generates video with audio, from lip-synched dialogue to sound effects to background noise. That’s a first for video generation models. As Google DeepMind CEO Demis Hassabis put it at this year’s Google I/O: “We’re emerging from the silent era of video generation.”
The challenge was to find a way to line up video and audio data so that the diffusion process would work on both at the same time. Google DeepMind’s breakthrough was a new way to compress audio and video into a single piece of data inside the diffusion model. When Veo 3 generates a video, its diffusion model produces audio and video together in a lockstep process, ensuring that the sound and images are synched.
You said that diffusion models can generate different kinds of data. Is this how LLMs work too?
No—or at least not yet. Diffusion models are most often used to generate images, video, and audio. Large language models—which generate text (including computer code)—are built using transformers. But the lines are blurring. We’ve seen how transformers are now being combined with diffusion models to generate videos. And this summer Google DeepMind revealed that it was building an experimental large language model that used a diffusion model instead of a transformer to generate text.
Here’s where things start to get confusing: Though video generation (which uses diffusion models) consumes a lot of energy, diffusion models themselves are in fact more efficient than transformers. Thus, by using a diffusion model instead of a transformer to generate text, Google DeepMind’s new LLM could be a lot more efficient than existing LLMs. Expect to see more from diffusion models in the near future!
Deep Dive
Artificial intelligence
In a first, Google has released data on how much energy an AI prompt uses
It’s the most transparent estimate yet from one of the big AI companies, and a long-awaited peek behind the curtain for researchers.
The two people shaping the future of OpenAI’s research
An exclusive conversation with Mark Chen and Jakub Pachocki, OpenAI’s twin heads of research, about the path toward more capable reasoning models—and superalignment.
How to run an LLM on your laptop
It’s now possible to run useful models from the safety and comfort of your own computer. Here’s how.
Therapists are secretly using ChatGPT. Clients are triggered.
Some therapists are using AI during therapy sessions. They’re risking their clients’ trust and privacy in the process.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读