How can Doubao achieve real-time video conversations with people? Doesn't the token count explode? Does it use a small model? Or does it extract key frames?
This is a very professional and incisive question. The reason Doubao (and similar models like GPT-4o, Gemini Live) can achieve smooth real-time video conversations without exploding token counts or causing high latency due to massive video data primarily relies on several key optimization strategies within the "Multimodal Large Model Technology Stack."
Simply put, it is not processing the raw video stream, nor is it looking at every single frame.
Here is a breakdown of the core technical principles behind it:
Your guess is correct, extracting key frames (Frame Sampling) is the most fundamental and crucial step.
Mechanism: Video is essentially a continuous stream of images (typically 30 or 60 frames per second, fps). The AI doesn't need to look at 60 images per second to understand what you're doing.
Sampling Rate: The model typically samples at an extremely sparse frequency. For example, extracting only 1-2 frames per second, or extracting based on "motion change" (no extraction when the scene is static, extraction when there's significant change).
Effect: This directly reduces the data processing volume by 30-60 times. For the AI, it's like watching a slideshow, not a movie, but this is sufficient to understand the scene.
Even if it only looks at 1 image per second, directly converting a 1080P image into tokens would be enormous. This is where a Visual Encoder (such as ViT or variants of VQ-VAE) comes into play.
Patching: The model doesn't process pixel by pixel but slices the image into small patches (e.g., 16x16 pixel patches).
Feature Compression: These patches are passed through the encoder and compressed into high-dimensional vectors. A multi-megabyte image might ultimately be represented by just 256 or 512 tokens (or even fewer, depending on the compression rate).
Conclusion: Even if you chat with the camera on for 1 minute, for the model, it might only add tens of thousands of tokens, not hundreds of millions of data points.
To prevent tokens from accumulating infinitely as the conversation goes on (leading to "explosion"), the system uses a Sliding Window or dynamic discarding mechanism.
Short-term Memory: The model typically only focuses on "recent" visual information. For instance, it might only keep the visual tokens from the last 10-20 frames in memory.
Discarding Old Data: Once you move the camera from a "water cup" to a "computer," the visual tokens for the "water cup" become unimportant after a few rounds of dialogue and are discarded from the context or compressed into a brief text summary.
Previous technology was a "relay race":
Speech-to-text -> Screenshot -> Visual model identifies image -> Text model generates reply -> Text-to-speech.
This pipeline was too slow. Current Doubao (especially Doubao-Vision models) and GPT-4o tend to use Native Multimodal or highly optimized cascaded architectures:
Native Understanding: The model directly "hears" the audio and "sees" the visual tokens. It doesn't need the intermediate step of converting to text (or the conversion is extremely fast).
Streaming Output: Just as your brain starts formulating a response before you finish speaking, the model performs pre-computation (Prefill) while receiving the video stream, achieving extremely low latency.
You asked if a small model is used. The answer is likely "Mixture of Experts (MoE)" or a distilled model for specific tasks.
MoE (Mixture of Experts): This is a massive model, but when processing each frame, only a small subset of its parameters (experts) are activated. For example, when seeing a "cat," only the "animal expert" parameters are activated. This maintains the "large model's" intelligence while keeping the inference speed and cost of a "small model."
Edge + Cloud Collaboration: While the main computation is in the cloud (ByteDance's Volcano Engine), the mobile device may also handle some preprocessing (like image cropping, compression, voice detection) to reduce pressure on the cloud.
Challenge Solution Effect
Video data too large****Key frame sampling (only viewing 1-2 frames per second)Data volume reduced by over 95%
Image pixels too many****Visual encoder (ViT / Patching)Compresses an image into hundreds of tokens
Conversation too long****Sliding window (discarding old visual tokens)Memory usage remains stable, no explosion
Latency too high****Streaming & MoE architectureActivates computation only when inference is needed
Export to Sheets
The reason Doubao is smooth is not because it's "magical," but because it "knows how to make trade-offs." It doesn't stare at every pixel of yours all the time. Instead, through high-compression-ratio visual encoding and sparse sampling, it achieves video understanding with minimal computational power within the "real-time" range you perceive.
What can I do for you next?
Would you like me to explain more about Visual Encoders (like ViT) work to compress images, or how MoE (Mixture of Experts) allows large models to run fast?
So, does that mean the model's ability to understand images in video is significantly weaker than its ability to understand images sent directly in a chat?
Your understanding is very sharp, and the conclusion is largely correct.
Under the current technical architecture, the image understanding capability in real-time video calls is indeed weaker than that for directly sent static images in terms of "detail capture" and "depth of analysis."
This is primarily an engineering trade-off: "sacrificing precision for speed and fluency." We can compare the two from the following dimensions:
Sending an image (High-precision mode): When you upload a photo to Doubao or GPT-4o, the system typically processes it at a higher resolution (e.g., 1024x1024 or a compressed version of the original). The model can "look closely," identifying small text on documents, complex chart details, or tiny textures on object surfaces.
Real-time video (Low-precision/Quick-glance mode): To ensure processing multiple frames per second without lag, individual frames in the video stream are usually heavily compressed and downsampled (Resizing).
Consequence: The model sees a relatively blurry, low-resolution image (e.g., 336x336 or 512x512).
Typical scenario: If you wave a medicine bottle on video, the model might only recognize "this is a white medicine bottle" but can't read the small print on the ingredient list; but if you send a high-definition photo, it can read the ingredient list.
Using the concept from psychology's Thinking, Fast and Slow:
Sending an image = System 2 (Slow thinking): The model has ample time (a few seconds) for deep reasoning on every patch of the image. It can allocate more parameters to analyze the logical relationships between objects in the picture.
Real-time video = System 1 (Fast thinking): Video calls require millisecond-level responses. The model must provide feedback within an extremely short time, so it often operates in an "intuitive reaction" mode. It tends to recognize large outlines, movements, colors, and scenes but lacks the "mental capacity" to delve into details.
Static image: The model's attention is entirely focused on this single image, which is the sole source of information.
Video stream: The model's attention is distributed along the "timeline." It needs to allocate computational power to remember "what you were doing a second ago" and "what you are doing now." To remember coherent actions, it must sacrifice attention to details in individual frames.
Although weaker in "static detail," video mode possesses a capability static images lack: understanding change and causality.
Example: If you send a static picture of your hand raised mid-air, the model doesn't know if you're about to "hit someone" or "high-five."
Video: The model can see the trajectory of your movement, accurately judging the intention.
Directly sending an image: It's like you, with a magnifying glass, carefully examining a painting in a museum. You can see the painter's brushstrokes, the texture of the canvas.
Real-time video conversation: It's like you riding a bicycle past that painting. You can still see it's a painting, know if it depicts a landscape or a portrait, and if the colors are vibrant, but you absolutely cannot see how many buttons the person in the painting is wearing.
So, when using Doubao or similar AI:
If you need to analyze details (e.g., read instructions, examine complex error code screenshots, check skin issues, analyze complex Excel screenshots): Please directly take a photo or upload an image.
If you need interaction or to demonstrate an environment (e.g., practice English conversation, show room layout, let AI see what the cat is playing, demonstrate an operation step): Please use real-time video calls.
Are video models and conversational models completely different models?
This is a crucial architectural question. The answer is: In the past, they were "completely different" models, but the current trend is that they are becoming "the same brain," known as "Native Multimodal Models."
To help you understand this evolution, we can divide AI architecture development into three stages. Products like Doubao and GPT-4o are currently in a state of transition from the second to the third stage.
"Three cobblers with their wits combined equal Zhuge Liang's mastermind."
In early AI systems, video models and conversational models were indeed completely independent, they didn't even know each other.
Eyes (Visual Model): This was a specialized CNN (Convolutional Neural Network) or early ViT. It was only responsible for looking at images and converting what it saw into text labels. For example, seeing a video, it outputs: "A cat, on a sofa, jumping."
Brain (Conversational Model): This was a pure-text LLM (like GPT-3). It took the above text and generated a reply: "Wow, it's so lively."
Mouth (Speech Model): Converted text to speech.
"Giving the brain optic nerves"
This is the architecture adopted by most current multimodal models (including early ones like LLaVA, MiniGPT-4, etc.).
Core Logic: There is still a core "language brain" (LLM), but engineers have trained a "translator" (Projector/Adapter).
How it works:
Video/images are no longer converted to text.
Instead, they are turned into a string of numerical vectors (Visual Tokens) via a Visual Encoder.
The key step: These visual vectors are "translated" into "language vectors" that the language brain can understand.
Brain processing: To the LLM, the video feed is like a piece of foreign language text. It mixes video tokens and text tokens together for reasoning.
"Born synesthetes"
This is the ultimate form enabling "real-time video conversation."
Core Logic: The model is trained from the very beginning watching videos, listening to audio, and reading text together.
Is it the same model? Yes, it's the same neural network.
Principle: In this model, the input can be anything (audio waveforms, image pixels, text characters). After entering the model, they all become unified representations. The model doesn't need to first convert sound to words; it can directly "understand" the anger in the tone; it doesn't need to first convert images to words; it can directly "see" the oppressive feeling in the scene.
Returning to your question, we can use the "human brain" as a metaphor:
Visual Encoder = Human Retina and Optic Nerve This part is specialized for processing images and is different from the part processing language. It's responsible for converting light signals (pixels) into neural signals (tokens).
Large Language Model Backbone (LLM Backbone) = Human Cerebral Cortex This part is general-purpose. Whether it's signals from the optic nerve (watching video) or signals from the auditory nerve (conversation), they all ultimately undergo thinking, reasoning, and decision-making in this same region.
So the answer is: They are different modules at the "perception layer" (eyes and ears), but at the "thinking layer" (brain), it's entirely the same model operating.
Precisely because they share the 'thinking layer,' Doubao can, upon seeing you take out a flower (visual), combine it with your earlier statement 'Today is Valentine's Day' (text context), and infer 'Are you going to give it to your girlfriend?' (logical reasoning). If they were completely separate models, such deep integration wouldn't be possible.
But the processing for image tokens is completely different, right? Can it also acquire multiple capabilities simultaneously during a single training process?
That's an excellent intuition! You've touched upon the core magic of multimodal large models—"Representation Alignment."
The answer is: Yes, although the "entry point" processing for images and text is completely different, once they enter the model's interior, their mathematical essence becomes the same thing, allowing them to "learn from each other" during the same training process.
It's like this: although Chinese and English are written completely differently (one uses characters, the other letters), in your brain, when you see "苹果" and "Apple," the concept that comes to mind (that red fruit) is the same.
Here's a detailed breakdown of how this process is implemented:
As you said, their initial processing is indeed worlds apart:
Text: Uses a Tokenizer.
Process: "a cat" -> Look up dictionary -> [ID: 105, ID: 2003] -> Vector A.
Image: Uses a Visual Encoder.
Process: Image pixels -> Patching -> Convolution/Transformer operations -> Vector B.
The key point is: Although their origins differ, they ultimately become "high-dimensional vectors" (Embedding Vectors), which are long strings of numbers (e.g., [0.1, -0.5, 0.9, ...]).
Once transformed into vectors, for the subsequent Transformer brain, it doesn't care whether this vector originally came from pixels or text. It just performs calculations.
How can it acquire both capabilities simultaneously in one training run? The secret lies in the training data being "mixed text and images," and the training task being a unified "predict the next token."
Imagine the model reading an illustrated comic book or browsing a webpage:
Input sequence: [ImageToken_1] [ImageToken_2] ... [ImageToken_256] + [TextToken: This] + [TextToken: is] + [TextToken: a] + [TextToken: what]
Training objective: The model needs to predict the next word is [cat].
To answer this question correctly, the model is forced to do one thing:
"I must look for clues in the preceding
[ImageToken]! If I understand that those image vectors represent 'cat' features, I can predict the word 'cat'; if I don't understand, I'll guess wrong, and the error (Loss) will be large."
Through trillions of such training iterations (gradient descent), the model learns to establish connections:
After such training, the model's brain forms a "unified semantic space."
In this space, the image vector representing a "round object" and the text vector representing the word "Ball" are geometrically very close.
This explains why "video conversation" is possible: when you provide video frames, the model is actually reading a bunch of vectors, which automatically activate related textual concepts in its mind.
Although the "front-end processing" for image tokens (patching, encoding) is completely different from text processing, they are mapped into the same "mathematical space."
An analogy:
Text is US Dollars.
Images are Japanese Yen.
The model is an international bank. Although US Dollars and Japanese Yen look completely different (different entry processing), the bank converts them all into Gold (vectors). In the bank's vault (inside the model), there's only Gold, no distinction between Dollars and Yen.
Therefore, through one large-scale mixed training, the model not only learns how to earn Dollars (writing) and how to earn Yen (image recognition), but also learns the exchange rate (text-image understanding).
What I'm saying is that the way large models process image tokens for pictures and for video is quite different, right? How can they acquire general capabilities?
This is a very sharp and in-depth technical perspective! You've grasped the most subtle contradiction in "Multimodal Large Models" (LMMs):
Since single-image processing is "high-resolution, heavy on spatial detail," while video frame processing is "low-resolution, heavy on temporal flow," and the input data characteristics for these two are so different, how can the same model be general-purpose?
The answer lies in a core property of modern Vision Transformer (ViT) architecture: Resolution Agnosticism, and the LLM's "insensitivity" to token sequences.
We can break down this principle into three levels to understand:
This is the most critical technical point. You might think models, like traditional fully connected networks, must have a fixed input size (e.g., must be 224×224). But modern ViTs are not like that.
Patching Logic: When a ViT processes an image, it cuts the image into small squares (Patches) of, say, 14×14 or 16×16 pixels.
Regardless of large or small images:
Processing a high-def single image (1024px): Cuts out 64×64=4096 Patches.
Processing a low-def video frame (336px): Cuts out 24×24=576 Patches.
Positional Embedding Interpolation: The model is usually trained with a standard resolution. When the input size changes, the model automatically interpolates (stretches or compresses) the "positional encodings."
Analogy: It's like the human eye. When you lean in to read a book (high-def image), you see a small field of view but with many details; when you step back to look at an entire wall (video frame), you see a large field of view but with fewer details. Your retina doesn't swap to a new one; only the focus changes. The model uses the same set of parameters (weights), just processing a different number of Patches.
For the subsequent Large Language Model (LLM), it has no idea whether the visual encoder saw one high-def image or eight low-def images. It only sees a bunch of tokens.
Here's a clever "space-for-time" computational balance:
Scenario A: Intensive reading of a single image
Input: 1 image at 1024×1024.
Token composition: All allocated to spatial detail.
What the LLM sees: [Detail1] [Detail2] ... [Detail1000] (total 1000 tokens).
Capability: Can read text, tiny objects.
Scenario B: Watching a video clip
Input: 8 frames at 336×336 (1 frame per second).
Token composition: Spatial detail reduced, but temporal dimension added.
What the LLM sees: [Approx_Outline_Second1] ... [Approx_Outline_Second2] ... [Approx_Outline_Second8] (total 1000 tokens).
Capability: Cannot read text, but can see action continuity.
Conclusion: For the LLM, the computational load (total token count) for these two scenarios might be the same. The model doesn't "switch brains" for video; it's just that the input token sequences carry information with different emphases (one is deep spatial information, the other is broad temporal information).
Why does this lead to "general capabilities"? Because core semantic features are compression-resistant.
A cat is still a cat: A cat, whether in a 4K photo or a 360P video, its core features (pointy ears, whiskers, furriness) are very similar in their high-dimensional vector representations after convolution or Patch processing.
Alignment Training: During the pre-training phase, the model uses massive amounts of data:
High-resolution image-text pairs (learns details).
Low-resolution video-text pairs (learns actions). This teaches the model to map visual features to the same concept at different resolutions.
The model isn't "split" into two; rather, it has elasticity:
Hardware-wise: The Vision Encoder, like an accordion, outputs a different number of tokens depending on the resolution.
Software-wise: The LLM receives a "long sequence." It doesn't care if this sequence is "slices of one large image" or "concatenation of several small images."
Therefore, when you're on a video call, Doubao doesn't "get dumber"; it simply reallocates the brain cells originally used to "see pores clearly" to "see your movements clearly."
So, specifically for video after it's split into frames, are tokens given frame by frame, or are multiple frames synthesized into a single token with a time dimension?
This is a very hardcore architectural detail question. The current mainstream industry approach (including GPT-4o, Gemini 1.5 Pro, Doubao, etc.) typically lies between the two options you mentioned but leans more towards "extract frame by frame first, then perform temporal compression/fusion."
It's rare for a model to synthesize multiple frames into just one token (too little information), nor does it naively keep all tokens from every frame (information explosion).
Current mainstream processing solutions usually fall into the following two paths, and Doubao's real-time model most likely uses a variant of Option Two:
"Frame after frame, relying on position to remember time"
This is the most intuitive method and was commonly used by early multimodal models (like Video-LLaVA).
Independent Encoding: The visual encoder (Visual Encoder) splits the video into Frame 1, Frame 2, Frame 3... and then independently converts each frame into a set of tokens (e.g., 256 tokens per frame).
Direct Concatenation: These sets of tokens are concatenated like beads on a string.
Positional Embedding: They are tagged to tell the LLM: "This is the image from second 1, this is from second 2."
Sequence seen by the LLM: [Frame1_Tokens] + [Frame2_Tokens] + [Frame3_Tokens] ...
Drawback: Token consumption is enormous. If sampling 2 frames per second, 256 tokens per frame, a one-minute video would be 2×60×256=30,720 tokens. This is too expensive and too slow for real-time conversation.
"Multiple frames in, fewer frames out" — Preferred by real-time models
To solve the token explosion problem, advanced models now incorporate a "Temporal Projector/Adapter." This is what you guessed as "synthesizing a token with a time dimension."
Feature Extraction: The visual encoder still first looks at Frame 1, Frame 2, Frame 3, Frame 4.
Fusion Compression (Pooling): There is a layer in the middle of the model specifically responsible for "deduplication" and "extracting changes."
If Frame 1 and Frame 2 are almost identical (e.g., you sitting still), the model will mathematically average or weight the tokens from these two frames.
It might compress the raw data from 4 frames (assuming 1000 tokens total) into a set of tokens representing this period (e.g., compressed to 64 tokens).
[Time_0-2s_Summary_Tokens] + [Time_2-4s_Summary_Tokens] ...In academia and some specific models (like Sora-style video generation models or certain efficient understanding models), there's a more thorough approach to creating "tokens with a time dimension," called Tubelet (pipelet slicing).
Principle:
Regular image slicing (Patch): Cuts a 2D square (x, y).
Video slicing (Tubelet): Cuts a 3D cuboid (x, y, t).
Example: A token sliced by the model is no longer "the pixels in the top-left corner at this second" but rather "the change in the top-left area over the next 3 frames."
Result: Such a token inherently contains information about temporal flow.
Considering the extremely low latency requirements of real-time calls, Doubao most likely employs a hybrid strategy of "key frame sampling + local temporal pooling":
It doesn't look at every frame: It looks at frames intermittently (e.g., only frames 1, 5, 10).
Tokens are independent but related: Each frame generates independent tokens.
Relies on the LLM to infer time: It relies on the powerful LLM (the brain) to understand time through the sequence order of tokens, rather than hard-coding time into a single token.
Plain understanding: It doesn't compress the action of "you waving" into a single "waving token"; instead, it sends the brain a series of tokens: [hand_low] -> [hand_mid] -> [hand_high]. The brain reads this sequence and concludes: "Oh, he's waving."