ByteDance Open-Sources Cola DLM: Breaking Token Boundaries with Continuous Semantic Space

新闻资讯中级 · 3.2

950 词 0 分钟 32 次阅读

#科技

Following Kaiming He's ELF, ByteDance also challenges the "predict next token" paradigm with Cola DLM (Continuous Latent Diffusion Language Model). Instead of focusing on diffusion, Cola DLM's core is representation, splitting generation into semantic latent prior and text decoder.

Do large language models really have to follow the "predict next token" path? After Kaiming He, ByteDance also gives the same answer: NO.

Moreover, both sides have不约而同地盯上了同一个方向——在连续语义空间中建模语言。More crucially, ByteDance goes all-in with open source this time, releasing the paper, code, model weights, and Chinese blog all together.

Let's quickly recap. Just last week, Kaiming He's team launched ELF, the first diffusion language model—it skips the token layer, completes the entire generation process in continuous embedding space, and outperforms mainstream diffusion language models with only 105M parameters, proving for the first time that the continuous route really has potential in language generation.

And this time, ByteDance brings Cola DLM (Continuous Latent Diffusion Language Model), which further corroborates this trend. They also choose to break free from discrete tokens and hand over the generation process to continuous space. The result is: Under strict controlled experiments with ~2B parameters and approximately 2000 EFLOPs, Cola DLM shows a more stable scaling trend than autoregressive models and mainstream discrete DLMs.

However, just when you think this is just another story of "moving image diffusion models into the language domain", ByteDance tells you: Wrong. Cola DLM's motivation has never been diffusion.

Huh?? Not for diffusion, but ended up making a diffusion language model?

In fact, the real protagonist is hidden in the second half of this sentence: Cola DLM's motivation has never been diffusion, but representation.

In ByteDance's view, what really matters is representation. Token, a byproduct of tokenizer engineering and historical evolution, is just a form in which representation is implemented. They also boldly put forward a bold claim: Token is the surface carrier of the human language system, not semantics itself.

Let's look at a simple example to understand. For example, we use different sentences to express the same meaning:

The tokens are very different, but the semantics are still the same.

In the past, mainstream large models usually treated these different expressions as several different sets of expressions to learn separately—obviously behind the same semantics, but the model had to align them one by one on the token surface.

So ByteDance's judgment is: if there is a more stable and abstract "semantic state" inside the model, then these sentences that are essentially the same but just different in wording don't need to be memorized separately, but can converge to similar representations internally. Therefore, in essence: Cola DLM's diffusion is not recovering tokens, but transporting a latent prior.

How to "transport a latent prior"? ByteDance chooses to directly layer semantics and implementation.

The specific methodology points to section 3.1.1 of the paper. Here we simply translate it: Cola DLM's generation model essentially has only two parts. One latent prior, responsible for generating "latent semantics"; one decoder, responsible for translating these semantics into specific text. It looks like splitting "generating a sentence" into two relatively independent things.

And crucially, the entire diffusion/flow matching process actually happens in latent space, not token space. That is to say, what Cola DLM does is not gradually denoising a bunch of dirty tokens into clean tokens, but first gradually organizing a mass of random semantics into meaningful latent representations in continuous semantic space, and finally translating them into text uniformly.

So in its generation path, there is actually no gradual token generation process at all. Tokens only appear in the last step, and what is learned before is all "how semantics are formed". This is also the biggest difference between Cola DLM and many diffusion language models.

Many DLMs essentially still revolve around tokens doing "patching work", such as recovering masked tokens, gradually restoring discrete text. But Cola DLM directly moves diffusion from the "text layer" to the "semantic layer". Diffusion is no longer responsible for "generating tokens", but for "organizing semantics". In ByteDance's view: This is not a difference in packaging, but a change in what diffusion actually does in the model.

We know the methodology, so where does Cola DLM really "distance itself from traditional continuous DLMs"? The answer is hidden in several very engineering but crucial design choices.

First is how latent comes from. When many people hear "continuous language model", their first reaction is—isn't it just doing diffusion on word embeddings? But Cola DLM specifically didn't do that. It specially built a set of Text VAE:

What's the difference? Token embedding is still bound one-to-one with tokens, one vector per token, essentially still a token sequence. And the latent Cola DLM wants is a random variable that can change continuously and be modeled probabilistically.

In this way, the object the model processes is no longer "next token", but "semantic state corresponding to the entire text segment".

Cola DLM doesn't use the familiar "add noise → denoise" diffusion, but a combination called block-causal DiT+Flow Matching.

It doesn't matter if you don't understand the combination, just know what this combination does:

To put it bluntly, instead of relying on repeated denoising, it directly learns an "optimal path" to smoothly guide noise toward meaningful semantics.

Even better, it also introduces a block structure on this semantic path—parallel within blocks to quickly organize local semantics, and causal order between blocks to ensure overall logic doesn't get messed up.

The whole is equivalent to rebuilding a generation framework at the semantic layer, "fast locally, smooth globally", neither is lost.

Continuous diffusion language models have a common problem: the semantic representation latent is easily led astray by diffusion, and finally degenerates into a "token in disguise", that is, on the surface it's a continuous vector, but in essence it's still memorizing words, and no real semantic abstraction is formed at all.

So Cola DLM's approach is—completely separate the two tasks.

And during training, the Encoder is basically "frozen" during the diffusion phase.

Why not let it also learn together? Because once you let the Encoder adapt to diffusion, it will cut corners to reduce loss, quietly sliding the semantic representation toward "easy-to-predict token forms", and finally back to the old path.

What ByteDance wants is a stable semantic space, not an intermediate layer polluted by tasks. So they do the opposite, letting the prior adapt to the semantic space instead of letting the semantic space please the prior.

In addition, they also added a semantic constraint (BERT-style mask loss) to prevent the encoder from "semantic collapse" during reconstruction.

Experiments prove that without this constraint, latent will indeed drift away to reduce loss.

If the first three points are more like engineering ingenuity, this fourth point is Cola DLM's theoretical hard work. ByteDance splits the training goal into three subtasks that can be viewed and diagnosed separately:

The advantage of splitting like this is that traditional autoregressive models muddle everything together in a single "predict next word" loss function.

When the generation effect is not good, you have no idea where the problem is, whether you misunderstood it, didn't have enough memory, or the generation path went wrong.

And Cola DLM keeps accounts clearly, you can just look at the metrics separately to know what's not working.

This is also the underlying reason why it can show a stable scaling trend—not blind guessing, but every link can be diagnosed and optimized separately.

Finally, due to space limitations, here we directly put the streamlined version of ByteDance's Cola DLM research results (for details, refer to the blog):

And speaking of this, it's hard not to compare ByteDance's Cola DLM with Kaiming He's team's ELF together.

Interestingly, the two works are almost concurrent, both challenging an assumption that has been taken for granted for twenty years—language models must be built on discrete tokens.

Why is this assumption starting to be questioned? On the one hand, today's autoregressive large models, the bottleneck of the "predict next token" path is becoming more and more obvious—slow inference, weak long-range dependence, structural gap between training goal and real generation quality.

On the other hand, the success of diffusion models in image and video generation makes everyone start to reflect: is discrete token really the carrier that language intelligence must attach to? Or is it just a habit chosen by history?

The exploration of diffusion language models in the past two years (LLaDA, Dream-7B, MDLM, etc.) has already pulled this issue to the table, but most works still stay in the "discrete school"—still doing diffusion on tokens.

Until ELF and Cola DLM appeared, both sides almost simultaneously gave the same answer—don't have to be tied to tokens.

It's just that there are differences in specific solutions. I also went to compare the differences between the two previous studies, shown in pictures:

Simply put, ELF is like a person doing it from start to finish, repeatedly pondering in the original-length embedding space, only putting pen to paper at the last step.

Cola DLM is like two division departments, the semantic department first discusses "what to express", and the text department is then responsible for "how to write it specifically".

Although the two routes are different in method, the underlying concerns are completely the same—let modeling happen in the representation space most suitable for the nature of language, don't be limited by the default framework of "token = semantics".

In essence, they are actually two answers to the same question. And this also represents a trend—it's time to re-recognize continuous diffusion language models.

In the past two years, the stage of diffusion language models has almost always been occupied by the "discrete school". But ELF and Cola DLM, one after the other, let the "continuous route" stand in front of the stage for the first time in a serious, comparable, and reproducible posture.

More worthy of attention is that Cola DLM also conveniently points out a bigger thing—one of the core obstacles that have been stuck in "unified multimodality" for a long time is that text is discrete, while images, videos, and audio are naturally more continuous.

If you want them to truly enter the same "latent world", there must be an interface that maps text to continuous semantic latent. And Cola DLM just plays such a role. And this, perhaps, is ByteDance's real ambition this time—not adding another player in the diffusion language model track, but building a bridge for language models to connect it to the continuous multimodal world.

Of course, the Cola DLM team themselves are also very restrained, they wrote at the end of the blog: Cola DLM is just an early attempt on this road, but this road itself is worth continuing.

Finally, according to convention, let's introduce the authors of this research.

The entire team is led by ByteDance's Seed team, gathering researchers from multiple universities including HKU, Renmin University, Peking University, BUPT, and ANU, covering multiple directions including language modeling, diffusion models, and video generation.

First author Hongcan Guo, currently a senior undergraduate at the School of Artificial Intelligence, BUPT, has been interning at ByteDance Seed since June 2025. His research interests focus on the mathematical foundation and learning dynamics of generative models and reasoning models, and Cola DLM's blog is exactly from his hand.

Corresponding author Yan Zeng is a "big shot" inside ByteDance Seed, she is the R&D leader of ByteDance's hit video generation model Seedance series. According to information, this Xi'an Jiaotong University alumnus joined ByteDance as a campus recruit in 2021, and was promoted from algorithm engineer to level 4-2 in only five years.

Many ideas of "hierarchical latent variable + diffusion prior" in Cola DLM this time have obvious similarities with the latent diffusion route long used in the video generation field.

There is also a very interesting "cross-border player" in the team—Shen Nie. He is a representative researcher in Li Chongxuan's group at Renmin University's Gaoling School of Artificial Intelligence, and also the first author of the discrete diffusion language model LLaDA. And LLaDA is exactly a discrete diffusion route that Cola DLM focuses on comparing in the paper.

In a sense, this thing itself is quite interesting: a representative figure of the discrete diffusion route also participated in the research of the continuous latent route. To some extent, it also shows that what Cola DLM really wants to discuss this time is not just "how diffusion generates text", but more fundamentally: On what kind of state space should text intelligence really be built?

Other core authors also have great backgrounds.

Hengshuang Zhao is an assistant professor in the Department of Computer Science at the University of Hong Kong, did postdocs at MIT CSAIL and Oxford Torr Vision Group, and has long been active in the fields of computer vision and generative modeling.

Qiushan Guo comes from Luo Ping's group at HKU MMLab, and is also an important R&D member of ByteDance's Seedream image generation model.

Other signed authors also include: Qinyu Zhao, Yian Zhao, Rui Zhu, Feng Wang, Tao Yang, Guoqiang Wei.

In fact, if you look at the entire author list together, you will actually find a very interesting phenomenon—ByteDance doing language models this time, to some extent, almost brings in the entire set of core ideas of "video/visual generation": those who do latent diffusion, those who do video generation, those who do image prior, those who do discrete DLM, finally rethink "how text should be modeled" together.

This is perhaps why Cola DLM as a whole looks very different from traditional language model routes.

Because from the beginning, what it focuses on is not just "how to generate text better", but trying to put language back into continuous semantic space, turning it into a modality that can naturally align with images, videos, and audio.

And this, perhaps, is the most worthy place for attention in Cola DLM: When text is no longer just a token sequence, but becomes a semantic state in the continuous world, what will multimodal intelligence look like.

Hugging Face address: https://huggingface.co/ByteDance-Seed/Cola-DLM
GitHub address: https://github.com/ByteDance-Seed/Cola-DLM
Paper: https://arxiv.org/abs/2605.06548
Blog: https://hongcanguo.github.io/posts/2026-cola-dlm-zh.html

0 条讨论

按时间

登录后发表评论

立即登录

暂无评论

成为第一个分享想法的人吧！