PostHole
Compose Login
You are browsing us.zone2 in read-only mode. Log in to participate.
rss-bridge 2026-02-03T08:40:00+00:00

Generating text with diffusion (and ROI with LLMs)

Two guests for the price of one! This episode has two interviews recorded at AWS re:Invent back in December.


February 3, 2026

Generating text with diffusion (and ROI with LLMs)

Two guests for the price of one! This episode has two interviews recorded at AWS re:Invent back in December.

In part 1, Ryan chats with the co-founder and CEO of Inception, Stefano Ermon, about diffusion language models and how their multiple token generation compares to traditional LLMs (spoiler: they’re faster and more accurate). In the second half of the episode, Ryan and the chairman of Roomie, Aldo Luevano, dive into Roomie’s purpose built models for both physical and software AI, and how their ROI-first approach helps companies track the impact of their robotics and AI implementation.

Inception researches and builds diffusion language models for faster and more efficient AI.

Roomie is a robotics and enterprise AI company with an ROI-first platform that tracks how well their AI solutions are actually working.

Connect with Stefano on LinkedIn.

Connect with Aldo on LinkedIn.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ryan Donovan, and today we have two interviews that I recorded on the floor at AWS re:Invent back in December. The first is with Stefano Ermon of Inception, and the second is with Aldo Luévano at Roomie. Enjoy, and we'll talk to you next time.

Ryan Donovan: I'm here talking to Stefano Ermon, CEO, and Co-founder of Inception, and today we're talking about Diffusion LLMs. Now, I've heard of 'diffusion models' for image generation. How do diffusion LLMs work?

Stefano Ermon: Yeah, that's a great question. So, diffusion models, as you said, are basically the best way right now to generate images, video, audio. So far, they've not really been able to work well on text and code generation, and what we're doing at Inception is we're pioneering the first large-scale, commercial-grade diffusion language. Diffusion language models, they work very differently compared to traditional large language models that everybody else is building. So, everybody else is building an ultra-aggressive model, and basically, the way they work is when you ask a question to ChatGPT or Gemini, they will provide an answer, one token at a time, left to right. And that's pretty slow; it's like a structural bottleneck that is very hard to accelerate. It's a very sequential kind of computation. A diffusion language model instead generates multiple tokens, essentially in parallel. So, you start with a rough guess of what the answer should be, and then you iteratively refine it. The key difference is that we're still use big neural network under the hood, but each neural network evaluation is not just producing a single token, but it's basically able to modify multiple tokens at the same time, and that's why diffusion language models are significantly faster than ultra-aggressive models. We're looking at 5 to 10 X faster compared to ultra-aggressive models of similar quality.

Ryan Donovan: For diffusion models I've seen for images, they start with a seed and noise. Do you also start with seeds and noise, and by reusing the same seed, would that make it a more deterministic process?

Stefano Ermon: Yeah, so the process is very similar. So, just like in a typical– you probably see in some videos showing how diffusion model works. You start with pure noise and then you refine it until you get a crisp, nice image or video at the end. We do something very similar. We start with basically random tokens, and then as we go through this diffusion process, we gradually remove noise, which means that we are able to correctly guess what the token value should be. And so, you can see the process where we go through this refinement chain until at the end we get a high-quality answer that we can output and we can give to the user.

Ryan Donovan: So, in refining that noise, is there a sort of built-in eval process there? Is that part of the neural network? I know with image generation there's all sorts of convolutional, deconvolution, all this sort of stuff. Is it anything like that?

Stefano Ermon: So, even for image diffusion models these days, people use transformers mostly. I was one of the pioneers of diffusion models for image generation back in 2019. My lab at Stanford kinda invented the whole idea of using a diffusion model. Back then, yes, we were using basically ConvNets because that made sense. We were doing dense image prediction, and that was the best thing that existed back then. Since then, now people have switched mostly to diffusion transformers. So, it's still a transformer-based neural network that is trained on the same kind of objective. Basically, you take an image, you add noise, and then you train the transformer to predict the noise or effectively remove the noise from the image. And we do something similar: under the hood, we're also using the transformers, so our neural network is still like a large transformer. We start with clean text or clean code. We intentionally add some mistakes, so we essentially destroy some of the structure in the data, and then we train the neural network to reconstruct the regional clean signal. It's basically trained to correct mistakes. Instead of being trained to predict the next token, the neural network is actually trained to correct mistakes. And at the inference time, the process is not, 'predict one token at a time;' the process is, 'let's try to fix as many mistakes as you can as you go through this denoising process until the output is sufficiently clean,' and then we just output it and give it to the user.

Ryan Donovan: That's interesting. So, it's almost like trying to reproduce the training data. How does that work for stuff that is peripheral or not exactly the training data?

Stefano Ermon: Yeah, so it's still a statistical model, so just like an ultra-aggressive model. We're trying to learn basically a probability distribution, which is like the data distribution, which is like the data generating process, which is whatever– it depends on what you're training the model on. It could be code, it could be text, typically large data sets that we scrape from the internet. And we're learning a generative model in the sense that it's not just trying to remember all the training data; it needs to generalize. So, there's still a machine learning component where you need to be able to generate new content, otherwise it wouldn't be particularly useful. But under the hood, yes, it's a generative model. It's a statistical model. When you give it a new prompt, the model has not perhaps seen during training, it will try to generalize. It'll try to come up with a reasonable answer, which is based on this probability distribution that the model has learned during training.

Ryan Donovan: Is it still prone to hallucinations?

[...]


Original source

Reply