[HN Gopher] Transfusion: Predict the next token and diffuse imag...
___________________________________________________________________
Transfusion: Predict the next token and diffuse images with one
multimodal model
Author : fzliu
Score : 43 points
Date : 2024-09-09 18:51 UTC (4 hours ago)
(HTM) web link (www.arxiv.org)
(TXT) w3m dump (www.arxiv.org)
| valine wrote:
| This is such a natural extension to LLMs. I'm shocked it hasn't
| been tried before.
|
| When I ask a diffusion model to generate a chessboard, I'd expect
| the pieces to be placed randomly. We are getting closer to image
| generators that not only know what chess pieces look like but
| also where to place them.
| ilaksh wrote:
| Hmm. I wonder if this is similar to Diffusion Transformers?
| darknoon wrote:
| this is somewhat similar, but diffusion transformers typically
| use a pre-trained text model as the text conditioning whereas,
| in this case it's integrated and trained together multimodally.
| cosmicjedi wrote:
| You can talk to the authors directly on alphaXiv!
| https://www.alphaxiv.org/abs/2408.11039v1
| BaculumMeumEst wrote:
| Stupid question: is their 7B model available? Is there public
| inference code that we could run? Or do they not usually release
| them along with these kinds of papers?
___________________________________________________________________
(page generated 2024-09-09 23:00 UTC)