[HN Gopher] Transfusion: Predict the next token and diffuse imag...
       ___________________________________________________________________
        
       Transfusion: Predict the next token and diffuse images with one
       multimodal model
        
       Author : fzliu
       Score  : 43 points
       Date   : 2024-09-09 18:51 UTC (4 hours ago)
        
 (HTM) web link (www.arxiv.org)
 (TXT) w3m dump (www.arxiv.org)
        
       | valine wrote:
       | This is such a natural extension to LLMs. I'm shocked it hasn't
       | been tried before.
       | 
       | When I ask a diffusion model to generate a chessboard, I'd expect
       | the pieces to be placed randomly. We are getting closer to image
       | generators that not only know what chess pieces look like but
       | also where to place them.
        
       | ilaksh wrote:
       | Hmm. I wonder if this is similar to Diffusion Transformers?
        
         | darknoon wrote:
         | this is somewhat similar, but diffusion transformers typically
         | use a pre-trained text model as the text conditioning whereas,
         | in this case it's integrated and trained together multimodally.
        
       | cosmicjedi wrote:
       | You can talk to the authors directly on alphaXiv!
       | https://www.alphaxiv.org/abs/2408.11039v1
        
       | BaculumMeumEst wrote:
       | Stupid question: is their 7B model available? Is there public
       | inference code that we could run? Or do they not usually release
       | them along with these kinds of papers?
        
       ___________________________________________________________________
       (page generated 2024-09-09 23:00 UTC)