[HN Gopher] SuperPrompt: Better Text to Image Prompts in 77M Par...
       ___________________________________________________________________
        
       SuperPrompt: Better Text to Image Prompts in 77M Parameters
        
       Author : roborovskis
       Score  : 72 points
       Date   : 2024-03-14 16:35 UTC (6 hours ago)
        
 (HTM) web link (brianfitzgerald.xyz)
 (TXT) w3m dump (brianfitzgerald.xyz)
        
       | ShamelessC wrote:
       | Nice. I've been using GPT-4-turbo with a custom system prompt for
       | this until now. Going to try this out.
        
       | pstorm wrote:
       | I'm surprised this isn't getting more love. I love the concept of
       | finetuned, hyper-specific, tiny LLMs. Of course, the data is the
       | most important part.
        
         | roborovskis wrote:
         | Thanks for the kind words! I started with the 780M param
         | flan-t5-large model, and kept trying smaller and smaller base
         | models - I was shocked at how good the output was at 77M. As
         | you go smaller, though, it's much easier to accidentally
         | overfit or collapse the model and produce gibberish. Had to be
         | very careful with hyperparams and sanitizing / filtering the
         | dataset.
        
       | vunderba wrote:
       | This is neat and some thing (aka text "expanders") that I imagine
       | a lot of the commercial offerings (midjourney, etc) are using
       | behind the scenes.
       | 
       | This seems to be targeting SDXL workflows, but in my experience a
       | lot of the custom checkpoints derived from SDXL can have widely
       | divergent recommended prompting styles ranging from natural
       | language to just a list of booru tags.
       | 
       | So I'm guessing this is really only optimized for base SDXL, but
       | I would be curious to see how well it worked on some of the more
       | SOTA SDXL checkpoints such as juggernaut and unstable.
        
         | roborovskis wrote:
         | I haven't tested extensively with non SDXL based checkpoints
         | but there's nothing really SDXL specific about the model; if
         | you're using a fine-tune that's trained on booru-style tags, it
         | will probably not work as well - but otherwise it should work
         | just fine. And in that case, just fork the project and tune it
         | on however your fine-tune prompts best :)
        
       | thorum wrote:
       | It's impressive how well the T5 family of models has aged, even
       | compared to newer LLM architectures.
        
         | htrp wrote:
         | encoder decoder vs decoder only
        
       | gregtc wrote:
       | Great work! I'd recommend including the "max_length=77" parameter
       | in your example, and it seems like the huggingface hosted
       | interface is broken because of the tokenizer. Also, I think your
       | website link on X is outdated.
        
         | roborovskis wrote:
         | will fix these, thanks for the heads up!
        
       | smcleod wrote:
       | Awesome work! I'd love to see how this could be integrated with
       | existing tools like InvokeAI.
        
         | roborovskis wrote:
         | As Invoke is open-source and already has transformers as a
         | dependency, it should be pretty easy to add.
        
       | lionkor wrote:
       | > Left: Drawbench prompt "A rainbow penguin in a tuxedo". Right:
       | SDXL output with SuperPrompt applied to the same input prompt.
       | 
       | Neither is wearing a tuxedo.
        
         | roborovskis wrote:
         | Yup, the model will still forget details sometimes. This is a
         | common issue with prompt upsampling methods, but I'm hoping to
         | improve this with the next version.
        
         | hanniabu wrote:
         | I wonder how much of that could be due to "tuxedo penguin"
         | being a thing
        
       | Lerc wrote:
       | Is the lack of training data the only thing preventing this
       | approach from being applied to both positive and negative prompts
       | together?
       | 
       | What size data set is actually needed? Does it need to be machine
       | generated or can you get away with something smaller, perhaps
       | crowdsourced?
        
         | roborovskis wrote:
         | You could definitely use this for upsampling negative prompts,
         | though I haven't tested that much. In theory, future T2I models
         | shouldn't need to be negatively prompted as much; I find it's
         | better to focus on really high quality positive prompts, as
         | that is closer to the captions the model was trained on.
         | 
         | You can take a look at the dataset here:
         | https://huggingface.co/datasets/roborovski/upsampled-prompts...
         | Roughly 5k samples were needed for the smaller ones at a
         | minimum, filtered from the 95k total generated.
        
       | ultrasaurus wrote:
       | I was reading a blog today[1] that was pretty confident that
       | "continual orders-of-magnitude increases in compute usage [by AI]
       | will utterly drown any changes in efficiency" but this is just
       | one of a million ways we can make AI more efficient. It doesn't
       | seem like a foregone conclusion that the costs will get order-of-
       | magnitudeS more expensive on every axis.
       | 
       | 1: Paywalled: https://www.noahpinion.blog/p/three-threats-to-the-
       | age-of-en...
        
       ___________________________________________________________________
       (page generated 2024-03-14 23:00 UTC)