[HN Gopher] Expressive text-to-image generation with rich text
___________________________________________________________________
Expressive text-to-image generation with rich text
Author : plurby
Score : 39 points
Date : 2023-10-04 19:21 UTC (3 hours ago)
(HTM) web link (rich-text-to-image.github.io)
(TXT) w3m dump (rich-text-to-image.github.io)
| pugworthy wrote:
| I would love to experiment with the idea of font interpretation.
| People can and do anthropomorphize fonts, but then they have
| names with meanings which might or might not be useful.
|
| For example, I'm wondering if a prompt written in Comic Sans
| should be turned into a comic-style illustration, or does it come
| out as a simplistic and childish drawing? Is a gothic font meant
| to imply a style of architecture, old Germanic peoples, or goth
| music and style?
| gorenb wrote:
| my god, i think midjourney and dalle should do this now
| 90-00-09 wrote:
| I like this idea. It could be handy to be able to focus on
| individual descriptions in complex prompts. Is this then mostly a
| "UI" feature that is being translated to a traditional prompt?
|
| (As a side note: using decorative typefaces was an unconvincing
| example.)
| minimaxir wrote:
| A relatively functionally similar approach is prompt term
| weighting with libraries such as compel:
| https://github.com/damian0815/compel
|
| Prompt weighting alone can fix undesired aspects of an output,
| especially with SDXL and its dual text encoders.
| Der_Einzige wrote:
| I LOVE this.
|
| All of the techniques that they are showing have already existed
| for awhile in places like Automatic1111/ComfyUI or its extensions
| (i.e. regional prompting, attention weights). Having it connect
| so seamlessly with rich text is awesome and is a cool UI trick
| that might make normies notice it.
|
| Also, related, but NLP is extremely undertooled on the prompt
| engineering side. Most of the techniques here would work just
| fine on any LLM. If you don't believe me, read this:
| https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
| LASR wrote:
| How well does this work with LLMs? Anyone tried this? I am
| curious about the references and footnotes approach the most.
| simbolit wrote:
| I looked at this, and thought about it, and then I waited for an
| hour, and now I looked at it again, and I can't help but think
| this is useless.
|
| We can already weigh parts of prompts, we can already specify
| colors or styles for parts of the images. And even if we could
| not, none of this needs rich text.
|
| In the beginning I even think their comparisons are dishonest.
| They compare "plaintext" prompts with "rich text" prompts, but
| the rich text prompts contain more information. What? Like,
| seriously, who is surprised the following two prompts give
| different images?
|
| (1) "A girl with long hair sitting in a cafe, by a table with
| coffee1 on it, best quality, ultra detailed, dynamic pose."
|
| (2) "A girl with long [Richtext:orange] hair sitting in a cafe,
| by a table with coffee on it, best quality, ultra detailed,
| dynamic pose. [Footnote:The ceramic coffee cup with intricate
| design, a dance of earthy browns and delicate gold accents. The
| dark, velvety latte is in it.]"
|
| the worst part is "Font style indicates the styles of local
| regions". In the comparison with other methods section they
| actually have to specify in parentheses what each font means
| style-wise, because nobody knows and (let's be frank) nobody
| wants to learn.
|
| So why not just use these plaintext parentheses in the prompt?
|
| I really stopped myself from immediately posting my (rather
| negative) opinion, but after over an hour, it hasn't changed. As
| far as i can see, this isn't useful, rich text prompts are a
| gimmick.
| aenvoker wrote:
| The rich text presentation is merely cute. But, the underlying
| feature is very nice. Being able to focus details on a specific
| aspect of an image without worrying about it leaking into other
| aspects would be greatly appreciated.
|
| How about a plain-text interface like this?
|
| > A girl with [long hair](orange) sitting in a cafe, by a table
| with [coffee](^1) on it, best quality, ultra detailed, dynamic
| pose. [^1](Ceramic coffee cup with intricate design, a dance of
| earthy browns and delicate gold accents. The dark, velvety
| latte is in it.)
| phil-martin wrote:
| It feels like that is where the real value is. Imagine
| describing all the assets of a game, story, or something
| larger than just a single image as mainly "what"
| descriptions, referring to broad styles of things. And then a
| second body of text detailing those styles in detail.
|
| It could be a text description of a fighter or noble wearing
| coats or armour. And then substitute in different style
| description of coats and armour depending on the family,
| class, race or other attributes suitable for the world you're
| trying to generate.
| EL_Loco wrote:
| I had the same thought. The gothic church one, for example. Why
| wouldn't I just write "A pink gothic church in the sunset"
| instead of writing "A gothic church" and then having to do the
| extra steps to turn the word "church" into pink? Of course, I'm
| very ignorant of the uses of such tech, so there's probably
| some usefulness in this.
| 90-00-09 wrote:
| The value I see is in constructing more complex prompts.
| Agree with your example but could see myself using this
| feature for prompts with multiple objects/aspects that
| require specific details. Probably not much different from
| inlining all details, just a nice separation of concerns: you
| can describe the high level requirement first, and then add
| and tweak individual details.
___________________________________________________________________
(page generated 2023-10-04 23:00 UTC)