https://multimodal.art/

multimodal.art

  * Home
  * MindsEye beta - ai art generator
  * News

welcome to multimodal.art

A comprehensive guide to understand the multimodal AI art scene and
create your own text-to-image (and other) pieces

twitter, instagram, email me: [email protected]

This page is an ever evolving meta-curation on multimodal ai art
content

A few examples of AI generated images from a text prompt by me (more
here and check out our portfolio for my participation in real world
AI art exhibitions)

A ritual to bring a Tamagotchi back to life (CLIP Guided Diffusion) 
Neuromancer in Ukiyo-e style (VQGAN+CLIP) The inauguration of a
wormhole between Shanghai and New York (VQGAN+CLIP) A mecha-robot in
a favela by James Gurney (CLIP Guided Diffusion)

Index

What is it (non technical, not in depth)
What is it (in depth, curated content)
I want to see examples of what it can do
I want to play with it myself

What is it (non technical, not in depth)

Essentially, this are models that can generate an image from a text
prompt, the most famous ones that are open source are: VQGAN + CLIP,
CLIP Guided Diffusion, Dall-E Mini.

The main ingredient for all the image-to-text generation AI models
are datasets of hundreds of millions of "image-and-text-pairs", or
images with labels describing what they are

Example of a text-image pairExample of a text-image pair

Those hundreds of millions of image-and-text pairs are then used to
train a neural networks (such as CLIP or DALL-E) that "learn"
features and the connections between the text and the images without
a human telling them what is what. For example, it can learn that
there's a dog, or a grass field or a red ball or even a dog's mouth
just by having enough examples of different things in the dataset and
"learning" what those are.

Once this models are trained, they can then be used - either directly
(such as in Dall-E like models), or indirectly (such as in CLIP
guided models) - as guidance to generate new images based on a text.
The idea here is that the model will be intertwined with other models
that are trained to be good at image generation (such as VQ-VAE,
VQGAN or Guided Diffusion), and will use this learned features of how
well a text matches an image (called loss function or error function)
to then guide this models to generate an image that satisfies that
text.

Short history of how we got here

Check out the excellent article The Weird and Wonderful World of AI
Art that goes into way more detail, but in summary: OpenAI's Dall-E
started the current trend in January 2021, but they haven't released
their pre-trained models, but they did release CLIP (a model that can
say how good an image matches a text pair, as described above). With
that, Ryan Murdock (@advadnoun) started this trend of hooking up CLIP
with image generation models with Big Dream + CLIP. After that,
Katherine Crowson (@rivershavewings) hooked CLIP with the VQGAN image
generation neural network, starting a text-to-image Cambrian
explosion. A few months after that she came back and did the same
with CLIP + a model called Guided Diffusion, which dramatically
increased the quality of the generations. In 2022 the process only
accelerated. The amount of models released started growing a lot. I
created a newsletter to keep up with it. But the highlights of 2022
so far are definitely the release of Latent Diffusion models, as well
as Dall-E 2.

Vice/Motherboard featured this scene in a piece in July/2021 called
"AI-Generated Art Scene Explodes as Hackers Create Groundbreaking New
Tools"

Other prominent people in this field: @jbusted1, @nshepperd1,
@BoneAmputee, dribnet, @danielrussruss, @bakztfuture. I also
recommend the following discords: our own Multimodal Art, EleutherAI
and LAION

What is it (in depth, curated content)

Understanding Multimodal Art: The AI art open source course by
Jonathan Whitaker Understanding Dall-E: Two minute papers video,
Yannic Kilcher video, original blog post
Understanding CLIP: Yannic Kilcher video, original blog post
Understanding VQGAN+CLIP: Adafruit blogpost, Bestiario del Hypogripho
Understanding Diffusion models: Ari Seff video
Understanding DALL-E 2: Two Minute Papers The next 10 years of
Multimodal art by Bakz T. Future

I want to see examples of what it can do

A few examples created by me. For more follow me on Twitter or
Instagram

A giant insect protecting the city of Lagos (CLIP Guided Diffusion) A
cute monster bathing in an acai bowl (CLIP Guided Diffusion) A pao de
queijo food cart with a Japanese castle in the backgroun by James
Gurney (CLIP Guided Diffusion) A mecha robot celebrating Diwali by
James Gurney (CLIP Guided Diffusion) A cute monster taking a shower
in a bathtub trending on artstation (CLIP Guided Diffusion) Prision
Shrimp Night Fight trending on Artstation (CLIP Guided Diffusion) A
renaissance painting of eyeballs (CLIP Guided Diffusion) Two people
silluetes looking at artificial intelligence art in a gallery (CLIP
Guided Diffusion) A cute seahorse amigurumi (CLIP Guided Diffusion) A
landscape resembling the Black Lotus Magic The Gathering Card (CLIP
Guided Diffusion) A shakira chicken dancing (CLIP Guided Diffusion) A
giant chicken in an Austrian supermarket by James Gurney (CLIP Guided
Diffusion) A surrealist sculpture of a GameBoy (CLIP Guided
Diffusion) The biggest baile funk party in Times Square (VQGAN+CLIP) 
Do not rinse raw chicken before cooking says the FDA (VQGAN+CLIP) 
Mark_Zuckerberg regretting having created Facebook oil in canvas
(VQGAN+CLIP) Elon Musk saying his final words before his exile in a
Jupiter moon oil in canvas (VQGAN+CLIP) Jeff Bezos apologizes to
former employees before going to jail oil in canvas (VQGAN+CLIP) 
Drinking the milky way galaxy from a milk bottle a couple spending
their first Universal Basic Income payment on a fully automated lab
grown meat restaurant the online advertisement bubble burst crisis ;
oil on canvas

Some prominent AI art/model creators: @rivershavewings, @advadnoun,
@images_ai, @jbusted1, @nshepperd1, @BoneAmputee, @dribnet,
@danielrussruss

I want to play with it myself

We just released MindsEye beta, a GUI for running multiple multimodal
art models. Check it out here multimodal

Besides that, check out those resources based on your use-case: Check
out this list of tools and other resources you can run on your own

I have a powerful GPU and I know a bit of coding and I want to run
this models on my local machine

Run VQGAN+CLIP locally
Run CLIP Guided Diffusion locally
Run Dall-E mini locally
Run many models at once with Vision of Chaos (Windows only)

I know how to use a Google Colab (or I am willing to learn)

VQGAN+CLIP: original notebook, with pooling trick, MSE regularized
Guided Diffusion: 512x512px original, Disco Diffusion Latent
Diffusion? notebook by us

I'm not willing to learn how to use a Colab, I just want a website
that I can just type the text and get the image out

Check out MindsEye!

multimodal.art

multimodal.art

  * Home
  * MindsEye beta - ai art generator
  * News