[HN Gopher] Exploring 12M of the 2.3B images used to train Stabl...
___________________________________________________________________
Exploring 12M of the 2.3B images used to train Stable Diffusion
Author : detaro
Score : 53 points
Date : 2022-08-30 21:39 UTC (1 hours ago)
(HTM) web link (waxy.org)
(TXT) w3m dump (waxy.org)
| rektide wrote:
| So excellent. Flipping the story we see all the time on it's
| head. AI's quasi-mystical powers are endless spectacle. Taking a
| look through the other side of the looking glass is vastly
| overdue. Amazing work.
|
| Just just just starting to scratch the surface. 2% of the data
| gathered, sources identified. These are a couple of sites we can
| now source as the primary powerers of AI. Barely reviewed, dived
| into in terms of the content itself. We have so little sense &
| appreciation for what lurks beneath, but this is a go.
| TaylorAlexander wrote:
| "The most frequent artist in the dataset? The Painter of Light
| himself, Thomas Kinkade, with 9,268 images."
|
| Oh that's why it is so good at generating Thomas Kinkade style
| paintings! I ran a bunch of those and they looked pretty good.
| Some kind of garden cottage prompt with Thomas Kinkade style
| works very well. Good image consistency with a high success rate,
| few weird artifacts.
| lmarcos wrote:
| I always had the crazy idea of "infinite entertainment": somehow
| we manage to "tap" into the multiverse and are able to watch TV
| from countless of planets/universes (I think Rick and Morty did
| something similar). So, in some channel at some time you may be
| able to see Brad Pitt fighting against Godzilla while the monster
| is hacking into the pentagon using ssh. Highly improbable, but in
| the multiverse TV everything is possible.
|
| Now I think we don't need the multiverse for that. Give this AI
| technology a few years and you'll have streaming services a la
| Netflix where you provide the prompt to create your own movie.
| What the hell, people will vote "best movie" among the millions
| submitted by other people. We'll be movie producers like we are
| nowadays YouTubers. Overabundance of high quality material and so
| little time to watch them all. Same goes for books, music and
| everything else that is digital (even software?).
| temp_account_32 wrote:
| Isn't everything representable in a digital form? I think we're
| in the very early era of entertainment becoming commoditized to
| an even higher degree than it is now.
|
| I envision exactly the future as you describe: Feed a song to
| the AI, it spits out a completely new, whole discography from
| the artist complete with lyrics and album art that you can
| listen to infinitely.
|
| "Hey Siri, play me a series about chickens from outer space
| invading Earth": No problem, here's a 12 hour marathon,
| complete with a coherent storyline, plot twists, good acting
| and voice lines.
|
| The only thing that is currently limiting us is computing
| power, and given enough time, the barrier will be overcome.
|
| A human brain is just a series of inputs, a function that
| transforms them, and a series of outputs.
| gpm wrote:
| Huh, there's a ton of duplicates in the data set... I would have
| expected that it would be worthwhile to remove those. Maybe
| multiple descriptions of the same thing helps, but some of the
| duplicates have duplicated descriptions as well. Maybe
| deduplication happens after this step?
|
| http://laion-aesthetic.datasette.io/laion-aesthetic-6pls/ima...
| minimaxir wrote:
| Per the project page: https://laion.ai/blog/laion-400-open-
| dataset/
|
| > There is a certain degree of duplication because we used
| URL+text as deduplication criteria. The same image with the
| same caption may sit at different URLs, causing duplicates. The
| same image with other captions is not, however, considered
| duplicated.
|
| I am surprised that image-to-image dupes aren't removed,
| though, as the cosine similarity trick the page mentions would
| work for that too.
| kaibee wrote:
| I assume having multiple captions for the same image is very
| helpful actually.
| minimaxir wrote:
| Scrolling through the sorted link from the GP, there are a
| few dupes with identical images and captions, so that
| doesn't always work either.
| gchamonlive wrote:
| Isn't it really expensive to dedupe images based on content?
| As you have to compare every image to every other image in
| the dataset?
|
| How could one go about deduping images? Maybe using something
| similar to rsync protocol? Cheap hash method, then a more
| expensive one, then a full comparison, maybe. Even so 2B+
| images... and you are talking about saving on storage costs,
| mostly which is quite cheap these days.
___________________________________________________________________
(page generated 2022-08-30 23:00 UTC)