[HN Gopher] Search 5.8B images used to train popular AI art models
___________________________________________________________________
Search 5.8B images used to train popular AI art models
Author : homarp
Score : 37 points
Date : 2022-09-14 21:17 UTC (1 hours ago)
(HTM) web link (haveibeentrained.com)
(TXT) w3m dump (haveibeentrained.com)
| naet wrote:
| I put in Donald Trump to see what kind of celebrity images might
| be in there, and there are a TON of memes / photoshopped versions
| of him looking like a caricature or otherwise warped. I wonder if
| the AI will average these into a fair resemblance, or whether
| prompts using his name will end up more cartoonish than other
| names due to the source data...
| shantara wrote:
| Does this website search the same dataset as
| https://knn5.laion.ai?
| latchkey wrote:
| I just uploaded a picture of my dog (Bichon Frise) and it showed
| me a bazillion nearly exact similar dogs.
|
| Why isn't this immediately being used as a missing persons (or
| pet) database service?
| at_a_remove wrote:
| Let's think of reasons!
|
| 1) Privacy.
|
| 2) Lack of geolocation data associated.
|
| 3) Privacy.
|
| 4) Lack of contact information attached.
|
| 5) Privacy.
|
| 6) Lack of case numbers for various missing persons cases being
| attached.
|
| 7) Privacy.
| latchkey wrote:
| Certainly this could be used for evil (tm), but it seems like
| something could also be built out that enables this for good
| in ways that don't cause issues with Mr. Fibbonaci.
| groby_b wrote:
| Because you're missing a person, not a picture of them?
| philipkglass wrote:
| I think that you answered your own question. You uploaded a
| picture of your non-missing dog and it found many very similar
| looking dogs.
| fimdomeio wrote:
| sometimes people are missing, sometimes they just have
| legitimate reasons to not want to be found.
| version_five wrote:
| I found this very similar to what you'd get with a google / bing
| / etc image search. Is that where this database comes from? I
| noticed there is a lot of "Shutterstock" watermarked stuff. And I
| also checked a few "adult" terms (large breasts etc) and found
| there is a lot of nude content. Only curious because I've seen
| lots of the generative models have some post-filtering for
| nudity, why don't they just clean it out of the training data if
| they're worried?
| mgraczyk wrote:
| Half page popup cookie banner with no opt-out option. I am
| completely fine with cookies but clearly there is something wrong
| with the state of things when I have to click through a
| completely non-actionable popup
| 0xrisk wrote:
| I'm one of the people building this. Hi HN, AMA :)
| telotortium wrote:
| What visual search engine are you using?
| simandl wrote:
| This is clip matching the text or image searched to the
| images in the Laion-5B dataset.
| tener wrote:
| Privacy policy for searched strings?
|
| "Rutkowski" returns a bunch of book covers, repeated a lot. Can
| you ensure images returned have diverse embeddings? I expected
| digital art, not detective stories.
|
| Do you use CLIP or just metadata?
|
| What is intended process that starts after you get artist's
| email (for either of purposes).
| simandl wrote:
| It's using clip to match the text to the image, so you can
| actually prompt it like you might an art generator. Here's
| "in the style of greg rutkowski": https://haveibeentrained.co
| m/?search_text=in%20the%20style%2...
|
| In the next few weeks we'll be adding the ability to log in
| and flag or upload your works (if they aren't there). Those
| lists will have permissions assigned to them, starting with
| simple opt-in or opt-out.
| educaysean wrote:
| One thing that sticks out to me is how so much of the the images
| in the collection have really terrible labels. I uncovered a
| large collection of pieces by an illustrator who was unsearchable
| by name, only via image upload. The reason they were
| unsearchable: the majority of this particular artist's images had
| labels that were all in the format of:
|
| {username}'s profile image
| constantlm wrote:
| I'd love to know how this works. I entered my own name for the
| lols, and it returned mostly paintings of the Cape Winelands in
| South Africa where I grew up, which is pretty creepy.
| 0xrisk wrote:
| that is crazy. Perhaps your family name is common in that
| region?
| prox wrote:
| It surprises me how many meme images there are. Aren't they low
| quality content?
|
| I haven't tried but I don't see SD making any memes by themselves
| yet.
| simandl wrote:
| Stable Diffusion used an aesthetic filter to train on a subset
| of the English language images from this full 5.8 billion
| multi-language set. That probably got a lot of what you're
| finding.
| gauravphoenix wrote:
| interestingly, if you search for "127.0.0.1", it throws
| {"message":"Forbidden"}
|
| in the api response and the page says- Sorry,
| there was an error with your search. Please try a different
| request.
| jonas-w wrote:
| Same for "localhost" but "[::1]" works.
| gigel82 wrote:
| It's surprising how poorly labeled these images are; who is
| curating this collection?
|
| Can't they crowd-source a proper labeling project - I wonder how
| much better things like Stable Diffusion would be if its training
| would include correct, complete labels for the images. I'm sure
| lots of folks would willingly spend a few minutes here and there
| to aid with the labeling if it means they get to enjoy the model
| for free.
| orbital-decay wrote:
| If you aren't Google, manually doing that with 5+ billion
| images might prove difficult, to put it mildly. Large-scale
| labeling is typically bootstrapped with smaller models and
| whatever manual data you have. What's being curated is the
| bootstrapping process.
| 0xrisk wrote:
| part of our idea is that artists can help with labelling their
| own work
| simandl wrote:
| This is Laion-5B, you can read more about it here:
| https://laion.ai/blog/laion-5b/
|
| Imagen and Stable-Diffusion both used subsets of this full 5.8B
| image set.
| homarp wrote:
| tweet explaining:
| https://twitter.com/matdryhurst/status/1570143343157575680
|
| " Releasing our first Spawning tool to help artists see if they
| are present in popular AI Art training data, and register to use
| our tools to opt in and opt out of AI training
|
| I think we have created a way to make this work out well for
| everyone"
|
| and StableDiffusion's Emad Mostaque agreeing to support the
| initiative
| https://twitter.com/EMostaque/status/1570158985852121090
___________________________________________________________________
(page generated 2022-09-14 23:00 UTC)