[HN Gopher] A major AI training data set contains millions of ex...
___________________________________________________________________
A major AI training data set contains millions of examples of
personal data
Author : pera
Score : 89 points
Date : 2025-07-30 09:59 UTC (13 hours ago)
(HTM) web link (www.technologyreview.com)
(TXT) w3m dump (www.technologyreview.com)
| kristianp wrote:
| https://archive.is/k7DY3
| cheschire wrote:
| I hope future functionality of haveibeenpwned includes a tool to
| search LLM models and training data for PII based on the
| collected and hashed results of this sort of research.
| croes wrote:
| Hard to search in the model itself
| cheschire wrote:
| Yep, that's why at the end of my sentence I referred to the
| results of research efforts like this that do the hard work
| of extracting the information in the first place.
| pera wrote:
| Yesterday I asked if there is any LLM provider that is GDPR
| compliant: at the moment I believe the answer is no.
|
| https://news.ycombinator.com/item?id=44716006
| tonyhart7 wrote:
| so your best bet is open weight LLM then???
|
| but its that a breach of GDPR???
| atoav wrote:
| Only if it contains personal data you collected without
| explicit consent ("explicit" here means litrrally asking: "I
| want to use this data for that purpose, do you allow this?
| Y/N").
|
| Also people who have given their consent before need to be
| able to revoke it at any point.
| tonyhart7 wrote:
| so EU basically locked itself from AI space????
|
| idk but how can we do that with GDPR compliance etc???
| croes wrote:
| So basically EU citizens could sue all AI providers
| galangalalgol wrote:
| I don't think they have to? The bodies in charge can
| simply levy the fine for 4% I think. The weird case is if
| all the companies self hosting open weight models become
| liable as data controllers and processors. If that was
| aggressively penalized they could kill AI use within the
| EU and depending on the math, such companies might choose
| to pull operations from the EU instead of giving up use
| of AI.
|
| Edit: that last bit is probably catastrophic thinking.
| Enforcement has always been precisely enough to cause
| compliance vs withdrawal from the market.
| croes wrote:
| I don't think killing the us is the only thing that could
| happen.
|
| You can't steal something and avoid punishment just
| because you don't sell in the country where the theft
| happened.
| galangalalgol wrote:
| You absolutely can depending on the countries involved. A
| recent extreme example is hackers in NK stealing
| cryptocurrency. A more regular one is Chinese
| manufacturers stealing designs. If the countries where
| the theives live and operate won't prosecute them there
| is no recourse. The question for multinationals is if
| continuing to operate in the EU is worth giving up their
| models, and if the countries they are headquartered in
| care or can be made to.
| croes wrote:
| If those countries still like to enforce their IP in the
| EU I guess they will.
|
| Tit for tat.
|
| NK isn't really a business partner in the world.
| galangalalgol wrote:
| But China is, and western countries including those in
| the EU have frequently ignored such things. Looking
| closer this really only affects diffusion models which
| are much cheaper to retrain. The exception is integrated
| models like Gemini and gpt-4v where retraining might
| reasonably cost more than the fine. Behemoths like google
| and openai won't bail over a few 100M, unless they see it
| is likely to happen repeatedly, in which case they would
| likely decouple the image models. But there is nothing to
| say some text database that is widely used isn't
| contaminated as well. Maybe only China will produce
| models in the future. They don't care if you enforce
| their IP.
|
| Edit: After more reading. Clearview AI did exactly this,
| they ignored all the EU rulings and the UK refused to
| enforce them. They were fined tens of millions and paid
| nothing. Stability is now also a UK company that used pi
| images for training; it seems quite likely they will try
| to walk that same path given their financial situation.
| Meta is facing so many fines and lawsuits who knows what
| it will do. Everyone else will call it cost of business
| while fighting it every step of the way.
| pera wrote:
| While my question was in relation to GDPR there are
| similar laws in the UK (DPA) and in California (CCPA).
|
| Also note that AI is not just generative models, and
| generative models don't need to be trained with personal
| data.
| jeroenhd wrote:
| I'm sure it's possible, but AI companies don't invest
| much money into complying with the law as it's not
| profitable.
|
| A normal industry would've figured out how to deal with
| this problem before going public, but AI people don't
| seem to be all that interested.
|
| I'm sure they'll all cry foul if one of them get hit with
| a fine and an order to figure out how to fix the mess
| they've created, but this is what you get when you don't
| ethics to computer scientists.
| xxs wrote:
| > need to be able to revoke it at any point.
|
| They have to be able to ask how much (if) data is being
| used, and how.
| pera wrote:
| There is currently no effective method for unlearning
| information - specially not when you don't have access to the
| original training datasets (as is the case with open weight
| models), see:
|
| _Rethinking Machine Unlearning for Large Language Models_
|
| https://arxiv.org/html/2402.08787v6
| thrance wrote:
| Mistral's products are supposed to be at least, since they are
| based in the EU.
| pera wrote:
| I am not sure if Mistral is: if you go to their GDPR page
| (https://help.mistral.ai/en/articles/347639-how-can-i-
| exercis...) and then to the erasure request section they just
| link to a "How can I delete my account?" page.
|
| Unfortunately they don't provide information regarding their
| training sets
| (https://help.mistral.ai/en/articles/347390-does-mistral-
| ai-c...) but I think it's safe to assume it includes DataComp
| CommonPool.
| itsalotoffun wrote:
| I WISH this mattered. I wish data breaches actually carried
| consequences. I wish people cared about this. But people don't
| care. Right up until you're targeted for ID theft, fraud or
| whatever else. But by then the causality feels so diluted that
| it's "just one of those things" that happens randomly to good
| people, and there's "nothing you can do". Horseshit.
| atoav wrote:
| It doesn't now, but we _could_ collectively decide to introduce
| consequences of the kind that deter anybody willing to try this
| again.
| jelvibe25 wrote:
| What's the right consequence in your opinion?
| passwordoops wrote:
| Criminal liability with a minimum 2 years served for
| executives and fines amounting to 110% of total global
| revenue to the company that allowed the breach would see
| cybersecurity taken a lot more seriously in a hurry
| lifestyleguru wrote:
| Would be nice to have executives finally responsible for
| something.
| bearl wrote:
| Internet commerce requires databases with pii that will be
| breached.
|
| Who is to blame for internet commerce?
|
| Our legislators. Maybe specifically we can blame Al Gore,
| the man who invented the internet. If we had put warning
| labels on the internet like we did with NWA and 2 live
| crew, Gore's second best achievement, we wouldn't be a
| failed democracy right now.
| krageon wrote:
| A stolen identity destroys the life of the victim, and
| there's going to be more than one. They (every single
| involved CEO) should have all of their assets seized, to be
| put in a fund that is used to provide free legal support to
| the victims. Then they should go to a low-security prison and
| have mandatory community service for the rest of their lives.
|
| They probably can't be redeemed and we should recognise that,
| but that doesn't mean they can't spend the rest of their life
| being forced to be useful to society in a constructive way.
| Any sort of future offense (violence, theft, assault,
| anything really) should mean we give up on them. Then they
| should be humanely put down.
| rypskar wrote:
| We should also stop calling it ID theft. The identity is not
| stolen, the owner do still have it. Calling it ID theft is
| moving the responsibility from the one that a fraud is against
| (often banks or other large entities) to an innocent 3rd party
| herbturbo wrote:
| Yes tricking a bank into thinking you are one of their
| customers is not the same as assuming someone else's
| identity.
| messagebus wrote:
| As always, Mitchell and Webb hit the nail precisely on the
| head.
|
| https://www.youtube.com/watch?v=CS9ptA3Ya9E
| JohnFen wrote:
| > Calling it ID theft is moving the responsibility from the
| one that a fraud is against (often banks or other large
| entities)
|
| The victim of ID theft is the person whose ID was stolen. The
| damage to banks or other large entities pales in comparison
| to the damage to those people.
| rypskar wrote:
| I did probably not formulate myself good enough. By calling
| it ID theft you are blaming the person the ID belongs to
| and that person have to prove they are innocent. By calling
| it by the correct words, bank fraud, the bank have to prove
| that the person the ID belongs to did it. No ID was stolen,
| it was only used by someone else to commit fraud. The banks
| don't have enough security to stop it because they have
| gotten away with calling it ID theft and putting the blame
| on the person the ID belongs to
| laughingcurve wrote:
| It's not clear to me how this is a data breach at all. Did the
| researchers hack into some database and steal information? No?
|
| Because afaik everything they collected was public web. So now
| researchers are being lambasted for having data in their sets
| that others released
|
| That said, masking obvious numbers like SSN is low hanging
| fruit. Trying to obviate every piece of public information
| about a person that can identify them is insane.
| imglorp wrote:
| Reader mode works on this site.
| satvikpendem wrote:
| This is all public data. People should not be putting personal
| data on public image hosts and sites like LinkedIn if they did
| not want them to be scraped. There is nothing private about the
| internet and I wish people understood that.
| malfist wrote:
| What's important is that we blame the victims instead of the
| corporations that are abusing people's trust. The victims
| should have known better than to trust corporations
| blitzar wrote:
| > blame the victims
|
| If you post something publicly you cant be complaining that
| it is public.
| lewhoo wrote:
| But I can complain about what happens to said something. If
| my blog photo becomes deep fake porn am I allowed to
| complain or not ? What we have is an entirely novel
| situation (with ai) worth at least a serious discussion.
| blitzar wrote:
| > But I can complain about what happens to said something
|
| no.
|
| > but ...
|
| no.
| YetAnotherNick wrote:
| > If my blog photo becomes deep fake porn
|
| Depends. In most cases, this thing is forbidden by law
| and you can claim actual damages.
| kldg wrote:
| That's helpful if they live in the same country, can
| figure out who the 4chan poster was, the police are
| interested (or you want to risk paying a lawyer), you're
| willing to sink the time pursuing such action (and if
| criminal, risk adversarial LEO interaction), and are
| satisfied knowing hundreds of others may be doing the
| same and won't be deterred. Of course, friends and co-
| workers are too close to you to post publicly when they
| generate it. Thankfully, the Taylor Swift laws in the US
| have stopped generation of nonconsensual imagery and
| video of its namesake (it hasn't).
|
| Daughter's school posted pictures of her online without
| an opt-out, but she's also on Facebook from family
| members and it's just kind of... well beyond the point of
| trying to suppress. Probably just best to accept people
| can imagine you naked, at any age, doing any thing.
| What's your neighbor doing with the images saved from his
| Ring camera pointed at the sidewalk? :shrug:
| YetAnotherNick wrote:
| I am not talking about 4chan poster. I am talking if a
| company does it.
| dpoloncsak wrote:
| FWIW...I really don't think so. If you say, posted your
| photo on a bulletin board in your local City Hall, can
| you prevent it from being defaced? Can you choose who
| gets to look at it? Maybe they take a picture of it and
| trace it...do you have any legal ground there? (Genuine
| Question). And even if so...It's illegal to draw angry
| eyebrows on every face on a billboard but people still do
| it...
|
| IMO, it being posted online to a publicly accessible site
| is the same. Don't post anything you don't want right-
| click-saved.
| malfist wrote:
| Sure, and if I put out a local lending library box in my
| front yard I shouldn't by annoyed by the neighbor that
| takes every book out of it and throws it in the trash.
|
| Decorum and respect expectations don't disappear the moment
| it's technically feasible to be an asshole
| YetAnotherNick wrote:
| That's a bad analogy. Most people including me do expect
| that their "public" data is used for AI training. I mean
| based on the ads everyone gets, most people know and
| expect completely well that anything they post online
| would be used in AI.
| JohnFen wrote:
| > Most people including me do expect that their "public"
| data is used for AI training.
|
| Based on what ordinary people have been saying, I don't
| think this is true. Or, _maybe_ it 's true now that the
| cat is out of the bag, but I don't think most people
| expected this before.
|
| Most tech-oriented people did, of course, but we're a
| small minority. And even amongst our subculture, a lot of
| people didn't see this abuse coming. I didn't, or I would
| have removed all of my websites from the public web years
| earlier than I did.
| YetAnotherNick wrote:
| > Most tech-oriented people did
|
| In fact it's the opposite. People who aren't into tech
| thinks Instagram is listening to them 24*7 to show feed
| and ads. There was even a hoax near my area among elderly
| groups that Whatsapp is using profile photo in illegal
| activity and many people removed their photo one time.
|
| > I didn't, or I would have removed all of my websites
| from the public web years earlier than I did.
|
| Your comment is public information. In fact posting
| anything in HN is a sure shot way to giving your content
| for AI training.
| JohnFen wrote:
| > People who aren't into tech thinks Instagram is
| listening to them 24*7 to show feed and ads
|
| True, but that's a world different than thinking that
| your data will be used to train genAI.
|
| > In fact posting anything in HN is a sure shot way to
| giving your content for AI training.
|
| Indeed so, but HN seems to be a bad habit I just can't
| kick. However, my comments here are the entirety of what
| I put up on the open web and I intentionally keep them
| relatively shallow. I no longer do long-form blogging or
| make any of my code available on the open web.
|
| However, you're right. Leaving HN is something that I
| need to do.
| bearl wrote:
| No, the average person has no idea what "ai training"
| even is. Should the average person have an above average
| iq? Yes. Could they? No. Don't be average yourself.
| malfist wrote:
| Are you trying to argue that 10 years ago when I uploaded
| my resume to linkedin, that I should have known it'd be
| used for AI training?
|
| Or that teenager that signed up to facebook should know
| that the embarrassing things they're posting is going to
| train AI and is, as you called it, public?
|
| What about the blog I started 25 years ago and then took
| down but it lives in the geocities archive. Was I
| supposed to know it'd go to an AI overlord corporation
| when I was in middle school writing about dragon photos I
| found on google?
|
| And we're not even getting into data breaches, or
| something that was uploaded as private and then sold when
| the corporation changed their privacy policy decades
| after it was uploaded.
|
| It's not a bad analogy when you don't give all the graces
| to corporations and none to the exploited.
| victorbjorklund wrote:
| Seriously, when YOU posted something on the Internet 20
| years ago you expected it to be used by a corporation to
| train an AI 20 years later?
| nerdjon wrote:
| Right, both things can be wrong here.
|
| We need to better educate people on the risks of posting
| private information online.
|
| But that does not absolve these corporations of criticism of
| how they are handling data and "protecting" people's privacy.
|
| Especially not when those companies are using dark patterns
| to convince people to share more and more information with
| them.
| thinkingtoilet wrote:
| If this was 2010 I would agree. This is the world we live in.
| If you post a picture of yourself on a lamp post on a street
| in a busy city, you can't be surprised if someone takes it.
| It's the same on the internet and everyone knows it by now.
| squigz wrote:
| > The victims should have known better than to trust
| corporations
|
| Literally yes? Is this sarcasm? Are we in 2025 supposed to
| implicitly trust multi-billion dollar multi-national
| corporations that have decades' worth of abuses to look back
| on? As if we couldn't have seen this coming?
|
| It's been part of every social media platform's ToS for many
| years that they get a license to do whatever they want with
| what you upload. People have warned others about this for
| years and nothing happened. Those platforms' have already
| used that data prior to this for image classification,
| identification and the like. But nothing happened. What's
| different now?
| keybored wrote:
| Modern companies: We aim to create or use human-like AI.
|
| Those same modern companies: Look, if our users inadvertently
| upload sensitive or private information then we can't really
| help them. The heuristics for detecting those kinds of things
| are just too difficult to implement.
| Workaccount2 wrote:
| I have negative sympathy for people who still aren't aware
| that if they aren't paying for something, they are the
| something to be sold. This has been the case for almost 30
| years now with the majority of services on the internet,
| _including this very website right here_.
| gishglish wrote:
| Tbh, even if they are paying for it, they're probably still
| the product. Unless maybe they're an enterprise customer
| who can afford magnitudes more to obtain relative privacy.
| malfist wrote:
| That explains why ISPs sell DNS lookup history, or your
| utility company sells your habits. Or your TV tracks your
| viewership. I've paid for all of those, but somehow, I'm
| still the product.
| jeroenhd wrote:
| AI and scraping companies are why we can't have nice things.
|
| Of course privacy law doesn't necessarily agree with the idea
| that you can just scrape private data, but good luck getting
| that enforced anywhere.
| pera wrote:
| > _This is all public data_
|
| It's important to know that generally this distinction is not
| relevant when it comes to data subject rights like GDPR's right
| to erasure: If your company is processing any kind of personal
| data, including publicly available data, it must comply with
| data protection regulations.
| booder1 wrote:
| Legal has in no way been able to keep up with AI. Just look
| at copyright. Internet data is public and the government is
| incapable of changing this.
| Anonbrit wrote:
| A hidden camera can make your bedroom public. Don't do it if
| you don't want it to be on pay-per-view?
| satvikpendem wrote:
| That is indeed what Justin.tv did, to much success. But that
| was because Justin had consented to doing so, just as
| anything anyone posts online is also consented to being seen
| by anyone.
| dpoloncsak wrote:
| Does this analogy really apply? Maybe I'm misunderstanding,
| but it seems like all of this data was publicly available
| already, and scraped from the web.
|
| In that case, its not a 'hidden camera'...users uploaded this
| data and made it public, right? I'm sure some were due to
| misconfiguration or whatever (like we see with Tea), but it
| seems like most of this was uploaded by the user to the clear
| web. I'm all for "Dont blame the victims", but if you upload
| your CC to Imgur I think you deserve to have to get a new
| card.
|
| Per the article "CommonPool ... draws on the same data
| source: web scraping done by the nonprofit Common Crawl
| between 2014 and 2022."
| dlivingston wrote:
| Your analogy doesn't hold. A 'hidden camera' would be either
| malware that does data exfiltration, or the company
| selling/training on your data outside of the bounds of its
| terms of service.
|
| A more apt analogy would be someone recording you in public,
| or an outside camera pointed at your wide-open bedroom
| window.
| djoldman wrote:
| Just to be clear, as with LAION, the data set doesn't _contain_
| personal data.
|
| It contains _links_ to personal data.
|
| The title is like saying that sending a magnet link to a
| copyrighted torrent file is distributing copyright material.
| Folks can argue if that's true but the discussion should at least
| be transparent.
| yorwba wrote:
| I think the data set is generally considered to consist of the
| images, not the list of links for downloading the images.
|
| That the data set aggregator doesn't directly host the images
| themselves matters when you want to issue a takedown (targeting
| the original image host might be more effective) but for the
| question "Does that mean a model was trained on my images?"
| it's immaterial.
| bearl wrote:
| Links to pii are by far the worst sort of pii, yes.
|
| "It's not his actual money, it's just his bank account and
| routing number."
| djoldman wrote:
| A more accurate analogy is "it's not his actual money, it's a
| link to a webpage or image that has his bank account and
| routing number."
| bearl wrote:
| My contention is that links to pii are themselves pii.
|
| A name, Jon Smith, is technically PII but not very
| specific. If I have a link to a specific Jon Smith's
| facebook page or his HN profile, it's even more personally
| identifiable than knowing his name is Jon Smith.
| 1vuio0pswjnm7 wrote:
| archive.is is (a) sometimes blocked, (b) serves CAPTCHAs in some
| instances and (c) includes a tracking pixel
|
| One alternative to archive.is for this website is to disable
| Javascript and CSS
|
| Another alternative is the website's RSS feed
|
| Works anywhere without CSS or Javascript, without CAPTCHAs,
| without tracking pixel
|
| For example, curl https://web.archive.org/web/20
| 250721104402if_/https://www.technologyreview.com/feed/
| |(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") >
| 1.htm firefox ./1.htm
|
| To retrieve only the entry about DataComp CommonPool,
| curl https://web.archive.org/web/20250721104402if_/https://www.te
| chnologyreview.com/feed/ |sed -n '/./{/>1120522</post-
| id>/,/>1120466</post-id>/p;}' |(echo "<meta
| charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm
| firefox ./1.htm
___________________________________________________________________
(page generated 2025-07-30 23:01 UTC)