[HN Gopher] A major AI training data set contains millions of ex...
       ___________________________________________________________________
        
       A major AI training data set contains millions of examples of
       personal data
        
       Author : pera
       Score  : 89 points
       Date   : 2025-07-30 09:59 UTC (13 hours ago)
        
 (HTM) web link (www.technologyreview.com)
 (TXT) w3m dump (www.technologyreview.com)
        
       | kristianp wrote:
       | https://archive.is/k7DY3
        
       | cheschire wrote:
       | I hope future functionality of haveibeenpwned includes a tool to
       | search LLM models and training data for PII based on the
       | collected and hashed results of this sort of research.
        
         | croes wrote:
         | Hard to search in the model itself
        
           | cheschire wrote:
           | Yep, that's why at the end of my sentence I referred to the
           | results of research efforts like this that do the hard work
           | of extracting the information in the first place.
        
       | pera wrote:
       | Yesterday I asked if there is any LLM provider that is GDPR
       | compliant: at the moment I believe the answer is no.
       | 
       | https://news.ycombinator.com/item?id=44716006
        
         | tonyhart7 wrote:
         | so your best bet is open weight LLM then???
         | 
         | but its that a breach of GDPR???
        
           | atoav wrote:
           | Only if it contains personal data you collected without
           | explicit consent ("explicit" here means litrrally asking: "I
           | want to use this data for that purpose, do you allow this?
           | Y/N").
           | 
           | Also people who have given their consent before need to be
           | able to revoke it at any point.
        
             | tonyhart7 wrote:
             | so EU basically locked itself from AI space????
             | 
             | idk but how can we do that with GDPR compliance etc???
        
               | croes wrote:
               | So basically EU citizens could sue all AI providers
        
               | galangalalgol wrote:
               | I don't think they have to? The bodies in charge can
               | simply levy the fine for 4% I think. The weird case is if
               | all the companies self hosting open weight models become
               | liable as data controllers and processors. If that was
               | aggressively penalized they could kill AI use within the
               | EU and depending on the math, such companies might choose
               | to pull operations from the EU instead of giving up use
               | of AI.
               | 
               | Edit: that last bit is probably catastrophic thinking.
               | Enforcement has always been precisely enough to cause
               | compliance vs withdrawal from the market.
        
               | croes wrote:
               | I don't think killing the us is the only thing that could
               | happen.
               | 
               | You can't steal something and avoid punishment just
               | because you don't sell in the country where the theft
               | happened.
        
               | galangalalgol wrote:
               | You absolutely can depending on the countries involved. A
               | recent extreme example is hackers in NK stealing
               | cryptocurrency. A more regular one is Chinese
               | manufacturers stealing designs. If the countries where
               | the theives live and operate won't prosecute them there
               | is no recourse. The question for multinationals is if
               | continuing to operate in the EU is worth giving up their
               | models, and if the countries they are headquartered in
               | care or can be made to.
        
               | croes wrote:
               | If those countries still like to enforce their IP in the
               | EU I guess they will.
               | 
               | Tit for tat.
               | 
               | NK isn't really a business partner in the world.
        
               | galangalalgol wrote:
               | But China is, and western countries including those in
               | the EU have frequently ignored such things. Looking
               | closer this really only affects diffusion models which
               | are much cheaper to retrain. The exception is integrated
               | models like Gemini and gpt-4v where retraining might
               | reasonably cost more than the fine. Behemoths like google
               | and openai won't bail over a few 100M, unless they see it
               | is likely to happen repeatedly, in which case they would
               | likely decouple the image models. But there is nothing to
               | say some text database that is widely used isn't
               | contaminated as well. Maybe only China will produce
               | models in the future. They don't care if you enforce
               | their IP.
               | 
               | Edit: After more reading. Clearview AI did exactly this,
               | they ignored all the EU rulings and the UK refused to
               | enforce them. They were fined tens of millions and paid
               | nothing. Stability is now also a UK company that used pi
               | images for training; it seems quite likely they will try
               | to walk that same path given their financial situation.
               | Meta is facing so many fines and lawsuits who knows what
               | it will do. Everyone else will call it cost of business
               | while fighting it every step of the way.
        
               | pera wrote:
               | While my question was in relation to GDPR there are
               | similar laws in the UK (DPA) and in California (CCPA).
               | 
               | Also note that AI is not just generative models, and
               | generative models don't need to be trained with personal
               | data.
        
               | jeroenhd wrote:
               | I'm sure it's possible, but AI companies don't invest
               | much money into complying with the law as it's not
               | profitable.
               | 
               | A normal industry would've figured out how to deal with
               | this problem before going public, but AI people don't
               | seem to be all that interested.
               | 
               | I'm sure they'll all cry foul if one of them get hit with
               | a fine and an order to figure out how to fix the mess
               | they've created, but this is what you get when you don't
               | ethics to computer scientists.
        
             | xxs wrote:
             | > need to be able to revoke it at any point.
             | 
             | They have to be able to ask how much (if) data is being
             | used, and how.
        
           | pera wrote:
           | There is currently no effective method for unlearning
           | information - specially not when you don't have access to the
           | original training datasets (as is the case with open weight
           | models), see:
           | 
           |  _Rethinking Machine Unlearning for Large Language Models_
           | 
           | https://arxiv.org/html/2402.08787v6
        
         | thrance wrote:
         | Mistral's products are supposed to be at least, since they are
         | based in the EU.
        
           | pera wrote:
           | I am not sure if Mistral is: if you go to their GDPR page
           | (https://help.mistral.ai/en/articles/347639-how-can-i-
           | exercis...) and then to the erasure request section they just
           | link to a "How can I delete my account?" page.
           | 
           | Unfortunately they don't provide information regarding their
           | training sets
           | (https://help.mistral.ai/en/articles/347390-does-mistral-
           | ai-c...) but I think it's safe to assume it includes DataComp
           | CommonPool.
        
       | itsalotoffun wrote:
       | I WISH this mattered. I wish data breaches actually carried
       | consequences. I wish people cared about this. But people don't
       | care. Right up until you're targeted for ID theft, fraud or
       | whatever else. But by then the causality feels so diluted that
       | it's "just one of those things" that happens randomly to good
       | people, and there's "nothing you can do". Horseshit.
        
         | atoav wrote:
         | It doesn't now, but we _could_ collectively decide to introduce
         | consequences of the kind that deter anybody willing to try this
         | again.
        
         | jelvibe25 wrote:
         | What's the right consequence in your opinion?
        
           | passwordoops wrote:
           | Criminal liability with a minimum 2 years served for
           | executives and fines amounting to 110% of total global
           | revenue to the company that allowed the breach would see
           | cybersecurity taken a lot more seriously in a hurry
        
             | lifestyleguru wrote:
             | Would be nice to have executives finally responsible for
             | something.
        
             | bearl wrote:
             | Internet commerce requires databases with pii that will be
             | breached.
             | 
             | Who is to blame for internet commerce?
             | 
             | Our legislators. Maybe specifically we can blame Al Gore,
             | the man who invented the internet. If we had put warning
             | labels on the internet like we did with NWA and 2 live
             | crew, Gore's second best achievement, we wouldn't be a
             | failed democracy right now.
        
           | krageon wrote:
           | A stolen identity destroys the life of the victim, and
           | there's going to be more than one. They (every single
           | involved CEO) should have all of their assets seized, to be
           | put in a fund that is used to provide free legal support to
           | the victims. Then they should go to a low-security prison and
           | have mandatory community service for the rest of their lives.
           | 
           | They probably can't be redeemed and we should recognise that,
           | but that doesn't mean they can't spend the rest of their life
           | being forced to be useful to society in a constructive way.
           | Any sort of future offense (violence, theft, assault,
           | anything really) should mean we give up on them. Then they
           | should be humanely put down.
        
         | rypskar wrote:
         | We should also stop calling it ID theft. The identity is not
         | stolen, the owner do still have it. Calling it ID theft is
         | moving the responsibility from the one that a fraud is against
         | (often banks or other large entities) to an innocent 3rd party
        
           | herbturbo wrote:
           | Yes tricking a bank into thinking you are one of their
           | customers is not the same as assuming someone else's
           | identity.
        
             | messagebus wrote:
             | As always, Mitchell and Webb hit the nail precisely on the
             | head.
             | 
             | https://www.youtube.com/watch?v=CS9ptA3Ya9E
        
           | JohnFen wrote:
           | > Calling it ID theft is moving the responsibility from the
           | one that a fraud is against (often banks or other large
           | entities)
           | 
           | The victim of ID theft is the person whose ID was stolen. The
           | damage to banks or other large entities pales in comparison
           | to the damage to those people.
        
             | rypskar wrote:
             | I did probably not formulate myself good enough. By calling
             | it ID theft you are blaming the person the ID belongs to
             | and that person have to prove they are innocent. By calling
             | it by the correct words, bank fraud, the bank have to prove
             | that the person the ID belongs to did it. No ID was stolen,
             | it was only used by someone else to commit fraud. The banks
             | don't have enough security to stop it because they have
             | gotten away with calling it ID theft and putting the blame
             | on the person the ID belongs to
        
         | laughingcurve wrote:
         | It's not clear to me how this is a data breach at all. Did the
         | researchers hack into some database and steal information? No?
         | 
         | Because afaik everything they collected was public web. So now
         | researchers are being lambasted for having data in their sets
         | that others released
         | 
         | That said, masking obvious numbers like SSN is low hanging
         | fruit. Trying to obviate every piece of public information
         | about a person that can identify them is insane.
        
       | imglorp wrote:
       | Reader mode works on this site.
        
       | satvikpendem wrote:
       | This is all public data. People should not be putting personal
       | data on public image hosts and sites like LinkedIn if they did
       | not want them to be scraped. There is nothing private about the
       | internet and I wish people understood that.
        
         | malfist wrote:
         | What's important is that we blame the victims instead of the
         | corporations that are abusing people's trust. The victims
         | should have known better than to trust corporations
        
           | blitzar wrote:
           | > blame the victims
           | 
           | If you post something publicly you cant be complaining that
           | it is public.
        
             | lewhoo wrote:
             | But I can complain about what happens to said something. If
             | my blog photo becomes deep fake porn am I allowed to
             | complain or not ? What we have is an entirely novel
             | situation (with ai) worth at least a serious discussion.
        
               | blitzar wrote:
               | > But I can complain about what happens to said something
               | 
               | no.
               | 
               | > but ...
               | 
               | no.
        
               | YetAnotherNick wrote:
               | > If my blog photo becomes deep fake porn
               | 
               | Depends. In most cases, this thing is forbidden by law
               | and you can claim actual damages.
        
               | kldg wrote:
               | That's helpful if they live in the same country, can
               | figure out who the 4chan poster was, the police are
               | interested (or you want to risk paying a lawyer), you're
               | willing to sink the time pursuing such action (and if
               | criminal, risk adversarial LEO interaction), and are
               | satisfied knowing hundreds of others may be doing the
               | same and won't be deterred. Of course, friends and co-
               | workers are too close to you to post publicly when they
               | generate it. Thankfully, the Taylor Swift laws in the US
               | have stopped generation of nonconsensual imagery and
               | video of its namesake (it hasn't).
               | 
               | Daughter's school posted pictures of her online without
               | an opt-out, but she's also on Facebook from family
               | members and it's just kind of... well beyond the point of
               | trying to suppress. Probably just best to accept people
               | can imagine you naked, at any age, doing any thing.
               | What's your neighbor doing with the images saved from his
               | Ring camera pointed at the sidewalk? :shrug:
        
               | YetAnotherNick wrote:
               | I am not talking about 4chan poster. I am talking if a
               | company does it.
        
               | dpoloncsak wrote:
               | FWIW...I really don't think so. If you say, posted your
               | photo on a bulletin board in your local City Hall, can
               | you prevent it from being defaced? Can you choose who
               | gets to look at it? Maybe they take a picture of it and
               | trace it...do you have any legal ground there? (Genuine
               | Question). And even if so...It's illegal to draw angry
               | eyebrows on every face on a billboard but people still do
               | it...
               | 
               | IMO, it being posted online to a publicly accessible site
               | is the same. Don't post anything you don't want right-
               | click-saved.
        
             | malfist wrote:
             | Sure, and if I put out a local lending library box in my
             | front yard I shouldn't by annoyed by the neighbor that
             | takes every book out of it and throws it in the trash.
             | 
             | Decorum and respect expectations don't disappear the moment
             | it's technically feasible to be an asshole
        
               | YetAnotherNick wrote:
               | That's a bad analogy. Most people including me do expect
               | that their "public" data is used for AI training. I mean
               | based on the ads everyone gets, most people know and
               | expect completely well that anything they post online
               | would be used in AI.
        
               | JohnFen wrote:
               | > Most people including me do expect that their "public"
               | data is used for AI training.
               | 
               | Based on what ordinary people have been saying, I don't
               | think this is true. Or, _maybe_ it 's true now that the
               | cat is out of the bag, but I don't think most people
               | expected this before.
               | 
               | Most tech-oriented people did, of course, but we're a
               | small minority. And even amongst our subculture, a lot of
               | people didn't see this abuse coming. I didn't, or I would
               | have removed all of my websites from the public web years
               | earlier than I did.
        
               | YetAnotherNick wrote:
               | > Most tech-oriented people did
               | 
               | In fact it's the opposite. People who aren't into tech
               | thinks Instagram is listening to them 24*7 to show feed
               | and ads. There was even a hoax near my area among elderly
               | groups that Whatsapp is using profile photo in illegal
               | activity and many people removed their photo one time.
               | 
               | > I didn't, or I would have removed all of my websites
               | from the public web years earlier than I did.
               | 
               | Your comment is public information. In fact posting
               | anything in HN is a sure shot way to giving your content
               | for AI training.
        
               | JohnFen wrote:
               | > People who aren't into tech thinks Instagram is
               | listening to them 24*7 to show feed and ads
               | 
               | True, but that's a world different than thinking that
               | your data will be used to train genAI.
               | 
               | > In fact posting anything in HN is a sure shot way to
               | giving your content for AI training.
               | 
               | Indeed so, but HN seems to be a bad habit I just can't
               | kick. However, my comments here are the entirety of what
               | I put up on the open web and I intentionally keep them
               | relatively shallow. I no longer do long-form blogging or
               | make any of my code available on the open web.
               | 
               | However, you're right. Leaving HN is something that I
               | need to do.
        
               | bearl wrote:
               | No, the average person has no idea what "ai training"
               | even is. Should the average person have an above average
               | iq? Yes. Could they? No. Don't be average yourself.
        
               | malfist wrote:
               | Are you trying to argue that 10 years ago when I uploaded
               | my resume to linkedin, that I should have known it'd be
               | used for AI training?
               | 
               | Or that teenager that signed up to facebook should know
               | that the embarrassing things they're posting is going to
               | train AI and is, as you called it, public?
               | 
               | What about the blog I started 25 years ago and then took
               | down but it lives in the geocities archive. Was I
               | supposed to know it'd go to an AI overlord corporation
               | when I was in middle school writing about dragon photos I
               | found on google?
               | 
               | And we're not even getting into data breaches, or
               | something that was uploaded as private and then sold when
               | the corporation changed their privacy policy decades
               | after it was uploaded.
               | 
               | It's not a bad analogy when you don't give all the graces
               | to corporations and none to the exploited.
        
               | victorbjorklund wrote:
               | Seriously, when YOU posted something on the Internet 20
               | years ago you expected it to be used by a corporation to
               | train an AI 20 years later?
        
           | nerdjon wrote:
           | Right, both things can be wrong here.
           | 
           | We need to better educate people on the risks of posting
           | private information online.
           | 
           | But that does not absolve these corporations of criticism of
           | how they are handling data and "protecting" people's privacy.
           | 
           | Especially not when those companies are using dark patterns
           | to convince people to share more and more information with
           | them.
        
           | thinkingtoilet wrote:
           | If this was 2010 I would agree. This is the world we live in.
           | If you post a picture of yourself on a lamp post on a street
           | in a busy city, you can't be surprised if someone takes it.
           | It's the same on the internet and everyone knows it by now.
        
           | squigz wrote:
           | > The victims should have known better than to trust
           | corporations
           | 
           | Literally yes? Is this sarcasm? Are we in 2025 supposed to
           | implicitly trust multi-billion dollar multi-national
           | corporations that have decades' worth of abuses to look back
           | on? As if we couldn't have seen this coming?
           | 
           | It's been part of every social media platform's ToS for many
           | years that they get a license to do whatever they want with
           | what you upload. People have warned others about this for
           | years and nothing happened. Those platforms' have already
           | used that data prior to this for image classification,
           | identification and the like. But nothing happened. What's
           | different now?
        
           | keybored wrote:
           | Modern companies: We aim to create or use human-like AI.
           | 
           | Those same modern companies: Look, if our users inadvertently
           | upload sensitive or private information then we can't really
           | help them. The heuristics for detecting those kinds of things
           | are just too difficult to implement.
        
           | Workaccount2 wrote:
           | I have negative sympathy for people who still aren't aware
           | that if they aren't paying for something, they are the
           | something to be sold. This has been the case for almost 30
           | years now with the majority of services on the internet,
           | _including this very website right here_.
        
             | gishglish wrote:
             | Tbh, even if they are paying for it, they're probably still
             | the product. Unless maybe they're an enterprise customer
             | who can afford magnitudes more to obtain relative privacy.
        
             | malfist wrote:
             | That explains why ISPs sell DNS lookup history, or your
             | utility company sells your habits. Or your TV tracks your
             | viewership. I've paid for all of those, but somehow, I'm
             | still the product.
        
         | jeroenhd wrote:
         | AI and scraping companies are why we can't have nice things.
         | 
         | Of course privacy law doesn't necessarily agree with the idea
         | that you can just scrape private data, but good luck getting
         | that enforced anywhere.
        
         | pera wrote:
         | > _This is all public data_
         | 
         | It's important to know that generally this distinction is not
         | relevant when it comes to data subject rights like GDPR's right
         | to erasure: If your company is processing any kind of personal
         | data, including publicly available data, it must comply with
         | data protection regulations.
        
           | booder1 wrote:
           | Legal has in no way been able to keep up with AI. Just look
           | at copyright. Internet data is public and the government is
           | incapable of changing this.
        
         | Anonbrit wrote:
         | A hidden camera can make your bedroom public. Don't do it if
         | you don't want it to be on pay-per-view?
        
           | satvikpendem wrote:
           | That is indeed what Justin.tv did, to much success. But that
           | was because Justin had consented to doing so, just as
           | anything anyone posts online is also consented to being seen
           | by anyone.
        
           | dpoloncsak wrote:
           | Does this analogy really apply? Maybe I'm misunderstanding,
           | but it seems like all of this data was publicly available
           | already, and scraped from the web.
           | 
           | In that case, its not a 'hidden camera'...users uploaded this
           | data and made it public, right? I'm sure some were due to
           | misconfiguration or whatever (like we see with Tea), but it
           | seems like most of this was uploaded by the user to the clear
           | web. I'm all for "Dont blame the victims", but if you upload
           | your CC to Imgur I think you deserve to have to get a new
           | card.
           | 
           | Per the article "CommonPool ... draws on the same data
           | source: web scraping done by the nonprofit Common Crawl
           | between 2014 and 2022."
        
           | dlivingston wrote:
           | Your analogy doesn't hold. A 'hidden camera' would be either
           | malware that does data exfiltration, or the company
           | selling/training on your data outside of the bounds of its
           | terms of service.
           | 
           | A more apt analogy would be someone recording you in public,
           | or an outside camera pointed at your wide-open bedroom
           | window.
        
       | djoldman wrote:
       | Just to be clear, as with LAION, the data set doesn't _contain_
       | personal data.
       | 
       | It contains _links_ to personal data.
       | 
       | The title is like saying that sending a magnet link to a
       | copyrighted torrent file is distributing copyright material.
       | Folks can argue if that's true but the discussion should at least
       | be transparent.
        
         | yorwba wrote:
         | I think the data set is generally considered to consist of the
         | images, not the list of links for downloading the images.
         | 
         | That the data set aggregator doesn't directly host the images
         | themselves matters when you want to issue a takedown (targeting
         | the original image host might be more effective) but for the
         | question "Does that mean a model was trained on my images?"
         | it's immaterial.
        
         | bearl wrote:
         | Links to pii are by far the worst sort of pii, yes.
         | 
         | "It's not his actual money, it's just his bank account and
         | routing number."
        
           | djoldman wrote:
           | A more accurate analogy is "it's not his actual money, it's a
           | link to a webpage or image that has his bank account and
           | routing number."
        
             | bearl wrote:
             | My contention is that links to pii are themselves pii.
             | 
             | A name, Jon Smith, is technically PII but not very
             | specific. If I have a link to a specific Jon Smith's
             | facebook page or his HN profile, it's even more personally
             | identifiable than knowing his name is Jon Smith.
        
       | 1vuio0pswjnm7 wrote:
       | archive.is is (a) sometimes blocked, (b) serves CAPTCHAs in some
       | instances and (c) includes a tracking pixel
       | 
       | One alternative to archive.is for this website is to disable
       | Javascript and CSS
       | 
       | Another alternative is the website's RSS feed
       | 
       | Works anywhere without CSS or Javascript, without CAPTCHAs,
       | without tracking pixel
       | 
       | For example,                  curl https://web.archive.org/web/20
       | 250721104402if_/https://www.technologyreview.com/feed/
       | |(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") >
       | 1.htm             firefox ./1.htm
       | 
       | To retrieve only the entry about DataComp CommonPool,
       | curl https://web.archive.org/web/20250721104402if_/https://www.te
       | chnologyreview.com/feed/         |sed -n '/./{/>1120522</post-
       | id>/,/>1120466</post-id>/p;}'         |(echo "<meta
       | charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm
       | firefox ./1.htm
        
       ___________________________________________________________________
       (page generated 2025-07-30 23:01 UTC)