[HN Gopher] The data that powers AI is disappearing fast
       ___________________________________________________________________
        
       The data that powers AI is disappearing fast
        
       Author : sgammon
       Score  : 48 points
       Date   : 2024-07-21 19:05 UTC (4 hours ago)
        
 (HTM) web link (www.nytimes.com)
 (TXT) w3m dump (www.nytimes.com)
        
       | blackeyeblitzar wrote:
       | We should make the data on large platforms like YouTube and
       | social media in general accessible to all companies for AI use
       | (with the actual creator's positive consent).
        
         | jsheard wrote:
         | How many creators do you think would actually go out of their
         | way to consent to their work being used for training? _Maybe_
         | if they get paid, but otherwise forget about it.
        
           | janice1999 wrote:
           | There's actually an interesting perverse incentive here. A
           | group of politically motivated actors (perhaps funded by a
           | government or private interests) could create content with
           | the sole purpose of opting in to all the AI data harvesting
           | feeds. That way they could bias the output to their cause
           | (assuming the AI company doing the training does take active
           | measures to counter it).
        
             | jsheard wrote:
             | I would assume that's happening regardless, platforms like
             | Reddit are already selling their data firehoses to AI
             | companies, so anyone who manages to slip propaganda bots
             | under Reddits radar will end up having their talking points
             | fed into future AI training sets.
        
           | im3w1l wrote:
           | Payment is out of the question - too much of a hassle. What
           | is likely is websites either requiring you allow ai training
           | in exchange for hosting your content, or giving some minor
           | perk/incentive for it.
           | 
           | Consider this comment itself. How much do you think it is
           | worth for an AI company? Maybe 0.00001 dollars? How would you
           | handle the logistics of money that little money?
        
           | oldkinglog wrote:
           | This comment neatly explains why the LLM bubble will burst as
           | soon as prosecutors remember that the DMCA doesn't have a
           | carve-out for AI.
        
         | BadBadJellyBean wrote:
         | I think all is about consent. And I don't even think that
         | people would be so upset if the whole AI traing wasn't abot
         | profit. But they way it is, companies are training their models
         | on other people's work and try to make money with the models.
        
         | add-sub-mul-div wrote:
         | Imagine if we'd been appropriately skeptical of the way social
         | media might worsen society in exchange for some short term fun
         | and convenience, rather than blindly conflate invention and
         | novelty with progress. We can forgive ourselves for that
         | naivete, but having seen that what excuse do we have now?
        
       | GaggiX wrote:
       | >Those restrictions are set up through the Robots Exclusion
       | Protocol
       | 
       | Well so it's not really disappearing at all.
        
       | buildbot wrote:
       | Well, unless you exclude common crawl and block all robots...it's
       | still going to end up in a dataset someday. Or deleted and gone
       | forever!
        
       | janice1999 wrote:
       | > We're seeing a rapid decline in consent to use data across the
       | web
       | 
       | That is such a weird and misleading way to put it. There was no
       | consent in the first place. Take YouTube for example. Google did
       | not consent to the videos it hosts being used by OpenAI. The
       | uploader certainly did not ever consent to their face, voice and
       | content being used to train models either.
        
         | advael wrote:
         | I think this is an important thing to point out. This is the
         | first really compelling test of the legal theory that the
         | overbroad content-ownership boilerplate licenses big user-
         | generated-content hosts use could actually bind people to allow
         | arbitrary use of their content, likeness, etc. Before the uses
         | of it were actually quite egregious, but kind of abstract,
         | things like "build a profile on you to advertise products" has
         | incredibly deep and creepy implications but they're for
         | whatever reason not as unifying an objection as this
        
           | squigz wrote:
           | I doubt that this theory has not been put to the test,
           | repeatedly.
        
             | advael wrote:
             | Guess we just take legal matters on faith then? My lawyer's
             | gonna be devastated
        
         | add-sub-mul-div wrote:
         | Right, I think a better way to put it is that there was
         | complacency and neutrality (inattention) on the topic before.
         | But now that people see that their work is going to be used to
         | flood the culture with slop, the new default stance is to deny
         | consent.
        
         | linkjuice4all wrote:
         | I agree that these content creators were not contacted
         | individually and asked for specific consent to have their
         | materials ingested for training purposes.
         | 
         | That being said - the general crux of the third-party doctrine
         | (at least in the United States) is that information told to
         | another party has no expectation of privacy. That seems to
         | apply to a vast amount of 'user generated content' on the
         | internet. Someone decided to utilize someone else's megaphone
         | to post into the commons and now expects to retain some
         | ownership. It's hard to have it both ways "just because someone
         | else noticed."
         | 
         | Unfortunately this also extends to general artistic styles. If
         | you, as an artist, continues to create in a consistent style
         | and someone else notices you don't have a lot of recourse when
         | some entity is able to recreate a similar style.
         | 
         | On a technical side this seems similar to your server
         | responding to any incoming request with a 200 status and some
         | content (LinkedIn v HiQ). In this case a user-agent asked (was
         | it a person? scraper? AI training process?) and you gave it to
         | "them".
         | 
         | I guess we end up in the position where if you don't want to
         | train AI then don't post your content publicly.
        
         | buildbot wrote:
         | You did when you uploaded it and consented to the ToS.
         | 
         | Google was training on all of YouTube back in 2012 -
         | https://slate.com/technology/2012/06/google-computers-learn-...
        
           | janice1999 wrote:
           | The article is about third parties, e.g. OpenAI scraping
           | Youtube and Reddit.
        
             | buildbot wrote:
             | Responding to this GP point: > The uploader certainly did
             | not ever consent to their face, voice and content being
             | used to train models either.
             | 
             | I'm also not saying I _support_ ToS
        
               | godelski wrote:
               | I'd argue that there was no consent here in the first
               | place. I want to remind everyone, there are plenty of
               | times and reasons a contract can be invalidated.
               | 
               | Consent requires the person knowing what they are
               | consenting to. How many people do you think know that
               | posting your face on Facebook consents to using your face
               | to generate photos? Your voice to create voice
               | generators?
               | 
               | What about when someone posts a group photo or video and
               | despite you not having an account that the same rules
               | apply? What about all the photos created by others and
               | uploaded by those with no ownership or rights to the
               | photos in the first place? They can't legally give away
               | the rights.
               | 
               | Then we also have to consider that the environment has
               | changed and the terms changed under people. You may have
               | signed up for Facebook and knew they were going to use
               | your data to sell ads and do some analysis. But you
               | didn't know AI was coming (let's be real, no one knew
               | even at the time of AlexNet. It didn't become clear in
               | tech groups until maybe 2016/2017). Sure, the agreements
               | might "be the same" but people always contextualize
               | things. In 2012 you might not care that Facebook "owns"
               | your voice because you know they have to host the file,
               | and contextually it is "just over legalize". But in 2024,
               | now the context is that this trains a model that can
               | replicate your voice and this means something VERY
               | different.
               | 
               | I think there's this common attitude of "Oh, well you
               | agreed to the terms" as if this makes the terms fair,
               | reasonable, or okay in the first place. It doesn't
               | account for the reality of the situation. It just further
               | legitimizes this world where a person is legally culpable
               | for not having domain expertise. News flash, we live in a
               | specialized world and you can't be a domain expert in
               | everything. It doesn't account for the fact that there is
               | often no alternative and there is serious cohesion at
               | this point. Unfortunately, not all the choices in your
               | life are up to you and many are social (and many of those
               | are idiotic and can force you into decisions nearly no
               | one in the group wants[0]) and not all the choices you
               | make ultimately come down to you. If it was up to me
               | everyone would be using Signal to communicate with me and
               | no one would have bought Apple products when they made
               | their devices unupgradable and charged more for
               | _increasing_ your disk space than it would cost to buy
               | 1.5 drives in the first place. But also, thank god that
               | everything isn't up to me because I'm also a fucking
               | idiot.
               | 
               | No, I don't think there was consent this whole time and
               | as time progressed we amount and capacity you have had to
               | consent to has decreased. And dismissing it (the fact
               | that consent is not binary) just makes the problem
               | exponentially worse. Not to mention all the dark patterns
               | which I can do a whole other rant about.
               | 
               | [0] Prime example: the American presidential race. The
               | vast majority of people would rather have neither Trump
               | nor Biden running. The vast majority of people do not
               | want a gerontocracy but will vote for one of the two
               | candidates because what other choice do they have? People
               | __could__ agree to to vote for a third party, but doing
               | so gives such a strategic disadvantage that people are
               | rational in attempting to choose from the options
               | provided to them (see primaries).
        
           | mitthrowaway2 wrote:
           | There's consent, and then there's _informed_ consent. It 's
           | not really possible for people to have given _informed_
           | consent to AI model training on their data in an era when AI
           | had much more limited capabilities relative to what it has
           | now. It 's one thing to give consent for gmail to train on
           | your data to develop better spam filters or profile your
           | interests; it's another to give consent for Google to
           | convincingly impersonate your writing style, your facial
           | tics, and your speaking voice. I think few people on Earth in
           | 2012 really understood that their data could ever be used in
           | such a way, and so their consent was not informed.
        
             | squigz wrote:
             | > I think few people on Earth in 2012 really understood
             | that their data could ever be used in such a way, and so
             | their consent was not informed.
             | 
             | I think we've been able to see this coming for many, many
             | years.
        
               | godelski wrote:
               | Post hoc ergo propter hoc
               | 
               | You may think this now, but I'm willing to bet a lot of
               | money that you didn't believe this in 2012 and even 2015.
               | Be careful to not rewrite history. It's an easy thing to
               | do.
        
             | godelski wrote:
             | > I think few people on Earth in 2012 really understood
             | that their data could ever be used in such a way, and so
             | their consent was not informed.
             | 
             | Arguably this is just not consent, plain and simple. There
             | could be no reasonable expectation for a person to believe
             | they are agreeing to this. Whereas uninformed consent, I'd
             | argue, is more like how people were agreeing to giving away
             | their data but do not actually understand how the data is
             | taken and used (let's be real, this includes even a lot, if
             | not most, tech people).
        
         | ADeerAppeared wrote:
         | I disagree, consent is important here, and the "move fast and
         | break things" mentality is what's causing problems.
         | 
         | #1 It's not all that certain AI training is fair use. There
         | could be significant damages if it is found to not be fair use.
         | 
         | #2 "We're going to steal your shit even if you won't give us
         | consent to do that" is a bad idea. We're already seeing the
         | practical results: _People stop /reduce posting their works to
         | the clearnet_. What's your scraper bot going to do? Create an
         | account on every 'private' forum and immediately get sued? Pay
         | a subscription fee to every single Patreon page?
         | 
         | And the big one, #3: It destroys public support for your
         | technology, which is essential if you want to survive the
         | oncoming government regulation.
         | 
         | Look at the public response to various tech companies stopping
         | their AI rollout in europe over EU regulations. Varying from
         | "Yes, this is exactly what we asked for" to "Good fucking
         | riddance".
         | 
         | And it will get only worse yet. Europe's privacy authorities
         | are already issuing quiet statements that the scraping of
         | social media posts, such as is done for AI data collection, is
         | not legal under the GDPR. (The law's pretty clear on this, it
         | doesn't matter that it was "posted publicly", you're not
         | allowed to use personal data like that) There's already
         | whispers of going after the LLMs themselves, as they contain
         | and continue to process personal data as well.
         | 
         | AI _needs_ public support, and the lack of consent is slowly
         | bleeding it dry.
        
           | CharlieDigital wrote:
           | > What's your scraper bot going to do? Create an account on
           | every 'private' forum and immediately get sued?
           | 
           | More likely, that the owners of any repository of note will
           | simply broker that data directly (a la Reddit).
        
             | ADeerAppeared wrote:
             | Yes, though this scenario isn't all that interesting to
             | discuss:
             | 
             | * In the case of big IP holders (e.g. media companies, news
             | organizations) this is just obtaining consent. The only fun
             | quirck is that OpenAI's purchasing of this drastically
             | lowers the "AI training is fair use" claim by proving the
             | existance of a market.
             | 
             | * In the case of platforms like Reddit, it just kicks the
             | problem one layer down. The platform does obtain "consent"
             | through it's ToS. (Beware that this consent is legally
             | weak, and won't protect your ass from anything outside
             | copyright) But users will still see it as "stealing" and
             | may flee the platform.
             | 
             | There's still a notable shift away from the clearnet.
        
           | godelski wrote:
           | > the "move fast and break things" mentality is what's
           | causing problems.
           | 
           | I want to add nuance. I don't think it is the "move fast and
           | break things" mentality that creates all the shit, but that
           | that there is no "time to clean up, everybody do your share"
           | mentality to complement it. Doing things often creates a
           | mess, and doing hard things often creates a bigger mess. You
           | can't make a fancy meal without dirtying a bunch of dishes.
           | But are we seriously not hiring "dishwashers"? Creating a
           | mess is unavoidable, and certainly it shouldn't be too large
           | of a mess, but the dishes are piling up and we can't hide it
           | anymore. It isn't a linear problem because the mess compounds
           | and the mess itself generates more mess. We won't refactor.
           | We won't rewrite. So we just have patchwork on top of
           | patchwork. That's enshitification.
           | 
           | We also have a status quo that we sprint into a sprint and
           | try to move as fast as we can but only measure how fast we're
           | going by looking at how far we moved in a quarter. There's no
           | long term measurement because "that's too hard." But this is
           | like trying to circumnavigate the world and choosing to walk.
           | You'll make progress every day and more importantly,
           | __measurable__ (and easily measurable) progress. But you
           | could spend 11 months building a fucking Cesna and still beat
           | the person that could walk on water. You need to move fast
           | (like the Cesna), but to move fast requires also slowing
           | down. Who is willing to slow down?
           | 
           | I think most people don't have a problem with people using
           | public data to do research or similar activities. That people
           | wouldn't be up in arms if OpenAI scraped all of YouTube,
           | trained on it, proved internally that they could do cool
           | stuff with it, AND THEN either started to generate their own
           | data for training or started to purchase data. Even though
           | this would still be costly to YouTube and be a weird ethical
           | ground (like getting a "free trial" (or theft) of the data
           | and pay only if it works).
           | 
           | As someone with anxiety, I can assure you, it is not a good
           | idea to constantly be rushing around chasing everything that
           | needs to be solved. You just make more messes because you
           | sloppily "fix" the issues, trying to move onto the next. The
           | trick is to fight your own mind, slow down, triage, and solve
           | anything that isn't a literal fucking fire with calm and
           | care, no matter how much your own mind wants to convince you
           | it is an emergency. But when everything is an emergency,
           | nothing is. And that's the problem. We created an economy
           | based on a business strategy that is functionally equivalent
           | to an untreated and severe anxiety disorder.
        
       | rectang wrote:
       | The NYT had another article a couple of days ago about Getty
       | leveraging the images it owns to go into AI.
       | 
       | https://www.nytimes.com/2024/07/19/technology/generative-ai-...
        
         | buildbot wrote:
         | Getty often makes false claims and generally is a super shitty
         | organization, for example (there are many):
         | https://www.latimes.com/business/hiltzik/la-fi-hiltzik-getty...
        
           | rectang wrote:
           | Getty will have to indemnify users against infringement
           | claims in the event that the images it generates turn out to
           | be based on unlicensed materials and judged as derivative by
           | a court. So will all other AI companies eventually, it's just
           | that Getty has _some_ content while other companies rely much
           | more on unlicensed content.
        
             | buildbot wrote:
             | To me, what getty is doing is actual theft. How do you
             | "accidentally" issue copyright claims using images you
             | don't own? How did getty start claiming this image as their
             | own?
             | 
             | In contrast, training an AI model you download a piece of
             | info once and then don't resell that (exact) piece of info.
             | Nobody is claiming they own the image when they don't.
        
       | thatxliner wrote:
       | I would be fine if you use my data to train your AI models if you
       | let me use your models for free. If you can't do that, you can't
       | have my data.
        
       | thatxliner wrote:
       | I would be fine if you used my data to train your AI models as
       | long as I'm able to use your models for free in return. If not,
       | you can't have my data.
        
         | wnc3141 wrote:
         | I'm not even sure if that. The ability to summarize others is
         | hardly a consolation prize for devaluing my work almost
         | completely.
        
         | rectang wrote:
         | I would not be fine with that, and I hope that you don't
         | believe that it should be imposed on me.
        
       | jsheard wrote:
       | > Those restrictions are set up through the Robots Exclusion
       | Protocol
       | 
       | If anyone would like to join in, there's an actively maintained
       | robots.txt here:
       | 
       | https://github.com/ai-robots-txt/ai.robots.txt
       | 
       | Yes, I know this isn't legally binding and scrapers can ignore it
       | if they want to.
        
       | thriftwy wrote:
       | I look forward for AI trained entirely on Wikipedia and classical
       | literature with no twitter and no contemporary art in sight. It
       | would be sublime. Let's face it, the creators of century XXI way
       | overestimate the importance of their stuff. It's mostly
       | deleterious to the culture.
        
         | to11mtm wrote:
         | > trained entirely on Wikipedia
         | 
         | does the CC used by wikipedia allow training?
         | 
         | > and no contemporary art in sight
         | 
         | will note per above, they may be able to scrape other CC if
         | they -can- scrape wikipedia.
         | 
         | > It's mostly deleterious to the culture.
         | 
         | Yep and based on how I already sometimes catch people treating
         | LLM hallucinations as fact, it's likely gonna get worse anyway.
        
       | elAhmo wrote:
       | NYT assumes that LLMs = AI, which is far from truth.
       | 
       | This is just a recent hype which relies on getting insane amounts
       | of data to train, but we had and will have AI models that do not
       | rely on training using data without consent.
        
         | kredd wrote:
         | Technicalities make things harder to understand for general
         | masses. "The Data That Powers LLMs" means little to nothing to
         | an average user. My 70 year old parents use ChatGPT, and they
         | have no idea what LLM is, but call it AI.
        
         | wkat4242 wrote:
         | It's not just LLMs but all generative models that rely on
         | extreme amounts of training data. Text but also images, video,
         | speech, music.
         | 
         | And it makes sense that LLMs were the ones to trigger the hype.
         | AI in general has been making steady progress but most usecases
         | are really hard to explain to layman. Give them a chatbot and
         | they naturally understand what's happening.
         | 
         | The only thing I'm seeing is that people are using them wrong.
         | They're using them like all-seeing oracles, which is
         | exacerbated by the confidence with which LLMs provide their
         | answers, and the innate human idea that "it's a computer so it
         | must be right".
         | 
         | But knowledge isn't really where LLMs shine, at least not
         | without search engine integrations. It's rather generation,
         | summarisation, translation (with context sensitivity),
         | rewriting styles etc.
        
       | jll29 wrote:
       | "disapparing" = people getting aware that their data has value,
       | and setting their robots.txt permissions acordingly?
        
       | wkat4242 wrote:
       | In the near future I see less real, quality human data that will
       | go behind paywalls, but also much more AI generated data feeding
       | into the next generation of AI. Because more and more people are
       | using it to publish stuff online. Which then gets scooped up by
       | AI training crawlers. And if it made sense to train an LLM on its
       | own output, it would be done already :)
       | 
       | I doubt they can continue the progress at the same speed they
       | have been at so far. Because the game is set to become more
       | difficult.
        
       | to11mtm wrote:
       | It honestly makes me hesitant to publish my special personal code
       | on gists or private repos vs at least 3-2 backup the really good
       | stuff...
       | 
       | TBH I would not feel any sadness if LLM models plateaued, greatly
       | decellerated development, or even -regressed- as they start to
       | ingest all the garbage already being spewed by LLMs.
        
       | hedora wrote:
       | I think the worst possible outcome is a licensing regime that
       | means that Disney or Paramount or Elsevier or whoever all get to
       | have a monopoly on training large models within their niche. My
       | guess is that any successful calls for regulation will have this
       | outcome, which means that individuals won't be able to legally
       | use AI-based tools except when creating works for hire, etc.
       | 
       | Currently, I think most of the training use cases can be covered
       | by the existing "you can't copyright a fact" carve out in the
       | law. That's probably better for society and creators than my
       | licensing regime scenario.
       | 
       | Anyway, I'm rooting for "no regulation" for now. The whole
       | industry is still being screwed over by market distortions
       | created by the DMCA, and this could easily be 10x worse.
        
         | DevX101 wrote:
         | It's already happening. Mega content aggregators will either
         | build their own LLMs, or license for $$$$. Twitter's doing both
         | and reddit did the latter.
        
         | janoc wrote:
         | It is good that you are rooting for the poor industry because
         | they are "being screwed".
         | 
         | Sad that you didn't consider the content creators, people who's
         | faces, voices, writings, art, personal information, etc. are
         | being used without consent and without any compensation as the
         | ones "being screwed" here.
        
         | oldkinglog wrote:
         | > I think the worst possible outcome is a licensing regime that
         | means that Disney or Paramount or Elsevier or whoever all get
         | to have a monopoly on training large models within their niche.
         | 
         | Why is this the worst possible outcome? Companies using AI
         | would be training with properties they own or have licensed
         | appropriately, rather than the existing scheme of ignoring
         | copyright law to extract $$$ from the creative works of
         | ordinary people.
        
       | doctorpangloss wrote:
       | "Tell HN: Stop Reading The New York Times"
        
       | greatpostman wrote:
       | Ai negativity clickbait
        
       | neonate wrote:
       | https://archive.ph/EEQQB
        
       ___________________________________________________________________
       (page generated 2024-07-21 23:17 UTC)