[HN Gopher] The data that powers AI is disappearing fast
___________________________________________________________________
The data that powers AI is disappearing fast
Author : sgammon
Score : 48 points
Date : 2024-07-21 19:05 UTC (4 hours ago)
(HTM) web link (www.nytimes.com)
(TXT) w3m dump (www.nytimes.com)
| blackeyeblitzar wrote:
| We should make the data on large platforms like YouTube and
| social media in general accessible to all companies for AI use
| (with the actual creator's positive consent).
| jsheard wrote:
| How many creators do you think would actually go out of their
| way to consent to their work being used for training? _Maybe_
| if they get paid, but otherwise forget about it.
| janice1999 wrote:
| There's actually an interesting perverse incentive here. A
| group of politically motivated actors (perhaps funded by a
| government or private interests) could create content with
| the sole purpose of opting in to all the AI data harvesting
| feeds. That way they could bias the output to their cause
| (assuming the AI company doing the training does take active
| measures to counter it).
| jsheard wrote:
| I would assume that's happening regardless, platforms like
| Reddit are already selling their data firehoses to AI
| companies, so anyone who manages to slip propaganda bots
| under Reddits radar will end up having their talking points
| fed into future AI training sets.
| im3w1l wrote:
| Payment is out of the question - too much of a hassle. What
| is likely is websites either requiring you allow ai training
| in exchange for hosting your content, or giving some minor
| perk/incentive for it.
|
| Consider this comment itself. How much do you think it is
| worth for an AI company? Maybe 0.00001 dollars? How would you
| handle the logistics of money that little money?
| oldkinglog wrote:
| This comment neatly explains why the LLM bubble will burst as
| soon as prosecutors remember that the DMCA doesn't have a
| carve-out for AI.
| BadBadJellyBean wrote:
| I think all is about consent. And I don't even think that
| people would be so upset if the whole AI traing wasn't abot
| profit. But they way it is, companies are training their models
| on other people's work and try to make money with the models.
| add-sub-mul-div wrote:
| Imagine if we'd been appropriately skeptical of the way social
| media might worsen society in exchange for some short term fun
| and convenience, rather than blindly conflate invention and
| novelty with progress. We can forgive ourselves for that
| naivete, but having seen that what excuse do we have now?
| GaggiX wrote:
| >Those restrictions are set up through the Robots Exclusion
| Protocol
|
| Well so it's not really disappearing at all.
| buildbot wrote:
| Well, unless you exclude common crawl and block all robots...it's
| still going to end up in a dataset someday. Or deleted and gone
| forever!
| janice1999 wrote:
| > We're seeing a rapid decline in consent to use data across the
| web
|
| That is such a weird and misleading way to put it. There was no
| consent in the first place. Take YouTube for example. Google did
| not consent to the videos it hosts being used by OpenAI. The
| uploader certainly did not ever consent to their face, voice and
| content being used to train models either.
| advael wrote:
| I think this is an important thing to point out. This is the
| first really compelling test of the legal theory that the
| overbroad content-ownership boilerplate licenses big user-
| generated-content hosts use could actually bind people to allow
| arbitrary use of their content, likeness, etc. Before the uses
| of it were actually quite egregious, but kind of abstract,
| things like "build a profile on you to advertise products" has
| incredibly deep and creepy implications but they're for
| whatever reason not as unifying an objection as this
| squigz wrote:
| I doubt that this theory has not been put to the test,
| repeatedly.
| advael wrote:
| Guess we just take legal matters on faith then? My lawyer's
| gonna be devastated
| add-sub-mul-div wrote:
| Right, I think a better way to put it is that there was
| complacency and neutrality (inattention) on the topic before.
| But now that people see that their work is going to be used to
| flood the culture with slop, the new default stance is to deny
| consent.
| linkjuice4all wrote:
| I agree that these content creators were not contacted
| individually and asked for specific consent to have their
| materials ingested for training purposes.
|
| That being said - the general crux of the third-party doctrine
| (at least in the United States) is that information told to
| another party has no expectation of privacy. That seems to
| apply to a vast amount of 'user generated content' on the
| internet. Someone decided to utilize someone else's megaphone
| to post into the commons and now expects to retain some
| ownership. It's hard to have it both ways "just because someone
| else noticed."
|
| Unfortunately this also extends to general artistic styles. If
| you, as an artist, continues to create in a consistent style
| and someone else notices you don't have a lot of recourse when
| some entity is able to recreate a similar style.
|
| On a technical side this seems similar to your server
| responding to any incoming request with a 200 status and some
| content (LinkedIn v HiQ). In this case a user-agent asked (was
| it a person? scraper? AI training process?) and you gave it to
| "them".
|
| I guess we end up in the position where if you don't want to
| train AI then don't post your content publicly.
| buildbot wrote:
| You did when you uploaded it and consented to the ToS.
|
| Google was training on all of YouTube back in 2012 -
| https://slate.com/technology/2012/06/google-computers-learn-...
| janice1999 wrote:
| The article is about third parties, e.g. OpenAI scraping
| Youtube and Reddit.
| buildbot wrote:
| Responding to this GP point: > The uploader certainly did
| not ever consent to their face, voice and content being
| used to train models either.
|
| I'm also not saying I _support_ ToS
| godelski wrote:
| I'd argue that there was no consent here in the first
| place. I want to remind everyone, there are plenty of
| times and reasons a contract can be invalidated.
|
| Consent requires the person knowing what they are
| consenting to. How many people do you think know that
| posting your face on Facebook consents to using your face
| to generate photos? Your voice to create voice
| generators?
|
| What about when someone posts a group photo or video and
| despite you not having an account that the same rules
| apply? What about all the photos created by others and
| uploaded by those with no ownership or rights to the
| photos in the first place? They can't legally give away
| the rights.
|
| Then we also have to consider that the environment has
| changed and the terms changed under people. You may have
| signed up for Facebook and knew they were going to use
| your data to sell ads and do some analysis. But you
| didn't know AI was coming (let's be real, no one knew
| even at the time of AlexNet. It didn't become clear in
| tech groups until maybe 2016/2017). Sure, the agreements
| might "be the same" but people always contextualize
| things. In 2012 you might not care that Facebook "owns"
| your voice because you know they have to host the file,
| and contextually it is "just over legalize". But in 2024,
| now the context is that this trains a model that can
| replicate your voice and this means something VERY
| different.
|
| I think there's this common attitude of "Oh, well you
| agreed to the terms" as if this makes the terms fair,
| reasonable, or okay in the first place. It doesn't
| account for the reality of the situation. It just further
| legitimizes this world where a person is legally culpable
| for not having domain expertise. News flash, we live in a
| specialized world and you can't be a domain expert in
| everything. It doesn't account for the fact that there is
| often no alternative and there is serious cohesion at
| this point. Unfortunately, not all the choices in your
| life are up to you and many are social (and many of those
| are idiotic and can force you into decisions nearly no
| one in the group wants[0]) and not all the choices you
| make ultimately come down to you. If it was up to me
| everyone would be using Signal to communicate with me and
| no one would have bought Apple products when they made
| their devices unupgradable and charged more for
| _increasing_ your disk space than it would cost to buy
| 1.5 drives in the first place. But also, thank god that
| everything isn't up to me because I'm also a fucking
| idiot.
|
| No, I don't think there was consent this whole time and
| as time progressed we amount and capacity you have had to
| consent to has decreased. And dismissing it (the fact
| that consent is not binary) just makes the problem
| exponentially worse. Not to mention all the dark patterns
| which I can do a whole other rant about.
|
| [0] Prime example: the American presidential race. The
| vast majority of people would rather have neither Trump
| nor Biden running. The vast majority of people do not
| want a gerontocracy but will vote for one of the two
| candidates because what other choice do they have? People
| __could__ agree to to vote for a third party, but doing
| so gives such a strategic disadvantage that people are
| rational in attempting to choose from the options
| provided to them (see primaries).
| mitthrowaway2 wrote:
| There's consent, and then there's _informed_ consent. It 's
| not really possible for people to have given _informed_
| consent to AI model training on their data in an era when AI
| had much more limited capabilities relative to what it has
| now. It 's one thing to give consent for gmail to train on
| your data to develop better spam filters or profile your
| interests; it's another to give consent for Google to
| convincingly impersonate your writing style, your facial
| tics, and your speaking voice. I think few people on Earth in
| 2012 really understood that their data could ever be used in
| such a way, and so their consent was not informed.
| squigz wrote:
| > I think few people on Earth in 2012 really understood
| that their data could ever be used in such a way, and so
| their consent was not informed.
|
| I think we've been able to see this coming for many, many
| years.
| godelski wrote:
| Post hoc ergo propter hoc
|
| You may think this now, but I'm willing to bet a lot of
| money that you didn't believe this in 2012 and even 2015.
| Be careful to not rewrite history. It's an easy thing to
| do.
| godelski wrote:
| > I think few people on Earth in 2012 really understood
| that their data could ever be used in such a way, and so
| their consent was not informed.
|
| Arguably this is just not consent, plain and simple. There
| could be no reasonable expectation for a person to believe
| they are agreeing to this. Whereas uninformed consent, I'd
| argue, is more like how people were agreeing to giving away
| their data but do not actually understand how the data is
| taken and used (let's be real, this includes even a lot, if
| not most, tech people).
| ADeerAppeared wrote:
| I disagree, consent is important here, and the "move fast and
| break things" mentality is what's causing problems.
|
| #1 It's not all that certain AI training is fair use. There
| could be significant damages if it is found to not be fair use.
|
| #2 "We're going to steal your shit even if you won't give us
| consent to do that" is a bad idea. We're already seeing the
| practical results: _People stop /reduce posting their works to
| the clearnet_. What's your scraper bot going to do? Create an
| account on every 'private' forum and immediately get sued? Pay
| a subscription fee to every single Patreon page?
|
| And the big one, #3: It destroys public support for your
| technology, which is essential if you want to survive the
| oncoming government regulation.
|
| Look at the public response to various tech companies stopping
| their AI rollout in europe over EU regulations. Varying from
| "Yes, this is exactly what we asked for" to "Good fucking
| riddance".
|
| And it will get only worse yet. Europe's privacy authorities
| are already issuing quiet statements that the scraping of
| social media posts, such as is done for AI data collection, is
| not legal under the GDPR. (The law's pretty clear on this, it
| doesn't matter that it was "posted publicly", you're not
| allowed to use personal data like that) There's already
| whispers of going after the LLMs themselves, as they contain
| and continue to process personal data as well.
|
| AI _needs_ public support, and the lack of consent is slowly
| bleeding it dry.
| CharlieDigital wrote:
| > What's your scraper bot going to do? Create an account on
| every 'private' forum and immediately get sued?
|
| More likely, that the owners of any repository of note will
| simply broker that data directly (a la Reddit).
| ADeerAppeared wrote:
| Yes, though this scenario isn't all that interesting to
| discuss:
|
| * In the case of big IP holders (e.g. media companies, news
| organizations) this is just obtaining consent. The only fun
| quirck is that OpenAI's purchasing of this drastically
| lowers the "AI training is fair use" claim by proving the
| existance of a market.
|
| * In the case of platforms like Reddit, it just kicks the
| problem one layer down. The platform does obtain "consent"
| through it's ToS. (Beware that this consent is legally
| weak, and won't protect your ass from anything outside
| copyright) But users will still see it as "stealing" and
| may flee the platform.
|
| There's still a notable shift away from the clearnet.
| godelski wrote:
| > the "move fast and break things" mentality is what's
| causing problems.
|
| I want to add nuance. I don't think it is the "move fast and
| break things" mentality that creates all the shit, but that
| that there is no "time to clean up, everybody do your share"
| mentality to complement it. Doing things often creates a
| mess, and doing hard things often creates a bigger mess. You
| can't make a fancy meal without dirtying a bunch of dishes.
| But are we seriously not hiring "dishwashers"? Creating a
| mess is unavoidable, and certainly it shouldn't be too large
| of a mess, but the dishes are piling up and we can't hide it
| anymore. It isn't a linear problem because the mess compounds
| and the mess itself generates more mess. We won't refactor.
| We won't rewrite. So we just have patchwork on top of
| patchwork. That's enshitification.
|
| We also have a status quo that we sprint into a sprint and
| try to move as fast as we can but only measure how fast we're
| going by looking at how far we moved in a quarter. There's no
| long term measurement because "that's too hard." But this is
| like trying to circumnavigate the world and choosing to walk.
| You'll make progress every day and more importantly,
| __measurable__ (and easily measurable) progress. But you
| could spend 11 months building a fucking Cesna and still beat
| the person that could walk on water. You need to move fast
| (like the Cesna), but to move fast requires also slowing
| down. Who is willing to slow down?
|
| I think most people don't have a problem with people using
| public data to do research or similar activities. That people
| wouldn't be up in arms if OpenAI scraped all of YouTube,
| trained on it, proved internally that they could do cool
| stuff with it, AND THEN either started to generate their own
| data for training or started to purchase data. Even though
| this would still be costly to YouTube and be a weird ethical
| ground (like getting a "free trial" (or theft) of the data
| and pay only if it works).
|
| As someone with anxiety, I can assure you, it is not a good
| idea to constantly be rushing around chasing everything that
| needs to be solved. You just make more messes because you
| sloppily "fix" the issues, trying to move onto the next. The
| trick is to fight your own mind, slow down, triage, and solve
| anything that isn't a literal fucking fire with calm and
| care, no matter how much your own mind wants to convince you
| it is an emergency. But when everything is an emergency,
| nothing is. And that's the problem. We created an economy
| based on a business strategy that is functionally equivalent
| to an untreated and severe anxiety disorder.
| rectang wrote:
| The NYT had another article a couple of days ago about Getty
| leveraging the images it owns to go into AI.
|
| https://www.nytimes.com/2024/07/19/technology/generative-ai-...
| buildbot wrote:
| Getty often makes false claims and generally is a super shitty
| organization, for example (there are many):
| https://www.latimes.com/business/hiltzik/la-fi-hiltzik-getty...
| rectang wrote:
| Getty will have to indemnify users against infringement
| claims in the event that the images it generates turn out to
| be based on unlicensed materials and judged as derivative by
| a court. So will all other AI companies eventually, it's just
| that Getty has _some_ content while other companies rely much
| more on unlicensed content.
| buildbot wrote:
| To me, what getty is doing is actual theft. How do you
| "accidentally" issue copyright claims using images you
| don't own? How did getty start claiming this image as their
| own?
|
| In contrast, training an AI model you download a piece of
| info once and then don't resell that (exact) piece of info.
| Nobody is claiming they own the image when they don't.
| thatxliner wrote:
| I would be fine if you use my data to train your AI models if you
| let me use your models for free. If you can't do that, you can't
| have my data.
| thatxliner wrote:
| I would be fine if you used my data to train your AI models as
| long as I'm able to use your models for free in return. If not,
| you can't have my data.
| wnc3141 wrote:
| I'm not even sure if that. The ability to summarize others is
| hardly a consolation prize for devaluing my work almost
| completely.
| rectang wrote:
| I would not be fine with that, and I hope that you don't
| believe that it should be imposed on me.
| jsheard wrote:
| > Those restrictions are set up through the Robots Exclusion
| Protocol
|
| If anyone would like to join in, there's an actively maintained
| robots.txt here:
|
| https://github.com/ai-robots-txt/ai.robots.txt
|
| Yes, I know this isn't legally binding and scrapers can ignore it
| if they want to.
| thriftwy wrote:
| I look forward for AI trained entirely on Wikipedia and classical
| literature with no twitter and no contemporary art in sight. It
| would be sublime. Let's face it, the creators of century XXI way
| overestimate the importance of their stuff. It's mostly
| deleterious to the culture.
| to11mtm wrote:
| > trained entirely on Wikipedia
|
| does the CC used by wikipedia allow training?
|
| > and no contemporary art in sight
|
| will note per above, they may be able to scrape other CC if
| they -can- scrape wikipedia.
|
| > It's mostly deleterious to the culture.
|
| Yep and based on how I already sometimes catch people treating
| LLM hallucinations as fact, it's likely gonna get worse anyway.
| elAhmo wrote:
| NYT assumes that LLMs = AI, which is far from truth.
|
| This is just a recent hype which relies on getting insane amounts
| of data to train, but we had and will have AI models that do not
| rely on training using data without consent.
| kredd wrote:
| Technicalities make things harder to understand for general
| masses. "The Data That Powers LLMs" means little to nothing to
| an average user. My 70 year old parents use ChatGPT, and they
| have no idea what LLM is, but call it AI.
| wkat4242 wrote:
| It's not just LLMs but all generative models that rely on
| extreme amounts of training data. Text but also images, video,
| speech, music.
|
| And it makes sense that LLMs were the ones to trigger the hype.
| AI in general has been making steady progress but most usecases
| are really hard to explain to layman. Give them a chatbot and
| they naturally understand what's happening.
|
| The only thing I'm seeing is that people are using them wrong.
| They're using them like all-seeing oracles, which is
| exacerbated by the confidence with which LLMs provide their
| answers, and the innate human idea that "it's a computer so it
| must be right".
|
| But knowledge isn't really where LLMs shine, at least not
| without search engine integrations. It's rather generation,
| summarisation, translation (with context sensitivity),
| rewriting styles etc.
| jll29 wrote:
| "disapparing" = people getting aware that their data has value,
| and setting their robots.txt permissions acordingly?
| wkat4242 wrote:
| In the near future I see less real, quality human data that will
| go behind paywalls, but also much more AI generated data feeding
| into the next generation of AI. Because more and more people are
| using it to publish stuff online. Which then gets scooped up by
| AI training crawlers. And if it made sense to train an LLM on its
| own output, it would be done already :)
|
| I doubt they can continue the progress at the same speed they
| have been at so far. Because the game is set to become more
| difficult.
| to11mtm wrote:
| It honestly makes me hesitant to publish my special personal code
| on gists or private repos vs at least 3-2 backup the really good
| stuff...
|
| TBH I would not feel any sadness if LLM models plateaued, greatly
| decellerated development, or even -regressed- as they start to
| ingest all the garbage already being spewed by LLMs.
| hedora wrote:
| I think the worst possible outcome is a licensing regime that
| means that Disney or Paramount or Elsevier or whoever all get to
| have a monopoly on training large models within their niche. My
| guess is that any successful calls for regulation will have this
| outcome, which means that individuals won't be able to legally
| use AI-based tools except when creating works for hire, etc.
|
| Currently, I think most of the training use cases can be covered
| by the existing "you can't copyright a fact" carve out in the
| law. That's probably better for society and creators than my
| licensing regime scenario.
|
| Anyway, I'm rooting for "no regulation" for now. The whole
| industry is still being screwed over by market distortions
| created by the DMCA, and this could easily be 10x worse.
| DevX101 wrote:
| It's already happening. Mega content aggregators will either
| build their own LLMs, or license for $$$$. Twitter's doing both
| and reddit did the latter.
| janoc wrote:
| It is good that you are rooting for the poor industry because
| they are "being screwed".
|
| Sad that you didn't consider the content creators, people who's
| faces, voices, writings, art, personal information, etc. are
| being used without consent and without any compensation as the
| ones "being screwed" here.
| oldkinglog wrote:
| > I think the worst possible outcome is a licensing regime that
| means that Disney or Paramount or Elsevier or whoever all get
| to have a monopoly on training large models within their niche.
|
| Why is this the worst possible outcome? Companies using AI
| would be training with properties they own or have licensed
| appropriately, rather than the existing scheme of ignoring
| copyright law to extract $$$ from the creative works of
| ordinary people.
| doctorpangloss wrote:
| "Tell HN: Stop Reading The New York Times"
| greatpostman wrote:
| Ai negativity clickbait
| neonate wrote:
| https://archive.ph/EEQQB
___________________________________________________________________
(page generated 2024-07-21 23:17 UTC)