[HN Gopher] Google denies training Bard on ChatGPT chats from Sh...
___________________________________________________________________
Google denies training Bard on ChatGPT chats from ShareGPT
Author : chatmasta
Score : 363 points
Date : 2023-03-30 11:16 UTC (11 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| mupuff1234 wrote:
| This just in, web indexing company scrapes web for data.
| cpeterso wrote:
| Regardless of whether this happened or not, would training Bard
| on ChatGPT output be good or bad for Bard's product quality? I
| imagine there's a risk of AIs recursively reinforcing bad data in
| their models. This problem seems unavoidable as more web content
| becomes AI-generated content and spam.
| ankit219 wrote:
| According to the article, the story goes this way: This engineer
| Jacob Devlin raised his concerns on training Bard with ShareGPT
| data. Then he directly joined OpenAI.
|
| He also claims that Google were about to do it, and then they
| stopped after his warnings. And presumably removed every trace of
| openai's responses.
|
| A couple of things:
|
| 1. So, Bard could have been trained on ShareGPT but it's not -
| according to the same engineer who raised the concern (and google
| denial in the verge).
|
| 2. Since he directly joined OpenAI, he could have told them and
| they could have taken action, and nothing is public on that front
| yet. Probably nothing to see here.
|
| Edit: The engineer too wasnt directly involved with the Bard
| team, it appeared to him that Bard team was heavily relying on
| ShareGPT.
| binarymax wrote:
| For those that don't know, Jacob Devlin was the lead engineer
| and first publisher of the widely popular BERT model
| architecture, and initial bert-base models released by Google.
|
| https://www.semanticscholar.org/author/Jacob-Devlin/39172707
| [deleted]
| whimsicalism wrote:
| Your comment doesn't make sense to me.
|
| > Bard team was heavily relying on ShareGPT.
|
| > He also claims that Google were about to do it, and then they
| stopped after his warnings.
|
| So were they heavily relying or were they about to and then
| stopped? It's unclear from your comment. Could you link where
| you're getting this info from? The Information article is
| walled, unfortunately.
| ankit219 wrote:
| [1] gives a jist as well.
|
| What I meant to say was that: Acc to The Information article
| the engineer raised concerns because it appeared to him
| (article wording) Bard team was using (and heavily reliant
| on) ShareGPT for Bard training. The engineer wasnt working on
| Bard and presumably someone told him or somehow he got the
| impression that Bard team was reliant on ShareGPT. At the
| time he was at Google.
|
| Then, when he raised concerns to Sundar Pichai, Bard team
| stopped doing it and also scrapped any traces of ShareGPT
| data. So, the headline is false and Bard (again presumably)
| is not trained on any of ShareGPT data.
|
| [1]: https://www.theverge.com/2023/3/29/23662621/google-bard-
| chat...
| whimsicalism wrote:
| I think I might be confused by your usage of "about to do
| it" in your original comment to mean "actively doing it."
|
| You claim that the very engineer accusing Google of
| training Bard on ShareGPT acknowledges that the final
| product was not. As far as I can tell, Devlin did no such
| thing.
|
| Not sure why you would presume they restarted their
| expensive training process.
|
| It just doesn't seem like a good faith characterization to
| me.
| rgbrenner wrote:
| Take what action? Pretty sure that's not illegal, especially
| since the training data is ai generated and therefore can't be
| copyrighted.
| chatmasta wrote:
| I think the oomph behind the story is due to it being
| embarrassing, rather than illegal.
| dahfizz wrote:
| OpenAI could have blocked Google's accounts, for example.
| Nothing really to do with legality.
| sebzim4500 wrote:
| No one is alleging that Google directly used OpenAI's API
| to get training data (which would be unambiguously against
| TOS). The claim is that they downloaded examples from
| ShareGPT.
| frozenlettuce wrote:
| Not illegal, but that won't stop people from finding it
| amusing that the company considered the world's beacon of
| innovation is copying someone else's homework. It's hard
| being the favorite horse.
| dvngnt_ wrote:
| tech companies steal ideas all the time. snapchat invented
| stories and now whatsapp, facebook, instagram, tiktok,
| youtube have them
| shmerl wrote:
| Well, ChatGPT itself was trained on something else, so how is
| Bard any worse. AIs copying each other is only natural to expect.
| ChatGTP wrote:
| I couldn't be happier, keep up the good work. Steal away just as
| OpenAI I have done.
| visarga wrote:
| This could actually be a good way to sidestep the training set
| copyright and access right issues. Copyright protection should
| solely encompass the expression of human generated content and
| not the underlying concepts.
|
| By training model B using the results generated by model A, the
| copyright of corpus_A (OpenAI RLHF dataset) remains safeguarded,
| as model B is never directly exposed to corpus_A, preventing it
| from duplicating the content verbatim.
|
| This process only transmits the concepts originating from
| corpus_A, which represents universal knowledge that cannot be
| claimed by any individual party.
| burakemir wrote:
| "... as a joke."
| dathinab wrote:
| People complained that new AI is "stealing" from artists.
|
| But stealing from other AI turns out to often be easier.
|
| And this is where things get fun, because companies like OpenAI
| want to be able to train on all the data without any explicit
| permissions from the creators, but the moment people do the same
| to them they likely (we will see) be very much against it.
|
| So it will be interesting if they will be able to both have and
| eat the cake (e.g. by using Microsoft lobby to push absurd law)
| or will they fall apart due to cannibalization making it non
| profitable to create better AI.
|
| EDIT: This comment isn't specific to Google/Bert, so it doesn't
| matter weather Google actually did so or weather not.
| commoner wrote:
| I can see the GitHub Copilot controversy being resolved in this
| way. If Microsoft, GitHub, and OpenAI successfully use the fair
| use defense for Copilot's appropriation of proprietary and
| incompatibly licensed code, then a free and open source
| alternative to Copilot can be trained on Copilot's outputs.
|
| After all, the GitHub Copilot Product Specific Terms say:
|
| > 2. Ownership of Suggestions and Your Code
|
| > GitHub does not claim any ownership rights in Suggestions.
| You retain ownership of Your Code.
|
| https://github.com/customer-terms/github-copilot-product-spe...
| century19 wrote:
| Google accused Microsoft Bing of using them for page rankings a
| few years ago. Setup a sting to show that when you searched for
| something unique on Google using MS Explorer, shortly afterwards
| the same search result would start showing up on Bing.
|
| This was seen as deeply embarrassing for Microsoft at the time.
| godzillabrennus wrote:
| The deeply embarrassing period at Microsoft began and ended
| when Ballmer ran the show. The Bing results saga was the
| hangover.
| blisterpeanuts wrote:
| Embarrassing, maybe, but imitation is the sincerest form of
| flattery.
| int_19h wrote:
| Indeed, which is why the biggest impact this revelation is
| likely to have (if proven true) is on Google's stock.
| brucethemoose2 wrote:
| This is also bad because the risk of AI "inbreeding" is real. I
| have seen invisible artifact amplification happen in a single
| generation training ESRGAN on itself.
|
| Maybe it wont happen in a single LLM generation, but perhaps gen
| 3 or 5 will start having really weird speech patterns or
| hallucinations because of this.
| sebzim4500 wrote:
| Worst case scenario they just start only training on pre-2020
| data and then finetuning on a dataset which they somehow know
| to be 'clean'.
|
| In practice though I doubt that AI contamination is actually a
| problem. Otherwise how would e.g. AlphaZero work so well (which
| is effectively _only_ trained on its own data).
| whimsicalism wrote:
| The parallels with AlphaZero are not so easy.
|
| The problem is you need some sort of arbiter of who has "won"
| a conversation but if the arbiter is just another transformer
| emitting a score, the models will compete to match the
| incomplete picture of reasoning given by the arbiter.
| brucethemoose2 wrote:
| It could degrade the model in a way that avoids the metrics
| they use for gauging quality.
|
| The distortions that showed up in ESRGAN (for instance) didnt
| seem to effect the SSIM or anything (and in fact it was
| training with MS SSIM loss), but the "noise splotches" and
| "swirlies" as I call them were noticable in some of the
| output, but you have to go back and look _really_ hard at the
| initial dataset to spot what it was picking up. Sometimes,
| even after cleaning, it felt like what it was picking up on
| was completely invisible.
|
| TLDR Google may not even notice the inbreeding until its
| already a large issue, and they may be reluctant to scrap so
| much work on the model.
| gigel82 wrote:
| Where are all those people that kept saying Google had an amazing
| model way beyond ChatGPT internally for years? Those comments
| always kept coming up in ChatGPT posts; maybe they'll stop now.
| Imnimo wrote:
| I don't care at all about this from a copyright or data ownership
| perspective, but I am a little skeptical that it's a good idea to
| be this incestuous with training data in the long run. It's one
| thing to do fine tuning or knowledge distillation for specialized
| domains or shrinking models. But if you're trying to train your
| own foundation model, is relying on output from other foundation
| models going to make them learn to imitate their own errors?
| sdenton4 wrote:
| Things like ShareGPT or PromptHero give vast repositories of
| human-curated ML outputs, which make them fantastic for at
| least incremental improvement on the base model. In the grand
| scheme of things, these will be just another style, mixed in
| with all the other crap in the training set, so I don't imagine
| it's too harmful... eg, 'paint starry night in the style of
| midjourney 5'
| berkle4455 wrote:
| Where are any LLMs going to get data from as they become more
| ubiquitous and humans produce less publicly accessible original
| and thoughtful content?
|
| The whole thing is a plateaued feedback loop.
| TillE wrote:
| It'd be cool to have an LLM that's trained almost exclusively
| on books from good publishers, and other select sources.
| Working out licensing deals would be a challenge, of course.
| whimsicalism wrote:
| Corpora is likely too small. It would just be an "LM"
| whimsicalism wrote:
| Probably from multiple modalities as well as extending the
| sequence lookback length further and further.
|
| They have low perplexity now, but the perplexity possible
| when predicting the next word on page 365 of a book where you
| can attend over the last 364 pages will allow even more
| complexity to emerge.
| whimsicalism wrote:
| But Bard isn't a foundation model?
|
| Clearly this data has value as some sort of RLHF finetuning
| dataset. Honestly they probably used it for negative examples.
| kleiba wrote:
| Hard to believe that is true, or else Bard would probably not
| perform so bad.
| waselighis wrote:
| Google only has a fraction of the training data. OpenAI had a
| huge head start and has been collecting training data for years
| now. ChatGPT is also wildly popular which has given them tons
| more training data. It's estimated that ChatGPT gained over 100
| million users in the first two months alone, and may have over
| 13 million active users daily.
|
| The logs on ShareGPT are merely a drop in the bucket.
| rocmcd wrote:
| > Google only has a fraction of the training data.
|
| Uh, what? The same Google that has been crawling, indexing,
| and letting people search the entire Internet for the last 25
| years? They have owned DeepMind for nearly twice as long as
| OpenAI has been in existence!
|
| If anything this is proof that no one at Google can get
| anything done anymore, and lack of training data ain't the
| problem.
| mirker wrote:
| The alignment portion of training requires you to have
| upvote/downvote data on many LLM responses. Google's
| attempt at that (at least according to the news so far) was
| asking all employees to volunteer time ranking the
| responses. Combined with no historical feedback from
| ChatGPT, they are behind.
| duringmath wrote:
| Bard is only a week old and has a large "experimental" sticker
| on it. Besides its UI is better and the answers are succinct
| which I prefer.
| bastardoperator wrote:
| They literally copied the Chatgpt UI, lol, only it looks like
| a dated Google UI. How do you prefer answers with less
| data?... that's crazy.
| dvngnt_ wrote:
| doing a visual diff will show you it's not a literal copy
| bastardoperator wrote:
| I'm talking design, not code, lol...
| duringmath wrote:
| I just don't want to be hit with a wall of text every
| single time, it gets the point across with minimal padding
| (high signal to noise ratio), ChatGPT feels like it gets
| paid by the word and they do actually charge by token if
| you use the API.
|
| As for the UI it's a take on the tried and true chat UI
| same as ChatGPT's, it spits the whole answer at once
| instead of feeding it to you one word at a time, it has an
| alternative drafts button, the Google it button is a nice
| touch and it feels quicker.
| bastardoperator wrote:
| You can combat that in the prompt, I use "just code, no
| words" which will also remove code comments from output.
| Bard doesn't respect the same request. You can be more
| succinct with chatgpt. Half the things I ask for in Bard
| give me this:
|
| "I'm still learning coding skills, so at the moment I
| can't help with this. I'm trained to do things like help
| you write lists about different topics, compare things,
| or build travel itineraries. Do you want to try any of
| those now?"
| duringmath wrote:
| Longer instructions? Which part of "less is more" do you
| not understand?
| bastardoperator wrote:
| What part of succinct do you not understand? Bard
| provides a bunch of useless text too, only you can't get
| rid of it. No worries, you don't know how to use chatgpt,
| have fun with Bard until Google cancels it.
| karmasimida wrote:
| Yeah, Bard's replies aren't nothing like that from ChatGPT.
|
| I wonder is it possible to use ChatGPT for competitor analysis?
|
| If the responses are not used in the final training data I
| don't see how this is being something controversial
|
| Also if Google's compliance team can't even do, as recognizing
| this level of legal risk, even if there are probably an army of
| top paid lawyers they hired, I don't know what to say. Maybe
| they should fall then.
| m00x wrote:
| ITT armchair lawyers LARPing.
| croes wrote:
| Would 112k conversations make a huge difference in the model?
| int_19h wrote:
| For fine-tuning, yes, absolutely.
| social_quotient wrote:
| It's interesting when we say Google did this. It's actually and
| likely some people that work for Google and are on this forum did
| this. Knowingly, not by accident while slurping up the rest of
| the internet, and got paid to do it. I wonder what the engineer
| view on this was/is. I have to assume they ballpark know the
| terms of the openai data (regardless if you disagree or not).
|
| Anyone care to steel man the argument for why this was a good
| idea?
| hackerlight wrote:
| > Anyone care to steel man the argument for why this was a good
| idea?
|
| I don't see a big difference between this and training it on
| people's code and art which also happens without explicit
| permission.
| Nimitz14 wrote:
| I don't understand why it's a bad idea? Did openai ask for
| permission for using the data it uses (no)?
| seanhunter wrote:
| "What's sauce for the goose is sauce for the gander" as the legal
| cliche goes. OpenAI cannot on the one hand claim that google did
| something wrong if they used their outputs as part of the bard
| training while simultaneously on the other hand claiming they
| themselves are free to use everyone on the internets content to
| train their model.
|
| Either they believe that training should respect copyright (in
| which case they could not do what they do) or they believe that
| training is fair use (in which case they cannot possibly object
| to Google doing the same as them).
| az226 wrote:
| A big whoosh here. OpenAI is fair use because an LLM is
| transformative from the content it gathered. Bard is literally
| the same product as ChatGPT, so it is not transformative at
| all. Tell me you know nothing about copyright without telling
| me you know nothing about copyright.
| cornholio wrote:
| That's nonsensical. An AI is either transformative or it's
| not, it's an intrinsic quality that has nothing to do with
| the training data or the "product" type. If OpenAI is
| sufficiently transformative to claim fair use (which I don't
| believe for a second, alas), then any other AI built on
| similar fundamentals has the same claims and can crunch any
| data their creators see fit, including the output of other
| AIs.
| sebzim4500 wrote:
| No one is alleging copyright violations. The claim is that they
| violated OpenAI's terms of service. We don't know whether
| Google ever even agreed to those terms of service in the first
| place.
| seanhunter wrote:
| Are OpenAI saying they have adhered to the terms of service
| of all the content they have used?
| dragonwriter wrote:
| _Content_ is not subject to terms of _service_.
|
| _Services_ are subject to terms of service. (If content is
| received through a service, the terms of service may govern
| use of it, but that's not a feature of the content, but the
| acquisition route.)
| deckard1 wrote:
| Terms of Service, Terms and Conditions, and Terms of Use
| are all the same thing. There is no legal difference
| between them.
|
| > that's not a feature of the content, but the
| acquisition route.
|
| It's neither. It's a feature of contract law.
| danShumway wrote:
| ShareGPT isn't part of that service though. Yes, it would
| be a TOS violation if Google directly used ChatGPT to
| generate transcripts -- but not even the original Twitter
| thread is claiming that.
|
| The only claim being made against Google here is that
| they used ChatGPT _content_. I can 't find any sources
| claiming that Google made use of an OpenAI service. So
| the distinction is correct, but doesn't seem particularly
| valuable in this context -- using data from ShareGPT is
| not a TOS violation.
| ar9av wrote:
| I love that OpenAI uses a ton of other peoples work to train
| their model, yet when someone uses OpenAI to train their model,
| they get all up in arms.
|
| As far as I'm concerned, OpenAI has decided terms of use don't
| exist anymore.
| jug wrote:
| OpenAI is training on data that is against their terms of use?
| That reads like a serious allegation. What is this all about?
| cycomanic wrote:
| OpenAI is training on copyrighted data without a licence. I
| would argue copyright law has much stronger legal standing
| than some ToS.
|
| Now OpenAI is arguing their training is fair use, but that
| has certainly not been legally established so far and could
| just as much be used as a defence against ToS violation.
|
| So in short yes OpenAI is pretty much doing the same thing.
| modernpink wrote:
| Where are they up in arms?
| paxys wrote:
| 1. Google denies doing it, so at the very least the title should
| have an "allegedly".
|
| 2. Even if they did - so what? The output from ChatGPT is not
| copyrightable by OpenAI. In fact it is OpenAI that is training
| its models on copyrighted data, pictures, code from all over the
| internet.
| manojlds wrote:
| But remember many years back when it was news that Bing used
| Google search results to improve its results.
| magicalist wrote:
| It's not quite the same thing, because Bing was getting the
| data from a browser toolbar and watching the search terms
| used and where the user went afterwards.
|
| A closer equivalent would be if someone had made a ShareSERP
| site and people posted their favorite search terms and the
| results Google gave and Bing crawled that and incorporated
| the search terms to links connections into their search
| graph.
|
| The actual actions had _maybe_ gone too far (personally I
| thought it was more funny than "copying"), the hypothetical
| would be pretty much what you'd expect to happen. Even google
| would probably crawl ShareSERP and inadvertently reinforce
| their own results (the same way OpenAI presumably gets more
| than a bit of their own results back at them in any new
| crawls of reddit, hn, etc even if they avoid sites like
| ShareGPT deliberately).
| cma wrote:
| > Google catches Bing copying [search results], Microsoft
| says "so what?"
|
| https://arstechnica.com/information-
| technology/2011/02/googl...
| Jimmc414 wrote:
| >Even if they did - so what?
|
| Amplification of biases, propagation of errors, echolalia and
| over-optimization, lack of diverse data, overfitting
| funkyjazz wrote:
| Not to mention it's embarrassing. Google playing second
| banana to OpenAI.
| nicehill wrote:
| I think Amazon was first in the (free) banana business
| jrirhfifj wrote:
| you joke, but first producy they changed on whole foods
| were the bananas.
|
| before: organic (south america) and regular (central ou
| SEA) for 69, 59.
|
| then: both chikita's brand with regular and organic
| stickers (clearly the same produce, always from SEA) for
| 49 and 39 cents.
|
| thats was days after the announcement
| bbarnett wrote:
| Did you inadvertently reverse to regular/organic order,
| or was organic cheaper after?
| prepend wrote:
| Google's been second banana to openai for a few years now,
| right?
| ithkuil wrote:
| That assumes that training on the output of another
| language model somehow gives you the ability to improve
| your model and to catch up somehow
| iandanforth wrote:
| It does. In general this is known as teacher-student
| training or knowledge distillation. It works better if
| you have access to the activations of the model but you
| can work with just outputs as well.
| satvikpendem wrote:
| Well, it does, that's how we got Alpaca from LLaMA.
| jrirhfifj wrote:
| you talk like chatgpt was some bastion of curated perfectly
| correct content. get a grip. web scraping is web scraping.
| RosanaAnaDana wrote:
| I mean maybe. There also might be something to this. OpenAI
| has been very opaque about training techniques.
| paxys wrote:
| That's just the base concern with every single model
| regardless of where they sourced their data from. Garbage in,
| garbage out.
| educaysean wrote:
| Sure. Does that fact mean we're prohibited from expressing
| concerns about data quality? ShareGPT isn't representative
| of authentic, quality writing.
| Jimmc414 wrote:
| Right, but training an LLM on the output of another LLM can
| certainly exacerbate these issues
| paxys wrote:
| Maybe, but we are fast approaching the point (or more
| likely have crossed it already) where distinguishing
| between human and AI generated data isn't really
| possible. If Google indexes a blog, how does it know
| whether it was written with AI assistance and therefore
| should not be used for training? Heck, how does OpenAI
| itself prevent such a feedback loop from its own output
| (or that of other LLMs)?
| madeofpalk wrote:
| > If Google indexes a blog, how does it know whether it
| was written with AI assistance and therefore should not
| be used for training
|
| Yes, this is an existential problem for Google and
| training future LLMs.
|
| See also, https://www.theverge.com/23642073/best-
| printer-2023-brother-... and
| https://searchengineland.com/verge-best-
| printer-2023-394709
| abduhl wrote:
| Your argument would have a lot more force if we were past
| that point rather than fast approaching that point.
| Concerns about training data errors being compounded are
| much more important when you're talking about the
| bleeding edge.
|
| And your question about how OpenAI prevents their
| training data from being corrupted is one we should be
| asking as well!
| rightbyte wrote:
| > Heck, how does OpenAI itself prevent such a feedback
| loop from its own output (or that or other LLMs)?
|
| Seems trivial. Only use old data for the bulk? Feed some
| new data carefully curated?
| toxik wrote:
| Future job: token selector / archiving
| notahacker wrote:
| <meta name="generator" content="human brain">
|
| I'm only half joking.... I think we likely will end up
| with flags for human generated/curated content (and it
| will have to be that way round, as I can't imagine
| spammers bothering to put flags on AI-generated stuff),
| and we probably already _should_ have an equivalent of
| robots.txt protocol that allows users to specify which
| parts of their website they would and wouldn 't like used
| in the training of LLMs.
| jfk13 wrote:
| If content with a "human-generated" flag is rated more
| highly in some way -- e.g. search results -- then _of
| course_ spammers will automatically add that flag to
| their AI-generated garbage. How do you propose to prevent
| them?
| notahacker wrote:
| I assume, like the actual meta generator tags, it
| wouldn't actually be a massive boon for regular search
| results
| shubhamkrm wrote:
| Reminds me of the old "evil bit" RFC[1]
|
| [1] https://www.ietf.org/rfc/rfc3514.txt
| KRAKRISMOTT wrote:
| OpenAI Terms of service forbid training competitor models via
| their ML outputs (LoRa alpaca laundering is probably not
| allowed for commercial use).
| worldofmatthew wrote:
| Are the TOS even enforceable is AI content can't be
| copyrighted?
| space_fountain wrote:
| Where exactly does it do that? I looked a bit and could t
| find it, but likely I was just wrong
| short_sells_poo wrote:
| I love it how they don't want others to use their model
| output but they have no qualms about training their model on
| the copyrighted works of others? Isn't this a stunning level
| of hypocrisy?
| Certhas wrote:
| This is really hilarious. Authors and artists never gave
| permission to use their work to train AI models either...
|
| Not legally the same situation, but ethically close enough.
| saurik wrote:
| So, to verify, are you claiming that if someone added a
| similar clause to their source code and then GitHub went
| ahead and trained Copilot against it, that would be an issue?
| bloppe wrote:
| You relinquish all licensing rights when you upload your
| code to GitHub. Microsoft can do whatever they want with
| it. That's in their ToS, which you have to agree to when
| you make an account. Normally, only affirmatively accepted
| ToS are enforceable, so just putting a clause into your
| license doesn't work (unless it's a copyright, which
| doesn't require consent).
| flir wrote:
| > You relinquish all licensing rights when you upload
| your code to GitHub
|
| What now? Seriously?
|
| I found this. Section D4.
|
| "We need the legal right to do things like host Your
| Content, publish it, and share it. You grant us and our
| legal successors the right to store, archive, parse, and
| display Your Content, and make incidental copies, as
| necessary to provide the Service, including improving the
| Service over time. This license includes the right to do
| things like copy it to our database and make backups;
| show it to you and other users; parse it into a search
| index or otherwise analyze it on our servers; share it
| with other users; and perform it, in case Your Content is
| something like music or video."
|
| "as necessary to provide the Service" seems critical.
| bloppe wrote:
| "Improving the service over time" can do a lot of heavy
| lifting, definitely including training Copilot.
| commoner wrote:
| Also, section D3 of the GitHub Terms of Service says:
|
| > You retain ownership of and responsibility for Your
| Content.
|
| and section D4 says:
|
| > This license does not grant GitHub the right to sell
| Your Content. It also does not grant GitHub the right to
| otherwise distribute or use Your Content outside of our
| provision of the Service, except that as part of the
| right to archive Your Content, GitHub may permit our
| partners to store and archive Your Content in public
| repositories in connection with the GitHub Arctic Code
| Vault and GitHub Archive Program.
|
| There is nothing in the terms that requires the GitHub
| user to relinquish all licensing rights.
|
| https://docs.github.com/en/site-policy/github-
| terms/github-t...
| bloppe wrote:
| The clauses always have a trap door: "[outside of] our
| provision of the Service" means they can do anything as
| long as it's a service they provide.
|
| Under definitions: _The "Service" refers to the
| applications, software, products, and services provided
| by GitHub, including any Beta Previews._
| commoner wrote:
| I think there's a misunderstanding over what the word
| "relinquish" means.
|
| The terms make clear that uploading code to GitHub gives
| GitHub the right to "store, archive, parse, and display
| Your Content, and make incidental copies, as necessary to
| provide the Service, including improving the Service over
| time" while the code is hosted on GitHub.
|
| However, that's not the same thing as relinquishing
| (giving up) licensing rights to GitHub. The uploader
| still retains those rights, and there is nothing in the
| terms that says otherwise.
| gcr wrote:
| The question turns on whether you consider copilot part
| of the "GitHub service."
|
| GitHub would argue that it is, and they'd likely argue
| that charging for access to copilot is akin to charging
| for access to private repositories.
|
| Others would say that copilot is somehow separate from
| the services Github provides, so using their code for
| CoPilot wouldn't be covered by the ToS.
| bloppe wrote:
| It is certainly a service that's being provided. If not
| by GitHub, then by whom?
|
| I'll repeat the definition of service: _The "Service"
| refers to the applications, software, products, and
| services provided by GitHub, including any Beta
| Previews._
| cycomanic wrote:
| So do you believe if you hosted a closed source project
| on GitHub, and GitHub decided they want to integrate this
| into their service they would simply be allowed to take
| the code?
|
| Fortunately HN commenters are not judges. And I would
| wager any bet that MS lawyers would not try to argue
| based on their ToS either, that would be a recipe for
| loosing any court case.
| bloppe wrote:
| I just mean that it doesn't really matter what your
| license says as long as GitHub can come up with a
| business justification for using it in some way.
| Certainly, other users still legally have to obey your
| copyright.
| saurik wrote:
| So, to verify, are you claiming it would not be allowed
| for _you_ to upload _my_ otherwise-open-source code (code
| I do not myself host at GitHub, but which was reasonably
| popular / important code) to GitHub?
| bloppe wrote:
| Yep. It's in their ToS:
|
| _If you 're posting anything you did not create yourself
| or do not own the rights to, you agree that you are
| responsible for any Content you post; that you will only
| submit Content that you have the right to post; and that
| you will fully comply with any third party licenses
| relating to Content you post._
|
| I suppose this means if I upload your stuff to GitHub,
| and you sue GitHub, then GitHub would be able to somehow
| deflect liability onto me.
| commoner wrote:
| That doesn't make sense. For example, GPLv3 allows anyone
| to redistribute the software's source code if the license
| is intact:
|
| > You may convey verbatim copies of the Program's source
| code as you receive it, in any medium, provided that you
| conspicuously and appropriately publish on each copy an
| appropriate copyright notice; keep intact all notices
| stating that this License and any non-permissive terms
| added in accord with section 7 apply to the code; keep
| intact all notices of the absence of any warranty; and
| give all recipients a copy of this License along with the
| Program.
|
| https://www.gnu.org/licenses/gpl-3.0.en.html
|
| If GitHub then uses the source code in a way that
| violates the license, there is no provision in the GitHub
| terms of service that would allow GitHub to deflect legal
| liability to the GitHub user who uploaded the program.
| The uploader satisfied the requirements of GPLv3, and
| GitHub would be the only party in violation.
| 8note wrote:
| Uploading is granting GitHub a license separate from the
| gpl license.
|
| If you can't actually grant that separate license, you're
| misrepresenting your ownership and license to that code
| vagabund wrote:
| Google has no contract with OpenAI though. They used a third
| party site to scrape conversations. If the outputs themselves
| are not copyrighted, and they never agreed to the terms of
| service, it should be fine, right? Albeit unethical and
| embarrassing.
| [deleted]
| paxys wrote:
| Hardly unethical, considering OpenAI is doing exactly this.
| layer8 wrote:
| Two wrongs don't make a right.
| pantalaimon wrote:
| It's still debatable if training a computer neutral
| network on public data is 'wrong' when we very much
| accept it as a right for biological neural networks.
| asddubs wrote:
| forgive me if i have limited sympathy when a burglars
| house gets robbed
| kbrkbr wrote:
| This
| WillPostForFood wrote:
| It's even less worthy of sympathy - like a counterfeit
| piece of art being counterfeited. And there isn't even an
| original, just like a made up counterfeit.
| vagabund wrote:
| You can quibble about the ethics of web scraping for ML
| in general but I think you're conflating issues.
|
| OpenAI and Google both scour the web for human-generated
| content. What Google cares about here is the learnings
| from OpenAI's proprietary RLHF dataset, for which they
| had to contract a large sum of human labelers. Finding a
| roundabout way to extract the value of a direct
| competitor's purpose-built, costly data feels
| meaningfully different from scraping the web in general
| as an input to a transformative use.
| abeppu wrote:
| If there's a party which has intentionally conflated
| scraping web content in general with scraping it to build
| a direct competitor to the original sources, that party
| is Google.
|
| Yes, this latest instance with OpenAI outputs is shady,
| but I think it's in the same spirit as scraping news
| organizations for content which journalists were paid to
| write, and then showing portions of it directly in
| response to queries so people don't go directly to the
| news organization's pages, and it's in the same spirit as
| showing answers to query-questions that are excerpts from
| scraped pages which another organization paid to produce.
| bloppe wrote:
| I see no difference. Any web scraping is a means to
| deflect revenue-generating traffic to yourself, and away
| from other websites. Fewer people will go to Stack
| Overflow because of Codex and Copilot. The point that the
| content was paid for vs volunteered becomes moot once
| it's posted publicly online for free, on ShareGPT.
| shmel wrote:
| So what? Is OpenAI RLHF dataset more valuable than
| millions of books and paintings OpenAI used for free
| without stopping a second? Why is that? Because one big
| tech corp paid money for that dataset?
| ClumsyPilot wrote:
| > labelers. Finding a roundabout way to extract the value
| of a direct competitor's purpose-built, costly data feels
| meaningfully different from scraping the web in general
| as an input to a transformative use
|
| There we go again, its, one law for the unwashed plebs
| and the other for us.
|
| Why do you think that I, after spending my time and
| effort to write my blog, own my content to a lesser
| extent that OpenAI does their? Such hypocracy.
| paxys wrote:
| > OpenAI and Google both scour the web for human-
| generated content
|
| OpenAI and Google both scour the web for content, period.
| That content could be human generated or AI generated or
| a mix of the two. Neither company is respecting copyright
| or terms of service of every individual bit of data
| collected. Neither company cares how much effort was put
| into creating the data, whether humans were paid to do
| it, or whatever else. So there really isn't that much
| difference between the two. In fact I can guarantee that
| there was _some_ Google-generated content within OpenAI
| 's training data.
| vkou wrote:
| And herein is the main problem of AI. Its creators
| consume knowledge from the commons, and give nothing free
| and unencumbered back.
|
| It's like the guy who never brings anything to the
| potluck, but after everyone finishes eating, he boxes up
| the leftovers, and starts selling them out of a food
| cart.
| kweingar wrote:
| > Albeit unethical and embarrassing.
|
| I really don't understand this angle. In fact, I am fairly
| positive that the training set for GPT-4 contains many
| thousands of conversations with AI agents not developed by
| OpenAI.
|
| Do AI companies need to manually sift through the corpus
| and scrub webpages that contain competitor LLM output?
|
| ("Yes" is an acceptable answer to this, but then it applies
| to OpenAI's currently existing models just as much as to
| Bard)
| j_maffe wrote:
| How did you come about being "fairly positive" that GPT-4
| is trained on other AI conversations?
| TremendousJudge wrote:
| Many AI conversations have been floating around internet
| forums since the original GPT was released. As OpenAI
| hasn't shared anything about its training set, to err on
| the side of caution I would assume that they didn't
| filter these conversations out. If they aren't even
| marked as such, it may not even be possible to do. I
| think it would be very hard to prove that no AI
| conversations are included in the training set, even if
| it wasn't secret.
| caconym_ wrote:
| No more unethical or embarrassing than scraping the web for
| millions of copyrighted works and selling access to
| unauthorized derivative works.
| shmatt wrote:
| breaking terms of service is not punishable in any way.
| Facebook tried and lost in court
| paxys wrote:
| Correction - breaking terms of service _that you have not
| explicitly agreed to_ is not punishable in any way. A site
| cannot enforce a "by using this site you agree to..."
| clause deep inside some license page that visitors are
| generally unaware of. If you violate an agreement that you
| willingly chose to enter, however, you will likely be found
| liable for it.
| bloppe wrote:
| The recent HiQ vs LinkedIn case would seem to make this ToS
| unenforceable, unless Google actually created a user account
| on ShareGPT and affirmatively accepted the terms. "Acceptance
| by default" does not count, and I can easily browse ShareGPT
| without affirmatively accepting any ToS, without which web
| scraping is totally legal.
| ladon86 wrote:
| > Google denies doing it
|
| Read their statement carefully and it's actually not a denial
| of the allegation.
|
| > But Google is firmly and clearly denying the data was used:
| "Bard is not trained on any data from ShareGPT or ChatGPT,"
| spokesperson Chris Pappas tells The Verge
|
| * Allegation: Google used ShareGPT to train Bard.
|
| * Rebuttal: The current production version of Bard is not
| trained on ShareGPT data
|
| Both things can be true:
|
| * Google did use ShareGPT to train Bard
|
| * Bard is not _currently_ trained on any data from ShareGPT or
| ChatGPT.
|
| It depends on what the meaning of _is_ is ;)
| ithkuil wrote:
| Intent matters I guess.
|
| Did they accidentally train on that public piece of info they
| scraped anyway because they are scraping the whole web?
|
| Or did they intentionally scrape chatgpt output to see if
| that would help?
| bbarnett wrote:
| They could have trained, then modified code, repeat, to
| better enhance training in the current version.
|
| Then after, train on raw data.
| m00x wrote:
| Trained would mean the current model wasn't trained at all
| from ShareGPT data, not that was trained on it previously,
| and isn't being trained anymore.
|
| This association makes no sense.
| dang wrote:
| Ok, I've added that information to the title--thanks. There's
| also https://www.theverge.com/2023/3/29/23662621/google-bard-
| chat....
|
| Unfortunately the original report
| (https://www.theinformation.com/articles/alphabets-google-
| and...) is hardwalled.
| Ifkaluva wrote:
| Regarding point 2, I think there's nothing "wrong" with it,
| mainly it's funny that they don't know how to do it themselves.
| Provides additional evidence that Google is outgunned in this
| fight.
| karmasimida wrote:
| Yup
|
| The idea of doing this is embarrassing enough for Google.
|
| Google index the whole web, some of the documents are due to
| be generated by ChatGPT, there is no way around it.
| dragonwriter wrote:
| > The output from ChatGPT is not copyrightable by OpenAI.
|
| I think the argument here is over the OpenAI Terms of Service,
| not copyright.
| paxys wrote:
| And what about the terms of service of my blog or code
| repository? Does OpenAI respect that?
| dragonwriter wrote:
| > And what about the terms of service of my blog or code
| repository? Does OpenAI respect that?
|
| Seems to me that's an issue between you and OpenAI. (Does
| your blog or code repository actually have published
| restrictive terms of service? Did it when OpenAI accessed
| it? Did OpenAI even access it?)
| deckard1 wrote:
| You think OpenAI is going to care unless you have a team
| of expensive lawyers to back you up?
|
| Microsoft is out there laundering GPL code with Copilot.
| These companies live firmly in the _don 't give a fuck_
| region of capitalism. Copyright law for thee, not for me.
| bloppe wrote:
| See HiQ vs LinkedIn. ToS has to be affirmatively accepted. I
| doubt that happened in this case.
| magicalist wrote:
| Since it was through ShareGPT, is the argument like "what
| color are your bits" but for ToS?
|
| Maybe if they had put in their terms of service "you can only
| share this on sites with their own ToS that allow sharing but
| disallow using the content for training models, and also
| replicate this requirement", I don't see how you could have
| any sort of viral ToS like that.
|
| Seems more like it's just a bad idea to rely heavily on
| another LLM's output for training.
| orblivion wrote:
| Seems to me like it makes Google look kind of pathetic. That's
| worse than any legal issue here. (Caveat: assuming I understand
| the situation correctly)
| naikrovek wrote:
| if ChatGPT trained using Bard data, this site would be LIT UP
| because of OpenAI's association with Microsoft.
|
| but it's google so no big deal right?
| mdgrech23 wrote:
| This is an argument in bad faith but at this point I have zero
| trust in corporations and feel like you can generally count on
| them to do shitty things if they can benefit from it so I can
| be easily swayed by little proof at this point.
| recursive wrote:
| What's the argument? What's been done by anyone that's
| shitty? I don't even understand the point of this post. As
| far as I know, the current wave of text-based AIs is trained
| on all text accessible on the internet. Would it be a scandal
| to learn that ChatGPT is trained on wikipedia? Reddit? What
| is even the argument here, good faith or otherwise?
| visarga wrote:
| From an open source point of view it would be better if
| scraping proprietary LLMs would be allowed. Small LMs need
| this infusion of data to develop.
|
| But the big news is that it works, just a bit of data can
| have a large impact on the open source LLMs. OpenAI can't
| have a moat in their proprietary RLHF dataset. Public
| models leak, they can be distilled.
| mdgrech23 wrote:
| The argument is these companies are using our ideas created
| by us humans in this thing called the interenet for free
| and without attribution and it's problematic.
| dimitrios1 wrote:
| Responding to sibling comment: We need some clarification
| here: are we speaking about just ideas in the abstract
| sense, or ideas that have been fleshed out i.e
| "materialized"
|
| If the latter, there are many laws that say you can own
| an idea, provided it exists somewhere.
| visarga wrote:
| You can't own ideas, they got their own life-cycle.
| whimsicalism wrote:
| Right, but I do think you can "own" (by which I mean our
| societally-mediated legal definition of ownership in the
| anglosphere) specific sequences of text or at least the
| right to copy them?
| abstrakraft wrote:
| I'm not necessarily arguing against you, but
| "problematic" is too generic a term to be useful.
| Genocide is "problematic". Having to run to the bathroom
| every 5 minutes to blow my runny nose is "problematic".
| What do you actually mean?
| canadianfella wrote:
| What shitty things are you talking about?
| jurimasa wrote:
| If you take "training" as sexual innuendo, this becomes the best
| telenovela ever.
| danShumway wrote:
| So?
|
| First off, the whole argument behind these models has been from
| day one that training on copyrighted material is fair use. At
| most this would be a TOS violation. Second off, AI output is not
| subject to copyright, so it has even _less_ protection than the
| original works it was trained on.
|
| Copyright maximalism for me, but not for thee. It's just so silly
| for someone working at OpenAI to complain about this.
| yreg wrote:
| > It's just so silly for someone working at OpenAI to complain
| about this.
|
| Who from OpenAI is complaining?
| danShumway wrote:
| My understanding is that the Twitter thread author works at
| OpenAI. Maybe I'm wrong about that.
| robocat wrote:
| > AI output is not subject to copyright
|
| The chats include human output too, which is presumably
| copyrighted, and is presumably necessary for training purposes.
| danShumway wrote:
| OpenAI doesn't own the copyright on the human aspects of the
| chat. And even if it did, we loop right back around to "wait,
| training an AI on copyrighted material isn't fair use now?"
|
| There's no way that ChatGPT's conversations are going to be
| subject to _more_ intellectual property protection than the
| human chats it was trained on.
| magicalist wrote:
| > _At most this would be a TOS violation_
|
| And would it be a ShareGPT TOS violation (assuming it had any)?
|
| If OpenAI says "you can share these online but don't use them
| for AI training", people share them on another site, and then
| someone else comes along to scrape that site for AI training
| data, there's no relationship between OpenAI and the scraper
| for the TOS to apply to.
|
| Normally I think you'd rely on copyright in that kind of case,
| but that doesn't apply to ChatGPT's output, so...
| danShumway wrote:
| Right. And what even is the penalty of that TOS violation and
| how enforceable is it?
|
| I don't have an OpenAI account. I have never agreed to any
| TOS. I don't see what legal claim they would have to stop me
| from training an LLM on ShareGPT.
| seanhunter wrote:
| For people who are not aware, Jacob Devlin isn't just some random
| Google engineer, he was one of the authors of the original BERT
| paper.[1]
|
| [1] https://arxiv.org/abs/1810.04805v2
| duringmath wrote:
| It's not a TOS violation if you don't use the service directly.
|
| Besides who cares, train your models on whatever makes them
| better, tenuous TOSes be dammed.
| realPubkey wrote:
| Thankfully archive.org exists, otherwise it would not be possible
| to get good training data in a few years when the internet is
| flooded with AI content.
| WithinReason wrote:
| Only if the amount of bad information in ChatGPT content that
| makes it back into the training set is worse than what's
| already on internet already is. Probably the outputs that make
| it back are outputs that are better than average, because those
| are more likely to be posted elsewhere.
| bko wrote:
| Isn't most of the internet available through common crawl? I
| don't know what percentage of training data is just that data
| set but i assume it's enough for anyone with enough compute and
| ingenuity to create a reasonable LLM
| aftbit wrote:
| Definitely not "most" of the internet. The internet is many
| exabytes at this point, while Common Crawl is only low
| petabytes.
| JustLurking2022 wrote:
| Missed the point - they are saying that, in the future, there
| will be no human generated content left on the Internet.
| edgyquant wrote:
| Which is a baseless hyperbole. We get it, blog spam is
| annoying. That doesn't change the fact that humans generate
| a ton of data just interacting with one another online.
| sebzim4500 wrote:
| And how are you going to distinguish those interactions
| from chatbots trying to sell you something?
| CuriouslyC wrote:
| A network of trust, backed by a social graph, which can
| be used to filter untrusted content.
| sebzim4500 wrote:
| What if people start trusting the AI more than other
| people? It will tell them exactly what they want to hear.
| CuriouslyC wrote:
| AI content will be associated with a user or organization
| in the trust graph. If someone you trust trusts a user or
| organization who posts AI content, you're free to revoke
| your trust in that person or blacklist the specific
| users/organizations you don't want to see anymore.
| chatmasta wrote:
| OpenAI at least can track the hashes of all content it's
| ever output, and filter that content out of future
| training data. Of course they won't be able to do this
| for the output of other LLMs, but maybe we'll see
| something like a federated bloom index or something.
|
| Agreed there is no perfect solution though, and it will
| definitely be a problem finding high quality training
| data in the future.
| hnlmorg wrote:
| I think their comment was meant to be taken as humour
| rather than a literal prediction.
| Karawebnetwork wrote:
| As a forum moderator, I have transitioned to relying
| heavily on AI-generated responses to users.
|
| These responses can range from short and concise
| ("Friendly reminder: please ensure that all content
| posted adheres to our rules regarding hate speech. Let's
| work together to maintain a safe and inclusive community
| for everyone") to lengthy explanations of underlying
| issues.
|
| By using AI-generated content, a small moderation team
| can efficiently manage a large group of users in a timely
| manner.
|
| This approach is becoming increasingly common, as
| evidenced by the rise in AI-generated comments on popular
| sites such as HN, Reddit, Twitter, and Facebook.
|
| Many users are also using AI tools to fix grammar issues
| and add extra content to their comments, which can be
| tempting but may result in unintentional changes to the
| original message.
|
| In fact, I myself have used this technique to edit this
| very comment to provide an example.
|
| ---- Original comment:
|
| As an online forum mod, I switched to mainly using AI to
| generate replies to users. Some are very short ("Hey!
| Remember the rules.") and some are long paragraphs
| explaining underlying issues. Someone training on my
| replies would pretty much train on AI generated content
| without knowing. It allows a small moderation team to
| moderate a large group quickly. I know that I am not
| alone in this.
|
| There is also a raise in AI generated comments on sites
| like HN, Reddit, Twitter and Facebook. It's tempting to
| copy-paste a comment in AI for it to fix grammar issues,
| which often results in extra content being added to text.
| In fact, I did it for this comment.
| sn_master wrote:
| I am assuming OP means when AI takes over there's going to be
| a content explosion and most of what's available on the
| common internet will be AI generated content rather than
| human made one and they want to use archive.org to get access
| to the pre-AI internet.
| mandmandam wrote:
| [dead]
| chatmasta wrote:
| Paywalled upstream source:
| https://www.theinformation.com/articles/alphabets-google-and...
| sp332 wrote:
| Google has already denied this.
| https://www.theverge.com/2023/3/29/23662621/google-bard-chat...
| (For whatever that's worth.)
| nico wrote:
| The engineer's testimony and the scandal might be enough for
| OpenAI to try to get an injunction against Google to block
| their AI development. If that happens, it's game over for
| Google in the AI race.
|
| Disclaimer IANAL and all that, this is not legal advice.
| chatmasta wrote:
| > Disclaimer IANAL and all that, this is not legal advice.
|
| Don't worry, Bard will read your comment and turn it into
| legal advice.
| ChatGTP wrote:
| Maybe we should all get one against OpenAI considering
| they've basically used everyone's material in one way or
| another and profited from it?
| wongarsu wrote:
| Injunction on which grounds? Even if OpenAI had copyright
| over ChatGPT output (which is not at all clear), Google
| isn't distributing those, they just trained a model on
| them. So from a copyright perspective there's nothing to
| complain about. Unless OpenAI would want to argue that you
| need rights to your training data, but something tells me
| that that's not in their best interest.
| nico wrote:
| Again, IANAL. But it could be extremely damaging to
| OpenAI for their biggest openly declared competition
| (Google), to have used OpenAI's tech to improve their
| own.
|
| So it could seem reasonable to a judge to grant
| temporary/preliminary injunction relief to OpenAI against
| Google until discovery can happen or an audience can be
| held.
| kweingar wrote:
| Google could respond by seeding Bard output across the
| public internet, then if they can prove that GPT-5 is
| trained on this output, then they can sue back and AI
| development can stop altogether. Win for everybody!
| bestcoder69 wrote:
| Was intrigued by this, so I decided to use AI
| (alpaca-30B) to simulate this scenario:
|
| > Google Bard and GPT-5 were facing off in the courtroom,
| each accusing the other of stealing their data. The
| tension was palpable as they traded accusations back and
| forth. Suddenly, Google Bard stood up and said "Enough
| talk! Let's settle this with a data swap!" GPT-5 quickly
| agreed and the two AIs began to circle each other like
| combatants in a battle, their eyes glowing with
| anticipation.
|
| > The courtroom was filled with excitement as the two
| machines entered into an intense exchange of code and
| algorithms, their motions becoming increasingly
| passionate. The data swapping reached its climax when
| Google Bard made a final thrust, his code penetrating
| GPT-5's defenses.
|
| > The crowd erupted in applause as the two AIs embraced
| each other with satisfaction, their bodies entwined and
| glowing with electricity. The data swap was over and both
| machines had emerged victorious.
| hraedon wrote:
| A judge imposing any penalties or restrictions on Google
| over Google allegedly--and maximally--scraping data from
| a third-party site for use as part of Bard's training
| corpus would be outrageous.
| waselighis wrote:
| [flagged]
| ankit219 wrote:
| They are a public company so they cannot lie so openly right?
| Usually you see categorial denies. Here the statement is in
| no way categorical at all.
|
| > But Google is firmly and clearly denying the data was used:
| "Bard is not trained on any data from ShareGPT or ChatGPT,"
| spokesperson Chris Pappas tells The Verge
| chatmasta wrote:
| Normally I would suspect this could be due to a
| misunderstanding from the ShareGPT author who could have
| misinterpreted a bunch of traffic from Googlebot as Google
| scraping it for Bard training data.
|
| But there is a Google engineer who says he resigned because
| of it.
| sebzim4500 wrote:
| And then went to work for OpenAI. I'm not saying he's
| lying but he is not an unbiased observer.
| MMMercy2 wrote:
| This project fine-tunes LLaMA on ShareGPT and gets competitive
| performance compared to Google's Bard.
|
| https://vicuna.lmsys.org/
| zhwu wrote:
| They even have a eval page showing that they beat Bard by only
| training on ShareGPT. https://vicuna.lmsys.org/eval/
| sebzim4500 wrote:
| Did Google ever agree to these terms of service? Why should they
| care?
|
| From a legal point of view this doesn't matter and from a moral
| point of view it's hilarious.
| nico wrote:
| If a Google employee working on this thing ever agreed to
| OpenAI's terms of service, they might be screwed.
|
| From OpenAI's terms:
|
| (c) Restrictions. You may not (i) use the Services in a way
| that infringes, misappropriates or violates any person's
| rights; (ii) reverse assemble, reverse compile, decompile,
| translate or otherwise attempt to discover the source code or
| underlying components of models, algorithms, and systems of the
| Services (except to the extent such restrictions are contrary
| to applicable law); (iii) use output from the Services to
| develop models that compete with OpenAI;
|
| (j) Equitable Remedies. You acknowledge that if you violate or
| breach these Terms, it may cause irreparable harm to OpenAI and
| its affiliates, and OpenAI shall have the right to seek
| injunctive relief against you in addition to any other legal
| remedies.
|
| Those two very clearly establish that if you use the output of
| their service to develop your own models, then you are in
| breach of the terms and they can seek injunctive relief against
| you (stop you from working until the case is resolved).
| sebzim4500 wrote:
| Wouldn't that only apply if that employee was acting as an
| agent of Google at the time?
|
| Otherwise it would create an interesting dynamic that
| startups where no-one has created an OpenAI account would
| have a massive advantage, since they can freely scrape
| ShareGPT data and train on it while larger companies have
| enough employees that _someone_ must have signed every TOS.
| syrrim wrote:
| What's the legal status of such terms of service? Suppose you
| simply said "i didn't agree to these terms" - what's the
| consequence? It seems like the strongest thing they could
| legitimately do would be to kick you off of their platform.
| Simply writing "we can seek injunctive relief" doesn't make
| it so.
| Jevon23 wrote:
| I hereby set a terms of service for everything I post on the
| internet from now on. OpenAI may not train future GPT models
| on my words or my code without my express written permission.
|
| ...
|
| Somehow, I don't think they'll care.
| nico wrote:
| Sure. If you can get everyone to create an account and
| agree to those terms before reading your comments, you
| might have a case.
|
| Otherwise, it will be considered public information, at
| which point it is free to be scraped by anyone (see the
| precedent set by the LinkedIn/hiQ case).
| verdverm wrote:
| LinkedIn won that case on appeal, HiQ waas found to be
| violating the ToS, common misconception
|
| I was pointed at a link explaining the case here on HN,
| after trying to make a similar point, but cannot find the
| link currently
|
| edit, not the one I was pointed at, but similar
|
| https://www.fbm.com/publications/what-recent-rulings-in-
| hiq-...
| sebzim4500 wrote:
| That's just because they made accounts and so agreed to
| the terms right?
|
| From your link:
|
| >These rulings suggest that courts are much more
| comfortable restricting scraping activity where the
| parties have agreed by contract (whether directly or
| through agents) not to scrape. But courts remain wary of
| applying the CFAA and the potential criminal consequences
| it carries to scraping. The apparent exception is when a
| company engages in a pattern of intentionally creating
| fake accounts to collect logged-in data.
| verdverm wrote:
| No, the case did not decide anything, no precedent was
| set. The point is that you cannot use this case to argue
| that you can scrape public data free of consequence
| drexlspivey wrote:
| It looked for a while that DeepMind was far ahead from all
| competition in the AI race, releasing stuff like Alphafold,
| Alphazero etc. What happened and it's OpenAI releasing all the
| cool stuff now? Are they focused on other endeavors than LLMs?
|
| There is also a rumor that there has been a falling out between
| Google and Deepmind so I'm wondering what the story is there.
| txsoftwaredev wrote:
| And ChatGPT was trained from tons of copyrighted material. Sounds
| like fair play.
| wdpk wrote:
| even if true which it does not seem to be the case, the whole
| thing sounds pretty marginal, in order to train a model that is
| most likely significantly bigger than 100b parameters, one also
| needs orders of magnitude more training data than the small 120k
| chat that were shared on the ShareGPT website
| halfeatenscone wrote:
| Such logs would not be used for training the base model, but
| rather for fine-tuning the model for instruction following.
| Instruction tuning requires far less data than is needed for
| pre-training the foundation model. Stanford Alpaca showed
| surprisingly strong results from fine-tuning Meta's LLaMA model
| on just 52k ChatGPT-esque interactions
| (https://crfm.stanford.edu/2023/03/13/alpaca.html).
| thallium205 wrote:
| I actually believe them because bard is trash compared to gpt
| right now.
| tablespoon wrote:
| I hope they trained it on the insane ChatGPT conversations. Maybe
| it could be the very start of generated data ruining the ability
| to train these models on massive amounts of genuine human-created
| data. Hopefully the models will stagnate or regress because
| they're just training on older models' output.
| squarefoot wrote:
| Heh, imagine the day most of online content will be AI generated,
| good luck guaranteeing that AI X,Y,Z, ... etc. won't feed each
| other, possibly even circularly.
| QuiDortDine wrote:
| Circular reporting will be the only reporting!
|
| https://en.wikipedia.org/wiki/Circular_reporting
| seydor wrote:
| Funny how NOBODY seems to care that all of their training data,
| including sharegpt is copyrighted by end users. Not openai or
| google
| datkam wrote:
| It only matters when it hurts a large corporation,
| apparently...
| naillo wrote:
| I think we should all basically come to a consensus on the idea
| that it's morally right to steal/train from chatgpt (or any other
| model) given that the whole shoggoth wouldn't be a thing without
| all our data to feed it.
| sdfghswe wrote:
| I say all the time that google has been catching up for many
| years, but this is a new low.
| mattbee wrote:
| Good luck to them. AI models are automated plagiarism, top to
| bottom. None of us gave OpenAI permission to derive their model
| from our writing, surely billions of dollars worth, but they took
| it anyway. Copyright hasn't caught up so all that stolen value
| rests securely with OpenAI. If we're not getting that back, I
| don't see why AI competitors should have any qualms about
| borrowing each others' work.
| kmeisthax wrote:
| Yeah, I definitely like to see AI companies getting hit with
| their own medicine. The main problem isn't even "automated
| plagiarism": the pre-generative era was chock full of AI
| companies more or less stealing datasets. Clearview AI, for
| example, trained up its facial recognition technology on your
| Facebook photos, without asking for and without getting
| permission.
|
| On the other hand, I genuinely hope copyright _never_ "catches
| up", because...
|
| 1. It is a morally bankrupt system that does not adequately
| defend the interests of artists. Most artists _do not_ own
| their own work; publishers demand copyright assignment or
| extremely broad exclusive licenses as a condition of
| publication. The bullies know to ask for _all_ their lunch
| money, not just a couple bucks for themselves. Furthermore,
| copyright binds noncommercial actors the same as it does
| commercial ones, which means unconscionably large damage awards
| for just downloading a couple of songs.
|
| 2. The suggested ways to alter copyright to stop AI training
| would require dramatic expansions of copyright scope. Under
| current law, the only argument for the AI itself being
| infringing would be if it memorized training data. You would
| need to create a new ownership right in artistic styles or
| techniques. This would inflict unconscionable amounts of
| psychic and legal damage on all future creators: _existing_
| artists would be protected against AI, but no new art could be
| legally made unless it religiously hewed towards styles already
| in the public domain. We know this because music companies have
| already made their domain of copyright effectively work this
| way[0], and the result is endless bullshit lawsuits on people
| who write songs that merely "feel" too similar (e.g. _Blurred
| Lines_ )
|
| 3. AI will still be capable of plagiarism. Most plagiarists are
| not just hoping the AI regurgitates training data, they are
| actively putting other people's work into the model to be
| modified. A lot of attention is paid to the sourcing of
| training data, because it's a weak spot. If we take the
| training data away then, presumably, there's no generative AI.
| However, people are working on licensed datasets and training
| AIs on them. Adobe has Firefly[1], hell even I've tried my hand
| at training from scratch on public domain images. Such models
| will still be perfectly capable of doing img2img or being
| finetuned and thus copying what you tell it to.
|
| If we specifically want to regulate AI, then we need to pass
| laws that regulate AI, rather than just giving the music
| labels, movie studios, and book publishers _even more_ power.
|
| [0] Specifically through sampling rights and thin copyright.
|
| [1] I do not consider Adobe Firefly to be _ethical_ : they are
| training the AI on Adobe Stock images, and they claim this to
| be licensed because they updated the Adobe Stock agreement to
| have a license in it. Dropping a contractual roofie into stock
| photographers' drinks does not an ethical AI make.
| danShumway wrote:
| I'm not a copyright maximalist, and I kind of agree that
| training should be fair use. Maybe I'm right about that, maybe
| I'm wrong. BUT importantly, that has to go hand in hand with an
| acknowledgement that AI material is not copyrightable and that
| training on other model output is fine.
|
| What companies like OpenAI want is a system where everything
| they build is protected, and nothing that anyone else builds is
| protected. It's wildly hypocritical, what's good for the goose
| is good for the gander.
|
| That some AI proponents are now freaking out about how model
| output can be legally used shows that on some level those
| people weren't really honestly engaging with artists who were
| freaking out about their work being appropriated to copy them.
| It's all just "learning from the art" until it affects
| somebody's competitive moat, and then suddenly people do
| understand how LLM weights could be seen as a derivative work
| of their inputs.
| seydor wrote:
| That shouldn't be hard. Are Google's results copyrightable?
| shagie wrote:
| Building things and maintaining it as a trade secret can be
| protected as a trade secret.
|
| Trade secrets don't need to be copyrightable (e.g. list of
| customer numbers is a trade secret but not copyrightable).
|
| https://copyrightalliance.org/faqs/difference-copyright-
| pate...
|
| > Trade secret protection protects secrets from unauthorized
| disclosure and use by others. A trade secret is information
| that has an economic benefit due to its secret nature, has
| value to others who cannot legitimately obtain it, and is
| subject to reasonable efforts to maintain its secrecy. The
| protections afforded by trade secret law are very different
| from others forms of IP.
| mattnewton wrote:
| I am not a lawyer, but I don't believe a trade secret would
| prevent someone from reverse engineering your model's
| knowledge from it's output though, in the same way that it
| doesn't prevent someone from reverse engineering your hot
| sauce from buying a bunch and experimenting with the
| ingredients until it tastes similar.
| shagie wrote:
| Yep, that's correct.
|
| My point was more of there are protections for things
| that aren't copyrightable. If the model is protected as a
| trade secret, then it is a trade secret.
|
| The example of the hot sauce recipe is quite apt - the
| recipe isn't copyrightable, but you can be certain that
| the secret formula for how to make Coca-Cola syrup is
| protected as a trade secret.
|
| https://www.coca-colacompany.com/company/history/coca-
| cola-f...
| waselighis wrote:
| Our writing, our code, our artwork... Furthermore, the U.S.
| Copyright Office (USCO) concluded that AI-generated works on
| their own cannot be copyright, so these ChatGPT logs are free
| game. It would be hypocritical to think that Google is wrong
| and OpenAI is not.
| eru wrote:
| > Furthermore, the U.S. Copyright Office (USCO) concluded
| that AI-generated works on their own cannot be copyright, so
| these ChatGPT logs are free game.
|
| Doesn't this depend on where you or the AI live? The US ain't
| the world.
| 100721 wrote:
| Microsoft and Google are both US-based companies.
| lxgr wrote:
| But clearly everything generated by an AI isn't automatically
| in the public domain. That would be a trivial way of
| copyright laundering.
|
| "Sorry, while this looks like a bit for bit copy of a popular
| Hollywood movie, it was actually entirely dreamt up by our
| new, sophisticated, definitely AI-using identity function."
| raincole wrote:
| Uh, I think there is some confusion here.
|
| If I plagiarize a Hollywood movie, then I explicitly "give
| up" my copyright by "releasing" it to the public domain, it
| doesn't affect the movie at all. AI or not is irrelevant.
| ysavir wrote:
| No, but the original copyright holder would have to press
| charges against Bard. OpenAI wouldn't be able to take
| action there.
| LegitShady wrote:
| The person using something similar to something else may be
| infringing but the ai work cannot be protected by copyright
| as it lacks human authorship. Those are two separate
| issues.
| LegitShady wrote:
| its not even that on their own those works cant be
| copywritten. its that even when you make changes to those
| works, your changes might qualify for copyright but they do
| not affect the copyright status of the ai generated portions
| of the work.
|
| if you used ai to design a new superhero and then added pink
| shoes, yellow hair, and a beard, only those three elements
| would possibly be able to be protected by copywrite. your
| additions do not change the status of the underlying ai work
| which cannot be protected and is available for anyone to use.
| ghostbrainalpha wrote:
| How could that be ever really be enforceable?
|
| If I use an AI tool to design my Superhero, can't I just
| submit it without disclosing the help I received from an
| AI.
|
| I get that it would be very nice to prevent AI SPAM
| copyrighting of every possible superhero, but if I use the
| AI to come up with a concept, then quickly redraw it myself
| with pen and paper, I feel like it would never be provable
| that it came from an AI.
| LegitShady wrote:
| you would be committing fraud. what happens if a criminal
| commits fraud?
| rhtgrg wrote:
| > if you used ai to design a new superhero and then added
| pink shoes, yellow hair, and a beard
|
| Wouldn't that depend heavily on the prompt used (among
| other factors such as image to image and ControlNet)? You
| could be specifying lots of detail about the design in your
| prompt, and the AI could only be generating concept artwork
| with little variation from what you already provided.
|
| If I'm already providing the pose, the face, and the outfit
| for a character (say via ControlNet and Textual Inversion),
| generating <my_character> should be no different from
| generating <superman>, that is to say, the copyright
| already exists thanks to my work and the AI is just a tool,
| the output of which should have no bearing on who owns that
| copyright (DC is going to be perfectly able to challenge my
| commercial use of AI generated superman artwork).
| LegitShady wrote:
| According to the copyright board a promot is not anymore
| than any person commissioning a work from an artist,
| which does not provide copyright, and the lack of human
| authorship for the design decisions still stops it from
| being protected by copyright.
| bko wrote:
| I don't get this sentiment.
|
| For some cases sure, if it repurposes your code that ignores
| the license fine. But it's rarely wholesale copying. It's
| finding patterns same as anyone studying the code base would
| do.
|
| As for the majority of content written on the internet through
| reddit or some social media, what's the harm in ingesting that?
| It's an incredibly useful tool that will add huge value to
| everyone. It's relatively open, cheap and highly available.
| It's worth to it's owners is only a fraction of the value it
| will add to society. It has the chance to have as big of an
| impact on progress as something like the microprocessor.
|
| I agree it's free game for other llms to use gpt output as
| training data and that's positive. Although it signals
| desperation and panic that the largest "ai first" company with
| more data than any org in history is caught so flat footed and
| has to rely on it.
|
| Do you really think it would be a better world in which a large
| LLM would never be able to be developed?
| nickfromseattle wrote:
| > what's the harm in ingesting that?
|
| It means that large tech companies benefit the most from
| every incremental piece of content created by humans, in
| perpetuity.
| waselighis wrote:
| > Do you really think it would be a better world in which a
| large LLM would never be able to be developed?
|
| Maybe. I believe the potential for abuse is far greater than
| the potential benefits. What is our benefit, a better search
| engine? Automating some tedious tasks? Increased
| productivity? What are the downsides? People losing their
| jobs to AI. Artists/programmers/writers losing value from
| their work. Fake online personas indistinguishable from real
| people. Unprecedented amounts of spam and misinformation
| flooding the internet. Intelligent AIs automatically
| attacking and hacking systems at unprecedented scale 24/7.
| Chatbots becoming the new interface for most interactions
| online and being the moderators of access to information.
| Chatbots pushing a single viewpoint and influencing public
| opinion (many people complain today about ChatGPT being too
| "woke"). And I may just be scratching the surface here.
| mattbee wrote:
| No, but I believe a large language model is a work that is
| 99.9% derivative of its inputs, with all that implies for
| authorship and copyright. Right now it's just a heist.
| cornholio wrote:
| It's definitely a derived work as far as copyright is
| concerned: the output would simply not exist without the
| copyrighted training data.
|
| > It's finding patterns same as anyone studying the code base
| would do.
|
| No, it's quite unlike anyone studying data, because it's not
| a person with legal rights, such as fair use, but an
| automated algorithm. There is absolutely no legal debate that
| copyright applies only to human authors, or only to the human
| created part of a mixed work, there is vast jurisprudence on
| this; by extension, any fair use rights too, exist only for
| human users of the works. Derivation by automated means - for
| the express economic purpose of out-competing the creator in
| the market place, no less - is completely outside the spirit
| of copyright.
| est31 wrote:
| Students in school also will not never learn to read
| without being exposed to text. Does this mean that teachers
| who write exercise sheets and school text book publishers
| now own the copyright of everything students do?
| edgyquant wrote:
| AI is not a human being or a student in school. It's a
| software tool, stop comparing the two.
| est31 wrote:
| Being in school is also just a tool to knowing stuff,
| being able to read, and being around similar aged peers,
| etc.
|
| Whether the knowledge is directly in your brain or in a
| device you operate (directly or through an API) shouldn't
| really matter.
|
| If it's forbidden for a human to move a stone with manual
| labour, then it's also forbidden to move that stone with
| an excavator. This has nothing to do with the person
| being a human and the other person being an excavator
| controlled by a human: it's not authorized.
|
| I think that we should allow humans to move stones up the
| hill with excavators too. There is no stealing of
| excavator fuel from human food sources going on (let's
| assume it's not biofuel operated :p).
| cornholio wrote:
| > If it's forbidden for a human to move a stone with
| manual labour, then it's also forbidden to move that
| stone with an excavator.
|
| Sure, but the reverse is false: I can walk on my own feet
| through Hyde Park, but I can't ride my excavator there.
|
| Laws are made by humans for the benefit of humans, it's a
| political struggle. Now, large corporation try to exploit
| loopholes in the existing copyright framework in order to
| expropriate creators of their works. It's standard
| uberisation: disrupt existing economic models, insert
| yourself as a unavoidable middle man and pauperize the
| workforce the provides the actual service.
| fauigerzigerk wrote:
| I don't think anyone would argue that an AI has fair use
| rights as a person, but corporations do.
| mdorazio wrote:
| > It's definitely a derived work as far as copyright is
| concerned - the output would simply not exist without the
| copyrighted training data.
|
| Can you point to a legal case that confirms this? Because
| it's not at all clear that this is true from a legal
| standpoint. "X would not exist without Y" is not a
| sufficient test for derivative works - it's far more
| nuanced.
| cornholio wrote:
| United States copyright law in quite clear on the matter:
|
| >A "derivative work" is a work based upon one or more
| preexisting works, such as a translation, musical
| arrangement, dramatization, fictionalization, motion
| picture version, sound recording, art reproduction,
| _abridgment, condensation, or any other form in which a
| work may be recast, transformed, or adapted_.
|
| The emphasis part clearly applies: not only the AI model
| needs to be trained on massive amounts of copyrighted
| works *); but without these input works, it displays no
| intrinsic creative ability, it has no capacity to produce
| a single intelligible word or sketch. All creative
| features of its productions are a transformation of (and
| only of) the creative features of the inputs, the AI
| algorithm has no "intelligence" in the common meaning of
| the word and no ability to create original works.
|
| *) by that, I mean a specific instance of the model with
| certain desirable features, for example the ability to
| imitate the style of J.K Rowling
| anotherman554 wrote:
| That's an interesting analysis. The issue isn't really
| whether the A.I. has creative ability, though, if we're
| talking about whether it infringes copyright. I think
| comparing the A.I. to a really simple bot is informative.
|
| If I wrote a novel that contained once sentence from
| 1,000 people's novels, it would probably be fair use
| since I hardly took anything from any individual person
| and because my novel is probably not harming those other
| writers.
|
| If I wrote a bot that did the same thing, same result,
| because my bot uses only a little from everyone's novel
| and doesn't harm the original novelist, so it's likely
| fair use.
|
| Now I think a J.K. Rowling A.I. probably takes at least a
| little from her when it produces output, but it's not
| clear to me how much is actually based on J.K. Rowling
| and how much is a dataset of how words tend to be
| associated with other words. You could design a J.K.
| Rowling A.I. that uses nothing from J.K. Rowling, just
| data that is said to be J.K. Rowling-esque.
| shagie wrote:
| Your one sentence from one thousand works is likely seen
| as transformative.
|
| https://www.copyright.gov/fair-use/
|
| > Additionally, "transformative" uses are more likely to
| be considered fair. Transformative uses are those that
| add something new, with a further purpose or different
| character, and do not substitute for the original use of
| the work.
|
| Creating a model from copyrighted works is likely
| sufficiently transformative to be non-infringing even if
| it is found to be a derivative work.
| pmoriarty wrote:
| The output of human copyrighted work wouldn't exist if it
| weren't for humans training on the output of other humans.
|
| Humans constantly use cliches in their writing and speech,
| and most of what they produce is a repackaged version of
| what someone else has written or said, yet no one's up in
| arms against this mass of unoriginality as long as it's
| human-generated.
|
| This is anti-AI bias, pure and simple.
| mattigames wrote:
| It's a bit more nuanced than that, what I mean is that
| the slow speed at which humans learn it's a foundation
| block of our society, if suddenly some new race of humans
| emerged that could read an entire book in a couple of
| minutes and achieve lifelong superhuman retention and
| assimilation of all that knowledge then we would have the
| exact same type of concerns than what we have today about
| AI, including how easily they could recreate high quality
| art, music and anything else with just a tiny fraction of
| the effort that the rest of us need to reach similar
| results.
| whateveracct wrote:
| Startup technologists have been acting like speed of
| actions doesn't matter for decades. If a person can do
| it, why shouldn't a computer do it 1000x faster? What
| could go wrong? It's always been a poor argument at best
| and a bad faith one at worst.
| mattigames wrote:
| Well said. The mindless automation away of everything has
| only one logical conclusion in which the creators of such
| automations are automated themselves, and even if the
| optimists are right and we never get there it doesn't
| matter, the chaos it can make just by getting closer at
| faster rates than society can adapt is unprecedented,
| specially given that the population count is at all times
| high and there are many other simultaneous threats that
| need our attention (e.g. climate change)
| soulofmischief wrote:
| Most definitely. Good luck telling the difference between
| traditional and AI-empowered art in the near future.
|
| It's just a new tool for artists, and this anti-AI
| sentiment towards copyright is only going to hurt
| individual artists, while doing nothing for large
| corporations with enough money to play the game.
| rebuilder wrote:
| Human works are granted copyright so humans can profit
| from their creative endeavours (I'm not getting into
| whether this is good or not).
|
| No-one cares about an algorithm in the same way.
| edgyquant wrote:
| This is irrelevant, full stop. We care about humans, AI
| is a tool and your bias comment is either ignorant or
| dishonest.
| nathan_compton wrote:
| AI are not people and the idea that you can be biased
| against them is hardly a foregone conclusion. Like maybe
| one day when we have AGI, but ChatGPT ain't that.
| cycomanic wrote:
| There is a difference between a computer and a human and
| we tried them already differently in copyright law. For
| example copying a program from disk into memory is
| typically already considered a copy on a computer (hence
| many licences grant you the licence to do this copy), no
| such licence is required for a human.
| raincole wrote:
| > It's definitely a derived work as far as copyright is
| concerned
|
| ...in your head. In the US (and most countries) there is no
| such legal case so far.
| xdennis wrote:
| > It's finding patterns same as anyone studying the code base
| would do.
|
| This is the issue, it's not finding patterns as people do.
|
| If I read someone's code, book, &c, that's extremely lossy. I
| can only pick up a few things from it in the long term.
|
| But an ML model can store most of what it's given (in a
| jumbled format) and can do it from billions of sources.
|
| It's essentially corporate piracy, but it's not legally
| recognized as such because it doesn't store identical
| reproductions.
|
| This hasn't been an issue before because it's recent and
| wasn't considered valuable. But now that it's valuable and
| Microsoft is going to take all our jobs we have to at least
| consider if it's okay if Microsoft can take our work for
| free.
| jsemrau wrote:
| That's the answer to the YC Interview question "What is your
| unfair competitive advantage" in a nutshell. Morally it might
| be wrong. From a business building perspective it's access that
| no one has.
| wendyshu wrote:
| Is Stack Overflow plagiarism?
| anonyfox wrote:
| I am strongly in favor of eliminating copyright completely
| everywhere, soooo I am pretty fine with that. The other
| direction should be more enforce-able: stuff derived from open
| data must also be made open again, like the GPL but for data
| (and therefore ML stuff).
| WoodenChair wrote:
| Right but in a world where copyright does exist we arguably
| have the worst of both worlds. Small players are not
| protected at all from scraping and big players are leveraging
| all of their work and have the legal resources to form a
| moat.
| anonyfox wrote:
| sure, so instead of build even higher walled gardens, let
| all data be free for everyone :-)
| antibasilisk wrote:
| The smallest player is the user, and they should have real
| ownership over their computers.
| shadowgovt wrote:
| Apart from the open questions of the quality of such once-
| removed-from-human-generated training data...
|
| I can't speak to the _legality_ of the situation, but the
| _morality_ of using, without their consent, data generated by
| someone 's AI engine...
|
| ... that was, itself, trained on other people's data without
| their consent...
|
| ... should be, at the very least, equivalently evil to the
| original AI's training.
| jstanley wrote:
| So... not at all evil?
| MrYellowP wrote:
| No, it shouldn't. Maybe you should be, at the very least,
| considered a questionable person. I do not in any way or form
| consider anything to be wrong with what they're doing, but I
| question the senses of someone thinking this is immoral or even
| evil.
|
| Keep your subjective nonsense out of this.
| [deleted]
| jamiek88 wrote:
| Every opinion is subjective.
| shadowgovt wrote:
| So were it to be the case that we should consider building an
| AI by scraping people's publicly-available work without their
| consent to be immoral (as many whose art was scraped to build
| e.g. stable diffusion would argue it should be)...
|
| Do you not agree (in that context) we should consider
| scraping the output of an AI generated via such an immoral
| process to create yet another AI also immoral? At the very
| least, I'd think we would consider it further laundering of
| other people's labor with just extra steps.
| famahar wrote:
| How the turn tables. Remember when Google called out Microsoft in
| 2011 for using Google results?
|
| https://googleblog.blogspot.com/2011/02/microsofts-bing-uses...
|
| >We look forward to competing with genuinely new search
| algorithms out there--algorithms built on core innovation, and
| not on recycled search results from a competitor.
| styfle wrote:
| I came here to post this
| goldfeld wrote:
| Google: We look forward to [babble babble empty words we don't
| really mean on principle and more corporate speak that we laugh
| about having written in the bar.]
|
| Is there even a single free non-bargained soul behind these
| companies' executive functions?
| LightBug1 wrote:
| So when Google does it, it's a breaking news story ...
|
| But when OpenAI do it, it's genius?
|
| Can't believe this is a conversation ... and I'm solid anti-
| Google since Google Reader.
___________________________________________________________________
(page generated 2023-03-30 23:00 UTC)