[HN Gopher] Core copyright violation moves ahead in The Intercep...
___________________________________________________________________
Core copyright violation moves ahead in The Intercept's lawsuit
against OpenAI
Author : giuliomagnifico
Score : 171 points
Date : 2024-11-29 13:48 UTC (9 hours ago)
(HTM) web link (www.niemanlab.org)
(TXT) w3m dump (www.niemanlab.org)
| philipwhiuk wrote:
| It's extremely lousy that you have to pre-register copyright.
|
| That would make the USCO a defacto clearinghouse for news.
| throw646577 wrote:
| You don't have to pre-register copyright in any Berne
| Convention countries. Your copyright exists from the moment you
| create something.
|
| (ETA: This paragraph below is diametrically wrong. Sorry.)
|
| AFAIK in the USA, registered copyright is necessary if you want
| to bring a lawsuit and get more than statutory damages, which
| are capped low enough that corporations do pre-register work.
|
| Not the case in all Berne countries; you don't need this in the
| UK for example, but then the payouts are typically a lot lower
| in the UK. Statutory copyright payouts in the USA can be enough
| to make a difference to an individual author/artist.
|
| As I understand it, OpenAI could still be on the hook for up to
| $150K per article if it can be demonstrated it is wilful
| copyright violation. It's hard to see how they can argue with a
| straight face that it is accidental. But then OpenAI is, like
| several other tech unicorns, a bad faith manufacturing device.
| Loughla wrote:
| You seem to know more about this than me. I have a family
| member who "invented" some electronics things. He hasn't done
| anything with the inventions (I'm pretty sure they're
| quackery).
|
| But to ensure his patent, he mailed himself a sealed copy of
| the plans. He claims the postage date stamp will hold up in
| court if he ever needs it.
|
| Is that a thing? Or is it just more tinfoil business? It's
| hard to tell with him.
| throw646577 wrote:
| Honestly I don't know whether that actually is a meaningful
| thing to do anymore; it may be with patents.
|
| It certainly used to be a legal device people used.
|
| Essentially it is low-budget notarisation. If your family
| member believes they have something which is timely and
| valuable, it might be better to seek out proper legal
| notarisation, though -- you'd consult a Notary Public:
|
| https://en.wikipedia.org/wiki/Notary_public
| WillAdams wrote:
| It won't hold up in court, and given that the post-office
| will mail/deliver unsealed letters (which may then be
| sealed after the fact), will be viewed rather dimly.
|
| Buy your family member a copy of:
|
| https://www.goodreads.com/book/show/58734571-patent-it-
| yours...
| Y_Y wrote:
| Surely the NSA will retain a copy which can be checked
| Tuna-Fish wrote:
| Even if they did, it in fact cannot be checked. There is
| precedent that you cannot subpoena NSA for their
| intercepts, because exactly what has been intercepted and
| stored is privileged information.
| hiatus wrote:
| > There is precedent that you cannot subpoena NSA for
| their intercepts
|
| I know it's tangential to this thread but could you link
| to further reading?
| ysofunny wrote:
| but only in a real democracy
| cma wrote:
| The USmoved to first to file years ago. Whoever files first
| gets it, except if he publishes it publicly there is a
| 1-year inventor's grace period (it would not apply to a
| self mail or private mail to other people).
|
| This is patent, not copyright.
| Isamu wrote:
| Mailing yourself using registered mail is a very old tactic
| to establish a date for your documents using an official
| government entity, so this can be meaningful in court.
| However this may not provide the protection he needs.
| Copyright law differs from patent law and he should seek
| legal advice
| dataflow wrote:
| Even if the date is verifiable, what would it even prove?
| If it's not public then I don't believe it can count as
| prior art to begin with.
| blibble wrote:
| presumably the intention is to prove the existence of the
| specific plans at a specific time?
|
| I guess the modern version would be to sha256 the plans and
| shove it into a bitcoin transaction
|
| good luck explaining that to a judge
| Isamu wrote:
| Right, you can register before you bring a lawsuit. Pre-
| registration makes your claim stronger, as does notice of
| copyright.
| dataflow wrote:
| That's what I thought too, but why does the article say:
|
| > Infringement suits require that relevant works were first
| registered with the U.S. Copyright Office (USCO).
| throw646577 wrote:
| OK so it turns out I am wrong here! Cool.
|
| I have it upside down/diametrically wrong, however you see
| fit. Right that structures exist, exactly wrong on how they
| apply.
|
| It is registration that guarantees access to statutory
| damages:
|
| https://www.justia.com/intellectual-
| property/copyright/infri...
|
| Without registration you still have your natural copyright,
| but you would have to try to recover the profits made by
| the infringer.
|
| Which does sound like more of an uphill struggle for The
| Intercept, because OpenAI could maybe just say "anything we
| earn from this is de minimis considering how much errr
| similar material is errrr in the training set"
|
| Oh man it's going to take a long time for me to get my
| brain to accept this truth over what I'd always understood.
| pera wrote:
| > _It 's hard to see how they can argue with a straight face
| that it is accidental_
|
| It's another instance of "move fast, break things" (i.e.
| "keep your eyes shut while breaking the law at scale")
| renewiltord wrote:
| Yes, because all progress depends upon the unreasonable
| man.
| 0xcde4c3db wrote:
| The claim that's being allowed to proceed is under 17 USC 1202,
| which is about stripping metadata like the title and author. Not
| exactly "core copyright violation". Am I missing something?
| anamexis wrote:
| I read the headline as the copyright violation claim being core
| to the lawsuit.
| H8crilA wrote:
| The plaintiffs focused on exactly this part - removal of
| metadata - probably because it's the most likely to hold in
| courts. One judge remarked on it pretty explicitly, saying
| that it's just a proxy topic for the real issue of the usage
| of copyrighted material in model training.
|
| I.e., it's some legalese trick, but "everyone knows" what's
| really at stake.
| CaptainFever wrote:
| Also, is there really any benefit to stripping author metadata?
| Was it basically a preprocessing step?
|
| It seems to me that it shouldn't really affect model quality
| all that much, is it?
|
| Also, in the amended complaint:
|
| > not to notify ChatGPT users when the responses they received
| were protected by journalists' copyrights
|
| Wasn't it already quite clear that as long as the articles
| weren't replicated, it wasn't protected? Or is that still being
| fought in this case?
|
| In the decision:
|
| > I agree with Defendants. Plai ntiffs allege that ChatGPT has
| been trained on "a scrape of most of the internet, " Compl. ,
| 29, which includes massive amounts of information from
| innumerable sources on almost any given subject. Plaintiffs
| have nowhere alleged that the information in their articles is
| copyrighted, nor could they do so . When a user inputs a
| question into ChatGPT, ChatGPT synthesizes the relevant
| information in its repository into an answer. Given the
| quantity of information contained in the repository, the
| likelihood that ChatGPT would output plagiarized content from
| one of Plaintiffs' articles seems remote. And while Plaintiffs
| provide third-party statistics indicating that an earlier
| version of ChatGPT generated responses containing signifi cant
| amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have
| not plausibly alleged that there is a " substantial risk" that
| the current version of ChatGPT will generate a response
| plagiarizing one of Plaintiffs' articles.
| freejazz wrote:
| >Also, is there really any benefit to stripping author
| metadata? Was it basically a preprocessing step?
|
| Have you read 1202? It's all about hiding your infringement.
| Kon-Peki wrote:
| Violations of 17 USC 1202 can be punished pretty severely. It's
| not about just money, either.
|
| If, _during the trial_ , the judge thinks that OpenAI is going
| to be found to be in violation, he can order all of OpenAIs
| computer equipment be impounded. If OpenAI is found to be in
| violation, he can then order permanent destruction of the
| models and OpenAI would have to start over from scratch in a
| manner that doesn't violate the law.
|
| Whether you call that "core" or not, OpenAI cannot afford to
| lose these parts that are left of this lawsuit.
| zozbot234 wrote:
| > he can order all of OpenAIs computer equipment be
| impounded.
|
| Arrrrr matey, this is going to be fun.
| Kon-Peki wrote:
| People have been complaining about the DMCA for 2+ decades
| now. I guess it's great if you are on the winning side. But
| boy does it suck to be on the losing side.
| immibis wrote:
| And normal people can't get on the winning side. I'm
| trying to get Github to DMCA my own repositories, since
| it blocked my account and therefore I decided it no
| longer has the right to host them. Same with Stack
| Exchange.
|
| GitHub's ignored me so far, and Stack Exchange explicitly
| said no (then I sent them an even broader legal request
| under GDPR)
| ralph84 wrote:
| When you uploaded your code to GitHub you granted them a
| license to host it. You can't use DMCA against someone
| who's operating within the parameters of the license you
| granted them.
| tremon wrote:
| Their stance is that GitHub revoked that license by
| blocking their account.
| immibis wrote:
| It won't happen. Judges only order that punishment for the
| little guys.
| nickpsecurity wrote:
| " If OpenAI is found to be in violation, he can then order
| permanent destruction of the models and OpenAI would have to
| start over from scratch in a manner that doesn't violate the
| law."
|
| That is exactly why I suggested companies train some models
| on public domain and licensed data. That risk disappears or
| is very minimal. They could also be used for code and
| synthetic data generation without legal issues on the
| outputs.
| 3pt14159 wrote:
| The problem is that you don't get the same quality of data
| if you go about it that way. I love ChatGPT and I
| understand that we're figuring out this new media landscape
| but I really hope it doesn't turn out to neuter the models.
| The models are really well done.
| nickpsecurity wrote:
| If I steal money, I can get way more done than I do now
| by earning it legally. Yet, you won't see me regularly
| dismissing legitimate jobs by posting comparisons to what
| my numbers would look like if stealing I.P..
|
| We must start with moral and legal behavior. Within that,
| we look at what opportunities we have. Then, we pick the
| best ones. Those we can't have are a side effect of the
| tradeoffs we've made (or tolerated) in our system.
| tremon wrote:
| That is OpenAI's problem, not their victims'.
| jsheard wrote:
| That's what Adobe and Getty Images are doing with their
| image generation models, both are exclusively using their
| own licensed stock image libraries so they (and their
| users) are on pretty safe ground.
| nickpsecurity wrote:
| That's good. I hope more do. This list has those doing it
| under the Fairly Trained banner:
|
| https://www.fairlytrained.org/certified-models
| james_sulivan wrote:
| Meanwhile China is using everything available to train their AI
| models
| goatlover wrote:
| We don't want to be like China.
| tokioyoyo wrote:
| Fair. But I made a comment somewhere else that, if their
| models become better than ours, they'll be incorporated into
| products. Then we're back to being depended on China for LLM
| model development as well, on top of manufacturing.
| Realistically that'll be banned because of National Security
| laws or something, but companies tend to choose the path of
| "best and cheapest" no matter what.
| zb3 wrote:
| Forecast: OpenAI and The Intercept will settle and OpenAI users
| will pay for it.
| jsheard wrote:
| Yep, the game plan is to keep settling out of court so that
| (they hope) no legal precedent is set that would effectively
| make their entire business model illegal. That works until they
| run out of money I guess, but they probably can't keep it up
| forever.
| echoangle wrote:
| Wouldn't the better method to throw all your money at one
| suit you can make an example of and try to win that one? You
| can't effectively settle every single suit if you have no
| realistic chance of winning, otherwise every single publisher
| on the internet will come and try to get their money.
| lokar wrote:
| Too high risk. Every year you can delay you keep lining
| your pockets.
| gr3ml1n wrote:
| That's a good strategy, but you have to have the right
| case. One where OpenAI feels confident they can win and
| establish favorable precedent. If the facts of the case
| aren't advantageous, it's probably not worth the risk.
| tokioyoyo wrote:
| Side question, why doesn't other companies get the same
| attention? Anthropic, xAI and others have deep pockets, and
| scraped the same data, I'm assuming? It could be a gold mine
| for all these news agencies to keep settling out of court to
| make some buck.
| ysofunny wrote:
| the very idea of "this digital asset is exclusively mine" cannot
| die soon enough
|
| let real physically tangible assets keep the exclusivity
| _problem_
|
| let's not undo the advantages unlocked by the digital internet;
| let us prevent a few from locking down this grand boon of digital
| abundance such that the problem becomes saturation of data
|
| let us say no to digital scarcity
| cess11 wrote:
| I think you'll find that most people aren't comfortable with
| this in practice. They'd like e.g. the state to be able to keep
| secrets, such as personal information regarding citizens and
| the stuff foreign spies would like to copy.
| jMyles wrote:
| Obviously we're all impacted in these perceptions by our
| bubbles, but it would surprise me at this particular moment
| in the history of US politics to find that most people favor
| the existence of the state at all, let alone its ability to
| keep secret personal information regarding citizens.
| goatlover wrote:
| Most people aren't anarchists, and think the state is
| necessary for complex societies to function.
| jMyles wrote:
| My sense is that the constituency of people who prefer
| deprecation of the US state is much larger than just
| anarchists.
| cess11 wrote:
| Really? Are Food Not Bombs and the IWW that popular we're
| you live?
| CaptainFever wrote:
| This is, in fact, the core value of the hacker ethos. _Hacker_
| News.
|
| > The belief that information-sharing is a powerful positive
| good, and that it is an ethical duty of hackers to share their
| expertise by writing open-source code and facilitating access
| to information and to computing resources wherever possible.
|
| > Most hackers subscribe to the hacker ethic in sense 1, and
| many act on it by writing and giving away open-source software.
| A few go further and assert that all information should be free
| and any proprietary control of it is bad; this is the
| philosophy behind the GNU project.
|
| http://www.catb.org/jargon/html/H/hacker-ethic.html
|
| Perhaps if the Internet didn't kill copyright, AI will.
| (Hyperbole)
|
| (Personally my belief is more nuanced than this; I'm fine with
| very limited copyright, but my belief is closer to yours than
| the current system we have.)
| ysofunny wrote:
| oh please, then, riddle me why does my comment has -1 votes
| on "hacker" news
|
| which has indeed turned into "i-am-rich-cuz-i-own-tech-
| stock"news
| CaptainFever wrote:
| Yes, I have no idea either. I find it disappointing.
|
| I think people simply like it when data is liberated from
| corporations, but hate it when data is liberated from them.
| (Though this case is a corporation too so idk. Maybe just
| "AI bad"?)
| alwa wrote:
| I did not contribute a vote either way to your comment
| above, but I would point out that you get more of what you
| reward. Maybe the reward is monetary, like an author paid
| for spending their life writing books. Maybe it's smaller,
| more reputational or social--like people who generate
| thoughtful commentary here, or Wikipedia's editors, or
| hobbyists' forums.
|
| When you strip people's names from their words, as the
| specific count here charges; and you strip out any reason
| or even way for people to reward good work when they
| appreciate it; and you put the disembodied words in the
| mouth of a monolithic, anthropomorphized statistical model
| tuned to mimic a conversation partner... what type of
| thought is it that becomes abundant in this world you
| propose, of "data abundance"?
|
| In that world, the only people who still have incentive to
| create are the ones whose content has _negative_ value, who
| make things people otherwise wouldn't want to see:
| advertisers, spammers, propagandists, trolls... where's the
| upside of a world saturated with that?
| onetokeoverthe wrote:
| Creators freely sharing with attribution requested is
| different than creations being ruthlessly harvested and
| repurposed without permission.
|
| https://creativecommons.org/share-your-work/
| CaptainFever wrote:
| > A few go further and assert that all information should
| be free and any proprietary control of it is bad; this is
| the philosophy behind the GNU project.
|
| In this view, the ideal world is one where copyright is
| abolished (but not moral rights). So piracy is good, and
| datasets are also good.
|
| Asking creators to license their work freely is simply a
| compromise due to copyright unfortunately still existing.
| (Note that even if creators don't license their work
| freely, this view still permits you to pirate or mod it
| against their wishes.)
|
| (My view is not this extreme, but my point is that this
| view was, and hopefully is, still common amongst hackers.)
|
| I will ignore the moralizing words (eg "ruthless",
| "harvested" to mean "copied"). It's not productive to the
| conversation.
| onetokeoverthe wrote:
| If not respected, some Creators will strike, lay flat,
| not post, go underground.
|
| Ignoring moral rights of creators is the issue.
| CaptainFever wrote:
| Moral rights involve the attribution of works where
| reasonable and practical. Clearly doing so during
| inference is not reasonable or practical (you'll have to
| attribute all of humanity!) but attributing individual
| sources _is_ possible and _is_ already being done in
| cases like ChatGPT Search.
|
| So I don't think you actually mean moral rights, since
| it's not being ignored here.
|
| But the first sentence of your comment still stands
| regardless of what you meant by moral rights. To that,
| well... we're still commenting here, are we not? Despite
| it with almost 100% certainty being used to train AI.
| We're still here.
|
| And yes, funding is a thing, which I agree needs
| copyright for the most part unfortunately. But does
| training AI on, for example, a book really reduce the
| need to buy the book, if it is not reproduced?
|
| Remember, training is not just about facts, but about
| learning how humans talk, how _languages_ work, how books
| work, etc. Learning that won 't reduce the book's
| economical value.
|
| And yes, summaries may reduce the value. But summaries
| already exist. Wikipedia, Cliff's Notes. I think the main
| defense is that you can't copyright facts.
| onetokeoverthe wrote:
| _we 're still commenting here, are we not? Despite it
| with almost 100% certainty being used to train AI. We're
| still here_
|
| ?!?! Comparing and equating commenting to creative works.
| ?!?!
|
| These comments are NOT equivalent to the 17 full time
| months it took me to write a nonfiction book.
|
| Or an 8 year art project.
|
| When I give away _my_ work _I_ decide to whom and how.
| a57721 wrote:
| > freely sharing with attribution requested
|
| If I share my texts/sounds/images for free, harvesting and
| regurgitating them omits the requested attribution. Even
| the most permissive CC license (excluding CC0 public
| domain) still requires an attribution.
| AlienRobot wrote:
| I think an ethical hacker is someone who uses their expertise
| to help those without.
|
| How could an ethical hacker side with OpenAI, when OpenAI is
| using its technological expertise to exploit creators
| without?
| CaptainFever wrote:
| I won't necessarily argue against that moral view, but in
| this case it is two large corporations fighting. One has
| the power of tech, the other has the power of the state
| (copyright). So I don't think that applies in this case
| specifically.
| Xelynega wrote:
| Aren't you ignoring that common law is built on
| precedent? If they win this case, that makes it a lot
| easier for people who's copyright is being infringed on
| an individual level to get justice.
| CaptainFever wrote:
| You're correct, but I think many don't realize how many
| small model trainers and fine-tuners there are currently.
| For example, PonyXL, or the many models and fine-tunes on
| CivitAI made by hobbyists.
|
| So basically the reasoning is this:
|
| - NYT vs OpenAI, neither is disenfranchied - OpenAI vs
| individual creators, creators are disenfranchised - NYT
| vs individual model trainers, model trainers are
| disenfranchised - Individual model trainers vs individual
| creators, neither are disenfranchised
|
| And if only one can win, and since the view is that
| information should be free, it biases the argument
| towards the model trainers.
| AlienRobot wrote:
| What "information" are you talking about? It's a text and
| image generator.
|
| Your argument is that it's okay to scrape content when
| you are an individual. It doesn't change the fact those
| individuals are people with technical expertise using it
| to exploit people without.
|
| If they wrote a bot to annoy people but published how
| many people got angry about it, would you say it's okay
| because that is information?
|
| You need to draw the line somewhere.
| Xelynega wrote:
| I don't understand what the "hacker ethos" could have to do
| with defending openai's blatant stealing of people's content
| for their own profit.
|
| Openai is not sharing their data(they're keeping it private
| to profit off of), so how could it be anywhere near the
| "hacker ethos" to believe that everyone else needs to hand
| over their data to openai for free?
| CaptainFever wrote:
| Following the "GNU-flavour hacker ethos" as described, one
| concludes that it is right for OpenAI to copy data without
| restriction, it is wrong for NYT to restrict others from
| using their data, and it is _also_ wrong for OpenAI to
| restrict the sharing of their model weights or outputs for
| training.
|
| Luckily, most people seem to ignore OpenAI's hypocritical
| TOS against sharing their output weights for training. I
| would go one step further and say that they should share
| the weights completely, but I understand there's practical
| issues with that.
|
| Luckily, we can kind of "exfiltrate" the weights by
| training on their output. Or wait for someone to leak it,
| like NovelAI did.
| whywhywhywhy wrote:
| It's so weird to me seeing journalists complaining about
| copyright and people taking something they did.
|
| The whole of journalism is taking the acts of others and
| repeating them, why does a journalist claim they have the rights
| to someone else's actions when someone simply looks at something
| they did and repeat it.
|
| If no one else ever did anything, the journalist would have
| nothing to report, it's inherently about replicating the work and
| acts of others.
| barapa wrote:
| This is terribly unpersuasive
| PittleyDunkin wrote:
| > The whole of journalism is taking the acts of others and
| repeating them
|
| Hilarious (and depressing) that this is what people think
| journalists do.
| SoftTalker wrote:
| What is a "journalist?" It sounds old-fashioned.
|
| They are "content creators" now.
| echoangle wrote:
| That's a pretty narrow view of journalism. If you look into
| newspapers, it's not just a list of events but also opinion
| pieces, original research, reports etc. The main infringement
| isn't with the basic reporting of facts but with the original
| part that's done by the writer.
| razakel wrote:
| Or you could just not do illegal and/or immoral things that are
| worthy of reporting.
| hydrolox wrote:
| I understand that regulations exist and how there can be
| copyright violations, but shouldn't we be concerned that other..
| more lenient governments (mainly China) who are opposed to the US
| will use this to get ahead? If OpenAI is significantly set back.
| fny wrote:
| No. OpenAI is suspected to be worth over $150B. They can
| absolutely afford to pay people for data.
|
| Edit: People commenting need to understand that $150B is the
| _discounted value of future revenues._ So... yes they can pay
| out... yes they will be worth less... and yes that 's fair to
| the people who created the information.
|
| I can't believe there are so many apologists on HN for what
| amounts to vacuuming up peoples data for financial gain.
| suby wrote:
| OpenAI is not profitable, and to achieve what they have
| achieved they had to scrape basically the entire internet. I
| don't have a hard time believing that OpenAI could not exist
| if they had to respect copyright.
|
| https://www.cnbc.com/2024/09/27/openai-sees-5-billion-
| loss-t...
| jpalawaga wrote:
| technically open ai has respected copyright, except in the
| (few) instances they produce non-fair-use amounts of
| copyrighted material.
|
| dmca does not cover scraping.
| jsheard wrote:
| The OpenAI that is assumed to keep being able to harvest
| every form of IP without compensation is valued at $150B, an
| OpenAI that has to pay for data would be worth significantly
| less. They're currently not even expecting to turn a profit
| until 2029, and that's _without_ paying for data.
|
| https://finance.yahoo.com/news/report-reveals-
| openais-44-bil...
| mrweasel wrote:
| That's not real money tough. You need actual cash on hand to
| pay for stuff, OpenAI only have the money they've been given
| by investors. I suspect that many of the investors wouldn't
| have been so keen if they knew that OpenAI would need an
| additional couple of billions a year to pay for data.
| nickpsecurity wrote:
| That doesn't mean they have $150B to hand over. What you can
| cite is the $10 billion they got from Microsoft.
|
| I'm sure they could use a chunk of that to buy competitive
| I.P. for both companies to use for training. They can also
| pay experts to create it. They could even sell that to others
| for use in smaller models to finance creating or buying even
| more I.P. for their models.
| dmead wrote:
| I'm more concerned that someone people in the tech world are
| conflating Sam Altman's interest with the national interest.
| jMyles wrote:
| Am I jazzed about Sam Altman making billions? No.
|
| Am I even more concerned about the state having control over
| the future corpus of knowledge via this doomed-in-any-case
| vector of "intellectual property"? Yes.
|
| I think it will be easier to overcome the influence of
| billionaires when we drop the pretext that the state is a
| more primal force than the internet.
| dmead wrote:
| 100% disagree. "It'll be fine bro" is not a substitute for
| having a vote over policy decisions made by the government.
| What you're talking about has a name. It starts with F and
| was very popular in Italy in the early to mid 20th century.
| jMyles wrote:
| Rapidity of Godwin's law notwithstanding, I'm not
| disputing the importance of equity in decision-making.
| But this matter is more complex than that: it's obvious
| that the internet doesn't tolerate censorship even if it
| is dressed as intellectual property. I prefer an open and
| democratic internet to one policied by childish legacy
| states, the presence of which serves only (and only
| sometimes) to drive content into open secrecy.
|
| It seems particularly unfair to equate any questioning of
| the wisdom of copyright laws (even when applied in
| situations where we might not care for the defendant, as
| with this case) with fascism.
| dmead wrote:
| It's not Godwin's law when it's correct. Just because
| it's cool and on the Internet doesn't mean you get to
| throw out people's stake in how their lives are run.
| jMyles wrote:
| > throw out people's stake in how their lives are run
|
| FWIW, you're talking to a professional musician.
| Ostensibly, the IP complex is designed to protect me. I
| cannot fathom how you can regard it as the "people's
| stake in how their lives are run". Eliminating copyright
| will almost certainly give people more control over their
| digital lives, not less.
|
| > It's not Godwin's law when it's correct.
|
| Just to be clear, you are doubling down on the claim that
| sunsetting copyright laws is tantamount to nazism?
| worble wrote:
| Should we also be concerned that other governments use slave
| labor (among other human rights violations) and will use that
| to get ahead?
| logicchains wrote:
| It's hysterical to compare training an ML model with slave
| labour. It's perfectly fine and accepted for a human to read
| and learn from content online without paying anything to the
| author when that content has been made available online for
| free, it's absurd to assert that it somehow becomes a human
| rights violation when the learning is done by a non-
| biological brain instead.
| Kbelicius wrote:
| > It's hysterical to compare training an ML model with
| slave labour.
|
| Nobody did that.
|
| > It's perfectly fine and accepted for a human to read and
| learn from content online without paying anything to the
| author when that content has been made available online for
| free, it's absurd to assert that it somehow becomes a human
| rights violation when the learning is done by a non-
| biological brain instead.
|
| It makes sense. There is always scale to consider in these
| things.
| devsda wrote:
| Get ahead in terms of what? Do you believe that the material in
| public domain or legally available content that doesn't violate
| copyrights is not enough to research AI/LLMs or is the concern
| about purely commercial interests?
|
| China also supposedly has abusive labor practices. So, should
| other countries start relaxing their labor laws to avoid
| falling behind ?
| mu53 wrote:
| Isn't it a greater risk that creators lose their income and
| nobody is creating the content anymore?
|
| Take for instance what has happened with news because of the
| internet. Not exactly the same, but similar forces at work. It
| turned into a race to the bottom with everyone trying to
| generate content as cheaply as possible to get maximum
| engagement with tech companies siphoning revenue. Expensive,
| investigative pieces from educated journalists disappeared in
| favor of stuff that looks like spam. Pre-Internet news was
| higher quality
|
| Imagine that same effect happening for all content? Art,
| writing, academic pieces. Its a real risk that openai has
| peaked in quality
| CuriouslyC wrote:
| Lots of people create without getting paid to do it. A lot of
| music and art is unprofitable. In fact, you could argue that
| when the mainstream media companies got completely captured
| by suits with no interest in the things their companies
| invested in, that was when creativity died and we got
| consigned to genre-box superhero pop hell.
| eastbound wrote:
| I don't know. When I look at news from before, there never
| was investigative journalism. It was all opinion swaying
| editos, until alternate voices voiced their
| counternarratives. It's just not in newspapers because they
| are too politically biased to produce the two sides of
| stories that we've always asked them to do. It's on other
| media.
|
| But investigative journalism has not disappeared. If
| anything, it has grown.
| immibis wrote:
| Absolutely: if copyright is slowing down innovation, we should
| abolish copyright.
|
| Not just turn a blind eye when it's the right people doing it.
| They don't even have a legal exemption passed by Congress -
| they're just straight-up breaking the law and getting away with
| it. Which is how America works, I suppose.
| JoshTriplett wrote:
| Exactly. They rushed to violate copyright on a massive scale
| _quickly_ , and now are making the argument that it shouldn't
| apply to them and they couldn't possibly operate in
| compliance with it. As long as humans don't get to ignore
| copyright, AI shouldn't either.
| Filligree wrote:
| Humans do get to ignore copyright, when they do the same
| thing OpenAI has been doing.
| slyall wrote:
| Exactly.
|
| Should I be paying a proportion of my salary to all the
| copyright holders of the books, song, TV shows and movies
| I consumed during my life?
|
| If a Hollywood writer says she "learnt a lot about
| writing by watching the Simpsons" will Fox have an
| additional claim on her earnings?
| __loam wrote:
| Yeah it turns out humans have more rights than computer
| programs and tech startups.
| bogwog wrote:
| This type of argument is ignorant, cowardly, shortsighted, and
| regressive. Both technology and society will progress when we
| find a formula that is sustainable and incentivizes everyone
| involved to maximize their contributions without it all blowing
| up in our faces someday. Copyright law is far from perfect, but
| it protects artists who want to try and make a living from
| their work, and it incentivizes creativity that places without
| such protections usually end up just imitating.
|
| When we find that sustainable framework for AI, China or
| <insert-boogeyman-here> will just end up imitating it. Idk what
| harms you're imagining might come from that ("get ahead" is too
| vague to mean anything), but I just want to point out that that
| isn't how you become a leader in anything. Even worse, if
| _they_ are the ones who find that formula first while we take
| shortcuts to "get ahead", then we will be the ones doing the
| imitation in the end.
| gaganyaan wrote:
| Copyright is a dead man walking and that's a good thing.
| Let's applaud the end of a temporary unnatural state of
| affairs.
| quarterdime wrote:
| Interesting. Two key quotes:
|
| > It is unclear if the Intercept ruling will embolden other
| publications to consider DMCA litigation; few publications have
| followed in their footsteps so far. As time goes on, there is
| concern that new suits against OpenAI would be vulnerable to
| statute of limitations restrictions, particularly if news
| publishers want to cite the training data sets underlying
| ChatGPT. But the ruling is one signal that Loevy & Loevy is
| narrowing in on a specific DMCA claim that can actually stand up
| in court.
|
| > Like The Intercept, Raw Story and AlterNet are asking for
| $2,500 in damages for each instance that OpenAI allegedly removed
| DMCA-protected information in its training data sets. If damages
| are calculated based on each individual article allegedly used to
| train ChatGPT, it could quickly balloon to tens of thousands of
| violations.
|
| Tens of thousands of violations at $2500 each would amount to
| tens of millions of dollars in damages. I am not familiar with
| this field, does anyone have a sense of whether the total cost of
| retraining (without these alleged DMCA violations) might compare
| to these damages?
| Xelynega wrote:
| If you're going to retrain your model because of this ruling,
| wouldn't it make sense to remove _all_ DMCA protected content
| from your training data instead of just the one you were most
| recently sued for(especially if it sets precedent)
| jsheard wrote:
| It would make sense from a legal standpoint, but I don't
| think they could do that without massively regressing their
| models performance to the point that it would jeopardize
| their viability as a company.
| zozbot234 wrote:
| They might make it work by (1) having lots of public domain
| content, for the purpose of training their models on basic
| language use, and (2) preserving source/attribution
| metadata about what copyrighted content they do use, so
| that the models can surface this attribution to the user
| during inference. Even if the latter is not 100% foolproof,
| it might still be useful in most cases and show good faith
| intent.
| CaptainFever wrote:
| The latter one is possible with RAG solutions like
| ChatGPT Search, which do already provide sources! :)
|
| But for inference in general, I'm not sure it makes too
| much sense. Training data is not just about learning
| facts, but also (mainly?) about how language works, how
| people talk, etc. Which is kind of too fundamental to be
| attributed to, IMO. (Attribution: Humanity)
|
| But who knows. Maybe it _can_ be done for more fact-like
| stuff.
| TeMPOraL wrote:
| > _Training data is not just about learning facts, but
| also (mainly?) about how language works, how people talk,
| etc._
|
| All of that and more, all at the same time.
|
| Attribution at inference level is bound to work more-less
| the same way as humans attribute things during
| conversations: "As ${attribution} said, ${some quote}",
| or "I remember reading about it in ${attribution-1} -
| ${some statements}; ... or maybe it was in
| ${attribution-2}?...". Such attributions are often wrong,
| as people hallucinate^Wmisremember where they saw or
| heard something.
|
| RAG obviously can work for this, as well as other
| solutions involving retrieving, finding or confirming
| sources. That's just like when a human actually looks up
| the source when citing something - and has similar
| caveats and costs.
| Xelynega wrote:
| I agree, just want to make sure "they can't stop doing
| illegal things or they wouldn't be a success" is said out
| loud instead of left to subtext.
| CuriouslyC wrote:
| They can't stop doing things some people don't like
| (people who also won't stop doing things other people
| don't like). The legality of the claims is questionable
| which is why most are getting thrown out, but we'll see
| if this narrow approach works out.
|
| I'm sure there are also a number of easy technical ways
| to "include" the metadata while mostly ignoring it during
| training that would skirt the letter of the law if
| needed.
| Xelynega wrote:
| If we really want to be technical, in common law systems
| anything is legal as long as the highest court to
| challenge it decides it's legal.
|
| I guess I should have used the phrase "common sense
| stealing in any other context" to be more precise?
| asdff wrote:
| I wonder if they can say something like "we aren't scraping
| your protected content, we are merely scraping this old
| model we don't maintain anymore and it happened to have
| protected content in it from before the ruling" then you've
| essentially won all of humanities output, as you can
| already scrape the new primary information (scientific
| articles and other datasets designed for researchers to
| freely access) and whatever junk outputted by the content
| mills is just going to be a poor summarizations of that
| primary information.
|
| Other factors that help this effort of an old model + new
| public facing data being complete, are the idea that other
| forms of media like storytelling and music have already
| converged onto certain prevailing patters. For stories we
| expect a certain style of plot development and complain
| when its missing or not as we expect. For music most
| anything being listened to is lyrics no one is deeply
| reading into put over the same old chord progressions we've
| always had. For art there are just too few of us who are
| actually going out of our way to get familiar with novel
| art vs the vast bulk of the worlds present day artistic
| effort which goes towards product advertisement, which once
| again follows certain patterns people have been publishing
| in psychological journals for decades now.
|
| In a sense we've already put out enough data and made
| enough of our world formulaic to the point where I believe
| we've set up for a perfect singularity already in terms of
| what can be generated for the average person who looks at a
| screen today. And because of that I think even a lack of
| any new training on such content wouldn't hurt openai at
| all.
| TeMPOraL wrote:
| Only half-serious, but: I wonder if they can dance with the
| publishers around this issue long enough for most of the
| contested text to become part of public court records, and
| then claim they're now training off that. <trollface>
| jprete wrote:
| Being part of a public court record doesn't seem like
| something that would invalidate copyright.
| A4ET8a8uTh0 wrote:
| Re-training can be done, but, and it is not a small but,
| models already do exist and can be used locally suggesting
| that the milk has been spilled for too long at this point.
| Separately, neutering them effectively lowers their value as
| opposed to their non-neutered counterparts.
| ashoeafoot wrote:
| What about bombing? You could always smuggle dmca content in
| training sets hoping for a payout?
| Xelynega wrote:
| The onus is on the person collecting massive amounts of
| data and circumventing DMCA protections to ensure they're
| not doing anything illegal.
|
| "well someone snuck in some DMCA content" when sharing
| family photos and doesn't suddenly make it legal to share
| that DMCA protected content with your photos...
| sandworm101 wrote:
| But all content is DMCA protected. Avoiding copyrighted
| content means not having content as all material is
| automatically copyrighted. One would be limited to licensed
| content, which is another minefield.
|
| The apparant loophole is between copyrighted work and
| copyrighted work that is _also_ registered. But registration
| can occur at any time, meaning there is little practical
| difference. Unless you have perfect licenses for all your
| training data, which nobody does, you have to accept the risk
| of copyright suits.
| Xelynega wrote:
| Yes, that's how every other industry that redistributes
| content works.
|
| You have to license content you want to use, you cant just
| use it for free because it's on the internet.
|
| Netflix doesn't just start hosting shows and hope they
| don't get a copyright suit...
| logicchains wrote:
| Eventually we're going to have embodied models capable of live
| learning and it'll be extremely apparent how absurd the ideas of
| the copyright extremists are. Because in their world, it'd be
| illegal for an intelligent robot to watch TV, read a book or
| browse the internet like a human can, because it could remember
| what it saw and potentially regurgitate it in future.
| luqtas wrote:
| problem is when a human company profits over their scrape...
| this isn't a non-profit running out of volunteers & a total
| distant reality from autonomous robots learning it way by
| itself
|
| we are discussing an emergent cause that has social &
| ecological consequences. servers are power hungry stuff that
| may or not run on a sustainable grid (that also has a bazinga
| of problems like leaking heavy chemicals on solar panels
| production, hydro-electric plants destroying their surroundings
| etc.) & the current state of producing hardware, be a sweatshop
| or conflict minerals. lets forget creators copyright violation
| that is written in the law code of almost every existing
| country and no artist is making billions out of the abuse of
| their creation right (often they are pretty chill on getting
| their stuff mentioned, remixed and whatever)
| openrisk wrote:
| Leaving aside the hypothetical "live learning AGI" of the
| future (given that money is made or lost _now_ ), would a human
| regurgitating content that is not theirs - but presented as if
| it is - be acceptable to you?
| CuriouslyC wrote:
| I don't know about you but my friends don't tell me that Joe
| Schmoe of Reuters published a report that said XYZ copyright
| XXXX. They say "XYZ happened."
| Karliss wrote:
| If humanity ever gets to the point where intelligent robots are
| capable of watching TV like human can, having to adjust
| copyright laws seems like the least of problems. How about
| having to adjust almost every law related to basic "human"
| rights, ownership, being to establish a contract, being
| responsible for crimes and endless other things.
|
| But for now your washing machine cannot own other things, and
| you owning a washing machine isn't considered slavery.
| JoshTriplett wrote:
| > copyright extremists
|
| It's not copyright "extremism" to expect a level playing field.
| As long as humans have to adhere to copyright, so should AI
| companies. If you want to abolish copyright, by all means do,
| but don't give AI a special exemption.
| IAmGraydon wrote:
| Except LLMs are in no way violating copyright in the true
| sense of the word. They aren't spitting out a copy of what
| they ingested.
| JoshTriplett wrote:
| Go make a movie using the same plot as a Disney movie, that
| doesn't copy any of the text or images of the original, and
| see how far "not spitting out a copy" gets you in court.
|
| AI's approach to copyright is very much "rules for thee but
| not for me".
| bdangubic wrote:
| 100% agree. but now a million$ question - how would you
| deal with AI when it comes to copyright? what rules could
| we possibly put in place?
| JoshTriplett wrote:
| The same rules we already have: follow the license of
| whatever you use. If something doesn't have a license,
| don't use it. And if someone says "but we can't build AI
| that way!", too bad, go fix it for everyone first.
| slyall wrote:
| You have a lot of opinions on AI for somebody who has
| only read stuff in the public domain
| rcxdude wrote:
| That might get you pretty far in court, actually. You'd
| have to be pretty close in terms of the sequence of
| events, character names, etc. Especially considering how
| many Disney movies are based on pre-existing stories, if
| you were, to, say, make a movie featuring talking animals
| that more or less followed the plot of Hamlet, you would
| have a decent chance of prevailing in court, given the
| resources to fight their army of lawyers.
| CuriouslyC wrote:
| It's actually the opposite of what you're saying. I can 100%
| legally do all the things that they're suing OpenAI for.
| Their whole argument is that the rules should be different
| when a machine does it than a human.
| JoshTriplett wrote:
| Only because it would be unconscionable to apply copyright
| to actual human brains, so we don't. But, for instance, you
| _absolutely can_ commit copyright violation by reading
| something and then writing something very similar, which is
| one reason why reverse engineering commonly uses clean-room
| techniques. AI training is in no way a clean room.
| IAmGraydon wrote:
| Exactly. Also core to the copyright extremists' delusional
| train of thought is the fact that they don't seem to understand
| (or admit) that ingesting, creating a model, and then
| outputting based on that model is exactly what people do when
| they observe others' works and are inspired to create.
| CuriouslyC wrote:
| You have to understand, the media companies don't give a shit
| about the logic, in fact I'm sure a lot of the people pushing
| the litigation probably see the absurdity of it. This is a
| business turf war, the stated litigation is whatever excuse
| they can find to try and go on the offensive against someone
| they see as a potential threat. The pro copyright group (big
| media) sees the writing on the wall, that they're about to get
| dunked on by big tech, and they're thrashing and screaming
| because $$$.
| tokioyoyo wrote:
| The problem is, we can't come up with a solution where both
| parties are happy, because in the end, consumers choose one
| (getting information from news agencies) or the other (getting
| information from chatgpt). So, both are fighting for life.
| 3pt14159 wrote:
| Is there a way to figure out if OpenAI ingested my blog? If the
| settlements are $2500 per article then I'll take a free used cars
| worth of payments if its available.
| jazzyjackson wrote:
| I suppose the cost of legal representation would cancel it out.
| I can just imagine a class action where anyone who posted on
| blogger.com between 2002 and 2012 eventually gets a check for
| 28 dollars.
|
| If I were more optimistic I could imagine a UBI funded by
| lawsuits against AGI, some combination of lost wages and
| intellectual property infringement. Can't figure out exactly
| how much more important an article on The Intercept had on
| shifting weights than your hacker news comments, might as well
| just pay everyone equally since we're all equally screwed
| dwattttt wrote:
| Wouldn't the point of the class action to be to dilute the
| cost of representation? If the damages per article are high
| and there's plenty of class members, I imagine the limit
| would be how much OpenAI has to pay out.
| SahAssar wrote:
| If you posted on blogger.com (or any platform with enough
| money to hire lawyers) you probably gave them a license that
| is irrevocable, non-exclusive and able to be sublicensed.
| bastloing wrote:
| Isn't this the same thing Google has been doing for years with
| their search engine? Only difference is Google keeps the data
| internal, whereas openai spits it out to you. But it's still
| scraped and stored in both cases.
| jazzyjackson wrote:
| A component of fair use is to what degree the derivative work
| displaces the original. Google's argument has always been that
| they direct traffic to the original, whereas AI summaries
| (which Google of course is just as guilty of as openai)
| completely obsoletes the original publication. The argument now
| is that the derivative work (LLM model) is transformative, ie,
| different enough that it doesn't economically compete with the
| original. I think it's a losing argument but we'll see what the
| courts arrive at.
| CaptainFever wrote:
| Is this specific to AI or specific to summaries in general?
| Do summaries, like the ones found in Wikipedia or Cliffs
| Notes, not have the same effect of making it such that people
| no longer have to view the original work as much?
|
| Note: do you mean the _model_ is transformative, or the
| _summaries_ are transformative? I think your comment holds up
| either way but I think it 's better to be clear which one you
| mean.
| LinuxBender wrote:
| In my opinion _not a lawyer_ , Google at least references where
| they obtained the data and did not regurgitate it as if they
| were the creators that created something new. _obfuscated
| plagiarism via LLM._ Some claim derivative works but I have
| always seen that as quite a stretch. People here expect me to
| cite references yet LLM 's somehow escape this level of
| scrutiny.
| efitz wrote:
| I would trust AI a lot more if it gave answers more like:
|
| _"Source A on date 1 said XYX"_
|
| _"Source B ..."_
|
| _"Synthesizing these, it seems that the majority opinion is X
| but Y is also a commonly held opinion."_
|
| Instead of what it does now, which is make extremely confident,
| unsourced statements.
|
| It looks like the copyright lawsuits are rent-seeking as much as
| anything else; another reason I hate copyright in its current
| form.
| CaptainFever wrote:
| ChatGPT Search provides this, by the way, though it relies a
| lot on the quality of Bing search results. Consensus.app does
| this but for research papers, and has been very useful to me.
| maronato wrote:
| More often than not in my experience, clicking these sources
| takes me to pages that either don't exist, don't have the
| information ChatGPT is quoting, or ChatGPT completely
| misinterpreted the content.
| akira2501 wrote:
| > which is make extremely confident,
|
| One of the results the LLM has available to itself is a
| confidence value. It should, at the very least, provide this
| along with it's answer. Perhaps if it did people would stop
| calling it 'AI'.'
| ashoeafoot wrote:
| Will we see human washing, where Ai art or works get a "Made by
| man" final touch in some third world mechanical turk den? Would
| that add another financial detracting layer to the ai winter?
| Retric wrote:
| The law generally takes a dim view of such attempts to get
| around things like that. AI biggest defense is claiming they
| are so beneficial to society that what they are doing is fine.
| gmueckl wrote:
| That argument stands on the mother of all slippery slopes!
| Just find a way to make your product mpressive or ubiquitous
| and all of a sudden it doesn't matter how much you break the
| law along the way? That's so insane I don't even know where
| to start.
| ashoeafoot wrote:
| Worked for purdue
| Retric wrote:
| YouTube, AirBnB, Uber, and many _many_ others have all done
| stuff that's blatant against the law but gotten away with
| it due to utility.
| rcxdude wrote:
| Why not, considering copyright law specifically has fair
| use outlined for that kind of thing? It's not some
| overriding consequence of law, it's that copyright is a
| granting of a privilege to individuals and that that
| privilege is not absolute.
| gaganyaan wrote:
| That is not in any way the biggest defense
| Retric wrote:
| It's worked for many startups and court cases in the past.
| Copyright even has many explicit examples of the utility
| loophole look at say: https://en.wikipedia.org/wiki/Sony_Co
| rp._of_America_v._Unive....
| righthand wrote:
| That will probably happen to some extent if not already.
| However I think people will just stop publishing online if
| malicious corps like OpenAI are just going to harvest works for
| their own gain. People publish for personal gain, not to enrich
| the public or enrich private entities.
| Filligree wrote:
| However, I get my personal gain regardless of whether or not
| the text is also ingested into ChatGPT.
|
| In fact, since I use ChatGPT a lot, I get more gain if it is.
| CuriouslyC wrote:
| There's no point in having third world mechanical turk dens do
| finishing passes on AI output unless you're trying to make it
| worse.
|
| Artists are already using AI to photobash images, and writers
| are using AI to outline and create rough drafts. The point of
| having a human in the loop is to tell the AI what is worth
| creating, then recognize where the AI output can be improved.
| If we have algorithms telling the AI what to make and content
| mill hacks smearing shit on the output to make it look more
| human, that would be the worst of both worlds.
| ada1981 wrote:
| I'm still of the opinion that we should be allowed to train on
| any data a human can read online.
| cynicalsecurity wrote:
| Yeah, let's stop the progress because a few magazines no one
| cares about are unhappy.
| a57721 wrote:
| Maybe just don't use data from the unhappy magazines you don't
| care about in the first place?
| bastloing wrote:
| Who would be forever grateful if openai removed all of The
| Intercept's content permanently and refused to crawl it in the
| future?
___________________________________________________________________
(page generated 2024-11-29 23:00 UTC)