[HN Gopher] GitHub Copilot is not infringing copyright
___________________________________________________________________
GitHub Copilot is not infringing copyright
Author : aarroyoc
Score : 256 points
Date : 2021-07-05 11:10 UTC (11 hours ago)
(HTM) web link (juliareda.eu)
(TXT) w3m dump (juliareda.eu)
| kube-system wrote:
| > If it were not possible to prohibit the use and modification of
| software code by means of copyright, then there would be no need
| for licences that prevent developers from making use of those
| prohibition rights (of course, free software licenses would still
| fulfil the important function of contractually requiring the
| publication of modified source code).
|
| The parenthetical backpedaling here is the _entire point_ of
| copyleft. If it wasn 't, copyleft wouldn't exist -- people would
| just release their software as public domain.
|
| The opposite of "copyleft" isn't "copyright".
|
| The opposite of "copyleft" is "never published", in which case,
| copyright is irrelevant.
|
| There is plenty of commercial closed-source software based on
| software released under permissive licenses like BSD, MIT or
| Apache, because they are not copyleft.
| sombremesa wrote:
| I'd argue that copyright is still relevant when the source code
| isn't published. It's not too difficult to copy an algorithm
| from a binary even if you don't have the source.
| kube-system wrote:
| Fair. When I wrote that, I was thinking "not published" as in
| server software.
| malwrar wrote:
| Who cares if they're infringing copyright?
|
| Microsoft bought the place that has a lot of our code and now is
| going to try and sell us a tool that will regurgitate it back on
| demand. The entire software industry is already largely based and
| advanced by the unpaid labor of open-source software project
| developers, GitHub as a popular open source ally could at least
| pretend to honor the gentleman's agreement of at least agreeing
| to respect the open-source origins of a ton of its stack.
|
| If the tool was also open we probably wouldn't have nearly as big
| a problem, but I guess Microsoft has to recoup the cost of their
| completely unnecessary purchase.
| IshKebab wrote:
| Microsoft could easily do this even if they didn't own GitHub.
| Anyone can download all of the code on GitHub.
| chrisseaton wrote:
| But doesn't Copilot generate verbatim copies of entire copyright
| methods that implement non-trivial novel algorithms, including
| comments?
|
| The article doesn't seem to address this?
| creshal wrote:
| Yes. The author apparently did no research of her own and just
| assumed Github's FAQ was trustworthy.
| ramraj07 wrote:
| Yes and you did no further research on your own since GitHub
| already said it's going to fix that (and a competent engineer
| would know it's trivial to fix that as well).
| chrisseaton wrote:
| Data mined and regenerated it.
| rektide wrote:
| I've seen way too many screenshots of a dozen-line complete XHR
| wrappers being suggested[1] to complete a function to imagine
| Copilot as a generative machine. It's a somewhat fancy copy paste
| engine, with phenomenal search. But it's smuggled through enough
| complexity & machinery to obfuscate any legal obligations that
| might be attached to the original source material.
|
| The article does not set itself up to address this at all:
|
| > Since Copilot also uses the numerous GitHub repositories under
| copyleft licences such as the GPL as training material, some
| commentators accuse GitHub of copyright infringement, because
| Copilot itself is not released under a copyleft licence, but is
| to be offered as a paid service after a test phase.
|
| I'm all for discussion of whether Copilot itself has to be
| copyleft. But to me, the immediate concern is that Copilot seems
| like a way to take copyleft works and remove the copyleft license
| from those works.
|
| [1] https://mastodon.social/@cjd/106513694972486353
| hnfong wrote:
| Isn't that what GPL allows, and is what the AGPL is for if you
| don't want people to take your code and host it as an online
| service?
| rektide wrote:
| the service itself is a source-code-copier.
|
| GPL does not permit you to copy source code without
| attribution. this copier does not provide attribution.
|
| as i just said, i'm not so interested in debating the source-
| code-copier's licensing. i think it could go either way but i
| don't really care. the copied source code that the source
| code copier copies is interesting to me, and feel like the
| stochiastic parrot act bullshit they are pulling is massive
| massive sinfully evil bullshit without attribution. the
| stochiastic parrot can't just ignore all the licensing of
| what it parrots out.
| hu3 wrote:
| Please copy my code. Reality is I'll be gone in 100 years tops
| and I'd be more than glad if my crappy code actually helps
| someone.
|
| As for attribution, we all learn by looking at code from all
| kinds of licenses. Between Stack Overflow, projects hosted in
| GitHub, libraries that sit on our vendor directories and even
| closed source projects there's a lot that is carried over to new
| projects without attribution.
|
| We're heading to a world were most projects are basically
| libraries glued together anyway. Standing on the shoulders of
| giants and all that.
|
| The dream of an omniscient pair programming buddy is slowly
| coming to fruition and I for one welcome.
|
| Copilot is just a tool, fancy search engine for the code that's
| available online. Projects should be judged by the way they use
| Copilot just like I'm judged if I misuse my car.
|
| I couldn't care less whether my name is shoved in some ever
| increasing CONTRIBUTOR.md file that no one but machines will
| read.
|
| I'm actually going to start documenting blocks of code more
| thoroughly so Copilot can better infer what each block does.
| COMMENT___ wrote:
| TLDR; GitHub will eventually add some kind of "data usage
| reporting" utility that could show which parts of final your code
| made with help of this CuckPilot could potentially infringe
| copyright with links to other known sources of these parts of
| code. Then they will tell you that it is your responsibility to
| ensure that your final code does not have copyright issues.
| [deleted]
| Syzygies wrote:
| Whatever the law, when does learning from what we read devolve
| into plagiarism?
|
| The poster child for this category would be those programs that
| generate nonsense English text that recognizably resembles a
| known author. They choose the next character at random,
| conditionally based on the previous characters. Too short a
| context, and the results are gibberish. Too long a context, and
| the results are plagiarism.
| [deleted]
| phoe-krk wrote:
| _> On the other hand, the argument that the outputs of GitHub
| Copilot are derivative works of the training data is based on the
| assumption that a machine can produce works. This assumption is
| wrong and counterproductive. Copyright law has only ever applied
| to intellectual creations - where there is no creator, there is
| no work. This means that machine-generated code like that of
| GitHub Copilot is not a work under copyright law at all, so it is
| not a derivative work either. The output of a machine simply does
| not qualify for copyright protection - it is in the public
| domain. That is good news for the open movement and not something
| that needs fixing._
|
| This is very good news. This line of thought implies that we can
| legally feed all proprietary code into GitHub Copilot in order to
| teach it all the patented and secret tricks of the companies we
| can see (since data mining is not copyright infrigement) in order
| to have it print those secrets back when we ask it to (so they
| become public domain).
|
| /s
| amelius wrote:
| > This line of thought implies that we can legally feed all
| proprietary code into GitHub Copilot
|
| And lines such as: char
| *base64_pirated_mp4_file = "YW55IGNhcm5hbCBwbGVhc3VyZS4...";
| shp0ngle wrote:
| Patent is not copyright.
|
| Secret tricks are usually a company secret, not really
| protected by copyright either.
|
| NDAs and similar are also not copyrights.
|
| Even trademark is not copyright.
| tgsovlerkhgsel wrote:
| "Patented and secret tricks" are not protected by copyright, if
| the output was an actual reimplementation of an idea instead of
| Copilot regurgitating existing code
| (https://news.ycombinator.com/item?id=27710287).
|
| The specific implementations are protected by copyright, and
| the ideas may be protected by patents. In the case of "secret"
| tricks, they may be protected by trade secret laws, but not if
| it's in a public GitHub repo.
| jakobdabo wrote:
| > The output of a machine simply does not qualify for copyright
| protection
|
| Good, does the `cp` or `cat` command qualify for the "output of
| a machine"? Now I can uncopyright everything, hooray. What
| about converting a video or an image to another format? Again,
| it's just output of a machine.
|
| Added:
|
| Really, I would've been happy if this was the situation, as
| I'm, in general, against patents and copyright (in the form
| that they are now being used).
| chalst wrote:
| > Copyright law has only ever applied to intellectual creations
| - where there is no creator, there is no work
|
| This is, at best, an oversimplification. Code compiled from
| copyrighted source code is derived work inheriting that
| copyright according to long-established law. This is exactly
| why the legal issues around machine learning applied to
| copyrighted corpora have been contentious.
| b3morales wrote:
| Wow, that's quite a strawman/bait-and-switch from the article;
| thanks for highlighting it.
|
| If Copilot is just a machine -- a glorified typewriter -- then
| the machine's _operator_ is responsible for its output.
|
| Or does the author seriously want to claim that any code added
| via Copilot to a proprietary codebase would not be proprietary
| as well? If that were true, Copilot's userbase is going to
| be...limited.
| syshum wrote:
| So then we come full circle on infringement, if the operator
| is responsible for the code produced by co-pilot then the
| articles claim that is not infringement because it is machine
| created fails, as the operator is responsible
|
| The argument in the article is that all code made by co-pilot
| is not infringement because there is no copyright attached.
| You seem to imply that the copyright of all code made by co-
| pilot is copyrighted by the operator of co-pilot thus would
| then fall under copyright law, and thus would/could be
| infringing
| b3morales wrote:
| > The argument in the article is that all code made by co-
| pilot is not infringement because there is no copyright
| attached.
|
| Right; I don't see how that's even remotely a tenable
| argument. I think the article is trying to eat its cake and
| have it too.
| bsza wrote:
| Except Copilot itself is not open source, so your only way to
| feed that proprietary code into it would be to upload it to
| github, which would make _you_ an infringer.
| heavyset_go wrote:
| As you type you send the code you're writing to Copilot's
| backend.
| bsza wrote:
| From the copilot telemetry docs [0]:
|
| > _The GitHub Copilot collects activity from the user's
| Visual Studio Code editor, tied to a timestamp, and
| metadata._
|
| [...]
|
| > _This data will only be used by GitHub for:_
|
| [...]
|
| > - _Improving the underlying code generation models, e.g.
| by providing positive and negative examples (but always so
| that your private code is not used as input to suggest code
| for other users of GitHub Copilot)_
|
| I'm inclined to believe this. After all, why would they
| taint the training data with code from a random guy who is
| _asking_ for help when they have more than a hundred
| thousand repos with 100+ stars?
|
| [0] [https://docs.github.com/en/github/copilot/about-
| github-copil...]
| TeMPOraL wrote:
| > _your only way to feed that proprietary code into it would
| be to upload it to github, which would make you an infringer_
|
| _Somebody_ has to do it. Someone ready to take one for the
| team (or better yet, already skilled in software piracy, so
| that it 's not a big deal for them). Then, if this argument
| holds, _everyone_ gets the result in public domain.
|
| As I see it, the argument presented in this post essentially
| makes Copilot to be an universal copyright laundering
| machine. Not just for code, but for anything that can be
| represented digitally.
|
| Obviously this won't stand. While I can see Github ending up
| protected from all liability, the only way for this to not
| kill copyright is for the _users_ of Copilot to become at
| risk of copyright infringement. Which kills the whole value
| proposition of Copilot.
| dTal wrote:
| This seems like a good time to ask what the heck _is_ the
| value proposition of a thing like this. Are people really
| going to use the output of this blindly? And if they 're
| going to audit every line - is that really easier than just
| writing the code yourself? Honestly, _at best_ it feels
| like a machine for introducing pernicious bugs that "look
| right" but are semantically wrong. (Which reminds me - were
| any of the Underhanded C Contest entries in the training
| data?)
| TeMPOraL wrote:
| That's a very good question. On the surface, the idea
| _seems_ to be helping people write code faster - but as
| you observe, properly auditing generated code is _more_
| work than actually writing it from scratch.
|
| Best I can think of in terms of real value delivered, is
| helping people with first drafts, breaking through the
| "staring at an empty page" problem. But even with this, I
| feel it's too risky compared to doing a StackOverflow
| search, where you can at least see some explanations,
| discussions, and other relevant context.
|
| It's definitely an interesting vision demonstrator -
| despite not being quite there, it lets us see that a tool
| like this that _actually worked well_ (in terms of
| generating correct, explainable, license-respecting code)
| would be very useful.
| visarga wrote:
| Assume GitHub designs a filter to detect similarities to
| the training set and displays an attribution link with
| the result, as a comment. It's no different from using a
| search engine to find the code and putting it in your
| project, especially that the code is public already and
| visible to multiple search engines. You are ultimately
| responsible, just like you are every day with Google.
|
| But the model has on average just 1 regurgitation in 10
| weeks per user, so you can just discard all of them.
| visarga wrote:
| Almost all the output of Copilot is not an exact copy of
| any code in the training set. You discard the 0.001% of
| generations that are similar to the training data and use
| the rest.
| captaincaveman wrote:
| If I understand what is being stated correctly; even if I assert
| a prohibition in my licence for my creative work (code) not to be
| used by Copilot (or any other machine learning model as training
| data), it wouldn't matter as its not covered by Copyright?
| yakubin wrote:
| _> The output of a machine simply does not qualify for copyright
| protection - it is in the public domain._
|
| Does it mean that compiler output does not qualify for copyright
| protection and I may legally share copies of MS Word via torrent?
| dragonwriter wrote:
| > Copyleft does not benefit from tighter copyright laws
|
| Of course it does, at least the goal copyleft serves for RMS
| style Free Software ideologues. While copyleft may be motivated
| by an ideology that prefers _no_ copyright protections, at least
| for software, it relies on copyright maximalism to avoid nonfree
| derivatives. From advocates viewpoint, the worst situation is a
| copyright regime that is strong enough that it allows nonfree
| software to exist but is also weak enough that it prevents
| creating an iron wall that prevents the use of software built by
| ideolgoical opponents of nonfree software from being used to
| advance nonfree software.
| creshal wrote:
| > On the other hand, the argument that the outputs of GitHub
| Copilot are derivative works of the training data is based on the
| assumption that a machine can produce works. This assumption is
| wrong and counterproductive. Copyright law has only ever applied
| to intellectual creations - where there is no creator, there is
| no work.
|
| Cool. I'll just train my new AI on 20 different copies of the
| same Disney movie and have it generate a new movie. Checkmate,
| lawyers!
| alpaca128 wrote:
| "The model might be slightly overfitted, but no creator, no
| work"
| nkrisc wrote:
| I don't think the judge will care _how_ you arrived at the
| copyright violation, only that you did. But hey, I 'd love to
| see that court case anyhow, maybe I'm wrong.
| TeMPOraL wrote:
| Yes, and this principle should apply equally well to Copilot
| - in particular, to anyone _using_ code provided by Copilot
| in their projects.
| arcturus17 wrote:
| You do understand that's not how laws work in general, right?
|
| The court of law would probably have you unveil and tear apart
| your process and find that you were trying to plagiarize in a
| roundabout way.
| xdennis wrote:
| > you were trying to plagiarize in a roundabout way
|
| So using machine learning with one movie is illegal, but
| using it with a million isn't?
| mthoms wrote:
| Generally speaking - probably yes. That's because the
| output would be (a) transformative and (b) not likely to
| affect the profitability of the original work (at least not
| directly).
|
| Note: I'm not really trying to comment specifically about
| the code/movie examples - just the general notion that the
| more input there is (from different sources), the
| likelihood that the use will be considered "Fair Use"
| increases.
|
| https://en.wikipedia.org/wiki/Fair_use
| Sr_developer wrote:
| Do you think this Copilot is some sort of advanced AGI who
| just became a genius programmer? Almost every piece of code
| that it "generates" you will find it almost verbatim in one
| or several public repos.
| creshal wrote:
| > The court of law would probably have you unveil and tear
| apart your process and find that you were trying to
| plagiarize in a roundabout way.
|
| Well, yes, that's the point.
| arcturus17 wrote:
| I'm not sure what the point is. That the argument of
| copying Disney movies is not a good analogy, or that courts
| will find plagiarism in software copyright cases involving
| Copilot?
| TeMPOraL wrote:
| The point is that Copilot is effectively doing the Disney
| movie thing, just with code, and yet this article argues
| this is all fine. As it is, the article turns Copilot
| into an universal copyright laundering machine.
| phoe-krk wrote:
| > The court of law would probably have you unveil and tear
| apart your process and find that you were trying to
| plagiarize in a roundabout way.
|
| This is the whole idea. Copilot is spitting out considerable
| chunks of code that is licensed under GPL and it will be up
| to GitHub to prove that Copilot is _not_ trying to plagiarize
| this code in a roundabout way.
|
| In the very least, Copilot should have separate data stores
| for different groups of licenses: public domain, attribution-
| only, copyleft, etc.. That would already make it much more
| usable than the current "here's some code, it came from I
| don't know where, don't ask me" that literally looks like
| black market deals except they are GitHub-branded.
| visarga wrote:
| > Copilot is spitting out considerable chunks of code that
| is licensed under GPL
|
| One regurgitation in 10 weeks per user. Not considerable.
| Could be just skipped with a simple search by Github.
| hmfrh wrote:
| > You do understand that's not how laws work in general,
| right?
|
| The only reason the law doesn't work this way for Microsoft
| Copilot is because the copyright holders are individuals who
| do not have the capital or expertise to file suit.
|
| If Microsoft instead released a video editor addon that was
| trained on Disney movies and which would sometimes insert
| scenes of _any_ Disney movie you can bet your ass we wouldn't
| be having the same discussion.
| visarga wrote:
| Comparing code to movies - in code even a single char
| difference can change the meaning of everything, in movies
| - you can skip whole scenes and still get the meaning. I
| don't think the two are compatible, they are judged by
| different standards.
| CyberRabbi wrote:
| > Works licensed under copyleft may be copied, modified and
| distributed by all, as long as any copies or derivative works may
| in turn be re-used under the same license conditions. This
| creates a virtuous circle, thanks to which more and more
| innovations are open to the general public.
|
| She claims that Copilot advanced the goals of copyleft but
| copilot does not create a "virtuous cycle" of generating more
| public IP. The customers of Copilot use Copilot extract public
| work through Copilot for themselves and are not compelled to
| contribute back.
|
| Copilot is anti-FOSS plain and simple.
| oolonthegreat wrote:
| Such a weird argument: "Copyleft people should not argue for
| better copyright". What does that even mean?
| pessimizer wrote:
| She's arguing that copyleft people are arguing for an effective
| extension of copyright into places IP lobbyists are currently
| fighting for. It's not a good framing. She's saying that we
| shouldn't argue for copyright to be consistent if we're against
| copyright - arguing that we should make a moral argument
| against a legal situation.
|
| It's as if we couldn't argue against drug companies being
| allowed to sell heroin if we were anti-drug war and drugs would
| remain illegal. It's a strategy argument that leads nowhere. If
| the result of making machine written works also subject to
| copyright results in all possible songs being copyrighted by a
| machine, _that 's a good outcome._ It's obviously absurd and
| weakens the entire concept.
|
| We should demand consistency.
|
| If this is fine, we might as well stop enforcing the GPL, too.
| It's a trick of copyright to further the cause of anti-
| copyright. I'm sure somebody can write an "auto-fork" that will
| digest GPL'd code and rearrange and rephrase it in order to
| spit out a clone.
| codesections wrote:
| Julia Reda's analysis depends on the factual claim in this key
| passage:
|
| > In a few cases, Copilot also reproduces short snippets from the
| training datasets, according to GitHub's FAQ.
|
| > This line of reasoning is dangerous in two respects: On the one
| hand, it suggests that even reproducing the smallest excerpts of
| protected works constitutes copyright infringement. This is not
| the case. Such use is only relevant under copyright law if the
| excerpt used is in turn original and unique enough to reach the
| threshold of originality.
|
| That analysis may have been reasonable when the post was first
| written, but subsequent examples seem to show Copilot reproducing
| far more than the "smallest excerpts" of existing code. For
| example, the excerpt from the Quake source code[0] appears to
| easily meet the standard of originality.
|
| [0]: https://news.ycombinator.com/item?id=27710287
| joe_the_user wrote:
| [Ianal]
|
| The thing about the situation is that "copying code you found
| on the Internet" certainly isn't automatically, always legal.
| That you engaged in copying X from the Internet doesn't make
| illegal either. Your source for the source code your
| incorporate into a product doesn't matter, what matters is
| whether that code is copyrighted and what the license terms (if
| any) are (and people saying "copyright doesn't apply to
| machines" are wildly misinterpreting things imo).
|
| Given what's come out, it seems plausible that you could coax
| the source of whatever smallish open source project you wished
| out of copilot. Claiming copyright on that code wouldn't be
| legal regardless of Copilot.
|
| Whether Microsoft/Github would be liable is another question as
| far as I can tell. I mean, Youtube-dl can be used to violate
| copyright but it isn't liable for those violations. The only
| way Copilot is different from youtube-dl is that it tells it's
| users everything is OK and "they told me it was OK" is
| generally not a legal defense (IE, I don't know for sure but
| I'd be shocked if the app shielded it's users from liability).
| All the open source code is certainly "free to look at" and
| Copilot putting that on a programmers screen isn't doing more
| than letting the programmer look at it until the programmer
| does something (incorporating it into a released work they
| claim as their own would be act).
|
| The question is how easily a programmer could accidentally come
| up with a large enough piece of a copyrighted work using
| Copilot. That question seems to be open.
|
| TL;DR; My entirel amateur legal opinion is that Copilot can't
| violate copyright but that it's users certainly can.
| riedel wrote:
| I would love to try a session of clean room reverse engineering
| using copilot. I would bet you get reasonably far for very
| common libraries with not much effort. The question would be if
| such compression/decompression would infringe copyright.
| make3 wrote:
| It can, that does not mean that it will, in any case other than
| people actively probing it for that.
| creshal wrote:
| I'm not sure if making an "analysis" without doing any research
| whatsoever is reasonable.
| codesections wrote:
| I'm not sure either --which is why I said "may have been
| reasonable" instead of "was reasonable" :)
|
| I can see an argument for doing your own research, but I can
| also see an argument for basing an analysis on what GitHub
| said in the FAQ -- I'm honestly a bit surprised that
| Microsoft's lawyers let them say that with a product that can
| reproduce such large blocks of verbatim code.
| creshal wrote:
| My guess is that their lawyers weren't consulted, and that
| the Github people just shipped it on their own.
| hnfong wrote:
| It's not obvious that _Microsoft_ is violating copyright
| yet. The main concern is whether the product makes others
| liable.
|
| So it could be that the executives really wanted to do
| it, and the lawyers thought "OK, _technically_ we 're not
| violating anything...."
| yunohn wrote:
| That literally cannot happen in FAANG/MS, esp not when
| the CEO announces the product in a public blog post.
| swiftcoder wrote:
| Yep. Individuals in a FAANG don't have the ability to
| launch a product without review. Just drafting a press
| release for a new product involves Comms oversight and
| VP-level approval.
| denton-scratch wrote:
| > appears to easily meet the standard of originality
|
| It's an algorithm. In the olden days, you couldn't copyright an
| algorithm, even an original one. There's only so many ways you
| can express an algorithm; the best ways are using code. So is
| it the intention that rewriting in Python an algorithm
| previously expressed in C would be infringing? Suppose the
| algorithm is re-expressed in English?
|
| Allowing copyrights on algorithms is tantamount to allowing
| copyrights on thought-processes.
|
| Here come the thought-police. Take cover.
| geofft wrote:
| Copyright is (and has been, since the earliest days) about
| protecting the _creative expression_ of an idea.
|
| You can't copyright an algorithm, but you certainly can
| copyright the expression of an algorithm in Python. You
| cannot copyright the words of the English language and their
| meanings, but Noah Webster absolutely did copyright his
| dictionary, which was a creative expression of their
| definitions (and lobbied for the first increase to US
| copyright law). Webster wasn't the "thought police" for
| trying to copyright people's understanding of words in
| English, because he didn't and couldn't copyright them; he
| copyrighted his expression of what words meant.
|
| If you read the creative expression of an algorithm in Python
| and then re-express it in English, then sure, copyright
| protection doesn't extend to that re-expression. But Copilot
| isn't doing that, it's quite clearly reproducing parts of the
| original creative expression of an algorithm, not the
| algorithm itself.
|
| Here's an easy way to demonstrate it: open up a source file
| in any language _other_ than C and try to get Copilot to spit
| out an implementation of Quake 's fast-inverse-square-root
| algorithm. You will very quickly discover that Copilot
| doesn't "know" the algorithm; it only "knows" the specific
| creative expression of it (comments included).
| eesmith wrote:
| There are no thought police here.
|
| In the US, copyright may include the choice of variable
| names, the organization of the code into modules and
| functions, and other aspects which where there are the
| creative choices that may be protected under copyright law.
|
| The relevant process is described at
| https://en.wikipedia.org/wiki/Abstraction-Filtration-
| Compari... , which comes from the court case at https://en.wi
| kipedia.org/wiki/Computer_Associates_Internatio.... nearly 30
| years ago:
|
| > the court presented a three-step test to determine
| substantial similarity, abstraction-filtration-comparison.
| This process is based on other previously established
| copyright principles of merger, scenes a faire, and the
| public domain.[1] In this test, the court must first
| determine the allegedly infringed program's constituent
| structural parts. Then, the parts are filtered to extract any
| non-protected elements. Non-protected elements include:
| elements made for efficiency (i.e. elements with a limited
| number of ways it can be expressed and thus incidental to the
| idea), elements dictated by external factors (i.e. standard
| techniques), and design elements taken from the public
| domain. _Any of these non-protected elements are thrown out
| and the remaining elements are compared with the allegedly
| infringing program 's elements to determine substantial
| similarity._
|
| Emphasis mine. This specifically highlights that your example
| ('only so many ways you can express an algorithm') is _not_
| protected under US copyright law.
|
| The originality requirement only applies to other aspects of
| the generated code, which in this case would include the
| comments that Copilot generated, and which clearly are not
| required for the algorithm to work.
|
| For thought police like you describe, look to patent law.
| jen20 wrote:
| > So is it the intention that rewriting in Python an
| algorithm previously expressed in C would be infringing?
|
| Yes, a port from language X to Y is widely considered a
| derived work. Whether it is infringing is a separate
| question.
| dang wrote:
| Please omit flamebait from your HN comments. It tends to
| produce flamewars, which are tedious and nasty. Your comment
| would be fine without the last two sentences.
|
| https://news.ycombinator.com/newsguidelines.html
| denton-scratch wrote:
| I didn't realise I was perpetrating flamebait; my last two
| sentences were meant as rhetorical hyperbole (and I wasn't
| targetting anyone here!)
|
| At any rate, I like it here; so I'll try to figure out how
| what I said was flamebait, and try not to say such things
| again.
|
| Sorry.
|
| [Edited upon re-reading]
| glitchc wrote:
| Clarification: One cannot patent an algorithm, but an
| implementation in source code can certainly be copyrighted.
| Wowfunhappy wrote:
| Huh, that's interesting. While I'm hesitant to suggest that
| what the world needs is even more patents, this doesn't
| make immediate sense to me.
|
| Let's say someone comes up with a new sorting algorithm,
| which completes in less cycles than was previously believed
| possible. Sure, it's math, but isn't that a new, creative
| expression? Don't we want to encourage them to publish
| their algorithm (one of the key purposes of patents--this
| way, anyone can use it after 20 years), as opposed to
| keeping it hidden from the world?
|
| It makes more sense to me than most software patents
| (admittedly, a low bar to clear). And if the patent office
| is doing its job (big if), the patents should only be
| granted for algorithms which are sufficiently novel.
| denton-scratch wrote:
| A new super-fast sorting algorithm (not just a few
| cycles, but something that actually changes the O-number)
| would obviously be a fantastic boon - I would want the
| inventor to benefit from his cleverness.
|
| But nowadays I think patent law isn't the right way to do
| that; trade secrets should be enough. I don't think that
| what is disclosed to the public in patent applications is
| of enough value to justify a long monopoly. It's not
| necessarily a problem with the written law; patents are
| horrible because of the way courts apply them.
| gruez wrote:
| >One cannot patent an algorithm
|
| software patents?
| tovej wrote:
| An algorithm is maths, you can't patent maths. Patent
| lawyers and business people have however somehow managed
| to convince courts/patent authorities that configurations
| of computer systems are patentable (or some similar
| argument), which then makes software patentable (IANAL
| but I think it's something like this).
|
| Either way, the copyright of source code is separate from
| that. Copyright is for the text of a program (the source
| code), that might e.g. implement an algrithm. The
| algorithm itself cannot be patented or otherwise legally
| protected.
| ModernMech wrote:
| An algorithm is maths, but a lot of code isn't
| algorithmic. Algorithms provably halt, and most software
| doesn't halt, let alone provably. Operating systems,
| browsers, games, etc. are non-algorithmic. It's hard to
| claim that something like a browser is just math and
| therefore deserves no IP protections.
| denton-scratch wrote:
| An algorithm is a reasoning procedure. A program (e.g. a
| browser) embodies many algorithms.
|
| I've not come across your stipulation that for a thing to
| count as an algorithm it must provably halt, but I can go
| along with that. So I'd argue that in most cases, any
| function or subroutine provably terminates, even if the
| program embodying it is not supposed to terminate.
|
| I also don't agree that an algorithm is "just maths". At
| least, not if you then pivot to saying that a browser
| isn't "just maths". Any operation performed by a computer
| is "just maths", because what a CPU does is basically
| arithmetic and branching.
|
| I don't think it's a question of what does and doesn't
| "deserve" IP protection. The source code of a browser is
| clearly an original work, and entitled to protection. But
| the ideas and procedures it embodies are not "works", and
| copyright isn't supposed to apply to ideas and
| procedures.
|
| I'm against the very idea of "intellectual property". It
| must have seemed a good idea at the time, but I think
| patents and copyrights have become monsters that inhibit,
| rather than encourage, innovation and creativity.
| ModernMech wrote:
| > I also don't agree that an algorithm is "just maths".
| At least, not if you then pivot to saying that a browser
| isn't "just maths".
|
| Algorithms are distinguished by their proofs of
| correctness. This elevates them above simple procedures.
| The halting problems tells us that there is no easy and
| automatic way to determine whether or not a program
| terminates. So when we find one, it's like discovering a
| mathematical law. The proof of an algorithm's correctness
| is expressed independently of any programming language or
| platform. What else could they be other than math?
|
| Things like browsers, games, operating systems, e-mail
| clients, music players etc. are not treated this way.
| They are not formally specified. They are implemented in
| the context of a machine and an actual running
| environment. The source code of the program usually
| doubles as its specification. It's very different
| compared to an algorithm.
|
| I agree IP as a concept is bad, but this is the way of
| the world at least for now. Given where we are, for me it
| makes sense to draw a line between algorithms and
| software in the context of copyright.
| erk__ wrote:
| On top of the sister comment software patents are not a
| thing in general in Europe which I would imagine is the
| authors area of expertise.
| mytailorisrich wrote:
| It's not the algorithm that's copyrighted. It's the source
| code that implements it.
| Joeri wrote:
| But that fast inverse square root example is particularly
| interesting because it is also a derivative work. Carmack did
| not invent it, and several variations of it had been passed
| around over time.
|
| Algorithms should not be subject to copyright, that way lies
| madness. It would prevent new generations from building on top
| of the work of their predecessors, because copyright lasts a
| very long time. The amounts of code that github copilot
| reproduces fall squarely into the "shouldn't be subject to
| copyright" domain for me, even if they pass the bar for
| originality.
| klodolph wrote:
| Something which is a "derivative work" is still copyrighted.
| In fact, by definition, a "derivative work" is copyrightable.
| It's the minimum threshold at which something, based on
| something else, gets its own, new copyright.
|
| The algorithm is not copyrighted, but the source code of the
| function _is_ copyrighted. You could learn how the algorithm
| works by reading the function, and then write your own
| function that implements the same algorithm. Algorithms are
| not copyrightable, they are not subject to copyright. Source
| code is copyrightable.
|
| Copilot is not reproducing just the algorithm, it is spitting
| out large chunks of the copyrighted source code, verbatim.
| modeless wrote:
| The funny thing about the Quake function is, id Software is
| almost certainly not the origin of the code. They copied it
| from somewhere else, possibly added profane comments, then
| slapped GPLv2 on it. Did they even have the right to do that?
| From an IP absolutist standpoint, probably not.
|
| https://www.beyond3d.com/content/articles/8/
| fridif wrote:
| They did not copy the implementation, they copied the general
| idea of what the algorithm should do.
|
| Do not go down this line of reasoning, otherwise we will be
| copyrighting the concept of for loops.
| dahart wrote:
| > they copied the general idea of what the algorithms
| should do. Do not go down this line of reasoning
|
| Too late, patents pick up where copyright ends, to protect
| general algorithmic ideas, not just implementations. And we
| have lots of patents on things that seem trivial now,
| including for-loops (just see how many patents depend on "a
| multiplicity"). Look - here's a helpful lawyer's template
| for including for-loops as a claim in your own patents:
| https://www.natlawreview.com/article/recursive-and-
| iterative...
|
| Another example is the famous XOR patent
| https://patents.google.com/patent/US4197590/en
|
| EFF keeps a blog on stupid patents
| https://www.eff.org/issues/stupid-patent-month
| fridif wrote:
| Anyone who believes in a free and open society should do
| away with all copyrights and patents.
|
| Anyone who thinks that licensing will have an effect on
| what is happening in reality is severely misguided.
| dahart wrote:
| > Anyone who believes in a free and open society should
| do away with all copyrights and patents.
|
| Free and open sound good to me! What do they mean
| exactly? I guess it's a non-debatable fact that
| copyrights and patents are abused by many big companies
| and patent trolls, but doing away with the system does
| seem extreme, it has also protected deserving individuals
| on occasion, no? You are saying that it should _always_
| be legal to copy someone else's code / inventions
| without giving them any credit or compensation?
|
| > Anyone who thinks that licensing will have an effect on
| what is happening in reality is severely misguided.
|
| I'm not sure I understand what you mean; lots of
| licensing activity does have a measurable effect on
| reality. This article is only a small example, but people
| get sued all the time over taking code and using it
| without licensing it.
| modeless wrote:
| > They did not copy the implementation, they copied the
| general idea of what the algorithm should do
|
| [Citation needed]
| fridif wrote:
| If you wrote an algorithm in the early 80s that did x+y+z
|
| And then I saw your source code and in the late 80s I
| changed the variable names, function name, and logic to
| be x+y+z+0.1
|
| And then I told my friend John that there's a super cool
| algorithm that adds numbers together, and he made some
| more changes to it and compiled it for a different
| platform...
|
| Has anybody broken the law in your mind?
|
| EDIT: because it would seem that the original authors
| (among them Cleve Moler) don't have any issue with what
| transpired
| hnfong wrote:
| The GP's argument is that you don't have evidence that
| they didn't copy the whole function verbatim.
|
| Is there a source that said they changed variable and
| function names and modified the logic?
|
| > because it would seem that the original authors (among
| them Cleve Moler) don't have any issue with what
| transpired
|
| Yet. Without an explicit license there is no basis to
| release it under the GPL (if the code was copied verbatim
| or had insufficient re-writing). What if the heirs of the
| copyright owner wanted to assert their rights? Is there a
| doctrine that if you don't assert your rights you lose
| them? (Presumably applies to trademarks, but I don't
| think this is the case for copyrights)
| fridif wrote:
| The source code in question is over 40 years old and most
| likely doesn't exist anymore in its original form.
|
| What do we do then? The burden of proof for infringement
| is on original authors, and they haven't done so for 40
| years.
|
| In the late 1700s and early 1800s, Britain had to take
| measure to prevent visiting Americans and others from
| memorizing the designs of their new high tech machinery
| like the steam engine and the power loom.
|
| Where do we draw the line? Shut down the internet until
| we create a massive copyright detection firewall?
|
| No, we live with the copying and constantly evolve and
| adapt our business. Death to all patent trolls.
| hnfong wrote:
| > Where do we draw the line?
|
| I won't even claim that people must necessarily follow
| the law. Copyright law is inconsistent at best, and
| notoriously hard to follow to the letter (and often
| ridiculous). In practice lawyers assess the legal risk
| and weigh the outcomes.
|
| I never intended to discuss what we should do, and I
| definitely did not propose shutting down the internet...
|
| The original discussion was such:
|
| > > They did not copy the implementation, they copied the
| general idea of what the algorithm should do
|
| > [Citation needed]
|
| You said the original authors did not complain, which is
| neither here nor there, as I pointed out. There is still
| some theoretical legal risk if you copy with the owner's
| knowledge but not express consent. The fact that the
| burden of proof is on the authors is true but that they
| have not brought a claim does not mean they cannot prove
| infringement.
|
| And in case I haven't made it clear, I don't think it's a
| bad idea to assume the function is under GPL, I just
| don't think there's a basis for claiming what you
| originally claimed, and there is still _some_ level of
| (probably acceptable) risk if you take the purported
| license of source code as-is.
| robertlagrant wrote:
| It's not the actual copying of the idea, but the verbatim
| reproduction of the function, comments and all. I think
| people somehow thought that copilot could write code, and so
| verbatim reproduction was surprising to them.
| jonas21 wrote:
| A quick search shows that this snippet, including comments,
| is included in thousands of Github repos [1], so it's not
| surprising that the model learned to reproduce it verbatim.
|
| It's such a famous snippet that it's even included in full
| on Wikipedia [2].
|
| I wouldn't be surprised if the next version of Copilot
| filtered these out.
|
| [1] https://github.com/search?q=0x5f3759df+what+the+fuck&ty
| pe=co...
|
| [2] https://en.wikipedia.org/wiki/Fast_inverse_square_root#
| Overv...
| eterevsky wrote:
| The excerpt from Quake code is literally one of the most famous
| functions out there. There is no wonder that it was reproduced
| verbatim. The share of such code, according to Github is really
| small.
|
| It would be quite straightforward to write an additional filter
| that would check the generated code against the training corpus
| to exclude exact copies.
| bogwog wrote:
| But the fact that it did that at all should be proof that
| Copilot is, in fact, copy and pasting rather than actually
| learning and producing new things using intelligence.
|
| This is a code search engine with the ability to integrate
| search results into your language syntax and program
| structure. The database is just stored in the neural network.
|
| It's definitely an impressive and interesting project with
| useful applications, but it's not an excuse to violate
| people's rights.
| throwaway984393 wrote:
| > actually learning and producing new things using
| intelligence
|
| People have been trying to accomplish that for 65 years.
| We're not even close. It's the software equivalent of cold
| fusion (with less scientific rigor)
| eterevsky wrote:
| Big part of work of almost any software engineer is finding
| similar already written parts of code and adapting them.
| How is this different?
| cartoonworld wrote:
| It also shows that copilot knows nothing about copyright,
| and is incapable of considering copyright as such.
|
| I'm not sure if I would characterize as a "database stored
| in a neural net", but that is definitely something to
| deeply consider.
| [deleted]
| TaupeRanger wrote:
| This is all just computational statistics. Why in the world
| would you invoke ill-defined anthropocentric terminology
| like "intelligence"? Of course a statistics program isn't
| "using intelligence".
|
| But it's also not exactly just a database. It contains
| contextual relationships as seen with things like GPT that
| are beyond what a typical database implementation would be
| capable of.
| sdfzug wrote:
| Define intelligence
| solipsism wrote:
| Also define "computational statistics". It'll be fun to
| try and fail to draw a clear line between the two.
| TaupeRanger wrote:
| A common tech-bro fallacy. We understand exactly what is
| happening at the base level of a statistics package. We
| can point to the specific instructions it is undertaking.
| We haven't the slightest understanding of what
| "intelligence" is in the human sense, because it's
| wrapped up with totally mysterious and unsolved problems
| about the nature of thought and experience more
| generally.
| randallsquared wrote:
| To be fair, they themselves referred to intelligence as
| "ill-defined"...
| bogwog wrote:
| > But it's also not exactly just a database. It contains
| contextual relationships as seen with things like GPT
| that are beyond what a typical database implementation
| would be capable of.
|
| You mean in the same way that google.com isn't "just a
| database"?
|
| If Copilot isn't intelligent, then what makes it more
| special than a search engine? How is Copilot not just
| Limewire but for code?
|
| I could understand the argument that, if Copilot really
| is intelligent or sentient or something like that, then
| what it is producing is as original as what a human can
| produce (although, humans still have to respect copyright
| laws). However, I haven't seen anyone even attempt to
| make a serious argument like that.
| freeone3000 wrote:
| It _can_ produce code snippets that were never seen by
| generating fragments from various sources and combining
| them in a new way. This makes it different from a search
| engine, which only returns existing items.
| bogwog wrote:
| Is it _producing_ code (by which I mean creating
| /inventing new code by itself), or is it just combining
| existing code? Because to me it seems like the latter is
| a more appropriate description.
|
| * AI searches for code in its neural-net-encoded database
| using your search terms (ex: "fast inverse square root")
|
| * AI parses and generates AST from the snippet it found
|
| * AI parses and generates AST from your existing codebase
|
| * AI merges the ASTs in a way that compiles (it inserts
| snippet at your cursor, renames variables/function/class
| names to match existing ones in your program, etc)
|
| * AI converts AST back into source code
|
| Is AI intelligently producing new code in that example?
| Because I don't think it is.
|
| What would be an interesting test of whether it can
| actually generate code is if it were tasked with
| implementing a new algorithm that isn't in the training
| set at all, and could not possibly be implemented by
| simply merging existing code snippets together. Maybe by
| describing a detailed imaginary protocol that does
| nothing useful, but requires some complicated logic,
| abstract concepts, and math.
|
| A person can implement an algorithm they've never seen
| before by applying critical thinking and creativity (and
| maybe domain knowledge). If an AI can't do that, then you
| cannot credibly say that it's writing original code,
| because the only thing it has ever read, and the only
| thing it will ever write, is other people's code.
| LightMachine wrote:
| That isn't even necessary. I've been exploring GPT-3 for
| a while and it is completely incapable of any reasoning.
| If you enter short unique logical sentences like "Bob had
| 5 apples, gave 2 to Mary, then ate the same amount. How
| many apples Bob has left?" No matter how many previous
| examples you give it (to be sure it gets the question),
| it gets it wrong. It is simply incapable of reasoning
| about what is going on.
| sjy wrote:
| Perhaps it's not so different from a search engine like
| Google. The article cites Google's successful defence,
| under US copyright law, of its practice of displaying
| 'snippets' from copyrighted books in search results.
| There is a clear difference between this and the
| distribution of complete copies on LimeWire.
| eterevsky wrote:
| If you look at it this way, your brain is also "just"
| computational statics. (Or to be precise, it might be,
| since we don't yet know in all the details how it works).
| tovej wrote:
| Would it? What would the threshold be? Twenty lines copied
| verbatim? Ten lines copied verbatim? What about boiler plate
| like ten #include statements at the beginning of a file? Or
| licenses in comments? What if someone has a one-liner that's
| unique enough to be protected by copyright?
| hobs wrote:
| https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,
| _...
|
| I think that was a big part of the Google Vs Oracle case -
| how much copying constitutes an infringement?
|
| It looks like they made a fairly complex rubric to apply in
| the future, it appears it would be on a case by case basis.
| e3bc54b2 wrote:
| > The excerpt from Quake code is literally one of the most
| famous functions out there. There is no wonder that it was
| reproduced verbatim.
|
| The question that brings is that this was found because it is
| so famous, but what if it is repeating Joe Schmoe's weekend
| library project, but we will never know because its not
| famous?
| 0-_-0 wrote:
| Because someone already checked, and it doesn't:
|
| https://docs.github.com/en/github/copilot/research-
| recitatio...
|
| Every literally quoted part that could infringe appears at
| least 10 times in the training data
| Zababa wrote:
| Someone that works at Github already checked. I think
| asking for a more independant study is fair.
| b3morales wrote:
| This doesn't stand on its own as a defense: perhaps the
| 10 inputs were legitimate copies of a single source. They
| could be forked repos that were properly following the
| original's license, for example.
| leereeves wrote:
| Or 10 different GPL projects that legitimately share code
| that remains copyrighted and protected by the GPL. Or 10
| obscure projects that illegitimately copied code but
| haven't been caught.
|
| Clearly, "10 other people did it" is no defense at all.
| duskwuff wrote:
| It might not even be "10 other people". For projects
| which originated outside Github, it's common for multiple
| users to have independently uploaded copies of the
| project. There's probably at least 10 users who have
| pushed copies of the GCC codebase to Github, for example.
| happymellon wrote:
| Pretty sure if someone trained a code suggestion tool with
| Windows source, Microsoft would claim that a single similar
| character being the same is grounds for copyright
| infringement.
|
| They are putting GPL code in non-gpled codebases. Is it okay
| to take sections of other people's source code and use it on
| yours, if you just got it as a suggestion?
| dboreham wrote:
| The true test will be whether MS indemnifies me against
| claims of copyright (and patent) infringement due to use of
| their tool.
| happymellon wrote:
| This will be interesting to watch.
| onion2k wrote:
| The example you linked to is talking about a 16 line function
| from the Quake source. The Quake source is 167,594 lines in
| total (counting the C code only). Does that _really_ fail to
| meet the standard for "smallest excerpt"?
| stefan_ wrote:
| That excerpt has its own Wikipedia page, of course it meets
| the threshold of originality. In any case, once you are
| discussing this, you have entered the area of _fair use_ ;
| that is an admission of copyright violation.
| coldacid wrote:
| Fair use is not a violation of copyright but a specified
| (and since 1976 statutory) _exception_ to it. You are
| clearly impugning the doctrine with your comment.
| mod50ack wrote:
| Having a WP page isn't proof of threshold-passing.
| https://en.wikipedia.org/wiki/BACH_motif
|
| But there is also actually an issue about laundering and
| what constitutes "use". But there is also de minimis to
| consider.
|
| And EVERYTHING will depend on jurisdiction of course.
|
| IANAL
| emodendroket wrote:
| Not only that, but it is clearly someone going out of their
| way to make it do that. I'm not sure that that is a
| reasonable test of how the program typically behaves.
| hmfrh wrote:
| > I'm not sure that that is a reasonable test of how the
| program typically behaves.
|
| That's not what people care about, people care about their
| copyright being blatantly violated by a massive corporation
| _without any consequences_.
| emodendroket wrote:
| Ok, but is "I can go out of my way to make it misbehave"
| adequate proof that the copyright is being violated?
| ghoward wrote:
| Not GP.
|
| Yes, it is, because that means that the algorithm will
| produce that copyrighted code regardless of the intent of
| the person who makes it misbehave. People could both
| accidentally and "accidentally" make it reproduce
| copyrighted code. In the first case, it's unintentional.
| In the second, how could you prove it's intentional?
|
| Because of this whole mess, I am actually adding clauses
| to FOSS licenses that I am writing, just to ensure that
| my copyright on my code is not infringed by code
| laundering.
| emodendroket wrote:
| To be clear, my suspicion is that this is so unlikely to
| happen unintentionally that it does not represent a real
| risk. If the issue is that I can force it to generate
| infringing output if I really want to, it is an argument
| against the Web browser too, since I could just as easily
| use the copyright-unsafe "copy" feature.
| ghoward wrote:
| I don't entirely agree.
|
| Whereas using the browser's copy feature requires the
| user to have intent to use it, getting Copilot to produce
| exact code does not. And proving that intent is not easy.
|
| I think companies will see that such code _can_ be
| exactly reproduced and decide to stay away from Copilot.
| I hope they do. In fact, I am less willing to take
| outside contributions for my own code, even for bug
| fixes, just because of the risk that that code came from
| Copilot.
| breakfastduck wrote:
| How long does it have to be for you to consider it
| copyrighted code?
|
| For example, a book could be copyrighted, but they
| certainly cannot sue me because a book i wrote contains a
| sentence that is the same.
| ghoward wrote:
| The answer to your first question is for the courts to
| decide, unfortunately.
|
| However, for my purposes, using a new license with
| particular terms would only be to make companies like
| GitHub pause and think before using my code as "training"
| to an "algorithm" like Copilot.
| b3morales wrote:
| I'm not at all in favor of the "code laundering" (which
| is a brilliant term, thank you). But I don't understand
| how you expect a new license to help.
|
| 1. A license applied to source code is effective _because
| of_ your copyright
|
| 2. The claim of Copilot's maintainers is that it
| _bypasses_ copyright
|
| Therefore, they will assert that they can ignore the new
| license saying "you may not launder my code" just as
| surely as they can ignore the previous license.
| ghoward wrote:
| First, I did not come up with the term "code laundering."
| I cannot claim credit for that; I saw it first on HN on
| https://news.ycombinator.com/item?id=27729209 somewhere.
|
| Second, you are correct that Copilot's maintainers claim
| that it bypasses copyright, but if it does while
| producing exact copies of code, then copyright is dead,
| and there are a lot of big companies out there with deep
| pockets that will ensure that doesn't happen.
|
| They may claim that because their algorithm is a black
| box, that whatever it produces has no copyright, but my
| licenses will push back directly on that claim by saying
| that if source code under the license is used as all or
| part of the inputs to an algorithm, whether all of the
| source code or partially, then the license terms must be
| attached to the output. After all, that's what we do with
| GPL and binary code. The binary code is the output of an
| algorithm (the compiler) whose input was the source code.
|
| I hope by tying it together like that, the terms can
| close the loophole they are claiming. But of course, I am
| going to get a lawyer to help me with those licenses.
| d110af5ccf wrote:
| > ... if source code under the license is used as all or
| part of the inputs to an algorithm, whether all of the
| source code or partially, then the license terms must be
| attached to the output.
|
| You're not getting it. If Copilot isn't currently
| infringing copyright then adding such a clause _won 't
| matter_. Such a clause would only hold weight _when
| copyright applies_. On the other hand, if copyright
| _does_ apply, then you don 't need such a clause because
| the activity is already a violation of the vast majority
| of licenses. (It even violates extremely permissive ones
| because it effectively strips out the license notice.)
|
| The GPL works specifically because copyright applies to
| the usecase in question. It simply specifies various
| requirements that you must meet in order to license the
| code _given that copyright applies_.
|
| In short, you can't just put a clause into a license
| saying, effectively, "and also, this license confers
| superpowers which make it so that my copyright applies in
| additional situations where it otherwise wouldn't!".
| ghoward wrote:
| Ah, I see.
|
| I argue that, even if _training_ a dataset is fair use,
| _distributing the result_ is copyright infringement. I
| would want my license to make that part clearer.
| hnfong wrote:
| I think the GP's "license" would still be effective,
| although it would not be "open source" per the OSI
| definition.
|
| Imagine this simplified scenario first: if I published a
| source file publicly without any licensing or explanation
| except a standard copyright notice - "Copyright (C) 2021
| MY NAME, all rights reserved", do you think a random
| person/company can take that code and integrate it into a
| commercial product?
|
| I would argue not (in general). Copyrights law as it is,
| does not permit a user who has access to a copy to do
| whatever they want with that copy (esp. if it involves
| more copying). OSS licenses do give you much freedom as
| long as you don't modify it, and that's why we have
| impression that we can do whatever with publicized source
| code. However, if we think about other types of
| copyrighted work, say movies for example, streaming
| services can "rent" you a movie multiple times even
| though you've paid to download the content previously.
| What are you paying for the second time you rent? Another
| example - some photographers may allow you to freely
| browse their works, but they can still make you pay money
| if you want to use their photo in your commercial
| product.
|
| So why wouldn't copyright restrict usage of source code
| in similar situations? The GP only needs to add a
| condition to the license to restrict how users can use
| it. It will no longer be OSS, but as long as it's his
| work, I don't see why in principle it shouldn't work.
|
| (In practice, I don't think it will make much difference
| -- I think your argument is still somewhat compelling,
| and some people will probably take your position.
| Conservative corporate lawyers aimed at reducing legal
| risk would disagree, so it's basically a matter of how
| much legal risk one is ready to take. Also, for an author
| trying to do this, note that suing Microsoft in these
| cases would be expensive, since they will likely fight
| back given that they spent so much money trying to do
| this, and the outcome will be uncertain. If really tested
| in court, given the result of the Oracle v Google case,
| if the US Supreme Court is impressed by the
| social/economic benefits that Android brings, I'm pretty
| sure the justices will be even more impressed by this
| intelligent code generation thingy, and might just grant
| this thing a fair use.)
| b3morales wrote:
| Your summary is generally correct, and I certainly agree
| with the other commenter's position on their work. But I
| think you're still missing the point. Copyright is the
| mechanism that allows you to prevent copying, but
| GitHub's claim is that copyright is _irrelevant_ to
| Copilot 's input.
|
| I have a nice strong lock on my door. GitHub (asserts
| that it) can enter my home through the window.
|
| Adding another deadbolt to the door does not help.
| hnfong wrote:
| I don't think I missed that point. I'm trying to argue
| that copyright _is_ relevant to Copilot 's input if not
| allowed by an OSS license.
|
| Maybe I'm missing something (just not the thing you
| said), but has Github made any legal claims so far? The
| original article is written by a politician in EU...
|
| Even if you're a lawyer defending Github in this case,
| there's still a couple things that needs to be clarified
| before you can make the case: (maybe the info is out
| there but I'm too lazy to research)
|
| - Is Github only using code/repos that are explicitly
| under OSS licenses? (because if that's the case, then the
| discussion might be justified in presuming OSS terms, and
| it may be the case that more restrictive non-OSS licenses
| would require a different analysis)
|
| - As somebody pointed out in another thread, the Github
| terms of service agreement seems to grant Github
| additional rights when dealing with user uploaded
| content. Is that a legal basis for the use?
| b3morales wrote:
| > I'm trying to argue that copyright _is_ relevant to
| Copilot 's input if not allowed by an OSS license.
|
| And I tend to agree with you (and the other commenter)
| here. But GitHub doesn't.
|
| > has Github made any legal claims so far?
|
| I'm not sure how actively, but the CEO was here in the
| announcement thread the other day saying that they think
| the ingestion of the inputs is a "fair use". They also
| have some material defending the output side:
| https://docs.github.com/en/github/copilot/research-
| recitatio...
|
| > Is Github only using code/repos that are explicitly
| under OSS licenses?
|
| I don't think we know exactly what code they used as
| inputs, no.
| [deleted]
| ipaddr wrote:
| Can you add fines?
| ghoward wrote:
| I wish. I just want users to know what rights _they_
| have. Ultimately, I want my software to serve end users,
| not companies. If companies add value for users with my
| software, that 's exactly what I want.
|
| But stripping licenses away so that users can't know what
| rights they have with my code is not that.
| rndgermandude wrote:
| >I am actually adding clauses to FOSS licenses that I am
| writing
|
| Doesn't this make your new licenses incompatible to a lot
| of existing licenses?
| ghoward wrote:
| wizzwizz4 is correct. Also, I have explicit clauses
| saying that GPL/AGPL dominate.
|
| But yes, my licenses may be incompatible (one-way) with
| permissive licenses. I say "one-way" because code with
| permissive licenses can still be used in code under my
| licenses, but maybe not necessarily the other way around.
|
| I'm okay with that.
| rndgermandude wrote:
| That does not really ring true to me. AGPL broadens the
| scope of violations as well, and you cannot use AGPL code
| in GPL-only code bases without turning the end product
| AGPL (but you can use GPL-only code in AGPL code bases).
|
| If you're just adding something along the lines of
| "copying passages extensive enough to reach originality
| is a violation of this license" then that's indeed
| already covered by the GPL, and there is really no need
| to add such a passage other than to be more explicit -
| and confuse people at least at first about why your
| license is not actually the GPL. So there isn't much of a
| point to do it in the first place, in my humble opinion.
|
| If you add text that says something along the lines of
| "you may not use this code as training data", then you
| created an incompatible license, and your code cannot be
| used in GPL code bases, and even worse, since it
| restricts what you can do with the code more than the
| GPL, it might even mean you stop being reverse-compatible
| and may not use GPL'ed code yourself in your own custom-
| license code base.
|
| The AGPL does not further restrict code uses, just
| broadens the scope of when you have to make available the
| code, so it's fine there. However, the original BSD
| license with the advertising clause is considered
| incompatible with the GPL.
|
| I am not a lawyer, and these are just my quick layman
| concerns. I fully recognize you're entitled to use
| whatever license you find suitable for your code and I am
| absolutely not entitled to your code and work whatsoever.
|
| But that said, I wouldn't touch your code if I saw a
| "potentially problematic" custom license, and I wouldn't
| consider contributing to your projects either.
| ghoward wrote:
| I understand your concerns.
|
| Honestly, with this whole debacle, I am not going to be
| accepting outside contributions anyway.
|
| I also understand the concern with a problematic license.
| However, I don't plan to make a specific exemption about
| machine learning, but rather tie up an ambiguity.
|
| What I think I'll do is that the license will require
| that when the licensed source code is used, partially or
| fully, as an input to an algorithm, the license terms
| must be distributed with the output of that algorithm.
|
| I don't think this is a violation of the GPL at all
| because the GPL requires you to distribute the license
| with the binary code of GPL'ed code, and such binary code
| is the output of an algorithm (the compiler) whose input
| was the source code.
|
| But what it would do is put the onus on GitHub that, if
| they used my code in training that data, if they
| distributed the results (as they are doing), they must
| distribute my license terms as well and tell users that
| some of the results are under those terms.
| d110af5ccf wrote:
| > something along the lines of "you may not use this code
| as training data"
|
| Would such a term be legally binding under present
| copyright law? Other than disallowing inclusion in a
| redistributed dataset specifically intended for training
| ML models, it's not clear to me that it would actually
| prevent such use if you already had a copy on hand for
| some other purpose. (Specifically, note that GitHub
| indeed already has a copy on hand for their authorized
| primary purpose of publicly distributing it.)
|
| More generally, the manner in which copyright law applies
| to machine learning algorithms _in general_ hasn 't been
| worked out by either the courts or legislature yet. Hence
| the current article ...
| wizzwizz4 wrote:
| Not necessarily. If you do it right, you've got a
| perfectly GPL-compatible license (because such laundering
| is, technically, a violation of the GPL... probably) -
| it's just a license that's more explicit about what's a
| license violation.
|
| Law isn't code.
| hnfong wrote:
| GPL explicitly forbids re-licensing under more
| restrictive terms.
|
| So either the added terms are not more restrictive, which
| basically means they are unnecessary and have no real
| effect; or they are more restrictive, which is
| incompatible with the GPL.
|
| You can't have things go both ways. It seems that your
| argument is "we're not adding restrictions, we're just
| saying what we think Copyright law / the GPL should
| actually be like." But unfortunately you can't "clarify"
| Copyright Law or "clarify" the GPL by adding terms.
| Ultimately courts decide that.
|
| (Of course, if somehow your "clarification" happens to
| align with a court decision, then maybe it will work
| after all. But in theory your "clarification" is still
| not necessary and has no additional effect....)
| heavyset_go wrote:
| GPL code and its derivatives can't be distributed with
| additional restrictions.
| formerly_proven wrote:
| Double standards ensue.
|
| Tool that could be used to violate copyright := Gets
| prosecuted by MPAA and friends, legislation is passed to
| make use / development / distribution of such tools
| illegal
|
| Bigcorp ships the ML equivalent of ALLCODE.tgz, but you
| actually gotta look in the
| no/dont/open/this/folder/gplviolations/quake.c folder :=
| Is this adequate proof that copyright is being violated?
| emodendroket wrote:
| Since I do not work for the MPAA, I don't see why you
| expect me to answer for them. Half of the article's
| argument is that any argument you could use to shut down
| Copilot would also give a lot of power to such entities
| if it were accepted.
| TeMPOraL wrote:
| Honestly, I feel most people don't care about that. What
| they do care about, is the risk of Copilot making the
| _user_ liable for copyright infringement. Even a
| possibility of it spewing out non-public-domain code
| should be considered a showstopper for any use of
| Copilot-generated code in a commercial project.
|
| Can Copilot produce licensed code verbatim, in enough
| quantities to matter, with a license your business would
| be infringing? Yes. Can you easily tell by looking at the
| output? No. Could someone end up suing you over it?
| Maybe, if they cared enough to find out. Can you honestly
| tell your investors, or a company you seek to be acquired
| by, that nobody else can have valid copyright claim
| against your code? No.
| emodendroket wrote:
| > Can Copilot produce licensed code verbatim, in enough
| quantities to matter, with a license your business would
| be infringing? Yes. Can you easily tell by looking at the
| output? No. Could someone end up suing you over it?
| Maybe, if they cared enough to find out. Can you honestly
| tell your investors, or a company you seek to be acquired
| by, that nobody else can have valid copyright claim
| against your code? No.
|
| Well aren't all your assertions exactly the point of
| contention?
| TeMPOraL wrote:
| Well, the "enough quantities to matter" part wasn't
| tested in courts yet, but I fail to see a way to rule for
| "No" here in a way that wouldn't gift us an universal way
| to turn any code into public domain, destroying source
| code licensing as a concept. Other than this part, the
| first two claims have already been demonstrated, and the
| rest follow from them.
| emodendroket wrote:
| But that is in fact the most fundamental question here.
| And I'm not fully sold on the idea either that this is
| going to happen in real-world usage or that a single
| function in a massive program constitutes a large enough
| portion to be infringing.
| TeMPOraL wrote:
| Quake's square root function wasn't the only, or the
| largest, example of code Copilot reproduces verbatim.
| Among others I've seen to date is someone generating a
| real "About" page with PII information of some random
| software developer.
|
| How much code is enough to infringe is a tricky question,
| though. It's not only a function of size, but also of
| importance/uniqueness - and we know that Copilot doesn't
| understand these concepts.
| shakna wrote:
| > ... or that a single function in a massive program
| constitutes a large enough portion to be infringing.
|
| As part of the sequences of rulings in Google vs Oracle,
| the 9-line rangeCheck function, in the entirety of the
| Android codebase, was found to be infringing.
| [deleted]
| noobermin wrote:
| I've said this before, but I hope the issue isn't infringement
| per se, but that the produced code isn't automatically GPL'ed.
| The author argues that machine generated code isn't copyrighted
| and this is good because it essentially fits the "data wants to
| be free" mentality, but I'd say tell that to the people who use
| it. Will they, after using something derived from open source,
| have to open source their code? No, they won't. If anything, this
| finally provides closed source developers with what they've
| always wanted, a means to rip open source code without having to
| return contributions.
|
| Julia Reda hints at that last bit as being an issue but only in a
| parenthetical. To the author, that literally is _the whole
| point_. Do people not remember the Free Software vs. Open Source
| debate? Or GPL vs BSD? The requirement that derived works also be
| free is literally the important bit in Free Software. This only
| fits the mentality of "data wanting to be free" if your model of
| that idea includes the permissive sensibility and doesn't care
| about actually changing the state of things, which is making free
| software more widely used in the world over proprietary software.
| jordigh wrote:
| > The output of a machine simply does not qualify for copyright
| protection
|
| Wolfram disagrees, and he's got lawyers and money too. Whom do we
| believe?
|
| http://www.groklaw.net/article.php?story=20090518204959409
| Mindwipe wrote:
| Wolfram is discussing American law, Reda European.
|
| (I'm still not sure I agree with Reda, but the point is at
| least arguable under European law and depends on the
| circumstances).
| jordigh wrote:
| But there's a Berne convention that kind of unifies copyright
| around the world, right? It's not like something can be
| copyrighted in one country but not another.
| sascha_sl wrote:
| I frankly think that the "free culture" label and extremely
| permissive licenses of many open source project are nothing but a
| redistribution of wealth upwards. Those with existing capital can
| make profitable unfree derivative works without any benefit to
| original authors. This relationship must go both ways if you want
| actual free culture. Stop producing MIT/BSD code in your non-work
| time.
|
| This is not a research project, this is a commercial work that
| produces verbatim copies of code without disclosing its license
| (or having a license grant in many cases). It doesn't matter how
| it manages to reproduce it either. It does.
| FranchuFranchu wrote:
| > without any benefit to original authors
|
| I don't really expect any benefit when writing some pieces of
| code. These days I just make certain types of code I make
| public domain, because if it's MIT then BigCorp inc. will just
| make my name one in a really long list of contributors, and I
| won't get any benefit either.
| indigochill wrote:
| > extremely permissive licenses of many open source project are
| nothing but a redistribution of wealth upwards
|
| Although I license all my amateur work GPLv3, I dispute the
| assertion that more permissive licenses are "nothing but" a
| redistribution of wealth upwards.
|
| Permissive licenses commoditize their features. This is of
| benefit to everyone, but organizations with more capital are
| better positioned to leverage that commodity, and typically
| when they do, they do so selfishly. This further centralizes
| value with them, but I believe this is still a better outcome
| than a closed license because of the
| educational/cultural/technical benefits to everyone of the open
| license. The capital leverage problem is orthogonal to that.
|
| Copyleft licenses kind of do the same commoditization thing,
| but explicitly only for share-alike uses, which is why they're
| the only way to grow open culture relative to proprietary
| culture: they're designed to deny "freeloaders" on open culture
| by requiring all derivative works to also be open.
|
| > without any benefit to original authors
|
| Benefiting the original authors is not (directly) the point of
| open source.
|
| MIT: "Here's some code, go nuts."
|
| GPL: "Here's some code, go nuts, but share the nuts if you do."
|
| There's zero value capture for the original author built into
| these licenses. If the original author wants to ensure they
| capture the value from their work, I recommend using a closed,
| proprietary license. Just be aware they're applying friction to
| the overall technological development of humanity by doing so.
| Kiro wrote:
| I don't expect any benefits when releasing stuff under MIT,
| regardless of who uses it. That's the whole point or I wouldn't
| use that license.
| wccrawford wrote:
| This is the closest anyone has ever come to convincing me to
| use GPL instead of MIT license.
|
| But I still want to support _small_ developers with anything I
| produce for fun, and I 'm not willing to give that up to spite
| the big developers.
|
| For instance, I wrote a small class to load OBJ files in Unity
| because I needed it for an idea. I went ahead and put it on
| Github for others that need it, too. I could easily see someone
| having an idea similar to mine that needed that and couldn't
| find it out there. (I think there are more libraries like that
| now, though.) I wanted them to feel comfortable using it, even
| if they eventually make money with their game.
|
| If a big corp uses that code, too, that sucks. But there's no
| good way to draw that line in a license, so I didn't.
|
| Having said that, in the future I could see releasing some
| software that I don't think anyone should profit from, and in
| that case I'd GPL it. Previously, I'd have just defaulted to
| the same MIT license. I'm just not sure what that'd be yet.
| dgb23 wrote:
| Just to be sure: GPL doesn't prevent anyone from selling.
| wccrawford wrote:
| No, but it does force them to GPL their own code if they
| do. And that's a no-go for most companies.
| enriquto wrote:
| > No, but it does force them to GPL their own code if
| they do. And that's a no-go for most companies.
|
| There's many companies who release source code. I don't
| know enough to say if that makes the word "most" in your
| sentence false or not.
|
| Anyways, a company using GPL code need not release all
| their own code. Just their modifications to that
| particular GPL program. And then again, only if they do
| intend to distribute the modified program.
| enriquto wrote:
| > This is the closest anyone has ever come to convincing me
| to use GPL instead of MIT license.
|
| > But I still want to support small developers with anything
| I produce for fun, and I'm not willing to give that up to
| spite the big developers.
|
| Another line of thought that may help you choose a license:
| Do not think only about other developers. Think about the
| final users of your code that will be running your algorithms
| on their computers. The GPL protects the right of these users
| to see and modify the code they run (your code). So-called
| permissive licenses, on the other hand, let middlemen to
| strip this right from your users.
|
| Users of your code are in fact freer thanks to copyleft
| licenses.
| zxcb1 wrote:
| Open source developers deserve the same rights as corporations.
|
| As a side note, in a not so distant future there may be
| decompilers enhanced by artificial intelligence.
| ksec wrote:
| So may be it is best to have a separate license for Machine
| Learning? Let's call it copilot licences. ( May be it is better
| to call it an exemption ? )
|
| You will need AGPL / GPL / LGPL / MIT / Apache / BSD + Copilot
| licences before it can be used for training? Knowing there are a
| very small possibility that some code snippet will be the output?
|
| I mean we could endless debate this with no solution unless this
| is put into court.
| Sr_developer wrote:
| This is a supposedly progressive politician, young, in an
| advanced country, her personal platform runs almost entirely on
| copyright issues and yet she gets almost everything wrong, what
| can you expect from your usual dinosaurs?
| marcosdumay wrote:
| Depends on who is founding the dinosaurs.
|
| She had to work really hard to get to that conclusion she
| stated.
| marcodiego wrote:
| Simple way to fix this mess: allow to user to choose training
| data samples licenses.
| stabbles wrote:
| And then what license do you choose? Many licenses require you
| to copy the original license verbatim, which may include the
| author's name and the date.
| [deleted]
| denton-scratch wrote:
| > The output of a machine simply does not qualify for copyright
| protection
|
| "Simply"? If it were that simple, surely that would mean that the
| output of the Unix "cp" program would not qualify? What about a
| DVD copier?
|
| I'm OK with copyright as it used to be, back when I was a
| teenager; the right expired with the author's life. Corporations
| couldn't own copyrights. There was no burden on the author to
| register their rights. And copyright was a civil matter; you sued
| for actual damages. Infringement wasn't a crime.
|
| I'm not OK with modern copyright law, with criminal penalties,
| rights that can be transferred to entities that are essentially
| immortal, and copyright terms that keep getting extended, just
| before Mickey Mouse and Elvis Presley become public domain.
| tzs wrote:
| >> The output of a machine simply does not qualify for
| copyright protection
|
| > "Simply"? If it were that simple, surely that would mean that
| the output of the Unix "cp" program would not qualify? What
| about a DVD copier?
|
| It means that the output of "cp" does not qualify for copyright
| protection _as a derivative work_. The output is still a copy
| of the input and would be subject to the same copyright as that
| input.
|
| Roughly, a derivative work is a _new_ work that incorporates
| some copyrightable elements from a previous work. The
| derivative work gets its own copyright separate from the
| copyrights of those incorporated elements.
| chalst wrote:
| Not necessarily. The object code produced by a compiler might
| best be regarded as a different form of the source code, even
| though it is not a copy or a new work, according to
|
| http://digital-law-online.info/lpdi1.0/treatise26.html
| adriancr wrote:
| > Roughly, a derivative work is a new work that incorporates
| some copyrightable elements from a previous work.
|
| By this logic:
|
| - someone could go and copy functions and/or entire files
| from GPL code bases and use them with a different license.
|
| - someone could use copilot or similar to learn from all
| available GPL code. Is resulting code GPL?
|
| - someone could use copilot or similar to learn from open
| source code of their competitors that license doesn't allow
| them to use. Are the results legal?
| mthoms wrote:
| The definition of a "derivative work" as stated is correct.
|
| The _copyright status_ of a derivative work is a separate
| issue: A derivative work can be considered infringing, and
| a derivative work can be considered non-infringing (ie. due
| to Fair Use).
| mthoms wrote:
| I think they meant "when the output of the machine is
| significantly original/transformed (i.e. is a new creative
| work)."
|
| I'm not arguing for/against: I just think a straight copy was
| not what the author intended to be included by that statement.
| b3morales wrote:
| > I think they meant "when the output of the machine is
| significantly original/transformed (i.e. is a new creative
| work)."
|
| No; the author spends some time asserting that the output of
| a machine is inherently _not_ a creative work:
|
| > Machine-generated code is not a derivative work
|
| > the argument that the outputs of GitHub Copilot are
| derivative works of the training data is based on the
| assumption that a machine can produce works. This assumption
| is wrong and counterproductive.
|
| > This means that machine-generated code like that of GitHub
| Copilot is not a work under copyright law at all
| mthoms wrote:
| The machine is performing some sort of transformation.
| Whether you want to call that a "(creative) work" or not is
| irrelevant to the point I was making: The machine is doing
| _something_ to the input, and that makes the 'cp' comment
| I replied to kind of silly.
|
| I think we can assume the author doesn't believe that
| piping bytes through the 'cp' command automatically removes
| copyright (as the person I replied to suggested).
| Cort3z wrote:
| I wonder how long it will take for the licenses to start
| explicitly disallowing this sort of usage. It is clearly
| something that many open source writers dislike, and in my
| opinion, rightly so.
| vharuck wrote:
| >What would then stop a music label from training an AI with its
| music catalogue to automatically generate every tune imaginable
| and prohibit its use by third parties? What would stop publishers
| from generating millions of sentences and privatising language in
| the process?
|
| The existing barrier we have is that, unless the music label can
| prove a human artist has listened to the specific song matching
| the artist's, there's no copyright violation. A copyright
| protects creators from having their work _copied_. It doesn 't
| give them ownership over matching works. I'm sure there are
| plenty of pairs of novels with the same first sentence despite
| each author never having read the other's work.
| rjmunro wrote:
| Note that Patents and Trademarks are not like this. You can
| innocently recreate an invention or a similar trademark and you
| are still infringing.
|
| This often causes confusion - people apply the rules of one
| type of IP to the others, but they have almost nothing in
| common.
| uCantCauseUCant wrote:
| I felt a great disturbance in the AI-community, as if millions of
| voices suddenly cried out in terror, of GPL Code in there output,
| and were suddenly silenced. I fear something terrible has
| happened.
| ClumsyPilot wrote:
| Julia is one the few MEPs that properly engages with issues of
| copyright and is active in IT. I really appreciate it, even if I
| dont always agree with her
| toyg wrote:
| This is actually why I was so disappointed by her analysis
| having some very glaring errors. With friends like these...
| CyberRabbi wrote:
| Politicians are very rarely trustworthy. Who funds her?
| ocdtrekkie wrote:
| She meets with tech company lobbyists pretty regularly
| according to her meeting log.
| elcapitan wrote:
| Former MEP, btw.
| chrisseaton wrote:
| But she doesn't seem to have engaged - she seems ignorant of
| basic facts of what the technology is doing in practice if you
| read the other comments here which give specific examples.
| mabbo wrote:
| > Copyright law has only ever applied to intellectual creations -
| where there is no creator, there is no work. This means that
| machine-generated code like that of GitHub Copilot is not a work
| under copyright law at all, so it is not a derivative work
| either. The output of a machine simply does not qualify for
| copyright protection - it is in the public domain
|
| This is fantastic news.
|
| I'm going to create a bot that crawls sites like GitHub searching
| for popular libraries. Then it will copy them- sans any license-
| to it's own website where it will sell these libraries under a
| new name.
|
| Since there is no creator here, just a piece of software, then
| there is no copyright violation. My system simply is "inspired"
| by the original source code using a proprietary algorithm that I
| call "Copy and paste".
|
| I'm open to accepting venture capital for this project.
| JorgeGT wrote:
| > The output of a machine simply does not qualify for copyright
| protection
|
| Does this include my Xerox machine? If so, is anyone looking
| for very very cheap textbooks?
| progval wrote:
| Already tried in court: https://en.wikipedia.org/wiki/Rameshw
| ari_Photocopy_Service_s...
| chrismorgan wrote:
| That looks to be as much about exceptions for education as
| about photocopying. Purely from a quick read of the
| Wikipedia article's summary, it looks like a lot of it
| hinges on the interpretation of Section 52(1)(i), https://c
| opyright.gov.in/Documents/CopyrightRules1957.pdf#pa...,
| "the reproduction of any work-- (i) by a teacher or a pupil
| in the course of instruction; or (ii) as part of the
| question to be answered in an examination; or (iii) in
| answers to such questions".
| Arnt wrote:
| It's not news, it is fantastic, and you'd do well to understand
| it.
|
| The output of copilot is macnine-generated and is not subject
| to copyright. Microsoft cannot claim copyright on what it
| generates. That does not affect my rights, or anyone else's. I
| can claim copyright on what _I_ write, and neither copilot nor
| your stupidity diminish my rights.
|
| MS may argue that what copilot copies is small enough that I
| have no copyright on that, and win in court. You may put
| forward the same argument but I think your fate in court would
| be different.
| epicide wrote:
| > The output of copilot is macnine-generated and is not
| subject to copyright. Microsoft cannot claim copyright on
| what it generates. That does not affect my rights, or anyone
| else's.
|
| What about when I, a developer working on a proprietary
| codebase, blindly commit the output code into our product?
| Have _I_ created a derivative work or, worse, plagiarized?
| user5994461 wrote:
| >>> Have I created a derivative work or, worse,
| plagiarized?
|
| If you reproduced code that's copyrightable and under
| another license, then yes you are in violation.
|
| It will take a decade for the case to proceed to court and
| determine exactly what claims can be made against your
| company and GitHub.
|
| In practical terms, you should turn off Copilot this very
| minute.
| epicide wrote:
| Right, which is why it feels like a bad-faith argument in
| this case to say "a machine can't produce copyrightable
| code, etc."
|
| We aren't talking about a server sitting in a Microsoft
| data center shuffling code to another server without
| human intervention. We are talking about a tool that
| helps _developers_ create code -- code that is
| "copyrightable and under another license", and thus in
| violation.
| verelo wrote:
| Id expect that since that's your intention, as described above,
| then it is copyright and you would be the actor. In the GitHub
| case, it isn't the primary intent but rather a byproduct of the
| goal of helping another developer? Not a lawyer, but spent
| enough time with lawyers to know that what you're describing
| won't fly. I don't even know what GitHub is doing will fly,
| maybe they're hoping it gets tested.
| mabbo wrote:
| I can't tell if you are being doubly-sarcastic to my sarcasm,
| or if you missed my point.
| verelo wrote:
| Ah, honestly i missed your sarcasm. Yeah, so i guess we're
| on the same page.
| jdright wrote:
| You know what, I love this idea! We can do the same with music
| with very few adaptations to the algorithm. This idea is worth
| gold!
| [deleted]
| jfmc wrote:
| Modern AI seems more like machine-assisted collage (or pictures,
| code, text, etc.) than anything else. Someone (of some other
| algorithm) needs to be added to ensure that the whole thing makes
| sense. The big problem here is that when an artist creates a
| collage he/she knows the sources. Here provenance is lost.
|
| [1] Collage (/k@'la:Z/, from the French: coller, "to glue" or "to
| stick together";[1]) is a technique of art creation, primarily
| used in the visual arts, but in music too, by which art results
| from an assemblage of different forms, thus creating a new whole.
| SXX wrote:
| I think it's time for someone to train AI on leaked proprietary
| code and source-available code like Unreal Engine. It's cool that
| we have so much of it right now.
|
| Then we'll see how fast Microsoft and others will shut it down.
| flazx wrote:
| "This is a slightly modified version of my original German-
| language article first published on heise.de under a CC-by 4.0
| license."
|
| Heise appears to be quite $bigcorp friendly recently.
| detaro wrote:
| ... because they let a regular author publish her opinion?
| [deleted]
| creshal wrote:
| If by "recently" you mean in the past 10~15 years or so, yeah.
| bennyp101 wrote:
| Countdown to Oracle lawsuit in 3, 2 ...
| varispeed wrote:
| Mass processing, repackaging and then selling the data is an
| exploitative business these multi-billion companies run without
| paying anything to the people who produced the data.
|
| This is wrong and should be stamped out.
| emrah wrote:
| Copilot itself may not be infringing copyright or GPL, but its
| users will be if they incorporate its suggestions into their
| commercial products.
| swiley wrote:
| So copyright is dead then?
|
| Can we merge all the leaked driver source into Linux and have
| decent OSes on handhelds yet?
|
| If I train an "ML autocomplete" on the "OpenNT" source can I
| share it legally?
| temac wrote:
| > What is astonishing about the current debate is that the calls
| for the broadest possible interpretation of copyright are now
| coming from within the Free Software community.
|
| It is not astonishing at all given:
|
| * proprietary codebase have not been indexed by copilot (at least
| a public version of it)
|
| * arguably derived code will be used in proprietary programs
| dleslie wrote:
| Yah, not sure what is astonishing about outrage in response to
| what appears to be a method for laundering GPL'd software.
|
| Copilot ought only to have indexed public domain, wtf, and
| other wide-open licensed software. They should remove all GPL'd
| software from their model, even if that means retraining from
| scratch.
| TeMPOraL wrote:
| It's not just GPL, they arguably should remove MIT, BSD and
| most other Open Source software too, as it's hard to tell
| when any given snippet crosses a threshold where the original
| license demands attribution or other things. People seem to
| forget that even MIT license has actual conditions in it.
| dento wrote:
| Not just GPL, even works with MIT/Apache/BSD license require
| attribution
| mrh0057 wrote:
| Why is everyone ignoring the fact what neural networks do? It is
| being used as a search context aware pattern matching and use
| that to predict what you will write next. Of course it's going to
| return copyrighted works based on what you right.
|
| It's a pattern matching algorithm what exactly did they think it
| was going to do?
| maweki wrote:
| The output of Copilot may be not a derivative work, but the
| trained model surely is, right?
| betwixthewires wrote:
| > ...some commentators accuse GitHub of copyright infringement,
| because Copilot itself is not released under a copyleft
| licence...
|
| This is not why. The issue at hand as I understand it is that
| people using copilot will potentially have code snippets in their
| work that are already licensed they do not know the license for
| and that they will not license properly as a result.
|
| That's in the first paragraph. If you enter this discussion with
| an incorrect presumption from the outset I don't see how you can
| form a valid defense.
|
| > However, by doing so, the copyleft scene is essentially
| demanding an extension of copyright to actions that have for good
| reason not been covered by copyright.
|
| No. Nobody is asking for an extension of copyright protection, we
| are asking for the existing reach of copyright to be respected.
| We built our licenses based on a ruleset that we were told is
| fair. You don't get to violate rules _you_ made and then claim
| that copyleft people only made their licenses because as a
| workaround to copyright and so are being hypocrites.
|
| > Others focus on Copilot's ability to generate outputs based on
| the training data. One may find both ethically reprehensible, but
| copyright is not violated in the process.
|
| The arguments I've heard are not that Microsoft is using publicly
| available information to train it's AI. The argument is that
| people are potentially (and in some current cases demonstrably)
| getting _copy pasted code snippets from licensed software._ If
| you can 't see the plainly obvious problem here it's because
| you're trying not to.
|
| Also a point made in the article, that machine generated things
| cannot be copyright because copyright requires a creator, brings
| up an interesting question as to whether works by people who used
| copilot can be licensed at all.
| boleary-gl wrote:
| I'd agree with this conclusion if it wasn't clear that it is very
| possible - if not common - for Copilot to just completely copy
| code. That isn't fair use - that's a clear violation of copyright
| regardless of license.
| orthoxerox wrote:
| Whether Copilot itself violates GPL or not is one issue.
|
| Whether the code produced by Copilot violates GPL or not is a
| whole different independent issue.
|
| If I am walking down the street, find a piece of paper with code
| on it, pick it up and add the code to my program and this code
| turns out to be licensed under the GPL then my program becomes a
| derivative work. It doesn't matter who wrote it on that piece of
| paper, whether it's a 100% correct copy of the GPLed code or not
| or if there are mistakes in it.
| alfiedotwtf wrote:
| Has anyone tried dumping the debugging symbols from a Microsoft
| binary e.g explorer.exe and tried to autocomplete^Wcopilot its
| functions? Would be interesting how far Microsoft could be pushed
| before they ate their own hat.
| alkonaut wrote:
| Whether Copilot infringes copyright is a muddy area. I personally
| would like to think that the world where machines can be trained
| on any data is easier to live in than one where trained machines
| are tainted by the licens of input.
|
| The interesting question however isn't whether Copilot infringes
| copyrights, but whether those that _use_ copilot do.
| Rapzid wrote:
| One of the points being made is that in the worst-case scenario
| of getting Copilot to repeat back verbatim chunks of code from
| projects, something that's not its primary use case, it would
| be a situation similar to a copy machine.
|
| You can copy a page out of a book, or the whole book, and be
| covered under fair use. But you can't sell your copy on Amazon.
| And if you did, the copy machine nor Xerox ran afoul of
| copyright law.
|
| You could also use a copy machine to copy fragments of the
| Linux kernel source out of a book about the Linux source and
| use them to construct an entirely original work that's not
| considered derivative.
|
| The devil's in the details, but GitHub talks at some length
| about the plagiarization issue and their plans to detect and
| link back to where verbatim chunks exist in the training data
| to let the operator decide what to do soo.. IDK.
| scotty79 wrote:
| Don't you think that our world would be way more relaxed and
| flourishing place if lawers kept their noses out of software like
| they are keeping them out of math?
| glitchc wrote:
| I disagree with this article. GitHub Copilot is indeed infringing
| copyright and not only in a grey zone, but in a very clear black
| and white fashion that our corporate taskmasters (Microsoft
| included) have defended as infringement.
|
| The legal debate around copyright infringement has always
| centered around the rights granted by the owner vs the rights
| appropriated by the user, with the owner's wants superseding user
| needs/wants. Any open-source code available on Github is
| controlled by the copyright notice of the owner granting specific
| rights to users. Copilot is a commercial product, therefore,
| Github can only use code that the owners make available for
| commercial use. Every other instance of code used is a case of
| copyright infringement, a clear case by Microsoft's own
| definition of copyright infringement [1][2].
|
| Github (and by extension Microsoft) is gambling on the fact that
| their license agreement granting them a license to the code in
| exchange for access to the platform supersedes the individual
| copyright notices attached to each repo. This is a fine line to
| walk and will likely not survive in a court of law. They are
| betting on deep lawyer pockets to see them through this, but are
| more likely than not to lose this battle. I suspect we will see
| how this plays out in the coming months.
|
| [1] https://www.microsoft.com/info/Cloud.html
|
| [2] https://github.com/contact/dmca
| lacker wrote:
| _Github (and by extension Microsoft) is gambling on the fact
| that their license agreement granting them a license to the
| code_
|
| This is incorrect. First of all, GitHub isn't even the people
| building the model. It's built by OpenAI, which has none of
| these licenses. Secondly, the model is not built purely from
| GitHub data. OpenAI is relying on fair use, not on a specific
| license.
| rlpb wrote:
| > Github (and by extension Microsoft) is gambling on the fact
| that their license agreement granting them a license to the
| code in exchange for access to the platform supersedes the
| individual copyright notices attached to each repo.
|
| The person who has the account on Github and uploads code to
| them rarely owns the copyright on all of the code, and
| therefore doesn't have the right to delegate to Github any
| further licensing permission.
| lubujackson wrote:
| "Copilot is a commercial product, therefore, Github can only
| use code that the owners make available for commercial use."
|
| IANAL, but this doesn't sound quite right. There is a
| difference between "using" code (running it in a commercial
| product) and manipulating it as arbitrary data within a
| commercial product.
|
| It definitely can be a gray area, but let's say I use Amazon's
| service where I email a PDF to my Kindle - is it Amazon's
| responsibility to know the copyright status of the PDF, or
| mine? In both cases a commercial product is manipulating
| copywritten data for the benefit of a user.
| emrah wrote:
| Even if it's legal for Copilot to do what it does, does it
| not violate GPL to take pieces of GPL'ed code and use them in
| a commercial product?
| aj3 wrote:
| There are plenty of SAAS that use GPL'd code on the
| backend. That's fine.
| dminor wrote:
| The basis of the GPL is copyright, so what you're really
| asking is whether you can use part of a copyrighted work in
| another work without infringing.
|
| And the answer as always is "it depends".
| danudey wrote:
| If I use Copilot and it suggests a large block of GPL2'ed
| code for my project, which I then include, then that is a
| GPL2 license violation.
|
| Whether the GPL2 will hold up in court, or whether the
| courts will uphold this specific case (e.g. can you prove
| intent? Do you need to?), is a separate issue entirely.
|
| The next question is, can I use GPL'ed code in my product
| and then claim that it was injected by Copilot to avoid
| repercussions of my actions if caught?
| electroly wrote:
| The claim (which I'm not qualified to judge) is that this
| use falls under fair use. The point of fair use is to allow
| some use of copyrighted works even if the copyright owner
| does not license it to you and even if the owner is
| explicitly hostile towards your usage. If it is indeed fair
| use, then the license doesn't matter because that's not the
| thing that's allowing you to use the work.
| moralestapia wrote:
| Yes.
| lelandbatey wrote:
| Your example doesn't quite match what's happening in real
| life though. You're not "using copilot as a mechanism to
| ferry around code". Co-pilot is making recommendations for
| what code to use and then also giving that exact code (the
| text) to you. A more apt example would be if Amazon had some
| UI which said "What kind of book do you want to read on your
| kindle?", you click the button labeled "biography", and then
| Amazon sends your Kindle an AI generated book which is the
| biography of a famous person, and it _just so happens_ that
| the "generated" book being sent to you is an exact copy of
| someone elses book (or incorporates exact copies of
| chapters/paragraphs of someone elses book), legal disclaimers
| and all.
| nxpnsv wrote:
| The proprietary model is a representation of lots of
| harvested open source code snippets. Without the model
| copilot is nothing. Arguably, the code snippets are part of
| the product....
| to11mtm wrote:
| Maybe you're right, maybe you're wrong.
|
| I'll give the best example, the one task that off the top of
| my head that I would like some AI help with.
|
| I would really like to replicate the functionality of Java's
| SSLEngine, but for C#.
|
| If I used Co-Pilot to help, at best, I would need to pay for
| a legal team to do some form of 'clean room' review of
| whatever was generated to make sure it did not infringe on
| the OpenJDK code that is out there. At worst, I would be
| having to defend myself from Oracle's legal team -anyway-.
|
| And yeah, I'm assuming in this case that Copilot would be
| 'smart' enough to be able to make the right inferences of
| that java code and put it into workable C# construct.
| Stepping back, though, one could still ask the question;
| what's the risk of a _Java_ developer accidentally getting
| some OpenJDK code a little too closely? There 's an order of
| magnitude difference between even a smaller AGPL developer
| and Oracle.
|
| If Microsoft/GH was willing to go to bat and agree to pay for
| the defense of users of Copilot, I would be far less
| concerned with the implications of all of this.
| onion2k wrote:
| If Copilot is infringing copyright by reproducing small samples
| of the training data, and if we agree that that isn't
| acceptable, doesn't that effectively spell the end of the road
| for any and all AI generated content unless the developers
| explicitly stop their product reproducing data that matches the
| data it was trained on? That seems like it would have far
| reaching consequences for AI as an industry.
| leereeves wrote:
| Doesn't everyone who uploads code to a public repo give
| Microsoft/GitHub a license to (strike ~redistribute~) reproduce
| that code?
|
| If they didn't, GitHub itself would be violating copyright
| every time someone browsed the repo.
|
| And copilot appears to be a part of GitHub.
|
| https://copilot.github.com/
|
| So why wouldn't copilot itself be covered by that license?
|
| (Certainly people using copilot would not. Let the user
| beware.)
|
| Edit: downvoted to death but the top reply shows that it's
| true. An inconvenient truth, I suppose.
| krono wrote:
| From the GH TOS:
|
| > _4. License Grant to Us_
|
| > _This license does not grant GitHub the right to sell Your
| Content. It also does not grant GitHub the right to otherwise
| distribute or use Your Content outside of our provision of
| the Service_
|
| https://docs.github.com/en/github/site-policy/github-
| terms-o... .
|
| > _5. License Grant to Other Users_
|
| > _If you set your pages and repositories to be viewed
| publicly, you grant each User of GitHub a nonexclusive,
| worldwide license to use, display, and perform Your Content
| through the GitHub Service and to reproduce Your Content
| solely on GitHub as permitted through GitHub 's functionality
| (for example, through forking)._
|
| > _You may grant further rights if you adopt a license._
|
| https://docs.github.com/en/github/site-policy/github-
| terms-o... .
|
| So yes, but only within GitHub. .
|
| Edit:
|
| > _A. Definitions_
|
| > _The "Service" refers to the applications, software,
| products, and services provided by GitHub, including any Beta
| Previews._
|
| https://docs.github.com/en/github/site-policy/github-
| terms-o...
|
| Sneaky bastards. .
|
| Edit: Formatting
| ben0x539 wrote:
| Not everything on Github was uploaded by the copyright
| holders. Often enough, it's uploaded by people who only have
| access to it under an open source license, so Github cannot
| in general squeeze additional license terms out of the
| uploader at that point.
| leereeves wrote:
| That's a good point; what is their obligation/liability in
| that case?
| glitchc wrote:
| There's more than redistribution happening here. Co-pilot is
| providing a value-add service where the open-source code is
| an input and the output is a service. As it happens, the
| service is actually regurgitating the code at this point, but
| it's important to consider that even if it didn't regurgitate
| the code verbatim, the fact that the service is making use of
| that code to provide a value-add means the code is a crucial
| input to the value proposition. Would Co-pilot be able to
| provide the value-add without the source? Likely not.
|
| Couple that with the fact, that presumably at some point in
| the future, Co-pilot will come attached with a subscription
| model (otherwise why do it in the first place?), and we have
| the makings of a product that is commercially infringing on
| copyright left, right and center.
| emrah wrote:
| I'm thinking it's not so much what is legal for Copilot to do
| with code chunks from GPL'ed code, but what it means for end
| users (i.e. developers at for-profit companies) to
| incorporate those chunks into commercial products
| moralestapia wrote:
| No.
|
| Edit: Sorry downvoters, whether you like it or not, you don't
| understand the terminology. You're confusing _reproduction_
| with _redistribution_.
| leereeves wrote:
| I'm not a lawyer so it's entirely possible I used the wrong
| term. Thank you for clarifying below.
|
| Using the terms as you explained them below, I meant that
| Microsoft/GitHub has permission to _reproduce_ the code so
| why wouldn 't that extend to copilot?
| blooalien wrote:
| Are they displaying the _license_ under which said code
| is licensed when they display a chunk of licensed code?
| If not, then they 're violating the terms of most
| licenses (except pure public domain, or other similar
| licenses which don't have any such requirements
| attached).
|
| The use of licensed code in other projects must be done
| under the terms of that license or you aren't legally
| (under copyright law) allowed to use the code.
| leereeves wrote:
| As I said, I'm not a lawyer, but I believe they're
| displaying it under the terms of the GitHub ToS, using
| rights granted to them when the project is uploaded to
| GitHub, not under the terms of the license the project
| uses for everyone else.
| moralestapia wrote:
| _Reproduction_ is enough to cover the first part of your
| use case. This is mentioned on Github 's TOS.
|
| For the latter you would need _redistribution_ as it is
| going into a different product, for which you claim
| ownership, and with possible modifications /adaptations
| (this would depend on the rights granted by the license).
| Nowhere on Github's TOS is the word or concept of
| _redistribution_ referenced.
|
| So, the answer to your original question is "no".
|
| Edit: leereeves modified its comment after I wrote this,
| so it may not make much sense but you can figure out the
| point. Best!
| dahart wrote:
| I'm not sure this is a completely fair take, I think the
| original question is legitimate and relevant. Github's
| TOS does in fact ask the contributor to grant a license
| for GH to host and serve their code from GH servers. That
| is both reproduction and distribution as defined by
| copyright law, and copyright covers both of those at the
| same time https://www.copyright.gov/what-is-copyright/
|
| (Edit and BTW GH calls out their 'distribution' in
| section D.4 of their TOS explicitly, but without using
| the word "distribute". They say you grant them the right
| to "publish" and "share" code you upload, which means
| "distribute" under copyright law. They also imply that by
| spelling out the terms under which they do not
| "distribute", which is anytime the content is used
| outside of GitHub's services.)
|
| I don't think you're correct that the term
| "redistribution" means either going into another product,
| nor that it implies a claim of ownership. Putting works
| into another product is sometimes known as making a
| _derivative_ work, while "redistributing" is quite
| commonly used to mean copy-and-distribute as-is.
| Redistribution can happen via license as well, it
| requires permission by the copyright owner, but does not
| imply the redistributor is (or is claiming to be) the
| copyright owner.
| moralestapia wrote:
| >I think the original question is legitimate and relevant
|
| You didn't see the original question, it was edited, so
| we cannot discuss that further.
|
| "[...] which means "distribute" under copyright law" <--
| Citation needed please, because I don't think that's
| correct.
|
| From the site you linked:
|
| "Distribute copies or phonorecords of the work to the
| public by sale or other transfer of ownership or by
| rental, lease, or lending."
|
| What I seem to grasp about the difference between
| _reproducing_ and _redistributing_ is that it has to do
| with the concept of "transfer of ownership". Also
| _derivate work_ and _redistribution_ are not mutually
| exclusive.
|
| The moment you create a new thing and start
| _distributing_ it (even if you do not modify it), you
| become the de facto owner of that new product, and
| copyright law is trying to limit the extent of the rights
| that apply there. So, in the case of music, it 's
| different thing to play ( _reproduce_ ) a song than to
| create a new album with your favorite artists that
| happens to include that particular song (
| _redistribution_ ).
| dahart wrote:
| > "Distribute copies or phonorecords of the work to the
| public by sale or other transfer of ownership or by
| rental, lease, or lending."
|
| > What I seem to grasp about the difference between
| reproducing and redistributing is that it has to do with
| the concept of "transfer of ownership". Also derivate
| work and redistribution are not mutually exclusive.
|
| What you've misunderstood is it is the _copies_ that are
| sold, not the copyrights.
|
| * edit
|
| > create a new album with your favorite artists that
| happens to include that particular song (redistribution).
|
| This is not what redistribution means. You seem confused
| about this word.
| moralestapia wrote:
| >What you've misunderstood is it is the copies that are
| sold, not the copyrights.
|
| Sorry, I'm not following you anymore. I don't even know
| what you mean by that sentence.
|
| Edit:
|
| >This is not what redistribution means. You seem confused
| about this word.
|
| But, that's exactly what redistribution entails ...
| dahart wrote:
| > Sorry, I'm not following you anymore. I don't even know
| what you mean by that sentence.
|
| The transfer of ownership you referred to is a transfer
| of ownership of a copy, it is not a transfer of ownership
| of the original work itself. You misunderstood the
| passage you quoted to mean that redistribution is
| transferring ownership of the work itself, as in
| copyright ownership of the work. But the text you quoted
| is only talking about transferring ownership of the
| copies. The text you chose makes more sense in the
| context of physical copies of books or "phonorecords".
| [deleted]
| dahart wrote:
| > You're confusing reproduction with redistribution.
|
| It seems like you're confused; GitHub's terms require users
| to grant both of those. Copyright law also covers both.
| moralestapia wrote:
| >GitHub's terms require users to grant both of those
|
| Last time I checked (about an hour ago), that wasn't
| true. Feel free to provide evidence to support your
| argument.
| dahart wrote:
| > Last time I checked (about an hour ago), that wasn't
| true. Feel free to provide evidence to support your
| argument.
|
| https://docs.github.com/en/github/site-policy/github-
| terms-o...
|
| "publish" and "share" mean redistribution. "Store" and
| "copy" mean reproduce.
| moralestapia wrote:
| >"publish" and "share" mean redistribution
|
| No. That's something you believe, but it's not
| necessarily true.
|
| Check here, https://copyrightalliance.org/faqs/what-
| rights-copyright-own...
|
| Again, distribution has to do with a transfer of
| ownership. In layman terms, Github can _show_ your code
| to others but it cannot _give_ (as in ownership) your
| code to them. It 's a bit tricky here since on the web
| showing something literally means making a copy at some
| point, but try to view things under the light of "who
| owns what" and it's a bit easier to grasp.
|
| If you browse through someone's repository, it's pretty
| clear who the owner of that code is, if a program gives
| you a chunk of code that it "got from somewhere" there's
| definitely some sort of change of ownership operation
| going on; which in this case is interesting, as it went
| from _attributed to someone_ to _missing_.
| maximilianroos wrote:
| > GitHub Copilot is indeed infringing copyright and not only in
| a grey zone, but in a very clear black and white fashion
|
| You seem to be confusing what you'd like the law to be with
| what the law is.
| drran wrote:
| Here is an explanation of the law:
| https://www.copyright.gov/fair-use/more-
| info.html#:~:text=Fa....
|
| Effect of the use upon the potential market for or value of
| the copyrighted work: Here, courts review whether, and to
| what extent, the unlicensed use harms the existing or future
| market for the copyright owner's original work. In assessing
| this factor, courts consider whether the use is hurting the
| current market for the original work (for example, by
| displacing sales of the original) and/or whether the use
| could cause substantial harm if it were to become widespread.
| progval wrote:
| > Any open-source code available on Github is controlled by the
| copyright notice of the owner granting specific rights to
| users.
|
| and by the GitHub ToS:
|
| > You grant us and our legal successors the right to store,
| archive, parse, and display Your Content, and make incidental
| copies, as necessary to provide the Service, including
| improving the Service over time. This license includes the
| right to do things like copy it to our database and make
| backups; show it to you and other users; parse it into a search
| index or otherwise analyze it on our servers; share it with
| other users; and perform it, in case Your Content is something
| like music or video.
|
| https://docs.github.com/en/github/site-policy/github-terms-o...
| TomVDB wrote:
| Does the GitHub ToS matter when I upload code that was
| written by somebody who doesn't use GitHub?
| progval wrote:
| Then you would be the one infringing their copyright, and
| they could probably sue you.
|
| Although I'm curious about what GitHub would do if the
| original author asked them to remove the work from Copilot.
| Retrain from scratch every month or so, to remove last
| month's DCMAed content?
| GoOnThenDoTell wrote:
| Not everyone who's code ends up on GitHub has agreed to this
| set of terms
| progval wrote:
| See my answer to
| https://news.ycombinator.com/item?id=27741709
| eCa wrote:
| > > as necessary to provide the Service
|
| I would consider Copilot to not be part of "the Service"[1],
| but at least currently[2] the definition of "the Service" is
| so vague as to include anything that Github does.
|
| Maybe they consider Copilot to be a "search index" and the
| suggestions "[sharing] [Your Content] with other users".
|
| [1] Since, as I understand it, it will require separate
| payment.
|
| [2] The ToS is currently last edited 2020-11-16, and does not
| contain the word "Copilot"
| j4yav wrote:
| The part that feels really obvious to me is that, if I made an
| AI that could generate music by looking through the entire
| (copyrighted) back catalog of the Beatles for example, and it
| would output music that I could control to be very much or even
| exactly like the original recordings, or I could accidentally
| do it, that it wouldn't really be a way to launder the original
| licenses/copyright into the public domain.
|
| Or maybe it is, but if so it essentially means the end of
| licensing because it would be trivial to make an AI that can
| take an input and produce the same output. Or maybe even cp is
| good enough to strip the source of its original license in that
| case.
|
| Open source licenses are worth protecting or you break the
| cycle that helps more software be open.
| Jarwain wrote:
| Wouldn't the parallel be closer to having an ai remix a bunch
| of songs together?
| aj3 wrote:
| You're giving AI too much credit, it's just a tool, it does
| not have it's own intentions.
|
| I.e. if you buy a piano or a guitar, you could play and
| record copyrighted music on it. That's not piano's or
| guitar's fault though, it's yours.
| alex_c wrote:
| Funny you should say that, as there is a direct line
| connecting player pianos in the 19th century to copyright
| law in the 21st:
|
| https://en.m.wikipedia.org/wiki/Mechanical_license
| [deleted]
| sillysaurusx wrote:
| It sounds like you might be the one giving it too much
| credit. AI is a glorified markov chain, which is
| essentially a compression algorithm. I agree that it can be
| an instrument (I've done it:
| https://soundcloud.com/theshawwn/sets/ai-generated-
| videogame...) but it's almost trivial to train a model that
| memorizes by rote.
|
| Suppose a model was trained solely on a single Beatles
| album. It could only spit out that album. That would be
| clear infringement, wouldn't it?
| toxik wrote:
| It's funny that people say it's a glorified Markov chain.
|
| No. It's not. A Markov chain has some very specific
| properties that are absolutely not fulfilled by GPT-3
| models.
|
| Just say "stochastic" if you want a buzzword. Stop
| appropriating Markov chains.
| sillysaurusx wrote:
| "It's a stochastic" doesn't flow, though I guess I could
| use "stochastic random walk."
|
| What properties does a gpt-3 model have that a Markov
| chain doesn't? (Other than effectiveness.)
| exdsq wrote:
| ANNs aren't Markov chains
| alkonaut wrote:
| A typewriter vs. a machnine that recites paragraphs of
| shakespeare are two different things.
| j4yav wrote:
| Neither of them unbind the content from the original
| license, though.
| koonsolo wrote:
| When I press 1 key and it plays copyrighted music, that is
| the piano's fault.
| aj3 wrote:
| Even if hypothetically there was such a strange bug in
| your piano and you decided to exploit it by recording
| copyrighted music and redistributing it, you would be
| accountable for it, not a piano.
|
| This analogy train went too far, don't you think? All
| examples that I've seen on Twitter require quite an
| intentional manipulation by human for Copilot to produce
| something copyrighted. It does not recite Linux code by
| pressing 1 key.
| hnmullany wrote:
| If you have an electronic piano that requires a complex
| series of button pushes to produce copyrighted music -
| that's still a copyright violation. Copyright law has no
| notion that the difficulty of reproducing copyrighted
| content effects the fact of a violation.
| akersten wrote:
| > an electronic piano that requires a complex series of
| button pushes to produce copyrighted music
|
| Surely a judge presented with the "complex series of
| button pushes," otherwise known as playing an instrument,
| would hold the player accountable for any infringement
| and not the piano?
|
| These analogies have gone so far off the rails that I
| can't tell which side this thread is arguing for by now
| ;)
| b3morales wrote:
| I think the whole swirling discussion is a little
| confused because there are potentially two "ends" where
| infringement could happen, and different people are
| talking about each. And the article covers both.
|
| One end is GitHub's, at the input: Copilot's "database"
| was initialized from code that GitHub does not have
| copyright to. The contention at this end is that they are
| ignoring the licenses that would grant them the right to
| use that code.* The article, GitHub, and others assert
| that there's no copyright issue for creating a database
| of this kind (a machine learning model).
|
| The other end is the the developer taking Copilot's
| output. The article seems to take the (absurd IMO)
| position that there's also no copyright implications
| here, because the output _is not copyrightable_ at all.
|
| *And personally this is the side that concerns me most.
| [deleted]
| [deleted]
| dahart wrote:
| If you have a piano that plays copyrighted music when you
| press a single key, isn't that the piano _maker's_ fault?
|
| Edit - googling, the history of player pianos vs
| copyright is interesting
|
| https://slate.com/technology/2014/05/white-smith-music-
| case-...
|
| https://www.techdirt.com/articles/20100712/18325210185.sh
| tml
| freshhawk wrote:
| "Or maybe it is, but if so it essentially means the end of
| licensing because it would be trivial to make an AI that can
| take an input and produce the same output."
|
| Yes, this is what is pretty interesting to me. I said in a
| previous comment that I have a really good OS generating AI.
| It asks you your favorite color and outputs a disk image you
| can use as an installer.
|
| Right now it just happens to output a cracked version of
| Windows if you answer "blue". Who can know how that happened?
| It's a black box after all. Seems useful though, since
| Microsoft is loudly saying that if I distributed this it
| would have no license problems at all.
| remus wrote:
| I think the main point that the article makes is that for
| copyright to work you need some notion of a creative work,
| and so far it's generally accepted that snippets like
| i = i + 1
|
| aren't creative enough to be covered by copyright. The
| interesting point is where you draw the line between what's
| boilerplate and what's creative, and legally it will
| presumably come down to showing that copilot crosses that
| line egregiously enough for someone to think they've got a
| successful chance at legal action.
| b3morales wrote:
| Sure, but GitHub's own promotional pages (pretty much any
| of the gifs on https://copilot.github.com/ as well as other
| articles, e.g.
| https://docs.github.com/en/github/copilot/research-
| recitatio...) show it producing much more elaborate
| segments than that.
|
| In fact, that's a crucial selling point for the product.
| j4yav wrote:
| Since that article was written people have shown it will
| generate quite long coherent sections. It will even
| generate someone's private about me page: https://twitter.c
| om/kylpeacock/status/1410749018183933952?s=...
| nanna wrote:
| But that about me page is the very definition of
| boilerplate text, so really it only gives weight to the
| argument that it's _not_ producing original work.
| dmurray wrote:
| You got downvoted, but I kind of like this argument.
| There are a million "about me" pages, but Copilot did a
| good job of picking one for "generic software engineer".
| If it could just have changed a word or two to a synonym,
| it would be great.
| phire wrote:
| Bullshit.
|
| That not an existing aboutme page. You can go to
| davidcelis' website and verify that it's completely
| different.
|
| Copilot just picked a random person and linked to their
| social media accounts. You can search any large quote
| within that about me on Google and not find a match, it
| is unique.
|
| The only two examples of generating large sections of
| copyrighted work are the quake floating point hack and
| the zen of python. Both those examples are commonly known
| and copied and talked about, to the point that they have
| wikipedia pages.
| thayne wrote:
| But as I understand it, copilot can generate much longer
| snippets, even entire functions.
|
| I think the big question is, if copilot ends up copying
| significant portions a GPL work, not just tiny snippets, is
| the resulting work infringing, and if so, who is liable?
| bdowling wrote:
| > and it would output music that I could control to be very
| much or even exactly like the original recordings, or I could
| accidentally do it, that it wouldn't really be a way to
| launder the original licenses/copyright into the public
| domain.
|
| The test for non-literal copyright infringement is
| "substantial similarity." If, after filtering out irrelevant
| and non-copyrightable elements, the allegedly-infringing work
| is substantially the same as the original work, then it
| infringes. If it infringes, then two common defenses are
| independent creation and fair use.
|
| In your hypothetical, the AI-generated work would infringe
| the original because you stated it would be substantially the
| same as the copyrighted work. You can't claim independent
| creation because the algorithm was dependent on the original
| work and you controlled the output of the algorithm to be
| exactly like the original work. Fair use is pretty much a
| non-starter, so I'll skip that analysis.
|
| So, no, you couldn't use an AI to launder copyrighted works
| into the public domain.
| Aeolun wrote:
| Unless you are Github, in which case having your AI copy
| code vertabim is ok?
| DarkmSparks wrote:
| There were two parts to the argument which seem to hold water.
|
| 1. Any code generated by co pilot is likely to be agpl
|
| 2. since the authors of copilot used co pilot beta to make co
| pilot release copilot is very likely using agpl licenced code
| and therefore in breach of the agpl licence.
|
| so yep, article looks flawed.
| makecheck wrote:
| Of _course_ derivative works are being produced!! Whether you
| blame Copilot or the developer using it, the result is something
| that required the original developer of the code in order to be
| constructed.
|
| Have we reached the point where every "class X" must become
| "class X_GPL2_CopyrightJohnQSmith_AllRightsReserved" in every
| code base out there? Do we need to go from header comments at the
| top of a file to reminder comments at the end of every line?
| truffdog wrote:
| If Microsoft is confident that Copilot is not a parrot, they
| should include their proprietary codebases in the training
| database.
| BuildTheRobots wrote:
| Does anyone know which codebases got included? I get the
| impression copilot scraped github - but as it's an internal
| tool, did it only scrape public repos or has private repos also
| been slurped?
| xdennis wrote:
| There are torrents of leaked Windows source code. Someone
| with access to Microsoft Copilot could try to see if
| reproduces the code there.
| dleslie wrote:
| If they truly believe copilot does not produce derivative
| works, then there is no downside to indexing their own code in
| its entirety; it would probably improve copilot's behaviour.
|
| Well, Microsoft, show us you believe your own arguments!
| hnfong wrote:
| A counter argument that Microsoft can use is: "The code we
| write at Microsoft is so bad that it will decrease the
| quality of the generated output". :)
| tyingq wrote:
| Guess round 2 will have Copilot dumping to AST, changing function
| and variable names, then dumping back to source.
| sprafa wrote:
| Amazing how this was never an issue when other "AI" systems use
| other people's data to learn how to drive cars/write text. But
| man you start messing with developer data and suddenly there are
| ethical issues! Amazing turnaround.
|
| Face it - AI as we currently call it is just a very sophisticated
| data sorting algo in most cases (let's ignore the AlphaZero non
| supervised learning type). Everyone was getting celebrating when
| Common Man was destroyed by devs commoditising their knowledge
| through data capture. But now suddenly it's a problem! Mess with
| a man's pocket.
| ben0x539 wrote:
| How was this never an issue? People including devs have been
| upset by AI and data mining for a long time.
| ramraj07 wrote:
| Show me a 700 comment HN thread about people worried about
| gpt violating copyright.
|
| Then show me 20 such threads because thats what's been spewed
| here.
| sprafa wrote:
| not in this forum, they were rubbing their hands in glee at
| insane AI valuations.
| Rapzid wrote:
| > GPT-3 trained on the ENTIRE INTERNET
|
| Red carpet.
|
| > Copilot trained on publicly available source code
|
| Pitch forks.
| dj_mc_merlin wrote:
| I think a good deal of engineers here should familiarize
| themselves with Julia Reda and her work and ask themselves if
| they have the legal knowledge to debate on this matter. Common
| knowledge is not acceptable to determine truth.
|
| Would you really respect the opinion of some dude who's only used
| Excel about your profession?
| josefx wrote:
| She cites githubs smallest excerpt claim for her reasoning when
| we already know that the tool happily reproduces entire
| functions with comments verbatim.
|
| Also her claims about machine generated code have a really
| funny interaction with the cp command. Clearly cp
| MicrosoftWindows11Source.zip FreeWindows.zip is not a creative
| process, cp is a command executed by a machine hence the
| contents of FreeWindows.zip are now entirely public domain. Man
| were was she when people where sued over creating entire
| libraries of public domain movies using BitTorrent?
| ramraj07 wrote:
| Just like you accuse her of being out of date with recent
| findings, y'all seem conveniently out of date with githubs
| assurance that they will be adding checks to not regurgitate
| full chunks of code. So what exactly is your point then?
| ramraj07 wrote:
| Won't work. This tool is attacking them like the presence of a
| vegan attacks some hardcore meat eaters. They might realize
| deep down that this is not an argument they can win but it
| offends their core existence in some ways so they can't help
| but die defending their incoherent arguments.
|
| Ethical or not, it's clear Microsoft isn't going to get into
| real legal trouble due to this, and if the tool is genuinely
| useful, it's going to "allow the laundering of GPL code" into
| companies, whatever that means.
|
| If that offends people then they better learn the lesson and
| not produce open source any more. I'm not happy but if thats
| the direction the natural progression of things take whatever
| let's see where that goes.
| Causality1 wrote:
| _The output of a machine simply does not qualify for copyright
| protection - it is in the public domain._
|
| Is it just me or is that a patently ridiculous statement? The
| output of a machine belongs to the person owning/using the
| machine. If I use a digital camera to take a picture of a
| copyrighted image I'm still committing copyright infringement
| despite the output being created by a machine and a bunch of
| image processing software.
| dr_kiszonka wrote:
| My less lofty personal gripe with Copilot is as follows. I worked
| hard to produce quality code. GitHub will make money off my code.
| Copilot users will make money using my code. I - the creator -
| will make nothing.
|
| At the very least, I should have been asked whether my code can
| used by Copilot and I should get at least a share of the profit
| Copilot generates every month, where the share equals to my code
| / all training code used by Copilot. The latter part could be
| gamed by other developers in the future, but it's the best I
| could come up with.
| CyberRabbi wrote:
| A determination of fair use does take commercialization into
| account so this is a fully valid concern. GitHub is explicitly
| looking to profit from the work of others.
| FeepingCreature wrote:
| If you didn't want your code to be reused or even
| commercialized by others, you really shouldn't have made it
| opensource.
| heavyset_go wrote:
| I'm okay with reuse and commercialization as long as the
| licensing terms of my code are adhered to. That means proper
| attribution, distribution of copyright notice and license,
| and making modified code available to users. Copilot does
| none of that.
| ben0x539 wrote:
| My understanding is that Github's argument is that their use
| of the code to train Copilot is fair use. As such, whether
| the code in question has been released as open source only
| matters to the extent that it makes it more convenient for
| Github to access it, but the argument would work as well for
| a proprietary codebase.
|
| Edit: I just skimmed the copilot blurb again, they seem to
| refer to "publicly available" sources and not open source
| code as their input.
| drran wrote:
| Yes, this is M$ argument, but it's not backed by law. See
| [0].
|
| "Publicly available" is not "in public domain". Many
| commercial songs are "publicly available" via a radio
| station.
|
| [0]: https://www.copyright.gov/fair-use/more-
| info.html#:~:text=Fa.... .
| [deleted]
| swiftcoder wrote:
| If my code isn't released under a permissive license, then I
| might have the expectation that those wishing to use my code
| for commercial purposes will contact me and pay for a
| commercial license.
|
| This is sort of the whole point of non-commercial licensing
| (and often, of the GPL itself, since many potential licensors
| don't wish to deal with GPL restrictions).
| IshKebab wrote:
| Sure but did you have the expectation that people wouldn't
| read your code and learn from it? I think even non-
| commercial licensing can't prevent that.
|
| If your code is so super-special that you don't want people
| to read it and go "ah that's a neat linked list reversal
| algorithm" or whatever then your only options are software
| patents or keeping it entirely closed source.
|
| Maybe trade secrets, but they tend to apply in very very
| limited circumstances. I doubt any software would qualify.
| drran wrote:
| Yes, people can read my open-sourced code and learn from
| it, like they can do with paints, movies, sculpts, and
| books.
|
| No, I don't allow to copy my code freely.
|
| Can you explain, what point you are trying to defend
| here?
| visarga wrote:
| > I - the creator - will make nothing.
|
| Did you benefit from reading code in your education? Pass it
| forward! You will benefit many people, don't cut the rope under
| you. And in turn you will also get the same benefit, and
| adapted to your needs.
| yunohn wrote:
| Here we go again, a legal expert weighs in with a long and
| detailed post about Copilot;
|
| And HN rallies to criticize it because Copilot can reproduce some
| snippets when forced to.
| joshuaissac wrote:
| > a legal expert weighs in [...] And HN rallies to criticize it
|
| That is an appeal to authority. Being a legal expert does not
| excuse one's writing from critical analysis. In this case, the
| post does not address Copilot reproducing large segments of
| copyrighted code verbatim. That is valid criticism.
| yunohn wrote:
| It is not an appeal to authority. I'm saying the expert is
| providing a legal explanation, and HN is throwing anecdotes
| around.
|
| There is no logical fallacy since HN refuses to even have a
| logical discussion about Copilot.
| IncRnd wrote:
| The GP is correct. It is a logical fallacy that by
| definition is an appeal to authority, and this is the
| logical discussion.
| floatingatoll wrote:
| This discussion is heavily biased and prioritizes
| people's emotional need to be credited and/or paid for
| their work over a discussion of the legal and ethical
| concerns at play here. It disregards the comments of an
| expert in the field and focuses instead on demands that
| may well be unsupported by copyright law.
|
| For example, GitHub license section D.4 specifically
| grants GitHub the right to display your content, analyze
| your content, and reproduce it in full to other users of
| the service. Yet no one seems particularly interested in
| discussing that here today, because it isn't compatible
| with the outrage that people are prioritizing on HN when
| discussion Copilot.
|
| I would have expected HN to be better than Reddit in this
| regard, but I'm not seeing it yet. I don't know if the
| expert is right or wrong here, but nothing in today's
| comments suggests anything new or curious that hasn't
| already been ranted about in every prior thread about
| this topic. I specifically care about copyright law and
| it's disappointing to see HN having a group tantrum
| instead of a discussion.
|
| https://docs.github.com/en/github/site-policy/github-
| terms-o...
| ghaff wrote:
| The legal commentary I'm seeing from people who really
| know this stuff is pretty much unanimously in favor of
| this being legal in at least most of the world based on
| caselaw--while acknowledging why some might have ethical
| concerns.
|
| I'm actually sort of curious as to the vigor of the
| backlash. Because Microsoft? Because of concerns about
| perceived further undermining of the GPL in particular?
| Because of people anxious to get their credit?
| Because...?
| floatingatoll wrote:
| Because they're not getting a share of GitHub's future
| revenues from their works or from derivations or their
| work.
|
| (Why do they care so much about revenue? Open source
| coders and 'starving artists', not to mention Covid
| economic wreckage, the US approach to medical insurance,
| and the total absence of Universal Basic Income in
| virtually all countries permitted to access GitHub.)
| ghaff wrote:
| So don't open source it and/or put it on GitHub?
| floatingatoll wrote:
| The latter, it turns out, is more important than the
| former.
| IncRnd wrote:
| > I'm actually sort of curious as to the vigor of the
| backlash. Because Microsoft? Because of concerns about
| perceived further undermining of the GPL in particular?
| Because of people anxious to get their credit?
| Because...?
|
| Because, this is really against the understanding of what
| was possible for copyrighted works. So, now that this is
| possible for anyone, copyright will start to get examined
| and hopefully updated to be useful in today's
| environment.
|
| There are about a million problems with this.
|
| This can even be used to intentionally launder source
| codes from a competitor. Apparently, all it will take
| will be to steal code (or just fork it), then create more
| than 10 copies on Github. At that point, copilot will
| start to emit the code during use. With all the legal
| commentary saying this isn't infringement, imagine how
| companies will be able to use this product.
|
| Similarly, the training set can be intentionally
| polluted, so your competitor finds the output of Copilot
| worthless.
| IncRnd wrote:
| > For example, GitHub license section D.4 specifically
| grants GitHub the right to display your content, analyze
| your content, and reproduce it in full to other users of
| the service. Yet no one seems particularly interested in
| discussing that here today, because it isn't compatible
| with the outrage that people are prioritizing on HN when
| discussion Copilot.
|
| Well, Copilot isn't really an analaysis and display of
| the source code within the original meaning that people
| held. That was meant more to run codeql, github actions,
| and other analysis while presenting the results in a
| repository to people. People never anticipated that
| github would strip their licenses from files and present
| their source code inside of VSCode for people to use
| freely. It may be legal, but what we are seeing now is an
| abuse of the sentences you just quoted that goes outside
| what they were originally understood to mean.
| floatingatoll wrote:
| Is it fair use to remix two musical albums into a new
| derivative work, that cannot plausibly be judged to
| replace demand for either original work?
|
| Is it fair use to autogenerate GIFs from movies, perhaps
| the most protected digital works on the Internet today,
| in order to use them as reaction memes on Imgur?
|
| Is it fair use to autoextract code fragments from a code
| base, in order to use them as suggestions on GitHub?
|
| The Internet, and I imagine HN, was in an uproar when the
| music industry attempted to kill the White Album, because
| it infringed on their freedom to remix and derive.
|
| The Internet, and I imagine HN, was in an uproar when MLB
| attempted to kill unauthorized baseball GIFs and replace
| them with official curated ones, because it infringed on
| their freedom to remix and derive.
|
| How, precisely, is remixing and deriving from code
| 'abusive', in contrast to the past ten or twenty years of
| pressure on the Internet to the contrary when remixing
| and deriving from music or movies?
|
| This is a core point of the original post linked above,
| where the author is shocked by our demands for more
| prohibitive copyright interpretations, and I want to call
| this out more bluntly and less politely than they did:
|
| Fair use of a work is _almost always_ perceived as
| abusive and unfair by the creator of a work. Creators
| ignore the cognitive dissonance between their demand to
| have fair use rights granted _more_ easily to the
| protected works of others, and their demand to have fair
| use rights granted _less_ easily to their own protected
| works.
|
| I see that dissonance go unaddressed in every top-level
| comment in today's discussion. I see that desire to deny
| fair use rights driving hundreds of emotional me-too
| posts, without considering the framing of _whether_ it is
| fair use in alignment with every prior copyright outrage
| we've discussed over the years.
|
| My theory is that permitting discussion of fair use would
| weaken their efforts to groundswell a pitchfork mob, and
| no one wants to confront their own biases or emotional
| investment or inability to profit from their code.
|
| Whatever the motivations, HN deserves better than this.
| hnfong wrote:
| Good point. Here's my 2 cents -
|
| > D.4 specifically grants GitHub the right to display
| your content, analyze your content, and reproduce it in
| full to other users of the service
|
| If you read the section carefully, this covers the right
| of GitHub to do those things to your content "as
| necessary to provide the Service". "It also does not
| grant GitHub the right to otherwise distribute or use
| Your Content outside of our provision of the Service".
|
| So, does "Service" only cover the type of Github's
| service at the time of the agreement, or does it allow
| Github to invent all kinds of unrelated services and use
| the code as such? If Github can provide a "Copilot"
| service that arguably "learns" the code, can it also
| provide a service that blatantly "copies" large pieces of
| source code for the user (without complying to OSS
| license terms)?
|
| It's not very clear what the answer would be, but if what
| I described is allowed, the consequences of this term
| being so broad would imply that if you're not the
| copyright owner of code you uploaded to Github, you've
| probably violated some OSS license by agreeing to
| Github's terms.
| floatingatoll wrote:
| Which OSS licenses are potentially incompatible with
| GitHub? Are they also incompatible with GitLab? How can
| one or the other be judged to have exceeded the bounds of
| what is permissible as a user-generated content provider,
| and/or fair use rights, in the legal jurisdiction of
| each?
| ben0x539 wrote:
| > For example, GitHub license section D.4 specifically
| grants GitHub the right to display your content, analyze
| your content, and reproduce it in full to other users of
| the service. Yet no one seems particularly interested in
| discussing that here today, because it isn't compatible
| with the outrage that people are prioritizing on HN when
| discussion Copilot.
|
| How applicable is the Github license when a lot of code
| on Github (let's say eg. the Linux kernel) was posted
| there by people other than the individual copyright
| holders? I'd assume they can only rely on the open source
| license of the code in question, and not really on
| additional license terms. As far as I can tell, Github
| claims fair use rather than citing their license.
| floatingatoll wrote:
| That's perhaps the most important question of this entire
| debate, and it's the one that no one is considering
| seriously here in the comments. I personally think that
| it's because no one at HN is both competent enough at
| copyright and licensing law to debate it _and_ willing to
| spend time debating it with Internet commenters for a $0
| /hour wage.
| treffer wrote:
| Well, I have a hard time drawing a line between GitHub Copilot
| and a compression algorithm.
|
| If you can reproduce a verbatim copy of Quake source code after
| taking that source code as input before then that's compression.
| A really fancy, but still.
|
| And given that it reproduces the source code: it has to hold that
| somewhere.
|
| It would be very interesting if someone could reproduce the Quake
| example with AGPL code, then request the whole model + code
| because it clearly contains the AGPL code in some encoded form.
| abriosi wrote:
| Some purists may say learning is compressing
| Syzygies wrote:
| Yes! In every form, lossy compression is distilling
| meaningful information from noise.
|
| This is a great legal question as it concerns our use of
| machine agents. We can learn from copyrighted literature or
| code that we read. Why can't our agents?
| Zababa wrote:
| > We can learn from copyrighted literature or code that we
| read.
|
| Not everywhere. Emulators communities often prohibit people
| from contributing if they've read the original code to
| protect themselves from copyright claims.
| AlotOfReading wrote:
| Because the process is different. You and any computer
| agent are allowed to learn the functional, non-
| copyrightable elements of fast inverse sqrt. When you need
| that functionality, you can write code that implements your
| understanding of those non-copyrightable elements and gain
| copyright over the resulting creative expression.
|
| What you _can 't_ do is copy all of the creative expression
| in the original (such as comments) without complying with
| the terms of the license. Moreover, reproducing the magic
| constants is a strong indication that your process didn't
| independently derive your code because the constants used
| in the original are unique and non-optimal.
| anticensor wrote:
| I should include a term in my licenses that licensees
| explicitly waive their rights to fair use and/or fair
| dealing.
| burnte wrote:
| If your model can't reproduce the Quake source without my
| input, you haven't really compressed it, especially if the
| dataset to recreate it is larger than the original. If I have
| to tell the program exactly what I want in detail to get the
| Quake source, that's more of a storage database. If I have to
| guide it intently to get it to output the Quake source, I'm
| heavily guiding it.
| dleslie wrote:
| All decompression requires input: the compressed artifact. In
| this case, the compressed artifact is the semantic queues
| necessary to extract the Quake inverse square root function.
| swiftcoder wrote:
| > especially if the dataset to recreate it is larger than the
| original
|
| Many types of compression produce a compressed file larger
| than the original for input data that is not easily
| compressed. Just because a compressor is bad at compressing
| (some) inputs, doesn't exclude it from being a compression
| algorithm.
| vharuck wrote:
| A compressed file containing Quake's source code would be
| covered by the copyright on Quake's source code. The
| compression algorithm would not. The algorithm cannot produce
| the plain-text copyrighted material without the compressed
| copyrighted material.
|
| Copilot has the ability to produce Quake's source code nearly
| by itself. And it's a work (not a person), so it can be seen as
| a derived work. Like a compression algorithm that sometimes
| tacks on the first paragraph of "50 Shades of Grey" at the end
| of files.
|
| I'm not a lawyer, but that's my opinion (admittedly, my opinion
| is softening each day). Plus, the purpose of the tool is to
| create code for inclusion in projects somebody will hold a
| copyright over, and they likely won't be the original authors.
| So it's output should be held to a higher standard than a
| compression algorithm or keyboard.
| madsbuch wrote:
| > A compressed file containing Quake's source code would be
| covered by the copyright on Quake's source code. The
| compression algorithm would not.
|
| What? Where does the distinction between data and algorithm
| go with compression algorithms?
|
| In its most abstract form a compression algorithm is function
| `{0, 1}^n -> {0, 1}^m` such that n < m and the output string
| is the result of something previously encoded.
|
| Why can't the input string be the seed used to make the
| machine learnt model generate the Quake source code?
| leereeves wrote:
| > Copilot has the ability to produce Quake's source code
| nearly by itself.
|
| Was it fed the Quake source code while training? Then it's
| not producing that code, it's just reproducing it, like a
| fancy (but imperfect) copy machine.
|
| I'm not sure it's accurate to say that the training source
| code is "compressed" in the parameters of the model, but
| certainly some approximation of the training source code is
| stored in the parameters.
| treffer wrote:
| It is probably a stretch, but I think less of a stretch
| than saying "it just a machine that learned to code and
| randomly reproduced these 10+ lines of code". That has IMHO
| a probability of 0.
|
| So if I rule that out, where does it end up? What if we put
| this as the grown up ML brother of the chain of LZW, PPM,
| dictionary assisted compression (e.g. zstd) and various
| attempts at using neural networks for compression?
|
| I would not want to judge this - that's why I put up the
| AGPL idea. Or even unlicensed code. It would be a very
| interesting case to watch.
| dleslie wrote:
| This is an interesting perspective; it does, indeed, seem like
| Copilot is a lossy compression algorithm wrapped in a semantic
| search interface.
| Spivak wrote:
| I mean that's essentially what all ML is if you want to think
| about it that way.
|
| Training is the process of creating a space where searching
| for the right thing within it gives you the answer to some
| problem you have.
| elliekelly wrote:
| I know absolutely nothing about IP and even less about
| compression but aren't compression algorithms usually run on
| copyright protected material with the consent of the rights
| holder or authorized licensee?
| belorn wrote:
| You would not need to produce a perfect copy. A fansub of a
| movie is considered an derivative of the movie, while being a
| far cry from being an actually copy of the movie.
|
| As a subtitle is to a movie, the quake "output" might be much
| smaller than quake itself.
| chx wrote:
| > The short code snippets that Copilot reproduces from training
| data are unlikely to reach the threshold of originality.
|
| I can only repeat myself: In light of Google v. Oracle going as
| far as the Supreme Court I find your confidence in this quite
| astonishing.
| dominicjj wrote:
| "(of course, free software licenses would still fulfil the
| important function of contractually requiring the publication of
| modified source code)"
|
| No no no. Licenses are NOT contracts. Someone who copies or makes
| derivative works of copylefted software which they then
| distribute is obliged to remain within the bounds of the license
| not because they voluntarily promised, but because they don't
| have any right to act at all except as the license permits.
|
| https://www.gnu.org/philosophy/enforcing-gpl.en.html
| hnfong wrote:
| OSS licenses, so far as they a permissive and require nothing
| in return, are not contracts. This is often the case for simply
| _using_ the OSS software. The user has no obligations
| whatsoever.
|
| If, on the other hand, the licensor and licensee both have some
| obligations (in OSS, this is usually when you modify or
| redistribute the source or compiled product), then it's
| basically a contract, no matter what RMS claims.
|
| I mean, with all due respect to the guy, he makes controversial
| claims even in the field of software engineering (and also free
| software evangelism), his supposed professional field. Why
| would you trust what he says about contract law, a field where
| he has no professional training whatsoever?
|
| (That said, GPLv2 is still an ingenious work for many reasons,
| albeit lawyers probably won't draft it that way)
| luhn wrote:
| That article isn't written by RMS, and the author has some
| relevant credentials.
|
| > Eben Moglen is professor of law and legal history at
| Columbia University Law School.
| hnfong wrote:
| You're right. I was mistaken -- I thought he was referring
| to those RMS claims that in general the GPL is not a
| contract.
|
| In Moglen's article about _enforcement_ , I think he's
| right that where there's a breach of GPL there is no
| contract. In fact that's what I said also in my follow up
| reply.
| hnfong wrote:
| PS: There's still a nuance that might require
| clarification(or am I adding confusion?) in your original
| quote though:
|
| Quote: "(of course, free software licenses would still fulfil
| the important function of contractually requiring the
| publication of modified source code)"
|
| Even though as many others have pointed out, OSS licenses can
| be contracts, I'm actually not sure this sentence is correct.
|
| When somebody uses the source code in compliance with the
| license terms, a contract might be formed to allow both
| parties to enjoy rights. However, if one party never complied
| with those terms and breaches them (eg. distributing source
| without retaining copyright notices), then arguably no
| contract was ever formed, and the act is a simple matter of
| _copyright violation_ and not a "breach of contract".
|
| Hope I'm not splitting hairs.
|
| Disclaimer: learned English common law a bit, not a lawyer.
| goodpoint wrote:
| This is false on many levels.
| hnfong wrote:
| Pray tell? (preferably citing legal authority?)
| dominicjj wrote:
| "This is often the case for simply using the OSS software.
| The user has no obligations whatsoever."
|
| This is a category error when it comes to copyleft licenses
| like the GPL. It has nothing to say about usage.
|
| "If, on the other hand, the licensor and licensee both have
| some obligations (in OSS, this is usually when you modify or
| redistribute the source or compiled product), then it's
| basically a contract, no matter what RMS claims."
|
| No it's not. There are no pre-agreed terms, penalties for
| violation, expected compensation for services provided or
| anything like that. GPLed software is copyrighted. Copyright
| law says you have no rights to copy it or make derivative
| works of it whatsoever. The license permits you to do so.
|
| "Why would you trust what he says about contract law, a field
| where he has no professional training whatsoever?"
|
| Because, surprise surprise, he has advice from people who ARE
| trained in the law.
| hnfong wrote:
| > The GPL ... has nothing to say about usage.
|
| The GPLv2 text: " The act of running the Program is not
| restricted, and the output from the Program is covered only
| if its contents constitute a work based on the Program
| (independent of having been made by running the Program). "
|
| Of course this sentence contradicts the previous sentence
| in the text, which claims that normal usage "is not covered
| by this License". I presume you'd argue to support your
| claim, but seriously, this is bad drafting.
|
| > No it's not. There are no pre-agreed terms, penalties for
| violation, expected compensation for services provided or
| anything like that.
|
| "penalties for violation", "expected compensation" are not
| necessary requirements for formation of a contract. The
| pre-agreed terms are clearly stated in the license text, or
| at least as clear as far as they don't contradict each
| other. By the way, "pre-agreed terms" are not necessary for
| the formation of a contract either.
|
| > Because, surprise surprise, he has advice from people who
| ARE trained in the law.
|
| Are you trained in the law? Because if you think the
| Internet should pay regards to somebody trained in the law
| (even though they may not have learned the law properly) as
| opposed to somebody who hasn't, then I don't see why you
| think you have a standing to speak as though you're an
| authoritative source on the matter.
| fredgrott wrote:
| and what pre-tell makes it a NON contract?
|
| License all by themselves are forms of contracts
|
| In fact the bill of rights is one
| roywiggins wrote:
| You need to actively assent to a contract. Some software has
| contracts ("EULAs") but you are bound by the license whether
| you agree or not.
|
| https://en.wikipedia.org/wiki/Meeting_of_the_minds?wprov=sfl.
| ..
| adrusi wrote:
| A license isn't a contract that binds the licensee, it's a
| contract that only binds the rightsholder. Since you, the
| licensee, are not relinquishing any rights in the contract,
| there's no need for you to agree to anything. The only
| rights being relinquished are the rightsholder's right to
| pursue legal retribution for some uses of their work that
| would otherwise be violations of copyright.
|
| You dont have to call it a contract, but it is a legal
| document in which one or more parties legally bind
| themselves, which seems like an adequate definition of a
| contract to me, and has more etymological fidelity to the
| word "contract" than other possible definitions that would
| exclude licenses. A _contract_ is a legal instrument by
| which the breadth of your rights _contract_ -- as in become
| smaller.
| robbedpeter wrote:
| Not trying to be snarky or rude, just letting you know the
| phrase is "pray tell", for your future reference.
| wizzwizz4 wrote:
| They're sort of contracts. If you do this, you get a copyright
| exemption; otherwise, you don't have the legal right to do
| anything.
| dominicjj wrote:
| They have nothing to do with contracts and there's a simple
| test for it. When contracts are violated, then if there's
| litigation the parties consult the relevant contract law for
| how to proceed. When a license is violated, the parties
| consult whatever law the license was permitting an exception
| to. If you copy software without a license, you can be sued
| for copyright infringement. If you fish without a license,
| you can be sued for trespassing.
| user5994461 wrote:
| >>> No no no. Licenses are NOT contracts.
|
| Yes yes yes, licenses are contracts.
|
| That just got set in stone by the French appeal court and
| backed by a decision from the CJEU few months before (the
| European court of Justice).
|
| Case 19 March 2021 https://www.legalis.net/jurisprudences/cour-
| dappel-de-paris-...
| dominicjj wrote:
| Enjoy. I'm sure I'm not alone in completely ignoring the
| opinion of French judges and the European Court of Justice.
| stale2002 wrote:
| If you don't care about what the courts say, I am not sure
| why you are making legal claims here.
|
| When it comes to legal matters, the only thing that matters
| is the opinion of the court system.
| ben0x539 wrote:
| Just for context, the author of the article you're
| commenting on is Julia Reda, a EU copyright activist and a
| former member of the EU parliament. While I likewise don't
| have too much use for the legal opinion of French courts, I
| think we can afford to cut her some slack for focusing on
| legal interpretations in her native jurisdiction.
| dominicjj wrote:
| Fair enough. She is correct about licenses in her
| jurisdiction.
| ghoward wrote:
| You are correct that they are not _necessarily_ contracts, but
| they can be. (See
| https://writing.kemitchell.com/2020/12/27/War-on-License-Not...
| and search for "Blue Oak avoids this theoretical complexity.")
| [deleted]
| detaro wrote:
| That's the US perspective on the matter, not globally
| applicable.
| visarga wrote:
| Copilot is the moment when simple functions have been
| commoditized, you can have as many as you like almost for free,
| and adapted to any project. Just spend a moment to admire the
| transition, it's a new stage of post-scarcity.
|
| AI can recreate photos, paintings, sounds, voice, music, human
| faces, text, dialogue, math, proteins, and now code. It does all
| this while allowing humans to control and direct the whole
| process, and create original combinations. They all have no
| economic value to own and are free to use now, like words in a
| language. Enjoy!
|
| Remember Karpathy's Char-RNN? How long we've come.
|
| http://karpathy.github.io/2015/05/21/rnn-effectiveness/
| turtletontine wrote:
| The idea that the debate actually does a disservice to copyleft
| by relying on the strictest interpretations of copyright is an
| interesting perspective to me, but the rest of this seems pretty
| weak. (Caveat that I'm no lawyer.) Copilot can regurgitate
| verbatim chunks of other codebases: it seems absurd to me that
| that wouldn't count as derivative work.
| moralestapia wrote:
| >it suggests that even reproducing the smallest excerpts of
| protected works constitutes copyright infringement
|
| Actually, it is. It has to do with whether the small excerpt is
| copying what could be called "the heart of the work"; which in
| the case of code I would argue is almost always what you are
| after. No one's gonna copy the indentation style, boilerplate
| around functions/blocks, punctuation, etc. You always go for the
| "functional" part of the code, which is definitely "the heart of
| the work".
|
| The heart of Carmack's fast inverse square root lies in its
| selection of a particular set of constants and operations that
| happen (i.e. were designed) to approximate the square root
| without taking an expensive path. Copyright law would look at
| this novelty; I don't think it would argue around "the use of
| subtraction and multiplication in a computer program", as that
| would be plainly stupid.
|
| I am surprised that someone who is supposedly an expert in
| copyright law does not (or pretends not to know) about this, not
| only that, but to actually suggest the opposite. This is
| copyright 101, come on.
| softwaredoug wrote:
| This is kind of beside the point. Something can still be
| unethical and perfectly legal. The issue is that machine learning
| can whitewash a developers intended license.
|
| Or put differently, as a GitHub customer, are you comfortable
| with your code being used this way? Instead of a passive host,
| your code is now being used to create tremendous value for GitHub
| and Microsoft. Do you feel your trust has been violated?
| (regardless of legality).
| reilly3000 wrote:
| How does one address the fact that 95% of software is based on
| the same basic tropes? At a certain level of density, all code
| trying to achieve a similar function to legally-protected code
| will convene on an implementation that is almost
| indistinguishable. With LOC accreting exponentially, only time
| will determine when we reach that threshold. The Copilots of the
| world serve to accelerate and monetize this reality.
| [deleted]
| MattIPv4 wrote:
| This seems to completely ignore the fact that we've seen Copilot
| regurgitating exact copies of existing code, and even with the
| incorrect license attached when it was asked for it. [0]
|
| [0] https://twitter.com/mitsuhiko/status/1410886329924194309
| sfletcher wrote:
| The Google Books case cited here allowed Google to show exact
| snippets (extracts) from the copyrighted books, hard to see how
| this is any different.
| creshal wrote:
| It also has no relevance for the discussion at hand. Yes,
| Github can _display_ all of its content - that 's kind of the
| point of it.
|
| But Copilot doesn't exist to show you random code snippets
| for the sole purpose of showing them.
|
| _Using_ this copyrighted material to create _derivative
| works_ is a completely different use case, and not covered at
| all by the Google Books ruling, or any other I 'm aware of.
| mjburgess wrote:
| You wouldn't be allowed to make derivative works of those
| books; i.e. copy/paste into your own work.
|
| Google isnt making new books, or enabling people to make
| derivative copies; it is merely previewing a book.
|
| Github search is a _preview_. Copilot is a copy /paste.
| duckmysick wrote:
| What about Google Books Ngram Viewer? Isn't that a
| derivative work based on copyrighted content? It's more
| than just a search or preview - it contains both novel
| information and snippets of existing content. Is linguistic
| corpus a special case?
|
| https://books.google.com/ngrams
| pessimizer wrote:
| https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Goog
| le,....
|
| The research value actually turned activity that would be
| infringing into activity that was not infringing. _Take
| away_ things like the ngram viewer and Google Books
| infringes.
| bootlooped wrote:
| That function exists in hundreds, if not thousands, of GitHub
| repositories. The function is so well known it has it's own
| Wikipedia page. If there is a more famous function in computer
| science, I don't know what it is. The fact that a machine
| trained on GitHub repositories might reproduce such common code
| is not alarming or surprising to me. I think people are using
| this as an example and implying it's happening all over the
| place, but I've yet to see another example like it.
| user5994461 wrote:
| >>> If there is a more famous function in computer science
|
| Maybe fizzbuzz?
|
| Let's try to auto generate some fizzbuzz code, see what we
| get :D
| timdaub wrote:
| Agree. I've written a comment on her blog about this. Hoping
| she'll enable it. I've published an opinion piece on the
| subject matter myself:
| https://rugpullindex.com/blog#BuiltonStolenData
|
| Edit: My comment was enabled:
| https://juliareda.eu/2021/07/github-copilot-is-not-infringin...
| kalium-xyz wrote:
| " If it looks like a duck, swims like a duck, and quacks like a
| duck, then it probably is a duck." I don't see my license
| respected for code it regurgitates that I wrote, there is nothing
| more to this.
___________________________________________________________________
(page generated 2021-07-05 23:00 UTC)