[HN Gopher] Analyzing the legal implications of GitHub Copilot
___________________________________________________________________
Analyzing the legal implications of GitHub Copilot
Author : HNCommenterAD
Score : 149 points
Date : 2021-07-15 16:08 UTC (6 hours ago)
(HTM) web link (fossa.com)
(TXT) w3m dump (fossa.com)
| darau1 wrote:
| I honestly thought the great-gitlab-exodus indicated that people
| saw this coming a hundred miles away.
| heavyset_go wrote:
| > _"If you look at the GitHub Terms of Service, no matter what
| license you use, you give GitHub the right to host your code and
| to use your code to improve their products and features," Downing
| says. "So with respect to code that's already on GitHub, I think
| the answer to the question of copyright infringement is fairly
| straightforward."_
|
| GitHub's Terms of Service doesn't override licensing terms.
| invokestatic wrote:
| Actually, it does. When you upload code to Github, you are
| effectively "dual licensing" the code to them under Github's
| terms. Github is not bound to any other licenses you may have
| applied to your license, because it did not agree to those
| terms. It only agreed to the terms spelled out in the Terms of
| Service. Of course, there are edge cases in which you could
| upload code to Github that you do not own, for which I do not
| know the answer to.
| lakecresva wrote:
| It doesn't have to, the rights you grant Github when you agree
| to the ToS and upload your work exist independent of any rights
| you might grant as part of the repo's license.
| sampo wrote:
| > Downing thinks there's a strong case that Copilot uses said
| code in a transformative manner, which would support a fair use
| argument that there is no copyright infringement.
|
| Fair use seems to be a legal concept that mostly only exists in
| the anglosphere. How will this be in the many other countries,
| then?
|
| https://en.wikipedia.org/wiki/Fair_use
| Rolpa wrote:
| Here's an inquiry for those more knowledgeable about IP law than
| myself: what's the state of the law regarding training an AI on
| copyrighted material besides code? I was debating this with
| someone in relation to the high definition texture packs for old
| games people have been making using models such as ESRGAN - do
| these infringe the copyright of the rights holders of the
| original assets? Or are they considered sufficiently
| transformative to be considered an original work?
| mdasen wrote:
| The problem with GitHub Copilot is that you never quite know
| where the suggestion comes from.
|
| As the article notes, longer and more complex blocks of code are
| most likely copyrightable.
|
| > GitHub reports that Copilot is mostly producing brand-new
| material, only regurgitating copies of learned code 0.1% of the
| time.
|
| For me, the issue is one of risk. Let's say that you have 100
| developers at your company making software for you and they
| decide that Copilot is great. 1 in 1,000 suggestions is
| regurgitated code verbatim. Let's say that only 1 in 10 of those
| suggestions is sufficiently long and complex enough that it
| warrants copyright protection. Within a week, you'd have to
| assume that you have dozens of copyrighted pieces of code in your
| codebase. The big issue is that you now don't know where the code
| came from and which pieces might be direct copies. It opens up a
| bit of a can of worms for a company looking to avoid risk.
|
| I think one of the pieces that might get overlooked is someone
| trying to weaponize Copilot. For example, Wikipedia has seen
| people upload creative-commons licensed media to Wikipedia and
| then become very litigious against people who might be slightly
| off in the attribution requirements. Attribution requirements are
| often more complicated than just "provide whatever attribution
| you think makes sense." The images are legitimately creative-
| commons licensed, but if someone doesn't provide the correct
| attribution, they sue them. This attribution can include the
| documentation of the modifications made, author, link, link to
| the license (which I think a lot of people forget), copyright
| notice, etc.
|
| https://news.ycombinator.com/item?id=27606035
|
| I don't think most people are looking to be copyright trolls.
| However, Copilot offers a neat little way to potentially inject
| your code into other people's programs. Will people start
| searching for uses of their code and use it as a form of
| copyright trolling? I don't think most people will, but we've
| seen it happen with patents and images.
|
| If you have a hundred engineers creating dozens of co-pilot
| suggested blocks per day, we're talking around a million blocks
| in a year. I don't think the odds of any individual suggested
| block being a problem are high. The issue is when you start
| scaling that up. If we're talking about a large company, the risk
| can start getting large. You don't know where the code came from
| and it starts getting likely that verbatim pieces of someone
| else's code are finding their way into your codebase.
|
| Does Copilot offer enough value to offset this risk? Will future
| versions of Copilot make sure that the suggestions are
| sufficiently different from the training source? Heck, there can
| even be chicken-and-egg problems where someone claims copyright
| on a block of code that was generated by Copilot and you then
| have to prove that your identical code generated by Copilot isn't
| an infringement. Can you prove "yes, Copilot would have generated
| that code block before you pushed your code into Github" when
| they claim "Copilot only generated that code block for you
| because of our code on Github"? It might not even be a company
| that's evil doing this. Large companies often have no idea what
| different parts of the company are doing - especially several
| years later.
|
| One thing I want to make clear is that this isn't just about
| cases that would win on their merits. One of the big parts of the
| Wikipedia discussion on overly-litigious uploaders is that it can
| cost a lot more to fight infringement claims than they're asking.
| If someone slaps your startup with a $250 "you stole my
| copyrighted code" claim, do you hire a lawyer at an hourly rate
| that might cost more, risk a trial costing tens of thousands, and
| risk a judgement against you? Or do you pay them off with a small
| amount of money to make it go away? I'm not saying this is a good
| situation. I'm just noting that it definitely exists and trolls
| can try and come after you at the worst times like when you're
| trying to raise funding. Do you decide to fight it when you're
| trying to IPO? Do you let the IPO price sink by a few percent and
| lose you lots of money when they're just looking for $5,000 to go
| away?
|
| It just seems like adding a lot of risk.
| tvirosi wrote:
| If this really counts as fair use it turns into a giant loophole
| to steal any IP you want. Just create a website with a github-
| like TOS, upload some disney copyrighted pictures to it, train a
| GAN super overfitted on the images, and then claim mickey mouse
| as your own.
| invokestatic wrote:
| The legal system is generally pretty nuanced, considering
| things such as intent and purpose. In this particular case, it
| doesn't really matter how the new work was generated or
| created. I don't really think that would be very relevant. The
| most important factors would be how similar the new work was to
| the original work, the intent, and how the new work affects the
| value of the original.
|
| Your proposal is just so substantively different from Copilot
| that I don't see how the arguments for Copilot would apply.
| 6gvONxR4sf7o wrote:
| You can't claim mickey mouse as your own, but you can exploit
| all the labor that went into creating all the work you're
| training on. Point a generative model at someone else's labor
| and now it'll do that for you. It seems like the person whose
| labor is being used should be somehow compensated, or at least
| have some say in its use.
| sobellian wrote:
| The lawyer's argument is that Copilot's query system is
| transformative. If you assemble Copilot's output to replicate a
| copyrighted work, then even if Copilot isn't infringing, you
| are by taking that work out of context. The burden is on the
| owner to ensure they don't infringe.
| erhk wrote:
| You cant just upload copyrighted photos. You dont own them.
| tompazourek wrote:
| This is interesting.
|
| But I think you'd violate Disney's copyright by uploading their
| pictures to the website.
|
| To make it work, Disney would have to upload the pictures
| themselves and agree to the TOS.
| heavyset_go wrote:
| Is GitHub making sure that license terms are being met when
| they train Copilot on hosted code? Because anyone can rehost
| code that they don't have the rights to, and it seems like
| GitHub will still train Copilot on it.
| tompazourek wrote:
| If someone is rehosting code that they don't have copyright
| to, it's like if someone would upload a pirated movie to
| YouTube.
|
| YouTube will still make money from it for some time
| (selling ads, luring customers in, ...), then the copyright
| holder asks YouTube to take it down, and then they take it
| down.
|
| The difference is that open source authors don't care that
| much about that. But maybe now they will when they see what
| GitHub is doing...
| heavyset_go wrote:
| > _YouTube will still make money from it for some time
| (selling ads, luring customers in, ...), then the
| copyright holder asks YouTube to take it down, and then
| they take it down._
|
| YouTube isn't publishing derivative work from the videos
| it hosts, though, like Microsoft is doing with Copilot
| and GitHub.
|
| If Copilot was trained on material it doesn't have the
| license to, it can potentially output that unlicensed
| code it was trained on, like in this example[1].
|
| Copilot could serve up copyrighted work in the same way
| YouTube does, but the analogy isn't complete, because
| YouTube itself isn't a derivative work in the same way
| the Copilot's model is a derivative of the data it was
| trained on.
|
| [1]
| https://twitter.com/mitsuhiko/status/1410886329924194309
| 6gvONxR4sf7o wrote:
| Regardless of whether it's fair use, copilot wouldn't be possible
| without the enormous amount of person-hours of work that has gone
| into writing the code it was trained on. There should be some
| kind of compensation for the content creators when their work is
| used to train models. The fair use argument is that "I could see
| it" is enough to justify no compensation and no say in how their
| work is used.
|
| Legal? Probably. Should we do better? Probably.
|
| At the very least, it should be opt-in. We'll probably need new
| IP law to make this kind of thing opt-in.
| dekhn wrote:
| These agree with my conclusions- it's fair use or permitted by
| license, but that it remains untested (as the GPL does in a
| larger sense) by law.
|
| I guess in about 5 years we'll see Softbank v. GitHub CoPilot in
| the supreme court deciding whether ML can make transformative
| work.
| shadilay wrote:
| Often times legal issues are more a question of finding a
| plaintiff with enough money to sue rather than the letter of
| the law.
| kyrra wrote:
| Google v Oracle took 11 years to reach conclusion.
| wolverine876 wrote:
| Why do we accept that the courts are so slow? I don't
| understand why there isn't a drive to reform courts by
| accelerating outcomes by an order of magnitude, and by making
| outcomes not depend on wealth.
| sokoloff wrote:
| A court needs a case. A case requires at least one litigant
| who is willing to go all the way to the extent of forcing a
| court hearing (rejecting settlements along the way and
| risking that a court will decide against them). I don't see
| the courts as being the rate-limiting factor when we're
| contemplating licenses that are 30 years old last month (GPL
| v2)
| [deleted]
| gnopgnip wrote:
| The courts being slow to change, respecting precedent is a
| feature. It should be on congress to change the law and
| ammend the copyright act
| pc86 wrote:
| You're assuming that the speed of a court case has something
| to do with the wealth of the participants?
| AdamJacobMuller wrote:
| "no matter what license you use, you give GitHub the right to
| host your code and to use your code to improve their products and
| features"
|
| I contribute my code to X project outside of github (say on a
| mailing list) under the explicit understand that my code is under
| GPL (say GPLv3 to be specific). If someone later uploads my code
| to github and github uses my code to train their ML model in
| violation of GPLv3 isn't the point that the person who uploaded
| my code to github is in violation of GPL by giving it to someone
| else under less restrictive terms?
|
| Does this mean that the github terms of service are perhaps
| fundamentally incompatible with uploading copyleft-style (or
| perhaps specifically only GPLv3 level) restrictive licenses?
|
| And, if so, probably they always were but nobody cared until now.
| tyingq wrote:
| _"If you look at the GitHub Terms of Service, no matter what
| license you use, you give GitHub the right to host your code and
| to use your code to improve their products and features," Downing
| says. "So with respect to code that's already on GitHub, I think
| the answer to the question of copyright infringement is fairly
| straightforward."_
|
| I don't know if it's really that straightforward. The TOS
| includes snippets like this in that area:
|
| _" This license does not grant GitHub the right to sell Your
| Content. It also does not grant GitHub the right to otherwise
| distribute or use Your Content outside of our provision of the
| Service"_
|
| I'm omitting other language, but if you read that area of the
| TOS, they seem to have purposefully scoped down their license-to-
| use for hosting, backups, etc.
| lindenksv85 wrote:
| They technically don't distribute any code. They show it to you
| in a hosted environment. It's the user that causes a
| distribution. They "provide it as part of the service."
| tyingq wrote:
| I'm not sure how sending it over the wire to Visual Studio
| doesn't count as distribution. Distribution without
| attribution, reference to where it came from, how it's
| licensed, etc.
| dheera wrote:
| Conceptually it's not particularly any different from
| distributing it to your web browser. They basically just
| turned Visual Studio into a fancy Github browser that has
| some editing features.
| inlined wrote:
| I disagree. They're the API host, so they're the
| distributor as well as the user agent
| zufallsheld wrote:
| In the same vain streaming sites don't distribute movies,
| they show them to, you in a hosted environment. and this
| argument did not hold up in court.
| lindenksv85 wrote:
| OSS license obligations mostly kick in upon distribution,
| hence this is a pivotal concept in this context. It's also
| important because of the language in the TOS that says the
| code won't be used outside the service. The stuff related
| to streaming is kind of unrelated here because movies
| aren't under copyleft licenses and so the question of
| whether or not there was distribution there is not
| relevant- the question is whether or not the copyright
| holder's monopoly right were violated and those include the
| right of public performance, public display, as well as
| distribution. They would have violated other copyright
| rights even without a finding of distribution.
| [deleted]
| tomrod wrote:
| Further, what if I branched the code from something hosted
| outside of Github -- and failed to follow proper attribution?
|
| This is a huge legal mess and its not being used to IMPROVE
| Github products and ops, it IS the Github product.
| kmeisthax wrote:
| It's actually fairly difficult to remove attribution from a
| Git repository. It's embedded in each commit. You'd have to
| rewrite the entire project history - something far different
| from just "branching" a repo.
| matmann2001 wrote:
| Technically, they aren't distributing "your" code. It's
| laundered through their machine learning algorithm first.
| hansvm wrote:
| Is that actually different legally from an "ML" algorithm
| that xors the code with the same garbage 1M times or
| otherwise does something expensive to implement a noop?
| grawprog wrote:
| > to use your code to improve their products and features
|
| >It also does not grant GitHub the right to otherwise
| distribute or use Your Content
|
| I'm curious. Copilot isn't actually part of github. It's a
| plugin for Visual Studio wouldn't that mean copilot is
| distributing code hosted on github, outside of github? You
| can't use copilot without visual studio.
|
| How is this not Microsoft just parasitizing all the code hosted
| on github to make visual studio better? Which as far as i know,
| depsite being owned by Microsoft is not actually part of
| github.
| LeifCarrotson wrote:
| I think it all depends on who 'they' are and what 'their
| products' re. The definition says:
|
| > _' GitHub,' 'We,' and 'Us' refer to GitHub, Inc., as well
| as our affiliates, directors, subsidiaries, contractors,
| licensors, officers, agents, and employees._
|
| Does that also include OpenAI? Does it include the Visual
| Studio team? All of Microsoft?
|
| The license granted by users to Github is:
|
| > _We need the legal right to do things like host Your
| Content, publish it, and share it. You grant us and our legal
| successors the right to store, archive, parse, and display
| Your Content, and make incidental copies, as necessary to
| provide the Service, including improving the Service over
| time. This license includes the right to do things like copy
| it to our database and make backups; show it to you and other
| users; parse it into a search index or otherwise analyze it
| on our servers; share it with other users; and perform it, in
| case Your Content is something like music or video._
|
| > _This license does not grant GitHub the right to sell Your
| Content. It also does not grant GitHub the right to otherwise
| distribute or use Your Content outside of our provision of
| the Service..._
|
| IANAL, but naively, Github appears to be a code hosting
| platform. If they need to analyze my code to make it work
| with Git and with their code hosting features, that makes
| sense. For example, they might have a feature to prevent
| inadvertent commits of private keys, and would need to parse
| my code to do so. Maybe my code contains stuff that doesn't
| work with their generic private-key-finding parser, and they
| need to specifically run a subset of my code on their
| platform through their parser in a debugger to fix the
| feature. That's a sensible license to grant to a code hosting
| platform, they're not a no-knowledge encrypted storage
| provider.
|
| They don't appear to be a software vendor that sells code to
| other private parties for use in closed-source applications.
| Their license appears to specifically deny them the right to
| sell snippets of my code to others.
|
| I suspect, however, that this isn't a black-and-white factual
| issue, rather, one for a court to decide. One could probably
| hire an attorney to argue any possible angle on the legality
| of Copilot. And by a similar mechanism to the "Winner's
| curse", the company who developed a tool like Copilot would
| always have been one where their internal counsel advised
| them that what they were doing was totally legal.
| lindenksv85 wrote:
| The "services" is that which is provided by GitHub. "GitHub"
| is defined to include all of its affiliates, including
| Microsoft.
| nomoreplease wrote:
| And "and to use your code to improve their products and
| features," does not explicitly include "or to create new
| products and features".
|
| CoPilot is a NEW product, not an existing product (Github
| itself) that the ToS gives permission to improve.
| matmann2001 wrote:
| Technically, they aren't distributing Your Content. It's
| laundered through their machine learning model first.
| tyingq wrote:
| With degrees of laundered varying from _" copied verbatim"_
| to _" minor things like symbol names changed"_ to _" actually
| transformed significantly"_.
| [deleted]
| BeefWellington wrote:
| An interesting thought experiment around this whole topic: If I
| were to take all the scripts of profitable films rated G or PG
| and train an AI on it, generate a bunch of scripts, then made
| movies out of those scripts, would I lose in court?
|
| Tangibly, how is this AI method substantially different from
| non-clean-room implementations?
|
| In terms of business use, it seems incredibly risky to me to
| even just *use* GitHub since their license agreement/ToS permit
| them to use my code to improve their tools which now apparently
| includes tooling where it may copy your code wholesale as
| someone else's suggestion.
| kmeisthax wrote:
| No thought experiment needed. If I watch a bunch of movies
| and then make my own movie, whether or not I lose in court
| depends on if the movie I made is at least "substantially
| similar" to any movie I happened to watch - or, in other
| words, had "access" to. That's a fact-intensive thing that
| juries usually decide on a case-by-case basis.
|
| The difference between that and having an AI do it is
| probably low. My gut instinct is that using an AI constitutes
| "access" to the AI's training corpus, so if it spits out
| something at least substantially similar to that corpus, then
| I'm infringing if I use that output. If it _doesn 't_
| constitute access, then a copyright owner would have to prove
| "striking similarity", which would really only cover things
| like using Copilot to spit out fragments of old Quake code
| verbatim.
|
| Clean-room is a way of arguing down the level of access that
| you have to something that you want to make a non-infringing
| copy of. It usually requires having actual attorneys review
| everything the clean-room engineers get to see, and stripping
| out the parts that are actually copyrightable. Merely
| training an ML system on input as a way to only have access
| to the uncopyrightable parts of that input probably wouldn't
| work.
|
| Pretty much every Internet service is going to have similar
| clauses to GitHub's; because anything else would basically be
| a "click here to make me liable for copyright infringement"
| button. In fact, I wouldn't be surprised that merely running
| something like GitHub but without a ToS would still give you
| similar levels of implied license over whatever people push
| to your server.
| kzrdude wrote:
| What about open source projects where the uploader and github
| users are not the only copyright holders? As a user i can't
| grant github any random license for the code, if I maintain for
| example Linux or python or any other old project there.
|
| The ONLY available terms are those given by the license,
| surely?
| lindenksv85 wrote:
| If you are putting up code on GitHub to which you don't have
| all the rights you're actually in violation of their TOS and
| you are violating the rights of other copyright holders. I
| understand this is common and may not violate community norms
| or expectations but it is technically a license violation on
| multiple fronts. Contributors who add to existing GitHub
| projects are providing the same license to GitHub as the
| project maintainer though per the TOS.
| rightbyte wrote:
| I guess the Github Copilot authors did not handpick
| projects they checked were legitimately put on Github. So
| they are accomplices in that case.
| filoleg wrote:
| YouTube doesn't really handpick things that get put on
| their platform either, beyond very basics and whatever
| automated tools they have to cover that.
|
| Beyond that, that's what DMCA takedown requests are for.
| Github would only be an accomplice in that case, if they
| got a legitimate DMCA takedown request and chose to
| completely ignore it.
| jolmg wrote:
| > If you are putting up code on GitHub to which you don't
| have all the rights you're actually in violation of their
| TOS and you are violating the rights of other copyright
| holders.
|
| I can't find where in the TOS it says that you must "have
| all the rights [to the code]". It just says that you must
| not violate copyright nor other laws.[1] FOSS licenses by
| definition permit redistribution, so uploading to GitHub
| seems to be in-line with the license granted by the
| copyright holders.
|
| What are the violations you mention?
|
| > Contributors who add to existing GitHub projects are
| providing the same license to GitHub as the project
| maintainer though per the TOS.
|
| Sure, but that's not the only way. If you contribute to a
| FOSS project elsewhere, those changes go under the same
| license of the project. Whoever you pass those changes to
| has liberty to redistribute per the terms of the license.
| The TOS is unneeded to legally redistribute FOSS-licensed
| projects with GitHub.
|
| The TOS saying that you must grant GitHub these permissions
| is only to protect GitHub in cases where people upload
| projects without licenses.
|
| [1] in addition to content restrictions, like no porn.
| sudosysgen wrote:
| But legally, they can't provide such a license. So GitHub
| can't have that license, surely, because they never had the
| legal authority to bestow it upon Github.
| lindenksv85 wrote:
| That was a problem before copilot though. And copyright
| holders have and will continue to have the right to send
| DMCA take-down notices if they like.
| jacoblambda wrote:
| But the thing to note is that a user can have a right to
| distribute (as with GPL) but does not necessarily have
| the rights to the license.
|
| So if the user uploads the source to GitHub, they agree
| to the terms (which they may not actually have the rights
| to) but that isn't equivalent to the rights owner giving
| GitHub the rights to distribute the source under a
| different license.
|
| The TOS can only modify those distribution terms (if it
| even can be found to be legally binding) if the user
| uploading the source is the rights owner which in so many
| cases is not the case.
| josephh wrote:
| I think the bigger question is whether GitHub will be
| able to honor DMCA requests that pertain to copyrighted
| materials showing up in Copilot's suggestions.
| Animats wrote:
| A third party who finds their GPL code on Github but is
| not themselves a user of Github has a right of action.
| They're not bound by Microsoft's terms.
| lakecresva wrote:
| I'm not sure that someone who published their work under
| the GPL hasn't thereby given consumers the right to put
| the repo on github. If the rights Github asks for in
| their ToS can be construed as a subset of the rights
| granted by the GPL, Github is just another GPL licensee.
| Unless they violate the conditions of the license,
| they're just utilizing their GPL rights.
| eitland wrote:
| > Github is just another GPL licensee. Unless they
| violate the conditions of the license, they're just
| utilizing their GPL rights.
|
| And here is exactly the problem.
|
| GitHub seems to be copying copyrighted code left and
| right _and pretend they made it!_
|
| No attribution, no license.
|
| They are of course allowed to let their AI study the
| code, but as "employer" of that AI GitHub/Microsoft has a
| responsibility if that AI breaks copyright right and left
| and they as a company pretend the code is theirs to give
| away.
| AdamJacobMuller wrote:
| > is not themselves a user of Github
|
| Is it that widely scoped? Can't we narrow it to "A third
| party who finds their GPL code on Github but has not
| uploaded that specific code to Github themselves has a
| right of action limited to that specific code."
|
| Just because I created a github account once and agreed
| to the TOS doesn't mean that I agree to let others upload
| my code to github, where would that scope end. Could
| someone steal code off my computer which i've never
| published and put it on Github and that was OK because I
| once signed up for a github account, clearly a contrived
| example but.
| kzrdude wrote:
| Today is the first time I've considered that, but it's
| certainly something we should think about. If big projects
| moved on this, I think github would take notice and "issue
| a clarification".
| btilly wrote:
| _I don 't know if it's really that straightforward._
|
| It gets worse. To the extent that it is that straightforward,
| the correct takeaway is that you do not have permission to
| include someone else's GPLed code in your Github repository.
|
| And that to the extent that GitHub relies on that permission in
| using the code that they host, they are liable for potential
| copyright claims from copyright owners that they have no
| relationship with, who never gave GitHub permission to use that
| code.
|
| I therefore think that GitHub should do some careful thinking
| about how much they can rely on a ToS to do as they want with
| the copyleft code that they host. And I further think that
| people who host GPLed projects should ask whether GitHub is
| where they should be hosting those projects.
|
| (Insert the mandatory, "I am not a lawyer and this is not legal
| advice.")
| [deleted]
| phkahler wrote:
| Yeah, I don't think bettering their products includes verbatim
| incorporation of code into those products.
|
| Also, for the part about small snippets being non
| copyrightable. I would suggest looking at the Google/Oracle
| case. Google was found guilty of infringement for a very small
| number of lines, but the award to Oracle was IIRC rather a joke
| (something like one dollar, indicating it was infringing but
| largely irrelevant).
| zja wrote:
| The Supreme Court found Google's use to be fair, not
| infringement.
| jcelerier wrote:
| the supreme court did not reconsider the previous judgment
| on the 9 lines of sorting algorithm being copied, which was
| _not_ considered fair use
| MrStonedOne wrote:
| >"If you look at the GitHub Terms of Service, no matter what
| license you use, you give GitHub the right to host your code and
| to use your code to improve their products and features," Downing
| says. "So with respect to code that's already on GitHub, I think
| the answer to the question of copyright infringement is fairly
| straightforward."
|
| Not as straightforward as they think thou.
|
| If a code project used (a)gpl code found elsewhere on the
| internet in their repo, and another user took the project and
| hosted it on github, the tos can not give github a license to use
| the code outside of the license given by (a)gpl, even if github
| thinks they have one, that won't shield them from legal
| liability, nor will it shield co-pilot users from being legally
| compelled to (a)gpl their code if a court case was won on those
| grounds.
|
| The github tos is basically a non-factor in this case.
| SethTro wrote:
| I find the first argument, that if you're project is in GitHub
| then they have the right to train in it, weak. Plenty of projects
| are hosted elsewhere but have been mirrored by random users (e.g.
| not the copyright holder) to GitHub
| kbenson wrote:
| I think they have a right to train in it, but not to present
| portions verbatim. Do you have a right to look at a bunch of
| open source code and come to conclusions about good programming
| practices? Are you prevented from knowing that a specific
| library in a language is good/common for a specific task
| because you see others using it?
|
| That's analogous to training, where there are associations
| between things, in my mind. I don't think that means they can
| provide licensed code verbatim though, just as you should not
| copy GPL code directly out of a Github repo and paste into your
| own private commercial code base.
| heavyset_go wrote:
| You're taking the machine "learning" metaphor literally. A
| human being learning something is not analogous to training
| an ML model. Training models is more analogous to compilation
| or lossy encoding or compression.
| michaelpb wrote:
| The biggest mistake of the ML field is its metaphorical
| naming. So many people seem to be taking Artificial
| Intelligence, Machine Learning, Neural Networks etc
| literally. They don't do this for other concepts in coding
| (eg for an absurd example, no one is arguing we ride a CPU
| "bus" to work), but with ML algos its a free-for-all.
| Grandiose naming conventions might be good for extracting
| VC money but it's also seriously confusing people.
| kbenson wrote:
| I'm thinking more "association" than "learning", and in
| both cases.
|
| If an algorithm of some sort scans a bunch of repos
| regarding video encoding and decoding and sees a lot of
| ffmpeg use, it might associate ffmpeg with video encoding
| and decoding, and decide to present some info about ffmpeg
| and a _generic_ snippet to include ffmpeg as a library and
| initialize it if it associates the current project with
| that.
|
| If I have perused a few encoding or decoding repos at some
| point and I think of the current project as having to do
| with encoding or decoding of video, I might immediately
| think ffmpeg even if I've never used it in a project as a
| library because I remembered seeing it in projects that
| used it, and look for some initialization code.
|
| In what ways are these materially different? What makes the
| random conceptual associations in my head from what I've
| seen previously different than an algorithm that collects
| the same?
|
| > Training models is more analogous to compilation or lossy
| encoding or compression.
|
| And learning in people isn't? Isn't all knowledge
| transference in people analogous to lossy encoding and
| compression?
|
| I don't know about you, but in college I don't remember
| regurgitating sections of "Advanced Programming in the UNIX
| Environment" to complete assignments, I remember studying
| it, internalizing parts of it on a conceptual level (as
| well as remembering specific fairly small chunks almost
| exactly), and using that to solve problems or answer
| questions or make associations.
|
| I'm not saying ML and and learning in humans is the same. I
| do think for the very specific case presented here in how
| it's used, there are some parallels. Feel free to disabuse
| me of that notion if you have evidence that contradicts it
| though. I'm not wedded to that position, but I would want
| to see arguments to the contrary before abandoning it.
| gdsdfe wrote:
| well a lot of people in here don't like these conclusions
| legerdemain wrote:
| Somewhat tangentially, Kate Downing is also the person who
| somewhat recently campaigned to raise awareness of the crisis in
| affordable housing in the Bay Area and Palo Alto in particular,
| and wrote a viral editorial after giving up and moving to the
| more affordable Santa Cruz.[1]
|
| https://news.ycombinator.com/item?id=12288306
| guitarbill wrote:
| It would be nice if we moved from a copyright discussion to an
| ethical one, since it could be years until the law is even
| tested.
|
| Is it ethical to do this, when some licenses are clearly chosen
| because of e.g. attribution or sharing improvements? Did
| Microsoft/GitHub consider the ethical implications, for example a
| chilling effect on code being open sourced in future (i.e. people
| choosing not to open source stuff so it doesn't get gobbled up by
| Copilot et al.)?
| dmitrygr wrote:
| I wonder if one could enforce a license's "this code may not be
| used to train any ML model of any sort for any reason without
| prior permission".
| progbits wrote:
| I've been wondering the same though found basically no
| discussion on this topic.
|
| Let's ignore whether GPL or whatever license allows GitHub to
| do this - let the lawyers sort this out. Instead we should
| focus on whether it is possible to legally prevent such
| behavior via license.
|
| In other words, where is my GPLv4 with anti-ml clause?
| EamonnMR wrote:
| FSF and SFC and OSI if you're listening, this would be very
| nice.
| lindenksv85 wrote:
| I think it's important to recognize that most ML models will
| not be built in top of copyleft material. It will mostly use
| data that we as users have voluntarily provided to someone at
| some point and to which that platform now claims ownership. So
| we need to think long and hard about whether or not we believe
| any of these models should receive any copyright protection at
| all and in a much broader context. I think if we insist on
| claiming that copilot is copyrightable itself and should be
| under GPL then we have totally capitulated with respect to all
| other use cases in a way that actually further protects
| incumbent advantage for large companies and which deprived
| everyone else of any benefit or remuneration for their own
| data. You're basically saying it's ok for companies to
| privatize the collective knowledge of all of humanity. I'm not
| on board with that.
| guitarbill wrote:
| I don't know if by "You're basically saying" you mean me
| specifically, but if you do, you're dead wrong. I'm not ok
| with this at all. However, I'm not so stupid to think me, as
| a non-IP lawyer can make sense of the current legal situation
| (which is what copyright is; law) or even propose new laws.
|
| However, as a dev I can think about it and say "to me, this
| is immoral and unethical", and refuse to use Copilot, not
| work for any company that uses Copilot, not use
| GitHub/Microsoft products, pull code from GitHub (if I had
| any), and decide not to open source stuff in future. Ethics
| has always been underemphasised in software compared to other
| engineering disciplines.
|
| Generally, non-technical people are (more) impacted by ML,
| but in this case it's us as developers and our open source
| communities. So I hope devs will give it some thought this
| time. And if this leads to devs thinking about ML more
| carefully in general, great. Things don't have to be illegal
| to be unethical.
| lindenksv85 wrote:
| I didn't mean you specifically. I think the ethical
| conversation is more interesting but I also think that
| people will feel different if, say, the Linux Foundation
| releases its own version of copilot and it's not just one
| company reaping the rewards of all that code. And I'd like
| to make it easy for other competitors to do exactly that.
| It will be harder for them to do that if we think that the
| models themselves are copyrightable. I don't think
| something like copilot is going to make anyone think twice
| 5 yrs from now any more than we think twice about something
| like google autocomplete or google search thumbnail images.
| I think stuff like copilot if properly tuned won't be
| providing a substitute for whole GPL projects. I don't
| think OSS communities will be damaged by this in any way.
| In fact those same oss communities are going to be some of
| the biggest users of these sorts of tools just like they
| use stackoverflow today.
| erhk wrote:
| Github is not required to open source your work
| erhk wrote:
| You can opensource with Git without using github. You can self
| host.
| xdennis wrote:
| If it's publically available, there's no guaranty that
| Microsoft won't gobble up your code.
| kube-system wrote:
| The ethical discussion certainly has its merits, but the legal
| discussion is very relevant for those of us who do not want to
| be part of the legal test case.
| jefftk wrote:
| _> If you look at the GitHub Terms of Service, no matter what
| license you use, you give GitHub the right to host your code and
| to use your code to improve their products and features. So with
| respect to code that's already on GitHub, I think the answer to
| the question of copyright infringement is fairly
| straightforward._
|
| This doesn't sound right. Alice writes code, and releases it
| under some restrictive license (GPL, something source-available,
| etc). Bob uploads it to GitHub, correctly labeled as GPL.
| Regardless of GitHub's TOS, Bob isn't able to give GitHub any
| additional rights to the code beyond what Alice gave him.
|
| I think the later discussion about whether this falls under Fair
| Use is the important question.
| pc86 wrote:
| If Bob is unable to give GitHub the rights that GitHub demands,
| then it means Bob was unable to lawfully upload the code to
| GitHub in the first place. You're making an argument that Bob
| violated GitHub's terms, not that GitHub is violating Alice's
| (though that may also be true).
| mikeryan wrote:
| You're right that Bob's the infringer here and not GitHub.
| But I'm not sure where that would place the derived work
| that's eventually used.
|
| Which is the point of the article a bit. GitHub is likely not
| infringing but it's also not absolving the end user of any
| infringement. Neat trick.
|
| That being said I think the risk is minimal enough that I'd
| be pretty comfortable with using it.
| eitland wrote:
| I learned here on HN that contracts are supposed to be a
| "meeting of minds".
|
| And in EU, as a consumer, you can pretty much ignore most
| EULAs because they aren't valid if they break EU consumer
| protections.
|
| Now if your interpretation is correct the idea of a meeting
| of minds falls completely on its face.
|
| And, as a lot of individuals also upload their projects to
| GitHub, GitHub is on shaky ground there as well.
|
| I think most EULAs have clauses like these in them but we are
| always told that it is because of crazy American lawyers and
| nothing to worry about.
|
| If Microsoft decides to prove once and for all that we should
| worry about ridiculously broad claims in EULAs I think it
| will be hard for GitHub to continue to operate in more sane
| jurisdictions.
| kuratkull wrote:
| Violating the TOS is not illegal, Bob would be in breach and
| GitHub could take measures against Bob's account. Github
| would be violating Alice's copyright license though, that is
| legally enforceable
| lindenksv85 wrote:
| Sort of. DMCA protects service providers against copyright
| infringement claims related to stuff uploaded to their
| services by third parties. So long as they adhere to DMCA
| requests, they're not violating copyright law themselves.
| bigwavedave wrote:
| > Sort of. DMCA protects service providers against
| copyright infringement claims related to stuff uploaded
| to their services by third parties. So long as they
| adhere to DMCA requests, they're not violating copyright
| law themselves.
|
| This is probably an extremely stupid question as I'm
| neither a lawyer nor an ML dev (merely an humble backend
| developer), but let's say that the above situation
| applies and that Github has taken down Bob's repo as per
| Alice's DMCA request. However, let's say that in between
| Bob uploading the offending code and Alice submitting the
| DMCA request, Github used Bob's repo as part of a
| training set for Copilot. Now that they've complied with
| the takedown request, does Github have to restore Copilot
| to an earlier state that hadn't yet been trained by Bob's
| repo? Does this question even make sense since I only
| know the absolute barest bones of ML?
| rented_mule wrote:
| Also not a lawyer, but I've been around ML for a while.
| The question makes perfect sense to me!
|
| It takes some amount of time to comply with a takedown
| notice. For example, time passes between receiving
| Alice's notice and taking down Bob's repo.
|
| I would expect Copilot's model(s) to be retrained
| periodically in order to remain relevant. The next
| retraining could exclude Alice's code. That might be a
| longer window than the case of the repo takedown, but as
| long as it doesn't take _too_ long they might be okay?
|
| There are incremental training approaches that evolve
| models over time rather than completely retraining them.
| In my experience, complete retraining is a far more
| common approach because the highly path dependent nature
| of incremental training can lead to outcomes that are
| hard to manage. For example, what if you discover bad
| training data like repos that collect anti-patterns? Or
| Alice's takedown notice? You typically want your models
| to be able to "unsee" things and that's hard with purely
| incremental training. Even when incremental approaches
| are used, there is often an occasional complete
| retraining to overcome such issues.
|
| To be clear, I have no idea what training approach is
| used for Copilot.
| jefftk wrote:
| GitHub's primary business is hosting open source software.
| There's no way they are going to claim that every user who
| uploaded code without owning the full rights is violating
| their TOS.
| dogleash wrote:
| >There's no way they are going to claim that every user who
| uploaded code without owning the full rights is violating
| their TOS.
|
| If they're taken to court and that part of the TOS is
| relevant to the issue, then yes, they can and will argue
| exactly that.
| eitland wrote:
| Well then let's hope the judges apply the same standards
| as when criminals claim they wasn't aware that the money
| they got was being laundered through them.
| jefftk wrote:
| That would destroy their business.
| ipaddr wrote:
| That's moving the goalposts after the fact.
|
| What about all of the code that existed before Microsoft
| purchased them and before new licease language was
| introduced?
| mikeryan wrote:
| TOS language is usually not grandfathered in.
| lindenksv85 wrote:
| All of these terms of service have an assignment provision
| that allows the provider to assign the agreement to an
| acquirer. So the license you gave to GitHub moves to
| Microsoft (though here the license likely remains with
| GitHub because they are an independent subsidiary). All of
| these agreements also say they can unilaterally change
| terms whenever. The terms are generally always broad enough
| to cover these circumstances.
| fhajm wrote:
| This lawyer does not understand GitHub. Half of the code is
| uploaded by third parties who do not hold any copyright.
|
| These people either think that for ideological reasons everything
| should be on GitHub or they want Google links to their companies.
|
| Furthermore "improve their services" reasonably only applies to
| their core service that was present _when people agreed to the
| TOS_ and not to some new code laundering AI.
|
| It is frightening that this matter could be decided by such
| lawyers in the US. People should just all leave GitHub, then
| Microsoft can play with its own AI and enjoy the silence.
| rhdunn wrote:
| Then there's the issue of any project that uses CoPilot. For
| example, if a developer of proprietary software uses this and
| it is later found that the code matches GPL code, they would be
| liable. Likewise if an open source uses code from a different
| license or proprietary code via this.
|
| Looking at the source code or the function and variable names
| in binaries, you cannot tell if CoPilot is used or not, so
| there isn't a functional difference between someone copying
| that code or CoPilot copying it.
| coding123 wrote:
| I just keep flagging this. It's getting over analyzed into the
| ground. Sue them if you want but it's just a waste of time to
| keep talking about this.
| wolverine876 wrote:
| > As we mentioned, GitHub trains Copilot on numerous pieces of
| public code, many of which are covered by strong copyleft
| licenses (i.e. GPL v2, GPL v3). Copyleft licenses require that
| derivative works (of the copyleft-licensed code) must carry the
| same license as the original code.
|
| Even when no GPL v2/3 code is quoted by Copilot, is using the
| code for training a non-free product allowed under the license?
| Under the license, is Copilot therefore now licensed GPL v2/3?
| The GPL code was certainly used to create a critical, integral
| part of Copilot, and to create its output.
|
| If I understand correctly, GPL v2/3 were designed to prevent non-
| free products from being parasites on FOSS code, taking and not
| giving. If that's the spirit of GPL, Copilot seems to clearly
| violate it.
| invokestatic wrote:
| When you upload code to Github, you agree to license it to them
| under Github's terms, and not whatever license the software is
| typically distributed under. You are effectively "dual
| licensing" software by uploading it to Github, whether you
| realize it or not. Of course, there are edge cases in which you
| don't have the rights to license the software to Github, but in
| those cases, I don't have the answer.
| jcelerier wrote:
| > "To the extent you see a piece of suggested code that's very
| clearly regurgitated from another source -- perhaps it still has
| comments attached to it, for example -- use your common sense and
| don't use those kinds of suggestions."
|
| how is "use common sense" even remotely a meaningful thing
| rjzzleep wrote:
| Especially coming from a supposed lawyer.
| scintill76 wrote:
| I thought this was weird too. Why are comments the dividing
| line? Because they sound like a human? How do we know Copilot
| won't regurgitate an exact copy of human code that doesn't have
| comments?
|
| It's kinda surprising Copilot even reads and outputs comments.
| pedrocr wrote:
| Given this fair use argument that the work is probably
| transformative enough here's what I'll be doing next. I'll take
| the Windows and Office source code, run it through a decompiler
| and then train a neural network on that output. This sequence of
| steps should be at least as transformative of Microsoft's
| copyright as what Copilot is doing with the open-source corpus,
| probably much more so. I will then use that neural network to
| write patches for ReactOS and WINE. Since those projects are very
| weary of interaction with Microsoft copyrighted works could
| Microsoft Legal please publicly state their assurance that all
| this is perfectly legal use of their copyrights? Maybe that would
| help convince people.
| EamonnMR wrote:
| Might be faster to generate verbatim copies of Disney IP.
| invokestatic wrote:
| I've heard something similar in response to Copilot in another
| thread (something like offering a sum of money to Github if
| they train their model exclusively on the Windows NT source
| code). But I think the legal theory here is that Copilot is
| trained on many thousands of sources. If Copilot was trained on
| a single source, or even a small handful of sources, the
| derivative work claim becomes much stronger. When trained on
| many sources, it becomes much harder to claim that its a
| derivative of another work.
|
| Take for example a human. If I studied a bunch of different
| open-source projects, learned techniques from them, and
| implemented them in my own projects, is that a derivative work?
| Probably not. But if I were to reverse engineer Windows and
| implement the techniques I saw in ReactOS, that's where it
| seems issues start to arise.
| pedrocr wrote:
| So I just need to decompile Oracle's database and a few other
| commercial products as well and I'm good? Is Microsoft legal
| happy if I do Windows+Office+OpenSourceCorpus? I'd take that
| statement as well. Or even if they just do that themselves
| and train Copilot on their internal source code just as they
| do with the public open-source corpus. That would be a strong
| statement as well.
| breischl wrote:
| >If I studied a bunch of different open-source projects,
| learned techniques from them, and implemented them in my own
| projects, is that a derivative work? Probably not.
|
| That's pretty unclear actually. If it's quite close to the
| original work, it is derivative. Even though you have
| probably been "trained" on quite a few different codebases
| over the years. Hence the existence of clean-room
| implementations, wherein the people building a new
| implementation have never seen the original.
|
| Also, given that code that has been passed through a
| biological network (ie, brain) can constitute infringement,
| it seems obvious that code passed through a mechanical one
| could too. Maybe not in every case, but it certainly seems
| plausible.
| erhk wrote:
| Well i could certainly construct a data set wjere windows NT
| is an atomic outlier and muddy the water with arbitrary
| inputs to satisy your irrelevant requirement. Perhaps ill jam
| some pictures of cows in, or any nubmber or animal photos.
| Maybe even some classical literature. Hell, maybe i even just
| jam a shit ton of Javascript in. Thats code right?
| heavyset_go wrote:
| You're taking the machine learning metaphor literally.
| Training an ML model is not the same thing as a human being
| learning off of material.
|
| A human being can understand abstract concepts and reason
| about them based on material they learn from. An ML model is
| a statistical model that is closer to compilation or lossy
| encoding or compression.
|
| Often, ML models can encode their training data verbatim in
| the model itself, which is exactly what happened with Copilot
| and this example[1].
|
| [1] https://twitter.com/mitsuhiko/status/1410886329924194309
| klyrs wrote:
| > When trained on many sources, it becomes much harder to
| claim that its a derivative of another work.
|
| Sounds good in theory, until it starts producing snippets
| verbatim from uniquely-identifiable sources.
| ghoward wrote:
| I am doing something very similar:
| https://twitter.com/GavinDHoward/status/1415380847537135620 .
| We'll see if they answer.
| modeless wrote:
| > If you look at the GitHub Terms of Service, no matter what
| license you use, you give GitHub the right to host your code and
| to use your code to improve their products and features
|
| Sure, that's fine if the author of the code chooses to upload it
| to GitHub. But what if they don't, and then someone else does? If
| I take an AGPL project that someone else wrote and upload it to
| GitHub, does that grant GitHub the right to use the code "to
| improve their products and features" which are closed source? I
| don't have the right to relicense the code, and neither does
| GitHub, so clearly not.
| _ph_ wrote:
| I think the discussions miss a bit an important point. IANAL, but
| I think if a young programmer reads a lot of source code on
| GitHub, and based on this reading becomes a better programmer,
| this is a fair use of copyrighted material and pretty much
| independant of the license. If I read any book and learn the
| corresponding language, this isn't a copyright violation of the
| book either. This starts, when I begin to quote from that book or
| the programmer takes snippets from the programs that got read.
|
| The problem is, I don't think you can really claim that Copilot
| learned to program. While some of the output seems to be
| something new, most of the times it looks more like a
| recomposition of learned fragments if not even longer pieces of
| verbatim code taken from copyrighted material. We have seen
| examples of this. And in this moment, it becomes a copyright
| discussion, probably determined by the volume of copyrighted
| material reproduced. Which by the way is always the risk if a
| human uses certain training material. The better one is at
| memorizing things, the more there is the risk.
|
| Or put it the other way around: if Copilot would use its
| "knowledge" of programs to advise the programmer like pointing
| out potential errors without reproducing anything it used for
| learning, it should be fine. But that is not how it works.
| jozvolskyef wrote:
| If a person who's never seen a goat looks at a million
| copyrighted images of goats and draws a goat, are they
| committing copyright infringement?
|
| What if an algorithm does the same? The result is 'a
| recomposition of learned fragments' in either case.
| aaron695 wrote:
| Do we have any proof that Copilot works?
|
| I assume it's a pile of rubbish that's currently fooling Youtube
| hype based programmers and followers. Has any ok but real
| programmer used it solid for a week yet and wants to keep going?
|
| This is tied to the legal argument.
|
| If Copilot works (Which I cannot believe it would) it changes
| many legal points. Garbage spewing out copyright code is
| different to something that 'understands' copyright code.
|
| And who cares about copyright if it's like all other hype based
| AI currently, unusable in the real world. All the current HN
| seems to be bike shedding around legal. Does noone program
| anymore?
| Zambyte wrote:
| Google Books is a great parallel to make with Microsoft's
| Copilot. The key differences between the two is
|
| A) Google Books produces verbatim results 100% of the time, while
| Microsoft's Copilot produces verbatim results some N > 0% of the
| time (with some % of results greater than N that would be
| considered a derivative if a human wrote it), and
|
| B) Google Books doesn't make the claim that you own the copyright
| to any greater than 0% of the search results, while Microsoft's
| Copilot makes the claim that you own the copyright to 100% of the
| results.
| rjzzleep wrote:
| If you copy a quote from Google Books you still have to
| attribute the original author. It's not magically your text
| just because it was hosted on Google Books. Why do they even
| compare these two?
|
| You can compare Github itself with Google Books, but not
| copilot.
| ghaff wrote:
| >If you copy a quote from Google Books you still have to
| attribute the original author.
|
| Mostly because if you don't do so, that's a plagiarism issue
| which the law mostly doesn't concern itself with except
| insofar as an attributed quote, unless the length is truly
| excessive, is likely to be seen as Fair Use while an
| unattributed quote, especially if it's more than a minimal
| snip, is not. (IANAL, etc.)
| Animats wrote:
| If you want to kill GitHub CoPilot, start posting useful snippets
| of code which contain security backdoors, and wait for CoPilot to
| put them in something.
| swhalen wrote:
| Does the fair use exemption (or an equivalent) exist in all
| countries?
| xdennis wrote:
| Obviously not. Neither does DMCA, but that doesn't stop the USA
| from enforcing it worldwide.
___________________________________________________________________
(page generated 2021-07-15 23:02 UTC)