hngopher.com

       [HN Gopher] All public GitHub code was used in training Copilot
       ___________________________________________________________________
        
       All public GitHub code was used in training Copilot
        
       Author : fredley
       Score  : 724 points
       Date   : 2021-07-08 08:18 UTC (14 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | iseethroughbs wrote:
       | I treat Copilot as literally a programmer in pair programming.
       | Which means that if it's trained, i.e. it has "seen" GPL code,
       | then it's tainted, and we should treat resulting code as GPL
       | code.
       | 
       | Replace "GPL" with the most restrictive license that's on GitHub,
       | but you get the point.
       | 
       | They're kinda shooting themselves in the foot, because this
       | reduces the commercial potential of the tool to almost nothing.
        
       | ksec wrote:
       | All I know is this Copilot opens a whole can of worms. And
       | doesn't or possibly never will have a right answer without court
       | settling.
       | 
       | Obviously most ( I think ) lawyers seems to be siding with
       | Microsoft on fair use. But most owner of the code seems to think
       | they are infringing on their work.
       | 
       | Then there is the international issue because one court cant
       | decide for everyone else.
       | 
       | I think the issue is important enough I wonder if we could
       | somehow crowdfund it for a court trial or something.
        
       | rambambram wrote:
       | I'm a programmer and also studied law for some time. These
       | stories make me - once more - realize the old adage: "Possession
       | is nine tenths of the law." Don't host that code in the cloud (or
       | a better term, someone else's dirty bucket). What happened to
       | developers hosting stuff on their own website!?
        
         | jefftk wrote:
         | It doesn't matter where the code is hosted, just that it is
         | publicly accessible. If developers hosted code on their own
         | sites, someone could still scrape them and use that to train
         | models.
         | 
         | (The question of whether this is sufficiently transformative to
         | count as fair use is still wide open)
        
           | BuildTheRobots wrote:
           | > It doesn't matter where the code is hosted, just that it is
           | publicly accessible. If developers hosted code on their own
           | sites, someone could still scrape them and use that to train
           | models.
           | 
           | I'd suggest it makes it more interesting. If it's self
           | hosted, then the hoster can choose to impose restrictions on
           | server aceess, including no automated scraping, rather than
           | trying to impose licensing on the code itself.
        
         | ghoward wrote:
         | This is why I have now moved my code off of GitHub.
        
           | [deleted]
        
         | elliekelly wrote:
         | Anecdata: I'm a lawyer and programmer and my clients (large
         | financial institutions) are increasingly insisting on hosting
         | as much on-site as possible. It costs more, it can make it
         | difficult to select vendors/service providers, and it's not
         | without business continuity risks which they take steps to
         | mitigate.
         | 
         | But I think more and more companies, particularly those in
         | highly regulated industries, are deciding that the benefit of
         | controlling the data -- access, security, privacy, and
         | understanding _who_ , exactly, it's being shared with --
         | outweighs the risks of someone else having that control.
        
           | radiator wrote:
           | So some professionals will have the chance to migrate systems
           | back to on-premises, after having migrated them from on-
           | premises to the cloud? Interesting.
        
         | ralph84 wrote:
         | GitHub's argument isn't that you hosted your code on GitHub and
         | therefore gave them a license to use it to train their model.
         | GitHub's argument is they don't need a license to train their
         | model because it's fair use. Hosting your code somewhere else
         | doesn't prevent fair use. If you don't want your code used to
         | train ML models, don't host it anywhere.
        
           | rambambram wrote:
           | I get it, but that's already a legal argument. I was trying
           | to zoom out from the unavoidable legal argumentative
           | deadlock: if GH does not have your code hosted on their
           | servers, it becomes way harder for 'them' to grab it and rape
           | it. Your own domain is - of course - also out in the open,
           | but at least you can have more control.
        
         | tannhaeuser wrote:
         | > _What happened to developers hosting stuff on their own
         | website!?_
         | 
         | Devs were hoping for stars and network effects rather than
         | listening to those of us feeling uncomfortable taking all
         | traffic to gh. Something like Copilot or even a coding bot was
         | predicted two years ago already.
        
       | dariusj18 wrote:
       | Wouldn't this question already have been asked and answered when
       | AI's were trained on books and articles?
        
         | zinekeller wrote:
         | As far as I know, there isn't a formal copyright-related US
         | court ruling (yet anyway) of training ML/AIs on any media
         | (except for copying the code of an ML). So everything is
         | actually on thin ice, much like the infamous "GIFs _[from
         | snippets of shows etc.]_ are widely believed to be fair use ",
         | which in reality is still untested. Let's not forget other
         | countries, with much stricter copyright rules (especially moral
         | rights).
        
       | _Understated_ wrote:
       | Ok, my curiosity has been fired here...
       | 
       | I have conjured up two scenarios here:
       | 
       | Let's say I use copilot to generate a bunch of code for an app,
       | something substantial, and it regurgitates a load of bits and
       | pieces from many sources it got from GitHub, I'd assume there
       | won't be any attribution in it... it will be as if Copilot made
       | the code itself (I know it sort of does but lets not split
       | hairs!). I'm guessing the prevailing theory (from GiitHub anyway)
       | is that I'm legitimately allowed to do this.
       | 
       | Now, let's say I generated all that code by manually copying and
       | pasting chunks of code from a whole bunch of repos, whether they
       | are open source, unlicensed, whatever. Would I not be ripe for
       | legal issues? I could potentially find all the code that copilot
       | generated and just copy and paste it from each of the sources and
       | not mention that in my license. What if I told everyone "yeah, I
       | just copied and pasted this from loads of Github repos and didn't
       | put any attribution in my code". I'd assume that (morality aside)
       | I'd be asking for trouble!
       | 
       | Am I missing something? Am I misunderstanding the situation, or
       | the capabilities of copilot?
        
         | [deleted]
        
         | BlueTemplar wrote:
         | Copilot is just a tool, legally it cannot "make code", you're
         | the one making it.
         | 
         | See also : Napster, including how it was condemned for
         | facilitating copyright infringement (what Microsoft is risking
         | here, though the offense is likely to be much milder, of
         | course).
        
         | schneidmaster wrote:
         | There's a decent bit of caselaw indicating that computers
         | reading and using a copyrighted work simply "don't count" in
         | terms of copyright infringement -- only humans can infringe
         | copyright. This article[0] does a pretty good job of
         | summarizing the rationale that the courts have provided. My
         | (non-lawyer) take is that GitHub is pushing this just half a
         | step farther -- if computers can consume copyrighted material,
         | and use it to answer questions like "was this essay
         | plagiarized", then in GitHub's view they can also use it to
         | train an AI model (even if it occasionally spits back out
         | snippets of the copyrighted training data). Microsoft has
         | enough lawyers on staff that I'm sure they have analyzed this
         | in depth and believe they at least have a defensible position.
         | 
         | [0]: https://slate.com/technology/2016/08/in-copyright-law-
         | comput...
        
           | mysterydip wrote:
           | Makes me wonder what would happen if a similar thing was done
           | with books. If I train an AI on all the texts of Tom Clancy,
           | or Stephen King, or every Star Wars novel, and the books it
           | generates every so often produce paragraphs verbatim from one
           | of those sources, would copyright owners be up in arms? What
           | would the distinction be between the code case and the text
           | case?
        
             | xsmasher wrote:
             | This will surely happen within the next few years; but if
             | the "new work" contains a full paragraph from an existing
             | novel the copyright hammer would come down hard.
             | 
             | Maybe it needs to be paired with another network / hunk of
             | code that checks for verbatim copying?
        
             | shagie wrote:
             | I am not a lawyer. I do photography and have a more than
             | passing interest in copyright as it applies to the
             | photographs I take and the material I photograph.
             | 
             | Copyright on art gets more interesting / fuzzier. The key
             | part is substantial similarity -
             | https://en.wikipedia.org/wiki/Substantial_similarity and
             | https://www.photoattorney.com/copyright-infringement-for-
             | sub...
             | 
             | Rather than text, my AI copyright hypothetical... consider
             | a model created based on sunset photographs. You take a
             | regular photograph, pass it through the model, and it
             | transforms it into a sunset. The model was trained on
             | copyrighted works but the _model_ is considered fair use.
             | 
             | Now, I go and take a photograph from some location during
             | the day and then pass it through the transformer and get a
             | sunset. Yea me! Unbeknownst to me, that location is a
             | favorite location for photographers and there were sunsets
             | from that location used in the training data. My
             | photograph, transformed to look like a sunset is now
             | similar to one of them in the training data.
             | 
             | Is my transformed photograph a derivative work of the one
             | in the training data to which it bears similarity to? How
             | would a judge feel about it? How does the photographer
             | who's photograph was used in the training data feel?
        
               | TaylorAlexander wrote:
               | What would be interesting in that case would be how the
               | transformed image would look if photos from that location
               | were removed from the training set. That would help
               | reveal whether it was just copying what it had seen or it
               | actually remembered what sunsets looked like and
               | transformed the image using its memory of sunsets in
               | general.
        
             | [deleted]
        
           | _Understated_ wrote:
           | I don't doubt that an army of lawyers has poured over this
           | but they have size on their side: the cost of litigation vs
           | potential revenue will be a massive factor.
           | 
           | Edit: > There's a decent bit of caselaw indicating that
           | computers reading and using a copyrighted work simply "don't
           | count" in terms of copyright infringement.
           | 
           | That means their computer can read any code it wants, do
           | whatever it wants with the code, then they can monetise that
           | by giving YOU the code. Would they then be indemnified by
           | saying "no Microsoft human read or used this code"?
           | 
           | However, if you then use the code and look at it, does that
           | make you liable?
        
             | schneidmaster wrote:
             | Again, not a lawyer, just a guy who likes reading this
             | stuff. The devil is usually in the details of copyright
             | cases. The Turnitin case hinged substantially on whether
             | Turnitin's use of copyrighted essays was "fair use". There
             | are four factors[0] which determine fair use; the two more
             | relevant factors here are "the purpose and character of
             | your use" and "the effect of the use upon the potential
             | market". The court found that Turnitin's use was highly
             | "transformative" (meaning they didn't just e.g. republish
             | essays; they transformed the copyrighted material into a
             | black-box plagiarism detection service) and also found that
             | Turnitin's use had minimal effect on the market (this is
             | where "computers don't count" comes in -- computers reading
             | copyrighted material don't affect the market much because a
             | computer wasn't ever going to buy an essay).
             | 
             | I would be shocked if GitHub's lawyers didn't argue that
             | using copyrighted material as training data for an AI model
             | is highly transformative. There may be snippets available
             | from the original but they are completely divorced from
             | their original context and virtually unrecognizable unless
             | they happen to be famous like the Quake inverse square root
             | algorithm. And I think GitHub's lawyers would also argue
             | that Copilot's use does not affect the _original_ market --
             | e.g. it does not hurt Quake's sales if their algorithm is
             | anonymously used in a probably totally unrelated codebase.
             | 
             | Your counterexample would probably fail both tests -- it's
             | not transformative use if your software hands out complete
             | pieces of copyrighted software, and it would definitely
             | affect the market if Copilot gave me the entire source code
             | of Quake for my own game.
             | 
             | [0]: https://fairuse.stanford.edu/overview/fair-use/four-
             | factors
        
               | _Understated_ wrote:
               | I thought I understood fair use but turns out I was
               | wrong...
               | 
               | That being said, creating a transformative work from
               | something else is considered fair use. So, for example,
               | if I read a whole bunch of books and then, heavily
               | influenced by them, create my own, similar book, that
               | would be fair use I suppose... that makes sense.
               | 
               | But, where does the derivative works come in? Where do
               | you draw the line?
               | 
               | If I am heavily influenced by billions of lines of other
               | people's GPL code (ala Copilot!), then I create my own
               | tool from it and keep my code hidden, does that not mean
               | I am abusing the GPL license?
        
               | schneidmaster wrote:
               | That's what I meant by the devil being in the details --
               | these gray area questions hinge on the specific facts.
               | Lawyers on both sides will argue which factors apply
               | based on past caselaw and available evidence, and the
               | court renders a decision. For example, from the Stanford
               | webpage I previously linked: "the creation of a Harry
               | Potter encyclopedia was determined to be "slightly
               | transformative" (because it made the Harry Potter terms
               | and lexicons available in one volume), but this
               | transformative quality was not enough to justify a fair
               | use defense in light of the extensive verbatim use of
               | text from the Harry Potter books". So you _might_ be okay
               | creating a Harry Potter encyclopedia in general, but not
               | if your definitions are copy /pasted from the books, but
               | you might still be okay quoting key lines from the books
               | if the quotes are a small portion of your encyclopedia.
               | The caselaw just doesn't lend itself to firm lines in the
               | sand.
        
           | LocalH wrote:
           | That's funny, because the bedrock of copyright - insofar as
           | software is concerned - is entirely predicated on the idea
           | that a computer copying code into RAM to execute it is indeed
           | a copyright violation outside of a license to do so.
        
           | blendergeek wrote:
           | > There's a decent bit of caselaw indicating that computers
           | reading and using a copyrighted work simply "don't count" in
           | terms of copyright infringement -- only humans can infringe
           | copyright.
           | 
           | I have read variations of "computers don't commit copyright"
           | more times than I can count in the past few days.
           | 
           | How is Copilot different from a compiler? (Please give me the
           | legal answer, not the technical answer. I now the difference
           | between Copilot and a compiler, technically.)
           | 
           | Isn't a compiler a computer program? How is its output
           | covered by copyright?
           | 
           | Am I fundamentally misunderstanding something here?
        
             | someone7x wrote:
             | You just blew my mind with that analogy. I can only imagine
             | some hair-splitting logic to rationalize a distinction.
        
               | ghoward wrote:
               | The analogy goes even further if you consider compiler
               | optimizations: https://gavinhoward.com/2021/07/poisoning-
               | github-copilot-and... .
        
             | cormacrelf wrote:
             | "Computers don't commit copyright" is a complete misreading
             | or misunderstanding of another proposition, that "computers
             | cannot author a work".
             | 
             | Authoring is the act that causes a work to be
             | copyrightable. In most jurisdictions, authoring a work
             | _automatically_ causes copyright to subsist in the work to
             | some degree. The purpose of the copyright system is to
             | encourage people to author new, original works, by
             | rewarding those who do with exclusive rights. It is well-
             | known that only humans can author a work. Computers simply
             | cannot do it. If your computer (by some kind of integer
             | overflow UB miracle) accidentally prints out a beautiful
             | artwork, NOBODY has exclusive copyright over it, and anyone
             | may reproduce it without limitation. Same goes for that
             | monkey who took a selfie.
             | 
             | What a compiler does, on the other hand, is adapt a work.
             | Adapting a work is not authoring it. Sometimes when you
             | adapt a work, you also author some original work yourself,
             | like when you translate a book into another language. When
             | a compiler (not a linker) transforms source code, it
             | absolutely, 100% definitely does NOT add any original work;
             | the executable or .so/.a/.dylib/.dll file is simply an
             | adaptation of the original work. The copyright-holder of
             | the source code is the copyright-holder of the machine
             | code. An adaptation is also known as a "derivative work".
             | 
             | (Side note; copyleft licenses boil down to some variation
             | of "if you adapt this, you have to share everything in the
             | derivative work, not just the bits you copied.")
             | 
             | Adaptation is a form of reproduction. It's copying.
             | "Distribution" also often involves copying, at least on the
             | internet. (Selling or giving away a book you have purchased
             | does not constitute copying.) Copying is one of the
             | exclusive rights you have when you own the copyright in a
             | work, that you may then license out.
             | 
             | It gets more complicated when the computer uses fancy ML
             | methods to produce images/text out of things it has
             | seen/read. You can't simplify the law around that to a
             | simple adage digestible enough to share memetically on HN
             | and Twitter. One thing is certain: if the computer did it,
             | by itself, then no original work was authored in the
             | process. That poses a problem for people who write the name
             | of a function and get CoPilot to write the rest; if you do
             | that, you are not the author of that part of the program.
             | If you use it more interactively that's a different story.
             | 
             | There is, however, always a question of whether the
             | copyright in the original works the computer used _still
             | subsists_ in the output.
             | 
             | My rough framing of the licensing issues around CoPilot is
             | therefore as follows:
             | 
             | 1. The source code to CoPilot is an original work, and the
             | copyright is owned by GitHub.
             | 
             | 2. When GH trained CoPilot's models on other people's
             | works, was that copying? (This one is partially answered.
             | It can spit out verbatim fragments, so it must be copying
             | to some extent, rather than e.g. actually learning how to
             | code from first principles by reading.) If it was not all
             | copying, how much of it was copying and how much of it was
             | something else? What else was it?
             | 
             | 3. If GH adapted the originals, what is the derivative
             | work? (I.E. where does the copyright subsist now? Is is a
             | blob of random fragments of code with some weights to a
             | neural network?)
             | 
             | 4. Which works is it an adaptation of? You might think "all
             | of them, and for each one, all of the code" but I'm not so
             | sure. For example, imagine the ML blob contains many
             | fragments, but some are shorter than others. If your
             | program has "int x;" in it, and CoPilot can name a variable
             | "x", you can hardly claim that as your own. I'm most
             | interested in whether the mere fact of CoPilot having
             | digested ALL of it, having fed this into the mix and
             | producing a ML blob based on all that information, means
             | that the ML blob is a derivative work of all of them. Or
             | whether there is some question of degree.
             | 
             | 5. Fair use. Was it fair use to train the model? Is it,
             | separately or not, fair use to create a commercial product
             | from the model and sell it? Fair use cares about commercial
             | use, nature of the copied work, amount of copying in
             | relation to the whole, and the effect on the market for /
             | value of the copied work. Massive question.
             | 
             | 6. If not fair use, then GH is subject to the licenses and
             | how they regulate use of the works. What license conditions
             | must GH comply with when they deal with the derivative
             | work, and how? Many will be tempted to jump straight to
             | this question and say GH must release the source code to
             | CoPilot. I'm not yet convinced that e.g. GPL would require
             | this. I can't believe I'm writing this, but is the ML blob
             | statically or dynamically linked? Lol.
             | 
             | 7. Final question, is there some way to separate out works
             | which were copied with no fair use (or not copied at all),
             | from works which were copied with no fair use? People are
             | worried about code laundering, e.g. typing the preamble to
             | a kernel function and reproducing it in full. In that
             | situation, it is fairly obvious that the end user has
             | ultimately copied code from the kernel and needs to abide
             | by GPL 2.0; moreover if they're using CoPilot to write out
             | large swathes of text they will naturally be alert to this
             | possibility and wary of using its output. But think of the
             | converse: if there is no way to get CoPilot to reproduce
             | something you wrote, what's the substance of your
             | complaint? Is CoPilot's model really a derivative of your
             | work, any more than me, having read your code, being better
             | at coding now? Strategically, if you wanted to get GH to
             | distribute the model in full, you might only need one
             | copyleft-licensed, verbatim-reproducible work's owner to
             | complain. But then they would just remove the complainant's
             | code. You might be looking at forcing them to have a "do
             | not use in CoPilot" button or something.
        
               | jeremyjh wrote:
               | I think this is more cogent analysis than anything else
               | I've seen yet on this topic. You should consider
               | submitting a blog post so this can become a top-level
               | topic.
               | 
               | Also, I loved this quote:
               | 
               | > Copying is one of the exclusive rights you have when
               | you own the copyright in a work, that you may then
               | license out.
               | 
               | I've been paying attention to software copyright topics
               | for more than twenty years and never thought of it in
               | exactly these terms. Its right there in the name - the
               | right to copy it - and determine the terms under which
               | others can copy it is exactly what a copyright is!
        
             | [deleted]
        
             | jeremyjh wrote:
             | What if I made a few tweaks to Copilot so that it is very
             | likely to reproduce large chunks of verbatim code that I
             | would like to use without attribution, such as the Linux
             | kernel. Do you really think you can write a computer
             | program that magically "launders" IP?
             | 
             | A compiler is run on original sources. I don't see any
             | analogy here at all.
        
               | ghoward wrote:
               | * They both process source code as input.
               | 
               | * They both produce software as output.
               | 
               | * They both transform their input.
               | 
               | * They both can combine different works to create a
               | derivative work of each work. (Compilers do this with
               | optimizations, especially inlining with link-time
               | optimization.)
               | 
               | They really do the same things, and yet, we say that the
               | output of compilers is still under the license that the
               | source code had. Why not Copilot?
        
               | jeremyjh wrote:
               | > Why not Copilot?
               | 
               | Because the sources used for input do not belong to the
               | person operating the tool.
               | 
               | If you say that doesn't matter, then you are saying open
               | source licenses don't matter because the same thing
               | applies - I could just run a tool (compiler) on someone
               | else's code, and ignore the terms of their license when I
               | redistribute the binary.
        
               | ghoward wrote:
               | I think you have what I am saying backwards. I am saying
               | that the licenses _should_ apply to the output of
               | Copilot, like they apply to the output of compilers.
        
               | jeremyjh wrote:
               | Oh sorry, my mistake! Thank you.
        
               | hedora wrote:
               | No, I think that's the point.
               | 
               | If I take some code I don't have a license for, feed it
               | to a compiler (perhaps with some -O4 option that uses
               | deep learning because buzzwords), then is the resulting
               | binary covered under fair use, and therefore free of all
               | license restrictions?
               | 
               | If not, then how is what Copilot is doing any different?
        
               | jeremyjh wrote:
               | > If I take some code I don't have a license for, feed it
               | to a compiler (perhaps with some -O4 option that uses
               | deep learning because buzzwords), then is the resulting
               | binary covered under fair use
               | 
               | No, the binary is not free of license restrictions. Read
               | any open source license - there are terms under which you
               | can redistribute a binary made from the code. For GPL you
               | have to make all your sources available under the same
               | terms for example. For MIT you have to include
               | attribution. For Apache you have to attribute and agree
               | not to file any patents on the work in Apache licensed
               | project you use. This has been upheld in many court cases
               | - though it is not always easy to find litigants who can
               | fund the cases the licenses are sound.
        
         | lukeplato wrote:
         | Copilot is a commercial paid service that generates money for
         | Microsoft
        
           | _Understated_ wrote:
           | Yeah, that bit I realise but the point I was getting at is
           | this: if I take someone else's code, use chunks of it in my
           | app, say that it's mine and make money from it is that not
           | illegal? Or, at least in violation of the license?
           | 
           | Superficially at least, Copilot (from my understanding) is
           | "copying" code, letting me use it in my app, and making money
           | from it.
           | 
           | I'm just trying to wrap my head around it.
           | 
           | Let's be clear, I am not a lawyer, but it seems... strange!
        
             | stonemetal12 wrote:
             | Both the Copy machine and VCR were found to be legal
             | because they had substantial non infringing uses. As is I
             | don't see how Copilot does. It could, if trained on public
             | domain or attribution free code only, unfortunately there
             | probably isn't enough code out there to train the model
             | adequately under such rules.
        
             | Diggsey wrote:
             | Also NAL, but I think there's far more of a case that users
             | of Copilot might violate copyright rather than Copilot
             | itself:
             | 
             | - Only a _very_ small proportion of Copilot generated code
             | is reproduced verbatim, so if you specifically built a
             | product just from copied-verbatim code, your act of
             | selecting and combining those pieces of copyrighted code
             | would be creating a derivative work.
             | 
             | - GitHub is not selling the copyrighted code, they are
             | selling the tool itself. Google is literally the same
             | thing: you could theoretically create a product by googling
             | for prefixes of copyrighted code and then copying the
             | remainder straight out of the search results. It's you who
             | would be violating copyright, not google.
        
               | ghoward wrote:
               | I think there is an argument to be made that Copilot is
               | producing derivative code, though. It may produce copies
               | verbatim, and that's a violation, but far more often, it
               | produces a mixture of things it was trained on, most of
               | which probably have some sort of license requiring
               | attribution at the very least.
        
               | [deleted]
        
             | klntsky wrote:
             | Does copilot seem strange, or maybe the concept of
             | intellectual property does?
        
               | _Understated_ wrote:
               | Copilot isn't strange from a technical prespective.
               | 
               | The strange bit is how they are allowed to use other
               | peoples code to create derivative works (this is how I
               | see it from my non-legal perspective anyway).
               | 
               | Even if it's legal (to the letter of the law, not the
               | spirit) it leaves a sour taste.
        
         | kstrauser wrote:
         | Suppose Copilot was Composer and it generated personalized
         | songs for you after being trained on Spotify's library. If you
         | started performing the resulting song and it contained
         | recognizable clips of others, I guarantee you'd have lawyers
         | coming after you.
         | 
         | I don't see this as fundamentally different. It's unlikely that
         | the Free Software Foundation is going to track you down for
         | including some GNU code in your single-user repo. If you used
         | their stuff in a popular commercial project and they got wind
         | of it, you might expect to receive a cease and desist at best.
        
         | wpietri wrote:
         | I think you're right. Especially given that Copilot can
         | reproduce significant blocks of code:
         | https://twitter.com/mitsuhiko/status/1410886329924194309
         | 
         | Famous code:
         | https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...
        
           | treesprite82 wrote:
           | I see this held up as an example a lot, but the fast inverse
           | square root algorithm didn't originate from Quake and is in
           | hundreds of repositories - many with permissive licenses like
           | WTFPL and many including the same comments.
           | 
           | GitHub claims they didn't find any "recitations" that
           | appeared fewer than 10 times in the training data. That
           | doesn't mean it's a completely solved issue (some code may be
           | repeated in many repositories but always GPL, and there are
           | limitations to how they detect recitations), but from rare
           | cases of generating already-common solutions people seem to
           | be concluding that all it does it copy paste.
        
             | wpietri wrote:
             | That may be true, although even GitHub doesn't know for
             | sure. But the problem remains: they're reproducing other
             | people's code without regard to license status.
        
         | rgbrenner wrote:
         | _" I'm guessing the prevailing theory (from GiitHub anyway) is
         | that I'm legitimately allowed to do this."_
         | 
         | No. Copilot is a technical preview. In the final release, if it
         | reproduces code verbatim, it'll tell you and present the
         | correct license.
        
           | ghoward wrote:
           | Doesn't matter that it's a technical preview; people are
           | using it now, GitHub has _already_ used it internally. So if
           | it infringes now, there is already code out there being used
           | that does infringe.
        
             | rgbrenner wrote:
             | GitHub appears to be tracking every snippet that they're
             | generating during their trials:
             | 
             | https://docs.github.com/en/github/copilot/research-
             | recitatio...
             | 
             | Are you doing that? If not, then I wouldn't use GitHub's
             | use as justification to engage in copyright infringement.
        
               | ghoward wrote:
               | Oh, I am not using Copilot. But other people not part of
               | GitHub are. And those are still violations.
        
           | x4e wrote:
           | How will it find the "correct" license?
           | 
           | Will it check the LICENSE file? Simply having a LICENSE file
           | is not a declaration that all the code in that repo is under
           | that LICENSE.
           | 
           | What if specific lines/files are specified to be under
           | different licenses?
           | 
           | What if the publisher of the repo is publishing it under an
           | incorrect license in bad faith?
           | 
           | Will github be responsible if it tells me the wrong license?
        
       | spywaregorilla wrote:
       | So playing devil's advocate. What if the courts just don't care,
       | and rule that copying code verbatim is not a crime because you
       | didn't copy it, and copilot is not a human so it can't commit
       | crimes. What's the net effect of a system that draws upon all
       | public code repos? It sounds... net beneficial to society?
       | 
       | On the plus side, a large body of work effectively becomes public
       | domain. On the negative side, copyleft licenses lose their teeth.
       | You probably see more power shift to those with big budgets. You
       | probably see fewer things made source available, because you
       | either have the public license or the private license now. This
       | feels like a bad path but I'm not convinced the end result isn't
       | better still.
        
         | downrightmike wrote:
         | The really nice thing is that this basically creates a library
         | of industry methods and practices. It'd be really nice to be
         | able to destroy copyright trolls because what their patent
         | "covers" is already a known and established industry method, or
         | a prior art.
        
         | mook wrote:
         | Would that mean I can start sampling songs if they get fed
         | through a neutral network? It'll be fine if I train it on
         | whatever is playing on the radio right? Doing the same for
         | poems?
        
           | spywaregorilla wrote:
           | I would expect the legal argument to get into the intentions
           | of the user and their relationship to the tool. I would also
           | expect perspectives of art and code to diverge.
        
         | agilob wrote:
         | >copilot is not a human so it can't commit crimes
         | 
         | I can setup my drone to detect me and attempt to crash into me.
         | AI would be quite poor, probably would attempt to crash at any
         | human. Would it be my fault it didn't crash into me and someone
         | lost eyes?
         | 
         | Can I setup torrent box that automatically downloads and seeds
         | all detected links from public trackers? Would I be responsible
         | for it?
        
           | spywaregorilla wrote:
           | Both of these examples include you creating something and
           | then using it. I don't know how copilot works, but using the
           | second example, if you wrote a script to download and seed
           | trackers, and someone else used it, I don't think you would
           | be held under any liability, especially if you don't profit
           | off of it.
           | 
           | Not a lawyer or even particularly well informed
           | 
           | edit: I am reminded of the monkey selfie, in which it was
           | ruled that a non-human cannot create copyrightable works. htt
           | ps://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
        
             | rcfox wrote:
             | It sounds like you're arguing that Github isn't liable for
             | people using copyrighted code through Copilot.
             | 
             | I think most people are more concerned about whether the
             | user of Copilot would be liable for using copyrighted code
             | generated by Copilot.
        
             | cool_dude85 wrote:
             | Did copilot spring from the aether? Or was it built and
             | trained on licensed code by github? Someone did something.
        
               | spywaregorilla wrote:
               | It's not a violation of copyright to train a model. There
               | are three questions at play though:
               | 
               | 1) Can you be liable for violating copyright if you have
               | never seen the work?
               | 
               | 2) Can a non-human be held accountable for violating
               | copyright?
               | 
               | 3) Can github be held liable for an end user using their
               | tool to violate copyright?
               | 
               | https://en.wikipedia.org/wiki/Substantial_similarity
               | 
               | wikipedia states: Generally, copying cannot be proven
               | without some evidence of access; however, in the seminal
               | case on striking similarity, Arnstein v. Porter, the
               | Second Circuit stated that even absent a finding of
               | access, copying can be established when the similarities
               | between two works are "so striking as to preclude the
               | possibility that the plaintiff and defendant
               | independently arrived at the same result."
               | 
               | This is a different situation in which exact replication
               | can be reasonably occurred without access to the
               | original.
               | 
               | Secondly, can you actually claim Github has violated
               | copyright if it doesn't have any claims to the work in
               | question?
               | 
               | I think it's totally plausible that they win this in the
               | long run.
        
               | stonemetal12 wrote:
               | 1) So you are saying if I get a disk duplication machine
               | I can freely copy and distribute blu ray disks as long as
               | I don't watch the movie on the disk?
               | 
               | 2,3) Seems pretty settled at this point, look at the
               | cases around the VCR and copy machine. In general the one
               | using the machine is liable. The creator of the machine
               | can be held liable if there aren't substantial non
               | infringing uses.
        
               | formerly_proven wrote:
               | > It's not a violation of copyright to train a model.
               | 
               | Many people on HN assert this based on the Authors Guild
               | vs. Google case, but it's quite important to keep in mind
               | that that case was about Google creating a search
               | algorithm, which is _not_ generating  "new" output.
               | 
               | We are talking about a very different kind of system here
               | and in many other cases. Claiming the Authors Guild case
               | sets precedent for these very different systems seems
               | unbased to me.
        
               | sangnoir wrote:
               | > It's not a violation of copyright to train a model.
               | 
               | This is a very bold assumption, one that I assume will
               | not hold in the court of law in all cases. I think the
               | nuanced question is: to train a model _that does what,
               | exactly_.
               | 
               | Let's say distributing meth recipes is illegal[1], can
               | one legally side-step that by training a model that spits
               | out the meth recipe instead? No court will bother with
               | the distinction, causation is well-trod ground.
               | 
               | 1. As an example - not sure if its illegal. You may
               | replace with classified nuclear weapon schematics if you
               | like.
        
       | rvz wrote:
       | It's been admitted again. This contraption by GitHub is really
       | causing chaos in the open source world and has been trained upon
       | all public GitHub code; essentially those who have their code
       | hosted there publicly, gave them permission to train copilot on
       | their code. Now they are complaining about it after all these
       | problems [0].
       | 
       | I warned against hosting source code on GitHub and going all in
       | on GitHub Actions, mainly for them being unreliable for the past
       | year. [1] (They go down every month). Now Copilot has gone and
       | trained on every single public repo on GitHub as admitted right
       | in this post, regardless of the copyright.
       | 
       | Maybe for organisations with serious projects, perhaps now's the
       | time to leave GitHub and self-host your own somewhere else?
       | 
       | [0] https://news.ycombinator.com/item?id=27726088
       | 
       | [1] https://news.ycombinator.com/item?id=27366397
        
         | MayeulC wrote:
         | Well, if you write software under a free license, you can't
         | really prevent someone from uploading the source on GitHub...
        
       | monkeydust wrote:
       | Bit confused. If I have code on GitHub with most restrictive
       | licence possible (no commercial reuse, no derived works) then how
       | did Githubs legal get comfortable with this approach? What am I
       | missing ?
        
         | Tenoke wrote:
         | There's an assumption that public repos can be read by humans
         | and machines both which hasn't been questioned legally.
        
           | monkeydust wrote:
           | But the repos are provided under licenced terms no which can
           | vary depending on publishers choice? Put another way, is
           | there a licence that would prohibit reuse in this manner ?
        
             | Tenoke wrote:
             | You can likely write/find one but if you don't want your
             | code seen perhaps it'd be simpler to use a private repo.
        
         | oauea wrote:
         | You uploaded your code to their service and agreed to their
         | TOS.
        
         | mariusor wrote:
         | By using github you have acceded to their terms of use[1]:
         | 
         | > Short version: You own content you create, but you allow us
         | certain rights to it, so that we can display and share the
         | content you post. You still have control over your content, and
         | responsibility for it, and the rights you grant us are limited
         | to those we need to provide the service. We have the right to
         | remove content or close Accounts if we need to.
         | 
         | [1] https://docs.github.com/en/github/site-policy/github-
         | terms-o...
        
       | Tenoke wrote:
       | The outrage-bait approach in this thread detracts from it. Yes,
       | they trained it on everything. No, it's not clear if that's legal
       | or not (probably is) or if that is much of a problem.
        
         | arp242 wrote:
         | Indeed; the question is if copyright should apply _at all_.
         | Harping on about licenses, GPL, and whatnot is a detraction
         | from the actual issue at hand.
         | 
         | Also, given that the author of this tweet called me a
         | "bootlicker" last year in response to a somewhat lengthy
         | nuanced post about GitHub, I'm gonna go out on a limb and say
         | that they're not all that interested in a meaningful
         | conversation on this in the first place but are rather on a
         | quest to "prove" GitHub is evil.
        
           | lifthrasiir wrote:
           | The possibility of GPL violation does show (one of) enormous
           | ramifications of the question though. I think it's not a
           | detraction as long as the question itself is also mentioned.
        
             | arp242 wrote:
             | There isn't any of this here though: it just operates on
             | the assumption that the GPL applies.
        
         | deviledeggs wrote:
         | It's not outrage bait. The thing reproduces GPL licensed code
         | verbatim.
        
           | Tenoke wrote:
           | I'm talking about how it's presented. It starts with
           | 
           | >oh my gods. they literally have no shame about this.
           | 
           | Then continues with
           | 
           | >it's official, obeying copyright is only for the plebs and
           | proles, rich people and big companies can do whatever they
           | want
           | 
           | and
           | 
           | > GitHub, and by extension @Microsoft , knows that copyright
           | is essentially worthless for individuals and small community
           | projects. THAT is why they're all buddy-buddy with free
           | software types; they never intended to respect our rights in
           | the first place
           | 
           | At any rate, it's not even clear to me if me publishing code
           | written with copilot (or even with a random tool that will
           | wget from github) puts the blame on the toolmaker or on me.
           | This post, however, doesn't attempt to look at that but uses
           | language that paints GH/MS as doing something illegal (and
           | evil) that others wouldn't even get away with but not caring
           | about it.
        
             | belorn wrote:
             | It seems that github did make a legal consideration when
             | choosing to include public projects but exclude private
             | ones, with many big companies having private projects for
             | proprietary code bases. Users of public repositories are
             | less likely to be able to fight github on the issue.
        
             | deviledeggs wrote:
             | Is that not true? Google and Oracle had a 10 year multi
             | billion dollar legal fight over ~20 lines of code identical
             | between Android and JVM.
             | 
             | A non rich individual has basically zero chance of
             | challenging GitHub on these blatant violations, and they
             | know it.
             | 
             | > At any rate, it's not even clear to me if me publishing
             | code written with copilot (or even with a random tool that
             | will wget from github) puts the blame on the toolmaker or
             | on me.
             | 
             | It really depends on the license, which GitHub apparently
             | doesn't care about at all.
        
               | cortexio wrote:
               | maybe you are correct, but i would agree that it's
               | formulated in a childish/evil spirited way. It smells of
               | outrage hype and cancel culture.
        
           | benhurmarcel wrote:
           | But is that reproduced code "substantial"?
           | 
           | I'm sure there's a "for i in range(0, n):" somewhere in a GPL
           | repo, and yet having that in my code doesn't make it GPL.
        
           | hu3 wrote:
           | Just a reminder: reproducing GPL licensed code verbatim is
           | not illegal per see.
           | 
           | The legality lies on what the user does with the code.
        
       | xbar wrote:
       | Microsoft LicenseLaunderer.
        
       | sleavey wrote:
       | To be fair, this could just be a mistaken interpretation from the
       | support staffer that answered the question - they didn't sound
       | sure ("apparently"). It certainly needs an official response from
       | GitHub senior management but I wouldn't call the foul yet (not
       | that it's even clear that it is a foul).
        
       | underyx wrote:
       | OP's rhetoric, and most discussion I see, asserts that training a
       | model on copyrighted data is a copyright violation. Personally I
       | don't find this to be so obviously the case. Think back to when
       | we were listening to AI generated pop music, for instance. I
       | don't recall any concern in HN comments about the copyright
       | holders' music being used for learning.
        
         | thinkingemote wrote:
         | Did you miss the bit where copilot reproduced exactly a
         | function including the comments? That's not some mashup or
         | reinterpretation or inspiration it meets the definition of
         | plagiarism in universities and is just copying.
        
           | underyx wrote:
           | I didn't miss that, this still doesn't make the answer
           | obvious to me. I'm pretty sure I've unknowningly replicated
           | licensed code as well during my time as an engineer, and I've
           | written way less code over my 8 years than Copilot has.
        
             | fragileone wrote:
             | Then if you were discovered using it in a commercial
             | project you can fairly be sued for it. Unless you're trying
             | to argue that you should for some reason get an exemption?
        
               | underyx wrote:
               | Would I be found guilty if I could prove that I didn't
               | explicitly copy that code but rather just happened to
               | write the same code by arriving at the same solution as
               | the original one I had seen years before?
        
               | adn wrote:
               | Nobody can answer this because it depends on the code and
               | the resources of the entity suing you, but in general
               | yes. This is why clean room design is a well-defined
               | strategy: depending on the code and company, you would
               | indeed not be allowed to work on the project because of
               | the fact you'd seen a competitors solution previously.
        
         | contravariant wrote:
         | I'd be surprised if nobody brought up those 'what-if' scenarios
         | at the time.
        
         | Zababa wrote:
         | > Think back to when we were listening to AI generated pop
         | music, for instance. I don't recall any concern in HN comments
         | about the copyright holders' music being used for learning.
         | 
         | Were those products sold to help people write commercial pop
         | music faster? If not, I don't think your point is valid.
        
         | belorn wrote:
         | You mean like https://www.theburnin.com/technology/artificial-
         | intelligence... ?
         | 
         | If one of the three largest record labels uses their own
         | catalog to train an AI, copyright seems less important to
         | discuss. I suspect the discussion would be a bit different if a
         | company scraped youtube and used that as a training set for AI
         | music and successfully sold it.
        
       | encryptluks2 wrote:
       | Really hoping to see a max exodus from GitHub after this.
       | Microsoft back to their old tactics like we all knew they would.
        
         | Tenoke wrote:
         | If you have public repos anywhere people can train on them just
         | as much.
        
           | ostenning wrote:
           | That's also my general sentiment. I assume anyone can do
           | virtually anything with my public repos with little recourse
           | from me. I wouldn't even know if they are indeed breaking my
           | license agreements. Doesn't really help the situation though.
        
             | encryptluks2 wrote:
             | GitHub only recently allowed non-paid private repos.
             | Previously these were only reserved for paid plans. Also,
             | GitHub has a specific section for license files. GitHub
             | actually believes these license files mean something, and
             | states that they must be included with the repo so they are
             | downloaded with it. Just because you can teach a script to
             | ignore a LICENSE file, doesn't mean that it still doesn't
             | apply. That is like saying that because you can teach a
             | robot to ignore restricted airspace, that it is allowed to
             | fly around an airport.
        
         | ostenning wrote:
         | Any suggestions for an alternative? One thing I like about
         | github is that it 'seems' to be a defacto standard for
         | portfolios & public works. It also has excellent integration
         | into AWS and alike
        
           | selfhoster11 wrote:
           | GitLab is a fairly good one. Lots of people self-host their
           | own GitLab/Gitea instance too.
        
           | goodpoint wrote:
           | SourceHut or Codeberg
        
           | aloisdg wrote:
           | GitLab is the best alternative feature wise.
           | https://sourcehut.org/ is great too if you are into this kind
           | of things
        
       | tobyhinloopen wrote:
       | A lot of hate for a cool piece of tech. Can't we just be happy
       | this tool exists?
        
       | mullikine wrote:
       | Public facing open-source code & media is going to be learned by
       | language models because they're exposed to them. That's the
       | simple truth. Nothing can stop that, not unless all public repos
       | are made private. Everyone has access to the ability to create
       | their own GPT, thanks to open-source. OpenAI is not actually very
       | far ahead of open source anymore.
       | 
       | The US seems well enough informed. As mentioned in the following
       | report "AI tools are diffusing broadly and rapidly" and "AI is
       | the quintessential "dual use" technology--it can be used for
       | civilian and military purposes.".
       | 
       | https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report...
       | 
       | I'm fully expecting that if I begin a story and put it on my blog
       | or on github, and if I go away for a couple years, I'll see it
       | completed for me when I return. I can use foresight to my
       | advantage or I can pretend like it's still the 1990s as if
       | placing some text at the top of the code I exposed publicly is
       | going to prevent people from training on it.
       | 
       | One thing for sure though, I don't think a large company such as
       | Microsoft should be profiting from training their language model
       | on open-source code.
       | 
       | The best way to release Copilot in my opinion would be to make
       | the entire thing open source and have separate models, even a
       | private paid-for model so long it's trained on their own code.
       | 
       | An open source model trained on code for specific licenses sounds
       | fine, but then the model should also follow that same license as
       | the code it was trained on.
       | 
       | There's just something deeply unsettling about having a computer
       | complete your thoughts for you without being able to question how
       | or why.
        
       | xgulfie wrote:
       | I'm really hoping some big corp whose codebase is source-
       | available and on GitHub but still under copyright, takes the piss
       | out of them for this
        
       | jensensbutton wrote:
       | Wouldn't it be the people publishing code written with Copilot
       | that (potentially) violate any licenses? It doesn't seem to be
       | that the tool violates anything, though it may put the _user_ at
       | risk of violating something.
       | 
       | Like, don't use it if you're worried about violating licenses,
       | but I don't see how Microsoft could get in trouble for the tool.
       | It doesn't write and publish code by itself.
        
         | cool_dude85 wrote:
         | Sorry, we built this tool for you that auto violates licenses.
         | Sure, we're owned by a huge megacorp with billions of dollars,
         | but it's your responsibility to confirm - and yes, we recognize
         | it's impossible to confirm - that what you release using our
         | tool isn't violating the license.
         | 
         | In short, github gets to make the license violator bot and push
         | the violations off onto the small fry who actually use it? No
         | thanks.
        
           | Naga wrote:
           | Isn't that sortof the justification behind bittorrent and
           | trackers?
        
             | vharuck wrote:
             | I see the difference as BitTorrent being an ignorant tool
             | that just processes the data it receives. If you point
             | BitTorrent to copyright data, it emits copyright data. The
             | fault is on the users. Copilot was built with and
             | "contains" copyright data, which it can produce with non-
             | copyright input.
        
         | fragileone wrote:
         | Microsoft are violating the licenses already when they
         | initially show you the generated code without attribution and
         | ignoring other license restrictions. How you use it yourself is
         | separate from that.
        
       | russdpale wrote:
       | Who cares? Seriously? Copilot has ripped off the absurd charade
       | around licensing and code.
       | 
       | It isn't any kind of copyright infringement. The AI is not
       | copying and pasting code that is has found, it is rewriting the
       | code from scratch on its own.
       | 
       | We keep trying to take old ways and meld them to the internet,
       | and its just not appropriate and it doesn't work.
        
         | geoffhill wrote:
         | > not copying and pasting code
         | 
         | https://news.ycombinator.com/item?id=27710287
        
         | nrb wrote:
         | Huh? There are already many examples of "copied and pasted"
         | code that it has found.
         | 
         | https://twitter.com/kylpeacock/status/1410749018183933952
         | 
         | https://twitter.com/mitsuhiko/status/1410886329924194309
         | 
         | https://twitter.com/pkell7/status/1411058236321681414/photo/...
        
       | neonihil wrote:
       | I think it's pretty easy to defeat MS in court.
       | 
       | We just need to bring the music industry into this!
       | 
       | For example: Let's train a network on Beatles music to generate
       | new Beatles songs. I'm pretty sure music lawyers will find a way
       | to prove that the trained network is violating the label's
       | copyright, as they always manage to do that.
       | 
       | And then we just need to use the precedent and argue that music
       | is the same thing as code.
        
         | alkonaut wrote:
         | > For example: Let's train a network on Beatles music to
         | generate new Beatles songs. I'm pretty sure music lawyers will
         | find a way to prove that the trained network is violating the
         | label's copyright, as they always manage to do that.
         | 
         | The people making the machine that learned (and recites)
         | beatles songs aren't infringing though (most likely). It's
         | those that _use_ the machine to create and distribute the new
         | works that are.
         | 
         | Same here. No one will be able to say that Copilot _itself_ is
         | a  "derived work" or somehow uses the code in a way similar to
         | a computer program (Although such claims have already been made
         | - I highly doubt that's the case). But those that produce a
         | whole file full of GPL code verbatim (Which will be rare, but
         | WILL happen), are at risk of violating the license terms if
         | they distribute it under the wrong license.
        
         | stingraycharles wrote:
         | Wouldn't a more accurate metaphor be "let's train a network on
         | all music, to generate new music", which includes Beatles, and
         | may generate songs that contain the same chords as the Beatles
         | used?
        
           | tylersmith wrote:
           | Or contain new chords that it synthesized from its knowledge
           | of the ones the Beatles used.
        
           | dec0dedab0de wrote:
           | Yes, but may also use the same chord progressions, lyrics, or
           | melodies. Could even say it contains snippits of the actual
           | recordings, depending on how you look at it
        
             | stingraycharles wrote:
             | Sure but then it'll definitely be harder to prove it's
             | actual copyright infringement, especially when only a very
             | small part of the song may have some snippets of the
             | Beatles. Could it then, perhaps, be considered fair use?
        
               | jeremyjh wrote:
               | Yes you'd have exactly the same kind of lawsuits and
               | arguments that already exist today around fair use. It
               | doesn't matter if a tool creates the new work or a person
               | creates it (without tools?!!?) because ultimately it is a
               | person who claims the new work as their own and
               | distributes it, and that is the person who will get sued
               | by the record companies if their work is too derivative.
               | Establishing the line for "too derivative" in any
               | particular case is a very lucrative field already I'm
               | sure.
        
         | bionhoward wrote:
         | Potentially dumb question from a guy who isn't a lawyer:
         | 
         | Does Copilot infringe Google's patent(s) on the Transformer
         | architecture? If so, then Google could potentially sue them for
         | royalties, at least.
         | 
         | Further, couldn't this Copilot thing backfire for Github
         | because customer trust is more valuable than AI training data
         | right now? If folks don't feel they can trust Github, seems
         | like they could move their work to other version control
         | systems like Gitlab or Bitbucket...
        
           | anonydsfsfs wrote:
           | Doesn't really matter, because if Google sued Microsoft,
           | Microsoft would immediately hit back with a countersuit,
           | since they would have little trouble finding something in
           | their 90,000+ patent warchest that Google is infringing on.
           | Software patents have become a matter of mutually-assured
           | destruction for the big players. The only winning move is not
           | to play.
        
           | [deleted]
        
         | Florin_Andrei wrote:
         | In ancient Rome they didn't have a police force. What they had
         | was essentially muscle for hire, mercenary bands paid by rich
         | and powerful folks to do their bidding. As a regular person,
         | the only thing that could have protected you from one of these
         | groups was another such group.
         | 
         | Same today with the licensing system.
        
           | LudwigNagasena wrote:
           | They had cohortes vigilum and cohortes urbanae in Ancient
           | Rome. Why don't they count as police?
        
         | BoxOfRain wrote:
         | There is an absolutely enormous archive of fan-taped Grateful
         | Dead shows out there, someone with much more time and money
         | than me _needs_ to train a network on that!
        
           | data_ders wrote:
           | username checks out lol
        
         | partiallypro wrote:
         | That already exists though? SongSmith and other similar tools
         | are used by musicians a lot.
        
           | nomel wrote:
           | At what point is it not a derivative work?
        
       | gizmodo59 wrote:
       | I hope they don't shutdown the project with all the legal
       | nightmares.
        
       | kazinator wrote:
       | If you learn a word or phrase from a copyrighted public
       | broadcast, does that mean you cannot speak it to others?
        
       | 0x0 wrote:
       | To the people arguing it's "fair use" to use this for training an
       | ML network. Where do you draw the line? What if you train an "ML
       | network" with one or two inputs... so that they almost always
       | "generate" exact copies of the inputs? Five inputs..? Ten? A
       | thousand? A million?
        
         | saint_abroad wrote:
         | > Where do you draw the line?
         | 
         | My simplistic view is that the following is legally equivalent:
         | 
         | input -> ai network -> output
         | 
         | input -> huffman coding -> output
         | 
         | So, whilst:
         | 
         | * compressing and decompressing a copyright work is
         | permissible;
         | 
         | * output and weights are deterministic transformations of the
         | inputs;
         | 
         | thus:
         | 
         | * not eligible for copyright (lacking creativity); and
         | 
         | * are derivative works of the inputs;
        
           | ghoward wrote:
           | But at the same time, a compiler does a deterministic
           | transformation of its inputs, and we still count its output
           | as under copyright and license.
           | 
           | copyrighted input -> compiler -> copyrighted output
        
           | kevincox wrote:
           | > output and weights are deterministic transformations of the
           | inputs;
           | 
           | That may be true but I fail to see how any process that
           | produces the same content that was input into it somehow
           | strips the license. If the generated code is novel, then
           | there is no copyright and it is just the output of the tool.
           | If the code is a copy, but non-creative (example a trivial
           | function) then it isn't covered by copyright in the source
           | anyways, so the output is not protected by copyright either.
           | However if the output is a copy and creative I don't think it
           | matters how complicated your copying process was. What
           | matters is that the code was copied and you need to obey
           | copyright.
           | 
           | Again, I don't think that novel code generated from being
           | trained on copyrighted code is the problem. I think it is
           | just the verbatim (or minimally transformed) copying that is
           | the issue.
        
         | Tenoke wrote:
         | I can imagine a requirement of the sort 'generated code needs
         | to match at most X% to snippets of the training data as shown
         | over Y amount of sampling' but I am not sure if you can get a
         | much better requirement than that.
         | 
         | Forbidding the training of AI on public code would definitely
         | be a step too far though.
         | 
         | Edit: I'd also like if they provide a tool for checking if your
         | code matches copyrighted code too close so you can confirm if
         | you are violating anything or not when you use copilot.
        
         | crazygringo wrote:
         | The line is exactly the same line that's always been drawn in
         | fair use cases.
         | 
         | There's absolutely nothing different whether the creator is ML
         | or a human.
         | 
         | Generally, if you train an ML network to generate an almost
         | exact copy of a thousand lines, it's _obviously_ not fair use.
         | If it 's five simple lines, it obviously _is_ fair use. If it
         | 's somewhere in between, there are a lot of different factors
         | that need to be weighed in a fair use decision, which you can
         | easily look up.
        
         | diffeomorphism wrote:
         | There obviously is no sharp line (e.g. it is 37. Immediate
         | question: why not 36?), but that does not matter at all.
         | 
         | We already have the same fuzzy line for writing. Am I forbidden
         | from ever reading other author's books because I might
         | accidentally "generate exact copies" of some of the sentences?
         | Clearly not, that is how people learn a language. Does that
         | mean I am allowed to copy the whole book? Also clearly not.
         | 
         | Where do you draw the line? Somewhere.
        
           | criddell wrote:
           | And _somewhere_ is determined for your particular case in
           | court. And tomorrow, a similar case may be determined
           | differently.
        
             | diffeomorphism wrote:
             | Not really, no.
        
       | agilob wrote:
       | This means that also all illegally leaked codes from Apple, CDPR,
       | Intel, NSA and Microsoft leaks are used in the models? iBoot,
       | Witcher 3? Gwent? NSA backdoors?
       | 
       | Does the copilot still learn from new repos? Can I post github
       | enterprise code publicly to let it learn from it?
       | 
       | Serious answers only please
        
         | tedunangst wrote:
         | Why would you think letting copilot scan the code would absolve
         | you of liability for posting it?
        
           | agilob wrote:
           | I'm not asking about legality of posting the code, but reuse
           | of this by the AI and users of the AI. "All public
           | repositories" is a wide net full of surprises.
        
         | BlueTemplar wrote:
         | I assume that this is a yes to most of those ?
         | 
         | Of course using code generated by Copilot from those would
         | still be illegal.
         | 
         | See also : Napster (and other p2p), the bitcoin blockchain
         | allegedly containing illegal numbers...
        
           | TingPing wrote:
           | So copyright doesn't apply unless copyright applies.
        
         | swiley wrote:
         | Has Microsoft just killed source code copyright? That would
         | definitely be a win.
        
           | blibble wrote:
           | it would be a win for Microsoft that don't distribute their
           | source code
           | 
           | whereas for open source it's a disaster
        
             | glogla wrote:
             | Which seems very much align with what has Microsoft been
             | trying to do for decades now.
        
               | formerly_proven wrote:
               | Interesting idea considering Microsofts copyright-
               | dependence has reached an all-time low point since they
               | move as much as they can into their SaaS and PaaS
               | offerings. Nothing left to copy, except for employees,
               | but you don't need copyright to bash their heads in,
               | legally speaking.
        
               | nh2 wrote:
               | It would be quite impressive if this was a long-time
               | planned "Embrace, extend, extinguish" move against
               | Copyleft, with a casual acquisition of Github to make it
               | work.
               | 
               | Finally, it beat the "cancer that attaches itself in an
               | intellectual property sense to everything it touches"
               | after all those years, with its own tools!
               | 
               | Now it's safe to touch.
        
             | swiley wrote:
             | There's nothing to stop the employees from distributing it
             | at that point and even with copyright it gets distributed
             | anyway, it's just not allowed to be used for anything
             | serious.
        
           | echelon wrote:
           | But who says the code has to be available to anyone but
           | Microsoft?
           | 
           | Remember that Amazon won off the back of open source. Now all
           | the open source servers and databases are Amazon products.
        
         | TuringNYC wrote:
         | IANAL but the serious answer -- i think -- is that you always
         | use things at your own risk, even purchased tools, and are
         | protected via indemnity agreements. If there is no indemnity
         | agreement (is is the case here), you assume the risk.
         | 
         | That said, if enough people are bitten by this, i'm not sure
         | what happens -- does anyone know of a relevant case. One
         | somewhat relevant case that caused mass pain was the SCO Linux
         | Dispute
         | 
         | https://en.wikipedia.org/wiki/SCO%E2%80%93Linux_disputes
        
           | user5994461 wrote:
           | If you're thinking about the liability waiver found in many
           | licenses and contracts and EULA and other, they are often
           | void, depends on the jurisdiction.
           | 
           | The official answer from Github that they take all input on
           | purpose doesn't play in their favor.
        
         | Voloskaya wrote:
         | No, it's not trained on all public code as the title suggests,
         | it's trained on all GitHub public code (so public repos hosted
         | on GH), none of the things you enumerate are hosted on GH.
        
           | agilob wrote:
           | >it's trained on all GitHub public code (so public repos
           | hosted on GH)
           | 
           | This is exactly what I meant.
           | 
           | >none of the things you enumerate are hosted on GH.
           | 
           | Plenty of them on GH, if not src then magnet links
        
           | akerro wrote:
           | Just found Intel leaks and Gwent on github without any
           | effort. Intel has a few repositories in different formats,
           | plain copy of .svn directory or converted to git. TF2/Portal
           | leak is there as well. All but 2 I found were made by
           | throwaway accounts.
        
             | e12e wrote:
             | Now, was the leaked nt kernel source ever published on
             | github?
        
               | sp332 wrote:
               | https://github.com/PubDom/Windows-Server-2003 Second
               | result on DDG.
        
               | e12e wrote:
               | I wonder if co-pilot will cough up stuff like these
               | useful macros? Seems like a reasonable hack...
               | 
               | https://github.com/PubDom/Windows-
               | Server-2003/blob/master/co...                 #ifdef _MAC
               | # include <string.h>       # pragma segment ClipBrd
               | // On the Macintosh, the clipboard is always open.  We
               | define a macro for       // OpenClipboard that returns
               | TRUE.  When this is used for error checking,       // the
               | compiler should optimize away any code that depends on
               | testing this,       // since it is a constant.       #
               | define OpenClipboard(x) TRUE            // On the
               | Macintosh, the clipboard is not closed.  To make all code
               | behave       // as if everything is OK, we define a macro
               | for CloseClipboard that returns       // TRUE.  When this
               | is used for error checking, the compiler should optimize
               | // away any code that depends on testing this, since it
               | is a constant.       # define CloseClipboard() TRUE
               | #endif // _MAC
               | 
               | Just the kind of trick co-pilot should help us with?
        
           | jeremyjh wrote:
           | There have been leaks of copyrighted code that were hosted on
           | Github before they were taken down. There is also a lot of
           | public code on Github without any license at all, which is
           | not public domain but actually unlicensed for all purposes.
        
       | Causality1 wrote:
       | Suppose you had some kind of AI Deepfake program operating off a
       | large database of copyrighted photos and you asked it to "make a
       | picture of a handsome man on a horse" and the man's head was an
       | exact duplicate of George Clooney's head from a specific magazine
       | cover, would that be infringement? Would selling the services of
       | an AI that took copyrighted photos of celebrities and edited them
       | into porn movies be infringement? I don't know the answers to
       | those questions but I find it very weird that people think large
       | blocks of typed text are less worthy of copyright protection than
       | other forms of media.
        
         | dogma1138 wrote:
         | That would potentially be an infringement of the copyright of
         | the photographer but in any case it's an infringement of the
         | personality rights of George Clooney.
         | 
         | You aren't allowed to sell someone's likeness without their
         | permission. You don't need an AI for this if you create a
         | portrait of Clooney and sell it or make any use that isn't
         | covered by fair use he can sue you.
         | 
         | Depending on the composition of the picture for example if
         | Clooney is naked and say Putin is riding in the "bitch seat" of
         | the saddle then you also are quite likely be open for a libel
         | suit as well.
        
           | hoppyhoppy2 wrote:
           | Satire does not usually fall under libel/defamation, though,
           | right?
           | 
           | >For example, in Hustler Magazine v. Falwell (1988), Chief
           | Justice William H. Rehnquist, writing for a unanimous court,
           | stated that a parody depicting the Reverend Jerry Falwell as
           | a drunken, incestuous son could not be defamation since it
           | was an obvious parody, not intended as a statement of fact.
           | To find otherwise, the Court said, was to endanger First
           | Amendment protection for every artist, political cartoonist,
           | and comedian who used satire to criticize public figures.
           | 
           | https://www.mtsu.edu/first-amendment/article/1015/satire
        
             | dogma1138 wrote:
             | Depends on the legal system in question and the intent and
             | usage.
             | 
             | The US system isn't the only one on the planet you know,
             | the UK still has political cartoonists despite a very
             | different definition for what defamation is which the
             | example above can fall under.
        
       | belorn wrote:
       | So why did GitHub chose to exclude private repositories? Why not
       | include everything, including the code for windows?
        
         | jefftk wrote:
         | In training on publicly accessible repositories, GitHub did
         | something anybody could have done. If they also used private
         | repositories, though, I would see that as abusing their
         | position.
        
           | jefftk wrote:
           | Additionally, if they had trained on private repositories
           | then they risk leaking code, and accidentally making it
           | public. Even if that was within fair use it would still be a
           | violation of the trust people put in them.
        
       | rambambram wrote:
       | Besides, as a programmer you should not excuse yourself with
       | "IANAL" or otherwise pass any judgment to lawyers. Lawyers are
       | just that: lawyers. They don't hold the truth either. One lawyer
       | says this, another lawyer says that. F*k 'm. If anything, say
       | "IANAJ" (I Am Not A Judge). Trias politica, you gotta love it.
        
       | pjfin123 wrote:
       | Of all the potential issues with training large AI models
       | incidental copyright infringement seems pretty mild.
        
       | sergiogjr wrote:
       | The way I see it is that MS has probably put as much financial
       | investment in the development of this product as in the research
       | of legalities regarding its release with its highly paid legal
       | army. To expect a multi billion dollar company to not do its due
       | diligence seems naive.
       | 
       | Maybe this could spark a discussion to change the current rules
       | that allow them to do this, but questioning the current legality
       | to me is a waste of time.
        
         | tpmx wrote:
         | It's a gamble. Worst case they have to reduce the quality by
         | removing GPL code from the training data. And/or pay off a few
         | lawsuits, which is routine stuff for them. Cost of doing
         | business.
        
       | pjfin123 wrote:
       | Of all the concerns over training large AI models incidental
       | copyright infringement doesn't seem that important.
        
       | zinekeller wrote:
       | https://news.ycombinator.com/item?id=27771742
       | 
       | The above thread is a dupe of this discussion but with
       | interesting discussions already in place before being marked as a
       | dupe.
        
       | floor_ wrote:
       | Road to hell paved with good intentions.
        
       | Luker88 wrote:
       | Like it or not, it seems like:
       | 
       | * most people here are unhappy
       | 
       | * most laywers will say it's fine (it very probably passed MS
       | ones)
       | 
       | I can understand that. Copyright was not created with AI/ML in
       | mind, even as a random stray thought. Those were not even words
       | at the time.
       | 
       | So the question is: If we change the law and require trained
       | algorithms to only work on licenses that permit this, and to
       | output the "minimum common license" somehow, what are the
       | repercussion on other application of copyright?
       | 
       | Because the consensus here seems to be that this looks a lot like
       | a de-licensor with extra steps
        
         | nonameiguess wrote:
         | Standard caveat that I'm not a lawyer by any stretch, but this
         | seems settled by the existence of text-generation assistants
         | trained on the full corpus of human writing ever digitized,
         | much of which is also copyrighted or licensed in some way. That
         | is clearly fine, as training text generation programs on
         | existing text has been standard for decades. Selling a product
         | based on GPT-3 is fine and the law has not come after anyone
         | trying to do that.
         | 
         | The more questionable line is if someone happens to
         | inadvertently reproduce entire paragraphs of Twilight: Breaking
         | Dawn word-for-word using GPT-3 and then sells it, that might be
         | a violation even if they didn't realize they were doing it.
         | 
         | Copilot is the same thing. Creating a product that makes
         | suggestions that it learned from reading other people's work is
         | fine. Now if you write code using Copilot and happen to
         | reproduce some part of glibc down to the variable names, and
         | don't release it under GPL, you might be in trouble. But
         | Copilot won't be.
        
           | dlisboa wrote:
           | I don't know if even copying small pieces of code verbatim
           | should mean anything.
           | 
           | Another example is the photo generation ML algorithms that
           | exist. They generate photos of random "people" (imaginary AI-
           | generated people) by using actual photos of real people. If
           | one eye or nose is verbatim copied from the actual photo to
           | the generated photo, is the entire output now illegal or
           | plagiarism? One might argue it's just an eye, the rest of the
           | picture is completely different, the original photographer
           | doesn't need to grant permission for that use.
           | 
           | Any analogies we make with this, be it text generation, image
           | generation, even video generation, seems like it falls under
           | the same conclusion: so far we've thought all of this was
           | perfectly fine. I don't see why code-generation is any
           | different. A function is just a tiny part of a project. It's
           | not necessarily more important than the composition of a
           | photograph, or a phrase in a book. We as programmers assign
           | meaning to it, we know it takes time to craft it and it might
           | be unique, but likewise a novelist may have spent weeks on a
           | specific 10 word phrase that was reproduced verbatim, in a
           | text of 500 pages.
           | 
           | The more I look at this the more it seems copyright, and IP
           | law in general, is the main problem. Copyleft and OS licenses
           | wouldn't be needed if it wasn't for the aggressive nature of
           | IP law. I don't see the need to defend far more strict
           | interpretations of it because it has now touched our field.
        
         | mrh0057 wrote:
         | There is nothing intelligent about this. What they did is a
         | context aware search and trying to claim that not what this is.
         | If it was just used as a search engine and people weren't using
         | the results or following the license of the original source,
         | then it would fine. There has been so much of a hype of machine
         | learning people likely have a false impression of what it is.
        
           | Ajedi32 wrote:
           | I've seen this claim that Copilot is "just a search engine"
           | repeated in multiple places now. It's wrong; as anyone
           | familiar with any of the GPT variants or other similar
           | autoregressive language models can attest.
           | 
           | Copilot isn't a search engine any more than any other
           | language model is. It _can_ sometimes output data from the
           | training set verbatim as most AI models do from time to time,
           | but that is the exception not the rule.
           | 
           | Whether modern autoregressive language models can be called
           | "inteligent" is debatable, but they're certainly far beyond
           | what you'd get from a simple search engine.
        
       | rgbrenner wrote:
       | There are a lot of posts here debating if licenses still apply
       | when copilot generates verbatim code. The answer is yes.
       | 
       | Copilot is currently a technical preview. Github has already said
       | they intend to detect verbatim code and notify the user and
       | present the correct license. That'll be in the final release.
       | 
       | Don't use the technical preview for anything except demoing a
       | cool concept. It's not ready for that yet because it will
       | reproduce licensed code and not tell you.
        
         | Voloskaya wrote:
         | > If licensing still apply when copilot generates the code. The
         | answer is yes.
         | 
         | Please provide a source for this.
        
       | rubyist5eva wrote:
       | Open source developers need a new kind of license with a ML model
       | training clause, so there is no more ambiguity if they don't want
       | their code to be used in this way.
        
         | Hamuko wrote:
         | People have been suggesting this ever since Copilot was
         | announced and it doesn't work on any level. They're using _all_
         | code on GitHub, even the ones with no license and which you can
         | 't use for any purpose and the reasoning is that they see it as
         | fair use - which supersedes any licenses and copyrights in the
         | US.
        
           | ghoward wrote:
           | They only claimed that _training_ the model was fair use.
           | What about its output? I argue that its output is still
           | affected by the copyright of its inputs, the same way the
           | output of a compiler is affected by the copyright of its
           | inputs.
        
         | ghoward wrote:
         | I made new licenses [1] [2] that attempt this. The problem with
         | adding a clause against ML training is that that is
         | (supposedly) fair use. What my licenses do is concede that but
         | claim that the _output_ of those algorithms is still under the
         | copyright and license.
         | 
         | I hope that even if it wouldn't work, it puts enough doubt in
         | companies' minds that they wouldn't want to use a model trained
         | by code under those licenses.
         | 
         | [1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-
         | and...
         | 
         | [2]: https://yzena.com/licenses/
        
         | jefftk wrote:
         | That doesn't work: your suggestion applies at too late a stage
         | in the flowchart. It looks like:
         | 
         | 1. Do you need a license to use materials for training, or to
         | use the output model?
         | 
         | 2. If so, does the code's license allow this?
         | 
         | GitHub is claiming 'no' for #1, that they do not need any sort
         | of license to the training materials. This is reasonably
         | standard in ML; it's also how GPT-3 etc were trained.
         | 
         | Now, whether a court will agree with their interpretation is an
         | interesting question, but if they are correct then #2 doesn't
         | come into play.
        
           | rubyist5eva wrote:
           | If the answer is 'no' for #1 than the GPL might as well not
           | exist because now we can just launder it through co-pilot and
           | close it off, a rather distorted interpretation of "fair use"
           | if you ask me.
           | 
           | "Dear copilot, I'm writing a Unix-like operating system...."
        
       | mehdix wrote:
       | GitHub's Copilot looks like a "code laundering" machine to me.
        
         | slownews45 wrote:
         | Developers have lost the plot here. The number of people
         | browsing stack exchange and copying code is huge. The number of
         | people who have read GPL'ed code to learn from (from the kernel
         | to others) is huge. The number of people who learned from code
         | they had to maintain -> huge.
         | 
         | This idea that a snippet of a code is a work seems crazy to me.
         | I thought we went through this with SCO already.
        
           | loeg wrote:
           | Stack exchange code is explicitly permissively licensed.
        
             | slownews45 wrote:
             | It is, but it has a GPL style permission.
             | 
             | ShareAlike -- If you remix, transform, or build upon the
             | material, you must distribute your contributions under the
             | same license as the original.
             | 
             | The idea that programmers taking snippets from
             | stackexchange or co-pilot etc meaning they have a
             | derivative work seems like total insanity.
        
             | e12e wrote:
             | Unfortunately wrongly licensed for its precieved use-case -
             | it'd be better if so used mit or bsd :/
             | 
             | https://stackoverflow.com/help/licensing
             | 
             | I'm guessing most uses of stack overflow snippets are
             | violating the license (no attribution, no share alike of
             | the "remix" - which would probably be the entire program).
        
         | lucasyvas wrote:
         | There's a phrase I never wanted to hear. But that sounds
         | exactly like what it is.
        
         | qayxc wrote:
         | Why and how? I'm honestly interested in an answer here.
         | 
         | What exactly is the difference between a machine learning
         | patterns and techniques from looking at code and people doing
         | it?
         | 
         | Is every programer who ever gazed at GPL'ed code guilty of
         | plagiarism and licensing violations because everything they
         | write has to be considered derivative work now?
        
           | mehdix wrote:
           | I can think of certain things here. As human beings we have
           | limitations. We get tired of gazing at code, GPLE'ed or not.
           | GitHub's clusters don't. It puts fair use of copyrighted
           | content under question. The next concern I have, is what
           | happens when Copilot produces certain code verbatim? I saw
           | the other day on HN that it produced some Quake code
           | verbatim. See https://news.ycombinator.com/item?id=27710287
        
             | qayxc wrote:
             | > As human beings we have limitations.
             | 
             | That's a fair point. ML models don't seem memorise all the
             | code they've seen either, it seems. Plus while the argument
             | of human limitations applies to the vast majority of
             | people, what about those with eidetic memory?
             | 
             | > what happens when Copilot produces certain code verbatim?
             | 
             | There are several options: suppress the result, annotate
             | with a proper reference or mark the snipped as GPL'ed.
             | 
             | There are technical solutions to this question, but it's
             | also important to ask to which degree this is necessary.
             | 
             | Is a search engine that returns code snippets regardless of
             | license also a tool that needs to be discussed the same
             | way? After all, code samples from StackOverflow or
             | RosettaCode are copied on a regular basis and not every
             | example provides a proper reference as to where it's been
             | taken from.
             | 
             | So maybe a hint like "may contain results based on GPL'ed
             | code" suffices? I don't know, but that's a question best
             | deferred to software copyright law experts.
        
       | ralph84 wrote:
       | > I've reached out to @fsf and @EFF's legal teams regarding this.
       | Please also reach out if you would be interested in participating
       | in a class action.
       | 
       | I think she's barking up the wrong tree here. If she's looking
       | for organizations interested in eliminating fair use, RIAA, MPA,
       | and AAP are more likely allies.
        
       | EVa5I7bHFq9mnYK wrote:
       | Open source is about love, sharing, helping out the fellow coder.
       | Coderz of the past hated all this licensing and copyright BS.
       | Your code, used to train this NN, is making the world a better
       | place, I'd be content with that.
        
         | luffapi wrote:
         | Nothing about further enriching Microsoft and continuing the
         | network effects behind a closed source "social network", is
         | making the world a better place. Quite the opposite really.
        
       | danbruc wrote:
       | At least to some first approximation irrelevant because reading
       | code is not subject to any license. What if a human reads some
       | restrictively licensed code and years later uses some idea he
       | noticed in that code, maybe even no longer being aware from where
       | this idea comes?
       | 
       | But what if the system memorizes entire functions? What if a
       | human does so? What if you change all the variable names? What if
       | you rearrange the control flow a bit? What if you just change the
       | spacing? What if two humans write the exact same code
       | independently? Is every for loop with i from 0 to n a license
       | violation?
       | 
       | I am not picking any side, but the problem is certainly much more
       | nuanced then either side of the argument wants to paint it.
        
         | selfhoster11 wrote:
         | The problem is that humans are limited in retention and rate of
         | learning. An AI/ML is not, which makes (or should make) a
         | difference.
        
           | danbruc wrote:
           | Sure, it might certainly be the case that different rules
           | should be applied to humans and machines, but this makes the
           | discussion only even more nuanced. But I don't think this
           | could reasonably be used to ban machines from ingesting code
           | with certain licenses even though it might restrict what they
           | can do with this information.
        
         | jonfw wrote:
         | I agree that it's nuanced and it's difficult to draw the line.
         | but where copilot sits is way over on the plagiarizing side of
         | the spectrum. Wherever we agree to draw the line, copilot
         | should definitely fall on the wrong side of it
         | 
         | Copilot will replicate entire functions, including comments,
         | from licensed code
        
           | kevincox wrote:
           | > but where copilot sits is way over on the plagiarizing side
           | of the spectrum
           | 
           | I think it is important to point out that not all Copilot
           | output is on the plagiarizing side of the spectrum. However
           | it does on occasion produce plagiarized code. And most
           | importantly there is no indication when this occurs.
        
         | kevincox wrote:
         | > What if a human reads some restrictively licensed code and
         | years later uses some idea he noticed in that code, maybe even
         | no longer being aware from where this idea comes?
         | 
         | In general using the idea is fine, whether it is AI or human
         | written. I think the major concern here is when the code is
         | copied verbatim, or near verbatim. (AKA the produced code is
         | not "transformative" upon the original)
         | 
         | > But what if the system memorizes entire functions? What if a
         | human does so?
         | 
         | In both of these cases I believe it would be a copyright
         | concern. It is not strictly defined, and it depends on the
         | complexity of the function. If you memorized (|a| a + 1) I
         | doubt any court would call that copying a creative work. But if
         | you memorized the quake fast inverse square root it is likely
         | protected under copyright, even if you changed the variable
         | names and formatting.
         | 
         | It seems clear to me that GitHub Copilot is capable of
         | producing code that is copyrighted and needs to be used
         | according to the copyright owner's license. Worse still, it
         | doesn't appear of capable of knowing when it is doing that, and
         | what the source is.
        
       | surfingdino wrote:
       | I am not surprised given who the owner of GitHub is. Now, let's
       | assume for a while that a private repo is left marked as public
       | by mistake and Copilot regurgitates it... Lawyers are going to
       | have fun with that one.
        
         | Hamuko wrote:
         | The worse scenario for GitHub is when a leak is published on
         | GitHub. It's not like it hasn't happened before.
         | 
         | https://www.theverge.com/2018/2/8/16992626/apple-github-dmca...
        
           | SXX wrote:
           | There actually tons of unlicensed and wrongly licensed code
           | on GitHub right now that being accidentally leaked by
           | employees of many companies.
        
       | Yoric wrote:
       | Out of curiosity, how do we define license violation in that
       | case? I, as a human being, have trained by reading code, much of
       | which is covered by licenses that are somehow not compatible with
       | code I'm writing. Am I violating licenses?
       | 
       | Asking seriously. It's really unclear to me where law and/or
       | ethics put the boundaries. Also, I'd guess it's probably country
       | dependent.
        
         | bananapub wrote:
         | https://en.wikipedia.org/wiki/Clean_room_design
         | 
         | sometimes? it's enough of an issue that companies explicitly
         | avoid it by having two teams.
        
           | Spivak wrote:
           | Clean room design is a technique to avoid the appearance of
           | copyright infringement. If the courts were omniscient and
           | could see into your mind that you didn't copy then there
           | would be no need. Why this is relevant is because we can see
           | into the mind of copilot. Whether what it does it considered
           | infringement I think will come out in the details.
           | 
           | If the ML model essentially is just a very sophisticated
           | search and helps you choose what to copy and helps you modify
           | it to fit your code then it's 100% infringement. If it is
           | actually writing code then maybe not.
        
         | danbruc wrote:
         | That is exactly what needs some careful consideration. As a
         | start, two people can write the exact same code independently,
         | therefore having identical code is not sufficient. On the other
         | hand I can copy some code and slightly modify it, maybe only
         | the spacing or maybe changing some variable names, and it could
         | reasonably be a license violation, therefore having identical
         | code is also not necessary.
         | 
         | Does the code even matter at all? If I start with a copy of
         | some existing code, how much do I have to change it to no
         | longer constitute a license violation? Can I ever reach this
         | point or would the violation already be in the fact that I
         | started with a copy no matter what happens later? Does
         | intention matter? Can I unintentionally violate a license?
         | 
         | But I think we don't have to do all the work, I am pretty sure
         | this has already been considered at length by philosophers and
         | jurist.
        
         | [deleted]
        
         | emerged wrote:
         | None of our laws were created under the assumption that
         | computers would do so much of our jobs and effect so much of
         | our lives. From robotic automation to social media to now
         | computer programming. I think it's really a mistake to ask what
         | the letter of the law _currently_ means in the evolving
         | context. Laws should serve us and need to be adapted.
        
           | kadoban wrote:
           | Who is "us" that are being served?
           | 
           | I'm not the biggest fan of copyright law as currently
           | written, but I wouldn't say that MS's desire to file off the
           | serial numbers on every piece of public code for their own
           | profit is a good impetus to rewrite the law.
        
         | einarfd wrote:
         | > Out of curiosity, how do we define license violation in that
         | case? I, as a human being, have trained by reading code, much
         | of which is covered by licenses that are somehow not compatible
         | with code I'm writing. Am I violating licenses?
         | 
         | That depends, if you end up writing copies of the code you've
         | studied then yes. You are on thin ice. Plagarization is
         | definitely something that you can do with computer code. There
         | have been several high profile cases around this in arts. As
         | far as I can see it usually ends up being a question about how
         | much of the work is similar, how similar it is, and how unique
         | what was similar is. And added wrinkle in programing is that
         | some things can be done in only one way, or at least any
         | reasonable programmer will do it in only one way. So for
         | example a swap(var1, var2) function can usually only be done in
         | one way, and therefor you would not get in trouble if your and
         | someone else swap function are the same.
         | 
         | I've been following the discussion about Copilot, and one issue
         | that comes up again and again is that people seem to think that
         | since Copilot is new, the law will treat it, and the code it
         | writes differently, than what it would you or a copy machine. I
         | think that is naive, my impression is that courts care more
         | about what you did, not how you did it, and if you think
         | Copilot can be used to do an end run around the law. Prepare to
         | be disappointed.
         | 
         | So if Copilot memorize code and spits out copies of that code,
         | then it is at best skating on thin ice, or at worst doing a
         | license violation. If the code it is copying is unique, then it
         | definitely is heading into problematic territory. I'm fairly
         | sure sure someone in legal at Github is very unhappy about the
         | quake fast inverse square root function.
        
           | MayeulC wrote:
           | > swap(var1, var2)
           | 
           | Well, there's also the xor way to be pedantic :)
           | var1 = var1 ^ var2        var2 = var2 ^ var1        var1 =
           | var1 ^ var2
           | 
           | But yeah, not too much wiggle room there.
        
             | johndough wrote:
             | Another variation (assuming no overflows):
             | var1 += var2;         var2 = var1 - var2;         var1 -=
             | var2;
             | 
             | And another:                   var1 ^= var2 ^= var1 ^=
             | var2;
             | 
             | Assembly even has an instruction for it:
             | xchg eax, ecx
        
           | dvfjsdhgfv wrote:
           | My guess is that many people will use it on the backend where
           | a copyright violation is hard to spot and even more difficult
           | to prove.
           | 
           | As for fronted/open source etc... sure, if you don't care
           | about copyright and licensing, use it.
        
         | contravariant wrote:
         | If you were to write large swaths of copyrighted code from
         | memory then yes you'd be committing a copyright violation.
         | 
         | Most humans don't do so unintentionally though.
        
           | FeepingCreature wrote:
           | Just as an example, this is very widespread in music though.
        
             | contravariant wrote:
             | If the whole 'Dark Horse' debacle proved anything it would
             | be that that can still be considered a copyright
             | infringement. Sure that particular example was (rightly
             | IMHO) deemed to not be a copyright violation, but they
             | still had to show their version was original enough, they
             | couldn't just claim such copying wasn't ever an
             | infringement.
        
           | elliekelly wrote:
           | I'm not so sure Copilot is doing so "unintentionally"
           | either...
        
         | Bdbvd wrote:
         | Doesnt matter how you define it. What you have to under stand
         | is the personality trait spectrum of the chimp troupe.
         | 
         | Some chimps cant sleep at night if what comes out of their 6
         | inch chimp head is not acknowledged in some way by the entire
         | troupe. These bastards then spend their whole life finding each
         | other and reinforcing each others self importance calling the
         | stories laws, ethics and all kinds of bullshit. It all doesnt
         | matter cause once what comes out of the 6 inch head has been
         | digitized it is now property of the rest of the universe.
        
         | belorn wrote:
         | The boundaries are not set in stone, and so the answer is the
         | old theme of "it depend". To provide a slightly different
         | situation which was discussed a few years ago, can you train an
         | AI on pictures of human faces without getting permission? Human
         | painters have created images of faces for a very long time, so
         | is it any different in terms of law and/ethics if an AI do it?
         | 
         | Yes, a bit? It depend. Using such things for advertisement
         | would likely cause anger if people start to recognize images of
         | the training set the AI was trained on.
        
           | b3morales wrote:
           | My opinion would be that if the training set for the face
           | generator was made up of photos whose creators had asked you
           | to credit them if you re-used their work, then, yes, the
           | generator is ethically in the wrong if it's skipping that
           | attribution. Regardless of copyright. (And I feel the same
           | way about Copilot.)
        
         | habosa wrote:
         | I am not a lawyer but I am sure that any legal standard for ML
         | has to be different than "isn't it just doing what humans do,
         | but faster?"
         | 
         | GitHub scanning billions of code files to build commercial
         | software is different than you learning at human pace, even if
         | they're both "learning" and in the end they both produce
         | commercial software.
        
           | leereeves wrote:
           | > isn't it just doing what humans do, but faster?
           | 
           | The human activity most like training an ML system is
           | memorizing a text by reciting from memory, checking against
           | the original, adjusting, and repeating until there are
           | acceptably few mistakes.
           | 
           | And if a human did so for thousands of texts then publicly
           | repeated those texts, they would be violating copyright too.
        
           | danbruc wrote:
           | It does not have to be different but it certainly can be
           | different, a difference in quantity can certainly be a
           | difference in quality. People watching other people walk by
           | and a camera - maybe with face detection - doing the same are
           | not only a difference in quantity but also in quality.
        
         | skinkestek wrote:
         | > I, as a human being, have trained by reading code, much of
         | which is covered by licenses that are somehow not compatible
         | with code I'm writing. Am I violating licenses?
         | 
         | As someone who has taught students in ICT a quick rule of thumb
         | was that I picked a piece of text that I suspected, wrapped it
         | in doublequotes and put it into a search engine.
         | 
         | 9/10 times - possibly more - of the times I had that feeling it
         | was true. 17 year olds don't write like seasoned reporters most
         | of the time.
         | 
         | Obviously there needs to be some independent tought in there as
         | well, but for teenagers I put the line at not copying verbatim,
         | and to cite sources.
         | 
         | As we've seen demonstrated again and again copilot breaks both
         | my minimum standard rules for teenagers: it copies verbatim and
         | it doesn't cite sources.
         | 
         | I say that is pretty bad.
         | 
         | If the system had actually learned the structure and applied
         | what it had learned to recreate the same it would be a whole
         | different story.
         | 
         | But in this case it is obvious that the AI isn't writing the
         | code - at least not all the time, it is instead choosing what
         | to copy - verbatim.
        
           | qayxc wrote:
           | > But in this case it is obvious that the AI isn't writing
           | the code - at least not all the time, it is instead choosing
           | what to copy - verbatim.
           | 
           | I still don't see any problem with that. If it's larger
           | sections (e.g. entire NON-TRIVIAL function bodies), those can
           | be filtered or correctly attributed after inference. So
           | that's just a technicality.
           | 
           | Smaller snippets and trivial or mechanical implementations
           | (generated code, API calls, API access patterns) aren't
           | subject to any kind of protection anyway.                 int
           | main(int argc, char* argv[]) {
           | 
           | Lines like that hold no intellectual value and can be found
           | in GPL'ed code. It can be argued that that's a verbatim
           | reproduction, yet it's not a violation of any kind in any
           | reasonable context.
           | 
           | Where do you draw the line and how would you be able to -
           | automatically even! - decide what does and does not represent
           | a significant verbatim reproduction?
        
             | jcelerier wrote:
             | what about lines such as                   Idxs[i] += (Imm
             | >> ((i * HalfLaneElts) % 8)) & ((1 << HalfLaneElts) - 1);
             | double r2 = fma(u*v, fma(v, fma(v, fma(v, ca_4, ca_3),
             | ca_2), ca_1), -correction);              seed ^= hasher(v)
             | + 0x9e3779b9 + (seed << 6) + (seed >> 2);
             | qint32 val = d + (((fromX << 8) + 0xff - lx) * dd >> 8);
             | 
             | even if it's one line, it likely took some non-negligible
             | thinking time from the programmer
        
               | wvenable wrote:
               | What about E = mc^2 ?
               | 
               | Mathematics and physics equations are not copyrightable.
        
               | jcelerier wrote:
               | but those aren't only mathematics. There's the choice of
               | variable names, the order in which things are called
               | (maybe to optimize the performance on some CPU, we don't
               | know), etc
        
               | wvenable wrote:
               | Your original argument is based on the false premise that
               | the amount of time or effort matters -- it doesn't. Not
               | all human activity can or should be subject to copyright
               | -- this the dangerous slippery slope of "intellectual
               | property" -- and we are dangling by edge these days.
        
             | skinkestek wrote:
             | >I still don't see any problem with that. If it's larger
             | sections (e.g. entire NON-TRIVIAL function bodies), those
             | can be filtered or correctly attributed after inference. So
             | that's just a technicality.
             | 
             | Today copilot does what it does.
             | 
             | I've never heard Microsoft defend anyone running afoul of
             | some of their licensing details with "they can fix it
             | later, it is just a technicality".
             | 
             | I think this should go both ways? No?
             | 
             | > Smaller snippets and trivial or mechanical
             | implementations (generated code, API calls, API access
             | patterns) aren't subject to any kind of protection anyway.
             | int main(int argc, char* argv[]) {
             | 
             | > Lines like that hold no intellectual value and can be
             | found in GPL'ed code. It can be argued that that's a
             | verbatim reproduction, yet it's not a violation of any kind
             | in any reasonable context.
             | 
             | Totally agree. Edit: otherwise we'd all be in serious
             | trouble.
             | 
             | > Where do you draw the line and how would you be able to -
             | automatically even! - decide what does and does not
             | represent a significant verbatim reproduction?
             | 
             | I am not a lawyer but I guess many can agree that somewhere
             | before copying functions verbatim, comments literally
             | copied as well for good measure, somewhere before that
             | point there is a line.
             | 
             | On the other hand: if there was significant evidence that
             | the AI was doing creative work, not just (or partially
             | just) copying then I think I would say it was OK even if it
             | arrived at that knowledge by reading copyrighted works.
             | 
             | Edit: how could we know if it was doing creative work?
             | First because it wouldn't be literally the same. Literal
             | copying is liter copying regardless of if it is done using
             | Xerox, paid writers, infinite monkeys om infinite
             | typewriters, "AI" or actual strong AI.
             | 
             | After that it becomes a bit more fuzzy as more
             | possibilities open up:
             | 
             | - for student works I look at how well adapted it is to the
             | question at hand: a good answer from Stackoverflow,
             | attributed properly and adapted to the coding style of the
             | code base? Absolutely OK. Copying together a bunch of stuff
             | from examples in the frameworks website? Fine. Reading
             | through all the docs and look at how a number of high
             | profile projects have done it in their open source
             | solution, updating the README.md with info on why this
             | solution was chosen? Now you are looking for a top grade in
             | my class.
             | 
             | (of course IBM will probably not want you to work on their
             | compiler though if you admit that you've studied OpenJDKs,
             | or so I have heard.)
        
               | qayxc wrote:
               | > Today copilot does what it does.
               | 
               | It's also not a commercially released product yet, but a
               | technical preview, so uncovering and addressing issues
               | like that is exactly what pre-release versions are for.
               | 
               | I'd say it succeeded greatly in sparking a discussion
               | about these issues.
        
           | CyberRabbi wrote:
           | It's not AI it is ML. GPT-3 is a very large ML model. It does
           | not reason. It's a statistical machine.
        
             | tsimionescu wrote:
             | ML is a subset of AI, in any defintion that I've seen. And
             | both are needlessly anthropomorphizing what are currently
             | simple statistical or rule-based deduction engines.
             | 
             | GPT-3 is no more 'intelligent' in the human sense than it
             | is 'learning' in the human sense.
        
             | cnity wrote:
             | By this logic there is no such thing as AI.
        
               | Eikon wrote:
               | There's no such thing as AI.
        
         | kzrdude wrote:
         | The training question seems much more difficult.
         | 
         | The main problem that has been the topic is a simpler one -
         | about the produced work. If you exactly reproduce someone's
         | existing code (doesn't matter if you copy by flipping bits one
         | by one or which technology you use), isn't it a copyright
         | violation?
         | 
         | I'm kind of imagining a Rube Goldberg machine that spells out
         | the quake invsqrt function in the sand, now...
        
           | wongarsu wrote:
           | Yes, if you play a video from Netflix while recording your
           | screen, transcode that video to MPEG2 and use a red laser to
           | write a complex encoding of that MPEG2 bitstream onto a
           | plastic disk, then send that by mail to your friend, a court
           | won't care about the complexity of that Rube Goldberg
           | machine. They will just say it's a clear copyright violation
           | since you distributed a Netflix movie by DVD.
           | 
           | With programming, there's the further complication what
           | constitutes a work. But quakes invsqrt certainly qualifies,
           | just like that one function from the Oracle vs Google case.
        
             | [deleted]
        
         | tsimionescu wrote:
         | > I, as a human being, have trained by reading code, much of
         | which is covered by licenses that are somehow not compatible
         | with code I'm writing. Am I violating licenses?
         | 
         | There are many good answers from the legal side. I would also
         | attack this side: the way human beings learn is entirely
         | different from the way ML models are trained. We don't do
         | gradient descent to find the slope of data points and find the
         | most likely next bit of code.
         | 
         | We humans create rational models of the code and of the world,
         | and use deduction from those models to create code. This is
         | extremely visible in the way we can explain the reason behind
         | our code, and in the way we are aware of the difference between
         | copying code we've seen before vs writing new code. It's also
         | visible in that we can be told rules and produce code that
         | obeys those rules that doesn't resemble any code ever written
         | before.
         | 
         | The difference is also easily quantifiable: humans learn to
         | program after seeing vastly fewer code examples than Co-pilot
         | needed, and we are much better at it.
         | 
         | One day, we will design an AI that does learn more similarly to
         | how humans learn, and that day your question will be far more
         | interesting. But we are far from such problems.
        
           | FeepingCreature wrote:
           | I'm not sure this is actually true. We can explain code, but
           | the fact that we can explain code is not necessarily related
           | to the way we actually end up writing it. Have you ever
           | written a function "on autopilot"? Your brain has selected
           | what you wanted it to do, and now you're just typing without
           | thought? I don't think we're as dissimilar to this model as
           | we'd like.
        
             | b3morales wrote:
             | The feeling of being "on autopilot" when doing a task has
             | to do with your, let's call it, _supervisory process_ being
             | otherwise occupied. It doesn 't suggest that that the other
             | mental processes which are responsible for figuring out the
             | actions have changed their character or mode of operation.
             | 
             | "You" are just not paying attention to it in that moment.
        
       | carlgreene wrote:
       | While I'm not trying to lessen the implications of something like
       | this, but didn't we all agree to them being able to do this when
       | we agreed to their TOS?
        
         | MiddleEndian wrote:
         | Let's assume that is enforceable through the TOS (which I
         | doubt), would that make hosting GPL'd code on Github a
         | violation of the GPL? If programmer X releases GPL'd code on
         | his website and programmer Y copies it to Github, then it could
         | presumably be considered a bypass of the copyright.
        
         | lucasyvas wrote:
         | I don't believe ToS are ever legally binding.
        
         | unanswered wrote:
         | Yeah, but I don't think you're allowed to interrupt the
         | circlejerk by pointing that out. Every piece of code pushed to
         | GitHub comes with: an implied licence for GitHub and its users,
         | which is an alternative to any explicit license in the
         | code(!!!) and also a representation that you're authorized to
         | grant such a license. Of course one imagines that in many cases
         | uploaders are not actually authorized to grant such a license,
         | such as if they're uploading something they themselves have
         | received under GPL license, but IANAL.
        
       | vel0city wrote:
       | I can't wait for machine learning models that given the right
       | input nearly perfectly reproduce feature length movies or music.
       | Its not copyright infringement, it was generated by a computer!
        
       | cortexio wrote:
       | i dont really care, if the code is public, it's public. If you
       | can copy it, why cant a bot do it? But it would be a useful
       | feature. Or does it also look at private repos? That would be
       | scary.
        
       | SavageBeast wrote:
       | When I first read about the new copilot tool I immediately
       | thought it would just be a matter of time before some group
       | started poisoning the AI. Garbage in, garbage out right?
       | 
       | So now we know its ALL public repos ... how long until all the
       | opponents of this tool have a giant repo full of syntactically
       | correct code that employs terrible design patterns and is
       | thoroughly obfuscated? I'm not going to waste my time on this
       | personally but there are certainly those who will. Someone will
       | invent a tool that perverts perfectly good code in the process
       | and probably have a good laugh.
       | 
       | Personally, while I recognize some people might find it useful, I
       | don't much care for it. No, I haven't tried it yet either. Ive
       | never sampled escargot either and I know I don't care for it all
       | the same. Maybe it's wonderful, I'll never know - but I do know
       | that I simply don't like the idea of it. Call it an objection on
       | General Principal if you like.
       | 
       | So remember kids, If you're not PAYING then you are the product.
       | 
       | Bottom line - private repos are cheap and you should use them
       | rather than freebie public stuff.
        
       | timdaub wrote:
       | Built on Stolen Data:
       | https://rugpullindex.com/blog#BuiltonStolenData
        
       | jand wrote:
       | Not a github user (*lab), also not a lawyer, so please excuse my
       | ignorance.
       | 
       | As this boils down to legal arguments, are there any clauses
       | (maybe disputed) in the ToS allowing github/MS usage of public
       | repos for such purpose?
       | 
       | Would it even be legally possible to override a software license
       | as a repo provider like "by using this service, you agree to..."?
        
       | Luker88 wrote:
       | So...
       | 
       | Putting the (imho) big licensing problems aside, what about the
       | software patents?
       | 
       | Apache and GPL have patent protection clauses.
       | 
       | Does this mean that anyone using copilot might somehow get code
       | that implements something patented, but protected by license,
       | except they did not get proper permission through the Apache/GPL
       | license?
       | 
       | ...I kind of hate myself for saying this, but... Patent trolls to
       | the rescue?
        
       | aliasEli wrote:
       | A direct confirmation from GitHub itself. This is problematic
       | because Copilot sometimes outputs code that was present in its
       | training set.
       | 
       | https://fossbytes.com/github-copilot-generating-functional-a...
        
         | deviledeggs wrote:
         | This is the crux of copilot. When I saw it copying RSA keys I
         | knew, it's overtrained.
         | 
         | Most of the comments are waxing about philosophical
         | possibilities of copilot copying GPL.
         | 
         | Reality is clear in the case, it's copy pasting thousands of
         | characters of GPL code with no modifications. Copyright
         | violation clear as day
        
       | sandruso wrote:
       | I don't know how to approach this. As a human I can read all
       | public code and learn from regardless the license and come up
       | with new solutions. Machine can read everything too, but can't
       | create new ideas or approach. How is copilot defined then? Should
       | it be only smart system for general code snippets?
        
         | raspyberr wrote:
         | Well you can read public code all you like but you can't just
         | take chunks of code and write them under different licenses
         | like how copilot has been shown doing.
        
           | sandruso wrote:
           | If you grab chunk of licensed code and put into private repo,
           | what prevents you from doing that? How much of licensed code
           | is scattered across private projects? I'm curious how these
           | license violations are detected.
        
             | Spivak wrote:
             | I mean if you text and drive while a police officer isn't
             | around to see it you still broke the law. Just because
             | piracy is huuuuge and largely unpunished doesn't mean that
             | copyright doesn't have to be respected in a huge publicly
             | visible trying to be above-board project.
        
             | kevincox wrote:
             | Copyright law "prevents" you from doing that. To be more
             | specific copyright law specifies that you must comply with
             | the license of the copyright holder in cases such as the
             | one you have described.
             | 
             | > How much of licensed code is scattered across private
             | projects?
             | 
             | Whether or not copyright violations regularly occur is not
             | (directly) relevant to whether or not it is illegal. People
             | download copyrighted movies without licenses all the time
             | and it still isn't legal.
        
       | [deleted]
        
       | maxrev17 wrote:
       | How do people think developers learn?! Many probably recite
       | copyrighted code almost verbatim on the reg. Storm in a tea cup.
        
       | codehawke wrote:
       | Most likely they are using the private stuff too.
        
       | greatgib wrote:
       | As I said in another thread, in my opinion there is no issue to
       | have with whatever they did as training with whatever public
       | data.
       | 
       | And in the end, the output in itself is not really an issue. It
       | is just a machine outputting random lines it encountered on
       | internet.
       | 
       | The problem is from the user side: Ok, you got random lines from
       | random places. If you do nothing about it, then no issue. But if
       | you try to use, publish, sell the code, then you are in deep
       | shit. But somehow it's your fault.
       | 
       | For GitHub, the problem is more to be sued by "customers" that
       | assumed that the generated code was safe to use when it is not
       | the case.
       | 
       | And, as a general comment, I think that this case is very
       | illustrative about the misconceptions about AI and machine
       | learning for the general public:
       | 
       | Here you can see that you don't really have an intelligent system
       | that can learn and then create something new and innovative from
       | scratch. But it is just a machine that copy code it already saw
       | based on correlation with similarities in your current code.
        
       | Andrex wrote:
       | Hmm, no. I'll be (finally) moving to GitLab or similar.
        
       | ssivark wrote:
       | Calling it "public" code feels like doublespeak. It's most
       | definitely NOT public domain code -- it only happens to be hosted
       | on GitHub and browsable ( _but not copyable_ ) by people. "Source
       | available for viewing" is very different from "public property"
       | as the phrase is commonly understood:
       | https://en.m.wikipedia.org/wiki/Public_property
        
         | BlueTemplar wrote:
         | Interestingly, it is copy able... but only on GitHub !
         | ("forkable")
         | 
         | That's some nasty walled garden terms... I wonder how much
         | these kinds of ToS are actually legal ?
        
           | swiley wrote:
           | Pressing that "fork" button might be illegal. It's certainly
           | illegal to push after pressing it in many cases.
        
         | prettygood wrote:
         | Not copyable by people, but we can go through the code, learn
         | from it and then use that knowledge to improve our coding
         | skills.
         | 
         | Isn't that what autopilot is doing here? The system is merely
         | learning how to code, and then applying it's learnings on other
         | programming problems. It's not like it's writing software to
         | specifically compete with other programs.
        
           | jspaetzel wrote:
           | There's a really philosophical question here about whether
           | Copilot is learning or imitating.
           | 
           | For instance, a parrot doesn't learn to speak, it learns to
           | imitate speech.
        
           | hellcow wrote:
           | Not when it outputs large sections of unique code verbatim,
           | as it's been shown to do.
        
             | qayxc wrote:
             | If it's large sections, that can be fixed by either licence
             | attribution or result filtering.
             | 
             | That's at best a technical issue. What way too many people
             | claim, however, is that the machine isn't even allowed to
             | _look at_ GPL 'ed code for some reason, while humans are.
             | 
             | I'd like to learn the reasoning behind that.
        
               | ssivark wrote:
               | I think result-filtering (based on license of search
               | results) is gnarly enough, and likely computationally
               | intensive, so as to break the whole feature. But it would
               | be interesting to see if that can be crafted to fix the
               | shortcomings of the ML model.
        
               | thomasahle wrote:
               | > What way too many people claim, however, is that the
               | machine isn't even allowed to look at GPL'ed code for
               | some reason, while humans are.
               | 
               | Why would those be the same thing? It's a matter of
               | scale. Just like how people are allowed to read websites,
               | but scraping is often disallowed.
        
         | fragileone wrote:
         | The word they're actually referring to here is "source
         | available", and trying to use "public" is just to confuse
         | people into thinking they're referring to public domain only.
        
           | ghoward wrote:
           | Maybe they mean code that is in public (versus private)
           | repos? And then use the word to make it seem like it's stuff
           | in the public domain?
        
         | CyberRabbi wrote:
         | Movies are "public" too. That does not mean you are allowed to
         | use them for any purpose. The term "Public" does not have
         | specific legal consequences in copyright law outside of
         | something being "public domain" as you say.
        
           | slownews45 wrote:
           | You are allowed to watch them. Many moves take ideas from
           | other movies, which took ideas from myths and earlier
           | stories. In fact, I find modern movies highly highly
           | derivative.
        
           | planb wrote:
           | The question is: are you allowed to train a neural network on
           | movies (e.g. For an automated color grading algorithm) and
           | then sell that as a service?
        
         | Hamuko wrote:
         | > _it only happens to be hosted on GitHub and browsable (but
         | not copyable) by people._
         | 
         | So would you say that it's _publicly_ visible?
        
           | sp332 wrote:
           | Publicly visible, yes. Publicly available, yes. Public code,
           | no.
        
             | frumper wrote:
             | If you post your code to the public, I wouldn't be shocked
             | if people copy it verbatim without regard to license. I'm
             | not suggesting that is a proper thing to do, just accepting
             | that it can happen when I post code.
        
             | Hamuko wrote:
             | "Public code" is not a defined term. It's not short for
             | "public domain code".
        
         | blibble wrote:
         | I guess leaked copies of the NT kernel source on github are now
         | "public" in the eyes of MS?
        
         | will4274 wrote:
         | Public and public domain are not the same thing. This code is
         | public in the same way that Google indexes publicly available
         | information on the internet.
        
       | mensetmanusman wrote:
       | Does gtp-3 have to attribute mankind for reading all of the
       | internet?
       | 
       | What about deep learning-artwork trained on google searches?
       | 
       | We enter a new era...
        
       | habitue wrote:
       | I am really confused by HN's response to copilot. It seems like
       | before the twitter thread on it went viral, the only people who
       | cared about programmers copying (verbatim!) short snippets of
       | code like this would be lawyers and executives. Suddenly everyone
       | is coming out of the woodworks as copyright maximalists?
       | 
       | I know HN loves a good "well actually" and Microsoft is always
       | suspect, but let's leave the idea of code laundering to the
       | Oracle lawyers. Let hackers continue to play and solve
       | interesting problems.
       | 
       | Copilot should be inspiring people to figure out how to do better
       | than it, not making hackers get up in arms trying to slap it
       | down.
        
         | joe_the_user wrote:
         | _I am really confused by HN 's response to copilot._
         | 
         | If you're asking about the moral reaction here, I think it
         | depends on how one views Copilot. Does Copilot create basically
         | original code that just happens to include a few small
         | snippets? Or does Copilot actually generate a large portion of
         | lightly changed code when it's not spitting out verbatim copies
         | of the code? I mean, if you tell Copilot, "make me a QT
         | compatible, crossplatform windowing library" and it spits out a
         | slightly modified version of the QT source code and if someone
         | started distributing that with a very cheap commercial license,
         | that would be a problem for the QT company, which licenses
         | their code commercial or GPL (and as QT a library, the QT GPL
         | forces user to also release their code GPL if they release it,
         | so it's a big restriction). So in the worst case scenario, you
         | can something ethically dubious as well as legally dubious.
         | 
         |  _Copilot should be inspiring people to figure out how to do
         | better than it, not making hackers get up in arms trying to
         | slap it down._
         | 
         | Why can't we do both? I mean, I am quite interested in AI and
         | it's progress and I also think it's important to note the way
         | that AI "launders" a lot of things (launders bias, launder
         | source code, etc). AI scanning of job applications has all
         | sorts of unfortunate effects, etc. etc. But my critique of the
         | applications doesn't make me uninterested in the theory,
         | they're two different things.
        
           | fragmede wrote:
           | A naive developer thinks that they are the source code they
           | write (you're not), and their source code leaking to the
           | world makes them worthless. (Which isn't true, but being
           | _that_ invalidated explains a lot of the fear. Which, welcome
           | to the club, programmers. Automation 's here for your job
           | too.)
           | 
           | Still, some of the moral outrage here has to do with it
           | coming from Github, and thus Microsoft. Software startup Kite
           | has largely gone under the radar so far, but they launched
           | this back in 2016. Github's late to the game. But look at the
           | difference (and similarities) in responses to their product
           | launch posts here.
           | 
           | https://news.ycombinator.com/item?id=11497111 and
           | https://news.ycombinator.com/item?id=19018037
        
         | papito wrote:
         | In addition, this is extremely hard to enforce. I think the
         | amount of code running in closed systems that does not exactly
         | respect the original license is shocking. What was the last
         | case you know where this was a "scandal"?
         | 
         | It only happens at boss level when tech giants litigate IP
         | issues.
        
         | 63 wrote:
         | Firstly it's important to remember that HN is not a single
         | person with a single opinion, but many people with conflicting
         | opinions. Personally I'm just interested in the copyright
         | discussion for the sake of it because I find it interesting.
         | Though, I imagine there's also an amount of feelings of
         | unfairness.
        
         | bpodgursky wrote:
         | Hacker News hates everything, especially if it seems to work.
         | Don't read into it.
        
           | [deleted]
        
           | dang wrote:
           | " _Please don 't sneer, including at the rest of the
           | community._"
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
         | mupuff1234 wrote:
         | I think the real issue is less about the "copying short
         | snippets", and more about how it was done, i.e zero
         | transparency, default opt in without any regards to licensing
         | (with no way to opt out??) and last but not least - planning to
         | charge money for it.
        
         | muyuu wrote:
         | idk, I don't quite enjoy the idea of having my code stolen
         | without any respect for its licence or even attribution
         | 
         | but then again I migrated away from github as soon as MS bought
         | it
         | 
         | still, it's a matter of principle
        
         | moyix wrote:
         | Programmers _love_ to pretend that they 're lawyers, especially
         | when it comes to copyright law. Something about the law really
         | appeals to hackers!
        
         | paxys wrote:
         | It's a pretty standard "big company releases new thing"
         | reaction. HN is usually negative on everything.
        
         | sp332 wrote:
         | You don't have to be a copyright maximalist to worry about a
         | company taking snippets of code that used to be under an open
         | license and using them in a closed-source app.
        
         | carom wrote:
         | It is a large corporation eroding the integrity of open source
         | licenses. It is perfectly reasonable to be pissed off about
         | this.
        
         | stusmall wrote:
         | I've always cared but never talked about it. Someone copy and
         | pasting code from a source that is clearly forbidden (free
         | software, reverse engineered code, leaked source code, etc)
         | isn't an interesting thing to talk about. It's obviously wrong.
         | 
         | Also people rarely do it; I've caught maybe a couple instances
         | of it in my career and I never really thought too much about
         | them again. This tool helps make it a lot easier and more
         | common. I have a feeling other people chiming in are also in
         | the camp of "Oh, this is going to be a thing now, huh?"
         | 
         | I also can't help but to think that my negative opinion of it
         | isn't solely based on this provenance issue. While it's cool it
         | seems questionable about how practical it is. If the value was
         | more clear I think I could stomach the risk a bit better.
        
         | isaac21259 wrote:
         | Of copilot were open source I wouldn't have an issue with it.
         | However it is closed source and a later version it's intended
         | to be sold.
        
         | corobo wrote:
         | The difference between copilot and copy pasting from
         | stackoverflow is consent
        
         | 6gvONxR4sf7o wrote:
         | On many ML posts, you get arguments about IP, and there's a
         | long history of IP wars on this forum, especially when
         | licensing comes up. Then you add the popular Big Tech Is Evil
         | arguments you see. I think it's a variety of factors coming
         | together for people to be upset about someone else profiting
         | from their own work in ways they didn't mean to allow.
         | 
         | I expect that we'll need new copyright law to protect creators
         | from this kind of thing (specifically, to give creators an
         | _option_ to make their work public without allowing arbitrary
         | ML to be trained on it). Otherwise the formula for ML based
         | fair use is  "$$$ + my things = your things" which is always a
         | recipe for tension.
        
         | leventov wrote:
         | Perhaps people on HN start sensing that successors of Github
         | Copilot will take their programming job. Rightly so.
         | 
         | Personally, I think that in the age of AI programming any
         | notions of code licensing should be abolished. There is no
         | copyright for genes in nature or memes in culture; similarly,
         | these shouldn't be copyright for code.
        
           | croes wrote:
           | >There is no copyright for genes in nature
           | 
           | Since when are humans not a part of nature?
        
           | superfrank wrote:
           | > Perhaps people on HN start sensing that successors of
           | Github Copilot will take their programming job. Rightly so.
           | 
           | I still think we're a long way from that. Copilot will help
           | write code quicker, but it's not doing anything you couldn't
           | do with a Google search and copy/paste. Once developers move
           | beyond the jr. level, writing code tends to become the least
           | of their worries.
           | 
           | Writing the code is easy, understanding how that code will
           | affect the rest of the system is hard.
        
             | fragmede wrote:
             | Depends on your definition of "a long way". Some of the
             | GPT3 based code generation demos (which, explicitly, are
             | just that - demos - we aren't shown the limitations of the
             | system during the demo) say that's closer than I think.
             | 
             | https://analyticsindiamag.com/open-ai-gpt-3-code-
             | generator-a... has a bunch of videos of this in action.
        
             | tylersmith wrote:
             | Based on the responses I've seen, people have it in their
             | heads that Copilot is a system where you describe what kind
             | of software you want and it finds it on Github and slaps
             | your own license on it.
             | 
             | It's just a smarter tab-completion.
        
           | LouisSayers wrote:
           | > Perhaps people on HN start sensing that successors of
           | Github Copilot will take their programming job. Rightly so.
           | 
           | I feel like this comment misunderstands what a software
           | developer is doing. Copilot isn't going to understand the
           | underlying problem to be solved. It's not going to know about
           | the specific domain and what makes sense and what doesn't.
           | 
           | We're not going to see developers replaced in our lifetime.
           | For that you need actual intelligence - which is very
           | different from the monkey see monkey do AI of today.
        
         | luffapi wrote:
         | > _Copilot should be inspiring people to figure out how to do
         | better than it, not making hackers get up in arms trying to
         | slap it down._
         | 
         | One of the (many) problems is that GitHub/Microsoft already
         | benefit from runaway network effects so it's difficult to "do
         | better". Where will you get all of that training code if not
         | off GitHub?
         | 
         | The real answer to this is to yank your projects from GitHub
         | now while you search for alternatives.
        
           | Narishma wrote:
           | Even if you do that, what's to stop them from using open
           | source software from all over the web and not just what's on
           | GitHub? The only way to stop them then is to go closed
           | source.
        
             | luffapi wrote:
             | I mean stop them at a larger level by threatening their
             | success as an organization. If developers stop publishing
             | to GitHub they have bigger problems than training ML
             | models.
             | 
             | Whether or not this move is "legal", it should serve as a
             | wake up call that GH is not actually a service we should be
             | empowering. This incident is just one example of why that's
             | a bad idea.
        
         | adamdusty wrote:
         | I'm more surprised that people don't care about the telemetry
         | aspect. It's an extension that sends your code to an MS
         | service, and MS promises access is on a need-to-know basis.
         | 
         | I don't care if MS copies my hobby projects exactly, but I'm
         | not sure my employer(defense contractor) would even be allowed
         | to use a tool like this.
         | 
         | I think it looks cool though. I will probably try it out if it
         | is ever available for free and works for the languages I use.
        
           | gradys wrote:
           | It's quite possible to do this on-prem and even on-device.
           | TabNine, a very similar system with a smaller model (based on
           | GPT-2 rather than 3), has existed for years and works on-
           | device.
        
         | ageyfman wrote:
         | Try doing any type of deal (fundraising, M&A) where you can't
         | point to the provenance of your application's code. This isn't
         | good for programmers, programmers WANT clean and knowable
         | copyrights. This is good for lawyers, who'll now have another
         | way to extract thousands of $$ from companies to launder their
         | code.
        
         | jspaetzel wrote:
         | Copy-left licenses are generally liked by developers, this flys
         | very directly against that since it suggests circumvention of
         | those type of licenses.
        
       | nickvincent wrote:
       | Regardless of how the (potentially very impactful) debate about
       | licensing and copyright plays out, I think many here would agree
       | this constitutes an "exploitation" of labor, at least in a mild
       | sense.
       | 
       | Optimistically, Copilot could be a wake up call for thinking more
       | deeply about how the winnings of data-dependent technologies
       | (ultimately, dependent on the labor of people who do things like
       | write open source code) are concentrated--or shared more broadly.
       | 
       | This longer blog post goes into more of a labor framing on the
       | topic: https://www.psagroup.org/blogposts/101
       | 
       | (For the record, I certainly think Copilot could be very good for
       | programmers in general and am not arguing against its existence
       | -- just arguing that this is a high profile case study, useful
       | for thinking about data-dependent tech in general)
        
         | leventov wrote:
         | There will be just a short transition period. In 10 years, AI
         | will be writing most of code, and in 20 years - nearly all
         | code. People will do only architecture/business analysis.
         | 
         | No more "exploitation" of labor.
        
       | hpcjoe wrote:
       | "All your code are belong to us" ... :(
        
       | 1970-01-01 wrote:
       | What about the other half of the law: If your copilot code takes
       | from public source but produces something that is patented, can
       | you be sued by a patent troll? (Yes.)
        
       | mikehearn wrote:
       | ML novice question: is this atypical when training models? Wasn't
       | GPT-3 trained on a lot of copyrighted data? My gut instinct,
       | which is based on very low-information, is that it would be
       | pretty hard to train models if you could only use open-licensed
       | material.
        
         | CyberRabbi wrote:
         | Yes training data is very valuable. Producing quality training
         | data is an industry in itself. GitHub is trying to get it for
         | free, doesn't work that way.
        
         | SCLeo wrote:
         | I am not a lawyer but I do believe GPT-3 as a commercial
         | product trained using copyrighted data constitutes
         | infringement. I also think GPT-2 does not because it is for
         | research purposes, which made it fair use.
        
         | Tenoke wrote:
         | Yes, it would stiffle NLP research immensely and we likely
         | wouldn't see anything better than gpt3 for years if such
         | restrictions are put in place.
        
           | CyberRabbi wrote:
           | You're free to privately research with this data but
           | commercializing other people's work using ML is theft.
           | 
           | Edit: commercializing of the derived work is one explicit
           | consideration used by US law in making a fair use
           | determination. That said, even if it weren't commercialized
           | it may still be infringement and I believe it is.
        
             | Spivak wrote:
             | Commercializing isn't really the issue, it's still
             | copyright infringement even if you release it for free
             | (i.e. piracy) -- it's unauthorized redistribution (i.e.
             | copying).
        
             | Tenoke wrote:
             | Even if we accept that (which many wouldnt as most licenses
             | say little about research), the research would never be
             | very useful if you can never make a comparable dataset to
             | use in the real world.
        
             | whimsicalism wrote:
             | I get that the problem is commercializing, but the theories
             | around copyright that are being deployed here would prevent
             | even free, open-source NLP research from becoming a
             | reality.
        
           | lumost wrote:
           | There is a difference between a model that achieves "fair
           | use" of copyrighted work and one that regurgitates
           | copyrighted work without attribution.
        
           | ghaff wrote:
           | You're basically seeing how some people would have had open
           | source play out. You can look at and use the code but not to
           | make money or in any other way that I personally disapprove
           | of. This is a world where open source would have ended up
           | being pretty much irrelevant.
        
             | La1n wrote:
             | Are we now also not seeing now why people would want to do
             | that? A multi-billion dollar company using people work to
             | make more profits without paying them.
             | 
             | I definitely understand why people pick a license that
             | disallows use someone doesn't agree with. Imagine baking
             | cookies for your friends, and one of them reselling them.
             | The material effect is the same to you, you gave away your
             | cookies, but sometimes you make/do something for a certain
             | group of people and not for other to make a profit of your
             | work.
        
               | ghaff wrote:
               | People can do whatever they want with their work,
               | including not sharing it at all.
               | 
               | But a great deal of the value that's come from open
               | source generally has been that open source licenses
               | _haven 't_ imposed the sort of usage-based restrictions
               | (e.g. free for educational use only) that were fairly
               | common in the PC world.
               | 
               | And, to your example, in the case of software the
               | incremental copy that your friend sold cost you
               | absolutely nothing. So it comes down to a purely
               | emotional response to someone else making money off
               | something you made.
        
               | La1n wrote:
               | >So it comes down to a purely emotional response to
               | someone else making money off something you made.
               | 
               | Exactly, as I said, the material situation is the same.
               | But we all are emotional beings, you would do certain
               | things for your family you wouldn't for strangers. I
               | don't think this case is any different.
               | 
               | I personally don't work for free for a company, but I do
               | charity work for free. Working for a company in the time
               | I work for a charity would "cost me absolutely nothing"
               | if I already spend the time anyway, but everyone
               | understands the difference.
        
         | jonfw wrote:
         | It would be pretty concerning if people used GPT-3 while they
         | were writing a novel, and it assisted them in plagiarizing a
         | Steven King novel.
         | 
         | We already have examples of copilot blatantly plagiarizing code
        
           | mikehearn wrote:
           | Right, but that sounds like the bigger issue here is that the
           | model might spit out copyrighted material, not just that it
           | scrapes it. The former seems like a technology problem that
           | Microsoft can solve.
        
             | Spivak wrote:
             | The issue is that not only might the model spit out
             | copyrighted material verbatim (which it is) but that it
             | might also spit out non-obvious derivative works that will
             | get you in legal hot water years down the road.
        
       | undecisive wrote:
       | I see a lot of people trying to compare its "machine learning" to
       | human learning.
       | 
       | Let's use this thought experiment: Imagine that Github's Copilot
       | was just a massive array of all the lines of code from every
       | github project, with some (magical automated whatever) tagging
       | and indexing on each function, and a search engine on top of
       | that.
       | 
       | Now imagine that copilot simply finds the closest search result,
       | and then when you press a button, it inserts the line from the
       | array, and press it again and you get the next line, etc.
       | 
       | Now hopefully nobody here thinks such a system would fulfil
       | either the spirit or the law of any half-restrictive license. Yet
       | that is a perfectly valid implementation of Copilot's aim - and
       | it sounds like it's not that far from what actually happens,
       | maybe with a bit of variable name munging.
       | 
       | So my question is this: If you could build a line between the
       | system I describe above and the system of human learning, where a
       | human learns the patterns and can genuinely produce novel
       | structures and patterns and even programming languages that it
       | has never seen before.
       | 
       | At what point along that line would you say that Copilot is close
       | enough to human to not be violating licenses that require
       | attribution?
        
         | leereeves wrote:
         | I don't think it matters where Copilot is on that line. A
         | skilled human programmer at the far end of that line, fully
         | capable of producing novel programs that they haven't seen
         | before, would still be violating copyright if they reproduced a
         | program they have seen before.
        
           | Spivak wrote:
           | I mean it answers the question pretty quickly if your agent
           | isn't sophisticated enough to actually produce novel programs
           | in the first place.
        
       | taytus wrote:
       | Microsoft has spent a lot of money and energy in earning
       | developers' trust over the last 15 years.
       | 
       | They have done an excellent job and succeeded in their goal.
       | 
       | Now, with copilot they are about to lose it all.
        
       | [deleted]
        
       | albertzeyer wrote:
       | So, when a human reads public code on the Internet (no matter the
       | licence), and gains knowledge, learns (updates the synaptic
       | weights of the brain), and then makes (indirectly) use of that
       | gained knowledge for further work, how is this different to this
       | case?
        
         | [deleted]
        
         | nowherebeen wrote:
         | The difference is intent. When Github reads public code, their
         | only intent is to profit from it. Depending on the license,
         | that's a violation.
        
           | albertzeyer wrote:
           | A human also often intends to make profit (by using the
           | gained knowledge).
        
             | nowherebeen wrote:
             | No, they intent to learn from it or find a solution to
             | their problem. It's much harder to argue human intent in
             | court than GitHub blantantly doing so.
        
         | SXX wrote:
         | It's no different, but if human reads copyrighted proprietary
         | code and then reproduces part of it exactly he have good chance
         | to get into huge legal trouble.
         | 
         | On other hand said AI have no idea of who the code belongs to
         | and it's able to reproduce it perfectly.
        
       | djoldman wrote:
       | One question I haven't really see talked about is: when you get a
       | suggestion through copilot and save the document, who is the
       | author of the document?
       | 
       | I think this may be the crux of this whole kerfuffle.
       | 
       | If you're the author isn't it on you if you infringe?
       | 
       | If not then perhaps you and GitHub/Microsoft share
       | authorship/culpability?
       | 
       | Who has the copyright to a piece of text generated by a tool? Or
       | art generated by a model?
        
       | cameldrv wrote:
       | My question is what GitHub is going to do when people start
       | sending them DMCA takedown notices over their code being
       | distributed through this system.
       | 
       | Currently, if you claim to be a copyright owner GitHub can
       | respond to a DMCA takedown by removing the repository. This might
       | require them to retrain the entire model.
       | 
       | One option for GitHub might be to maintain a blocklist of various
       | code snippets, and if there is a substring match, just don't make
       | the suggestion.
        
       | thinkingemote wrote:
       | The answer is simple: Github needs to make a tool which can scan
       | all your code to see if it contains code from public code. Its
       | what universities around the world do for students work.
       | 
       | Of course, theres a huge irony in that Github is also making the
       | tool that enables the widespread plagarism....
        
       | prepend wrote:
       | It's not copyright violation to train ML on content. So the
       | license doesn't matter unless there's some "can't use this for ML
       | training" license that I don't know about (and doesn't seem to be
       | legal).
        
         | WoodenChair wrote:
         | > It's not copyright violation to train ML on content.
         | 
         | The training is not a copyright violation. That seems to be
         | settled case law. Whether the verbatim copying as a result of
         | that training is a copyright violation I think is less tested.
         | 
         | Let's flip the domains. Say we had an ML algorithm that could
         | auto generate news stories and it at some point (not all the
         | time) copied verbatim a Wall Street Journal article and posted
         | it to a blog. Copyright violation?
         | 
         | With copilot, we're sometimes seeing "paragraphs" of source
         | lines copying verbatim, so this analogy is not such a stretch.
         | 
         | I think we need to think about how much our sharing culture in
         | programming has tinted our view of the legality of this
         | enterprise.
        
         | vharuck wrote:
         | >It's not copyright violation to train ML on content.
         | 
         | I agree. It'd be a nice gesture to reach out to the creators of
         | the training data, like is usual with web scrapers. But
         | collecting and analyzing data publicly available on the web is
         | ok.
         | 
         | >So the license doesn't matter unless there's some "can't use
         | this for ML training" license that I don't know about (and
         | doesn't seem to be legal).
         | 
         | I disagree. While Copilot is, at heart, a ML model, the
         | copyright trouble comes from its usage. It consumes copyright
         | code (ok), analyzes copyright code (still ok), and then
         | produces code which sometimes is a copy of copyright code (not
         | ok). The only way it'd be ok is if Copilot followed all
         | licensing requirements when it produced copies of other works.
         | 
         | Personally, I won't touch it for work until either Copilot
         | abides by the licenses or there's robust case law.
        
           | prepend wrote:
           | > It'd be a nice gesture to reach out to the creators of the
           | training data, like is usual with web scrapers.
           | 
           | I don't think this is practical. And who notifies people of
           | scraping content? I would've annoyed if I got spam from sites
           | that scraped my content.
        
             | vharuck wrote:
             | I've contacted websites about scraping when it'd be a
             | repeat thing and they didn't have a robots.txt file
             | available. Also if their stance on enforcing copyright was
             | hazy (e.g. medical coding created by a non-profit).
             | Sometimes, they pointed me toward an API I didn't know
             | about.
             | 
             | >I don't think this is practical.
             | 
             | I don't like people ignoring things just because they're
             | impractical for ML. That leads to crap like automated
             | account banning without possiblity of talking to a living
             | customer service representative.
        
       | kune wrote:
       | Guys please read the Terms of Use of Github section D.4.
       | 
       | We need the legal right to do things like host Your Content,
       | publish it, and share it. You grant us and our legal successors
       | the right to store, archive, parse, and display Your Content, and
       | make incidental copies, as necessary to provide the Service,
       | including improving the Service over time. This license includes
       | the right to do things like copy it to our database and make
       | backups; show it to you and other users; parse it into a search
       | index or otherwise analyze it on our servers; share it with other
       | users; and perform it, in case Your Content is something like
       | music or video.
       | 
       | This license does not grant GitHub the right to sell Your
       | Content. It also does not grant GitHub the right to otherwise
       | distribute or use Your Content outside of our provision of the
       | Service, except that as part of the right to archive Your
       | Content, GitHub may permit our partners to store and archive Your
       | Content in public repositories in connection with the GitHub
       | Arctic Code Vault and GitHub Archive Program.
        
         | bruce343434 wrote:
         | Still depends on how they defined "the Service". Can't be
         | bothered to read the full license myself because I don't use
         | github - but I can't imagine "the Service" is defined as
         | including an AI copy paster.
        
           | mumblemumble wrote:
           | > The "Service" refers to the applications, software,
           | products, and services provided by GitHub, including any Beta
           | Previews.
           | 
           | So it wouldn't include just any AI copy pasters. Only the
           | ones that are provided by GitHub.
        
         | remram wrote:
         | If I upload somebody else's GPL code to GitHub, I also can't
         | grant to GitHub the (implicit) legal rights to use that code in
         | Copilot, because they are not mine to give.
         | 
         | I could previously mirror GPL code, because the GPL granted me
         | the rights I need to grant GitHub as part of their ToS; but if
         | they change their ToS, or if the meaning is changed by them
         | adding vastly different features to their Service, this becomes
         | a problem.
        
         | eCa wrote:
         | If Copilot requires separate payment or signup I[1] fail to see
         | how it can be part of "the Service" as defined therein, and
         | since the rights to do "things" to the provided code only go as
         | far "as necessary to provide the Service" the ToS can't[2] be
         | used to argue that it gives explicit permission to use provided
         | code for this purpose. Or am I misinterpreting something?
         | 
         | [1] I'm not a lawyer.
         | 
         | [2] Still not a lawyer.
        
       | baby wrote:
       | The licensing game is really awful imo. It should be that
       | releasing your code on github = fair game. Licenses are seriously
       | hindering development. You either take part in open source or you
       | don't. I get anxious every time someone asks me to add a license
       | to one of my project because I don't know which license to use
       | and wonder if it'll prevent some people from using the software
       | down the line. Once I tried writing my own license that basically
       | said: I don't care, do whatever. Yet someone complained with "yet
       | another license".
        
         | dylan604 wrote:
         | yeah, no. Licensing is really awful, yes.
         | 
         | "You either take part in open source or you don't." I disagree.
         | You can allow your software to be used and post the source
         | code, but it is yours so you get some say in your intentions.
         | Forking is what you're looking for. However, once you fork it,
         | you still owe credit to those that did the heavy lifting before
         | making whatever tweak it is you made and want to call it your
         | own. There's nothing wrong with the original developers getting
         | credit for the work they did. There's nothing wrong with the
         | original devs willing to let other people use their work as
         | long as it is used in the same spirit it was provided (FOSS).
         | That also does not mean the original devs are wrong for wanting
         | evilCorps that want to use their freesoftware to be
         | included/distributed in their packages they sell and profit
         | from to be restrictive.
        
       | turndown wrote:
       | It seems that now is finally the time I must apologize for some
       | of the java code I put up on github.
        
       | keonix wrote:
       | If the training set contains _verbatim_ (A)GPL code does this
       | mean that Copilot also should be _distributed_ by Microsoft under
       | GPL? Because without it Copilot (as it is _distributed_ by
       | Microsoft) couldn 't be built, wouldn't it make it a _derivative
       | work_ of GPL 'd code (and obviously every other license)?
       | 
       | I see a lot of people comparing human learning to machine
       | learning in the comments, but there is a huge difference - we
       | don't _distribute_ copies of humans
        
         | rapind wrote:
         | This is pretty interesting for AI in general. Should you be
         | able to train with material you don't own? Can your training
         | benefit from material that has specific usage licenses attached
         | to it? What about stuff like GameGAN?
        
           | zoomablemind wrote:
           | > ...Should you be able to train with material you don't own?
           | 
           | If relating this to how humans learn, books and other sources
           | are used to inform understanding and human knowledge. One can
           | purchase or borrow a book without actually owning the
           | copyright to it. Indeed, a given passage may be later quoted
           | verbatim, provided it is accompanied with a reference to its
           | source.
           | 
           | Otherwise, a verbatim use without attribution in authored
           | context is considered plagiarism.
           | 
           | So, sure one can use a multitude of material for the
           | training. Yet, once it gets to the use of the acquired
           | "knowledge" - proper attribution is due for any "authentic
           | enough" pieces.
           | 
           | What is authentic enough in this case is not easy to define,
           | however.
        
             | rapind wrote:
             | "If relating this to how humans learn" seems like a big IF
             | though right? Are we going to treat computer neural nets as
             | human from a legal standpoint?
             | 
             | At some point Neural Nets like GameGAM might be good enough
             | to duplicate (and optimize) a commercial game. Can you then
             | release your version of the game? Do you just need to make
             | a few tweaks? Are we going to get a double standard because
             | commercial interests are opposed depending on the use case?
             | 
             | It would be pretty funny if Microsoft as a game publisher
             | lobbies to prevent their IP being used w/ something like
             | GameGAN, but then takes the opposing stand point for
             | something like their CoPilot! Although I'm sure it'll be
             | spun as "These things are completely different!".
        
               | paulryanrogers wrote:
               | This is the key question. In school I was taught to be
               | careful to always cite even paraphrased works. If Copilot
               | regurgitates copyrighted fragments without citation or
               | informing acceptors of licenses involved then it's
               | facilitating infringement.
        
         | ghoward wrote:
         | This is a great argument.
        
         | AaronFriel wrote:
         | No, see Authors Guild v. Google. Even without a license or
         | permission, fair use permits the mass scanning of books, the
         | storage of the content of those books, and rendering verbatim
         | snippets of those books. The Google Books site is not a
         | derivative work of the millions of authors they copied from,
         | and if they did copy any coincidentally GPL, AGPL, or creative
         | commons copyleft work, the fair use exception applies before we
         | reach the question of whether Google is obligated to provide
         | anything beyond what it is doing.
         | 
         | By comparison, Copilot is even more obviously fair use.
         | 
         | I've had this conversation quite a few times lately, and the
         | non-obvious thing for many developers is that fair use is an
         | exception to copyright itself.
         | 
         | A license is a grant of permission (with some terms) to use a
         | copyrighted work.
         | 
         | This snippet from the Linux kernel doesn't make my comment here
         | or the website Hacker News a GPL derivative work:
         | ret = vmbus_sendpacket(dev->channel, init_pkt,
         | sizeof(struct nvsp_message),             (unsigned
         | long)init_pkt, VM_PKT_DATA_INBAND,
         | VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
         | 
         | This snippet from an AGPL licensed project, Bitwarden, does not
         | compel dang or pg to release the Hacker News source code:
         | await _sendRepository.ReplaceAsync(send);         await
         | _pushService.PushSyncSendUpdateAsync(send);         return
         | (await
         | _sendFileStorageService.GetSendFileDownloadUrlAsync(send,
         | fileId), false, false);
         | 
         | Fair use is an exception to copyright itself. A license cannot
         | remove your right to fair use.
         | 
         | The Free Software Foundation agrees
         | (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)
         | 
         | > Yes, you do. "Fair use" is use that is allowed without any
         | special permission. Since you don't need the developers'
         | permission for such use, you can do it regardless of what the
         | developers said about it--in the license or elsewhere, whether
         | that license be the GNU GPL or any other free software license.
         | 
         | > Note, however, that there is no world-wide principle of fair
         | use; what kinds of use are considered "fair" varies from
         | country to country.
         | 
         | (And even this verbatim copying from FSF.org for the purpose of
         | education is... Fair use!)
        
           | RandomBK wrote:
           | For any discussion on copyright and fair use, we should
           | distinguish between the implications to Copilot the software
           | itself and the implications to users of Copilot.
           | 
           | For Copilot itself, I do see the case for fair use, though it
           | gets fuzzy should Microsoft ever start commercializing the
           | feature. Nevertheless it remains to be seen whether ML
           | training fits the same public policy benefits public
           | libraries and free debate leverages to enable the fair use
           | defense.
           | 
           | For Copilot users, I don't see an easy defense. In your
           | hypothetical, this would be akin to me going on Google books
           | and copying snippets of copyrighted works for my own book. In
           | the case of Google books, they explicitly call out the limits
           | on how the material they publish can be used. I'm contrast,
           | Copilot seems to be designed to encourage such copying,
           | making it more worry some in comparison.
        
           | swiley wrote:
           | Just for reference, the hackernews source is public.
        
             | e12e wrote:
             | Not the current version? AFAIK there's some security-by-
             | obscurity in the measures against spam, voter rings etc ?
        
           | jollybean wrote:
           | Thanks for this, but can you answer the question:
           | 
           | Would it be 'fair use' for the devlopers to simply copy code
           | from those repos - even just 10 lines, and claim 'fair use' -
           | i.e. circumventing Copilot?
           | 
           | Even if Copilot is 'fair use' ... does that mean the results
           | are 'fair use' on the part of AutoPilot users?
           | 
           | And a bigger question: is your interpretation of those
           | statues and case law enough to make the answer unambiguous?
           | 
           | I don't have legal background, but I do have an operating
           | background with lawyers and tech ... and my 'gut' says that
           | anyone using Copilot is opening themselves up to lawsuits.
           | 
           | If the code you put in your software comes, via Copilot, but
           | that code is verbatim from some kind of GPL's (or worse,
           | proprietary) ... there's a good chance you could get sued if
           | someone gets the inclination.
           | 
           | Maybe it's because of my personal experience, but I can just
           | see corporate lawyers banning Copilot straight up as the
           | risks are simply now worth the upside. That's now what we
           | like to hear in the classically liberal sense i.e. 'share and
           | innovate' ... but gosh it doesn't feel like a happy legal
           | situation to me.
           | 
           | Looking forward to people with more insight sharing on this
           | important topic.
        
             | AaronFriel wrote:
             | > Would it be 'fair use' for the devlopers to simply copy
             | code from those repos - even just 10 lines, and claim 'fair
             | use' - i.e. circumventing Copilot?
             | 
             | Only a lawyer (and truly, only a court) could answer that
             | question.
             | 
             | If you copy 100 lines of code that amounts to no more than
             | a trivial implementation in a popular language of how to
             | invert a binary tree, it's likely fair use.
             | 
             | If you copy 10 lines of code that are highly novel, have
             | never been written before, and solve a problem no one
             | outside the authors have solved... It may not be fair use
             | to copy that.
             | 
             | Other people who have replied have mentioned "the heart" of
             | a work. The US Supreme Court has held that even de minimis
             | - "minimal", to be brief - copying can sometimes be
             | infringement if you copied the "heart" of a work.
        
           | syshum wrote:
           | While I agree you are correct about (in the US anyway) fair
           | use being an exemption from copyright, thus superceeds
           | licensing
           | 
           | I disagree that Copilot is "more obviously fair use.", some
           | parts might be, but we have seen clear examples (i.e verbatim
           | code reproduction) that would not be.
           | 
           | I dont believe the question of "is this fair use" is as clear
           | as you believe it to be
        
           | indigochill wrote:
           | Next up, Copilot for college papers! Who needs to pay a
           | professional paper-writer (ahem, I mean write the paper) when
           | you can have an AI write your paper for you! It's fair use,
           | so you're entitled to claim ownership to it, right?
        
             | jrochkind1 wrote:
             | I think you are confusing legal protections for
             | intellectual property with plagiarism. (At least that's
             | what I think you're doing if I read your comment as sarcasm
             | and guess what you're trying to say non-sarcastically?) But
             | they are entirely different things.
             | 
             | You can be violating copyright without plagiarizing, so
             | long as you cite your source, but if you copy a copyright-
             | protected work in an illegal way when doing so.
             | 
             | And you can be plagiarizing without violating copyright, if
             | you have the permission of the copyright holder to use
             | their content, or if the content is in the public domain
             | and not protected by copyright, or if it's legal under fair
             | use -- but you pass it off as your own work.
             | 
             | Two entirely separate things. You can get expelled from
             | school for plaguriism without violating anyone's copyright,
             | or prosecuted for copyright without committing any academic
             | dishonesty.
             | 
             | You can indeed have the legal right to make use of content,
             | under fair use or anything else, but it can still be
             | plagiarism. That you have a fair use right does not mean
             | "Oh so that means you are allowed to turn it in to your
             | professor and get an A and the law says you must be allowed
             | to do this and nobody can say otherwise!" -- no.
        
               | indigochill wrote:
               | Yeah, I was being sarcastic. But you make a good point
               | about the legality of plagiarism.
        
           | tyre wrote:
           | Copilot is not doing what your example does.
           | 
           | If Github had a service that automatically mirrored public
           | repositories on Gitlab, that would be equivalent to the
           | example you gave.
           | 
           | But Github is taking content under specific licenses to build
           | something new for commercial use.
           | 
           | I'm not sure if what Github does falls under Fair Use, but I
           | don't know that it matters. I can read fifty books and then
           | write my own, which would certainly rely--consciously or not
           | --on what I had read. Is that a copyright violation? It
           | doesn't seem like it is but maybe it is and until now has
           | been impossible to prosecute?
        
             | nojito wrote:
             | GitHub isn't building anything.
             | 
             | The end user is.
             | 
             | By this logic any and all neural nets that draw pictures
             | are copyright infringing as well.
        
               | saati wrote:
               | If they create exact copies of copyrighted pictures, then
               | yes, they do.
        
           | jkaplowitz wrote:
           | The world is global. That's a US court ruling from one court
           | of appeals. Most countries have narrower fair use rights than
           | the US. Even if Copilot would fall within that legal
           | precedent (far from guaranteed), a legal challenge in any
           | jurisdiction worldwide outside the US states covered by that
           | particular court of appeals, or which reaches the US Supreme
           | Court, or which goes through the Federal Circuit Court of
           | Appeals due to the initial complaint including a patent
           | claim, would not be bound by that result and (especially in a
           | different country) could very plausibly find otherwise.
           | 
           | What's more, if any of the code implements a patent, fair use
           | does not cover patent law, and relying on fair use rather
           | than a copyright license does not benefit from any patent use
           | grant that may be included in the copyright license. If a
           | codebase infringes a patent due to Copilot automatically
           | adding the code, I can easily imagine GitHub being attributed
           | shared contributory liability for the infringement by a
           | court.
           | 
           | Not a lawyer, just a former law student and law feel layman
           | who has paid attention to these subjects.
        
           | mirekrusin wrote:
           | If fair use memorising whole source code byte-by-byte,
           | storing it as ie. some non-100%-lossless compression for
           | subsequent retrieval or arbitrary size snippets?
        
           | shakna wrote:
           | > No, see Authors Guild v. Google.
           | 
           | That case required that the output be transformative, in that
           | "words in books are being used in a way they have not been
           | used before".
           | 
           | Copilot only fits the transformative aspect if it is not
           | directly reciting code, that already exists in the form that
           | it is redistributing. So long as it does so, it fails to meet
           | the criteria.
        
             | kmeisthax wrote:
             | I think you might be considering two different acts here:
             | 
             | 1. The act of training Copilot on public code
             | 
             | 2. The resulting use of Copilot to generate presumably new
             | code
             | 
             | #1 is arguably close to the Authors Guild v. Google case.
             | You are literally transforming the input code into an
             | entirely new thing: a series of statistical parameters
             | determining what functioning code "looks like". You can use
             | this information to generate a whole bunch of novel and
             | useful code sequences, not _just_ by feeding it parts of it
             | 's training data and acting shocked that it remembered what
             | it saw. That smells like fair use to me.
             | 
             | #2 is where things get more dicey - _just because_ it 's
             | legal to train an ML system on copyrighted data wouldn't
             | mean that it's resulting output is non-infringing. The
             | network itself is fair use, but the code it generates would
             | be used in an ordinary commercial context, so you wouldn't
             | be able to make a fair use argument here. This is the
             | difference between scanning a bunch of books into a search
             | engine, versus copying a paragraph out of the search engine
             | and into your own work.
             | 
             | (More generally: Fair use is non-transitive. Each reuse
             | triggers a new fair use analysis of every prior work in the
             | chain, because each _fair_ reuse creates a new copyright
             | around what you added, but the original copyright also
             | still remains.)
        
             | tedunangst wrote:
             | It's not possible to get copilot to output a transformed
             | version of the input?
        
               | shakna wrote:
               | Transformed output _may_ fall under fair use.
               | 
               | However - Copilot directly recites code. That is _very
               | unlikely_ to fall under fair use.
               | 
               | Redistributing the exact same code, in the same form, for
               | the same purpose, probably means that Copilot, and thus
               | the people responsible for it, are infringing.
        
               | ThrowawayR2 wrote:
               | > " _However - Copilot directly recites code._ "
               | 
               | Sounds like that wouldn't be difficult to fix? Transform
               | the code to an intermediate representation (https://en.wi
               | kipedia.org/wiki/Intermediate_representation) as a pre-
               | processing stage, which ditches any non-essential
               | structure of the code and eliminates comments, variable
               | names, etc., before running the learning algorithms on
               | it. _Et voila,_ much like a human learning something and
               | reimplementing it, only essential code is generated
               | without any possibility of accidentally regurgitating
               | verbatim snippets of the source data.
        
               | salawat wrote:
               | At that point, can we all just agree IP is the stupidest
               | concept to ever be layered on top of math (which
               | programming is) and move on with non-copyrightable code?
        
               | jcheng wrote:
               | Only if you agree that copyleft licenses are also stupid;
               | without copyright, there's no way to prevent companies
               | from making closed-source forks of code you wrote and
               | intended to stay open.
        
               | oefnak wrote:
               | Yes, sure. Without copyright there's no need for copyleft
               | left, right?
        
               | nonfamous wrote:
               | > However - Copilot directly recites code.
               | 
               | You make that statement as an absolute, but in the
               | interests of clarity, all evidence so far shows that it
               | directly recites code very rarely indeed. Even the Quake
               | example had to be prompted by the specific variable names
               | used in the original code.
               | 
               | In practice, the output code is heavily influenced by
               | your own context -- the comments you include, the
               | variable names you use, even the name of the file you are
               | editing -- and with use it's obvious that the code is
               | almost certainly not a direct recitation of any existing
               | code.
        
               | svaha1728 wrote:
               | So if a foreign company pilfers the source code to
               | Windows, can they add it to a training set and then
               | 'prompt' the machine learning algorithm to spit out a new
               | 'copyright free' Windows, just by transforming the
               | variable names?
        
               | nonfamous wrote:
               | Well no, because only GitHub has access to the training
               | set. But more importantly this misunderstands how Copilot
               | even works -- even if Windows was in the training set,
               | you couldn't get Copilot to reproduce it. It only
               | generates a few lines of code at a time, and even then
               | it's almost certainly entirely novel code.
               | 
               | Now, if you knew the code you wanted Copilot to generate
               | you could certainly type it character by character and
               | you might save yourself a few keystrokes with the TAB
               | key, but it's going to be much MUCH easier to simply copy
               | the whole codebase as files, and now you're right back
               | where you started.
        
               | rkeene2 wrote:
               | I think that's my question regarding this whole thing:
               | 
               | If it's so fair use, why not train it on all Microsoft
               | code, regardless of license (in addition to GitHub.com) ?
               | Would Microsoft employees be fine with Copilot re-
               | creating "from memory" portions of Windows to use in WINE
               | ?
        
               | shakna wrote:
               | > all evidence so far shows that it directly recites code
               | very rarely indeed.
               | 
               | _Once_ is enough for it to be infringing. The law is not
               | very forgiving when you try and handwave it away.
        
               | mthoms wrote:
               | You sound quite sure that the outlying instances of
               | direct copying wouldn't be covered by the Fair Use
               | copyright exemption. Any particular reason for that?
               | 
               | I tend to think it would be covered (provided it there
               | were relatively small snippets and not entire functions).
        
             | AaronFriel wrote:
             | Is there any evidence of Copilot producing substantial
             | (100s of lines) verbatim copies of copyrighted works?
             | 
             | Absent this, I don't think there's a case. The courts have
             | given extraordinarily wide latitude to fair use and ML
             | algorithms are routinely trained on copyrighted works,
             | photos, etc. without a license.
             | 
             | I understand that this feels more personal because it
             | involves our field, but artists and authors have expressed
             | the same sentiment when neural nets began making pictures
             | and sentences.
             | 
             | The question here is no different than "Is GPT-3 an
             | unlicensed, unlawfully created derivative work of millions,
             | if not billions of people?"
             | 
             | No, I'm quite confident it is not.
        
               | b3morales wrote:
               | The point of Copilot -- its entire value as a product --
               | is to produce code that matches the _intent_ and
               | _semantics_ of code that was in the input. In other
               | words, very deliberately not transformative in purpose.
        
               | shakna wrote:
               | > Is there any evidence of Copilot producing substantial
               | (100s of lines) verbatim copies of copyrighted works?
               | 
               | It doesn't need to be substantial. In Google v. Oracle a
               | 9-line function was found to be infringing.
        
               | AaronFriel wrote:
               | If I recall correctly, the nine line question wasn't
               | decided by the supreme court, but the API question was.
               | 
               | The Supreme Court did hold that the 11,500 lines of API
               | code copied verbatim constituted fair use.
               | 
               | https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.p
               | df
        
               | shakna wrote:
               | > The Supreme Court did hold that the 11,500 lines of API
               | code copied verbatim constituted fair use.
               | 
               | Yes, because it was _transformative_, in a clear way.
               | Because an API is only an interface. Which makes that
               | part of that decision largely irrelevant to the topic at
               | hand.
               | 
               | > Google's limited copying of the API is a transformative
               | use. Google copied only what was needed to allow
               | programmers to work in a different compu-ting environment
               | without discarding a portion of a familiar program-ming
               | language. Google's purpose was to create a different
               | task-related system for a different computing environment
               | (smartphones) and tocreate a platform--the Android
               | platform--that would help achieve and popularize that
               | objective.
               | 
               | > If I recall correctly, the nine line question wasn't
               | decided by the supreme court, but the API question was.
               | 
               | It was already decided earlier, and Google did not
               | contest it, choosing instead to negotiate a zero payment
               | settlement with Oracle over the rangeCheck function.
               | There was no need for the Supreme Court to hear it.
        
               | AaronFriel wrote:
               | A $0 settlement means there is no binding precedent and
               | signals to me that Oracle's attorneys felt they didn't
               | have a strong argument and a potential for more.
               | 
               | If they felt the nine line function made Google's entire
               | library an unlicensed derivative work, they would have
               | pressed their case.
        
               | shakna wrote:
               | > A $0 settlement means there is no binding precedent and
               | signals to me that Oracle's attorneys felt they didn't
               | have a strong argument and a potential for more.
               | 
               | That's not the case. It wasn't an out-of-court-
               | settlement, but an agreement about the damages being
               | sought, the court had already found it to be infringing,
               | and that was part of the ruling.
               | 
               | But none of that changes that 9-lines is substantial
               | enough to be infringing. It isn't necessary to be a large
               | body of work.
               | 
               | > If they felt the nine line function made Google's
               | entire library an unlicensed derivative work, they would
               | have pressed their case.
               | 
               | No... It means the rangeCheck function was infringing.
               | The implication you seem to have inferred here wouldn't
               | be inferred by any kind of plagiarism case.
        
               | AaronFriel wrote:
               | I think we agree then, and appreciate the correction on
               | the lower court settlement.
               | 
               | If Copilot is infringing, I suspect it's correctable (by
               | GitHub) by adding a bloom filter or something like it to
               | filter out verbatim snippets of GPL or other copyleft
               | code. (And this actually sounds like something corporate
               | users would want even if it was entirely fair use because
               | of their intense aversion to the GPL, anyhow.)
        
               | shakna wrote:
               | It may be correctable... It doesn't change that Copilot
               | is probably infringing today, which may mean that damages
               | against GitHub may be sought.
        
               | infogulch wrote:
               | Why did you choose the standard of "substantial" = "100s
               | of lines"? Especially since we've already seen examples
               | of verbatim output in the dozens of lines range, that
               | choice of standard is rather conveniently just outside
               | what exists so far. If we find a case with 200 lines of
               | verbatim output will you say the only reasonable standard
               | is 1000s of lines?
               | 
               | I don't think your argument is as strong as you're making
               | it out to be.
        
               | AaronFriel wrote:
               | Just a fairly arbitrary number. It's easy to produce a
               | few lines from memory, up to 10s of lines and that's
               | "obviously" fair use. I would be surprised if many of
               | haven't inadvertently "copied" some GPL code in this way!
               | 
               | This goes to the "substantial" test for fair use. Clips
               | from a film can contain core plot points, quotes from a
               | book can contain vital passages to understanding a
               | character, screen captures and scrapes of a website can
               | contain huge amounts of textual detail, but depending on
               | the four factors for fair use, still be fair use. (There
               | have been exceptions though.)
               | 
               | The reaction on Hacker News to a machine producing code
               | trained on their works is no different than the reactions
               | artists and writers have had to other ML models. I
               | suspect many of us are biased because it strikes at what
               | we do and we think that our copyrights (because we have
               | so many neat licenses) are special. They are not.
               | 
               | I think it would need to get to that level of "Copilot
               | will emit a kernel module" before it's not obviously fair
               | use.
               | 
               | After all, Google Books will happily convey to me whole
               | pages from copyrighted works, page after page after page.
               | 
               | https://www.google.com/books/edition/Capital_in_the_Twent
               | y_F...
        
               | jcelerier wrote:
               | > Just a fairly arbitrary number. It's easy to produce a
               | few lines from memory, up to 10s of lines and that's
               | "obviously" fair use.
               | 
               | it's anything but obvious.
               | https://www.copyright.gov/fair-use/
               | 
               | > there is no formula to ensure that a predetermined
               | percentage or amount of a work--or specific number of
               | words, lines, pages, copies--may be used without
               | permission.
               | 
               | 9 lines of very run-of-the-mill code in Oracle / Google
               | weren't considered fair use.
        
             | [deleted]
        
           | _Understated_ wrote:
           | > By comparison, Copilot is even more obviously fair use.
           | 
           | Not sure I see it that way.
           | 
           | If I take your hard work that you clearly marked with a GPL
           | license and then make money from it, not quite directly, but
           | very closely, how is that fair use? Or legal?
           | 
           | Copying and storing a book isn't recreating another book from
           | it. Copilot is creating new stuff from the contents of the
           | "books" in this case.
           | 
           | Edit: I misunderstood fair use as it turns out...
        
             | nonfamous wrote:
             | For your specific case, "take your hard work that you
             | clearly marked with a GPL license and then make money from
             | it", you don't even need to rely on fair use. As long as
             | you comply with the terms of the GPL, making money with the
             | code is perfectly acceptable, and the FSF even endorses the
             | practice. [1] Red Hat is but one billion-dollar example.
             | 
             | [1] https://www.gnu.org/licenses/gpl-
             | faq.en.html#DoesTheGPLAllow...
        
               | b3morales wrote:
               | But the person making money from the GPL code has to
               | follow the terms of the license. Attribution, sharing
               | modifications, etc.
        
               | nonfamous wrote:
               | Correct. That's why I said "As long as you comply with
               | the terms of the GPL".
        
             | AaronFriel wrote:
             | I've edited my comment with examples and a clarification.
             | 
             | Fair use is an exception to copyright and, by definition,
             | copyright licenses.
        
               | Arelius wrote:
               | I don't think that's an accurate description...
               | 
               | Fair use is a defense for cases of copyright
               | infringement, which means you're starting of from a case
               | of copyright infringement, which sort-of muckys up the
               | whole "innocent until proven guilty" thing. And
               | considering it's a weighted test, it's hardly very cut-
               | and-dry at that.
        
               | _Understated_ wrote:
               | I understand the concept of fair use (I think) but I
               | can't see how it applies to Copilot.
               | 
               | Google didn't create new books from the contents of
               | existing ones (whether you agree that they should have
               | been allowed to store the books or not) but Copilot is
               | creating new code/apps from existing ones.
               | 
               | Edit: I guess my understanding of fair use was wrong. I
               | stand corrected.
        
               | unanswered wrote:
               | That you apparently think fair use is something you just
               | think about real hard in order to see how you feel about
               | it in a given situation demonstrates that you do not
               | understand the concept of fair use. There are rules.
        
               | elliekelly wrote:
               | I don't disagree with your point but was it necessary to
               | make it in such a snarky way?
        
               | unanswered wrote:
               | Have you found a better way to defeat Dunning-Kruger?
               | 
               | Your comment is entirely unconstructive as it does not
               | suggest an alternative course of action. Criticism on HN
               | should strive to be constructive, or failing that to be
               | left unsaid.
        
               | _Understated_ wrote:
               | Yeah, I realise that now.
               | 
               | However, where does one draw the line between fair use
               | and derivative works?
               | 
               | Creating something based on other stuff (Google creating
               | AI books from the existing ones for example) would
               | possibly be fair use I think but would it not also be
               | derivative works?
        
               | thrashh wrote:
               | There's no clear line and there can never be because the
               | world is too complex. We leave up determination to the
               | court system.
               | 
               | Google Books is considered fair use because they got sued
               | and successfully used fair use as a defense. Until
               | someone sues over Copilot, everyone is an armchair
               | lawyer.
        
               | AaronFriel wrote:
               | If Google Books were creating new books, that would only
               | _help_ their argument. Transformativeness is one of the
               | four parts of the fair use test.
               | 
               | Copilot producing new, novel works (which may contain
               | short verbatim snippets of GPL works) is a strong
               | argument for transformativeness.
        
               | FemmeAndroid wrote:
               | It would help the transformativeness, but it would
               | substantially change the effect upon the market. By
               | creating competing products with the copyrighted
               | material, there is a higher degree of transformative, but
               | you also end up disrupting the marketplace.
               | 
               | I don't know how a court would decide this, but I do
               | think the facts in future GPT-3 cases are sufficiently
               | different from Author's Guild that I could see it going
               | any way. Plus, I think the prevalence of GPT-3 and the
               | ramifications of the ruling one way or another could lead
               | some future case to be heard by the Supreme Court. A
               | similar case could come up in California, or another
               | state where the 2nd Circuit Artist Guild case isn't
               | precedent.
        
               | peteradio wrote:
               | > short verbatim snippets of GPL works
               | 
               | Define short
        
             | shadowgovt wrote:
             | > If I take your hard work that you clearly marked with a
             | GPL license and then make money from it, not quite
             | directly, but very closely, how is that fair use? Or legal?
             | 
             | If I'm Google, and I scan your code and return a link to it
             | when people ask to find code like that (but show an ad next
             | to that link for someone else's code that might solve their
             | problem too), that's fair use and legal. My search engine
             | has probably stored your code in a partial format, and
             | that's fine.
        
             | Hamuko wrote:
             | > _If I take your hard work that you clearly marked with a
             | GPL license and then make money from it, not quite
             | directly, but very closely, how is that fair use? Or
             | legal?_
             | 
             | You can wipe your ass with the GPL license if your use of
             | the product falls within Fair Use.
             | 
             | You can actually take snippets from commercial movies and
             | post them onto YouTube if your YouTube video is
             | transformative enough for your usage to be considered fair
             | use. Well, theoretically at least - in reality YouTube
             | might automatically copyright strike it.
             | 
             | > _Copying and storing a book isn 't recreating another
             | book from it._
             | 
             | That doesn't mean that GitHub has to redistribute Copilot
             | under GPL. However, the end user could potentially have to
             | if they use Copilot to generate new code that happens to
             | copy GPL code verbatim.
        
               | _Understated_ wrote:
               | > You can wipe your ass with the GPL license if your use
               | of the product falls within Fair Use.
               | 
               | Is Copilot fair use? It's reading code, generating other
               | code (some verbatim) and making money from it all while
               | not having to release its source code to the world?
               | 
               | > That doesn't mean that GitHub has to redistribute
               | Copilot under GPL
               | 
               | I wasn't saying that was the case: some of the code that
               | Copilot used may not allow redistribution under GPL.
               | 
               | But let's say that all of the code it scanned was GPL for
               | the sake of argument. Why would they not have to
               | distribute their Copilot source yet, if I use it to
               | generate some code, I'd have to distribute mine?
               | 
               | My spidey-sense it tingling at that one!
        
               | tshaddox wrote:
               | > Is Copilot fair use? It's reading code, generating
               | other code (some verbatim) and making money from it all
               | while not having to release its source code to the world?
               | 
               | Again, fair use is an _exception_ to copyright
               | protection. If something is fair use, the license does
               | not apply. The fact that Copilot does not release its
               | source code is related only to a specific term of a
               | specific license, which does not apply if Copilot is
               | indeed fair use.
        
               | 8note wrote:
               | Making money is irrelevant to fair use
        
               | zelphirkalt wrote:
               | Irrelevant to GPL maybe.
        
               | ghoward wrote:
               | Totally relevant: https://en.wikipedia.org/wiki/Fair_use#
               | 1._Purpose_and_charac... .
        
             | IgorPartola wrote:
             | If you view GPL code with your browser would that mean that
             | your browser now has to be GPL as well? In the sense that
             | copilot is not much different than a browser for Stack
             | Overflow with some automation, why would it need to be
             | GPLed? Your own code on the other hand...
        
               | atq2119 wrote:
               | > If you view GPL code with your browser would that mean
               | that your browser now has to be GPL as well?
               | 
               | Some good responses in sibling comments already, but I
               | don't see the narrow answer here, which is: No, because
               | no distribution _of the browser_ took place.
               | 
               | If you created a weird version of the browser in which a
               | specific URL is hardcoded to show the GPL'd code instead
               | of the result of an HTTP request, and you then
               | distributed that browser to others, then I believe that
               | yes, you'd have to do so under the GPL. (You might get
               | away with it under fair use if the amount of GPL'd code
               | is small, etc.)
        
               | keonix wrote:
               | To build a browser you don't need a verbatim GPL code, so
               | it's not a derivative work in the same sense copilot is.
               | 
               | Stackoverflow on the other hand is much trickier
               | question...
        
               | IgorPartola wrote:
               | SO clearly doesn't need GPL code to be useful. The wider
               | SE network is evidence of that.
        
               | dtech wrote:
               | If you use your browser to copy some GPL code into your
               | project your project must now be GPL as well.
               | 
               | So following your own argument, even if Copilot is
               | allowed, using it still risks you falling under GPL
        
               | IgorPartola wrote:
               | My point exactly. Copilot is innocent in that case just
               | like the browser.
        
               | dpe82 wrote:
               | Or if you simply read GPL code and learn something from
               | it - or bits of the code are retained verbatim in your
               | memory, are you (as a person) now GPL'd? Obviously not.
        
               | jcelerier wrote:
               | > Or if you simply read GPL code and learn something from
               | it - or bits of the code are retained verbatim in your
               | memory, are you (as a person) now GPL'd? Obviously not.
               | 
               | I do not find that to be obvious at all.
        
               | [deleted]
        
               | zelphirkalt wrote:
               | That probably depends on how large and how significant
               | the bits you remember are. Otherwise one could take a
               | person with photographic memory and circumvent all GPL
               | licenses easily, by making that person type what they
               | remember.
        
               | johndough wrote:
               | For sake of discussion, it would be clearer to split
               | copilot code (not derived from GPL'd works) and the
               | actual weights of the neural network at the heart of
               | copilot (derived from GPL'd works via algorithmic means).
               | 
               | For your browser analogy, that would mean that the
               | "browser" is the copilot code, while the weights would be
               | some data derived from GPL'd works, perhaps a screenshot
               | of the browser showing the code.
               | 
               | I'd think that the weights/screenshot in this analogy
               | would have to abide by the GPL license. In a vacuum, I
               | would not think that the copilot code had to be licensed
               | under GPL, but it might be different in this case since
               | the copilot code is necessary to make use of the weights.
               | 
               | But then again, the weights are sitting on some server,
               | so GPL might not apply anyway. Not sure about AGPL and
               | other licenses though. There is likely some illegal
               | incompatibility between licenses in there.
        
               | IgorPartola wrote:
               | As I understand it the things copilot tries to do is
               | automate the loop of "Google your problem, find a Stack
               | Overflow answer, paste in the code from there into my
               | editor". In that sense, the burden of whether the license
               | of the code being copy pasted is on the person who
               | answered the SO question and on me. If this literally was
               | what copilot did, nobody would bat an eye that some code
               | it produced was GPL or any other license because it
               | wouldn't be copilot's problem.
               | 
               | No let's substitute a different database of for the code
               | that isn't SO. It doesn't really matter if that database
               | is a literal RDBMS, a giant git repo or is encoded as a
               | neural net. All copilot is going to do is perform a
               | search in that database, find a result and paste it in.
               | The burden of licensing is still on me to not use GPL
               | code and possibly on the person hosting the database.
               | 
               | The gotcha here is that copilot's database is a neural
               | network. If you take GPL code and feed it as training
               | data to a neural network to create essentially a lookup
               | table along with non-GPL code did you just create a
               | derived work? It is unclear to me whether you did or not.
               | In particular, can they neural network itself be
               | considered "source code"?
        
             | arianvanp wrote:
             | Google did not scan those books and use it to build new
             | books with different titles. The comparison doesn't hold up
             | at all.
        
               | _Understated_ wrote:
               | > Google did not scan those books and use it to build new
               | books with different titles. The comparison doesn't hold
               | up at all.
               | 
               | Not sure if you meant to reply to me but I agree with
               | you: you can't compare what Google did to what Copilot
               | does.
        
               | PieUser wrote:
               | Copilot just _suggests_ code.
        
               | kroltan wrote:
               | And someone accepts it. Even if suggesting derivatives of
               | licensed code is not a license infringement, then Copilot
               | sure is a vector for mass license infringement by the
               | people clicking "Accept suggestion". And those people are
               | unable to know (without doing extensive investigation
               | that completely nullifies the point of the tool) whether
               | that suggestion is potentially a verbatim copy of some
               | existing work in an incompatible license.
        
               | MadcapJake wrote:
               | If I suggest whole lines of dialogue to you, the
               | screenwriter, did I write those lines or you? If you
               | change names in those lines of dialogue to fit your
               | story, do you now gain credit for writing those lines?
               | 
               | Suggesting code is generating code
        
               | jollybean wrote:
               | There are situations where the question is are the
               | mishmashes from Copilot 'fair use'.
               | 
               | But the other, more direct question is ... what about the
               | instances where Copilot doesn't come up with a learned
               | mishmash result? What happens when Copilot just gives you
               | a straight up answer from it's learning data, verbatim?
               | 
               | Then you, as a dev, end up with a bunch of code that is
               | effectively copied, via a 'copying tool', which is GPL'd?
               | 
               | It's that specific case that to me sticks out as the
               | 'most concerning part'.
               | 
               | Please correct me if I'm wrong.
        
           | echelon wrote:
           | > Even without a license or permission, fair use permits the
           | mass scanning of books, the storage of the content of those
           | books, and rendering verbatim snippets of those books.
           | 
           | For commercial use and derivative works?
           | 
           | Authors won't incorporate snippets of books into new works
           | unless they're reviews. Copilot is different.
        
             | AaronFriel wrote:
             | Google Books is a commercial site which incorporated the
             | snippets of millions of copyrighted works. And of course,
             | sitting in thousands of Google servers/databases are full
             | copies of each of those books, photos of each page, the
             | OCRed text of each page, and indexes to search them. Even
             | that egregious copying without a license or permission was
             | considered fair use.
             | 
             | If anything, the ways in which Copilot is different aid
             | Microsoft/GitHub's argument for fair use. Because Copilot
             | creates novel new works, that gives them a strong argument
             | their system is more transformative than Google Books,
             | which just presents verbatim copies of books.
        
             | extra88 wrote:
             | > Authors won't incorporate snippets of books into new
             | works
             | 
             | Of course they do, previous works are quoted all the time.
        
               | e12e wrote:
               | But that's another thing - co-pilot doesn't _quote_ it
               | encourages something more akin to _plagarism_ , doesn't
               | it?
        
               | extra88 wrote:
               | Plagiarism, pretending you made a work entirely yourself
               | when you didn't, is rarely a matter for a court to decide
               | and the standards for what constitutes plagiarism can
               | vary a lot. When I turn in projects for a course, a cite
               | sources in the comments a lot, even if what I turn in is
               | substantially modified. An employer generally doesn't
               | care if you copied and pasted code from StackOverflow or
               | wherever, so long as you don't expose them to a suit and
               | you don't lie if asked "Did you write this 100%
               | yourself?"
               | 
               | Citing your source is not a get out jail free card for
               | copyright infringement, it doesn't really matter.
        
               | e12e wrote:
               | > Citing your source is not a get out jail free card
               | 
               | No, but it's a requirement of the license
               | stackoverflow.com uses, which is unfortunate, for code
               | (as opposed to text, where a quote can be easily
               | attributed).
        
               | b3morales wrote:
               | ... _with attribution_.
        
               | AaronFriel wrote:
               | And without. Attribution isn't a "copyright escape
               | clause", copying a work without permission is still
               | infringement - unless it's fair use.
               | 
               | Plagiarism is not the same as infringement.
        
           | jrm4 wrote:
           | You're strongly and incorrectly implying that "Fair Use" is a
           | clear (and relatively immutable) concept within copyright
           | law, which couldn't be further from the truth. Even if this
           | or that particular case sets out what appears to be solid
           | grounds, one shouldn't take that as gospel by any means.
           | 
           | This mostly has to do with the nature of the wishy-washy
           | nature of the 4 part Fair Use test, which, unlike decent
           | legal tests, doesn't actually have _discrete_ answers. The
           | judge looks at the 4 questions, talks about them while waving
           | her hands, and makes a decision.
           | 
           | Comparing to, e.g., Patent, where you actually do have yes-
           | or-no questions. Clean Booleans. Is it Novel? Is it Non-
           | Obvious? Is it Useful? If any of the above is "No", then no
           | patent for you.
           | 
           | As for the execution of Fair Use, while I haven't gone too
           | deep into Software, I can assure that for music, the thing is
           | just a silly holy-hell mess; confirmed most recently by the
           | "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling
           | or melody taking) was alleged, merely that the song sounded
           | really similar to "Got to give it up" and that was enough.
           | 
           | So then, I'd say everything either is, or should be, up in
           | the air, when it comes to Fair Use and software.
        
             | extra88 wrote:
             | > Is it Novel? Is it Non-Obvious?
             | 
             | Those questions for patents are barely more clear-cut than
             | copyright fair use tests, there is lots of room for
             | disagreement.
             | 
             | It's definitely true that a fair use defense against
             | copyright infringement varies a lot by the field of work
             | and norms can develop which are relevant to court cases.
             | The music field is a mess, the "Blurred Lines" judgement
             | was total bullshit. But the software field is not without
             | its own copyright history and norms so there's no reason to
             | expect everything to go to hell.
        
               | jrm4 wrote:
               | But there's no reason _not_ to either - I suppose my
               | point is, don 't take too much as gospel and think about
               | everybody's best "end-goals" and push or pull with or
               | against the law as needed.
        
               | TheSpiceIsLife wrote:
               | There's also an aspect of this that varies by size,
               | budget, political clout, etc etc, of the individual or
               | organisation.
               | 
               | The big guns like Microsoft, Google, Oracle, do this sort
               | of thing as a matter of course in their business
               | activities, they have the lawyers, the money, and the ear
               | of members of parliaments, senators etc.
               | 
               | Whereas an individual or small business probably wants to
               | conduct themselves within a more narrow set of
               | adherences.
        
             | [deleted]
        
             | [deleted]
        
             | nerdponx wrote:
             | Unanswered question, as far as I know: is a trained model a
             | derivative work? If the model accidentally retains a copy
             | of the work, is that an unauthorized copy?
        
             | MAGZine wrote:
             | I think it would be pretty easy to stake opinions on those
             | "boolean questions."
             | 
             | Is (was?) a swipe gesture novel? Is it non-obvious?
        
               | jrm4 wrote:
               | Oh, absolutely. Kind of furthers my point. Patent is a
               | silly mess in a lot of ways, but _at least_ there 's
               | something like Booleans in it. "Fair use" doesn't even
               | have THAT.
        
               | Arelius wrote:
               | I think what the parent is stating is that even though
               | the patent questions can have debate, once you settle the
               | question "Is it Novel" as yes or no you can determine if
               | the item is patentable... wheras for fair-use, the
               | questions themselves aren't yes/no questions, and
               | further, they are just used as balancing factors, so even
               | if everyone agrees on "the effect of the use upon the
               | potential market for or value of the copyrighted work"
               | it's only weighed as a factor for how fair the use is,
               | and broadly left up to the hand-waving of the particular
               | judge.
        
             | [deleted]
        
             | cogman10 wrote:
             | Most law is wishy washy. There are very few cut and dry
             | answers in the law (If there were, we wouldn't need lawyers
             | and a court system based on deciphering the law).
             | 
             | All that said, the one thing I'd add about fair use is that
             | it isn't permission to use anything you like, but rather a
             | defense in a legal proceeding about copyright. It's pretty
             | much all about being able to reference copyrighted material
             | with the law later coming in and making final decisions on
             | whether or not that reference went too far. (IE, copying
             | all of a disney movie and saying "What's up with this!" vs
             | copying 1 scene and saying "This is totally messed up and
             | here's why".)
             | 
             | That was a big part of the google oracle lawsuit.
        
           | extra88 wrote:
           | Yes to all this.
           | 
           | I think the factor most at risk in a fair use test with
           | Copilot is whether it ever suggests verbatim, code that could
           | be considered the "heart" of the original work. The John
           | Carmack example that's popped up here at least gets closer to
           | this question, it was a relatively small amount but it was
           | doing something very clever and important.
           | 
           | One can imagine a project that has thousands of lines of code
           | to create a GUI, handle error conditions, etc. that's built
           | around a relatively small function; if Copilot spat out that
           | function in my code, it might not be fair use because it's
           | the "heart" of the original work. Additionally, its inclusion
           | in another project could affect the potential market for the
           | original, another fair use test.
           | 
           | But Copilot suggesting a "heart" is unlikely, something that
           | would have to be ruled on in a case-by-case basis and not a
           | reason to shut it down entirely. Companies that are risk-
           | averse could forbid developers from using Copilot.
        
             | mthoms wrote:
             | This is an excellent comment because it captures some
             | important nuance missing from other analysis on HN.
             | 
             | I agree with you that the relative importance of the copied
             | code to the end product would be (or should be) the crux of
             | the issue for the courts in determining infringement.
             | 
             | This overall interpretation most closely adheres to the
             | _spirit_ and _intent_ of Fair Use as I understand it.
        
           | otterley wrote:
           | If this issue is eventually litigated, we will see. The law
           | in the Second Circuit (where the final judgment was rendered
           | before the case was eventually settled) may well be different
           | than the law in a different circuit. If there is a split in
           | the circuit courts, then the Supreme Court may have to weigh
           | in on this issue.
           | 
           | When fair use is an issue, the courts look at the facts in
           | context each time. These are obviously different facts than
           | scanning books for populating a search index and rendering
           | previews; and each side is going to argue that the facts are
           | similar or that they are dissimilar. How the court sees it is
           | going to be the key question.
        
             | tyre wrote:
             | This could either be:
             | 
             | 1. a fascinating Supreme Court opinion.
             | 
             | 2. a frustrating ruling because SCOTUS doesn't understand
             | software and code.
             | 
             | 3. the type of anti-anticlimactically(?) narrow ruling
             | typical of the Roberts court.
             | 
             | While our Congresspersons can't seem to wrap their minds
             | around technology/social media, I think SCOTUS would
             | understand this one enough to avoid (2).
        
               | otterley wrote:
               | Fair use cases tend to produce narrowly-written law
               | because the outcomes hinge on how the court judges the
               | facts against the list of factors codified in the
               | Copyright Act (17 U.S.C. section 107). The courts don't
               | really have breathing room to use a different test. I
               | don't recall any cases in which the courts have set
               | binding guidelines for interpretation of these factors.
        
               | dtech wrote:
               | The Google vs Oracle case showed that SCOTUS can handle
               | technical topics
        
           | ElFitz wrote:
           | > Note, however, that there is no world-wide principle of
           | fair use; what kinds of use are considered "fair" varies from
           | country to country.
           | 
           | Exactly the point I came to make.
           | 
           | The Authors' Guild is a US entity, and so is Google, so only
           | US law applies. And thus, we have the Fair Use exception.
           | 
           | But developers sharing code on GitHub come from and live all
           | over the world.
           | 
           | Now, Github's ToS do include the usual provision stating that
           | US & California law applies, et caetera, et caetera [1],
           | but... and even they acknowledge it may be the case, such
           | provisions usually aren't considered legal outside of the US.
           | 
           | So... developers from outside the US, in countries with less
           | lenient exceptions to copyright, definitely could sue them.
           | 
           | Identifying these countries and finding those developers,
           | however, is a different matter altogether.
           | 
           | [1]: https://docs.github.com/en/github/site-policy/github-
           | terms-o...
        
           | croes wrote:
           | Can you still apply Fair Use if they make Copilot a payed
           | service?
        
           | cmsonger wrote:
           | This is a thoughtful and insightful reply. Thank you.
        
           | IgorPartola wrote:
           | I think this is the correct answer. IANAL but the copilot
           | code vs the copilot training data are different things and
           | licensing for one shouldn't affect the other, right? And the
           | fact that training data happens to also be code is
           | incidental.
        
             | 8note wrote:
             | One view would be that copilot the app distributes GPL'd
             | code, in a weird encoding. Training the model is a
             | compilation step to that encoding
        
             | keonix wrote:
             | I assume the code is a derivative work of training data
             | because given different data code would be also different
             | (neuron weights)
        
               | IgorPartola wrote:
               | If I read a GPL implementation of a linked list and then
               | write my own linked list implementation, was my neural
               | network in my brain a derivative work of the GPL code?
        
               | keonix wrote:
               | Sure it is, you brain is not software though
        
               | IgorPartola wrote:
               | So as long as I read GPL code, then rewrite it from
               | memory and feed it to copilot to train it I can unGPL
               | anything?
        
           | logifail wrote:
           | > Fair use is an exception to copyright itself. A license
           | cannot remove your right to fair use.
           | 
           | ...and if you're outside the USA?
        
           | ska wrote:
           | > By comparison, Copilot is even more obviously fair use.
           | 
           | You are correct about (US specific) the fair use exception,
           | but it is in no way as clear as you suggest that what copilot
           | is doing entirely falls under fair use. Fair use is always
           | constrained.
           | 
           | I suspect some variant of this sort of thing will have to be
           | tested in court before the arguments are really clear.
        
           | random314 wrote:
           | If copilot was trained using the entirety of the linux
           | kernel, wouldn't the neural network itself need to be GPLed,
           | if not its output.
        
           | feanaro wrote:
           | When the recent Github v. youtube-dl fiasco happened, I
           | remember reading similarly strongly-worded but _dismissive_
           | comments regarding fair use, stating how it is quite obvious
           | that youtube-dl 's test code could _never_ be fair use and
           | how fair use itself is a vague, shaky, underspecified
           | provision of the copyright law which cannot ever be relied
           | on.
           | 
           | To me, seeing youtube-dl's case as fair use is so much easier
           | than using _hundreds of thousands_ source code files without
           | permission in order to build a _proprietary product_.
        
             | 0-_-0 wrote:
             | How would you feel about a paid-for search engine using
             | _hundreds of millios_ of web pages without permission in
             | order to build a _proprietary product_?
        
               | sbelskie wrote:
               | You mean like a search engine?
        
           | indymike wrote:
           | Books (mostly) are not distributed under the GPL.
        
             | e12e wrote:
             | True. But pretty good privacy might be worth considering in
             | this context - it was at one point published as a book
             | after all...
             | 
             | https://philzimmermann.com/EN/essays/BookPreface.html
        
             | tedunangst wrote:
             | Does the GPL forbid fair use? Why don't book publishers use
             | a license that forbids fair use?
        
               | AaronFriel wrote:
               | Because fair use is an exception to copyright itself. A
               | copyright license can't take away your legal right to
               | fair use.
        
             | jefftk wrote:
             | The GPL only gives you additional permissions relative to
             | what you would have by default. The books included in that
             | suit were more strongly restricted, since there was no
             | license at all.
        
               | indymike wrote:
               | There are certainly some interesting additional
               | conditions the GPL creates by taking the license away if
               | you violate certain clauses. Regardless, the interesting
               | part of this is that this looks different from the user's
               | point of view and Microsoft's. Sure, 5 lines out of
               | 10,000 is probably fair use. For Microsoft, their system
               | is using the whole code base and copying it a few lines
               | at a time to different people, eventually adding up to
               | potentially lots more than fair use.
               | 
               | The question on this one will be about the difference
               | between Microsoft/Github's product and a programmer using
               | copilot's code:
               | 
               | "If I feed the entire code base to a machine, and it
               | copies small snippets to different people, do we add the
               | copies up, or just look at the final product?"
        
           | wilde wrote:
           | I buy the argument about copilot itself and this comment. But
           | when someone goes to release software that uses the output of
           | Copilot, I fail to see how they wouldn't be a GPL derivative
           | work if enough source was used. Copilot is essentially really
           | fancy copy/paste in that context.
        
           | Causality1 wrote:
           | Read the Authors Guild v Google dismissal. The court
           | considered it fair use because Google's project was built
           | explicitly to let users find and purchase books, giving
           | revenue to the copyright holders. Copilot does not do that.
        
           | adverbly wrote:
           | This was a good point. Really enjoying this discussion.
           | Interesting stuff.
           | 
           | I'm really out of my depth in giving my own opinion here, but
           | I'm not sure that either the "distribution != derivative"
           | characterization, or that "parsing GPL => derivative of GPL"
           | really locks this thing down. The bit that I can't follow
           | with the "distribution != derivative" argument is that the
           | copilot is actually performing distribution rather than
           | "design". I would have said that copilot's core function is
           | generating implementations, which to me does not seem like
           | distribution. This isn't a "search" product, and it's not
           | trying to be one. It is attempting to do design work, and I
           | could see a case where that distinction matters.
        
         | jrochkind1 wrote:
         | If Github hosts AGPL code, does that mean that github's own
         | code must be AGPL? Obviously not. What's the difference?
         | 
         | There's no point to copilot without training data, some but not
         | all of the training data was (A)GPL. There's no point to github
         | without hosting code, some but not all of the code it hosts is
         | A(GPL).
         | 
         | The code in either cases is _data_ or content, it has not
         | actually been incorporated into the copilot or github product.
        
           | mhh__ wrote:
           | Code isn't to GitHub what training data is to this model, or
           | at least even if you could argue that it is within a current
           | framework it shouldn't be.
        
           | b3morales wrote:
           | > If Github hosts AGPL code, does that mean that github's own
           | code must be AGPL? Obviously not. What's the difference?
           | 
           | GitHub's TOS include granting them a _separate_ license
           | (i.e., not the GPL) to reproduce the GPL code in limited ways
           | that are necessary for providing the hosting service. This
           | means commonsense things like displaying the source text on a
           | webpage, copying the data between servers, and so on.
        
         | jdavis703 wrote:
         | Assuming that copilot is a violation of copyright on GPL works,
         | it would also be a violation of non-GPL copyrighted works,
         | including public but, but fully copyrighted works. Therefor
         | relicensing others source code under GPL would violate even
         | more copyright.
        
           | zelphirkalt wrote:
           | So in that case, of course copilot would have to give license
           | info for every. single. snippet. Case solved. Only, that they
           | will probably not do that.
        
         | andy_ppp wrote:
         | Probably they get away with it, but it definitely seems against
         | the spirit of the GPL just as closed source GitHub existing
         | because of open source software seems quite hypocritical.
        
         | fny wrote:
         | I think the bigger issue is that use of Copilot puts the end
         | user at risk of using copyrighted code without knowing it.
         | 
         | Sure one could argue that Copilot learned in the way a human
         | does. There is nothing that prevents one from learning from
         | copyrighted work, but snippets delivered verbatim from such
         | works are surely a copyright violation.
        
         | coding123 wrote:
         | copilot isn't distributing copies of itself either.
        
         | hedora wrote:
         | More interestingly, if we can trick it into regurgitating a
         | leaked copy of the windows source code, Microsoft apparently
         | says that's fair use.
        
       | fchu wrote:
       | If a company built a tool like Copilot to help students write
       | essays, is that considered plagiarism? Probably yes, and the
       | reason is that regurgitating blobs of text without actually
       | thinking like a human and writing them anew doesn't feel like
       | actual work, just direct re-use.
       | 
       | Same thinking probably applies to GitHub Copilot and copyright
        
         | taywrobel wrote:
         | It's already fairly commonplace for news agencies to generate
         | articles using ML solutions such as https://ai-writer.com/
         | 
         | So by your logic ABC, CBS, Fox, and NBC have all been
         | plagiarizing and violating copyright for doing so? I'm not sure
         | if there's been a legal challenge/precedent set in that case
         | yet, but that seems like a more apples to apples comparison
         | than the Google Books metaphor being used.
         | 
         | Disclosure: I work at GitHub but am not involved in CoPilot
        
           | yccs27 wrote:
           | The big question here is: On what data was the model trained?
           | Presumably the news stations trained theirs on public-domain
           | works and their own backlog of news articles, so even with
           | manual copying there would be no infringement. In contrast,
           | Copilot was trained on other people's code with active
           | copyright.
        
             | taywrobel wrote:
             | That's quite a big presumption IMO. Training sets need to
             | be quite large in order to produce reasonable output. My
             | understanding is that these companies provide the model
             | themselves, which seems like it'd be trained on more than
             | one company's publications. But I get your point, and
             | understand both sides of the argument here.
             | 
             | I think this will end up with a large class action lawsuit
             | for sure, tho I really think it's a toss up as to who would
             | win it. This conversation was bound to happen eventually
             | and we're in uncharted territory here.
             | 
             | I think it's going to hinge on whether machine learning is
             | considered equivalent in abstraction to human learning,
             | which will be quite an interesting legal, technological,
             | and philosophical precedent to set if it goes that way.
        
       | neom wrote:
       | Curious what the consensus is on how GH should have approached
       | this to avoid such blowback.
       | 
       | Best case scenario, they explained in advance on the GH blog
       | they're going to be doing some work on ML and coding, and they'd
       | like people to opt into their profile being read via a flag
       | setting/or put a file in the repo that gives permission like
       | robots.txt? Second best case scenario, same as first but opt out
       | vs opt in, and least ideal would be something like not doing the
       | first two, however, when they announced it, explained in detail
       | how the model was trained and what was used, why, and when- kinda
       | thing?
       | 
       | Is that generally about right, or..?
        
         | tazjin wrote:
         | > Second best case scenario
         | 
         | Not really, consider for example repositories mirrored to
         | Github.
         | 
         | It seems unclear who has the rights to grant this permission
         | anyways (with free software licenses). Probably the copyright
         | holder? Who that is might also be complicated.
        
           | iimblack wrote:
           | In that hypothetical I wouldn't think GitHub is responsible
           | for determining if a repository is mirrored and what the
           | implications of that are. They just need to look at what
           | license is on the repo in GitHub.
        
           | neom wrote:
           | Good point, I would have thought GH requires you to agree in
           | some TOS that you have permission to put the code on GH (but
           | I don't know)? If so, could that point be put aside? (I'm not
           | a software engineer so sorry if that made no sense. Super
           | curious about the whole codepilot thing from a business and
           | community perspective)
        
             | tazjin wrote:
             | > that you have permission to put the code on GH
             | 
             | This is the complicated bit: All open-source licenses grant
             | you permission to redistribute the code (usually with
             | stipulations like having to include the license), so you
             | are almost always allowed to upload the code to Github.
             | 
             | What it doesn't mean however is that you're the copyright
             | holder of that code, you're merely redistributing work that
             | somebody else has ownership of.
             | 
             | So who gets to decide what Github is allowed to do with it?
             | 
             | I expect this will end up in courts and we won't get a
             | definite answer before that.
        
               | neom wrote:
               | If you'll entertain me on a hypothetical for a moment.
               | Suppose then the copious amount of intelligent folks over
               | at GH _know_ this will eventually end up in the courts,
               | and expected that from the start. Would you suggest they
               | messaged /rolled it out any differently? Did they do
               | exactly what they needed to do so that it _did_ end up in
               | the courts? Should they have done anything differently to
               | not piss folks off so much? Sorry for the million
               | questions, you seem to know /have thought a bit about
               | this. Thanks! :)
        
         | lukeplato wrote:
         | They should have only used code from projects that included a
         | license that allow for commercial use or made their model
         | openly available and/or free to use
        
           | whimsicalism wrote:
           | How does attribution work then?
        
         | BlueTemplar wrote:
         | Code (co)created with Copilot has to follow all the licenses of
         | the source (heh) code. This generally means at the very least
         | automatically including in projects getting help from Copilot a
         | copy of all the licenses involved, and attribution for all the
         | people the code of which Copilot has been trained on.
         | 
         | (Not sure for the cases where there is no license and therefore
         | normal copyright applies, but AFAIK this isn't the case for any
         | code on Github, which automatically gets an open source licence
         | ?
         | 
         | EDIT : Code in public repositories seems to be "forkable" on
         | Github itself but not copyable (to elsewhere). That's some
         | nasty walled garden stuff right there, I wonder how legal that
         | ToS is ? I could see how this could make them to incentivize
         | people to stop using other licenses on Github, to not have to
         | deal with this license mess... EEE yet again ?)
        
           | neom wrote:
           | So I guess then, the first thing they should have done, is
           | trained it to understand licenses, and used that as a first
           | principle for how they built the system?
        
             | BlueTemplar wrote:
             | Seems to be too much effort (is it even possible to link
             | the source to the end result ?), and might not be
             | admissible, so just include a database with all of the
             | relevant licenses and authors ?
        
             | whimsicalism wrote:
             | Is it a derivative work of GPL licensed work if it is
             | trained on the license? Is the GPL license text under GPL?
        
               | [deleted]
        
               | BlueTemplar wrote:
               | > GNU GENERAL PUBLIC LICENSE
               | 
               | > Version 3, 29 June 2007
               | 
               | > Copyright (c) 2007 Free Software Foundation, Inc.
               | <https://fsf.org/>
               | 
               | > Everyone is permitted to copy and distribute verbatim
               | copies of this license document, but changing it is not
               | allowed.
        
       | rwmj wrote:
       | So would a way to do this be to train multiple models on each
       | different code license (perhaps allowing compatible licenses to
       | cohabit) and then have Copilot identify the license of the target
       | project and use the appropriate model?
       | 
       | It might have an interesting feedback effect that some licenses
       | which are more popular would presumably have better Copilot
       | recommendations, which would produce better and thus more popular
       | code for those licenses. Although maybe this happens already.
        
       | fleddr wrote:
       | To me, the particular use case and whether it is fair use or not,
       | is of minor interest. A far more pressing matter is at hand: AI
       | centralization and monopolization.
       | 
       | Take Google as an example, running Google Photos for free for
       | several years. And now that this has sucked in a trillion photos,
       | the AI job is done, and they likely have the best image
       | recognition AI in existence.
       | 
       | Which is of course still peanuts compared to training a super AI
       | on the entire web.
       | 
       | My point here is that only companies the size of Google and
       | Microsoft have the resources to do this type of planetary scale
       | AI. They can afford the super expensive AI engineers, have the
       | computing power and own the data or will forcefully get access to
       | it. We will even freely give it to them.
       | 
       | Any "lesser" AI produced from smaller companies trying to compete
       | are obsolete, and the better one accelerates away. There is no
       | second-best in AI, only winners.
       | 
       | If we predict that ultimately AI will change virtually every
       | aspect of society, these companies will become omnipresent,
       | "everything companies". God companies.
       | 
       | As per usual, it will be packaged as an extra convenience for
       | you. And you will embrace it and actively help realize this
       | scenario.
        
         | ncr100 wrote:
         | I think it's an innate quality of technology.
         | 
         | Yes sophisticated AI tech concentrates power for those who
         | already have power.
         | 
         | And the technology we all (presumably readers of HN) create can
         | enhance the impact of the user. And this can result in unfair
         | circumstances, in reality.
         | 
         | Law and force can prevent disproportionate use of power. Of
         | course one must define the law, which may be done AFTER the
         | offense has been committed. Further, if those who make the laws
         | are corrupted by those with e.g. this AI tech power, then no
         | effective law may be enacted and the hypothetical abuse will
         | continue.
        
         | mikewarot wrote:
         | I have about 300,000 photos that haven't been scanned by AI
         | (unless someone at Backblaze did it without permission). I'm
         | sure there are lots of other photographers out there who miss
         | Picassa, which Google killed off to push everyone's data to
         | their service. (It did really well in matching faces, even
         | across age, but the last version has a bug when there are
         | multiple faces in a picture, sometimes it swaps the labels)
         | 
         | If there were offline image recognition we could train on our
         | own data privately, _could_ the results of those trainings be
         | merged to come up with better recognition on average than any
         | one person could do themselves with their own photos?
         | 
         | In other words, would it be possible for us to share the
         | results of training, and build better models, without sharing
         | the photos themselves?
        
         | hexa22 wrote:
         | There is a second best though. Apple offers image AI which is
         | worse than googles but wins because it works offline.
        
         | mtrn wrote:
         | The final step is to break down these monopolies. The
         | government can do that and has done it before.
        
         | jspaetzel wrote:
         | If Google makes an amazing model that no-one can beat it will
         | only be dominate as long as others get access to it freely. But
         | if there are restrictions on access or if it's too expensive,
         | other options will appear and even if they're not as perfect,
         | they'll still be very usable. Imagine a coalition of companies
         | all feeding data, that could compete just as well.
        
           | est31 wrote:
           | Google has all the data of all the users though. I'd wager
           | that they won't just let AI companies scrape it.
        
       | hartator wrote:
       | GitHub CoPilot is clear fair use. Having a ruling that says it
       | doesn't will be regression. Please don't.
        
       | pizza wrote:
       | Please don't get mad at me but my question is genuinely: so what?
       | Why does it matter? Can't you violate licenses in a tedious
       | manner just by Googling + copy pasting blindly already? Genuinely
       | looking to understand the consensus here
        
       | ghoward wrote:
       | This is why I relicensed my code [1] yesterday to a license I
       | wrote [2], which is designed to poison the well for machine
       | learning.
       | 
       | [1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-
       | and...
       | 
       | [2]: https://yzena.com/yzena-network-license/
        
         | speedgoose wrote:
         | Since you allow new versions by default, can't someone just
         | release a new version of your license allowing everything they
         | want?
        
           | ghoward wrote:
           | That is a good point, but easily fixed. Will do that now.
           | 
           | Edit: done. They are under the CC-BY-ND license now.
        
         | jkaplowitz wrote:
         | Do the GitHub Terms of Service give them the necessary
         | permissions for Copilot, independently of the license? (I
         | honestly don't know the answer; this is a straight question.)
        
           | ghoward wrote:
           | I don't know. Because I don't know is why I pulled all of my
           | code (except for a permissively-licensed project that people
           | actually depend on the GitHub link for) off of GitHub.
        
           | shakna wrote:
           | > The licenses you grant to us will end when you remove Your
           | Content from our servers, unless other Users have forked it.
           | [0]
           | 
           | I don't see how they can keep this clause, and then have a
           | service that recites/redistributes code, based on a model
           | that has already ingested said code.
           | 
           | > This license does not grant GitHub the right to sell Your
           | Content. It also does not grant GitHub the right to otherwise
           | distribute or use Your Content outside of our provision of
           | the Service, except that as part of the right to archive Your
           | Content, GitHub may permit our partners to store and archive
           | Your Content in public repositories in connection with the
           | GitHub Arctic Code Vault and GitHub Archive Program. [1]
           | 
           | Copilot is distributed verbatim code when it regurgitates,
           | which seems a pretty clear violation of this clause. (If it
           | wasn't regurgitating, they'd have caselaw for fair use.
           | But... It is.)
           | 
           | [0] https://docs.github.com/en/github/site-policy/github-
           | terms-o...
           | 
           | [1] https://docs.github.com/en/github/site-policy/github-
           | terms-o...
        
         | gjs278 wrote:
         | good luck enforcing this. you are nobody and no court will hear
         | your case.
        
         | p0ckets wrote:
         | I think to actually poison the well, we should add code to
         | existing repos with dead code clearly labelled as "the way that
         | things shouldn't be done" that are wrong in subtle ways. So
         | every time we fix a security issue, we keep the version with
         | the bug with some comments indicating what's wrong with it. Of
         | course, this only works until the AI is trained to weigh the
         | code based on how often the code is called.
        
           | ghoward wrote:
           | That is a funny idea. Personally, too much work for me, and
           | Copilot probably generates subtly wrong code already.
        
           | nradov wrote:
           | The notion of intentionally polluting and over complicating
           | your code base just to "poison the well" is bizarre. Talk
           | about cutting off your nose to spite your face.
           | 
           | If you don't want others to use your code then the solution
           | is very simple. Keep it on a secure private server and don't
           | publicly release it.
        
             | ghoward wrote:
             | Keeping it private is one option, but I really want my end
             | users to have the freedom to modify the code for
             | themselves.
        
         | CyberRabbi wrote:
         | That's safe but it's probably not necessary to be protected
         | from what GitHub, OpenAI, and Microsoft are doing. When these
         | licenses were crafted there was no reasonable expectation that
         | companies could use ML applications as a loop hole in existing
         | copyright licenses, so just because there is no explicit clause
         | denying it doesn't mean they are in the clear for using
         | copyright-protected code that way. Licenses give permission,
         | they don't revoke it.
         | 
         | Copyright is broad, licenses are minimal. This must be the case
         | otherwise they would not be very effective at protecting the
         | work of creators. There is no explicit allowance for what
         | GitHub is doing in most licenses so they do not have general
         | permission to do so.
        
           | ghoward wrote:
           | I agree; my blog post says so.
           | 
           | What my licenses are supposed to do is sow even more doubt in
           | companies' minds about models trained on my code.
        
         | richardwhiuk wrote:
         | If it's allowed by fair use, your license is irrelevant. If
         | it's not, your license doesn't matter.
        
           | ghoward wrote:
           | In my blog post, I talk about how training is fair use, but
           | we don't know about distributing the _output_. These
           | licenses, even if they don 't work, are designed to poison
           | the well by putting enough doubt into companies' minds that
           | they would not want to use Copilot if it has been trained
           | with my relicensed code.
        
       | laurowyn wrote:
       | Could this be the beginning of the true test of open source
       | licenses? My understanding is that there has never been a ruling
       | by a court to give precedence to the validity or scope of any
       | open source license. I can see a class action suit coming on
       | behalf of all GPL licensed code authors.
        
         | jefftk wrote:
         | GitHub used code that wasn't under any license at all, just
         | publicly visible. Their claim is not that the license allows
         | what they're doing, but that they do not need a license.
        
           | laurowyn wrote:
           | which is a different issue to my point, but still very valid.
           | what terms are implied if no license is specified? I would
           | argue attribution should be expected if used, but I also
           | wouldn't go near any code without a specific license attached
           | as there's no express permission given - just because a
           | license isn't disclosed doesn't mean it isn't there.
           | 
           | you can't go copying anything and everything just because
           | nobody has told you that you can't. and I feel that's part of
           | the purpose behind GPL. force a license on derivative code so
           | that at least there's clear rights moving forwards.
        
             | jefftk wrote:
             | It's stronger than that: if GitHub is correct that they
             | don't need a license then they are allowed to train on
             | publicly visible code even if it is labeled with "no one
             | has any provision to use this for anything at all,
             | especially training models"
        
               | laurowyn wrote:
               | Which is why I think this could be a big turning point.
               | IMO, GitHub is breaking licenses. If an ML algorithm
               | ingests a viral licensed block of code, its outputs
               | should be tainted with that license as it's a derived
               | work. Otherwise I can make a program reproduce whole
               | repositories license free, so long as I can claim "well,
               | the AI did it, not me!" It's produced something based on
               | the original work, therefore it should follow the license
               | of the original. And that issue is exacerbated by the
               | mixture of licenses available - they will all apply at
               | the same time, and not all are compatible.
               | 
               | I would hope GitHub (and Microsoft) did the legal work to
               | cover this, and not just ploughed ahead with the plan to
               | drown any legal challenges. From my perspective, they're
               | doing the latter.
        
         | jcelerier wrote:
         | What ? There have been plenty of GPL cases defended in court.
         | 
         | https://en.m.wikipedia.org/wiki/Open_source_license_litigati...
        
           | laurowyn wrote:
           | All of the copyright cases were settled, so no precedence is
           | set. Open source as a contract has been ruled legal, and
           | licensors can sue for breach of contract - which is not the
           | same as copyright infringement.
           | 
           | I think my point still stands.
        
       | crazygringo wrote:
       | I mean, if it's considered "fair use" legally (which is surely
       | their position), then why wouldn't they?
       | 
       | Why would they distinguish between licenses if there's no legal
       | need to?
       | 
       | Licenses are only restrictions _on top_ of fair use. Licenses can
       | 't restrict fair use.
       | 
       | It would be interesting if someone takes them to court and a
       | judge definitively rules on fair use in this particular case. Or
       | I don't know if there's enough precedent here that the case would
       | never even make it to trial. But with a team of top-paid
       | Microsoft lawyers that gave this the green light, I'm pretty sure
       | they're quite confident of the legality of it.
        
       | 6gvONxR4sf7o wrote:
       | I've figured out why ML based fair use arguments for generative
       | models feel dirty to me.
       | 
       | Imagine a scenario where you'd love to have access to a large
       | number of my digital widgets, but they're expensive to make or
       | buy, and a large number of them is _really_ expensive. So you
       | train an ML model on my things you can 't afford to buy. It's
       | still expensive, but that's a one time cost. Spend $5M training
       | GPT-3, it's fine. Now you can _sample from the space of my
       | digital widgets_. You have gotten a large number of widgets, just
       | by throwing money at AWS. With money, you have converted my
       | widgets into your widgets, and I 'll never see a cent of it.
       | 
       | That's the issue. Content is expensive and it's still needed.
       | Traditionally, I make content and if you want to benefit from my
       | labor, you pay me. In the future, if you want to benefit from my
       | labor, you pay AWS instead.
       | 
       | tl;dr The most significant equation for generative models is "$$$
       | + my stuff = your stuff"
        
         | remram wrote:
         | In addition, the model is going to spit out widgets that are
         | combinations of the existing ones, if it doesn't outright copy.
         | This is different from a human who is going to put their own
         | creativity into it (and will be accused of plagiarism if they
         | don't): the model has no creativity to offer on top of the
         | unlicensed input.
        
       ___________________________________________________________________
       (page generated 2021-07-08 23:00 UTC)