[HN Gopher] All public GitHub code was used in training Copilot
___________________________________________________________________
All public GitHub code was used in training Copilot
Author : fredley
Score : 724 points
Date : 2021-07-08 08:18 UTC (14 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| iseethroughbs wrote:
| I treat Copilot as literally a programmer in pair programming.
| Which means that if it's trained, i.e. it has "seen" GPL code,
| then it's tainted, and we should treat resulting code as GPL
| code.
|
| Replace "GPL" with the most restrictive license that's on GitHub,
| but you get the point.
|
| They're kinda shooting themselves in the foot, because this
| reduces the commercial potential of the tool to almost nothing.
| ksec wrote:
| All I know is this Copilot opens a whole can of worms. And
| doesn't or possibly never will have a right answer without court
| settling.
|
| Obviously most ( I think ) lawyers seems to be siding with
| Microsoft on fair use. But most owner of the code seems to think
| they are infringing on their work.
|
| Then there is the international issue because one court cant
| decide for everyone else.
|
| I think the issue is important enough I wonder if we could
| somehow crowdfund it for a court trial or something.
| rambambram wrote:
| I'm a programmer and also studied law for some time. These
| stories make me - once more - realize the old adage: "Possession
| is nine tenths of the law." Don't host that code in the cloud (or
| a better term, someone else's dirty bucket). What happened to
| developers hosting stuff on their own website!?
| jefftk wrote:
| It doesn't matter where the code is hosted, just that it is
| publicly accessible. If developers hosted code on their own
| sites, someone could still scrape them and use that to train
| models.
|
| (The question of whether this is sufficiently transformative to
| count as fair use is still wide open)
| BuildTheRobots wrote:
| > It doesn't matter where the code is hosted, just that it is
| publicly accessible. If developers hosted code on their own
| sites, someone could still scrape them and use that to train
| models.
|
| I'd suggest it makes it more interesting. If it's self
| hosted, then the hoster can choose to impose restrictions on
| server aceess, including no automated scraping, rather than
| trying to impose licensing on the code itself.
| ghoward wrote:
| This is why I have now moved my code off of GitHub.
| [deleted]
| elliekelly wrote:
| Anecdata: I'm a lawyer and programmer and my clients (large
| financial institutions) are increasingly insisting on hosting
| as much on-site as possible. It costs more, it can make it
| difficult to select vendors/service providers, and it's not
| without business continuity risks which they take steps to
| mitigate.
|
| But I think more and more companies, particularly those in
| highly regulated industries, are deciding that the benefit of
| controlling the data -- access, security, privacy, and
| understanding _who_ , exactly, it's being shared with --
| outweighs the risks of someone else having that control.
| radiator wrote:
| So some professionals will have the chance to migrate systems
| back to on-premises, after having migrated them from on-
| premises to the cloud? Interesting.
| ralph84 wrote:
| GitHub's argument isn't that you hosted your code on GitHub and
| therefore gave them a license to use it to train their model.
| GitHub's argument is they don't need a license to train their
| model because it's fair use. Hosting your code somewhere else
| doesn't prevent fair use. If you don't want your code used to
| train ML models, don't host it anywhere.
| rambambram wrote:
| I get it, but that's already a legal argument. I was trying
| to zoom out from the unavoidable legal argumentative
| deadlock: if GH does not have your code hosted on their
| servers, it becomes way harder for 'them' to grab it and rape
| it. Your own domain is - of course - also out in the open,
| but at least you can have more control.
| tannhaeuser wrote:
| > _What happened to developers hosting stuff on their own
| website!?_
|
| Devs were hoping for stars and network effects rather than
| listening to those of us feeling uncomfortable taking all
| traffic to gh. Something like Copilot or even a coding bot was
| predicted two years ago already.
| dariusj18 wrote:
| Wouldn't this question already have been asked and answered when
| AI's were trained on books and articles?
| zinekeller wrote:
| As far as I know, there isn't a formal copyright-related US
| court ruling (yet anyway) of training ML/AIs on any media
| (except for copying the code of an ML). So everything is
| actually on thin ice, much like the infamous "GIFs _[from
| snippets of shows etc.]_ are widely believed to be fair use ",
| which in reality is still untested. Let's not forget other
| countries, with much stricter copyright rules (especially moral
| rights).
| _Understated_ wrote:
| Ok, my curiosity has been fired here...
|
| I have conjured up two scenarios here:
|
| Let's say I use copilot to generate a bunch of code for an app,
| something substantial, and it regurgitates a load of bits and
| pieces from many sources it got from GitHub, I'd assume there
| won't be any attribution in it... it will be as if Copilot made
| the code itself (I know it sort of does but lets not split
| hairs!). I'm guessing the prevailing theory (from GiitHub anyway)
| is that I'm legitimately allowed to do this.
|
| Now, let's say I generated all that code by manually copying and
| pasting chunks of code from a whole bunch of repos, whether they
| are open source, unlicensed, whatever. Would I not be ripe for
| legal issues? I could potentially find all the code that copilot
| generated and just copy and paste it from each of the sources and
| not mention that in my license. What if I told everyone "yeah, I
| just copied and pasted this from loads of Github repos and didn't
| put any attribution in my code". I'd assume that (morality aside)
| I'd be asking for trouble!
|
| Am I missing something? Am I misunderstanding the situation, or
| the capabilities of copilot?
| [deleted]
| BlueTemplar wrote:
| Copilot is just a tool, legally it cannot "make code", you're
| the one making it.
|
| See also : Napster, including how it was condemned for
| facilitating copyright infringement (what Microsoft is risking
| here, though the offense is likely to be much milder, of
| course).
| schneidmaster wrote:
| There's a decent bit of caselaw indicating that computers
| reading and using a copyrighted work simply "don't count" in
| terms of copyright infringement -- only humans can infringe
| copyright. This article[0] does a pretty good job of
| summarizing the rationale that the courts have provided. My
| (non-lawyer) take is that GitHub is pushing this just half a
| step farther -- if computers can consume copyrighted material,
| and use it to answer questions like "was this essay
| plagiarized", then in GitHub's view they can also use it to
| train an AI model (even if it occasionally spits back out
| snippets of the copyrighted training data). Microsoft has
| enough lawyers on staff that I'm sure they have analyzed this
| in depth and believe they at least have a defensible position.
|
| [0]: https://slate.com/technology/2016/08/in-copyright-law-
| comput...
| mysterydip wrote:
| Makes me wonder what would happen if a similar thing was done
| with books. If I train an AI on all the texts of Tom Clancy,
| or Stephen King, or every Star Wars novel, and the books it
| generates every so often produce paragraphs verbatim from one
| of those sources, would copyright owners be up in arms? What
| would the distinction be between the code case and the text
| case?
| xsmasher wrote:
| This will surely happen within the next few years; but if
| the "new work" contains a full paragraph from an existing
| novel the copyright hammer would come down hard.
|
| Maybe it needs to be paired with another network / hunk of
| code that checks for verbatim copying?
| shagie wrote:
| I am not a lawyer. I do photography and have a more than
| passing interest in copyright as it applies to the
| photographs I take and the material I photograph.
|
| Copyright on art gets more interesting / fuzzier. The key
| part is substantial similarity -
| https://en.wikipedia.org/wiki/Substantial_similarity and
| https://www.photoattorney.com/copyright-infringement-for-
| sub...
|
| Rather than text, my AI copyright hypothetical... consider
| a model created based on sunset photographs. You take a
| regular photograph, pass it through the model, and it
| transforms it into a sunset. The model was trained on
| copyrighted works but the _model_ is considered fair use.
|
| Now, I go and take a photograph from some location during
| the day and then pass it through the transformer and get a
| sunset. Yea me! Unbeknownst to me, that location is a
| favorite location for photographers and there were sunsets
| from that location used in the training data. My
| photograph, transformed to look like a sunset is now
| similar to one of them in the training data.
|
| Is my transformed photograph a derivative work of the one
| in the training data to which it bears similarity to? How
| would a judge feel about it? How does the photographer
| who's photograph was used in the training data feel?
| TaylorAlexander wrote:
| What would be interesting in that case would be how the
| transformed image would look if photos from that location
| were removed from the training set. That would help
| reveal whether it was just copying what it had seen or it
| actually remembered what sunsets looked like and
| transformed the image using its memory of sunsets in
| general.
| [deleted]
| _Understated_ wrote:
| I don't doubt that an army of lawyers has poured over this
| but they have size on their side: the cost of litigation vs
| potential revenue will be a massive factor.
|
| Edit: > There's a decent bit of caselaw indicating that
| computers reading and using a copyrighted work simply "don't
| count" in terms of copyright infringement.
|
| That means their computer can read any code it wants, do
| whatever it wants with the code, then they can monetise that
| by giving YOU the code. Would they then be indemnified by
| saying "no Microsoft human read or used this code"?
|
| However, if you then use the code and look at it, does that
| make you liable?
| schneidmaster wrote:
| Again, not a lawyer, just a guy who likes reading this
| stuff. The devil is usually in the details of copyright
| cases. The Turnitin case hinged substantially on whether
| Turnitin's use of copyrighted essays was "fair use". There
| are four factors[0] which determine fair use; the two more
| relevant factors here are "the purpose and character of
| your use" and "the effect of the use upon the potential
| market". The court found that Turnitin's use was highly
| "transformative" (meaning they didn't just e.g. republish
| essays; they transformed the copyrighted material into a
| black-box plagiarism detection service) and also found that
| Turnitin's use had minimal effect on the market (this is
| where "computers don't count" comes in -- computers reading
| copyrighted material don't affect the market much because a
| computer wasn't ever going to buy an essay).
|
| I would be shocked if GitHub's lawyers didn't argue that
| using copyrighted material as training data for an AI model
| is highly transformative. There may be snippets available
| from the original but they are completely divorced from
| their original context and virtually unrecognizable unless
| they happen to be famous like the Quake inverse square root
| algorithm. And I think GitHub's lawyers would also argue
| that Copilot's use does not affect the _original_ market --
| e.g. it does not hurt Quake's sales if their algorithm is
| anonymously used in a probably totally unrelated codebase.
|
| Your counterexample would probably fail both tests -- it's
| not transformative use if your software hands out complete
| pieces of copyrighted software, and it would definitely
| affect the market if Copilot gave me the entire source code
| of Quake for my own game.
|
| [0]: https://fairuse.stanford.edu/overview/fair-use/four-
| factors
| _Understated_ wrote:
| I thought I understood fair use but turns out I was
| wrong...
|
| That being said, creating a transformative work from
| something else is considered fair use. So, for example,
| if I read a whole bunch of books and then, heavily
| influenced by them, create my own, similar book, that
| would be fair use I suppose... that makes sense.
|
| But, where does the derivative works come in? Where do
| you draw the line?
|
| If I am heavily influenced by billions of lines of other
| people's GPL code (ala Copilot!), then I create my own
| tool from it and keep my code hidden, does that not mean
| I am abusing the GPL license?
| schneidmaster wrote:
| That's what I meant by the devil being in the details --
| these gray area questions hinge on the specific facts.
| Lawyers on both sides will argue which factors apply
| based on past caselaw and available evidence, and the
| court renders a decision. For example, from the Stanford
| webpage I previously linked: "the creation of a Harry
| Potter encyclopedia was determined to be "slightly
| transformative" (because it made the Harry Potter terms
| and lexicons available in one volume), but this
| transformative quality was not enough to justify a fair
| use defense in light of the extensive verbatim use of
| text from the Harry Potter books". So you _might_ be okay
| creating a Harry Potter encyclopedia in general, but not
| if your definitions are copy /pasted from the books, but
| you might still be okay quoting key lines from the books
| if the quotes are a small portion of your encyclopedia.
| The caselaw just doesn't lend itself to firm lines in the
| sand.
| LocalH wrote:
| That's funny, because the bedrock of copyright - insofar as
| software is concerned - is entirely predicated on the idea
| that a computer copying code into RAM to execute it is indeed
| a copyright violation outside of a license to do so.
| blendergeek wrote:
| > There's a decent bit of caselaw indicating that computers
| reading and using a copyrighted work simply "don't count" in
| terms of copyright infringement -- only humans can infringe
| copyright.
|
| I have read variations of "computers don't commit copyright"
| more times than I can count in the past few days.
|
| How is Copilot different from a compiler? (Please give me the
| legal answer, not the technical answer. I now the difference
| between Copilot and a compiler, technically.)
|
| Isn't a compiler a computer program? How is its output
| covered by copyright?
|
| Am I fundamentally misunderstanding something here?
| someone7x wrote:
| You just blew my mind with that analogy. I can only imagine
| some hair-splitting logic to rationalize a distinction.
| ghoward wrote:
| The analogy goes even further if you consider compiler
| optimizations: https://gavinhoward.com/2021/07/poisoning-
| github-copilot-and... .
| cormacrelf wrote:
| "Computers don't commit copyright" is a complete misreading
| or misunderstanding of another proposition, that "computers
| cannot author a work".
|
| Authoring is the act that causes a work to be
| copyrightable. In most jurisdictions, authoring a work
| _automatically_ causes copyright to subsist in the work to
| some degree. The purpose of the copyright system is to
| encourage people to author new, original works, by
| rewarding those who do with exclusive rights. It is well-
| known that only humans can author a work. Computers simply
| cannot do it. If your computer (by some kind of integer
| overflow UB miracle) accidentally prints out a beautiful
| artwork, NOBODY has exclusive copyright over it, and anyone
| may reproduce it without limitation. Same goes for that
| monkey who took a selfie.
|
| What a compiler does, on the other hand, is adapt a work.
| Adapting a work is not authoring it. Sometimes when you
| adapt a work, you also author some original work yourself,
| like when you translate a book into another language. When
| a compiler (not a linker) transforms source code, it
| absolutely, 100% definitely does NOT add any original work;
| the executable or .so/.a/.dylib/.dll file is simply an
| adaptation of the original work. The copyright-holder of
| the source code is the copyright-holder of the machine
| code. An adaptation is also known as a "derivative work".
|
| (Side note; copyleft licenses boil down to some variation
| of "if you adapt this, you have to share everything in the
| derivative work, not just the bits you copied.")
|
| Adaptation is a form of reproduction. It's copying.
| "Distribution" also often involves copying, at least on the
| internet. (Selling or giving away a book you have purchased
| does not constitute copying.) Copying is one of the
| exclusive rights you have when you own the copyright in a
| work, that you may then license out.
|
| It gets more complicated when the computer uses fancy ML
| methods to produce images/text out of things it has
| seen/read. You can't simplify the law around that to a
| simple adage digestible enough to share memetically on HN
| and Twitter. One thing is certain: if the computer did it,
| by itself, then no original work was authored in the
| process. That poses a problem for people who write the name
| of a function and get CoPilot to write the rest; if you do
| that, you are not the author of that part of the program.
| If you use it more interactively that's a different story.
|
| There is, however, always a question of whether the
| copyright in the original works the computer used _still
| subsists_ in the output.
|
| My rough framing of the licensing issues around CoPilot is
| therefore as follows:
|
| 1. The source code to CoPilot is an original work, and the
| copyright is owned by GitHub.
|
| 2. When GH trained CoPilot's models on other people's
| works, was that copying? (This one is partially answered.
| It can spit out verbatim fragments, so it must be copying
| to some extent, rather than e.g. actually learning how to
| code from first principles by reading.) If it was not all
| copying, how much of it was copying and how much of it was
| something else? What else was it?
|
| 3. If GH adapted the originals, what is the derivative
| work? (I.E. where does the copyright subsist now? Is is a
| blob of random fragments of code with some weights to a
| neural network?)
|
| 4. Which works is it an adaptation of? You might think "all
| of them, and for each one, all of the code" but I'm not so
| sure. For example, imagine the ML blob contains many
| fragments, but some are shorter than others. If your
| program has "int x;" in it, and CoPilot can name a variable
| "x", you can hardly claim that as your own. I'm most
| interested in whether the mere fact of CoPilot having
| digested ALL of it, having fed this into the mix and
| producing a ML blob based on all that information, means
| that the ML blob is a derivative work of all of them. Or
| whether there is some question of degree.
|
| 5. Fair use. Was it fair use to train the model? Is it,
| separately or not, fair use to create a commercial product
| from the model and sell it? Fair use cares about commercial
| use, nature of the copied work, amount of copying in
| relation to the whole, and the effect on the market for /
| value of the copied work. Massive question.
|
| 6. If not fair use, then GH is subject to the licenses and
| how they regulate use of the works. What license conditions
| must GH comply with when they deal with the derivative
| work, and how? Many will be tempted to jump straight to
| this question and say GH must release the source code to
| CoPilot. I'm not yet convinced that e.g. GPL would require
| this. I can't believe I'm writing this, but is the ML blob
| statically or dynamically linked? Lol.
|
| 7. Final question, is there some way to separate out works
| which were copied with no fair use (or not copied at all),
| from works which were copied with no fair use? People are
| worried about code laundering, e.g. typing the preamble to
| a kernel function and reproducing it in full. In that
| situation, it is fairly obvious that the end user has
| ultimately copied code from the kernel and needs to abide
| by GPL 2.0; moreover if they're using CoPilot to write out
| large swathes of text they will naturally be alert to this
| possibility and wary of using its output. But think of the
| converse: if there is no way to get CoPilot to reproduce
| something you wrote, what's the substance of your
| complaint? Is CoPilot's model really a derivative of your
| work, any more than me, having read your code, being better
| at coding now? Strategically, if you wanted to get GH to
| distribute the model in full, you might only need one
| copyleft-licensed, verbatim-reproducible work's owner to
| complain. But then they would just remove the complainant's
| code. You might be looking at forcing them to have a "do
| not use in CoPilot" button or something.
| jeremyjh wrote:
| I think this is more cogent analysis than anything else
| I've seen yet on this topic. You should consider
| submitting a blog post so this can become a top-level
| topic.
|
| Also, I loved this quote:
|
| > Copying is one of the exclusive rights you have when
| you own the copyright in a work, that you may then
| license out.
|
| I've been paying attention to software copyright topics
| for more than twenty years and never thought of it in
| exactly these terms. Its right there in the name - the
| right to copy it - and determine the terms under which
| others can copy it is exactly what a copyright is!
| [deleted]
| jeremyjh wrote:
| What if I made a few tweaks to Copilot so that it is very
| likely to reproduce large chunks of verbatim code that I
| would like to use without attribution, such as the Linux
| kernel. Do you really think you can write a computer
| program that magically "launders" IP?
|
| A compiler is run on original sources. I don't see any
| analogy here at all.
| ghoward wrote:
| * They both process source code as input.
|
| * They both produce software as output.
|
| * They both transform their input.
|
| * They both can combine different works to create a
| derivative work of each work. (Compilers do this with
| optimizations, especially inlining with link-time
| optimization.)
|
| They really do the same things, and yet, we say that the
| output of compilers is still under the license that the
| source code had. Why not Copilot?
| jeremyjh wrote:
| > Why not Copilot?
|
| Because the sources used for input do not belong to the
| person operating the tool.
|
| If you say that doesn't matter, then you are saying open
| source licenses don't matter because the same thing
| applies - I could just run a tool (compiler) on someone
| else's code, and ignore the terms of their license when I
| redistribute the binary.
| ghoward wrote:
| I think you have what I am saying backwards. I am saying
| that the licenses _should_ apply to the output of
| Copilot, like they apply to the output of compilers.
| jeremyjh wrote:
| Oh sorry, my mistake! Thank you.
| hedora wrote:
| No, I think that's the point.
|
| If I take some code I don't have a license for, feed it
| to a compiler (perhaps with some -O4 option that uses
| deep learning because buzzwords), then is the resulting
| binary covered under fair use, and therefore free of all
| license restrictions?
|
| If not, then how is what Copilot is doing any different?
| jeremyjh wrote:
| > If I take some code I don't have a license for, feed it
| to a compiler (perhaps with some -O4 option that uses
| deep learning because buzzwords), then is the resulting
| binary covered under fair use
|
| No, the binary is not free of license restrictions. Read
| any open source license - there are terms under which you
| can redistribute a binary made from the code. For GPL you
| have to make all your sources available under the same
| terms for example. For MIT you have to include
| attribution. For Apache you have to attribute and agree
| not to file any patents on the work in Apache licensed
| project you use. This has been upheld in many court cases
| - though it is not always easy to find litigants who can
| fund the cases the licenses are sound.
| lukeplato wrote:
| Copilot is a commercial paid service that generates money for
| Microsoft
| _Understated_ wrote:
| Yeah, that bit I realise but the point I was getting at is
| this: if I take someone else's code, use chunks of it in my
| app, say that it's mine and make money from it is that not
| illegal? Or, at least in violation of the license?
|
| Superficially at least, Copilot (from my understanding) is
| "copying" code, letting me use it in my app, and making money
| from it.
|
| I'm just trying to wrap my head around it.
|
| Let's be clear, I am not a lawyer, but it seems... strange!
| stonemetal12 wrote:
| Both the Copy machine and VCR were found to be legal
| because they had substantial non infringing uses. As is I
| don't see how Copilot does. It could, if trained on public
| domain or attribution free code only, unfortunately there
| probably isn't enough code out there to train the model
| adequately under such rules.
| Diggsey wrote:
| Also NAL, but I think there's far more of a case that users
| of Copilot might violate copyright rather than Copilot
| itself:
|
| - Only a _very_ small proportion of Copilot generated code
| is reproduced verbatim, so if you specifically built a
| product just from copied-verbatim code, your act of
| selecting and combining those pieces of copyrighted code
| would be creating a derivative work.
|
| - GitHub is not selling the copyrighted code, they are
| selling the tool itself. Google is literally the same
| thing: you could theoretically create a product by googling
| for prefixes of copyrighted code and then copying the
| remainder straight out of the search results. It's you who
| would be violating copyright, not google.
| ghoward wrote:
| I think there is an argument to be made that Copilot is
| producing derivative code, though. It may produce copies
| verbatim, and that's a violation, but far more often, it
| produces a mixture of things it was trained on, most of
| which probably have some sort of license requiring
| attribution at the very least.
| [deleted]
| klntsky wrote:
| Does copilot seem strange, or maybe the concept of
| intellectual property does?
| _Understated_ wrote:
| Copilot isn't strange from a technical prespective.
|
| The strange bit is how they are allowed to use other
| peoples code to create derivative works (this is how I
| see it from my non-legal perspective anyway).
|
| Even if it's legal (to the letter of the law, not the
| spirit) it leaves a sour taste.
| kstrauser wrote:
| Suppose Copilot was Composer and it generated personalized
| songs for you after being trained on Spotify's library. If you
| started performing the resulting song and it contained
| recognizable clips of others, I guarantee you'd have lawyers
| coming after you.
|
| I don't see this as fundamentally different. It's unlikely that
| the Free Software Foundation is going to track you down for
| including some GNU code in your single-user repo. If you used
| their stuff in a popular commercial project and they got wind
| of it, you might expect to receive a cease and desist at best.
| wpietri wrote:
| I think you're right. Especially given that Copilot can
| reproduce significant blocks of code:
| https://twitter.com/mitsuhiko/status/1410886329924194309
|
| Famous code:
| https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...
| treesprite82 wrote:
| I see this held up as an example a lot, but the fast inverse
| square root algorithm didn't originate from Quake and is in
| hundreds of repositories - many with permissive licenses like
| WTFPL and many including the same comments.
|
| GitHub claims they didn't find any "recitations" that
| appeared fewer than 10 times in the training data. That
| doesn't mean it's a completely solved issue (some code may be
| repeated in many repositories but always GPL, and there are
| limitations to how they detect recitations), but from rare
| cases of generating already-common solutions people seem to
| be concluding that all it does it copy paste.
| wpietri wrote:
| That may be true, although even GitHub doesn't know for
| sure. But the problem remains: they're reproducing other
| people's code without regard to license status.
| rgbrenner wrote:
| _" I'm guessing the prevailing theory (from GiitHub anyway) is
| that I'm legitimately allowed to do this."_
|
| No. Copilot is a technical preview. In the final release, if it
| reproduces code verbatim, it'll tell you and present the
| correct license.
| ghoward wrote:
| Doesn't matter that it's a technical preview; people are
| using it now, GitHub has _already_ used it internally. So if
| it infringes now, there is already code out there being used
| that does infringe.
| rgbrenner wrote:
| GitHub appears to be tracking every snippet that they're
| generating during their trials:
|
| https://docs.github.com/en/github/copilot/research-
| recitatio...
|
| Are you doing that? If not, then I wouldn't use GitHub's
| use as justification to engage in copyright infringement.
| ghoward wrote:
| Oh, I am not using Copilot. But other people not part of
| GitHub are. And those are still violations.
| x4e wrote:
| How will it find the "correct" license?
|
| Will it check the LICENSE file? Simply having a LICENSE file
| is not a declaration that all the code in that repo is under
| that LICENSE.
|
| What if specific lines/files are specified to be under
| different licenses?
|
| What if the publisher of the repo is publishing it under an
| incorrect license in bad faith?
|
| Will github be responsible if it tells me the wrong license?
| spywaregorilla wrote:
| So playing devil's advocate. What if the courts just don't care,
| and rule that copying code verbatim is not a crime because you
| didn't copy it, and copilot is not a human so it can't commit
| crimes. What's the net effect of a system that draws upon all
| public code repos? It sounds... net beneficial to society?
|
| On the plus side, a large body of work effectively becomes public
| domain. On the negative side, copyleft licenses lose their teeth.
| You probably see more power shift to those with big budgets. You
| probably see fewer things made source available, because you
| either have the public license or the private license now. This
| feels like a bad path but I'm not convinced the end result isn't
| better still.
| downrightmike wrote:
| The really nice thing is that this basically creates a library
| of industry methods and practices. It'd be really nice to be
| able to destroy copyright trolls because what their patent
| "covers" is already a known and established industry method, or
| a prior art.
| mook wrote:
| Would that mean I can start sampling songs if they get fed
| through a neutral network? It'll be fine if I train it on
| whatever is playing on the radio right? Doing the same for
| poems?
| spywaregorilla wrote:
| I would expect the legal argument to get into the intentions
| of the user and their relationship to the tool. I would also
| expect perspectives of art and code to diverge.
| agilob wrote:
| >copilot is not a human so it can't commit crimes
|
| I can setup my drone to detect me and attempt to crash into me.
| AI would be quite poor, probably would attempt to crash at any
| human. Would it be my fault it didn't crash into me and someone
| lost eyes?
|
| Can I setup torrent box that automatically downloads and seeds
| all detected links from public trackers? Would I be responsible
| for it?
| spywaregorilla wrote:
| Both of these examples include you creating something and
| then using it. I don't know how copilot works, but using the
| second example, if you wrote a script to download and seed
| trackers, and someone else used it, I don't think you would
| be held under any liability, especially if you don't profit
| off of it.
|
| Not a lawyer or even particularly well informed
|
| edit: I am reminded of the monkey selfie, in which it was
| ruled that a non-human cannot create copyrightable works. htt
| ps://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
| rcfox wrote:
| It sounds like you're arguing that Github isn't liable for
| people using copyrighted code through Copilot.
|
| I think most people are more concerned about whether the
| user of Copilot would be liable for using copyrighted code
| generated by Copilot.
| cool_dude85 wrote:
| Did copilot spring from the aether? Or was it built and
| trained on licensed code by github? Someone did something.
| spywaregorilla wrote:
| It's not a violation of copyright to train a model. There
| are three questions at play though:
|
| 1) Can you be liable for violating copyright if you have
| never seen the work?
|
| 2) Can a non-human be held accountable for violating
| copyright?
|
| 3) Can github be held liable for an end user using their
| tool to violate copyright?
|
| https://en.wikipedia.org/wiki/Substantial_similarity
|
| wikipedia states: Generally, copying cannot be proven
| without some evidence of access; however, in the seminal
| case on striking similarity, Arnstein v. Porter, the
| Second Circuit stated that even absent a finding of
| access, copying can be established when the similarities
| between two works are "so striking as to preclude the
| possibility that the plaintiff and defendant
| independently arrived at the same result."
|
| This is a different situation in which exact replication
| can be reasonably occurred without access to the
| original.
|
| Secondly, can you actually claim Github has violated
| copyright if it doesn't have any claims to the work in
| question?
|
| I think it's totally plausible that they win this in the
| long run.
| stonemetal12 wrote:
| 1) So you are saying if I get a disk duplication machine
| I can freely copy and distribute blu ray disks as long as
| I don't watch the movie on the disk?
|
| 2,3) Seems pretty settled at this point, look at the
| cases around the VCR and copy machine. In general the one
| using the machine is liable. The creator of the machine
| can be held liable if there aren't substantial non
| infringing uses.
| formerly_proven wrote:
| > It's not a violation of copyright to train a model.
|
| Many people on HN assert this based on the Authors Guild
| vs. Google case, but it's quite important to keep in mind
| that that case was about Google creating a search
| algorithm, which is _not_ generating "new" output.
|
| We are talking about a very different kind of system here
| and in many other cases. Claiming the Authors Guild case
| sets precedent for these very different systems seems
| unbased to me.
| sangnoir wrote:
| > It's not a violation of copyright to train a model.
|
| This is a very bold assumption, one that I assume will
| not hold in the court of law in all cases. I think the
| nuanced question is: to train a model _that does what,
| exactly_.
|
| Let's say distributing meth recipes is illegal[1], can
| one legally side-step that by training a model that spits
| out the meth recipe instead? No court will bother with
| the distinction, causation is well-trod ground.
|
| 1. As an example - not sure if its illegal. You may
| replace with classified nuclear weapon schematics if you
| like.
| rvz wrote:
| It's been admitted again. This contraption by GitHub is really
| causing chaos in the open source world and has been trained upon
| all public GitHub code; essentially those who have their code
| hosted there publicly, gave them permission to train copilot on
| their code. Now they are complaining about it after all these
| problems [0].
|
| I warned against hosting source code on GitHub and going all in
| on GitHub Actions, mainly for them being unreliable for the past
| year. [1] (They go down every month). Now Copilot has gone and
| trained on every single public repo on GitHub as admitted right
| in this post, regardless of the copyright.
|
| Maybe for organisations with serious projects, perhaps now's the
| time to leave GitHub and self-host your own somewhere else?
|
| [0] https://news.ycombinator.com/item?id=27726088
|
| [1] https://news.ycombinator.com/item?id=27366397
| MayeulC wrote:
| Well, if you write software under a free license, you can't
| really prevent someone from uploading the source on GitHub...
| monkeydust wrote:
| Bit confused. If I have code on GitHub with most restrictive
| licence possible (no commercial reuse, no derived works) then how
| did Githubs legal get comfortable with this approach? What am I
| missing ?
| Tenoke wrote:
| There's an assumption that public repos can be read by humans
| and machines both which hasn't been questioned legally.
| monkeydust wrote:
| But the repos are provided under licenced terms no which can
| vary depending on publishers choice? Put another way, is
| there a licence that would prohibit reuse in this manner ?
| Tenoke wrote:
| You can likely write/find one but if you don't want your
| code seen perhaps it'd be simpler to use a private repo.
| oauea wrote:
| You uploaded your code to their service and agreed to their
| TOS.
| mariusor wrote:
| By using github you have acceded to their terms of use[1]:
|
| > Short version: You own content you create, but you allow us
| certain rights to it, so that we can display and share the
| content you post. You still have control over your content, and
| responsibility for it, and the rights you grant us are limited
| to those we need to provide the service. We have the right to
| remove content or close Accounts if we need to.
|
| [1] https://docs.github.com/en/github/site-policy/github-
| terms-o...
| Tenoke wrote:
| The outrage-bait approach in this thread detracts from it. Yes,
| they trained it on everything. No, it's not clear if that's legal
| or not (probably is) or if that is much of a problem.
| arp242 wrote:
| Indeed; the question is if copyright should apply _at all_.
| Harping on about licenses, GPL, and whatnot is a detraction
| from the actual issue at hand.
|
| Also, given that the author of this tweet called me a
| "bootlicker" last year in response to a somewhat lengthy
| nuanced post about GitHub, I'm gonna go out on a limb and say
| that they're not all that interested in a meaningful
| conversation on this in the first place but are rather on a
| quest to "prove" GitHub is evil.
| lifthrasiir wrote:
| The possibility of GPL violation does show (one of) enormous
| ramifications of the question though. I think it's not a
| detraction as long as the question itself is also mentioned.
| arp242 wrote:
| There isn't any of this here though: it just operates on
| the assumption that the GPL applies.
| deviledeggs wrote:
| It's not outrage bait. The thing reproduces GPL licensed code
| verbatim.
| Tenoke wrote:
| I'm talking about how it's presented. It starts with
|
| >oh my gods. they literally have no shame about this.
|
| Then continues with
|
| >it's official, obeying copyright is only for the plebs and
| proles, rich people and big companies can do whatever they
| want
|
| and
|
| > GitHub, and by extension @Microsoft , knows that copyright
| is essentially worthless for individuals and small community
| projects. THAT is why they're all buddy-buddy with free
| software types; they never intended to respect our rights in
| the first place
|
| At any rate, it's not even clear to me if me publishing code
| written with copilot (or even with a random tool that will
| wget from github) puts the blame on the toolmaker or on me.
| This post, however, doesn't attempt to look at that but uses
| language that paints GH/MS as doing something illegal (and
| evil) that others wouldn't even get away with but not caring
| about it.
| belorn wrote:
| It seems that github did make a legal consideration when
| choosing to include public projects but exclude private
| ones, with many big companies having private projects for
| proprietary code bases. Users of public repositories are
| less likely to be able to fight github on the issue.
| deviledeggs wrote:
| Is that not true? Google and Oracle had a 10 year multi
| billion dollar legal fight over ~20 lines of code identical
| between Android and JVM.
|
| A non rich individual has basically zero chance of
| challenging GitHub on these blatant violations, and they
| know it.
|
| > At any rate, it's not even clear to me if me publishing
| code written with copilot (or even with a random tool that
| will wget from github) puts the blame on the toolmaker or
| on me.
|
| It really depends on the license, which GitHub apparently
| doesn't care about at all.
| cortexio wrote:
| maybe you are correct, but i would agree that it's
| formulated in a childish/evil spirited way. It smells of
| outrage hype and cancel culture.
| benhurmarcel wrote:
| But is that reproduced code "substantial"?
|
| I'm sure there's a "for i in range(0, n):" somewhere in a GPL
| repo, and yet having that in my code doesn't make it GPL.
| hu3 wrote:
| Just a reminder: reproducing GPL licensed code verbatim is
| not illegal per see.
|
| The legality lies on what the user does with the code.
| xbar wrote:
| Microsoft LicenseLaunderer.
| sleavey wrote:
| To be fair, this could just be a mistaken interpretation from the
| support staffer that answered the question - they didn't sound
| sure ("apparently"). It certainly needs an official response from
| GitHub senior management but I wouldn't call the foul yet (not
| that it's even clear that it is a foul).
| underyx wrote:
| OP's rhetoric, and most discussion I see, asserts that training a
| model on copyrighted data is a copyright violation. Personally I
| don't find this to be so obviously the case. Think back to when
| we were listening to AI generated pop music, for instance. I
| don't recall any concern in HN comments about the copyright
| holders' music being used for learning.
| thinkingemote wrote:
| Did you miss the bit where copilot reproduced exactly a
| function including the comments? That's not some mashup or
| reinterpretation or inspiration it meets the definition of
| plagiarism in universities and is just copying.
| underyx wrote:
| I didn't miss that, this still doesn't make the answer
| obvious to me. I'm pretty sure I've unknowningly replicated
| licensed code as well during my time as an engineer, and I've
| written way less code over my 8 years than Copilot has.
| fragileone wrote:
| Then if you were discovered using it in a commercial
| project you can fairly be sued for it. Unless you're trying
| to argue that you should for some reason get an exemption?
| underyx wrote:
| Would I be found guilty if I could prove that I didn't
| explicitly copy that code but rather just happened to
| write the same code by arriving at the same solution as
| the original one I had seen years before?
| adn wrote:
| Nobody can answer this because it depends on the code and
| the resources of the entity suing you, but in general
| yes. This is why clean room design is a well-defined
| strategy: depending on the code and company, you would
| indeed not be allowed to work on the project because of
| the fact you'd seen a competitors solution previously.
| contravariant wrote:
| I'd be surprised if nobody brought up those 'what-if' scenarios
| at the time.
| Zababa wrote:
| > Think back to when we were listening to AI generated pop
| music, for instance. I don't recall any concern in HN comments
| about the copyright holders' music being used for learning.
|
| Were those products sold to help people write commercial pop
| music faster? If not, I don't think your point is valid.
| belorn wrote:
| You mean like https://www.theburnin.com/technology/artificial-
| intelligence... ?
|
| If one of the three largest record labels uses their own
| catalog to train an AI, copyright seems less important to
| discuss. I suspect the discussion would be a bit different if a
| company scraped youtube and used that as a training set for AI
| music and successfully sold it.
| encryptluks2 wrote:
| Really hoping to see a max exodus from GitHub after this.
| Microsoft back to their old tactics like we all knew they would.
| Tenoke wrote:
| If you have public repos anywhere people can train on them just
| as much.
| ostenning wrote:
| That's also my general sentiment. I assume anyone can do
| virtually anything with my public repos with little recourse
| from me. I wouldn't even know if they are indeed breaking my
| license agreements. Doesn't really help the situation though.
| encryptluks2 wrote:
| GitHub only recently allowed non-paid private repos.
| Previously these were only reserved for paid plans. Also,
| GitHub has a specific section for license files. GitHub
| actually believes these license files mean something, and
| states that they must be included with the repo so they are
| downloaded with it. Just because you can teach a script to
| ignore a LICENSE file, doesn't mean that it still doesn't
| apply. That is like saying that because you can teach a
| robot to ignore restricted airspace, that it is allowed to
| fly around an airport.
| ostenning wrote:
| Any suggestions for an alternative? One thing I like about
| github is that it 'seems' to be a defacto standard for
| portfolios & public works. It also has excellent integration
| into AWS and alike
| selfhoster11 wrote:
| GitLab is a fairly good one. Lots of people self-host their
| own GitLab/Gitea instance too.
| goodpoint wrote:
| SourceHut or Codeberg
| aloisdg wrote:
| GitLab is the best alternative feature wise.
| https://sourcehut.org/ is great too if you are into this kind
| of things
| tobyhinloopen wrote:
| A lot of hate for a cool piece of tech. Can't we just be happy
| this tool exists?
| mullikine wrote:
| Public facing open-source code & media is going to be learned by
| language models because they're exposed to them. That's the
| simple truth. Nothing can stop that, not unless all public repos
| are made private. Everyone has access to the ability to create
| their own GPT, thanks to open-source. OpenAI is not actually very
| far ahead of open source anymore.
|
| The US seems well enough informed. As mentioned in the following
| report "AI tools are diffusing broadly and rapidly" and "AI is
| the quintessential "dual use" technology--it can be used for
| civilian and military purposes.".
|
| https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report...
|
| I'm fully expecting that if I begin a story and put it on my blog
| or on github, and if I go away for a couple years, I'll see it
| completed for me when I return. I can use foresight to my
| advantage or I can pretend like it's still the 1990s as if
| placing some text at the top of the code I exposed publicly is
| going to prevent people from training on it.
|
| One thing for sure though, I don't think a large company such as
| Microsoft should be profiting from training their language model
| on open-source code.
|
| The best way to release Copilot in my opinion would be to make
| the entire thing open source and have separate models, even a
| private paid-for model so long it's trained on their own code.
|
| An open source model trained on code for specific licenses sounds
| fine, but then the model should also follow that same license as
| the code it was trained on.
|
| There's just something deeply unsettling about having a computer
| complete your thoughts for you without being able to question how
| or why.
| xgulfie wrote:
| I'm really hoping some big corp whose codebase is source-
| available and on GitHub but still under copyright, takes the piss
| out of them for this
| jensensbutton wrote:
| Wouldn't it be the people publishing code written with Copilot
| that (potentially) violate any licenses? It doesn't seem to be
| that the tool violates anything, though it may put the _user_ at
| risk of violating something.
|
| Like, don't use it if you're worried about violating licenses,
| but I don't see how Microsoft could get in trouble for the tool.
| It doesn't write and publish code by itself.
| cool_dude85 wrote:
| Sorry, we built this tool for you that auto violates licenses.
| Sure, we're owned by a huge megacorp with billions of dollars,
| but it's your responsibility to confirm - and yes, we recognize
| it's impossible to confirm - that what you release using our
| tool isn't violating the license.
|
| In short, github gets to make the license violator bot and push
| the violations off onto the small fry who actually use it? No
| thanks.
| Naga wrote:
| Isn't that sortof the justification behind bittorrent and
| trackers?
| vharuck wrote:
| I see the difference as BitTorrent being an ignorant tool
| that just processes the data it receives. If you point
| BitTorrent to copyright data, it emits copyright data. The
| fault is on the users. Copilot was built with and
| "contains" copyright data, which it can produce with non-
| copyright input.
| fragileone wrote:
| Microsoft are violating the licenses already when they
| initially show you the generated code without attribution and
| ignoring other license restrictions. How you use it yourself is
| separate from that.
| russdpale wrote:
| Who cares? Seriously? Copilot has ripped off the absurd charade
| around licensing and code.
|
| It isn't any kind of copyright infringement. The AI is not
| copying and pasting code that is has found, it is rewriting the
| code from scratch on its own.
|
| We keep trying to take old ways and meld them to the internet,
| and its just not appropriate and it doesn't work.
| geoffhill wrote:
| > not copying and pasting code
|
| https://news.ycombinator.com/item?id=27710287
| nrb wrote:
| Huh? There are already many examples of "copied and pasted"
| code that it has found.
|
| https://twitter.com/kylpeacock/status/1410749018183933952
|
| https://twitter.com/mitsuhiko/status/1410886329924194309
|
| https://twitter.com/pkell7/status/1411058236321681414/photo/...
| neonihil wrote:
| I think it's pretty easy to defeat MS in court.
|
| We just need to bring the music industry into this!
|
| For example: Let's train a network on Beatles music to generate
| new Beatles songs. I'm pretty sure music lawyers will find a way
| to prove that the trained network is violating the label's
| copyright, as they always manage to do that.
|
| And then we just need to use the precedent and argue that music
| is the same thing as code.
| alkonaut wrote:
| > For example: Let's train a network on Beatles music to
| generate new Beatles songs. I'm pretty sure music lawyers will
| find a way to prove that the trained network is violating the
| label's copyright, as they always manage to do that.
|
| The people making the machine that learned (and recites)
| beatles songs aren't infringing though (most likely). It's
| those that _use_ the machine to create and distribute the new
| works that are.
|
| Same here. No one will be able to say that Copilot _itself_ is
| a "derived work" or somehow uses the code in a way similar to
| a computer program (Although such claims have already been made
| - I highly doubt that's the case). But those that produce a
| whole file full of GPL code verbatim (Which will be rare, but
| WILL happen), are at risk of violating the license terms if
| they distribute it under the wrong license.
| stingraycharles wrote:
| Wouldn't a more accurate metaphor be "let's train a network on
| all music, to generate new music", which includes Beatles, and
| may generate songs that contain the same chords as the Beatles
| used?
| tylersmith wrote:
| Or contain new chords that it synthesized from its knowledge
| of the ones the Beatles used.
| dec0dedab0de wrote:
| Yes, but may also use the same chord progressions, lyrics, or
| melodies. Could even say it contains snippits of the actual
| recordings, depending on how you look at it
| stingraycharles wrote:
| Sure but then it'll definitely be harder to prove it's
| actual copyright infringement, especially when only a very
| small part of the song may have some snippets of the
| Beatles. Could it then, perhaps, be considered fair use?
| jeremyjh wrote:
| Yes you'd have exactly the same kind of lawsuits and
| arguments that already exist today around fair use. It
| doesn't matter if a tool creates the new work or a person
| creates it (without tools?!!?) because ultimately it is a
| person who claims the new work as their own and
| distributes it, and that is the person who will get sued
| by the record companies if their work is too derivative.
| Establishing the line for "too derivative" in any
| particular case is a very lucrative field already I'm
| sure.
| bionhoward wrote:
| Potentially dumb question from a guy who isn't a lawyer:
|
| Does Copilot infringe Google's patent(s) on the Transformer
| architecture? If so, then Google could potentially sue them for
| royalties, at least.
|
| Further, couldn't this Copilot thing backfire for Github
| because customer trust is more valuable than AI training data
| right now? If folks don't feel they can trust Github, seems
| like they could move their work to other version control
| systems like Gitlab or Bitbucket...
| anonydsfsfs wrote:
| Doesn't really matter, because if Google sued Microsoft,
| Microsoft would immediately hit back with a countersuit,
| since they would have little trouble finding something in
| their 90,000+ patent warchest that Google is infringing on.
| Software patents have become a matter of mutually-assured
| destruction for the big players. The only winning move is not
| to play.
| [deleted]
| Florin_Andrei wrote:
| In ancient Rome they didn't have a police force. What they had
| was essentially muscle for hire, mercenary bands paid by rich
| and powerful folks to do their bidding. As a regular person,
| the only thing that could have protected you from one of these
| groups was another such group.
|
| Same today with the licensing system.
| LudwigNagasena wrote:
| They had cohortes vigilum and cohortes urbanae in Ancient
| Rome. Why don't they count as police?
| BoxOfRain wrote:
| There is an absolutely enormous archive of fan-taped Grateful
| Dead shows out there, someone with much more time and money
| than me _needs_ to train a network on that!
| data_ders wrote:
| username checks out lol
| partiallypro wrote:
| That already exists though? SongSmith and other similar tools
| are used by musicians a lot.
| nomel wrote:
| At what point is it not a derivative work?
| gizmodo59 wrote:
| I hope they don't shutdown the project with all the legal
| nightmares.
| kazinator wrote:
| If you learn a word or phrase from a copyrighted public
| broadcast, does that mean you cannot speak it to others?
| 0x0 wrote:
| To the people arguing it's "fair use" to use this for training an
| ML network. Where do you draw the line? What if you train an "ML
| network" with one or two inputs... so that they almost always
| "generate" exact copies of the inputs? Five inputs..? Ten? A
| thousand? A million?
| saint_abroad wrote:
| > Where do you draw the line?
|
| My simplistic view is that the following is legally equivalent:
|
| input -> ai network -> output
|
| input -> huffman coding -> output
|
| So, whilst:
|
| * compressing and decompressing a copyright work is
| permissible;
|
| * output and weights are deterministic transformations of the
| inputs;
|
| thus:
|
| * not eligible for copyright (lacking creativity); and
|
| * are derivative works of the inputs;
| ghoward wrote:
| But at the same time, a compiler does a deterministic
| transformation of its inputs, and we still count its output
| as under copyright and license.
|
| copyrighted input -> compiler -> copyrighted output
| kevincox wrote:
| > output and weights are deterministic transformations of the
| inputs;
|
| That may be true but I fail to see how any process that
| produces the same content that was input into it somehow
| strips the license. If the generated code is novel, then
| there is no copyright and it is just the output of the tool.
| If the code is a copy, but non-creative (example a trivial
| function) then it isn't covered by copyright in the source
| anyways, so the output is not protected by copyright either.
| However if the output is a copy and creative I don't think it
| matters how complicated your copying process was. What
| matters is that the code was copied and you need to obey
| copyright.
|
| Again, I don't think that novel code generated from being
| trained on copyrighted code is the problem. I think it is
| just the verbatim (or minimally transformed) copying that is
| the issue.
| Tenoke wrote:
| I can imagine a requirement of the sort 'generated code needs
| to match at most X% to snippets of the training data as shown
| over Y amount of sampling' but I am not sure if you can get a
| much better requirement than that.
|
| Forbidding the training of AI on public code would definitely
| be a step too far though.
|
| Edit: I'd also like if they provide a tool for checking if your
| code matches copyrighted code too close so you can confirm if
| you are violating anything or not when you use copilot.
| crazygringo wrote:
| The line is exactly the same line that's always been drawn in
| fair use cases.
|
| There's absolutely nothing different whether the creator is ML
| or a human.
|
| Generally, if you train an ML network to generate an almost
| exact copy of a thousand lines, it's _obviously_ not fair use.
| If it 's five simple lines, it obviously _is_ fair use. If it
| 's somewhere in between, there are a lot of different factors
| that need to be weighed in a fair use decision, which you can
| easily look up.
| diffeomorphism wrote:
| There obviously is no sharp line (e.g. it is 37. Immediate
| question: why not 36?), but that does not matter at all.
|
| We already have the same fuzzy line for writing. Am I forbidden
| from ever reading other author's books because I might
| accidentally "generate exact copies" of some of the sentences?
| Clearly not, that is how people learn a language. Does that
| mean I am allowed to copy the whole book? Also clearly not.
|
| Where do you draw the line? Somewhere.
| criddell wrote:
| And _somewhere_ is determined for your particular case in
| court. And tomorrow, a similar case may be determined
| differently.
| diffeomorphism wrote:
| Not really, no.
| agilob wrote:
| This means that also all illegally leaked codes from Apple, CDPR,
| Intel, NSA and Microsoft leaks are used in the models? iBoot,
| Witcher 3? Gwent? NSA backdoors?
|
| Does the copilot still learn from new repos? Can I post github
| enterprise code publicly to let it learn from it?
|
| Serious answers only please
| tedunangst wrote:
| Why would you think letting copilot scan the code would absolve
| you of liability for posting it?
| agilob wrote:
| I'm not asking about legality of posting the code, but reuse
| of this by the AI and users of the AI. "All public
| repositories" is a wide net full of surprises.
| BlueTemplar wrote:
| I assume that this is a yes to most of those ?
|
| Of course using code generated by Copilot from those would
| still be illegal.
|
| See also : Napster (and other p2p), the bitcoin blockchain
| allegedly containing illegal numbers...
| TingPing wrote:
| So copyright doesn't apply unless copyright applies.
| swiley wrote:
| Has Microsoft just killed source code copyright? That would
| definitely be a win.
| blibble wrote:
| it would be a win for Microsoft that don't distribute their
| source code
|
| whereas for open source it's a disaster
| glogla wrote:
| Which seems very much align with what has Microsoft been
| trying to do for decades now.
| formerly_proven wrote:
| Interesting idea considering Microsofts copyright-
| dependence has reached an all-time low point since they
| move as much as they can into their SaaS and PaaS
| offerings. Nothing left to copy, except for employees,
| but you don't need copyright to bash their heads in,
| legally speaking.
| nh2 wrote:
| It would be quite impressive if this was a long-time
| planned "Embrace, extend, extinguish" move against
| Copyleft, with a casual acquisition of Github to make it
| work.
|
| Finally, it beat the "cancer that attaches itself in an
| intellectual property sense to everything it touches"
| after all those years, with its own tools!
|
| Now it's safe to touch.
| swiley wrote:
| There's nothing to stop the employees from distributing it
| at that point and even with copyright it gets distributed
| anyway, it's just not allowed to be used for anything
| serious.
| echelon wrote:
| But who says the code has to be available to anyone but
| Microsoft?
|
| Remember that Amazon won off the back of open source. Now all
| the open source servers and databases are Amazon products.
| TuringNYC wrote:
| IANAL but the serious answer -- i think -- is that you always
| use things at your own risk, even purchased tools, and are
| protected via indemnity agreements. If there is no indemnity
| agreement (is is the case here), you assume the risk.
|
| That said, if enough people are bitten by this, i'm not sure
| what happens -- does anyone know of a relevant case. One
| somewhat relevant case that caused mass pain was the SCO Linux
| Dispute
|
| https://en.wikipedia.org/wiki/SCO%E2%80%93Linux_disputes
| user5994461 wrote:
| If you're thinking about the liability waiver found in many
| licenses and contracts and EULA and other, they are often
| void, depends on the jurisdiction.
|
| The official answer from Github that they take all input on
| purpose doesn't play in their favor.
| Voloskaya wrote:
| No, it's not trained on all public code as the title suggests,
| it's trained on all GitHub public code (so public repos hosted
| on GH), none of the things you enumerate are hosted on GH.
| agilob wrote:
| >it's trained on all GitHub public code (so public repos
| hosted on GH)
|
| This is exactly what I meant.
|
| >none of the things you enumerate are hosted on GH.
|
| Plenty of them on GH, if not src then magnet links
| akerro wrote:
| Just found Intel leaks and Gwent on github without any
| effort. Intel has a few repositories in different formats,
| plain copy of .svn directory or converted to git. TF2/Portal
| leak is there as well. All but 2 I found were made by
| throwaway accounts.
| e12e wrote:
| Now, was the leaked nt kernel source ever published on
| github?
| sp332 wrote:
| https://github.com/PubDom/Windows-Server-2003 Second
| result on DDG.
| e12e wrote:
| I wonder if co-pilot will cough up stuff like these
| useful macros? Seems like a reasonable hack...
|
| https://github.com/PubDom/Windows-
| Server-2003/blob/master/co... #ifdef _MAC
| # include <string.h> # pragma segment ClipBrd
| // On the Macintosh, the clipboard is always open. We
| define a macro for // OpenClipboard that returns
| TRUE. When this is used for error checking, // the
| compiler should optimize away any code that depends on
| testing this, // since it is a constant. #
| define OpenClipboard(x) TRUE // On the
| Macintosh, the clipboard is not closed. To make all code
| behave // as if everything is OK, we define a macro
| for CloseClipboard that returns // TRUE. When this
| is used for error checking, the compiler should optimize
| // away any code that depends on testing this, since it
| is a constant. # define CloseClipboard() TRUE
| #endif // _MAC
|
| Just the kind of trick co-pilot should help us with?
| jeremyjh wrote:
| There have been leaks of copyrighted code that were hosted on
| Github before they were taken down. There is also a lot of
| public code on Github without any license at all, which is
| not public domain but actually unlicensed for all purposes.
| Causality1 wrote:
| Suppose you had some kind of AI Deepfake program operating off a
| large database of copyrighted photos and you asked it to "make a
| picture of a handsome man on a horse" and the man's head was an
| exact duplicate of George Clooney's head from a specific magazine
| cover, would that be infringement? Would selling the services of
| an AI that took copyrighted photos of celebrities and edited them
| into porn movies be infringement? I don't know the answers to
| those questions but I find it very weird that people think large
| blocks of typed text are less worthy of copyright protection than
| other forms of media.
| dogma1138 wrote:
| That would potentially be an infringement of the copyright of
| the photographer but in any case it's an infringement of the
| personality rights of George Clooney.
|
| You aren't allowed to sell someone's likeness without their
| permission. You don't need an AI for this if you create a
| portrait of Clooney and sell it or make any use that isn't
| covered by fair use he can sue you.
|
| Depending on the composition of the picture for example if
| Clooney is naked and say Putin is riding in the "bitch seat" of
| the saddle then you also are quite likely be open for a libel
| suit as well.
| hoppyhoppy2 wrote:
| Satire does not usually fall under libel/defamation, though,
| right?
|
| >For example, in Hustler Magazine v. Falwell (1988), Chief
| Justice William H. Rehnquist, writing for a unanimous court,
| stated that a parody depicting the Reverend Jerry Falwell as
| a drunken, incestuous son could not be defamation since it
| was an obvious parody, not intended as a statement of fact.
| To find otherwise, the Court said, was to endanger First
| Amendment protection for every artist, political cartoonist,
| and comedian who used satire to criticize public figures.
|
| https://www.mtsu.edu/first-amendment/article/1015/satire
| dogma1138 wrote:
| Depends on the legal system in question and the intent and
| usage.
|
| The US system isn't the only one on the planet you know,
| the UK still has political cartoonists despite a very
| different definition for what defamation is which the
| example above can fall under.
| belorn wrote:
| So why did GitHub chose to exclude private repositories? Why not
| include everything, including the code for windows?
| jefftk wrote:
| In training on publicly accessible repositories, GitHub did
| something anybody could have done. If they also used private
| repositories, though, I would see that as abusing their
| position.
| jefftk wrote:
| Additionally, if they had trained on private repositories
| then they risk leaking code, and accidentally making it
| public. Even if that was within fair use it would still be a
| violation of the trust people put in them.
| rambambram wrote:
| Besides, as a programmer you should not excuse yourself with
| "IANAL" or otherwise pass any judgment to lawyers. Lawyers are
| just that: lawyers. They don't hold the truth either. One lawyer
| says this, another lawyer says that. F*k 'm. If anything, say
| "IANAJ" (I Am Not A Judge). Trias politica, you gotta love it.
| pjfin123 wrote:
| Of all the potential issues with training large AI models
| incidental copyright infringement seems pretty mild.
| sergiogjr wrote:
| The way I see it is that MS has probably put as much financial
| investment in the development of this product as in the research
| of legalities regarding its release with its highly paid legal
| army. To expect a multi billion dollar company to not do its due
| diligence seems naive.
|
| Maybe this could spark a discussion to change the current rules
| that allow them to do this, but questioning the current legality
| to me is a waste of time.
| tpmx wrote:
| It's a gamble. Worst case they have to reduce the quality by
| removing GPL code from the training data. And/or pay off a few
| lawsuits, which is routine stuff for them. Cost of doing
| business.
| pjfin123 wrote:
| Of all the concerns over training large AI models incidental
| copyright infringement doesn't seem that important.
| zinekeller wrote:
| https://news.ycombinator.com/item?id=27771742
|
| The above thread is a dupe of this discussion but with
| interesting discussions already in place before being marked as a
| dupe.
| floor_ wrote:
| Road to hell paved with good intentions.
| Luker88 wrote:
| Like it or not, it seems like:
|
| * most people here are unhappy
|
| * most laywers will say it's fine (it very probably passed MS
| ones)
|
| I can understand that. Copyright was not created with AI/ML in
| mind, even as a random stray thought. Those were not even words
| at the time.
|
| So the question is: If we change the law and require trained
| algorithms to only work on licenses that permit this, and to
| output the "minimum common license" somehow, what are the
| repercussion on other application of copyright?
|
| Because the consensus here seems to be that this looks a lot like
| a de-licensor with extra steps
| nonameiguess wrote:
| Standard caveat that I'm not a lawyer by any stretch, but this
| seems settled by the existence of text-generation assistants
| trained on the full corpus of human writing ever digitized,
| much of which is also copyrighted or licensed in some way. That
| is clearly fine, as training text generation programs on
| existing text has been standard for decades. Selling a product
| based on GPT-3 is fine and the law has not come after anyone
| trying to do that.
|
| The more questionable line is if someone happens to
| inadvertently reproduce entire paragraphs of Twilight: Breaking
| Dawn word-for-word using GPT-3 and then sells it, that might be
| a violation even if they didn't realize they were doing it.
|
| Copilot is the same thing. Creating a product that makes
| suggestions that it learned from reading other people's work is
| fine. Now if you write code using Copilot and happen to
| reproduce some part of glibc down to the variable names, and
| don't release it under GPL, you might be in trouble. But
| Copilot won't be.
| dlisboa wrote:
| I don't know if even copying small pieces of code verbatim
| should mean anything.
|
| Another example is the photo generation ML algorithms that
| exist. They generate photos of random "people" (imaginary AI-
| generated people) by using actual photos of real people. If
| one eye or nose is verbatim copied from the actual photo to
| the generated photo, is the entire output now illegal or
| plagiarism? One might argue it's just an eye, the rest of the
| picture is completely different, the original photographer
| doesn't need to grant permission for that use.
|
| Any analogies we make with this, be it text generation, image
| generation, even video generation, seems like it falls under
| the same conclusion: so far we've thought all of this was
| perfectly fine. I don't see why code-generation is any
| different. A function is just a tiny part of a project. It's
| not necessarily more important than the composition of a
| photograph, or a phrase in a book. We as programmers assign
| meaning to it, we know it takes time to craft it and it might
| be unique, but likewise a novelist may have spent weeks on a
| specific 10 word phrase that was reproduced verbatim, in a
| text of 500 pages.
|
| The more I look at this the more it seems copyright, and IP
| law in general, is the main problem. Copyleft and OS licenses
| wouldn't be needed if it wasn't for the aggressive nature of
| IP law. I don't see the need to defend far more strict
| interpretations of it because it has now touched our field.
| mrh0057 wrote:
| There is nothing intelligent about this. What they did is a
| context aware search and trying to claim that not what this is.
| If it was just used as a search engine and people weren't using
| the results or following the license of the original source,
| then it would fine. There has been so much of a hype of machine
| learning people likely have a false impression of what it is.
| Ajedi32 wrote:
| I've seen this claim that Copilot is "just a search engine"
| repeated in multiple places now. It's wrong; as anyone
| familiar with any of the GPT variants or other similar
| autoregressive language models can attest.
|
| Copilot isn't a search engine any more than any other
| language model is. It _can_ sometimes output data from the
| training set verbatim as most AI models do from time to time,
| but that is the exception not the rule.
|
| Whether modern autoregressive language models can be called
| "inteligent" is debatable, but they're certainly far beyond
| what you'd get from a simple search engine.
| rgbrenner wrote:
| There are a lot of posts here debating if licenses still apply
| when copilot generates verbatim code. The answer is yes.
|
| Copilot is currently a technical preview. Github has already said
| they intend to detect verbatim code and notify the user and
| present the correct license. That'll be in the final release.
|
| Don't use the technical preview for anything except demoing a
| cool concept. It's not ready for that yet because it will
| reproduce licensed code and not tell you.
| Voloskaya wrote:
| > If licensing still apply when copilot generates the code. The
| answer is yes.
|
| Please provide a source for this.
| rubyist5eva wrote:
| Open source developers need a new kind of license with a ML model
| training clause, so there is no more ambiguity if they don't want
| their code to be used in this way.
| Hamuko wrote:
| People have been suggesting this ever since Copilot was
| announced and it doesn't work on any level. They're using _all_
| code on GitHub, even the ones with no license and which you can
| 't use for any purpose and the reasoning is that they see it as
| fair use - which supersedes any licenses and copyrights in the
| US.
| ghoward wrote:
| They only claimed that _training_ the model was fair use.
| What about its output? I argue that its output is still
| affected by the copyright of its inputs, the same way the
| output of a compiler is affected by the copyright of its
| inputs.
| ghoward wrote:
| I made new licenses [1] [2] that attempt this. The problem with
| adding a clause against ML training is that that is
| (supposedly) fair use. What my licenses do is concede that but
| claim that the _output_ of those algorithms is still under the
| copyright and license.
|
| I hope that even if it wouldn't work, it puts enough doubt in
| companies' minds that they wouldn't want to use a model trained
| by code under those licenses.
|
| [1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-
| and...
|
| [2]: https://yzena.com/licenses/
| jefftk wrote:
| That doesn't work: your suggestion applies at too late a stage
| in the flowchart. It looks like:
|
| 1. Do you need a license to use materials for training, or to
| use the output model?
|
| 2. If so, does the code's license allow this?
|
| GitHub is claiming 'no' for #1, that they do not need any sort
| of license to the training materials. This is reasonably
| standard in ML; it's also how GPT-3 etc were trained.
|
| Now, whether a court will agree with their interpretation is an
| interesting question, but if they are correct then #2 doesn't
| come into play.
| rubyist5eva wrote:
| If the answer is 'no' for #1 than the GPL might as well not
| exist because now we can just launder it through co-pilot and
| close it off, a rather distorted interpretation of "fair use"
| if you ask me.
|
| "Dear copilot, I'm writing a Unix-like operating system...."
| mehdix wrote:
| GitHub's Copilot looks like a "code laundering" machine to me.
| slownews45 wrote:
| Developers have lost the plot here. The number of people
| browsing stack exchange and copying code is huge. The number of
| people who have read GPL'ed code to learn from (from the kernel
| to others) is huge. The number of people who learned from code
| they had to maintain -> huge.
|
| This idea that a snippet of a code is a work seems crazy to me.
| I thought we went through this with SCO already.
| loeg wrote:
| Stack exchange code is explicitly permissively licensed.
| slownews45 wrote:
| It is, but it has a GPL style permission.
|
| ShareAlike -- If you remix, transform, or build upon the
| material, you must distribute your contributions under the
| same license as the original.
|
| The idea that programmers taking snippets from
| stackexchange or co-pilot etc meaning they have a
| derivative work seems like total insanity.
| e12e wrote:
| Unfortunately wrongly licensed for its precieved use-case -
| it'd be better if so used mit or bsd :/
|
| https://stackoverflow.com/help/licensing
|
| I'm guessing most uses of stack overflow snippets are
| violating the license (no attribution, no share alike of
| the "remix" - which would probably be the entire program).
| lucasyvas wrote:
| There's a phrase I never wanted to hear. But that sounds
| exactly like what it is.
| qayxc wrote:
| Why and how? I'm honestly interested in an answer here.
|
| What exactly is the difference between a machine learning
| patterns and techniques from looking at code and people doing
| it?
|
| Is every programer who ever gazed at GPL'ed code guilty of
| plagiarism and licensing violations because everything they
| write has to be considered derivative work now?
| mehdix wrote:
| I can think of certain things here. As human beings we have
| limitations. We get tired of gazing at code, GPLE'ed or not.
| GitHub's clusters don't. It puts fair use of copyrighted
| content under question. The next concern I have, is what
| happens when Copilot produces certain code verbatim? I saw
| the other day on HN that it produced some Quake code
| verbatim. See https://news.ycombinator.com/item?id=27710287
| qayxc wrote:
| > As human beings we have limitations.
|
| That's a fair point. ML models don't seem memorise all the
| code they've seen either, it seems. Plus while the argument
| of human limitations applies to the vast majority of
| people, what about those with eidetic memory?
|
| > what happens when Copilot produces certain code verbatim?
|
| There are several options: suppress the result, annotate
| with a proper reference or mark the snipped as GPL'ed.
|
| There are technical solutions to this question, but it's
| also important to ask to which degree this is necessary.
|
| Is a search engine that returns code snippets regardless of
| license also a tool that needs to be discussed the same
| way? After all, code samples from StackOverflow or
| RosettaCode are copied on a regular basis and not every
| example provides a proper reference as to where it's been
| taken from.
|
| So maybe a hint like "may contain results based on GPL'ed
| code" suffices? I don't know, but that's a question best
| deferred to software copyright law experts.
| ralph84 wrote:
| > I've reached out to @fsf and @EFF's legal teams regarding this.
| Please also reach out if you would be interested in participating
| in a class action.
|
| I think she's barking up the wrong tree here. If she's looking
| for organizations interested in eliminating fair use, RIAA, MPA,
| and AAP are more likely allies.
| EVa5I7bHFq9mnYK wrote:
| Open source is about love, sharing, helping out the fellow coder.
| Coderz of the past hated all this licensing and copyright BS.
| Your code, used to train this NN, is making the world a better
| place, I'd be content with that.
| luffapi wrote:
| Nothing about further enriching Microsoft and continuing the
| network effects behind a closed source "social network", is
| making the world a better place. Quite the opposite really.
| danbruc wrote:
| At least to some first approximation irrelevant because reading
| code is not subject to any license. What if a human reads some
| restrictively licensed code and years later uses some idea he
| noticed in that code, maybe even no longer being aware from where
| this idea comes?
|
| But what if the system memorizes entire functions? What if a
| human does so? What if you change all the variable names? What if
| you rearrange the control flow a bit? What if you just change the
| spacing? What if two humans write the exact same code
| independently? Is every for loop with i from 0 to n a license
| violation?
|
| I am not picking any side, but the problem is certainly much more
| nuanced then either side of the argument wants to paint it.
| selfhoster11 wrote:
| The problem is that humans are limited in retention and rate of
| learning. An AI/ML is not, which makes (or should make) a
| difference.
| danbruc wrote:
| Sure, it might certainly be the case that different rules
| should be applied to humans and machines, but this makes the
| discussion only even more nuanced. But I don't think this
| could reasonably be used to ban machines from ingesting code
| with certain licenses even though it might restrict what they
| can do with this information.
| jonfw wrote:
| I agree that it's nuanced and it's difficult to draw the line.
| but where copilot sits is way over on the plagiarizing side of
| the spectrum. Wherever we agree to draw the line, copilot
| should definitely fall on the wrong side of it
|
| Copilot will replicate entire functions, including comments,
| from licensed code
| kevincox wrote:
| > but where copilot sits is way over on the plagiarizing side
| of the spectrum
|
| I think it is important to point out that not all Copilot
| output is on the plagiarizing side of the spectrum. However
| it does on occasion produce plagiarized code. And most
| importantly there is no indication when this occurs.
| kevincox wrote:
| > What if a human reads some restrictively licensed code and
| years later uses some idea he noticed in that code, maybe even
| no longer being aware from where this idea comes?
|
| In general using the idea is fine, whether it is AI or human
| written. I think the major concern here is when the code is
| copied verbatim, or near verbatim. (AKA the produced code is
| not "transformative" upon the original)
|
| > But what if the system memorizes entire functions? What if a
| human does so?
|
| In both of these cases I believe it would be a copyright
| concern. It is not strictly defined, and it depends on the
| complexity of the function. If you memorized (|a| a + 1) I
| doubt any court would call that copying a creative work. But if
| you memorized the quake fast inverse square root it is likely
| protected under copyright, even if you changed the variable
| names and formatting.
|
| It seems clear to me that GitHub Copilot is capable of
| producing code that is copyrighted and needs to be used
| according to the copyright owner's license. Worse still, it
| doesn't appear of capable of knowing when it is doing that, and
| what the source is.
| surfingdino wrote:
| I am not surprised given who the owner of GitHub is. Now, let's
| assume for a while that a private repo is left marked as public
| by mistake and Copilot regurgitates it... Lawyers are going to
| have fun with that one.
| Hamuko wrote:
| The worse scenario for GitHub is when a leak is published on
| GitHub. It's not like it hasn't happened before.
|
| https://www.theverge.com/2018/2/8/16992626/apple-github-dmca...
| SXX wrote:
| There actually tons of unlicensed and wrongly licensed code
| on GitHub right now that being accidentally leaked by
| employees of many companies.
| Yoric wrote:
| Out of curiosity, how do we define license violation in that
| case? I, as a human being, have trained by reading code, much of
| which is covered by licenses that are somehow not compatible with
| code I'm writing. Am I violating licenses?
|
| Asking seriously. It's really unclear to me where law and/or
| ethics put the boundaries. Also, I'd guess it's probably country
| dependent.
| bananapub wrote:
| https://en.wikipedia.org/wiki/Clean_room_design
|
| sometimes? it's enough of an issue that companies explicitly
| avoid it by having two teams.
| Spivak wrote:
| Clean room design is a technique to avoid the appearance of
| copyright infringement. If the courts were omniscient and
| could see into your mind that you didn't copy then there
| would be no need. Why this is relevant is because we can see
| into the mind of copilot. Whether what it does it considered
| infringement I think will come out in the details.
|
| If the ML model essentially is just a very sophisticated
| search and helps you choose what to copy and helps you modify
| it to fit your code then it's 100% infringement. If it is
| actually writing code then maybe not.
| danbruc wrote:
| That is exactly what needs some careful consideration. As a
| start, two people can write the exact same code independently,
| therefore having identical code is not sufficient. On the other
| hand I can copy some code and slightly modify it, maybe only
| the spacing or maybe changing some variable names, and it could
| reasonably be a license violation, therefore having identical
| code is also not necessary.
|
| Does the code even matter at all? If I start with a copy of
| some existing code, how much do I have to change it to no
| longer constitute a license violation? Can I ever reach this
| point or would the violation already be in the fact that I
| started with a copy no matter what happens later? Does
| intention matter? Can I unintentionally violate a license?
|
| But I think we don't have to do all the work, I am pretty sure
| this has already been considered at length by philosophers and
| jurist.
| [deleted]
| emerged wrote:
| None of our laws were created under the assumption that
| computers would do so much of our jobs and effect so much of
| our lives. From robotic automation to social media to now
| computer programming. I think it's really a mistake to ask what
| the letter of the law _currently_ means in the evolving
| context. Laws should serve us and need to be adapted.
| kadoban wrote:
| Who is "us" that are being served?
|
| I'm not the biggest fan of copyright law as currently
| written, but I wouldn't say that MS's desire to file off the
| serial numbers on every piece of public code for their own
| profit is a good impetus to rewrite the law.
| einarfd wrote:
| > Out of curiosity, how do we define license violation in that
| case? I, as a human being, have trained by reading code, much
| of which is covered by licenses that are somehow not compatible
| with code I'm writing. Am I violating licenses?
|
| That depends, if you end up writing copies of the code you've
| studied then yes. You are on thin ice. Plagarization is
| definitely something that you can do with computer code. There
| have been several high profile cases around this in arts. As
| far as I can see it usually ends up being a question about how
| much of the work is similar, how similar it is, and how unique
| what was similar is. And added wrinkle in programing is that
| some things can be done in only one way, or at least any
| reasonable programmer will do it in only one way. So for
| example a swap(var1, var2) function can usually only be done in
| one way, and therefor you would not get in trouble if your and
| someone else swap function are the same.
|
| I've been following the discussion about Copilot, and one issue
| that comes up again and again is that people seem to think that
| since Copilot is new, the law will treat it, and the code it
| writes differently, than what it would you or a copy machine. I
| think that is naive, my impression is that courts care more
| about what you did, not how you did it, and if you think
| Copilot can be used to do an end run around the law. Prepare to
| be disappointed.
|
| So if Copilot memorize code and spits out copies of that code,
| then it is at best skating on thin ice, or at worst doing a
| license violation. If the code it is copying is unique, then it
| definitely is heading into problematic territory. I'm fairly
| sure sure someone in legal at Github is very unhappy about the
| quake fast inverse square root function.
| MayeulC wrote:
| > swap(var1, var2)
|
| Well, there's also the xor way to be pedantic :)
| var1 = var1 ^ var2 var2 = var2 ^ var1 var1 =
| var1 ^ var2
|
| But yeah, not too much wiggle room there.
| johndough wrote:
| Another variation (assuming no overflows):
| var1 += var2; var2 = var1 - var2; var1 -=
| var2;
|
| And another: var1 ^= var2 ^= var1 ^=
| var2;
|
| Assembly even has an instruction for it:
| xchg eax, ecx
| dvfjsdhgfv wrote:
| My guess is that many people will use it on the backend where
| a copyright violation is hard to spot and even more difficult
| to prove.
|
| As for fronted/open source etc... sure, if you don't care
| about copyright and licensing, use it.
| contravariant wrote:
| If you were to write large swaths of copyrighted code from
| memory then yes you'd be committing a copyright violation.
|
| Most humans don't do so unintentionally though.
| FeepingCreature wrote:
| Just as an example, this is very widespread in music though.
| contravariant wrote:
| If the whole 'Dark Horse' debacle proved anything it would
| be that that can still be considered a copyright
| infringement. Sure that particular example was (rightly
| IMHO) deemed to not be a copyright violation, but they
| still had to show their version was original enough, they
| couldn't just claim such copying wasn't ever an
| infringement.
| elliekelly wrote:
| I'm not so sure Copilot is doing so "unintentionally"
| either...
| Bdbvd wrote:
| Doesnt matter how you define it. What you have to under stand
| is the personality trait spectrum of the chimp troupe.
|
| Some chimps cant sleep at night if what comes out of their 6
| inch chimp head is not acknowledged in some way by the entire
| troupe. These bastards then spend their whole life finding each
| other and reinforcing each others self importance calling the
| stories laws, ethics and all kinds of bullshit. It all doesnt
| matter cause once what comes out of the 6 inch head has been
| digitized it is now property of the rest of the universe.
| belorn wrote:
| The boundaries are not set in stone, and so the answer is the
| old theme of "it depend". To provide a slightly different
| situation which was discussed a few years ago, can you train an
| AI on pictures of human faces without getting permission? Human
| painters have created images of faces for a very long time, so
| is it any different in terms of law and/ethics if an AI do it?
|
| Yes, a bit? It depend. Using such things for advertisement
| would likely cause anger if people start to recognize images of
| the training set the AI was trained on.
| b3morales wrote:
| My opinion would be that if the training set for the face
| generator was made up of photos whose creators had asked you
| to credit them if you re-used their work, then, yes, the
| generator is ethically in the wrong if it's skipping that
| attribution. Regardless of copyright. (And I feel the same
| way about Copilot.)
| habosa wrote:
| I am not a lawyer but I am sure that any legal standard for ML
| has to be different than "isn't it just doing what humans do,
| but faster?"
|
| GitHub scanning billions of code files to build commercial
| software is different than you learning at human pace, even if
| they're both "learning" and in the end they both produce
| commercial software.
| leereeves wrote:
| > isn't it just doing what humans do, but faster?
|
| The human activity most like training an ML system is
| memorizing a text by reciting from memory, checking against
| the original, adjusting, and repeating until there are
| acceptably few mistakes.
|
| And if a human did so for thousands of texts then publicly
| repeated those texts, they would be violating copyright too.
| danbruc wrote:
| It does not have to be different but it certainly can be
| different, a difference in quantity can certainly be a
| difference in quality. People watching other people walk by
| and a camera - maybe with face detection - doing the same are
| not only a difference in quantity but also in quality.
| skinkestek wrote:
| > I, as a human being, have trained by reading code, much of
| which is covered by licenses that are somehow not compatible
| with code I'm writing. Am I violating licenses?
|
| As someone who has taught students in ICT a quick rule of thumb
| was that I picked a piece of text that I suspected, wrapped it
| in doublequotes and put it into a search engine.
|
| 9/10 times - possibly more - of the times I had that feeling it
| was true. 17 year olds don't write like seasoned reporters most
| of the time.
|
| Obviously there needs to be some independent tought in there as
| well, but for teenagers I put the line at not copying verbatim,
| and to cite sources.
|
| As we've seen demonstrated again and again copilot breaks both
| my minimum standard rules for teenagers: it copies verbatim and
| it doesn't cite sources.
|
| I say that is pretty bad.
|
| If the system had actually learned the structure and applied
| what it had learned to recreate the same it would be a whole
| different story.
|
| But in this case it is obvious that the AI isn't writing the
| code - at least not all the time, it is instead choosing what
| to copy - verbatim.
| qayxc wrote:
| > But in this case it is obvious that the AI isn't writing
| the code - at least not all the time, it is instead choosing
| what to copy - verbatim.
|
| I still don't see any problem with that. If it's larger
| sections (e.g. entire NON-TRIVIAL function bodies), those can
| be filtered or correctly attributed after inference. So
| that's just a technicality.
|
| Smaller snippets and trivial or mechanical implementations
| (generated code, API calls, API access patterns) aren't
| subject to any kind of protection anyway. int
| main(int argc, char* argv[]) {
|
| Lines like that hold no intellectual value and can be found
| in GPL'ed code. It can be argued that that's a verbatim
| reproduction, yet it's not a violation of any kind in any
| reasonable context.
|
| Where do you draw the line and how would you be able to -
| automatically even! - decide what does and does not represent
| a significant verbatim reproduction?
| jcelerier wrote:
| what about lines such as Idxs[i] += (Imm
| >> ((i * HalfLaneElts) % 8)) & ((1 << HalfLaneElts) - 1);
| double r2 = fma(u*v, fma(v, fma(v, fma(v, ca_4, ca_3),
| ca_2), ca_1), -correction); seed ^= hasher(v)
| + 0x9e3779b9 + (seed << 6) + (seed >> 2);
| qint32 val = d + (((fromX << 8) + 0xff - lx) * dd >> 8);
|
| even if it's one line, it likely took some non-negligible
| thinking time from the programmer
| wvenable wrote:
| What about E = mc^2 ?
|
| Mathematics and physics equations are not copyrightable.
| jcelerier wrote:
| but those aren't only mathematics. There's the choice of
| variable names, the order in which things are called
| (maybe to optimize the performance on some CPU, we don't
| know), etc
| wvenable wrote:
| Your original argument is based on the false premise that
| the amount of time or effort matters -- it doesn't. Not
| all human activity can or should be subject to copyright
| -- this the dangerous slippery slope of "intellectual
| property" -- and we are dangling by edge these days.
| skinkestek wrote:
| >I still don't see any problem with that. If it's larger
| sections (e.g. entire NON-TRIVIAL function bodies), those
| can be filtered or correctly attributed after inference. So
| that's just a technicality.
|
| Today copilot does what it does.
|
| I've never heard Microsoft defend anyone running afoul of
| some of their licensing details with "they can fix it
| later, it is just a technicality".
|
| I think this should go both ways? No?
|
| > Smaller snippets and trivial or mechanical
| implementations (generated code, API calls, API access
| patterns) aren't subject to any kind of protection anyway.
| int main(int argc, char* argv[]) {
|
| > Lines like that hold no intellectual value and can be
| found in GPL'ed code. It can be argued that that's a
| verbatim reproduction, yet it's not a violation of any kind
| in any reasonable context.
|
| Totally agree. Edit: otherwise we'd all be in serious
| trouble.
|
| > Where do you draw the line and how would you be able to -
| automatically even! - decide what does and does not
| represent a significant verbatim reproduction?
|
| I am not a lawyer but I guess many can agree that somewhere
| before copying functions verbatim, comments literally
| copied as well for good measure, somewhere before that
| point there is a line.
|
| On the other hand: if there was significant evidence that
| the AI was doing creative work, not just (or partially
| just) copying then I think I would say it was OK even if it
| arrived at that knowledge by reading copyrighted works.
|
| Edit: how could we know if it was doing creative work?
| First because it wouldn't be literally the same. Literal
| copying is liter copying regardless of if it is done using
| Xerox, paid writers, infinite monkeys om infinite
| typewriters, "AI" or actual strong AI.
|
| After that it becomes a bit more fuzzy as more
| possibilities open up:
|
| - for student works I look at how well adapted it is to the
| question at hand: a good answer from Stackoverflow,
| attributed properly and adapted to the coding style of the
| code base? Absolutely OK. Copying together a bunch of stuff
| from examples in the frameworks website? Fine. Reading
| through all the docs and look at how a number of high
| profile projects have done it in their open source
| solution, updating the README.md with info on why this
| solution was chosen? Now you are looking for a top grade in
| my class.
|
| (of course IBM will probably not want you to work on their
| compiler though if you admit that you've studied OpenJDKs,
| or so I have heard.)
| qayxc wrote:
| > Today copilot does what it does.
|
| It's also not a commercially released product yet, but a
| technical preview, so uncovering and addressing issues
| like that is exactly what pre-release versions are for.
|
| I'd say it succeeded greatly in sparking a discussion
| about these issues.
| CyberRabbi wrote:
| It's not AI it is ML. GPT-3 is a very large ML model. It does
| not reason. It's a statistical machine.
| tsimionescu wrote:
| ML is a subset of AI, in any defintion that I've seen. And
| both are needlessly anthropomorphizing what are currently
| simple statistical or rule-based deduction engines.
|
| GPT-3 is no more 'intelligent' in the human sense than it
| is 'learning' in the human sense.
| cnity wrote:
| By this logic there is no such thing as AI.
| Eikon wrote:
| There's no such thing as AI.
| kzrdude wrote:
| The training question seems much more difficult.
|
| The main problem that has been the topic is a simpler one -
| about the produced work. If you exactly reproduce someone's
| existing code (doesn't matter if you copy by flipping bits one
| by one or which technology you use), isn't it a copyright
| violation?
|
| I'm kind of imagining a Rube Goldberg machine that spells out
| the quake invsqrt function in the sand, now...
| wongarsu wrote:
| Yes, if you play a video from Netflix while recording your
| screen, transcode that video to MPEG2 and use a red laser to
| write a complex encoding of that MPEG2 bitstream onto a
| plastic disk, then send that by mail to your friend, a court
| won't care about the complexity of that Rube Goldberg
| machine. They will just say it's a clear copyright violation
| since you distributed a Netflix movie by DVD.
|
| With programming, there's the further complication what
| constitutes a work. But quakes invsqrt certainly qualifies,
| just like that one function from the Oracle vs Google case.
| [deleted]
| tsimionescu wrote:
| > I, as a human being, have trained by reading code, much of
| which is covered by licenses that are somehow not compatible
| with code I'm writing. Am I violating licenses?
|
| There are many good answers from the legal side. I would also
| attack this side: the way human beings learn is entirely
| different from the way ML models are trained. We don't do
| gradient descent to find the slope of data points and find the
| most likely next bit of code.
|
| We humans create rational models of the code and of the world,
| and use deduction from those models to create code. This is
| extremely visible in the way we can explain the reason behind
| our code, and in the way we are aware of the difference between
| copying code we've seen before vs writing new code. It's also
| visible in that we can be told rules and produce code that
| obeys those rules that doesn't resemble any code ever written
| before.
|
| The difference is also easily quantifiable: humans learn to
| program after seeing vastly fewer code examples than Co-pilot
| needed, and we are much better at it.
|
| One day, we will design an AI that does learn more similarly to
| how humans learn, and that day your question will be far more
| interesting. But we are far from such problems.
| FeepingCreature wrote:
| I'm not sure this is actually true. We can explain code, but
| the fact that we can explain code is not necessarily related
| to the way we actually end up writing it. Have you ever
| written a function "on autopilot"? Your brain has selected
| what you wanted it to do, and now you're just typing without
| thought? I don't think we're as dissimilar to this model as
| we'd like.
| b3morales wrote:
| The feeling of being "on autopilot" when doing a task has
| to do with your, let's call it, _supervisory process_ being
| otherwise occupied. It doesn 't suggest that that the other
| mental processes which are responsible for figuring out the
| actions have changed their character or mode of operation.
|
| "You" are just not paying attention to it in that moment.
| carlgreene wrote:
| While I'm not trying to lessen the implications of something like
| this, but didn't we all agree to them being able to do this when
| we agreed to their TOS?
| MiddleEndian wrote:
| Let's assume that is enforceable through the TOS (which I
| doubt), would that make hosting GPL'd code on Github a
| violation of the GPL? If programmer X releases GPL'd code on
| his website and programmer Y copies it to Github, then it could
| presumably be considered a bypass of the copyright.
| lucasyvas wrote:
| I don't believe ToS are ever legally binding.
| unanswered wrote:
| Yeah, but I don't think you're allowed to interrupt the
| circlejerk by pointing that out. Every piece of code pushed to
| GitHub comes with: an implied licence for GitHub and its users,
| which is an alternative to any explicit license in the
| code(!!!) and also a representation that you're authorized to
| grant such a license. Of course one imagines that in many cases
| uploaders are not actually authorized to grant such a license,
| such as if they're uploading something they themselves have
| received under GPL license, but IANAL.
| vel0city wrote:
| I can't wait for machine learning models that given the right
| input nearly perfectly reproduce feature length movies or music.
| Its not copyright infringement, it was generated by a computer!
| cortexio wrote:
| i dont really care, if the code is public, it's public. If you
| can copy it, why cant a bot do it? But it would be a useful
| feature. Or does it also look at private repos? That would be
| scary.
| SavageBeast wrote:
| When I first read about the new copilot tool I immediately
| thought it would just be a matter of time before some group
| started poisoning the AI. Garbage in, garbage out right?
|
| So now we know its ALL public repos ... how long until all the
| opponents of this tool have a giant repo full of syntactically
| correct code that employs terrible design patterns and is
| thoroughly obfuscated? I'm not going to waste my time on this
| personally but there are certainly those who will. Someone will
| invent a tool that perverts perfectly good code in the process
| and probably have a good laugh.
|
| Personally, while I recognize some people might find it useful, I
| don't much care for it. No, I haven't tried it yet either. Ive
| never sampled escargot either and I know I don't care for it all
| the same. Maybe it's wonderful, I'll never know - but I do know
| that I simply don't like the idea of it. Call it an objection on
| General Principal if you like.
|
| So remember kids, If you're not PAYING then you are the product.
|
| Bottom line - private repos are cheap and you should use them
| rather than freebie public stuff.
| timdaub wrote:
| Built on Stolen Data:
| https://rugpullindex.com/blog#BuiltonStolenData
| jand wrote:
| Not a github user (*lab), also not a lawyer, so please excuse my
| ignorance.
|
| As this boils down to legal arguments, are there any clauses
| (maybe disputed) in the ToS allowing github/MS usage of public
| repos for such purpose?
|
| Would it even be legally possible to override a software license
| as a repo provider like "by using this service, you agree to..."?
| Luker88 wrote:
| So...
|
| Putting the (imho) big licensing problems aside, what about the
| software patents?
|
| Apache and GPL have patent protection clauses.
|
| Does this mean that anyone using copilot might somehow get code
| that implements something patented, but protected by license,
| except they did not get proper permission through the Apache/GPL
| license?
|
| ...I kind of hate myself for saying this, but... Patent trolls to
| the rescue?
| aliasEli wrote:
| A direct confirmation from GitHub itself. This is problematic
| because Copilot sometimes outputs code that was present in its
| training set.
|
| https://fossbytes.com/github-copilot-generating-functional-a...
| deviledeggs wrote:
| This is the crux of copilot. When I saw it copying RSA keys I
| knew, it's overtrained.
|
| Most of the comments are waxing about philosophical
| possibilities of copilot copying GPL.
|
| Reality is clear in the case, it's copy pasting thousands of
| characters of GPL code with no modifications. Copyright
| violation clear as day
| sandruso wrote:
| I don't know how to approach this. As a human I can read all
| public code and learn from regardless the license and come up
| with new solutions. Machine can read everything too, but can't
| create new ideas or approach. How is copilot defined then? Should
| it be only smart system for general code snippets?
| raspyberr wrote:
| Well you can read public code all you like but you can't just
| take chunks of code and write them under different licenses
| like how copilot has been shown doing.
| sandruso wrote:
| If you grab chunk of licensed code and put into private repo,
| what prevents you from doing that? How much of licensed code
| is scattered across private projects? I'm curious how these
| license violations are detected.
| Spivak wrote:
| I mean if you text and drive while a police officer isn't
| around to see it you still broke the law. Just because
| piracy is huuuuge and largely unpunished doesn't mean that
| copyright doesn't have to be respected in a huge publicly
| visible trying to be above-board project.
| kevincox wrote:
| Copyright law "prevents" you from doing that. To be more
| specific copyright law specifies that you must comply with
| the license of the copyright holder in cases such as the
| one you have described.
|
| > How much of licensed code is scattered across private
| projects?
|
| Whether or not copyright violations regularly occur is not
| (directly) relevant to whether or not it is illegal. People
| download copyrighted movies without licenses all the time
| and it still isn't legal.
| [deleted]
| maxrev17 wrote:
| How do people think developers learn?! Many probably recite
| copyrighted code almost verbatim on the reg. Storm in a tea cup.
| codehawke wrote:
| Most likely they are using the private stuff too.
| greatgib wrote:
| As I said in another thread, in my opinion there is no issue to
| have with whatever they did as training with whatever public
| data.
|
| And in the end, the output in itself is not really an issue. It
| is just a machine outputting random lines it encountered on
| internet.
|
| The problem is from the user side: Ok, you got random lines from
| random places. If you do nothing about it, then no issue. But if
| you try to use, publish, sell the code, then you are in deep
| shit. But somehow it's your fault.
|
| For GitHub, the problem is more to be sued by "customers" that
| assumed that the generated code was safe to use when it is not
| the case.
|
| And, as a general comment, I think that this case is very
| illustrative about the misconceptions about AI and machine
| learning for the general public:
|
| Here you can see that you don't really have an intelligent system
| that can learn and then create something new and innovative from
| scratch. But it is just a machine that copy code it already saw
| based on correlation with similarities in your current code.
| Andrex wrote:
| Hmm, no. I'll be (finally) moving to GitLab or similar.
| ssivark wrote:
| Calling it "public" code feels like doublespeak. It's most
| definitely NOT public domain code -- it only happens to be hosted
| on GitHub and browsable ( _but not copyable_ ) by people. "Source
| available for viewing" is very different from "public property"
| as the phrase is commonly understood:
| https://en.m.wikipedia.org/wiki/Public_property
| BlueTemplar wrote:
| Interestingly, it is copy able... but only on GitHub !
| ("forkable")
|
| That's some nasty walled garden terms... I wonder how much
| these kinds of ToS are actually legal ?
| swiley wrote:
| Pressing that "fork" button might be illegal. It's certainly
| illegal to push after pressing it in many cases.
| prettygood wrote:
| Not copyable by people, but we can go through the code, learn
| from it and then use that knowledge to improve our coding
| skills.
|
| Isn't that what autopilot is doing here? The system is merely
| learning how to code, and then applying it's learnings on other
| programming problems. It's not like it's writing software to
| specifically compete with other programs.
| jspaetzel wrote:
| There's a really philosophical question here about whether
| Copilot is learning or imitating.
|
| For instance, a parrot doesn't learn to speak, it learns to
| imitate speech.
| hellcow wrote:
| Not when it outputs large sections of unique code verbatim,
| as it's been shown to do.
| qayxc wrote:
| If it's large sections, that can be fixed by either licence
| attribution or result filtering.
|
| That's at best a technical issue. What way too many people
| claim, however, is that the machine isn't even allowed to
| _look at_ GPL 'ed code for some reason, while humans are.
|
| I'd like to learn the reasoning behind that.
| ssivark wrote:
| I think result-filtering (based on license of search
| results) is gnarly enough, and likely computationally
| intensive, so as to break the whole feature. But it would
| be interesting to see if that can be crafted to fix the
| shortcomings of the ML model.
| thomasahle wrote:
| > What way too many people claim, however, is that the
| machine isn't even allowed to look at GPL'ed code for
| some reason, while humans are.
|
| Why would those be the same thing? It's a matter of
| scale. Just like how people are allowed to read websites,
| but scraping is often disallowed.
| fragileone wrote:
| The word they're actually referring to here is "source
| available", and trying to use "public" is just to confuse
| people into thinking they're referring to public domain only.
| ghoward wrote:
| Maybe they mean code that is in public (versus private)
| repos? And then use the word to make it seem like it's stuff
| in the public domain?
| CyberRabbi wrote:
| Movies are "public" too. That does not mean you are allowed to
| use them for any purpose. The term "Public" does not have
| specific legal consequences in copyright law outside of
| something being "public domain" as you say.
| slownews45 wrote:
| You are allowed to watch them. Many moves take ideas from
| other movies, which took ideas from myths and earlier
| stories. In fact, I find modern movies highly highly
| derivative.
| planb wrote:
| The question is: are you allowed to train a neural network on
| movies (e.g. For an automated color grading algorithm) and
| then sell that as a service?
| Hamuko wrote:
| > _it only happens to be hosted on GitHub and browsable (but
| not copyable) by people._
|
| So would you say that it's _publicly_ visible?
| sp332 wrote:
| Publicly visible, yes. Publicly available, yes. Public code,
| no.
| frumper wrote:
| If you post your code to the public, I wouldn't be shocked
| if people copy it verbatim without regard to license. I'm
| not suggesting that is a proper thing to do, just accepting
| that it can happen when I post code.
| Hamuko wrote:
| "Public code" is not a defined term. It's not short for
| "public domain code".
| blibble wrote:
| I guess leaked copies of the NT kernel source on github are now
| "public" in the eyes of MS?
| will4274 wrote:
| Public and public domain are not the same thing. This code is
| public in the same way that Google indexes publicly available
| information on the internet.
| mensetmanusman wrote:
| Does gtp-3 have to attribute mankind for reading all of the
| internet?
|
| What about deep learning-artwork trained on google searches?
|
| We enter a new era...
| habitue wrote:
| I am really confused by HN's response to copilot. It seems like
| before the twitter thread on it went viral, the only people who
| cared about programmers copying (verbatim!) short snippets of
| code like this would be lawyers and executives. Suddenly everyone
| is coming out of the woodworks as copyright maximalists?
|
| I know HN loves a good "well actually" and Microsoft is always
| suspect, but let's leave the idea of code laundering to the
| Oracle lawyers. Let hackers continue to play and solve
| interesting problems.
|
| Copilot should be inspiring people to figure out how to do better
| than it, not making hackers get up in arms trying to slap it
| down.
| joe_the_user wrote:
| _I am really confused by HN 's response to copilot._
|
| If you're asking about the moral reaction here, I think it
| depends on how one views Copilot. Does Copilot create basically
| original code that just happens to include a few small
| snippets? Or does Copilot actually generate a large portion of
| lightly changed code when it's not spitting out verbatim copies
| of the code? I mean, if you tell Copilot, "make me a QT
| compatible, crossplatform windowing library" and it spits out a
| slightly modified version of the QT source code and if someone
| started distributing that with a very cheap commercial license,
| that would be a problem for the QT company, which licenses
| their code commercial or GPL (and as QT a library, the QT GPL
| forces user to also release their code GPL if they release it,
| so it's a big restriction). So in the worst case scenario, you
| can something ethically dubious as well as legally dubious.
|
| _Copilot should be inspiring people to figure out how to do
| better than it, not making hackers get up in arms trying to
| slap it down._
|
| Why can't we do both? I mean, I am quite interested in AI and
| it's progress and I also think it's important to note the way
| that AI "launders" a lot of things (launders bias, launder
| source code, etc). AI scanning of job applications has all
| sorts of unfortunate effects, etc. etc. But my critique of the
| applications doesn't make me uninterested in the theory,
| they're two different things.
| fragmede wrote:
| A naive developer thinks that they are the source code they
| write (you're not), and their source code leaking to the
| world makes them worthless. (Which isn't true, but being
| _that_ invalidated explains a lot of the fear. Which, welcome
| to the club, programmers. Automation 's here for your job
| too.)
|
| Still, some of the moral outrage here has to do with it
| coming from Github, and thus Microsoft. Software startup Kite
| has largely gone under the radar so far, but they launched
| this back in 2016. Github's late to the game. But look at the
| difference (and similarities) in responses to their product
| launch posts here.
|
| https://news.ycombinator.com/item?id=11497111 and
| https://news.ycombinator.com/item?id=19018037
| papito wrote:
| In addition, this is extremely hard to enforce. I think the
| amount of code running in closed systems that does not exactly
| respect the original license is shocking. What was the last
| case you know where this was a "scandal"?
|
| It only happens at boss level when tech giants litigate IP
| issues.
| 63 wrote:
| Firstly it's important to remember that HN is not a single
| person with a single opinion, but many people with conflicting
| opinions. Personally I'm just interested in the copyright
| discussion for the sake of it because I find it interesting.
| Though, I imagine there's also an amount of feelings of
| unfairness.
| bpodgursky wrote:
| Hacker News hates everything, especially if it seems to work.
| Don't read into it.
| [deleted]
| dang wrote:
| " _Please don 't sneer, including at the rest of the
| community._"
|
| https://news.ycombinator.com/newsguidelines.html
| mupuff1234 wrote:
| I think the real issue is less about the "copying short
| snippets", and more about how it was done, i.e zero
| transparency, default opt in without any regards to licensing
| (with no way to opt out??) and last but not least - planning to
| charge money for it.
| muyuu wrote:
| idk, I don't quite enjoy the idea of having my code stolen
| without any respect for its licence or even attribution
|
| but then again I migrated away from github as soon as MS bought
| it
|
| still, it's a matter of principle
| moyix wrote:
| Programmers _love_ to pretend that they 're lawyers, especially
| when it comes to copyright law. Something about the law really
| appeals to hackers!
| paxys wrote:
| It's a pretty standard "big company releases new thing"
| reaction. HN is usually negative on everything.
| sp332 wrote:
| You don't have to be a copyright maximalist to worry about a
| company taking snippets of code that used to be under an open
| license and using them in a closed-source app.
| carom wrote:
| It is a large corporation eroding the integrity of open source
| licenses. It is perfectly reasonable to be pissed off about
| this.
| stusmall wrote:
| I've always cared but never talked about it. Someone copy and
| pasting code from a source that is clearly forbidden (free
| software, reverse engineered code, leaked source code, etc)
| isn't an interesting thing to talk about. It's obviously wrong.
|
| Also people rarely do it; I've caught maybe a couple instances
| of it in my career and I never really thought too much about
| them again. This tool helps make it a lot easier and more
| common. I have a feeling other people chiming in are also in
| the camp of "Oh, this is going to be a thing now, huh?"
|
| I also can't help but to think that my negative opinion of it
| isn't solely based on this provenance issue. While it's cool it
| seems questionable about how practical it is. If the value was
| more clear I think I could stomach the risk a bit better.
| isaac21259 wrote:
| Of copilot were open source I wouldn't have an issue with it.
| However it is closed source and a later version it's intended
| to be sold.
| corobo wrote:
| The difference between copilot and copy pasting from
| stackoverflow is consent
| 6gvONxR4sf7o wrote:
| On many ML posts, you get arguments about IP, and there's a
| long history of IP wars on this forum, especially when
| licensing comes up. Then you add the popular Big Tech Is Evil
| arguments you see. I think it's a variety of factors coming
| together for people to be upset about someone else profiting
| from their own work in ways they didn't mean to allow.
|
| I expect that we'll need new copyright law to protect creators
| from this kind of thing (specifically, to give creators an
| _option_ to make their work public without allowing arbitrary
| ML to be trained on it). Otherwise the formula for ML based
| fair use is "$$$ + my things = your things" which is always a
| recipe for tension.
| leventov wrote:
| Perhaps people on HN start sensing that successors of Github
| Copilot will take their programming job. Rightly so.
|
| Personally, I think that in the age of AI programming any
| notions of code licensing should be abolished. There is no
| copyright for genes in nature or memes in culture; similarly,
| these shouldn't be copyright for code.
| croes wrote:
| >There is no copyright for genes in nature
|
| Since when are humans not a part of nature?
| superfrank wrote:
| > Perhaps people on HN start sensing that successors of
| Github Copilot will take their programming job. Rightly so.
|
| I still think we're a long way from that. Copilot will help
| write code quicker, but it's not doing anything you couldn't
| do with a Google search and copy/paste. Once developers move
| beyond the jr. level, writing code tends to become the least
| of their worries.
|
| Writing the code is easy, understanding how that code will
| affect the rest of the system is hard.
| fragmede wrote:
| Depends on your definition of "a long way". Some of the
| GPT3 based code generation demos (which, explicitly, are
| just that - demos - we aren't shown the limitations of the
| system during the demo) say that's closer than I think.
|
| https://analyticsindiamag.com/open-ai-gpt-3-code-
| generator-a... has a bunch of videos of this in action.
| tylersmith wrote:
| Based on the responses I've seen, people have it in their
| heads that Copilot is a system where you describe what kind
| of software you want and it finds it on Github and slaps
| your own license on it.
|
| It's just a smarter tab-completion.
| LouisSayers wrote:
| > Perhaps people on HN start sensing that successors of
| Github Copilot will take their programming job. Rightly so.
|
| I feel like this comment misunderstands what a software
| developer is doing. Copilot isn't going to understand the
| underlying problem to be solved. It's not going to know about
| the specific domain and what makes sense and what doesn't.
|
| We're not going to see developers replaced in our lifetime.
| For that you need actual intelligence - which is very
| different from the monkey see monkey do AI of today.
| luffapi wrote:
| > _Copilot should be inspiring people to figure out how to do
| better than it, not making hackers get up in arms trying to
| slap it down._
|
| One of the (many) problems is that GitHub/Microsoft already
| benefit from runaway network effects so it's difficult to "do
| better". Where will you get all of that training code if not
| off GitHub?
|
| The real answer to this is to yank your projects from GitHub
| now while you search for alternatives.
| Narishma wrote:
| Even if you do that, what's to stop them from using open
| source software from all over the web and not just what's on
| GitHub? The only way to stop them then is to go closed
| source.
| luffapi wrote:
| I mean stop them at a larger level by threatening their
| success as an organization. If developers stop publishing
| to GitHub they have bigger problems than training ML
| models.
|
| Whether or not this move is "legal", it should serve as a
| wake up call that GH is not actually a service we should be
| empowering. This incident is just one example of why that's
| a bad idea.
| adamdusty wrote:
| I'm more surprised that people don't care about the telemetry
| aspect. It's an extension that sends your code to an MS
| service, and MS promises access is on a need-to-know basis.
|
| I don't care if MS copies my hobby projects exactly, but I'm
| not sure my employer(defense contractor) would even be allowed
| to use a tool like this.
|
| I think it looks cool though. I will probably try it out if it
| is ever available for free and works for the languages I use.
| gradys wrote:
| It's quite possible to do this on-prem and even on-device.
| TabNine, a very similar system with a smaller model (based on
| GPT-2 rather than 3), has existed for years and works on-
| device.
| ageyfman wrote:
| Try doing any type of deal (fundraising, M&A) where you can't
| point to the provenance of your application's code. This isn't
| good for programmers, programmers WANT clean and knowable
| copyrights. This is good for lawyers, who'll now have another
| way to extract thousands of $$ from companies to launder their
| code.
| jspaetzel wrote:
| Copy-left licenses are generally liked by developers, this flys
| very directly against that since it suggests circumvention of
| those type of licenses.
| nickvincent wrote:
| Regardless of how the (potentially very impactful) debate about
| licensing and copyright plays out, I think many here would agree
| this constitutes an "exploitation" of labor, at least in a mild
| sense.
|
| Optimistically, Copilot could be a wake up call for thinking more
| deeply about how the winnings of data-dependent technologies
| (ultimately, dependent on the labor of people who do things like
| write open source code) are concentrated--or shared more broadly.
|
| This longer blog post goes into more of a labor framing on the
| topic: https://www.psagroup.org/blogposts/101
|
| (For the record, I certainly think Copilot could be very good for
| programmers in general and am not arguing against its existence
| -- just arguing that this is a high profile case study, useful
| for thinking about data-dependent tech in general)
| leventov wrote:
| There will be just a short transition period. In 10 years, AI
| will be writing most of code, and in 20 years - nearly all
| code. People will do only architecture/business analysis.
|
| No more "exploitation" of labor.
| hpcjoe wrote:
| "All your code are belong to us" ... :(
| 1970-01-01 wrote:
| What about the other half of the law: If your copilot code takes
| from public source but produces something that is patented, can
| you be sued by a patent troll? (Yes.)
| mikehearn wrote:
| ML novice question: is this atypical when training models? Wasn't
| GPT-3 trained on a lot of copyrighted data? My gut instinct,
| which is based on very low-information, is that it would be
| pretty hard to train models if you could only use open-licensed
| material.
| CyberRabbi wrote:
| Yes training data is very valuable. Producing quality training
| data is an industry in itself. GitHub is trying to get it for
| free, doesn't work that way.
| SCLeo wrote:
| I am not a lawyer but I do believe GPT-3 as a commercial
| product trained using copyrighted data constitutes
| infringement. I also think GPT-2 does not because it is for
| research purposes, which made it fair use.
| Tenoke wrote:
| Yes, it would stiffle NLP research immensely and we likely
| wouldn't see anything better than gpt3 for years if such
| restrictions are put in place.
| CyberRabbi wrote:
| You're free to privately research with this data but
| commercializing other people's work using ML is theft.
|
| Edit: commercializing of the derived work is one explicit
| consideration used by US law in making a fair use
| determination. That said, even if it weren't commercialized
| it may still be infringement and I believe it is.
| Spivak wrote:
| Commercializing isn't really the issue, it's still
| copyright infringement even if you release it for free
| (i.e. piracy) -- it's unauthorized redistribution (i.e.
| copying).
| Tenoke wrote:
| Even if we accept that (which many wouldnt as most licenses
| say little about research), the research would never be
| very useful if you can never make a comparable dataset to
| use in the real world.
| whimsicalism wrote:
| I get that the problem is commercializing, but the theories
| around copyright that are being deployed here would prevent
| even free, open-source NLP research from becoming a
| reality.
| lumost wrote:
| There is a difference between a model that achieves "fair
| use" of copyrighted work and one that regurgitates
| copyrighted work without attribution.
| ghaff wrote:
| You're basically seeing how some people would have had open
| source play out. You can look at and use the code but not to
| make money or in any other way that I personally disapprove
| of. This is a world where open source would have ended up
| being pretty much irrelevant.
| La1n wrote:
| Are we now also not seeing now why people would want to do
| that? A multi-billion dollar company using people work to
| make more profits without paying them.
|
| I definitely understand why people pick a license that
| disallows use someone doesn't agree with. Imagine baking
| cookies for your friends, and one of them reselling them.
| The material effect is the same to you, you gave away your
| cookies, but sometimes you make/do something for a certain
| group of people and not for other to make a profit of your
| work.
| ghaff wrote:
| People can do whatever they want with their work,
| including not sharing it at all.
|
| But a great deal of the value that's come from open
| source generally has been that open source licenses
| _haven 't_ imposed the sort of usage-based restrictions
| (e.g. free for educational use only) that were fairly
| common in the PC world.
|
| And, to your example, in the case of software the
| incremental copy that your friend sold cost you
| absolutely nothing. So it comes down to a purely
| emotional response to someone else making money off
| something you made.
| La1n wrote:
| >So it comes down to a purely emotional response to
| someone else making money off something you made.
|
| Exactly, as I said, the material situation is the same.
| But we all are emotional beings, you would do certain
| things for your family you wouldn't for strangers. I
| don't think this case is any different.
|
| I personally don't work for free for a company, but I do
| charity work for free. Working for a company in the time
| I work for a charity would "cost me absolutely nothing"
| if I already spend the time anyway, but everyone
| understands the difference.
| jonfw wrote:
| It would be pretty concerning if people used GPT-3 while they
| were writing a novel, and it assisted them in plagiarizing a
| Steven King novel.
|
| We already have examples of copilot blatantly plagiarizing code
| mikehearn wrote:
| Right, but that sounds like the bigger issue here is that the
| model might spit out copyrighted material, not just that it
| scrapes it. The former seems like a technology problem that
| Microsoft can solve.
| Spivak wrote:
| The issue is that not only might the model spit out
| copyrighted material verbatim (which it is) but that it
| might also spit out non-obvious derivative works that will
| get you in legal hot water years down the road.
| undecisive wrote:
| I see a lot of people trying to compare its "machine learning" to
| human learning.
|
| Let's use this thought experiment: Imagine that Github's Copilot
| was just a massive array of all the lines of code from every
| github project, with some (magical automated whatever) tagging
| and indexing on each function, and a search engine on top of
| that.
|
| Now imagine that copilot simply finds the closest search result,
| and then when you press a button, it inserts the line from the
| array, and press it again and you get the next line, etc.
|
| Now hopefully nobody here thinks such a system would fulfil
| either the spirit or the law of any half-restrictive license. Yet
| that is a perfectly valid implementation of Copilot's aim - and
| it sounds like it's not that far from what actually happens,
| maybe with a bit of variable name munging.
|
| So my question is this: If you could build a line between the
| system I describe above and the system of human learning, where a
| human learns the patterns and can genuinely produce novel
| structures and patterns and even programming languages that it
| has never seen before.
|
| At what point along that line would you say that Copilot is close
| enough to human to not be violating licenses that require
| attribution?
| leereeves wrote:
| I don't think it matters where Copilot is on that line. A
| skilled human programmer at the far end of that line, fully
| capable of producing novel programs that they haven't seen
| before, would still be violating copyright if they reproduced a
| program they have seen before.
| Spivak wrote:
| I mean it answers the question pretty quickly if your agent
| isn't sophisticated enough to actually produce novel programs
| in the first place.
| taytus wrote:
| Microsoft has spent a lot of money and energy in earning
| developers' trust over the last 15 years.
|
| They have done an excellent job and succeeded in their goal.
|
| Now, with copilot they are about to lose it all.
| [deleted]
| albertzeyer wrote:
| So, when a human reads public code on the Internet (no matter the
| licence), and gains knowledge, learns (updates the synaptic
| weights of the brain), and then makes (indirectly) use of that
| gained knowledge for further work, how is this different to this
| case?
| [deleted]
| nowherebeen wrote:
| The difference is intent. When Github reads public code, their
| only intent is to profit from it. Depending on the license,
| that's a violation.
| albertzeyer wrote:
| A human also often intends to make profit (by using the
| gained knowledge).
| nowherebeen wrote:
| No, they intent to learn from it or find a solution to
| their problem. It's much harder to argue human intent in
| court than GitHub blantantly doing so.
| SXX wrote:
| It's no different, but if human reads copyrighted proprietary
| code and then reproduces part of it exactly he have good chance
| to get into huge legal trouble.
|
| On other hand said AI have no idea of who the code belongs to
| and it's able to reproduce it perfectly.
| djoldman wrote:
| One question I haven't really see talked about is: when you get a
| suggestion through copilot and save the document, who is the
| author of the document?
|
| I think this may be the crux of this whole kerfuffle.
|
| If you're the author isn't it on you if you infringe?
|
| If not then perhaps you and GitHub/Microsoft share
| authorship/culpability?
|
| Who has the copyright to a piece of text generated by a tool? Or
| art generated by a model?
| cameldrv wrote:
| My question is what GitHub is going to do when people start
| sending them DMCA takedown notices over their code being
| distributed through this system.
|
| Currently, if you claim to be a copyright owner GitHub can
| respond to a DMCA takedown by removing the repository. This might
| require them to retrain the entire model.
|
| One option for GitHub might be to maintain a blocklist of various
| code snippets, and if there is a substring match, just don't make
| the suggestion.
| thinkingemote wrote:
| The answer is simple: Github needs to make a tool which can scan
| all your code to see if it contains code from public code. Its
| what universities around the world do for students work.
|
| Of course, theres a huge irony in that Github is also making the
| tool that enables the widespread plagarism....
| prepend wrote:
| It's not copyright violation to train ML on content. So the
| license doesn't matter unless there's some "can't use this for ML
| training" license that I don't know about (and doesn't seem to be
| legal).
| WoodenChair wrote:
| > It's not copyright violation to train ML on content.
|
| The training is not a copyright violation. That seems to be
| settled case law. Whether the verbatim copying as a result of
| that training is a copyright violation I think is less tested.
|
| Let's flip the domains. Say we had an ML algorithm that could
| auto generate news stories and it at some point (not all the
| time) copied verbatim a Wall Street Journal article and posted
| it to a blog. Copyright violation?
|
| With copilot, we're sometimes seeing "paragraphs" of source
| lines copying verbatim, so this analogy is not such a stretch.
|
| I think we need to think about how much our sharing culture in
| programming has tinted our view of the legality of this
| enterprise.
| vharuck wrote:
| >It's not copyright violation to train ML on content.
|
| I agree. It'd be a nice gesture to reach out to the creators of
| the training data, like is usual with web scrapers. But
| collecting and analyzing data publicly available on the web is
| ok.
|
| >So the license doesn't matter unless there's some "can't use
| this for ML training" license that I don't know about (and
| doesn't seem to be legal).
|
| I disagree. While Copilot is, at heart, a ML model, the
| copyright trouble comes from its usage. It consumes copyright
| code (ok), analyzes copyright code (still ok), and then
| produces code which sometimes is a copy of copyright code (not
| ok). The only way it'd be ok is if Copilot followed all
| licensing requirements when it produced copies of other works.
|
| Personally, I won't touch it for work until either Copilot
| abides by the licenses or there's robust case law.
| prepend wrote:
| > It'd be a nice gesture to reach out to the creators of the
| training data, like is usual with web scrapers.
|
| I don't think this is practical. And who notifies people of
| scraping content? I would've annoyed if I got spam from sites
| that scraped my content.
| vharuck wrote:
| I've contacted websites about scraping when it'd be a
| repeat thing and they didn't have a robots.txt file
| available. Also if their stance on enforcing copyright was
| hazy (e.g. medical coding created by a non-profit).
| Sometimes, they pointed me toward an API I didn't know
| about.
|
| >I don't think this is practical.
|
| I don't like people ignoring things just because they're
| impractical for ML. That leads to crap like automated
| account banning without possiblity of talking to a living
| customer service representative.
| kune wrote:
| Guys please read the Terms of Use of Github section D.4.
|
| We need the legal right to do things like host Your Content,
| publish it, and share it. You grant us and our legal successors
| the right to store, archive, parse, and display Your Content, and
| make incidental copies, as necessary to provide the Service,
| including improving the Service over time. This license includes
| the right to do things like copy it to our database and make
| backups; show it to you and other users; parse it into a search
| index or otherwise analyze it on our servers; share it with other
| users; and perform it, in case Your Content is something like
| music or video.
|
| This license does not grant GitHub the right to sell Your
| Content. It also does not grant GitHub the right to otherwise
| distribute or use Your Content outside of our provision of the
| Service, except that as part of the right to archive Your
| Content, GitHub may permit our partners to store and archive Your
| Content in public repositories in connection with the GitHub
| Arctic Code Vault and GitHub Archive Program.
| bruce343434 wrote:
| Still depends on how they defined "the Service". Can't be
| bothered to read the full license myself because I don't use
| github - but I can't imagine "the Service" is defined as
| including an AI copy paster.
| mumblemumble wrote:
| > The "Service" refers to the applications, software,
| products, and services provided by GitHub, including any Beta
| Previews.
|
| So it wouldn't include just any AI copy pasters. Only the
| ones that are provided by GitHub.
| remram wrote:
| If I upload somebody else's GPL code to GitHub, I also can't
| grant to GitHub the (implicit) legal rights to use that code in
| Copilot, because they are not mine to give.
|
| I could previously mirror GPL code, because the GPL granted me
| the rights I need to grant GitHub as part of their ToS; but if
| they change their ToS, or if the meaning is changed by them
| adding vastly different features to their Service, this becomes
| a problem.
| eCa wrote:
| If Copilot requires separate payment or signup I[1] fail to see
| how it can be part of "the Service" as defined therein, and
| since the rights to do "things" to the provided code only go as
| far "as necessary to provide the Service" the ToS can't[2] be
| used to argue that it gives explicit permission to use provided
| code for this purpose. Or am I misinterpreting something?
|
| [1] I'm not a lawyer.
|
| [2] Still not a lawyer.
| baby wrote:
| The licensing game is really awful imo. It should be that
| releasing your code on github = fair game. Licenses are seriously
| hindering development. You either take part in open source or you
| don't. I get anxious every time someone asks me to add a license
| to one of my project because I don't know which license to use
| and wonder if it'll prevent some people from using the software
| down the line. Once I tried writing my own license that basically
| said: I don't care, do whatever. Yet someone complained with "yet
| another license".
| dylan604 wrote:
| yeah, no. Licensing is really awful, yes.
|
| "You either take part in open source or you don't." I disagree.
| You can allow your software to be used and post the source
| code, but it is yours so you get some say in your intentions.
| Forking is what you're looking for. However, once you fork it,
| you still owe credit to those that did the heavy lifting before
| making whatever tweak it is you made and want to call it your
| own. There's nothing wrong with the original developers getting
| credit for the work they did. There's nothing wrong with the
| original devs willing to let other people use their work as
| long as it is used in the same spirit it was provided (FOSS).
| That also does not mean the original devs are wrong for wanting
| evilCorps that want to use their freesoftware to be
| included/distributed in their packages they sell and profit
| from to be restrictive.
| turndown wrote:
| It seems that now is finally the time I must apologize for some
| of the java code I put up on github.
| keonix wrote:
| If the training set contains _verbatim_ (A)GPL code does this
| mean that Copilot also should be _distributed_ by Microsoft under
| GPL? Because without it Copilot (as it is _distributed_ by
| Microsoft) couldn 't be built, wouldn't it make it a _derivative
| work_ of GPL 'd code (and obviously every other license)?
|
| I see a lot of people comparing human learning to machine
| learning in the comments, but there is a huge difference - we
| don't _distribute_ copies of humans
| rapind wrote:
| This is pretty interesting for AI in general. Should you be
| able to train with material you don't own? Can your training
| benefit from material that has specific usage licenses attached
| to it? What about stuff like GameGAN?
| zoomablemind wrote:
| > ...Should you be able to train with material you don't own?
|
| If relating this to how humans learn, books and other sources
| are used to inform understanding and human knowledge. One can
| purchase or borrow a book without actually owning the
| copyright to it. Indeed, a given passage may be later quoted
| verbatim, provided it is accompanied with a reference to its
| source.
|
| Otherwise, a verbatim use without attribution in authored
| context is considered plagiarism.
|
| So, sure one can use a multitude of material for the
| training. Yet, once it gets to the use of the acquired
| "knowledge" - proper attribution is due for any "authentic
| enough" pieces.
|
| What is authentic enough in this case is not easy to define,
| however.
| rapind wrote:
| "If relating this to how humans learn" seems like a big IF
| though right? Are we going to treat computer neural nets as
| human from a legal standpoint?
|
| At some point Neural Nets like GameGAM might be good enough
| to duplicate (and optimize) a commercial game. Can you then
| release your version of the game? Do you just need to make
| a few tweaks? Are we going to get a double standard because
| commercial interests are opposed depending on the use case?
|
| It would be pretty funny if Microsoft as a game publisher
| lobbies to prevent their IP being used w/ something like
| GameGAN, but then takes the opposing stand point for
| something like their CoPilot! Although I'm sure it'll be
| spun as "These things are completely different!".
| paulryanrogers wrote:
| This is the key question. In school I was taught to be
| careful to always cite even paraphrased works. If Copilot
| regurgitates copyrighted fragments without citation or
| informing acceptors of licenses involved then it's
| facilitating infringement.
| ghoward wrote:
| This is a great argument.
| AaronFriel wrote:
| No, see Authors Guild v. Google. Even without a license or
| permission, fair use permits the mass scanning of books, the
| storage of the content of those books, and rendering verbatim
| snippets of those books. The Google Books site is not a
| derivative work of the millions of authors they copied from,
| and if they did copy any coincidentally GPL, AGPL, or creative
| commons copyleft work, the fair use exception applies before we
| reach the question of whether Google is obligated to provide
| anything beyond what it is doing.
|
| By comparison, Copilot is even more obviously fair use.
|
| I've had this conversation quite a few times lately, and the
| non-obvious thing for many developers is that fair use is an
| exception to copyright itself.
|
| A license is a grant of permission (with some terms) to use a
| copyrighted work.
|
| This snippet from the Linux kernel doesn't make my comment here
| or the website Hacker News a GPL derivative work:
| ret = vmbus_sendpacket(dev->channel, init_pkt,
| sizeof(struct nvsp_message), (unsigned
| long)init_pkt, VM_PKT_DATA_INBAND,
| VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
|
| This snippet from an AGPL licensed project, Bitwarden, does not
| compel dang or pg to release the Hacker News source code:
| await _sendRepository.ReplaceAsync(send); await
| _pushService.PushSyncSendUpdateAsync(send); return
| (await
| _sendFileStorageService.GetSendFileDownloadUrlAsync(send,
| fileId), false, false);
|
| Fair use is an exception to copyright itself. A license cannot
| remove your right to fair use.
|
| The Free Software Foundation agrees
| (https://www.gnu.org/licenses/gpl-faq.en.html#GPLFairUse)
|
| > Yes, you do. "Fair use" is use that is allowed without any
| special permission. Since you don't need the developers'
| permission for such use, you can do it regardless of what the
| developers said about it--in the license or elsewhere, whether
| that license be the GNU GPL or any other free software license.
|
| > Note, however, that there is no world-wide principle of fair
| use; what kinds of use are considered "fair" varies from
| country to country.
|
| (And even this verbatim copying from FSF.org for the purpose of
| education is... Fair use!)
| RandomBK wrote:
| For any discussion on copyright and fair use, we should
| distinguish between the implications to Copilot the software
| itself and the implications to users of Copilot.
|
| For Copilot itself, I do see the case for fair use, though it
| gets fuzzy should Microsoft ever start commercializing the
| feature. Nevertheless it remains to be seen whether ML
| training fits the same public policy benefits public
| libraries and free debate leverages to enable the fair use
| defense.
|
| For Copilot users, I don't see an easy defense. In your
| hypothetical, this would be akin to me going on Google books
| and copying snippets of copyrighted works for my own book. In
| the case of Google books, they explicitly call out the limits
| on how the material they publish can be used. I'm contrast,
| Copilot seems to be designed to encourage such copying,
| making it more worry some in comparison.
| swiley wrote:
| Just for reference, the hackernews source is public.
| e12e wrote:
| Not the current version? AFAIK there's some security-by-
| obscurity in the measures against spam, voter rings etc ?
| jollybean wrote:
| Thanks for this, but can you answer the question:
|
| Would it be 'fair use' for the devlopers to simply copy code
| from those repos - even just 10 lines, and claim 'fair use' -
| i.e. circumventing Copilot?
|
| Even if Copilot is 'fair use' ... does that mean the results
| are 'fair use' on the part of AutoPilot users?
|
| And a bigger question: is your interpretation of those
| statues and case law enough to make the answer unambiguous?
|
| I don't have legal background, but I do have an operating
| background with lawyers and tech ... and my 'gut' says that
| anyone using Copilot is opening themselves up to lawsuits.
|
| If the code you put in your software comes, via Copilot, but
| that code is verbatim from some kind of GPL's (or worse,
| proprietary) ... there's a good chance you could get sued if
| someone gets the inclination.
|
| Maybe it's because of my personal experience, but I can just
| see corporate lawyers banning Copilot straight up as the
| risks are simply now worth the upside. That's now what we
| like to hear in the classically liberal sense i.e. 'share and
| innovate' ... but gosh it doesn't feel like a happy legal
| situation to me.
|
| Looking forward to people with more insight sharing on this
| important topic.
| AaronFriel wrote:
| > Would it be 'fair use' for the devlopers to simply copy
| code from those repos - even just 10 lines, and claim 'fair
| use' - i.e. circumventing Copilot?
|
| Only a lawyer (and truly, only a court) could answer that
| question.
|
| If you copy 100 lines of code that amounts to no more than
| a trivial implementation in a popular language of how to
| invert a binary tree, it's likely fair use.
|
| If you copy 10 lines of code that are highly novel, have
| never been written before, and solve a problem no one
| outside the authors have solved... It may not be fair use
| to copy that.
|
| Other people who have replied have mentioned "the heart" of
| a work. The US Supreme Court has held that even de minimis
| - "minimal", to be brief - copying can sometimes be
| infringement if you copied the "heart" of a work.
| syshum wrote:
| While I agree you are correct about (in the US anyway) fair
| use being an exemption from copyright, thus superceeds
| licensing
|
| I disagree that Copilot is "more obviously fair use.", some
| parts might be, but we have seen clear examples (i.e verbatim
| code reproduction) that would not be.
|
| I dont believe the question of "is this fair use" is as clear
| as you believe it to be
| indigochill wrote:
| Next up, Copilot for college papers! Who needs to pay a
| professional paper-writer (ahem, I mean write the paper) when
| you can have an AI write your paper for you! It's fair use,
| so you're entitled to claim ownership to it, right?
| jrochkind1 wrote:
| I think you are confusing legal protections for
| intellectual property with plagiarism. (At least that's
| what I think you're doing if I read your comment as sarcasm
| and guess what you're trying to say non-sarcastically?) But
| they are entirely different things.
|
| You can be violating copyright without plagiarizing, so
| long as you cite your source, but if you copy a copyright-
| protected work in an illegal way when doing so.
|
| And you can be plagiarizing without violating copyright, if
| you have the permission of the copyright holder to use
| their content, or if the content is in the public domain
| and not protected by copyright, or if it's legal under fair
| use -- but you pass it off as your own work.
|
| Two entirely separate things. You can get expelled from
| school for plaguriism without violating anyone's copyright,
| or prosecuted for copyright without committing any academic
| dishonesty.
|
| You can indeed have the legal right to make use of content,
| under fair use or anything else, but it can still be
| plagiarism. That you have a fair use right does not mean
| "Oh so that means you are allowed to turn it in to your
| professor and get an A and the law says you must be allowed
| to do this and nobody can say otherwise!" -- no.
| indigochill wrote:
| Yeah, I was being sarcastic. But you make a good point
| about the legality of plagiarism.
| tyre wrote:
| Copilot is not doing what your example does.
|
| If Github had a service that automatically mirrored public
| repositories on Gitlab, that would be equivalent to the
| example you gave.
|
| But Github is taking content under specific licenses to build
| something new for commercial use.
|
| I'm not sure if what Github does falls under Fair Use, but I
| don't know that it matters. I can read fifty books and then
| write my own, which would certainly rely--consciously or not
| --on what I had read. Is that a copyright violation? It
| doesn't seem like it is but maybe it is and until now has
| been impossible to prosecute?
| nojito wrote:
| GitHub isn't building anything.
|
| The end user is.
|
| By this logic any and all neural nets that draw pictures
| are copyright infringing as well.
| saati wrote:
| If they create exact copies of copyrighted pictures, then
| yes, they do.
| jkaplowitz wrote:
| The world is global. That's a US court ruling from one court
| of appeals. Most countries have narrower fair use rights than
| the US. Even if Copilot would fall within that legal
| precedent (far from guaranteed), a legal challenge in any
| jurisdiction worldwide outside the US states covered by that
| particular court of appeals, or which reaches the US Supreme
| Court, or which goes through the Federal Circuit Court of
| Appeals due to the initial complaint including a patent
| claim, would not be bound by that result and (especially in a
| different country) could very plausibly find otherwise.
|
| What's more, if any of the code implements a patent, fair use
| does not cover patent law, and relying on fair use rather
| than a copyright license does not benefit from any patent use
| grant that may be included in the copyright license. If a
| codebase infringes a patent due to Copilot automatically
| adding the code, I can easily imagine GitHub being attributed
| shared contributory liability for the infringement by a
| court.
|
| Not a lawyer, just a former law student and law feel layman
| who has paid attention to these subjects.
| mirekrusin wrote:
| If fair use memorising whole source code byte-by-byte,
| storing it as ie. some non-100%-lossless compression for
| subsequent retrieval or arbitrary size snippets?
| shakna wrote:
| > No, see Authors Guild v. Google.
|
| That case required that the output be transformative, in that
| "words in books are being used in a way they have not been
| used before".
|
| Copilot only fits the transformative aspect if it is not
| directly reciting code, that already exists in the form that
| it is redistributing. So long as it does so, it fails to meet
| the criteria.
| kmeisthax wrote:
| I think you might be considering two different acts here:
|
| 1. The act of training Copilot on public code
|
| 2. The resulting use of Copilot to generate presumably new
| code
|
| #1 is arguably close to the Authors Guild v. Google case.
| You are literally transforming the input code into an
| entirely new thing: a series of statistical parameters
| determining what functioning code "looks like". You can use
| this information to generate a whole bunch of novel and
| useful code sequences, not _just_ by feeding it parts of it
| 's training data and acting shocked that it remembered what
| it saw. That smells like fair use to me.
|
| #2 is where things get more dicey - _just because_ it 's
| legal to train an ML system on copyrighted data wouldn't
| mean that it's resulting output is non-infringing. The
| network itself is fair use, but the code it generates would
| be used in an ordinary commercial context, so you wouldn't
| be able to make a fair use argument here. This is the
| difference between scanning a bunch of books into a search
| engine, versus copying a paragraph out of the search engine
| and into your own work.
|
| (More generally: Fair use is non-transitive. Each reuse
| triggers a new fair use analysis of every prior work in the
| chain, because each _fair_ reuse creates a new copyright
| around what you added, but the original copyright also
| still remains.)
| tedunangst wrote:
| It's not possible to get copilot to output a transformed
| version of the input?
| shakna wrote:
| Transformed output _may_ fall under fair use.
|
| However - Copilot directly recites code. That is _very
| unlikely_ to fall under fair use.
|
| Redistributing the exact same code, in the same form, for
| the same purpose, probably means that Copilot, and thus
| the people responsible for it, are infringing.
| ThrowawayR2 wrote:
| > " _However - Copilot directly recites code._ "
|
| Sounds like that wouldn't be difficult to fix? Transform
| the code to an intermediate representation (https://en.wi
| kipedia.org/wiki/Intermediate_representation) as a pre-
| processing stage, which ditches any non-essential
| structure of the code and eliminates comments, variable
| names, etc., before running the learning algorithms on
| it. _Et voila,_ much like a human learning something and
| reimplementing it, only essential code is generated
| without any possibility of accidentally regurgitating
| verbatim snippets of the source data.
| salawat wrote:
| At that point, can we all just agree IP is the stupidest
| concept to ever be layered on top of math (which
| programming is) and move on with non-copyrightable code?
| jcheng wrote:
| Only if you agree that copyleft licenses are also stupid;
| without copyright, there's no way to prevent companies
| from making closed-source forks of code you wrote and
| intended to stay open.
| oefnak wrote:
| Yes, sure. Without copyright there's no need for copyleft
| left, right?
| nonfamous wrote:
| > However - Copilot directly recites code.
|
| You make that statement as an absolute, but in the
| interests of clarity, all evidence so far shows that it
| directly recites code very rarely indeed. Even the Quake
| example had to be prompted by the specific variable names
| used in the original code.
|
| In practice, the output code is heavily influenced by
| your own context -- the comments you include, the
| variable names you use, even the name of the file you are
| editing -- and with use it's obvious that the code is
| almost certainly not a direct recitation of any existing
| code.
| svaha1728 wrote:
| So if a foreign company pilfers the source code to
| Windows, can they add it to a training set and then
| 'prompt' the machine learning algorithm to spit out a new
| 'copyright free' Windows, just by transforming the
| variable names?
| nonfamous wrote:
| Well no, because only GitHub has access to the training
| set. But more importantly this misunderstands how Copilot
| even works -- even if Windows was in the training set,
| you couldn't get Copilot to reproduce it. It only
| generates a few lines of code at a time, and even then
| it's almost certainly entirely novel code.
|
| Now, if you knew the code you wanted Copilot to generate
| you could certainly type it character by character and
| you might save yourself a few keystrokes with the TAB
| key, but it's going to be much MUCH easier to simply copy
| the whole codebase as files, and now you're right back
| where you started.
| rkeene2 wrote:
| I think that's my question regarding this whole thing:
|
| If it's so fair use, why not train it on all Microsoft
| code, regardless of license (in addition to GitHub.com) ?
| Would Microsoft employees be fine with Copilot re-
| creating "from memory" portions of Windows to use in WINE
| ?
| shakna wrote:
| > all evidence so far shows that it directly recites code
| very rarely indeed.
|
| _Once_ is enough for it to be infringing. The law is not
| very forgiving when you try and handwave it away.
| mthoms wrote:
| You sound quite sure that the outlying instances of
| direct copying wouldn't be covered by the Fair Use
| copyright exemption. Any particular reason for that?
|
| I tend to think it would be covered (provided it there
| were relatively small snippets and not entire functions).
| AaronFriel wrote:
| Is there any evidence of Copilot producing substantial
| (100s of lines) verbatim copies of copyrighted works?
|
| Absent this, I don't think there's a case. The courts have
| given extraordinarily wide latitude to fair use and ML
| algorithms are routinely trained on copyrighted works,
| photos, etc. without a license.
|
| I understand that this feels more personal because it
| involves our field, but artists and authors have expressed
| the same sentiment when neural nets began making pictures
| and sentences.
|
| The question here is no different than "Is GPT-3 an
| unlicensed, unlawfully created derivative work of millions,
| if not billions of people?"
|
| No, I'm quite confident it is not.
| b3morales wrote:
| The point of Copilot -- its entire value as a product --
| is to produce code that matches the _intent_ and
| _semantics_ of code that was in the input. In other
| words, very deliberately not transformative in purpose.
| shakna wrote:
| > Is there any evidence of Copilot producing substantial
| (100s of lines) verbatim copies of copyrighted works?
|
| It doesn't need to be substantial. In Google v. Oracle a
| 9-line function was found to be infringing.
| AaronFriel wrote:
| If I recall correctly, the nine line question wasn't
| decided by the supreme court, but the API question was.
|
| The Supreme Court did hold that the 11,500 lines of API
| code copied verbatim constituted fair use.
|
| https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.p
| df
| shakna wrote:
| > The Supreme Court did hold that the 11,500 lines of API
| code copied verbatim constituted fair use.
|
| Yes, because it was _transformative_, in a clear way.
| Because an API is only an interface. Which makes that
| part of that decision largely irrelevant to the topic at
| hand.
|
| > Google's limited copying of the API is a transformative
| use. Google copied only what was needed to allow
| programmers to work in a different compu-ting environment
| without discarding a portion of a familiar program-ming
| language. Google's purpose was to create a different
| task-related system for a different computing environment
| (smartphones) and tocreate a platform--the Android
| platform--that would help achieve and popularize that
| objective.
|
| > If I recall correctly, the nine line question wasn't
| decided by the supreme court, but the API question was.
|
| It was already decided earlier, and Google did not
| contest it, choosing instead to negotiate a zero payment
| settlement with Oracle over the rangeCheck function.
| There was no need for the Supreme Court to hear it.
| AaronFriel wrote:
| A $0 settlement means there is no binding precedent and
| signals to me that Oracle's attorneys felt they didn't
| have a strong argument and a potential for more.
|
| If they felt the nine line function made Google's entire
| library an unlicensed derivative work, they would have
| pressed their case.
| shakna wrote:
| > A $0 settlement means there is no binding precedent and
| signals to me that Oracle's attorneys felt they didn't
| have a strong argument and a potential for more.
|
| That's not the case. It wasn't an out-of-court-
| settlement, but an agreement about the damages being
| sought, the court had already found it to be infringing,
| and that was part of the ruling.
|
| But none of that changes that 9-lines is substantial
| enough to be infringing. It isn't necessary to be a large
| body of work.
|
| > If they felt the nine line function made Google's
| entire library an unlicensed derivative work, they would
| have pressed their case.
|
| No... It means the rangeCheck function was infringing.
| The implication you seem to have inferred here wouldn't
| be inferred by any kind of plagiarism case.
| AaronFriel wrote:
| I think we agree then, and appreciate the correction on
| the lower court settlement.
|
| If Copilot is infringing, I suspect it's correctable (by
| GitHub) by adding a bloom filter or something like it to
| filter out verbatim snippets of GPL or other copyleft
| code. (And this actually sounds like something corporate
| users would want even if it was entirely fair use because
| of their intense aversion to the GPL, anyhow.)
| shakna wrote:
| It may be correctable... It doesn't change that Copilot
| is probably infringing today, which may mean that damages
| against GitHub may be sought.
| infogulch wrote:
| Why did you choose the standard of "substantial" = "100s
| of lines"? Especially since we've already seen examples
| of verbatim output in the dozens of lines range, that
| choice of standard is rather conveniently just outside
| what exists so far. If we find a case with 200 lines of
| verbatim output will you say the only reasonable standard
| is 1000s of lines?
|
| I don't think your argument is as strong as you're making
| it out to be.
| AaronFriel wrote:
| Just a fairly arbitrary number. It's easy to produce a
| few lines from memory, up to 10s of lines and that's
| "obviously" fair use. I would be surprised if many of
| haven't inadvertently "copied" some GPL code in this way!
|
| This goes to the "substantial" test for fair use. Clips
| from a film can contain core plot points, quotes from a
| book can contain vital passages to understanding a
| character, screen captures and scrapes of a website can
| contain huge amounts of textual detail, but depending on
| the four factors for fair use, still be fair use. (There
| have been exceptions though.)
|
| The reaction on Hacker News to a machine producing code
| trained on their works is no different than the reactions
| artists and writers have had to other ML models. I
| suspect many of us are biased because it strikes at what
| we do and we think that our copyrights (because we have
| so many neat licenses) are special. They are not.
|
| I think it would need to get to that level of "Copilot
| will emit a kernel module" before it's not obviously fair
| use.
|
| After all, Google Books will happily convey to me whole
| pages from copyrighted works, page after page after page.
|
| https://www.google.com/books/edition/Capital_in_the_Twent
| y_F...
| jcelerier wrote:
| > Just a fairly arbitrary number. It's easy to produce a
| few lines from memory, up to 10s of lines and that's
| "obviously" fair use.
|
| it's anything but obvious.
| https://www.copyright.gov/fair-use/
|
| > there is no formula to ensure that a predetermined
| percentage or amount of a work--or specific number of
| words, lines, pages, copies--may be used without
| permission.
|
| 9 lines of very run-of-the-mill code in Oracle / Google
| weren't considered fair use.
| [deleted]
| _Understated_ wrote:
| > By comparison, Copilot is even more obviously fair use.
|
| Not sure I see it that way.
|
| If I take your hard work that you clearly marked with a GPL
| license and then make money from it, not quite directly, but
| very closely, how is that fair use? Or legal?
|
| Copying and storing a book isn't recreating another book from
| it. Copilot is creating new stuff from the contents of the
| "books" in this case.
|
| Edit: I misunderstood fair use as it turns out...
| nonfamous wrote:
| For your specific case, "take your hard work that you
| clearly marked with a GPL license and then make money from
| it", you don't even need to rely on fair use. As long as
| you comply with the terms of the GPL, making money with the
| code is perfectly acceptable, and the FSF even endorses the
| practice. [1] Red Hat is but one billion-dollar example.
|
| [1] https://www.gnu.org/licenses/gpl-
| faq.en.html#DoesTheGPLAllow...
| b3morales wrote:
| But the person making money from the GPL code has to
| follow the terms of the license. Attribution, sharing
| modifications, etc.
| nonfamous wrote:
| Correct. That's why I said "As long as you comply with
| the terms of the GPL".
| AaronFriel wrote:
| I've edited my comment with examples and a clarification.
|
| Fair use is an exception to copyright and, by definition,
| copyright licenses.
| Arelius wrote:
| I don't think that's an accurate description...
|
| Fair use is a defense for cases of copyright
| infringement, which means you're starting of from a case
| of copyright infringement, which sort-of muckys up the
| whole "innocent until proven guilty" thing. And
| considering it's a weighted test, it's hardly very cut-
| and-dry at that.
| _Understated_ wrote:
| I understand the concept of fair use (I think) but I
| can't see how it applies to Copilot.
|
| Google didn't create new books from the contents of
| existing ones (whether you agree that they should have
| been allowed to store the books or not) but Copilot is
| creating new code/apps from existing ones.
|
| Edit: I guess my understanding of fair use was wrong. I
| stand corrected.
| unanswered wrote:
| That you apparently think fair use is something you just
| think about real hard in order to see how you feel about
| it in a given situation demonstrates that you do not
| understand the concept of fair use. There are rules.
| elliekelly wrote:
| I don't disagree with your point but was it necessary to
| make it in such a snarky way?
| unanswered wrote:
| Have you found a better way to defeat Dunning-Kruger?
|
| Your comment is entirely unconstructive as it does not
| suggest an alternative course of action. Criticism on HN
| should strive to be constructive, or failing that to be
| left unsaid.
| _Understated_ wrote:
| Yeah, I realise that now.
|
| However, where does one draw the line between fair use
| and derivative works?
|
| Creating something based on other stuff (Google creating
| AI books from the existing ones for example) would
| possibly be fair use I think but would it not also be
| derivative works?
| thrashh wrote:
| There's no clear line and there can never be because the
| world is too complex. We leave up determination to the
| court system.
|
| Google Books is considered fair use because they got sued
| and successfully used fair use as a defense. Until
| someone sues over Copilot, everyone is an armchair
| lawyer.
| AaronFriel wrote:
| If Google Books were creating new books, that would only
| _help_ their argument. Transformativeness is one of the
| four parts of the fair use test.
|
| Copilot producing new, novel works (which may contain
| short verbatim snippets of GPL works) is a strong
| argument for transformativeness.
| FemmeAndroid wrote:
| It would help the transformativeness, but it would
| substantially change the effect upon the market. By
| creating competing products with the copyrighted
| material, there is a higher degree of transformative, but
| you also end up disrupting the marketplace.
|
| I don't know how a court would decide this, but I do
| think the facts in future GPT-3 cases are sufficiently
| different from Author's Guild that I could see it going
| any way. Plus, I think the prevalence of GPT-3 and the
| ramifications of the ruling one way or another could lead
| some future case to be heard by the Supreme Court. A
| similar case could come up in California, or another
| state where the 2nd Circuit Artist Guild case isn't
| precedent.
| peteradio wrote:
| > short verbatim snippets of GPL works
|
| Define short
| shadowgovt wrote:
| > If I take your hard work that you clearly marked with a
| GPL license and then make money from it, not quite
| directly, but very closely, how is that fair use? Or legal?
|
| If I'm Google, and I scan your code and return a link to it
| when people ask to find code like that (but show an ad next
| to that link for someone else's code that might solve their
| problem too), that's fair use and legal. My search engine
| has probably stored your code in a partial format, and
| that's fine.
| Hamuko wrote:
| > _If I take your hard work that you clearly marked with a
| GPL license and then make money from it, not quite
| directly, but very closely, how is that fair use? Or
| legal?_
|
| You can wipe your ass with the GPL license if your use of
| the product falls within Fair Use.
|
| You can actually take snippets from commercial movies and
| post them onto YouTube if your YouTube video is
| transformative enough for your usage to be considered fair
| use. Well, theoretically at least - in reality YouTube
| might automatically copyright strike it.
|
| > _Copying and storing a book isn 't recreating another
| book from it._
|
| That doesn't mean that GitHub has to redistribute Copilot
| under GPL. However, the end user could potentially have to
| if they use Copilot to generate new code that happens to
| copy GPL code verbatim.
| _Understated_ wrote:
| > You can wipe your ass with the GPL license if your use
| of the product falls within Fair Use.
|
| Is Copilot fair use? It's reading code, generating other
| code (some verbatim) and making money from it all while
| not having to release its source code to the world?
|
| > That doesn't mean that GitHub has to redistribute
| Copilot under GPL
|
| I wasn't saying that was the case: some of the code that
| Copilot used may not allow redistribution under GPL.
|
| But let's say that all of the code it scanned was GPL for
| the sake of argument. Why would they not have to
| distribute their Copilot source yet, if I use it to
| generate some code, I'd have to distribute mine?
|
| My spidey-sense it tingling at that one!
| tshaddox wrote:
| > Is Copilot fair use? It's reading code, generating
| other code (some verbatim) and making money from it all
| while not having to release its source code to the world?
|
| Again, fair use is an _exception_ to copyright
| protection. If something is fair use, the license does
| not apply. The fact that Copilot does not release its
| source code is related only to a specific term of a
| specific license, which does not apply if Copilot is
| indeed fair use.
| 8note wrote:
| Making money is irrelevant to fair use
| zelphirkalt wrote:
| Irrelevant to GPL maybe.
| ghoward wrote:
| Totally relevant: https://en.wikipedia.org/wiki/Fair_use#
| 1._Purpose_and_charac... .
| IgorPartola wrote:
| If you view GPL code with your browser would that mean that
| your browser now has to be GPL as well? In the sense that
| copilot is not much different than a browser for Stack
| Overflow with some automation, why would it need to be
| GPLed? Your own code on the other hand...
| atq2119 wrote:
| > If you view GPL code with your browser would that mean
| that your browser now has to be GPL as well?
|
| Some good responses in sibling comments already, but I
| don't see the narrow answer here, which is: No, because
| no distribution _of the browser_ took place.
|
| If you created a weird version of the browser in which a
| specific URL is hardcoded to show the GPL'd code instead
| of the result of an HTTP request, and you then
| distributed that browser to others, then I believe that
| yes, you'd have to do so under the GPL. (You might get
| away with it under fair use if the amount of GPL'd code
| is small, etc.)
| keonix wrote:
| To build a browser you don't need a verbatim GPL code, so
| it's not a derivative work in the same sense copilot is.
|
| Stackoverflow on the other hand is much trickier
| question...
| IgorPartola wrote:
| SO clearly doesn't need GPL code to be useful. The wider
| SE network is evidence of that.
| dtech wrote:
| If you use your browser to copy some GPL code into your
| project your project must now be GPL as well.
|
| So following your own argument, even if Copilot is
| allowed, using it still risks you falling under GPL
| IgorPartola wrote:
| My point exactly. Copilot is innocent in that case just
| like the browser.
| dpe82 wrote:
| Or if you simply read GPL code and learn something from
| it - or bits of the code are retained verbatim in your
| memory, are you (as a person) now GPL'd? Obviously not.
| jcelerier wrote:
| > Or if you simply read GPL code and learn something from
| it - or bits of the code are retained verbatim in your
| memory, are you (as a person) now GPL'd? Obviously not.
|
| I do not find that to be obvious at all.
| [deleted]
| zelphirkalt wrote:
| That probably depends on how large and how significant
| the bits you remember are. Otherwise one could take a
| person with photographic memory and circumvent all GPL
| licenses easily, by making that person type what they
| remember.
| johndough wrote:
| For sake of discussion, it would be clearer to split
| copilot code (not derived from GPL'd works) and the
| actual weights of the neural network at the heart of
| copilot (derived from GPL'd works via algorithmic means).
|
| For your browser analogy, that would mean that the
| "browser" is the copilot code, while the weights would be
| some data derived from GPL'd works, perhaps a screenshot
| of the browser showing the code.
|
| I'd think that the weights/screenshot in this analogy
| would have to abide by the GPL license. In a vacuum, I
| would not think that the copilot code had to be licensed
| under GPL, but it might be different in this case since
| the copilot code is necessary to make use of the weights.
|
| But then again, the weights are sitting on some server,
| so GPL might not apply anyway. Not sure about AGPL and
| other licenses though. There is likely some illegal
| incompatibility between licenses in there.
| IgorPartola wrote:
| As I understand it the things copilot tries to do is
| automate the loop of "Google your problem, find a Stack
| Overflow answer, paste in the code from there into my
| editor". In that sense, the burden of whether the license
| of the code being copy pasted is on the person who
| answered the SO question and on me. If this literally was
| what copilot did, nobody would bat an eye that some code
| it produced was GPL or any other license because it
| wouldn't be copilot's problem.
|
| No let's substitute a different database of for the code
| that isn't SO. It doesn't really matter if that database
| is a literal RDBMS, a giant git repo or is encoded as a
| neural net. All copilot is going to do is perform a
| search in that database, find a result and paste it in.
| The burden of licensing is still on me to not use GPL
| code and possibly on the person hosting the database.
|
| The gotcha here is that copilot's database is a neural
| network. If you take GPL code and feed it as training
| data to a neural network to create essentially a lookup
| table along with non-GPL code did you just create a
| derived work? It is unclear to me whether you did or not.
| In particular, can they neural network itself be
| considered "source code"?
| arianvanp wrote:
| Google did not scan those books and use it to build new
| books with different titles. The comparison doesn't hold up
| at all.
| _Understated_ wrote:
| > Google did not scan those books and use it to build new
| books with different titles. The comparison doesn't hold
| up at all.
|
| Not sure if you meant to reply to me but I agree with
| you: you can't compare what Google did to what Copilot
| does.
| PieUser wrote:
| Copilot just _suggests_ code.
| kroltan wrote:
| And someone accepts it. Even if suggesting derivatives of
| licensed code is not a license infringement, then Copilot
| sure is a vector for mass license infringement by the
| people clicking "Accept suggestion". And those people are
| unable to know (without doing extensive investigation
| that completely nullifies the point of the tool) whether
| that suggestion is potentially a verbatim copy of some
| existing work in an incompatible license.
| MadcapJake wrote:
| If I suggest whole lines of dialogue to you, the
| screenwriter, did I write those lines or you? If you
| change names in those lines of dialogue to fit your
| story, do you now gain credit for writing those lines?
|
| Suggesting code is generating code
| jollybean wrote:
| There are situations where the question is are the
| mishmashes from Copilot 'fair use'.
|
| But the other, more direct question is ... what about the
| instances where Copilot doesn't come up with a learned
| mishmash result? What happens when Copilot just gives you
| a straight up answer from it's learning data, verbatim?
|
| Then you, as a dev, end up with a bunch of code that is
| effectively copied, via a 'copying tool', which is GPL'd?
|
| It's that specific case that to me sticks out as the
| 'most concerning part'.
|
| Please correct me if I'm wrong.
| echelon wrote:
| > Even without a license or permission, fair use permits the
| mass scanning of books, the storage of the content of those
| books, and rendering verbatim snippets of those books.
|
| For commercial use and derivative works?
|
| Authors won't incorporate snippets of books into new works
| unless they're reviews. Copilot is different.
| AaronFriel wrote:
| Google Books is a commercial site which incorporated the
| snippets of millions of copyrighted works. And of course,
| sitting in thousands of Google servers/databases are full
| copies of each of those books, photos of each page, the
| OCRed text of each page, and indexes to search them. Even
| that egregious copying without a license or permission was
| considered fair use.
|
| If anything, the ways in which Copilot is different aid
| Microsoft/GitHub's argument for fair use. Because Copilot
| creates novel new works, that gives them a strong argument
| their system is more transformative than Google Books,
| which just presents verbatim copies of books.
| extra88 wrote:
| > Authors won't incorporate snippets of books into new
| works
|
| Of course they do, previous works are quoted all the time.
| e12e wrote:
| But that's another thing - co-pilot doesn't _quote_ it
| encourages something more akin to _plagarism_ , doesn't
| it?
| extra88 wrote:
| Plagiarism, pretending you made a work entirely yourself
| when you didn't, is rarely a matter for a court to decide
| and the standards for what constitutes plagiarism can
| vary a lot. When I turn in projects for a course, a cite
| sources in the comments a lot, even if what I turn in is
| substantially modified. An employer generally doesn't
| care if you copied and pasted code from StackOverflow or
| wherever, so long as you don't expose them to a suit and
| you don't lie if asked "Did you write this 100%
| yourself?"
|
| Citing your source is not a get out jail free card for
| copyright infringement, it doesn't really matter.
| e12e wrote:
| > Citing your source is not a get out jail free card
|
| No, but it's a requirement of the license
| stackoverflow.com uses, which is unfortunate, for code
| (as opposed to text, where a quote can be easily
| attributed).
| b3morales wrote:
| ... _with attribution_.
| AaronFriel wrote:
| And without. Attribution isn't a "copyright escape
| clause", copying a work without permission is still
| infringement - unless it's fair use.
|
| Plagiarism is not the same as infringement.
| jrm4 wrote:
| You're strongly and incorrectly implying that "Fair Use" is a
| clear (and relatively immutable) concept within copyright
| law, which couldn't be further from the truth. Even if this
| or that particular case sets out what appears to be solid
| grounds, one shouldn't take that as gospel by any means.
|
| This mostly has to do with the nature of the wishy-washy
| nature of the 4 part Fair Use test, which, unlike decent
| legal tests, doesn't actually have _discrete_ answers. The
| judge looks at the 4 questions, talks about them while waving
| her hands, and makes a decision.
|
| Comparing to, e.g., Patent, where you actually do have yes-
| or-no questions. Clean Booleans. Is it Novel? Is it Non-
| Obvious? Is it Useful? If any of the above is "No", then no
| patent for you.
|
| As for the execution of Fair Use, while I haven't gone too
| deep into Software, I can assure that for music, the thing is
| just a silly holy-hell mess; confirmed most recently by the
| "Blurred Lines" case, where NO DIRECT COPYING (e.g. sampling
| or melody taking) was alleged, merely that the song sounded
| really similar to "Got to give it up" and that was enough.
|
| So then, I'd say everything either is, or should be, up in
| the air, when it comes to Fair Use and software.
| extra88 wrote:
| > Is it Novel? Is it Non-Obvious?
|
| Those questions for patents are barely more clear-cut than
| copyright fair use tests, there is lots of room for
| disagreement.
|
| It's definitely true that a fair use defense against
| copyright infringement varies a lot by the field of work
| and norms can develop which are relevant to court cases.
| The music field is a mess, the "Blurred Lines" judgement
| was total bullshit. But the software field is not without
| its own copyright history and norms so there's no reason to
| expect everything to go to hell.
| jrm4 wrote:
| But there's no reason _not_ to either - I suppose my
| point is, don 't take too much as gospel and think about
| everybody's best "end-goals" and push or pull with or
| against the law as needed.
| TheSpiceIsLife wrote:
| There's also an aspect of this that varies by size,
| budget, political clout, etc etc, of the individual or
| organisation.
|
| The big guns like Microsoft, Google, Oracle, do this sort
| of thing as a matter of course in their business
| activities, they have the lawyers, the money, and the ear
| of members of parliaments, senators etc.
|
| Whereas an individual or small business probably wants to
| conduct themselves within a more narrow set of
| adherences.
| [deleted]
| [deleted]
| nerdponx wrote:
| Unanswered question, as far as I know: is a trained model a
| derivative work? If the model accidentally retains a copy
| of the work, is that an unauthorized copy?
| MAGZine wrote:
| I think it would be pretty easy to stake opinions on those
| "boolean questions."
|
| Is (was?) a swipe gesture novel? Is it non-obvious?
| jrm4 wrote:
| Oh, absolutely. Kind of furthers my point. Patent is a
| silly mess in a lot of ways, but _at least_ there 's
| something like Booleans in it. "Fair use" doesn't even
| have THAT.
| Arelius wrote:
| I think what the parent is stating is that even though
| the patent questions can have debate, once you settle the
| question "Is it Novel" as yes or no you can determine if
| the item is patentable... wheras for fair-use, the
| questions themselves aren't yes/no questions, and
| further, they are just used as balancing factors, so even
| if everyone agrees on "the effect of the use upon the
| potential market for or value of the copyrighted work"
| it's only weighed as a factor for how fair the use is,
| and broadly left up to the hand-waving of the particular
| judge.
| [deleted]
| cogman10 wrote:
| Most law is wishy washy. There are very few cut and dry
| answers in the law (If there were, we wouldn't need lawyers
| and a court system based on deciphering the law).
|
| All that said, the one thing I'd add about fair use is that
| it isn't permission to use anything you like, but rather a
| defense in a legal proceeding about copyright. It's pretty
| much all about being able to reference copyrighted material
| with the law later coming in and making final decisions on
| whether or not that reference went too far. (IE, copying
| all of a disney movie and saying "What's up with this!" vs
| copying 1 scene and saying "This is totally messed up and
| here's why".)
|
| That was a big part of the google oracle lawsuit.
| extra88 wrote:
| Yes to all this.
|
| I think the factor most at risk in a fair use test with
| Copilot is whether it ever suggests verbatim, code that could
| be considered the "heart" of the original work. The John
| Carmack example that's popped up here at least gets closer to
| this question, it was a relatively small amount but it was
| doing something very clever and important.
|
| One can imagine a project that has thousands of lines of code
| to create a GUI, handle error conditions, etc. that's built
| around a relatively small function; if Copilot spat out that
| function in my code, it might not be fair use because it's
| the "heart" of the original work. Additionally, its inclusion
| in another project could affect the potential market for the
| original, another fair use test.
|
| But Copilot suggesting a "heart" is unlikely, something that
| would have to be ruled on in a case-by-case basis and not a
| reason to shut it down entirely. Companies that are risk-
| averse could forbid developers from using Copilot.
| mthoms wrote:
| This is an excellent comment because it captures some
| important nuance missing from other analysis on HN.
|
| I agree with you that the relative importance of the copied
| code to the end product would be (or should be) the crux of
| the issue for the courts in determining infringement.
|
| This overall interpretation most closely adheres to the
| _spirit_ and _intent_ of Fair Use as I understand it.
| otterley wrote:
| If this issue is eventually litigated, we will see. The law
| in the Second Circuit (where the final judgment was rendered
| before the case was eventually settled) may well be different
| than the law in a different circuit. If there is a split in
| the circuit courts, then the Supreme Court may have to weigh
| in on this issue.
|
| When fair use is an issue, the courts look at the facts in
| context each time. These are obviously different facts than
| scanning books for populating a search index and rendering
| previews; and each side is going to argue that the facts are
| similar or that they are dissimilar. How the court sees it is
| going to be the key question.
| tyre wrote:
| This could either be:
|
| 1. a fascinating Supreme Court opinion.
|
| 2. a frustrating ruling because SCOTUS doesn't understand
| software and code.
|
| 3. the type of anti-anticlimactically(?) narrow ruling
| typical of the Roberts court.
|
| While our Congresspersons can't seem to wrap their minds
| around technology/social media, I think SCOTUS would
| understand this one enough to avoid (2).
| otterley wrote:
| Fair use cases tend to produce narrowly-written law
| because the outcomes hinge on how the court judges the
| facts against the list of factors codified in the
| Copyright Act (17 U.S.C. section 107). The courts don't
| really have breathing room to use a different test. I
| don't recall any cases in which the courts have set
| binding guidelines for interpretation of these factors.
| dtech wrote:
| The Google vs Oracle case showed that SCOTUS can handle
| technical topics
| ElFitz wrote:
| > Note, however, that there is no world-wide principle of
| fair use; what kinds of use are considered "fair" varies from
| country to country.
|
| Exactly the point I came to make.
|
| The Authors' Guild is a US entity, and so is Google, so only
| US law applies. And thus, we have the Fair Use exception.
|
| But developers sharing code on GitHub come from and live all
| over the world.
|
| Now, Github's ToS do include the usual provision stating that
| US & California law applies, et caetera, et caetera [1],
| but... and even they acknowledge it may be the case, such
| provisions usually aren't considered legal outside of the US.
|
| So... developers from outside the US, in countries with less
| lenient exceptions to copyright, definitely could sue them.
|
| Identifying these countries and finding those developers,
| however, is a different matter altogether.
|
| [1]: https://docs.github.com/en/github/site-policy/github-
| terms-o...
| croes wrote:
| Can you still apply Fair Use if they make Copilot a payed
| service?
| cmsonger wrote:
| This is a thoughtful and insightful reply. Thank you.
| IgorPartola wrote:
| I think this is the correct answer. IANAL but the copilot
| code vs the copilot training data are different things and
| licensing for one shouldn't affect the other, right? And the
| fact that training data happens to also be code is
| incidental.
| 8note wrote:
| One view would be that copilot the app distributes GPL'd
| code, in a weird encoding. Training the model is a
| compilation step to that encoding
| keonix wrote:
| I assume the code is a derivative work of training data
| because given different data code would be also different
| (neuron weights)
| IgorPartola wrote:
| If I read a GPL implementation of a linked list and then
| write my own linked list implementation, was my neural
| network in my brain a derivative work of the GPL code?
| keonix wrote:
| Sure it is, you brain is not software though
| IgorPartola wrote:
| So as long as I read GPL code, then rewrite it from
| memory and feed it to copilot to train it I can unGPL
| anything?
| logifail wrote:
| > Fair use is an exception to copyright itself. A license
| cannot remove your right to fair use.
|
| ...and if you're outside the USA?
| ska wrote:
| > By comparison, Copilot is even more obviously fair use.
|
| You are correct about (US specific) the fair use exception,
| but it is in no way as clear as you suggest that what copilot
| is doing entirely falls under fair use. Fair use is always
| constrained.
|
| I suspect some variant of this sort of thing will have to be
| tested in court before the arguments are really clear.
| random314 wrote:
| If copilot was trained using the entirety of the linux
| kernel, wouldn't the neural network itself need to be GPLed,
| if not its output.
| feanaro wrote:
| When the recent Github v. youtube-dl fiasco happened, I
| remember reading similarly strongly-worded but _dismissive_
| comments regarding fair use, stating how it is quite obvious
| that youtube-dl 's test code could _never_ be fair use and
| how fair use itself is a vague, shaky, underspecified
| provision of the copyright law which cannot ever be relied
| on.
|
| To me, seeing youtube-dl's case as fair use is so much easier
| than using _hundreds of thousands_ source code files without
| permission in order to build a _proprietary product_.
| 0-_-0 wrote:
| How would you feel about a paid-for search engine using
| _hundreds of millios_ of web pages without permission in
| order to build a _proprietary product_?
| sbelskie wrote:
| You mean like a search engine?
| indymike wrote:
| Books (mostly) are not distributed under the GPL.
| e12e wrote:
| True. But pretty good privacy might be worth considering in
| this context - it was at one point published as a book
| after all...
|
| https://philzimmermann.com/EN/essays/BookPreface.html
| tedunangst wrote:
| Does the GPL forbid fair use? Why don't book publishers use
| a license that forbids fair use?
| AaronFriel wrote:
| Because fair use is an exception to copyright itself. A
| copyright license can't take away your legal right to
| fair use.
| jefftk wrote:
| The GPL only gives you additional permissions relative to
| what you would have by default. The books included in that
| suit were more strongly restricted, since there was no
| license at all.
| indymike wrote:
| There are certainly some interesting additional
| conditions the GPL creates by taking the license away if
| you violate certain clauses. Regardless, the interesting
| part of this is that this looks different from the user's
| point of view and Microsoft's. Sure, 5 lines out of
| 10,000 is probably fair use. For Microsoft, their system
| is using the whole code base and copying it a few lines
| at a time to different people, eventually adding up to
| potentially lots more than fair use.
|
| The question on this one will be about the difference
| between Microsoft/Github's product and a programmer using
| copilot's code:
|
| "If I feed the entire code base to a machine, and it
| copies small snippets to different people, do we add the
| copies up, or just look at the final product?"
| wilde wrote:
| I buy the argument about copilot itself and this comment. But
| when someone goes to release software that uses the output of
| Copilot, I fail to see how they wouldn't be a GPL derivative
| work if enough source was used. Copilot is essentially really
| fancy copy/paste in that context.
| Causality1 wrote:
| Read the Authors Guild v Google dismissal. The court
| considered it fair use because Google's project was built
| explicitly to let users find and purchase books, giving
| revenue to the copyright holders. Copilot does not do that.
| adverbly wrote:
| This was a good point. Really enjoying this discussion.
| Interesting stuff.
|
| I'm really out of my depth in giving my own opinion here, but
| I'm not sure that either the "distribution != derivative"
| characterization, or that "parsing GPL => derivative of GPL"
| really locks this thing down. The bit that I can't follow
| with the "distribution != derivative" argument is that the
| copilot is actually performing distribution rather than
| "design". I would have said that copilot's core function is
| generating implementations, which to me does not seem like
| distribution. This isn't a "search" product, and it's not
| trying to be one. It is attempting to do design work, and I
| could see a case where that distinction matters.
| jrochkind1 wrote:
| If Github hosts AGPL code, does that mean that github's own
| code must be AGPL? Obviously not. What's the difference?
|
| There's no point to copilot without training data, some but not
| all of the training data was (A)GPL. There's no point to github
| without hosting code, some but not all of the code it hosts is
| A(GPL).
|
| The code in either cases is _data_ or content, it has not
| actually been incorporated into the copilot or github product.
| mhh__ wrote:
| Code isn't to GitHub what training data is to this model, or
| at least even if you could argue that it is within a current
| framework it shouldn't be.
| b3morales wrote:
| > If Github hosts AGPL code, does that mean that github's own
| code must be AGPL? Obviously not. What's the difference?
|
| GitHub's TOS include granting them a _separate_ license
| (i.e., not the GPL) to reproduce the GPL code in limited ways
| that are necessary for providing the hosting service. This
| means commonsense things like displaying the source text on a
| webpage, copying the data between servers, and so on.
| jdavis703 wrote:
| Assuming that copilot is a violation of copyright on GPL works,
| it would also be a violation of non-GPL copyrighted works,
| including public but, but fully copyrighted works. Therefor
| relicensing others source code under GPL would violate even
| more copyright.
| zelphirkalt wrote:
| So in that case, of course copilot would have to give license
| info for every. single. snippet. Case solved. Only, that they
| will probably not do that.
| andy_ppp wrote:
| Probably they get away with it, but it definitely seems against
| the spirit of the GPL just as closed source GitHub existing
| because of open source software seems quite hypocritical.
| fny wrote:
| I think the bigger issue is that use of Copilot puts the end
| user at risk of using copyrighted code without knowing it.
|
| Sure one could argue that Copilot learned in the way a human
| does. There is nothing that prevents one from learning from
| copyrighted work, but snippets delivered verbatim from such
| works are surely a copyright violation.
| coding123 wrote:
| copilot isn't distributing copies of itself either.
| hedora wrote:
| More interestingly, if we can trick it into regurgitating a
| leaked copy of the windows source code, Microsoft apparently
| says that's fair use.
| fchu wrote:
| If a company built a tool like Copilot to help students write
| essays, is that considered plagiarism? Probably yes, and the
| reason is that regurgitating blobs of text without actually
| thinking like a human and writing them anew doesn't feel like
| actual work, just direct re-use.
|
| Same thinking probably applies to GitHub Copilot and copyright
| taywrobel wrote:
| It's already fairly commonplace for news agencies to generate
| articles using ML solutions such as https://ai-writer.com/
|
| So by your logic ABC, CBS, Fox, and NBC have all been
| plagiarizing and violating copyright for doing so? I'm not sure
| if there's been a legal challenge/precedent set in that case
| yet, but that seems like a more apples to apples comparison
| than the Google Books metaphor being used.
|
| Disclosure: I work at GitHub but am not involved in CoPilot
| yccs27 wrote:
| The big question here is: On what data was the model trained?
| Presumably the news stations trained theirs on public-domain
| works and their own backlog of news articles, so even with
| manual copying there would be no infringement. In contrast,
| Copilot was trained on other people's code with active
| copyright.
| taywrobel wrote:
| That's quite a big presumption IMO. Training sets need to
| be quite large in order to produce reasonable output. My
| understanding is that these companies provide the model
| themselves, which seems like it'd be trained on more than
| one company's publications. But I get your point, and
| understand both sides of the argument here.
|
| I think this will end up with a large class action lawsuit
| for sure, tho I really think it's a toss up as to who would
| win it. This conversation was bound to happen eventually
| and we're in uncharted territory here.
|
| I think it's going to hinge on whether machine learning is
| considered equivalent in abstraction to human learning,
| which will be quite an interesting legal, technological,
| and philosophical precedent to set if it goes that way.
| neom wrote:
| Curious what the consensus is on how GH should have approached
| this to avoid such blowback.
|
| Best case scenario, they explained in advance on the GH blog
| they're going to be doing some work on ML and coding, and they'd
| like people to opt into their profile being read via a flag
| setting/or put a file in the repo that gives permission like
| robots.txt? Second best case scenario, same as first but opt out
| vs opt in, and least ideal would be something like not doing the
| first two, however, when they announced it, explained in detail
| how the model was trained and what was used, why, and when- kinda
| thing?
|
| Is that generally about right, or..?
| tazjin wrote:
| > Second best case scenario
|
| Not really, consider for example repositories mirrored to
| Github.
|
| It seems unclear who has the rights to grant this permission
| anyways (with free software licenses). Probably the copyright
| holder? Who that is might also be complicated.
| iimblack wrote:
| In that hypothetical I wouldn't think GitHub is responsible
| for determining if a repository is mirrored and what the
| implications of that are. They just need to look at what
| license is on the repo in GitHub.
| neom wrote:
| Good point, I would have thought GH requires you to agree in
| some TOS that you have permission to put the code on GH (but
| I don't know)? If so, could that point be put aside? (I'm not
| a software engineer so sorry if that made no sense. Super
| curious about the whole codepilot thing from a business and
| community perspective)
| tazjin wrote:
| > that you have permission to put the code on GH
|
| This is the complicated bit: All open-source licenses grant
| you permission to redistribute the code (usually with
| stipulations like having to include the license), so you
| are almost always allowed to upload the code to Github.
|
| What it doesn't mean however is that you're the copyright
| holder of that code, you're merely redistributing work that
| somebody else has ownership of.
|
| So who gets to decide what Github is allowed to do with it?
|
| I expect this will end up in courts and we won't get a
| definite answer before that.
| neom wrote:
| If you'll entertain me on a hypothetical for a moment.
| Suppose then the copious amount of intelligent folks over
| at GH _know_ this will eventually end up in the courts,
| and expected that from the start. Would you suggest they
| messaged /rolled it out any differently? Did they do
| exactly what they needed to do so that it _did_ end up in
| the courts? Should they have done anything differently to
| not piss folks off so much? Sorry for the million
| questions, you seem to know /have thought a bit about
| this. Thanks! :)
| lukeplato wrote:
| They should have only used code from projects that included a
| license that allow for commercial use or made their model
| openly available and/or free to use
| whimsicalism wrote:
| How does attribution work then?
| BlueTemplar wrote:
| Code (co)created with Copilot has to follow all the licenses of
| the source (heh) code. This generally means at the very least
| automatically including in projects getting help from Copilot a
| copy of all the licenses involved, and attribution for all the
| people the code of which Copilot has been trained on.
|
| (Not sure for the cases where there is no license and therefore
| normal copyright applies, but AFAIK this isn't the case for any
| code on Github, which automatically gets an open source licence
| ?
|
| EDIT : Code in public repositories seems to be "forkable" on
| Github itself but not copyable (to elsewhere). That's some
| nasty walled garden stuff right there, I wonder how legal that
| ToS is ? I could see how this could make them to incentivize
| people to stop using other licenses on Github, to not have to
| deal with this license mess... EEE yet again ?)
| neom wrote:
| So I guess then, the first thing they should have done, is
| trained it to understand licenses, and used that as a first
| principle for how they built the system?
| BlueTemplar wrote:
| Seems to be too much effort (is it even possible to link
| the source to the end result ?), and might not be
| admissible, so just include a database with all of the
| relevant licenses and authors ?
| whimsicalism wrote:
| Is it a derivative work of GPL licensed work if it is
| trained on the license? Is the GPL license text under GPL?
| [deleted]
| BlueTemplar wrote:
| > GNU GENERAL PUBLIC LICENSE
|
| > Version 3, 29 June 2007
|
| > Copyright (c) 2007 Free Software Foundation, Inc.
| <https://fsf.org/>
|
| > Everyone is permitted to copy and distribute verbatim
| copies of this license document, but changing it is not
| allowed.
| rwmj wrote:
| So would a way to do this be to train multiple models on each
| different code license (perhaps allowing compatible licenses to
| cohabit) and then have Copilot identify the license of the target
| project and use the appropriate model?
|
| It might have an interesting feedback effect that some licenses
| which are more popular would presumably have better Copilot
| recommendations, which would produce better and thus more popular
| code for those licenses. Although maybe this happens already.
| fleddr wrote:
| To me, the particular use case and whether it is fair use or not,
| is of minor interest. A far more pressing matter is at hand: AI
| centralization and monopolization.
|
| Take Google as an example, running Google Photos for free for
| several years. And now that this has sucked in a trillion photos,
| the AI job is done, and they likely have the best image
| recognition AI in existence.
|
| Which is of course still peanuts compared to training a super AI
| on the entire web.
|
| My point here is that only companies the size of Google and
| Microsoft have the resources to do this type of planetary scale
| AI. They can afford the super expensive AI engineers, have the
| computing power and own the data or will forcefully get access to
| it. We will even freely give it to them.
|
| Any "lesser" AI produced from smaller companies trying to compete
| are obsolete, and the better one accelerates away. There is no
| second-best in AI, only winners.
|
| If we predict that ultimately AI will change virtually every
| aspect of society, these companies will become omnipresent,
| "everything companies". God companies.
|
| As per usual, it will be packaged as an extra convenience for
| you. And you will embrace it and actively help realize this
| scenario.
| ncr100 wrote:
| I think it's an innate quality of technology.
|
| Yes sophisticated AI tech concentrates power for those who
| already have power.
|
| And the technology we all (presumably readers of HN) create can
| enhance the impact of the user. And this can result in unfair
| circumstances, in reality.
|
| Law and force can prevent disproportionate use of power. Of
| course one must define the law, which may be done AFTER the
| offense has been committed. Further, if those who make the laws
| are corrupted by those with e.g. this AI tech power, then no
| effective law may be enacted and the hypothetical abuse will
| continue.
| mikewarot wrote:
| I have about 300,000 photos that haven't been scanned by AI
| (unless someone at Backblaze did it without permission). I'm
| sure there are lots of other photographers out there who miss
| Picassa, which Google killed off to push everyone's data to
| their service. (It did really well in matching faces, even
| across age, but the last version has a bug when there are
| multiple faces in a picture, sometimes it swaps the labels)
|
| If there were offline image recognition we could train on our
| own data privately, _could_ the results of those trainings be
| merged to come up with better recognition on average than any
| one person could do themselves with their own photos?
|
| In other words, would it be possible for us to share the
| results of training, and build better models, without sharing
| the photos themselves?
| hexa22 wrote:
| There is a second best though. Apple offers image AI which is
| worse than googles but wins because it works offline.
| mtrn wrote:
| The final step is to break down these monopolies. The
| government can do that and has done it before.
| jspaetzel wrote:
| If Google makes an amazing model that no-one can beat it will
| only be dominate as long as others get access to it freely. But
| if there are restrictions on access or if it's too expensive,
| other options will appear and even if they're not as perfect,
| they'll still be very usable. Imagine a coalition of companies
| all feeding data, that could compete just as well.
| est31 wrote:
| Google has all the data of all the users though. I'd wager
| that they won't just let AI companies scrape it.
| hartator wrote:
| GitHub CoPilot is clear fair use. Having a ruling that says it
| doesn't will be regression. Please don't.
| pizza wrote:
| Please don't get mad at me but my question is genuinely: so what?
| Why does it matter? Can't you violate licenses in a tedious
| manner just by Googling + copy pasting blindly already? Genuinely
| looking to understand the consensus here
| ghoward wrote:
| This is why I relicensed my code [1] yesterday to a license I
| wrote [2], which is designed to poison the well for machine
| learning.
|
| [1]: https://gavinhoward.com/2021/07/poisoning-github-copilot-
| and...
|
| [2]: https://yzena.com/yzena-network-license/
| speedgoose wrote:
| Since you allow new versions by default, can't someone just
| release a new version of your license allowing everything they
| want?
| ghoward wrote:
| That is a good point, but easily fixed. Will do that now.
|
| Edit: done. They are under the CC-BY-ND license now.
| jkaplowitz wrote:
| Do the GitHub Terms of Service give them the necessary
| permissions for Copilot, independently of the license? (I
| honestly don't know the answer; this is a straight question.)
| ghoward wrote:
| I don't know. Because I don't know is why I pulled all of my
| code (except for a permissively-licensed project that people
| actually depend on the GitHub link for) off of GitHub.
| shakna wrote:
| > The licenses you grant to us will end when you remove Your
| Content from our servers, unless other Users have forked it.
| [0]
|
| I don't see how they can keep this clause, and then have a
| service that recites/redistributes code, based on a model
| that has already ingested said code.
|
| > This license does not grant GitHub the right to sell Your
| Content. It also does not grant GitHub the right to otherwise
| distribute or use Your Content outside of our provision of
| the Service, except that as part of the right to archive Your
| Content, GitHub may permit our partners to store and archive
| Your Content in public repositories in connection with the
| GitHub Arctic Code Vault and GitHub Archive Program. [1]
|
| Copilot is distributed verbatim code when it regurgitates,
| which seems a pretty clear violation of this clause. (If it
| wasn't regurgitating, they'd have caselaw for fair use.
| But... It is.)
|
| [0] https://docs.github.com/en/github/site-policy/github-
| terms-o...
|
| [1] https://docs.github.com/en/github/site-policy/github-
| terms-o...
| gjs278 wrote:
| good luck enforcing this. you are nobody and no court will hear
| your case.
| p0ckets wrote:
| I think to actually poison the well, we should add code to
| existing repos with dead code clearly labelled as "the way that
| things shouldn't be done" that are wrong in subtle ways. So
| every time we fix a security issue, we keep the version with
| the bug with some comments indicating what's wrong with it. Of
| course, this only works until the AI is trained to weigh the
| code based on how often the code is called.
| ghoward wrote:
| That is a funny idea. Personally, too much work for me, and
| Copilot probably generates subtly wrong code already.
| nradov wrote:
| The notion of intentionally polluting and over complicating
| your code base just to "poison the well" is bizarre. Talk
| about cutting off your nose to spite your face.
|
| If you don't want others to use your code then the solution
| is very simple. Keep it on a secure private server and don't
| publicly release it.
| ghoward wrote:
| Keeping it private is one option, but I really want my end
| users to have the freedom to modify the code for
| themselves.
| CyberRabbi wrote:
| That's safe but it's probably not necessary to be protected
| from what GitHub, OpenAI, and Microsoft are doing. When these
| licenses were crafted there was no reasonable expectation that
| companies could use ML applications as a loop hole in existing
| copyright licenses, so just because there is no explicit clause
| denying it doesn't mean they are in the clear for using
| copyright-protected code that way. Licenses give permission,
| they don't revoke it.
|
| Copyright is broad, licenses are minimal. This must be the case
| otherwise they would not be very effective at protecting the
| work of creators. There is no explicit allowance for what
| GitHub is doing in most licenses so they do not have general
| permission to do so.
| ghoward wrote:
| I agree; my blog post says so.
|
| What my licenses are supposed to do is sow even more doubt in
| companies' minds about models trained on my code.
| richardwhiuk wrote:
| If it's allowed by fair use, your license is irrelevant. If
| it's not, your license doesn't matter.
| ghoward wrote:
| In my blog post, I talk about how training is fair use, but
| we don't know about distributing the _output_. These
| licenses, even if they don 't work, are designed to poison
| the well by putting enough doubt into companies' minds that
| they would not want to use Copilot if it has been trained
| with my relicensed code.
| laurowyn wrote:
| Could this be the beginning of the true test of open source
| licenses? My understanding is that there has never been a ruling
| by a court to give precedence to the validity or scope of any
| open source license. I can see a class action suit coming on
| behalf of all GPL licensed code authors.
| jefftk wrote:
| GitHub used code that wasn't under any license at all, just
| publicly visible. Their claim is not that the license allows
| what they're doing, but that they do not need a license.
| laurowyn wrote:
| which is a different issue to my point, but still very valid.
| what terms are implied if no license is specified? I would
| argue attribution should be expected if used, but I also
| wouldn't go near any code without a specific license attached
| as there's no express permission given - just because a
| license isn't disclosed doesn't mean it isn't there.
|
| you can't go copying anything and everything just because
| nobody has told you that you can't. and I feel that's part of
| the purpose behind GPL. force a license on derivative code so
| that at least there's clear rights moving forwards.
| jefftk wrote:
| It's stronger than that: if GitHub is correct that they
| don't need a license then they are allowed to train on
| publicly visible code even if it is labeled with "no one
| has any provision to use this for anything at all,
| especially training models"
| laurowyn wrote:
| Which is why I think this could be a big turning point.
| IMO, GitHub is breaking licenses. If an ML algorithm
| ingests a viral licensed block of code, its outputs
| should be tainted with that license as it's a derived
| work. Otherwise I can make a program reproduce whole
| repositories license free, so long as I can claim "well,
| the AI did it, not me!" It's produced something based on
| the original work, therefore it should follow the license
| of the original. And that issue is exacerbated by the
| mixture of licenses available - they will all apply at
| the same time, and not all are compatible.
|
| I would hope GitHub (and Microsoft) did the legal work to
| cover this, and not just ploughed ahead with the plan to
| drown any legal challenges. From my perspective, they're
| doing the latter.
| jcelerier wrote:
| What ? There have been plenty of GPL cases defended in court.
|
| https://en.m.wikipedia.org/wiki/Open_source_license_litigati...
| laurowyn wrote:
| All of the copyright cases were settled, so no precedence is
| set. Open source as a contract has been ruled legal, and
| licensors can sue for breach of contract - which is not the
| same as copyright infringement.
|
| I think my point still stands.
| crazygringo wrote:
| I mean, if it's considered "fair use" legally (which is surely
| their position), then why wouldn't they?
|
| Why would they distinguish between licenses if there's no legal
| need to?
|
| Licenses are only restrictions _on top_ of fair use. Licenses can
| 't restrict fair use.
|
| It would be interesting if someone takes them to court and a
| judge definitively rules on fair use in this particular case. Or
| I don't know if there's enough precedent here that the case would
| never even make it to trial. But with a team of top-paid
| Microsoft lawyers that gave this the green light, I'm pretty sure
| they're quite confident of the legality of it.
| 6gvONxR4sf7o wrote:
| I've figured out why ML based fair use arguments for generative
| models feel dirty to me.
|
| Imagine a scenario where you'd love to have access to a large
| number of my digital widgets, but they're expensive to make or
| buy, and a large number of them is _really_ expensive. So you
| train an ML model on my things you can 't afford to buy. It's
| still expensive, but that's a one time cost. Spend $5M training
| GPT-3, it's fine. Now you can _sample from the space of my
| digital widgets_. You have gotten a large number of widgets, just
| by throwing money at AWS. With money, you have converted my
| widgets into your widgets, and I 'll never see a cent of it.
|
| That's the issue. Content is expensive and it's still needed.
| Traditionally, I make content and if you want to benefit from my
| labor, you pay me. In the future, if you want to benefit from my
| labor, you pay AWS instead.
|
| tl;dr The most significant equation for generative models is "$$$
| + my stuff = your stuff"
| remram wrote:
| In addition, the model is going to spit out widgets that are
| combinations of the existing ones, if it doesn't outright copy.
| This is different from a human who is going to put their own
| creativity into it (and will be accused of plagiarism if they
| don't): the model has no creativity to offer on top of the
| unlicensed input.
___________________________________________________________________
(page generated 2021-07-08 23:00 UTC)