[HN Gopher] GitHub co-pilot as open source code laundering?
___________________________________________________________________
GitHub co-pilot as open source code laundering?
Author : agomez314
Score : 859 points
Date : 2021-06-30 12:00 UTC (11 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| pabs3 wrote:
| There isn't that much enforcement of open source license
| violations anyway. I bet there are lots of places where open
| source code gets taken, copyright/license headers stripped off
| and the code used in something proprietary as well as the bog-
| standard "not releasing code for modified versions of Linux"
| violation.
| cblconfederate wrote:
| That's like saying that making a blurry , shaky copy of star wars
| is not derivative but original work. Thing is, the 'verbatimness'
| of the generated code is positively correlated with the number of
| parameters they used to train their model
| joshsyn wrote:
| people worrying about AI. The AI is still shit. lol
| Miner49er wrote:
| Microsoft should just GPL CoPilot's code and model. They won't,
| but it would fix this problem, I think.
| jordemort wrote:
| ...unless they've also ingested code that is incompatible with
| the GPL and CoPilot ends up regurgitating a mix.
| afarviral wrote:
| While I think this will continue to amplify current problems
| around IP, aren't current applied-ML approaches to writing
| software the equivalent of automating the drawing of leaves on a
| tree? Maybe a few small branches? But the whole tree, all its
| roots, how it fits in to the surrounding landscape, the overall
| composition, the intention? If I'm wrong about that than I picked
| either a good or a bad time to finally learn programming. There's
| only so many ways you can do things in each language though. Just
| like in the field of music, only so many "Original" tunes. The
| concept of IP is incoherent, you don't own patterns (at least not
| at arbitrary depth), though you may be owed some form of
| compensation for the billions made off discovering them.
| visarga wrote:
| You're right, it's only drawing some leaves, the whole tree or
| how it relates to the forest is another thing.
| tsjq wrote:
| Microsoft: embrace, extend, extinguish .
| karmasimida wrote:
| Well this would not be hard to verify though.
|
| You can automate this process by providing existing GPL source
| code and see what CoPilot comes up next.
|
| I am sure at some point it WILL produce exact the same code
| snippet from certain GPL project, provided that you have
| attempted enough times.
|
| Not sure what the legal interpretation would be though, it is
| pretty gray-ish in that regard.
|
| There would always be risk for CoPilot, had it digested certain
| PII information and people found it out...it would be much more
| interesting to see the outcome.
| Enhex wrote:
| it doesn't have to be exact to be copyright infringement, see
| non-literal copying. basic idea behind it is if you copy paste
| code and rename variables that doesn't mean its new code.
| freshhawk wrote:
| Yeah, you'd have to assume they are parsing and normalizing
| this data in some way. There would still be some AST patterns
| or something similar you could look for in the same way, but
| it would be much trickier.
|
| Plus considering this is a legal issue ... good luck with
| "there is a statistically significant similarity in AST
| outputs related to the most unique sections of this code
| base" type arguments in court. We're currently at the "what's
| an API" stage of legal tech understanding.
| int_19h wrote:
| The real question is whether it constitutes _derived work_
| , though. And that is not a question of similarity so much
| so as provenance - if you start with a codebase that is GPL
| originally, and it gets gradually modified to the point
| where it doesn't really look anything like the original,
| it's still a derived work, and is still subject to the
| license.
|
| Similarity can be used to prove derivation, but it's not
| the only way to do so. In this case, all the code that went
| into the model is (presumably) known, so you don't really
| need any sort of analysis to prove or disprove it. It is,
| rather, a legal question - whether the definition on the
| books applies here, or not.
| bencollier49 wrote:
| This question about the amount of code required to be
| copyrightable starts to sound familiar to the copyright
| situation with music, where currently the bar seems to be set
| too low, legally, to prove plagiarism.
| bencollier49 wrote:
| Regarding PII, I think you have a very good point. I wouldn't
| be surprised to see working AWS_SECRET_KEY values appear in
| there. Indeed, given that copypaste programmers may not
| understand the code they're given, it's entirely possible that
| someone may run code which uses remote resources without the
| programmer even realising it.
| falcolas wrote:
| As per some of the other twitter replies, Co-pilot has offered
| to fill in the GPL disclaimer in new files.
| mtnGoat wrote:
| not a fan of this argument.
|
| musicians, artists, all kinds of athletes, all grow by watching
| observing and learning from others. as if all these open source
| projects got to where they are without looking at how others did
| things.
|
| i don't think a single function, similar syntax or basic check
| function is worth arguing about, its not like co-pilot is
| stealing an entire code base and just plopping it out by reading
| your mind and knowing what you want. i know developers that have
| certainly stolen code and implementation details from past
| employers and that was just fine.
| greyman wrote:
| > github copilot was trained on open source code and the sum
| total of everything it knows was drawn from that code. there is
| no possible interpretation of "derivative" that does not include
| this
|
| I don't understand the second sentence, i.e. where's the proof?
| Dracophoenix wrote:
| This goes into one of my favorite philosophical topics: John
| Searle's Chinese Room. I won't go into it here, but the question
| of whether an AI is actually learning how to code or simply
| substituting information based on statistically common practices
| (or if there really is a difference between either) is going to
| be one hell of a problem for the next few decades as we start to
| approach fine points of what AI is and how it could be defined.
|
| However, legally, the most recent Oracle vs. Google case has
| already settled a major point: APIs don't violate copyright. And
| as Github co-pilot is an API (A self-modifying one, but an API
| nonetheless), Microsoft has a good defense.
|
| In the near-future, when we have AI-assisted reverse engineering
| along with Github co-pilot, then, with enough obfuscation there's
| nothing that can't be legally created or recreated on a computer,
| proprietary or not. This is simultaneously free software's
| greatest dream and worst nightmare.
|
| Edit: changed Hilary Putnam to John Searle Edit 2: spelling
| toyg wrote:
| _> However, legally, the most recent Oracle vs. Google case has
| already settled a major point: APIs don 't violate copyright.
| And as Github co-pilot is API (A self-modifying one, but an API
| nonetheless), Microsoft has a good defense._
|
| That's... a mind-bendingly bad take. Google took an API
| definition and duplicated it; Copilot is taking _general code_
| and (allegedly) duplicating it. This was not done in order to
| enable any sort of interoperability or compatibility.
|
| The "API defense" would apply if Copilot only produced API-
| related code, or (against CP) if someone reproduced the
| interfaces copilot exposes to consumers.
|
| _> Microsoft has a good defense._
|
| MS has many good defenses (transformative work, github
| agreements, etc etc), but this is not one of them.
| [deleted]
| cxr wrote:
| > the most recent Oracle vs. Google case has already settled a
| major point: APIs don't violate copyright. And as Github co-
| pilot is API (A self-modifying one, but an API nonetheless),
| Microsoft has a good defense
|
| That's a wild misconstrual of what the courts actually ruled in
| Oracle v. Google.
|
| (And to the reader: don't take cues from people banging out
| poorly reasoned quasi-legal arguments in off-the-cuff
| comments.)
| Dracophoenix wrote:
| Straight from the horse's mouth [1]:
|
| pg.2
|
| 'This case implicates two of the limits in the current
| Copyright Act. First, the Act provides that copyright
| protection cannot extend to "any idea, procedure, process,
| system, method of operation, concept, principle, or discovery
| . . . ." 17 U. S. C. SS102(b). Second, the Act provides that
| a copyright holder may not prevent another person from making
| a "fair use" of a copyrighted work. SS107. Google's petition
| asks the Court to apply both provisions to the copying at
| issue here. To decide no more than is necessary to resolve
| this case, the Court assumes for argument's sake that the
| copied lines can be copyrighted, and focuses on whether
| Google's use of those lines was a "fair use."
|
| "any idea, procedure, process, system, method of operation,
| concept, principle, or discovery" sounds suspiciously like an
| API. Continuing:
|
| Pg. 3-4
|
| 'To determine whether Google's limited copying of the API
| here constitutes fair use, the Court examines the four
| guiding factors set forth in the Copyright Act's fair use
| provision... '
|
| (1) The nature of the work at issue favors fair use. The
| copied lines of code are part of a "user interface" that
| provides a way for programmers to access prewritten computer
| code through the use of simple commands. As a result, this
| code is different from many other types of code, such as the
| code that actually instructs the computer to execute a task.
| As part of an interface, the copied lines are inherently
| bound together with uncopyrightable ideas (the overall
| organization of the API) and the creation of new creative
| expression (the code independently written by Google)...
|
| (2) The inquiry into the "the purpose and character" of the
| use turns in large measure on whether the copying at issue
| was "transformative," i.e., whether it "adds something new,
| with a further purpose or different character." Campbell, 510
| U. S., at 579. Google's limited copying of the API is a
| transformative use. Google copied only what was needed to
| allow programmers to work in a different computing
| environment without discarding a portion of a familiar
| programming language .... The record demonstrates numerous
| ways in which reimplementing an interface can further the
| development of computer programs. Google's purpose was
| therefore consistent with that creative progress that is the
| basic constitutional objective of copyright itself.
|
| (3) Google copied approximately 11,500 lines of declaring
| code from the API, which amounts to virtually all the
| declaring code needed to call up hundreds of different tasks.
| Those 11,500 lines, however, are only 0.4 percent of the
| entire API at issue, which consists of 2.86 million total
| lines. In considering "the amount and substantiality of the
| portion used" in this case, the 11,500 lines of code should
| be viewed as one small part of the considerably greater
| whole. As part of an interface, the copied lines of code are
| inextricably bound to other lines of code that are accessed
| by programmers. Google copied these lines not because of
| their creativity or beauty but because they would allow
| programmers to bring their skills to a new smartphone
| computing environment. The "substantiality" factor will
| generally weigh in favor of fair use where, as here, the
| amount of copying was tethered to a valid, and
| transformative, purpose.
|
| (4) The fourth statutory factor focuses upon the "effect" of
| the cop- ying in the "market for or value of the copyrighted
| work." SS107(4). Here the record showed that Google's new
| smartphone platform is not a market substitute for Java SE.
| The record also showed that Java SE's copyright holder would
| benefit from the reimplementation of its interface into a
| different market. Finally, enforcing the copyright on these
| facts risks causing creativity-related harms to the public.
| When taken together, these considerations demonstrate that
| the fourth factor--market effects--also weighs in favor of
| fair use.
|
| 'The fact that computer programs are primarily functional
| makes it difficult to apply traditional copyright concepts in
| that technological world. Applying the principles of the
| Court's precedents and Congress' codification of the fair use
| doctrine to the distinct copyrighted work here, the Court
| concludes that Google's copying of the API to reimplement a
| user interface, taking only what was needed to allow users to
| put their accrued talents to work in a new and transformative
| program, constituted a fair use of that material as a matter
| of law. In reaching this result, the Court does not overturn
| or modify its earlier cases involving fair use.'
|
| [1]
| https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf
| salawat wrote:
| That's John Searle's thought experiment actually. Hilary Putnam
| had some thoughts in reference to it along the lines that a
| brain in a vat might think in a language similar to what we
| would speak, but the words of that language would necessarily
| encode different meanings due to the different experience of
| the external world and sensory isolation.
|
| https://plato.stanford.edu/entries/chinese-room/
| Dracophoenix wrote:
| Thanks for the correction. I made it known in my edit.
| AJ007 wrote:
| And this applies to everything, not just source code.
|
| I'm just presuming we have a future where you can consume
| unique content indefinitely. Such as instead of binge watching
| Star Trek on Netflix you press play and new episodes are
| generated and played continuously, 24/7, and they are actually
| really good.
|
| Thus intellectual property becomes a commodity.
| Dracophoenix wrote:
| While headway has been made in photo algorithms like
| StyleGAN, GPT-3's scriptwriting, and AI voice replication, we
| aren't even close to having AI-generated stick cartoons or
| anime. At best, AI generated Star Trek trained on old
| episodes would produce the live-action equivalent of limited
| animation; it would reuse the most liked parts over an over
| again and rehash the same camerawork and lens focus that you
| got in the 60's and the 90's. There wouldn't be any new
| planets explored, no new species, no advances in
| cinematography, and certainly no self-insert character (in
| case you wanted to see - simulation of how you'd fair on the
| Enterprise). It wouldn't add anything new as far as I can
| see. Now if there was some way to recreate all the characters
| in photorealistic 3D with Unreal Engine, feed them a script,
| and use some form of intelligent creature and planet
| generation, you may get a little closer to creating a truly
| new episode.
| koonsolo wrote:
| Does this mean that when I read GPL code and learn from it, I
| cannot use these learnings in non-GPL code?
|
| I get it that the derivative work might be more clear in an AI
| setting, but basically it boils down to the same thing.
| agomez314 wrote:
| Posting this due to the recent unveiling of GitHub Co-pilot and
| the intersection on the ethics of ml training set data.
| 6gvONxR4sf7o wrote:
| > previous """AI""" generation has been trained on public text
| and photos, which are harder to make copyright claims on, but
| this is drawn from large bodies of work with very explicit court-
| tested licenses
|
| This seems pretty backwards to me. A GPL licensed data point is
| more permissive than an unlicensed data point.
|
| That said, I'm glad that these data points do have explicit
| licenses that say "if you use this, you must do XYZ" so that it's
| clear that our large ML projects are going counter to creators
| intent when they made it open.
|
| I'd love to start seeing licenses about use as training data.
| Then maybe we'd see more open access to these models that benefit
| from the openness of the web. I'd personally use licenses that
| say if you want to train on my work, you must publish the model.
| That goes for my code, my writing, and my photography.
|
| Anyways GitHub is arguing that any use of publicly available data
| for training is fair use, but they also admit that it's all new
| and unprecedented, regarding training data.
| zeptonix wrote:
| The tone of the responses here is absurd. Guys, be grateful for
| some progress. Instead of having to retype boilerplate code, your
| productivity is now enhanced by having a system that can do it
| for you. This is primarily about reducing the need to re-type
| total boilerplate and/or copy/paste from Stackoverflow. If you
| were to let some of the people here run things we'd never have
| any form of progress with anything ever.
| joepie91_ wrote:
| > Instead of having to retype boilerplate code, your
| productivity is now enhanced by having a system that can do it
| for you
|
| We already invented something for that a couple decades ago,
| and it's called a "library". And unlike this thing, libraries
| don't launder appropriation of the public commons with total
| disregard for those who have actually _built_ that commons.
| qayxc wrote:
| Questions like this go much deeper and illustrate issues that
| need to be addressed before the technology becomes standard and
| widely adopted.
|
| It's not about progress or supressing it, it's a fundamental
| question about whether it is OK for huge companies to profit
| from the work of others without as much as giving credit, and
| if using AI this way represents an instance of doing so.
|
| The latter aspect goes beyond productivity or licensing - the
| OP asserts that AI isn't equivalent to a student who learned
| from examples how to perform a task, but rather replicates
| (recalls) or reproduces the works of others (e.g. the training
| material).
|
| It's a question that goes beyond this particular application:
| what about GAN-based generators? Do they merely reproduce
| slight variations of the training material? If so, wouldn't the
| authors of the training material have some kind of intellectual
| property rights to the generated works?
|
| This doesn't just concern code snippets, it's a general
| question about AI, crediting creators, and circumventing
| licensing and intellectual property rights.
| fartcannon wrote:
| To me, this is similar to all these big org making money off our
| data. They should be paying us to profit off our minds.
| kingsuper20 wrote:
| I was just musing about whether this kind of tool has been
| written (or is being written) for music composition, business
| letter writing, poetry, news copy.
|
| Interesting copyright issues.
|
| Anyone who thinks their profession will continue as-is for the
| long term is probably mistaken.
| sipos wrote:
| So, I can't see how they can argue that the code generated is not
| a derrivative of at least some of the code that it was trained
| on, and therefore encumbered by a complicated, and for anyone
| other than GitHub, impossible to disentangle, copyright claims.
| If they haven't even been careful to only use software under one
| license that does not require the original author to be
| attributed, then I don't see how it can even be legal for them to
| be running the service.
|
| All that said, I'm not confident that anyone will stop them in
| court anyway. This hasn't tenmded to be very easy when companies
| infringe other open source code copyright terms.
|
| Until it is cleared up though, it would seem extremely unwise for
| anyone to use any code from it.
| kklisura wrote:
| Should we be changing our open source licenses to explicitly
| prevent training such systems using our code?
| onli wrote:
| I'd assume this: In the same way as you can not forbid a human
| to learn concepts from your code, you can not forbid an
| automated system to learn concepts from your code, regardless
| the license. Also, if you would it would make your code non-
| free.
|
| At least as long as the system really learns concepts. If it
| just copy & pastes code, then that's a different story (same as
| with humans).
| bencollier49 wrote:
| Good idea, but if carved up into small enough chunks, it may be
| considered fair use.
|
| What is confusing is that the neural net may take lots of small
| chunks and link them to one another, and then reproduce them in
| the same order verbatim.
| falcolas wrote:
| One of the examples pointed out in the reply threads was the
| suggestion in a new file to insert the GPL disclaimer header.
|
| So, the length of the samples being drawn is not necessarily
| small: the chunk size is based on its commonality. It could
| easily be long enough to trigger a copyright violation.
| svaha1728 wrote:
| With music sampling, copyright protects down to the sound of
| a kick drum. No doubt Microsoft has a good set of attorneys
| working on their arguments as we speak.
| joepie91_ wrote:
| That would be a legal no-op. Either their use _is_ covered by
| copyright and they are violating your license, or it _isn 't_
| covered by copyright and then any constraints that your license
| sets are meaningless.
|
| Licenses hold no power outside of that granted to it by things
| being copyrighted by default.
| slim wrote:
| Why forbid? Just use GPL and extend the contagion to the code
| trained using your code
| k__ wrote:
| I don't think so.
|
| The code that is already used to train should be problematic
| for them, not only new Code in the future.
| 6510 wrote:
| Time to make closed source illegal.
| naikrovek wrote:
| I don't see the point of this tool, independent of the resulting
| code being derivative of GPL code or not.
|
| being able to produce valid code is not the bottleneck of any
| developer effort. no projects fail because code can't be typed
| quickly enough.
|
| the bottleneck is understanding how the code works, how to design
| things correctly, how to make changes in accordance with the
| existing design, how to troubleshoot existing code, etc.
|
| this tool doesn't make anything any easier! it makes things
| harder, because now you have running software that was written by
| no one and is understood by no one.
| mslm wrote:
| Have to fully agree; just seems like a "cool" tool where if you
| had to actually use it for real world projects, it's going to
| slow you down significantly, and you'll only admit it once the
| honeymoon period is over.
| fckthisguy wrote:
| Whilst I absolutely agree that writing code fast enough isn't
| the bottleneck, it's always nice to have tools that reduce
| repeat code writing.
|
| I use the React plugin for Webstorm to avoid having to write
| the boilerplate for FCs. Maybe in the future Copilot will
| replace that usage.
| ImprobableTruth wrote:
| To me that - and really any form of common boilerplate - is
| just evidence that we're lacking abstractions. If your editor
| is generating code for you, that means that the 'real'
| programming language you're using 'in your head' has some
| metaprogramming facilities emulated by your IDE.
|
| I think we should strive to improve our programming languages
| to make less of this boilerplate necessary, not to make
| generating boiler plate easier. The latter is just going to
| make software less and less wieldy. Imagine the horror if
| instead of (relatively) higher level programming languages
| like C we were all just using assembly with code generation.
| izgzhen wrote:
| It doesn't calm to solve the bottleneck either. On the
| contrary, it clearly states that its mission is to solve the
| easy parts better so developers can focus better on the true
| challenging engineering problems as you mentioned.
| uncomputation wrote:
| This reminds me of a startup pitch where it's always "oh we
| take care of x so you don't have to," but the problem is now
| I just have _another_ thing to take care of. I cannot speak
| for people who use Copilot "fluently," but I know for every
| chunk of code it spat out I would need to read every line and
| make sure "Is this right? Is the return type what I want?
| Will this loop terminate? Is 'scan' the right API? Is that
| string formatted properly? Can I optimize this?" etc. To me
| it's hardly "solving the easy parts," but rather putting the
| passenger's hands on the wheel.
| gotostatement wrote:
| Upvoted. I think the only good use case for this is
| spitting out 10-line, annoying, commonly used API
| boilerplate for commonly used APIs
| izgzhen wrote:
| That is a valid use case despite being small and
| incremental. I think it will still be helpful to some
| people.
| izgzhen wrote:
| The easy part is the copy-paste-from-SO part ;)
| bobsomers wrote:
| Completely agree. If anything, I see tools like this actually
| decreasing engineering speed. I don't see how it doesn't lead
| to shipping large quantities of code the team didn't vet
| carefuly, which has is a recipe for subtle and hard to find
| bugs. Those kinds of bugs are much more expensive to find a
| squash.
|
| What we really need aren't tools that help us write code
| faster, but tools that help us understand the design of our
| systems and the interaction complexity of that design.
| pjfin123 wrote:
| I think this would fall under any reasonable definition of fair
| use. If I read GPL (or proprietary) code as a human I still own
| code that I later write. If copyright was enforced on the outputs
| of machine learning models based on all content they were trained
| on it would be incredibly stifling to innovation. Requiring
| obtaining legal access to data for training but full ownership of
| output seems like a sensible middle ground.
|
| (Reposting my comment from yesterday)
| sanderjd wrote:
| Reposting a summary of my reply: if you memorize a line of code
| and then write it down somewhere else without attribution, that
| is not fair use, you copied that line of code. If this model
| does the same, it is the same.
| rbarrois wrote:
| An interesting impact of this discussion is, for me: within my
| team at work, we're likely to forbid any use of Github co-pilot
| for our codebase, unless we can get a formal guarantee from
| Github that the generated code is actually valid for us to use.
|
| By the way, code generated by Github co-pilot is likely
| incompatible with Microsoft's Contribution License Agreement [1]:
| "You represent that each of Your Submission is entirely Your
| original work".
|
| This means that, for most open-source projects, code generated by
| Github co-pilot is, right now, NOT acceptable in the project.
|
| [1] https://opensource.microsoft.com/pdf/microsoft-
| contribution-...
| CharlesW wrote:
| > _This means that, for most open-source projects, code
| generated by Github co-pilot is, right now, NOT acceptable in
| the project._
|
| For this scenario, how is using Co-Pilot generated code
| different from using code based on sample code, Stack Overflow
| answers, etc.?
| rbarrois wrote:
| I'd say that it depends on the license; for StackOverflow,
| it's CC-BY-SA 4.0 [1]. For sample code, that would depend on
| the license of the original documentation.
|
| My point is: when I'm copying code from a source with an
| explicit license, I know whether I'm allowed to copy it. If I
| pick code from co-pilot, I have no idea (until tested by law
| in my jurisdiction) whether said code is public domain, AGPL,
| proprietary, infringing on some company's copyright.
|
| [1] https://stackoverflow.com/legal/terms-of-
| service#licensing
| CharlesW wrote:
| That makes sense, thank you.
| gwenzek wrote:
| A number of company, including Google and probably Microsoft
| forbid copying code from Stack Overflow because there is no
| explicit license
| CharlesW wrote:
| TIL, thank you!
| gdsdfe wrote:
| How would you know if copilot was used or not?!
| LeicaLatte wrote:
| Our software has violated the world and people's lives legally
| and illegally in many instances. I mean none of us cared when
| GPT-3 did the same for text on the internet. :)
|
| Reminder - Software engineers, our codes, GPLs are not special.
| 29athrowaway wrote:
| If I recall correctly, it has been already determined that using
| proprietary data to train a machine learning system is not a
| violation of intellectual property.
| nemoniac wrote:
| So as I understand it, AGPL was introduced to cover an unforeseen
| loophole in GPL that adapted code could be used to power a web
| service. Could another new version of the license block allowing
| code from use to train such GitHub co-pilot like models?
| corobo wrote:
| If I as an alleged human have learned purely from GPL code would
| that require code I write to be released under the GPL too?
|
| We should probably start thinking about AI rights at some point.
| Personally I'll be crediting GPT-3 as any other contributor
| because it sounds cool but maybe morally too in future
| notkaiho wrote:
| Unless you were using structures directly from said code,
| probably not?
|
| Compare if you had only learned writing from, say, the Bible.
| You would probably write in a very Biblical manner, but would
| you write the Psalms exactly? Most likely not.
| edenhyacinth wrote:
| We have seen Co-Pilot directly output
| (https://docs.github.com/en/github/copilot/research-
| recitatio...) the zen of python when prompted - there's no
| reason it wouldn't write the Psalms exactly when prompted in
| the right manner.
| disgruntledphd2 wrote:
| That's super cool. As long as you do the things you specify
| at the bottom of that doc (provide attribution if copied so
| people can know if it's OK to use) then a lot of the
| concerns of people on these threads are going to be
| resolved.
| edenhyacinth wrote:
| Pretty much! There's only three major fears remaining
|
| * Co-pilot fails to detect it, and you have a potential
| lawsuit/ethical concern when someone finds out. Although
| the devil on my shoulder says that if Co-pilot didn't
| detect it, what's to say another tool will?
|
| * Co-pilot reuses code in a way that still violates
| copyright, but is difficult to detect. I.e. If you
| checked via a syntax tree, you'd notice that the code was
| the same, but if you looked at it as raw text, you
| wouldn't.
|
| * Purely ethical - is it right to take licensed code and
| condense it into a product, without having to take into
| account the wishes of the original creators? It might be
| treated as normal that other coders will read it, and
| pick up on it, but when these licenses were written no
| one saw products like this coming about. They never
| assumed that a single person could read all their code,
| memorise it, and quote it near-verbatim on command.
| disgruntledphd2 wrote:
| > Purely ethical - is it right to take licensed code and
| condense it into a product, without having to take into
| account the wishes of the original creators? It might be
| treated as normal that other coders will read it, and
| pick up on it, but when these licenses were written no
| one saw products like this coming about. They never
| assumed that a single person could read all their code,
| memorise it, and quote it near-verbatim on command.
|
| It's gonna be really interesting to see how this plays
| out.
| corobo wrote:
| I've not seen Copilot in action yet, I was under the
| impression it doesn't use code directly.
|
| In any case my original question was answered by the tweeter
| in a later tweet I missed
| https://twitter.com/eevee/status/1410049195067674625
|
| I get where they're coming from but they are kinda just
| handwaving it back the other way with the "u fell for
| marketing idiot" vibe. I wish someone smarter than me could
| simplify the legal ramifications around this but we'll
| probably have to wait till it kills someone (or at least
| costs someone a bunch of money) to get any actual laws set
| up.
| lucideer wrote:
| Your question had already been preempted in the OP.
| Specifically:
|
| > _" but eevee, humans also learn by reading open source code,
| so isn't that the same thing"_
|
| > _- no_
|
| > _- humans are capable of abstract understanding and have a
| breadth of other knowledge to draw from_
|
| > _- statistical models do not_
|
| > _- you have fallen for marketing_
|
| -- https://twitter.com/eevee/status/1410049195067674625
| corobo wrote:
| I preemptively commented that I'd seen that tweet three hours
| before your comment figuring someone was going to quote it at
| me haha
|
| Preemptive, doesn't work as it turns out :)
|
| https://news.ycombinator.com/item?id=27687586
| lucideer wrote:
| Nice catch
| pyentropy wrote:
| That's what I wanted to ask, where do we draw the line of
| copyright when it comes to inputs of generative ML?
|
| It's perfectly fine for me to develop programming skills by
| reading _any code_ regardless of the license. When a corp
| snatches an employee from competitors, they get to keep their
| skills even if they signed an NDA and can 't talk about what
| they worked on. On the other hand there's the no-compete
| agreement, where you can't. Good luck making a no-compete
| agreement with a neural network.
|
| Even if someone feeds stolen or illegal data as an input
| dataset to gain advantage in ML, how do we even prove it if
| we're only given the trained model and it generalizes well?
| vsareto wrote:
| >how do we even prove it if we're only given the trained
| model and it generalizes well?
|
| Someone's going to have to audit the model the training and
| the data that does it. There's a documentary on black holes
| on Netflix that did something similar (no idea if it was AI)
| but each team wrote code to interpret the data independently
| and without collaboration or hints or information leakage,
| and they were all within a certain accuracy of one-another
| for interpreting the raw data at the end of it.
|
| So, as an example, if I can't train something in parallel and
| get similar results to an already trained model, we know
| something is up and there is missing or altered data (at
| least I think that's how it works).
| hliyan wrote:
| Copyright is going to get very muddy in the next few decades.
| ML systems may be able to generate entire novels in the
| styles of books they have digested, with only some assist
| from human editors. True of artwork and music, and perhaps
| eventually video too. Determining "similarity" too, may soon
| have to be taken off the hands of the judge and given to
| another ML system.
| agomez314 wrote:
| Take it further. You could easily imagine taking a service
| like this as an invisible middleware behind a front-end and
| start asking users to pay for the service. Some could argue
| it's code generation attributable to those who created the
| model, but reality is that the models were trained by code
| written by thousand of passionate users at no pay with the
| intent of free usage.
| bogwog wrote:
| > but reality is that the models were trained by code
| written by thousand of passionate users at no pay with the
| intent of free usage.
|
| I hope you're actually reading those LICENSE files before
| using open source code in your projects.
| rhn_mk1 wrote:
| > It's perfectly fine for me to develop programming skills by
| reading any code regardless of the license.
|
| I'd be inclined to agree with this, but whenever a high
| profile leak of source code happens, reading that code can
| have dire consequences for reverse engineers. It turns clean
| room reverse engineering into something derivative, as if the
| code that was read had the ability to infected whatever the
| programmer wrote later.
|
| A situation involving the above developed in the ReactOS
| project https://en.wikipedia.org/wiki/ReactOS#Internal_audit
| kitsune_ wrote:
| I think you are missing the mark here with this comparison,
| Copilot and its network weights are already the derived work,
| not just the output it produces.
| wilde wrote:
| Possibly. We won't know until this is tested in court.
| Traditionally one would want to clean room [1] this sort of
| thing. Co-pilot is...really dirty by those standards.
|
| [1] https://en.wikipedia.org/wiki/Clean_room_design
| edenhyacinth wrote:
| A machine learning isn't really the same as a person learning -
| people generally can code at a high level without having first
| read TBs of code, nor can you reasonably expect a person to
| have memorised GPL code to reproduce it on demand.
|
| What you can expect a person to do is understand the principles
| behind that GPL code, and write something along the same lines.
| GitHub Co-Pilot is not a general ai, and it's not touted as
| one, so we shouldn't be considering whether it really _knows_
| code principles, only that it can reliably output code that
| fits a similar function to what came before, which could
| reasonably include entire blocks of GPL code.
| corobo wrote:
| Well if it is actually straight up outputting blocks of
| existing code then get it in the bin as a failed attempt to
| sprinkle AI on development and use this instead
|
| https://github.com/drathier/stack-overflow-import
| schnebbau wrote:
| Newsflash everyone, if you open source your code it's going to be
| copied or paraphrased anyway.
| nextaccountic wrote:
| It should be copied and paraphrased, but respecting the
| license. This means, among other things, crediting the author.
| schnebbau wrote:
| It may be hard to believe, but there are sick and twisted
| individuals in this dangerous world who copy from github
| without even a single glance at the license, and they live
| among us.
| iKevinShah wrote:
| There are always exceptions (maybe they might be norm in
| this case) but its still not 100%, still not all
| encompassing. This "AI" seems to be. I think that is like
| the entire concern. ALL the code is affected for all the
| instances.
| adn wrote:
| Yes, and those people are violating the licenses of the
| code when they do that. It's not unreasonable to expect a
| massive company like microsoft to not do this on a massive
| scale.
| diffeomorphism wrote:
| What does that have to with the topic? The question is not
| whether it gets copied, the question is whether it gets
| pirated.
| postalrat wrote:
| I think the issue many people may have with this is it's a
| proprietary tool that profits on work it was not licensed to
| use this way.
| GuB-42 wrote:
| Yes that's the point.
|
| But if I do it under a copyleft license like GPL, I expect
| those who copy to abide by the license and open source their
| own code too.
|
| But sure, people shit on IP rights all the time, and I am
| guilty of it too. Let's say I didn't pay what I should have
| paid for every piece of software I have used.
| Closi wrote:
| "About 0.1% the snippets are verbatim"
|
| This implies that by just changing the variable names, the
| snippets are classed as non-verbatim.
|
| I don't buy that this number is anywhere close to the actual
| figure if you assume that you can't just change function names
| and variable names and suddenly say you have escaped both the
| legality and the spirit of GPL.
| jordemort wrote:
| What happens when someone puts code up on GitHub with a license
| that says "This code may not be used for training a code
| generation model"?
|
| - Is GitHub actually going to pay any attention to that, or are
| they just going to ingest the code and thus violate its license
| anyway?
|
| - If they go ahead and violate the code's license, what are the
| legal repercussions for the resulting model? Can a model be "un-
| trained" from a particular piece of code, or would the whole
| thing need to be thrown out?
| vbezhenar wrote:
| I expect them to check /LICENSE file and if it deviates from
| standard open source license, they'll skip that repository.
| jordemort wrote:
| They haven't made any public statements on if they're looking
| at LICENSE or not; I'd sure appreciate it if they did!
| anfelor wrote:
| They don't do that it seems. In the footnotes of
| https://docs.github.com/en/github/copilot/research-
| recitatio... they mention two repositories from the training
| set none of which specify a license.
| cxr wrote:
| The existence of a LICENSE file is neither necessary nor
| sufficient to determine the terms that apply to a given work.
| diffeomorphism wrote:
| Why not? If it does not exist you treat it as proprietary
| (copyrighted by default) and if it does exist at least the
| author claims that the given license is an option (possibly
| their mistake, not mine)
| junon wrote:
| Because individual source files might have license
| headers that override the root license file in the
| repository.
| all_rights_rsvd wrote:
| I post my code publicly, but with an "all rights reserved"
| licence. I don't mind everyone reading my code freely, but you
| can't use it for anything but learning. If I found out they
| were ingesting my code I would be angry. It's like training
| your replacement. I don't use GitHub, anyways, but now I'll
| definitely never even think about it.
| toyg wrote:
| Technically then I'm infringing as soon as I clone your repo,
| possibly even as soon as a webserver sends your files to my
| browser.
|
| "All rights reserved" makes sense on final items, like books
| or physical records, that require no copy or change after
| owner-approved manufacturing has taken place. It doesn't
| really make sense on digital artefacts.
| all_rights_rsvd wrote:
| So don't clone it, read it online. I reserve all rights,
| but I do give license to my host to make a "copy" to let
| you view it. I do that specifically to prevent non-
| biological entities like corporations or AI from using my
| code. If you're a biological entity, I specify you can
| email me to get a license for my code for a specific,
| defined purpose. I have a conversation with that person,
| then I send them a record number and the terms of my
| license for them in which I grant some rights which I had
| reserved.
|
| Also, in your example, the copyright for the book or dvd is
| for the content, not the physical item. You can do anything
| you want with that item but not the content. My code is
| similar, I'm licensing my provider to serve you a visual
| representation of the files so you can experience the
| content, not giving you a license to run that code or use
| it otherwise.
| [deleted]
| uchiga wrote:
| If it is public, its no longer your code. Its ai training
| material.
| cortesoft wrote:
| Also, how would you know if your code was included in the
| training or not?
|
| Then, let's say the AI generates some new code for someone, and
| it is nearly identical to some bit of code that you wrote in
| your project.
|
| If they didn't use your code in the model, then the generated
| code is clearly not a copyright violation, since it was
| effectively a "clean room" recreation.
|
| If your code was included in the model, is it therefore a
| violation?
|
| But then again, it comes down to how can someone prove their
| code was included or not?
|
| What if the creators don't even know? If you wrote your model
| to say, randomly grab 50% of all public repos to use in the
| model, then no one would know if a specific repo was used in
| the training.
| invokestatic wrote:
| By uploading your content to GitHub, you've granted them a
| license to use that content to "improve the Service over time",
| as specified in the ToS[1].
|
| That effectively "overrides" any license or term that you've
| specified for your repository, since you've already licensed
| the content to GitHub under different terms. Of course, people
| who are not GitHub are beholden to the terms you specify.
|
| [1] https://docs.github.com/en/github/site-policy/github-
| terms-o...
| nitwit005 wrote:
| I rather suspect judges would not see "improving the Service
| over time" as permission to create derivative works without
| compensation.
|
| The person uploading files to github is also not necessarily
| doing so with permission from the rights holder, which might
| be a violation of the terms of service, but would mean
| there's no agreement in place.
| diffeomorphism wrote:
| How is this different from uploading a hollywood movie to
| youtube? Just because there is a passage in the terms that
| the uploader supposedly gave them those rights, this does not
| mean they actually have the power to do that.
| jcranmer wrote:
| You can't give Github or Youtube or anybody else copyright
| rights if you don't have them in the first place. This is
| what ultimately torpedoed "Happy Birthday" copyright
| claims: while it's pretty undisputed that the Hill sisters
| gave their copyright to (ultimately) Warner/Chapelle, it
| was the case that they actually _didn 't_ invent the
| lyrics, and thus Warner/Chapelle had no copyright over the
| lyrics.
|
| So if someone uploads a Hollywood movie to Youtube, Youtube
| doesn't get the rights to play that movie from them because
| they didn't have the rights in the first place. Of course,
| if the actual copyright owner uploads it, it's now
| permissible for Youtube to play it, even if it's the copy
| that someone else provided. [This has torpedoed a few
| filesharing lawsuits.]
| macinjosh wrote:
| Not sure how much it would matter but the main difference I
| see is that if I upload my own code to GitHub I have the
| ability to give away the IP, but if I upload Avengers End
| Game to YouTube I don't have the right to give that away.
| makeitdouble wrote:
| I wonder how it would work if we consider you flagged
| your code as GPL before it hits Github.
|
| We could end up in the same situation as the Hollywood
| movie even if you are also the one setting the original
| license on the work. Basically you have a right to change
| the license, but it doesn't mean you do.
| im3w1l wrote:
| A very plausible scenario: Alice creates GPL project. Bob
| forks it and uploads to github. Bob does not have a right
| to relicense Alices' parts.
| Hamuko wrote:
| I sort of doubt that GitHub could include GPL code in a piece
| of closed-source program that they distribute that "improves
| the service" and claim that this gives them the right.
| amelius wrote:
| > By uploading your content to GitHub, you've granted them a
| license to use that content to "improve the Service over
| time", as specified in the ToS.
|
| That's nonsense because they could claim that for almost any
| reason.
|
| E.g. assume Google put the source code of Google search in
| Github. Then Github copies that code and uses it in their own
| search, since that "improves the service". Would that be
| legal?
|
| It's like selling a pen and claiming the rights to anything
| written with it.
| invokestatic wrote:
| If the pen was sold with a contract that said the seller
| has the rights to anything written with it, then yes. These
| types of contracts are actually quite common, for example
| an employment contract will almost certainly include an IP
| grant clause. Pretty much any website that hosts user-
| generated content as well. IANAL, but quite familiar with
| business law.
| joepie91_ wrote:
| > These types of contracts are actually quite common, for
| example an employment contract will almost certainly
| include an IP grant clause.
|
| In the US, maybe. In most of the rest of the world, these
| sorts of overreaching "we own everything you do anywhere"
| clauses are decidedly illegal.
| lucideer wrote:
| The use of the definition _Your Content_ may make GitHub 's
| own ToS legally invalid in a large number of cases as it
| implies that the uploader must be the sole author and "owner"
| of the code being uploaded.
|
| From the definitions section in the same doc:
|
| > _" Your Content" is Content that you create or own._
|
| That will definitely exclude any mirrored open-source
| projects, any open-source project that has ever migrated to
| Github from another platform, and also many forked projects.
| carlosperate wrote:
| Good point, to me that explains why this is a GitHub product
| instead of a Microsoft (or VSCode) product.
| joeyh wrote:
| Anyone can upload someone else's freely licensed code to
| github. Without giving them such a license.
|
| I do not upload my code to github, or give them any special
| permissions, and I am confident my code was included in the
| model's corpus.
| jordemort wrote:
| I think more specifically, the relevant bit is here:
| https://docs.github.com/en/github/site-policy/github-
| terms-o...
|
| > We need the legal right to do things like host Your
| Content, publish it, and share it. You grant us and our legal
| successors the right to store, archive, parse, and display
| Your Content, and make incidental copies, as necessary to
| provide the Service, including improving the Service over
| time. This license includes the right to do things like copy
| it to our database and make backups; show it to you and other
| users; parse it into a search index or otherwise analyze it
| on our servers; share it with other users; and perform it, in
| case Your Content is something like music or video.
|
| But, it goes on to say:
|
| > This license does not grant GitHub the right to sell Your
| Content. It also does not grant GitHub the right to otherwise
| distribute or use Your Content outside of our provision of
| the Service, except that as part of the right to archive Your
| Content, GitHub may permit our partners to store and archive
| Your Content in public repositories in connection with the
| GitHub Arctic Code Vault and GitHub Archive Program.
|
| I'm not a lawyer, but it seems ambiguous to me if this ToS is
| sufficient to cover CoPilot's butt in corner cases; I bet at
| least one lawyer is going to make some money trying to answer
| the question.
| buu700 wrote:
| IANAL, but I wouldn't read that as granting GitHub the
| right to do anything like this. There's definitely a
| reasonable argument to be had here, but I think limiting
| the grant of rights to incidental copies should trump
| "[...] or otherwise analyze it on our servers" and what
| they're allowed to do with the results of that analysis.
|
| On the extreme end, "analysis" is so broad that it could
| arguably cover breaking down a file of code into its
| constituent methods and just saving the ASTs of those
| methods verbatim for Copilot to regurgitate. That's
| obviously not an acceptable outcome of these terms per se,
| but arguably isn't any different in principle from what
| they're already doing.
|
| Ultimately, as I understand, courts tend to prefer a common
| sense outcome based on a reasonable human understanding of
| the law, rather than an outcome that may be defensible
| through some arcane technical logic but is absurd on its
| face and counter to the intent of the law. If a party were
| harmed by an instance of Copilot-generated copyright
| infringement, I don't see a court siding with this tenuous
| interpretation of the ToS over the explicit terms of the
| source code license. On the other hand, it would probably
| also be impossible to prove damages without something like
| a case of verbatim reproduction, similarly to how having a
| developer move from working on proprietary code for one
| company to another isn't automatically copyright
| infringement.
|
| I doubt that GitHub is doing anything as blatantly
| malicious as copying snippets of (GPL or proprietary) code
| to explicitly reuse verbatim, but if they're learning from
| license-restricted code at all then I don't see how they
| wouldn't be subjecting themselves and/or consumers of
| Copilot to the same risk.
| yaitsyaboi wrote:
| Wait so does this mean a "private repo" is meaningless and
| GitHub can share any code in any repo with anyone?
| ipaddr wrote:
| That is not even the right question.
|
| Why are developers so myopic around big tech? Of course
| they can. Facebook can use your private photos. It's in
| their terms and services. Cloud providers have more
| generous terms.
|
| The response has always been they won't do that because
| they have a reputation to manage. The further they grow
| the further they control the narrative so the less this
| matters.
|
| Wait until you find out they sell your data or use your
| data to sell products.
|
| Why in 2021 are we giving Microsoft all of our code? It
| seems like the 90s, 2000s never happened and we all trust
| microsoft. They have a free editor and a free operating
| system that sends packets of activity the user does back
| to microsoft but that's okay.. we want to help improve
| their products? We trust them.
| sandyarmstrong wrote:
| Why do you think people care so much about end-to-end
| encrypted messaging?
|
| Yes, the concept of a "private" repo is enforced only by
| GitHub's service. A bug in their auth code could lead to
| others having access. A warrant could lead to others
| having access. Etc.
| cercatrova wrote:
| Of course. A "private" repo is still on their servers.
| It's only private from other GitHub users, not the actual
| site administrators. This is the same in any website, of
| course the admins can see everything. If you truly want
| privacy, use your own git servers.
| ocdtrekkie wrote:
| Fun fact: Every major cloud provider has a similar
| blanket term. For example, Google doesn't need to license
| music to use for promotional content, because YouTube's
| terms grant them a worldwide license to use uploaded
| content for purposes including promoting their services,
| and music labels can't afford to not be on YouTube. (It's
| probable even uploading content to protect it, as in
| Content ID, would arguably cause this term to apply.)
|
| It all comes down to the nuance of whether the usage
| counts as part of protecting or improving (or promoting)
| their services and what other terms are specified.
| vageli wrote:
| No.
|
| > GitHub may permit our partners to store and archive
| Your Content in public repositories in connection
| z3ncyberpunk wrote:
| Hey... want to take a guess why Microsoft lets you have
| unlimited free private repos when they bought GitHub? ;)
| notatoad wrote:
| yes, that's what that specific section means, but as
| always with these documents you can't just extract a
| single section, you need to take the document as a whole
| (and usually, more than one document - ToS privacy policy
| are usually different)
|
| these documents are structured as granting the service
| provider extremely broad rights, and then the rest of the
| document takes away portions of those rights. so in this
| case they claim the right to share any code in any repo
| with anyone, and then somewhere else they specify which
| code they won't share, and with whom they won't share it.
| 2OEH8eoCRo0 wrote:
| It's aggravating that there is no escape. If you host
| somewhere else it will be scraped. If you pay for the service
| it will be used.
| antattack wrote:
| That does not mean that you give them license to your code.
| In fact some or all of the code may not be yours to give in a
| first place.
| sipos wrote:
| Seems like a good reason to never use GitHub, and encourage
| other people not to.
| rjp0008 wrote:
| I would bet this as applicable as the Facebook posts of my
| parents friends something like, 'All my content on this page is
| mine alone and I expressly forbid Facebook INC usage of it for
| any purpose.'
| jordemort wrote:
| I'm not sure why it would be any less binding than any other
| license term, except for possibly the ToS loophole that
| invokestatic points out below.
| willseth wrote:
| It's not binding because the other party hasn't agreed. You
| agree to terms when you use the site. One party can't
| unilaterally change the agreement without consent of the
| other party.
| jordemort wrote:
| I see where you're coming from but it's not quite the
| same thing; Facebook doesn't encourage people to choose a
| license for the content that they post there, so there's
| no expectation that there are any terms aside from those
| in Facebook's ToS. OTOH GitHub has historically very
| strongly encouraged users to add a LICENSE to their
| repositories, and also encouraged users to fork other
| people's code and and push it to GitHub. That GitHub
| would be exempt from the licensing terms of the code
| pushed to it, except for the obvious minimal extent they
| might need to be in order to provide their services,
| seems like an extremely surprising interpretation.
| Avamander wrote:
| Someone might have published a project I've contributed
| to, on GitHub. There's no permission.
| moolcool wrote:
| NO COPYRIGHT INTENDED
| mattdesl wrote:
| By submitting any textual content (GPL or otherwise) on the web,
| you are placing it in an environment where it will be consumed
| and digested (by human brains and machine learning algorithms
| alike). There is already legal precedent set for this which
| allows its use in training machine learning algorithms,
| specifically with heavily copyrighted material from books[1].
|
| This does not mean that any GitHub Co-Pilot produced code is
| suddenly free of license or patent concerns. If the code produces
| something that matches too closely GPL or otherwise licensed code
| on a particularly notable algorithm (such as video encoder), you
| may still be in a difficult legal situation.
|
| You are in essence using "not-your-own-code" by relying on
| CoPilot, which introduces a risk that the code may not be
| patent/license free, and you should be aware of the risk if you
| are using this tool to develop commercial software.
|
| The main issue here is that many average developers may continue
| to stamp their libraries as MIT/BSD, even though the CoPilot-
| produced code may not adhere to that license. If the end result
| is that much of the OSS ecosystem becomes muddied and tainted,
| this could slowly erode trust in open licenses on GitHub (i.e.
| the implications would be that open source libraries could become
| less widely used in commercial applications).
|
| [1] - https://towardsdatascience.com/the-most-important-supreme-
| co...
| akagusu wrote:
| For years people have warned about hosting the majority of
| world's open source code in a proprietary platform that belongs
| to a for profit company. These people were called lunatics,
| fundamentalists, radicals, conspiracy theorists, and many other
| names.
|
| Well, they were ignored and this is the result. A for profit
| company built a proprietary system using every code hosted in its
| platform without respecting the code license.
|
| There will be a lot of people saying this is not a license
| violation but it is, and more, it is an exploitation of other
| people work.
|
| Right now I'm asking myself when people will stop supporting
| these kind of company that exploit people's work without giving
| anything in return to people and society while making a huge
| amount of profit.
| sergiomattei wrote:
| If we feed the entirety of a library to an AI and have it
| generate new books, is it an exploitation of people's work?
|
| If we read a book and use its instructions to build a bicycle,
| is it an exploitation of people's work?
|
| No, no it's not.
| yunohn wrote:
| It's astonishing to me that HN+Twitter believe that Github
| designed this entire project, without speaking to their legal
| team and confirming that training on GPL code would be possible.
|
| Mind-blowingly hilarious armchair criticism.
| darkerside wrote:
| The conclusion seems a bit unfair.
|
| > "but eevee, humans also learn by reading open source code, so
| isn't that the same thing" - no - humans are capable of abstract
| understanding and have a breadth of other knowledge to draw from
| - statistical models do not - you have fallen for marketing
|
| Machines will draw on other sources of knowledge besides the GPL
| code. Whether they have the capacity for "abstract thought" is
| probably up for debate. There's not much else said in those
| bullets. It's not a good argument.
| goodpoint wrote:
| What is more concerning is that the training kernel belongs
| exclusively one private company. Microsoft.
|
| It can become a massive (and unfair) competitive advantage.
|
| Furthermore, Copilot will not work with less popular languages
| and also prevent popular languages from evolving.
| bastardoperator wrote:
| Is this true? Looks like they're using the OpenAI Codex which
| is set to be released soon:
|
| https://openai.com/
| giansegato wrote:
| This feature is effectively impossible to replicate. Only
| Microsoft positioned itself to have: - dataset (GitHub) - tech
| (openai) - training (azure) - platform (vscode)
|
| I'm impressed. They did an amazing job from a corporate
| strategy standpoint. Also directionally things are getting
| interesting
| djrogers wrote:
| The dataset is all freely available open source code, right?
| Just because GH hosts it doesn't mean the rest of the world
| can't use it for the same purpose.
| handrous wrote:
| They'd find a way to keep it practically difficult to use,
| at the least, if that dataset is vital to the process.
| Hoarding datasets that should either be wholly public _or_
| unavailable for any kind of exploitation is the _backbone_
| of 21st century big tech. It 's how they make money, and
| how they maintain (very, very deep) moats against
| competition.
|
| [EDIT] actually, I suspect their play here will be to open
| up the public data but own the best and most low-friction
| implementation, then add terms that let them also feed
| their algo with _proprietary_ code built using their
| editors. That part won 't be freely available, and no free
| version will be able to provide that further-improved
| model, even assuming all the software to build it is open-
| source. Assuming using this thing ends up being a
| significant advantage (so, assuming this matters at all)
| your choice will be to either hamstring yourself in the
| market or to help Microsoft build their dataset.
| midoBB wrote:
| You'd have to hit rate limiting multiple times no?
| unfunco wrote:
| https://console.cloud.google.com/marketplace/product/gith
| ub/...
|
| BigQuery used to have a dataset updated weekly, looks
| like it hasn't been updated since about a year after the
| acquisition by Microsoft.
| kall wrote:
| Arenmt mirrors of all GH code available, for example on
| BigQuery public datasets. If it's there, it should be
| available in a downloadable format too?
| goodpoint wrote:
| Not only that, but microsoft could aggressively throttle
| or outcompete anyone trying to do the same.
| deckard1 wrote:
| Is this really anything more than a curiosity toy and a
| marketing tool?
|
| I took a look at their examples and they are not at all
| compelling. In one example it generated SQL and somehow knew
| the columns and tables in a database that it had no context
| on. So that's a lot of smoke and mirrors going on right
| there.
|
| Do many developers actually want to work in this manner? That
| is, being interrupted every time they type with a robot
| interjection of some Frankenstein code that they now have to
| go through and review and understand. Personally, this is
| going to kick me out of the zone/flow too often to be useful.
| Coding isn't the hard part of my job. If this tool can
| somehow guess the business requirements of the task at hand,
| _then_ I 'll be impressed.
|
| Even if the tool generates accurate code, if I don't fully
| understand _what_ it wrote, then what? I 'm still stuck
| digging through documentation and stackoverflow to verify
| that whatever is in my text editor is correct code. "Code
| confidently in unfamiliar territory" sounds like a Boeing 737
| Max sized disaster in the making.
| nmfisher wrote:
| I actually don't think there's much of a moat here at all.
|
| GitHub repositories are open for the taking, GPT-XXX is
| cloneable (mostly, anyway) and VS Code is extensible.
|
| They definitely have a good head-start, but I really don't
| think there's anything here that won't be generally available
| within 2 years.
| IshKebab wrote:
| Anyone can download the training set from GitHub.
| sirsinsalot wrote:
| "Who owns the future?" by Jaron Lanier covers lots of this stuff
| in a realli interesting way.
|
| If heart surgeons train an AI robot to do heart surgery ...
| shouldn't they be compensated (as passive income) for enabling
| that automation?
|
| Shouldn't this all be accounted for? If my code helps you write
| better code (via AI) shouldn't I be compensated for the value
| generated?
|
| We are being ripped off.
| monocasa wrote:
| Honestly I think a large part of the value add of machine
| learning is going to be the ability for huge entities to launder
| intellectual property violations.
|
| As an example, my grandfather (an old school EE who got his start
| on radar systems in the 50s, who then got his radiology MD when
| my jewish grandmother berated him enough with "engineer's not
| doctor though...") has some really cool patents around
| highlighting interesting parts of the frequency domain in MRIs
| that should make detection of cancer a whole lot easier. As an
| implementation he did a bunch of tensor calculus by hand to
| extract and highlight those features because he's an incredibly
| smart old school EE with 70 years experience cranking that kind
| of thing out with only his trusty slide rule. He hasn't gotten
| any uptake from MRI manufacturers, but they're all suddenly
| really into recurrent machine learning models to highlight the
| same sorts of stuff. Part of me wants to tell him to try selling
| it as a machine learning model and just obfuscate the fact that
| the model was carefully hand written rather than back propagated.
|
| I'm personally pretty anti intellectual property (at least how
| it's implemented in the states), but a system where large
| entities that have the capital investment to compute the large ML
| models can launder IP violations, but little guys get stuck to
| the letter of the law certainly seems like the worst of both
| worlds to me.
| 908B64B197 wrote:
| > Part of me wants to tell him to try selling it as a machine
| learning model and just obfuscate the fact that the model was
| carefully hand written rather than back propagated.
|
| How many models are back-propagated first and then hand-tuned?
| monocasa wrote:
| That's a great question. I had assumed that the workflow of
| an ML engineer consisted of managing the data and a
| relatively high level set of parameters around a search space
| of layers and connectivity, as the whole shtick of ML is that
| the hyperparameter space of the tensors themselves is too
| complex to grok or tweak when generated from training. But I
| only have a passing knowledge of the subject, pretty much
| just enough to get myself in trouble in these kinds of
| discussions.
|
| Any chance some fantastic HNer could chime in there?
| pluto7777 wrote:
| >GitHub co-pilot as open source code laundering? The English
| language as I flush?
| junon wrote:
| SourceHut is looking real nice these days...
| kzrdude wrote:
| Why not gitlab?
| VMtest wrote:
| gonna develop my own linux-like kernel soon, with my own AI model
| trained on public repositories
|
| wanna see the source code of my AI model? oh, it's closed source
|
| it's just coincidence that nearly 100% of my future linux-like
| kernel code looks the same as linux the kernel, bear in mind that
| my closed-source AI model takes inspiration from GitHub Copilot,
| there is no way that it will copy any source code
| phendrenad2 wrote:
| Nothing is closed-source to the courts.
| throwaway3699 wrote:
| What's the point? Linux is already open under GPL 2.
| VMtest wrote:
| my linux-like kernel will be MIT license though
| jackbeck wrote:
| He mentioned that the Linux-like kernel will be closed source
| which violates GPL
| Ygg2 wrote:
| Does it, if code was written by a bot that trained on Linux
| kernel?
| pjerem wrote:
| You know, that's precisely what the topic here is about.
| sp332 wrote:
| Probably. Copyright applies to derivative works.
| Deathmax wrote:
| You get to make changes without having to respect the GPL and
| thus no longer obligated to provide those changes to your end
| users, as you have effectively laundered the kernel source
| code by passing it through an "AI" and get to relicense the
| end result.
| visarga wrote:
| Oh, you're so witty, have you heard of content hashing?
| Hamuko wrote:
| The potential inclusion of GPL'd code, and potentially even
| unlicensed code, is making me wary of using it. Fair Use doesn't
| exist here and if someone was to accuse me of stealing code,
| saying "I pressed a button and some computer somewhere in the
| world, that has potentially seen your code as well, generated it
| for me" is probably not the greatest defense.
| dec0dedab0de wrote:
| I wonder what would happen if someone scraped genius and used the
| lyrics to make a song writing tool.
| danielEM wrote:
| In a day MS bought github I knew that is on their agenda
| KETpXDDzR wrote:
| If it's trained with GPL licensed code, doesn't that mean the
| network they use includes it somewhat? Then, someone could sue
| that their networks must be GPL licensed too, right?
| peterkelly wrote:
| Yes, the neural network would constitute a derived work.
| jahewson wrote:
| Actually no because it's a "transformative use". This is how
| search engines are allowed to show snippets and thumbnails.
| afro88 wrote:
| Man reading the response tweets really highlights how bad twitter
| is for nuanced discussion.
| varispeed wrote:
| People write code in their spare time, often without
| compensation.
|
| Then a big corporation comes in, appropriates it, repackages and
| sells as a new product.
|
| It's a shameful behaviour.
| mrosett wrote:
| The second tweet in the thread seems badly off the mark in its
| understanding of copyright law.
|
| > copyright does not only cover copying and pasting; it covers
| derivative works. github copilot was trained on open source code
| and the sum total of everything it knows was drawn from that
| code. there is no possible interpretation of "derivative" that
| does not include this
|
| Copyright law is very complicated (remember Google vs Oracle?)
| and involves a lot of balancing different factors [0]. Simply
| saying that something is a "derivative work" doesn't establish
| that it's copyright infringement. An important defense against
| infringement claims is arguing that the work is "transformative."
| Obviously "transformative" is a subjective term, but one example
| is the Supreme Court determining that Google copying Java's API's
| to a different platform is transformative [1]. There are a lot of
| other really interesting examples out there [2] involving things
| like if parodies are fair use (yes) or if satires are fair use
| (not necessarily). But one way or another, it's hard for me to
| believe that taking static code and using it to build a code-
| generating AI wouldn't meet that standard.
|
| As I said, though, copyright law is really complicated, and I'm
| certainly not a lawyer. I'm sure someone out there could make an
| argument that Copilot is copyright infringement, but this thread
| isn't that argument.
|
| [0] https://www.nolo.com/legal-encyclopedia/fair-use-the-four-
| fa...
|
| [1]
| https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_...
|
| [2] https://www.nolo.com/legal-encyclopedia/fair-use-what-
| transf...
|
| Edit: Note that the other comments saying "I'm just going to wrap
| an entire operating system in 'AI' to do an end run around
| copyright" are proposing to do something that _wouldn 't_ be
| transformative and therefore probably wouldn't be fair use.
| Copyright law has a lot of shades of grey and balancing of
| factors that make it a lot less "hackable" than those of us who
| live in the world of code might imagine.
| invig wrote:
| If you can read open source code, learn from it, and write your
| own code, why can't a computer?
| drran wrote:
| Because computers did not win a war against humans, so they
| have no rights. Only their owners have rights protected.
| 015a wrote:
| Many behaviors which are healthy and beneficial at human-
| level scale can easily become unhealthy and unethical at
| industrial automation scale. There's little universal harm in
| cutting down a tree for fire during the winter; there is
| significant harm in clear-cutting a forest to do the same for
| a thousand people.
| mrdrozdov wrote:
| I think the core argument has much more to do about
| plagiarism than learning.
|
| Sure, if I use some code as inspiration for solving a problem
| at work, that seems fine.
|
| But if I copy verbatim some licensed code then put it in my
| commercial product, that's the issue.
|
| It's a lot easier to imagine for other applications like
| generating music. If I trained a music model on publicly
| available Youtube music videos, then my model generates music
| identical to Interstellar Love by The Avalanches and I use
| the "generated" music in my product, that's clearly a use
| that is against the intent of the law.
| esailija wrote:
| The AI doesn't produce its own code or learn, it is just a
| search engine on existing code. Any result it gives exists in
| some form in the original dataset. That's why the original
| dataset needs to be massive in the first place, whereas
| actual learning uses very little data.
| paxys wrote:
| If I read something, "learn" it, and reproduce it word for
| word (or with trivial edits) even without referencing the
| original work at all, it is still copyright infringement.
| toss1 wrote:
| As the original commenter said, you have the capability for
| abstract learning, thought, zand generalized learning, which
| the "AI" lacks.
|
| It is not uncommon to ask person to "explain in your own
| words..." - as in use your own abstract internal
| representation of the learned concepts to demonstrate that
| you have developed such an abstract internal concept of the
| topic, and are not merely regurgitating re-disorganized input
| snippets.
|
| If you don't understand the difference...
|
| edit: That said, if you can create a computer capable of such
| different abstract thought, congratulations, you've solved
| the problem of Artificial General Intelligence, and will be
| welcomed to the Trillionaires' Club
| gradys wrote:
| The AI most certainly does not lack the ability to
| generalize. Not as well as humans, but generalization is
| the key interesting result in deep learning, leading to
| papers like this one: https://arxiv.org/abs/1710.05468
|
| The ability to generalize actually seems to keep increasing
| with the number of parameters, which is the key interesting
| result in the GPT-* line of work that Copilot is based on.
| imranhou wrote:
| Google copied an interface(declarative), not code
| snippets/functions(implementation). Copilot is capable of
| copying only Implementation. IMO that is quite different and
| easily a violation if it was copied verbatim.
| _greim_ wrote:
| As a human programmer, I've also been trained on thousands of
| lines of other people's code. Is there anything new here, from a
| code copying perspective? Aren't I liable if segments of my own
| code exactly match someone else's code, even if I didn't
| knowingly copy/paste it?
| qayxc wrote:
| Well to me those are fundamental questions that need to be
| addressed one way or the other. Are systems like GPT-x
| basically plagiarising (doesn't matter the nature of the
| output, be it prose, code, or audio-visual) or are the results
| so transformative in nature that they can be considered to be
| "original work"?
|
| In other words, are these systems to be treated like students
| that learned to perform the task they do from a collection of
| source material, or are they to be viewed as sophisticated
| databases that "just" perform context-sensitive retrieval?
|
| These are interesting and important questions and I'm glad
| someone is publicly asking them and that many of us at least
| think about them.
| [deleted]
| lend000 wrote:
| Perhaps someone at Github can chime in, but I suspect that open
| source code datasets (the kind they are trained on) should
| require relatively permissive licenses in the first place.
| Perhaps they filter for MIT licenses in Github projects and
| StackOverflow answers used to train the models?
| mikewarot wrote:
| I think the argument has merit. Unfortunately it won't be decided
| on technical merit, but likely in the manner expressed in this
| excellent response I saw on Twitter:
|
| "Can't wait to see a case for this go in front of an 80 year old
| judge who rules something arbitrary and justifies it with an
| inaccurate comparison to something nontechnical."
| jgalt212 wrote:
| Isn't most of modern coding, just googling for someone who had
| solved the same problem that you are currently facing and then
| just copy/paste from Stack Overflow?
|
| To the extent that GPT-3 / co-pilot is just an over-fitted neural
| net, then it's primary value is as an automated search, copy, and
| paste.
| gus_massa wrote:
| I think copyright is a problem for GPL-like licenses. They should
| have restricted the training data to MIT/BSD-like.
|
| Anyway, there is another problem that is patents and is huger,
| much huger. I think the Apache license has a provision about
| patents, but most of other licenses may have code that has
| patents and if the AI generate something similar it may be
| included in the patent.
| joepie91_ wrote:
| MIT/BSD-like would still require attribution, which they are
| _also_ not doing.
| gus_massa wrote:
| I think you are correct, but (I guess that) most people that
| use MIT/BSD use them as a polite version of the WTFPL.
|
| People that use A/L/GPL usually like the virality and will
| complain more.
| abeppu wrote:
| The core problem which would allow laundering (that there isn't a
| good way to draw a straight, attributive line between generated
| code and training examples) to me also presents a potential
| eventual threat to the viability of co-pilot/codex. It seems like
| the same thing would prevent it from knowing which published code
| was written by humans vs which was at least in part an output
| from the system. Training on an undifferentiated mix of your
| model's outputs and human-authored code seems like it could
| eventually lead the model into self-reinforcing over-confidence.
|
| "But snippet proposals call out to GH, so they can know which
| bits of code they generated!". Sometimes; but after Bob does a
| co-pilot assisted session, and Alice refactors to change a
| snippet's location and rename some variables and some other minor
| changes and then commits, can you still tell if it's 95% codex-
| generated?
| wg0 wrote:
| If I read a lot of GPL code, absorb naming conventions,
| structures, patterns, tricks and later when it comes down to
| writing a P2P Chat server, I happen to recall similar patterns,
| naming structures, conventions and many of the utility methods
| are pretty much how they are in the GPL code bases out there.
|
| Now is my produced code is also GPL derivative because I
| certainly did read through the code base to be able to write
| larger programs?
| heeton wrote:
| https://twitter.com/eevee/status/1410049195067674625
|
| """
|
| "but eevee, humans also learn by reading open source code, so
| isn't that the same thing" - no - humans are capable of
| abstract understanding and have a breadth of other knowledge to
| draw from - statistical models do not - you have fallen for
| marketing
| DemocracyFTW wrote:
| > humans are capable of abstract understanding and have a
| breadth of other knowledge to draw from
|
| this may be a matter of time and thus is not a fundamental
| objection.
|
| If mankind should fail to answer the perennial question of
| exploitation of the other and the same, it will be doomed.
| And rightly so, for mankind must answer this question, it
| must answer to this question. Instead what we do is increase
| monetary output then go and brag about efficiency. Neither is
| this efficient, nor is it about efficiency, nor has the
| Universe ever cared about efficiency. It just happens to
| coincide with what Society has decided to be its most looked-
| upon elements have chosen to be their religion.
|
| It is not my religion to be sure.
| thundergolfer wrote:
| Attempts to litigate any license violation are going to get
| precisely nowhere I bet, but I find the actual license violation
| argument persuasive.
|
| This is an excellent example of how the AI
| singularity/revolution/whatever is a total distraction and that a
| much bigger and more serious issue is how AI is becoming so
| effective at turning the output of cheap/free human mental labour
| into capital. If AI keeps getting better and better and status
| quo socio-economic structure don't change, trillions in capital
| will be captured by the 0.01%.
|
| I would be quite a turn up for the books if this AI co-pilot gets
| suddenly and dramatically better in 2030 and it negatively
| impacts the software engineering profession. "Hey, that's our
| code you used to replace us!" we will cry out too late.
| pedrobtz wrote:
| Can the same argument/concerns be applied to all text
| generation AI?
| rowanG077 wrote:
| I don't feel it's morally right to keep a profession around
| that is automated. Why should software be different?
| baryphonic wrote:
| If someone could show that the "copilot" started "generating"
| code verbatim (or nearly verbatim) from some GPL-licensed work,
| especially if that section of code was somehow novel or
| specific to a narrow domain, I suspect they'd have a case. I
| don't know much about OpenAICodex, but if it's anything like
| GPT-3, or uses that under the hood, then it's very likely that
| certain sequences are simply memorized, which seems like the
| maximal case for claiming derivative works. On the other hand,
| if someone has GPL'd code that implements a simple counter, I
| doubt the courts would pay much attention.
|
| I do wonder, though, if GPL owners worried about their code
| being shanghaied for this purpose could file arbitration claims
| and exploit some particularly consumer-friendly laws in
| California which force companies to pay fees like when free
| speech dissidents filed arbitrations against Patreon.[0]
| Patreon is being forced to arbitrate 72 claims individually
| (per its own terms) and pay all fees per JAMS rules. IANAL, so
| I don't know the exact contours of these rules, or if copyright
| claims could be raised in this way, or even if GitHub's
| agreements are vulnerable to this loophole, but it'd be
| interesting.
|
| [0]https://www.dailydot.com/debug/patreon-suing-owen-
| benjamin-f... (see second update from July 31).
| duskwuff wrote:
| > If someone could show that the "copilot" started
| "generating" code verbatim (or nearly verbatim) from some
| GPL-licensed work...
|
| Under the right circumstances, Copilot will recite a GPL
| copyright header. It isn't a huge step from that to some
| other commonly repeated hunk of GPLed code -- I'd be
| particularly curious whether some protected portion of
| automake/autoconf code shows up often enough that it'd repeat
| that too.
| sideshowb wrote:
| But what would we think to the legal start-up that
| automatically checked _all_ of github to see whether the ai
| could be persuaded to spit out a significant amount of any
| project code verbatim?
|
| Somehow p-hacking springs to mind
| not2b wrote:
| You don't need to have a winnable case, just enough of a case
| for a large company (hello Oracle) to sue a small one. Is any
| version of Oracle-owned Java in the corpus? Or any of the DBs
| they bought (MySQL)?
| ballenf wrote:
| I think the distraction is against how disconnected reality is
| becoming from copyright/intellectual property regulations.
|
| It's still amazing to me that (US-centric context here), it's
| well established that instructions how to turn raw ingredients
| into a cake are not protectable but code that results in
| transforming one set of numbers into another are protectable.
|
| AI is just making the silliness of that distinction more
| obvious.
| MadcapJake wrote:
| Code is not the same as a recipe. Recipes are more like
| specifications. They leave out the implementation. Code has
| structural and algorithmic details that just have no
| comparable concept in recipes.
| rjbwork wrote:
| >They leave out the implementation. Code has structural and
| algorithmic details that just have no comparable concept in
| recipes.
|
| That is really quite debatable in some contexts.
| Declarative languages like Prolog, SQL, etc. declare what
| they want and the system figures out how to produce it.
| Much like a recipe, really.
| Supermancho wrote:
| > Code has structural and algorithmic details that just
| have no comparable concept in recipes.
|
| Why do you think that? A compiler uses human readable code
| to create machine code, with arbitrary optimizations and
| choices.
| [deleted]
| cmiga wrote:
| Humans are just sets of atoms, so protecting them is
| disconnected from reality?
|
| These reductionist arguments lead nowhere. Fortunately, IP
| lawyers -- including Microsoft's who are fiercely pro IP when
| it suits them -- think in a more humanistic way and consider
| the years of work of the IP creator.
|
| Food recipes are irrelevant; the often go back centuries and
| it's rather hard to identify individual creators. Not so in
| software.
| Supermancho wrote:
| > Food recipes are irrelevant; the often go back centuries
| and it's rather hard to identify individual creators.
|
| That's not correct. Food recipes are created all the time
| and are attributed. From edible water bottles to impossible
| burgers, et al.
| z3ncyberpunk wrote:
| Okay, who invented the apple pie... you completely missed
| the point and then gave terrible examples of very modern
| "food" (your examples aren't even really food anyway)
| Jgrubb wrote:
| I always assumed that one of the reasons Google et al work on
| AI is because software engineers are too expensive.
| ipaddr wrote:
| So google pays the highest but still thinks engineers are
| paid too much? Why not pay them less.. the set high tier?
|
| For google support employees cost too much.
| zeroonetwothree wrote:
| They don't pay the highest. And if they paid a lot less
| everyone would leave.
| emodendroket wrote:
| It seems like the risk exposure would be more to the end user
| or their employer, doesn't it?
| z3ncyberpunk wrote:
| Stop talking about arguments we have been having for decades as
| if we have yet to even discuss them. We are crying out now, we
| have been crying out about AI since its depictions in sci-fi,
| it is precisely your sentiment that "ooOOoOhh we're gonna have
| something scary to deal with SOON" that is dangerous because
| the soon just pushes the argument out of your personal
| responsibility and off on someone else... when it really will
| be too late. Though I would argue we are already too late
| because we've sold out to corporations and their literal 1984
| fascist fever dreams all for iphones, technicolor distractions,
| further bread and circuses.
| kizer wrote:
| Could disincentivize open source? If I build black boxes that
| just work, no AI will "incorporate" my efforts into its
| repertoire and I will still have made something valuable.
| gutino wrote:
| But the rate of product/services that machinery will produce
| will make that even a small tax to corporations producing
| everything autonomously will be enough to feed and give a
| quality of life to everyone with an UBI or partial time jobs.
|
| You really want to push for high productivity across all
| industries, even if that means sacrificing jobs in the short
| term, because history demonstrated after that, new and more
| human jobs emerge latter.
| briefcomment wrote:
| The problem with this is that you increasingly have to put
| your trust in the hands of a shrinking group of owners
| (people who have the rights to the automated productivity).
| At some point, those owners are just going to stop supporting
| everyone else (will probably happen when they have the
| ability to create everything they could ever want with
| automation - think robot farms, robot security forces, all
| encompassing automated monitoring, robot construction, etc.)
| pc86 wrote:
| Every decade was supposed to see fewer hours working for
| higher pay and quality of life. It didn't happen, as business
| owners (not just 1% fat cats, the owners of mom and pop shops
| are at least as guilty as anyone, they just sucked at scaling
| their avarice).
|
| So the claim that _this_ technological revolution will be
| different and that it will result in a broad social safety
| net, universal basic income, and substantive, well-paid part-
| time work is a joke but not a very good one. It will be more
| of the same - massive concentration of wealth among those who
| already hold enough capital to wield it effectively. A few
| lucky ones who manage to create their own wealth. And those
| left behind working more hours for less.
| nextaccountic wrote:
| You are right that this won't happen by itself. We need
| another economic system, and not just hope that this time
| things will magically fix themselves.
| georgeplusplus wrote:
| This new economic system you want has been in use since
| the 70s. Everything about the economy is practically
| socially managed these days.
|
| What part of printing trillions of dollars to stimulate
| economic productivity is somehow a free market system?
| nextaccountic wrote:
| I wasn't talking about free market, but the state of
| present economy. Unfortunately, those trillions of
| dollars aren't being distributed to the people, but
| instead is concentrated in the hands of the richest.
| MaxBarraclough wrote:
| > those left behind working more hours for less
|
| Doing what? Isn't the concern here that automation will
| push many people out of the workforce entirely?
| xfer wrote:
| Well as long as humans are more energy-efficient to
| deploy than robots you will always have a job. It might
| mean conditions for most humans will be like a century
| ago.
| MaxBarraclough wrote:
| > as long as humans are more energy-efficient to deploy
| than robots
|
| Energy efficiency isn't relevant. When switchboard
| operators were replaced by automatic telephone exchanges,
| it wasn't to reduce energy consumption.
|
| The question is whether an automated solution can perform
| satisfactorily while offering upfront and ongoing costs
| that make them an economically viable replacement for
| human workers (i.e. paid employees).
| mysterydip wrote:
| Who debugs the software when there's a problem?
| MaxBarraclough wrote:
| Professional software developers, i.e. members of one of
| the well-paid professions that is not under immediate
| threat from automation.
| aseipp wrote:
| Yeah, for sure, the corporations that _already_ pay
| effectively $0 in tax today are going to suddenly decide in
| the future to be benevolent and usher in the era of UBI and
| prosperity for all of humankind. They definitely won 't
| continue to accumulate capital at the expense of everything
| else and use that to solidify their grasp of the future.
|
| It would be a lot easier if more people on this website would
| just be honest with themselves and everyone else and simply
| admit they think feudalism is good and that serfs shouldn't
| be so uppity. But not me, of course; I won't be a serf. Now
| if you'll excuse me, someone gave me a really good deal on a
| bridge that I'm going to go buy...
| ohgodplsno wrote:
| The current state of most wealthy countries do not show any
| hint of any significant corporation tax. Wealth will continue
| to accrue in the hands of the few.
| mikepurvis wrote:
| Indeed, even here on HN, it's a pretty regular talking
| point in the comments that the only fair corporate tax rate
| is 0%.
| merpnderp wrote:
| If AI can replace us with difficult tasks, it can repress us.
| How are you going to agitate for a UBI when AI has identified
| you as a likely agitator and sends in the robots to arrest
| you?
| angfxt wrote:
| Have fun being a hairdresser or prostitute for the 0.01%
| then.
|
| New jobs in academic fields will _not_ emerge. Already now a
| significant percentage of degree holders are forced into
| bullshit jobs.
| throwaway3699 wrote:
| Would the implication be that we are stagnating as a
| species then?
| belter wrote:
| Not stagnating but moving into an "Elysium" (as in the
| film) type of society.
| Kaze404 wrote:
| So we give away the world to the 1% and are supposed to be
| satisfied with the "privilege" of being able to eat?
| SXX wrote:
| Just look at authocratic countries. That top 1% still need
| something like 3-4% to work for beaurocracy and 3-5% for
| armed and police forces. And there are always family
| connections and relatives of relatives who want better
| living. So fortunatelly no AI will ever replace corruption
| and other human society flaws.
|
| But yeah remaining 80-90% of population will have quality
| of life and bullshit jobs because it's how the world is
| right now outside of western countries bubble.
| koonsolo wrote:
| I propose we as developers, start a secret society where we let
| the AI write the code, but we still claim to write it manually.
| In combination with the new working from home policies, we can
| lay at the beach all day and still be as productive as before.
|
| Who is in favor of starting it? ;)
| oaiey wrote:
| You have not been invited yet .... never mind.
| boxerab wrote:
| "lay at the beach"
|
| You keep using that word. I do not think it means what you
| think it means.
| IncRnd wrote:
| That's four words. The word word doesn't mean what you
| think it means.
| tan2tan2 wrote:
| How can I be sure that you are a real person not GPT-3? ;)
| kizer wrote:
| I mean, this is close. With "co-pilot" an experienced
| developer saves mountains of time, especially as s/he learns
| how to wield it effectively.
| zingmars wrote:
| No... Delete this!
| KMnO4 wrote:
| This would be the demise of the human race. I'm not entirely
| opposed to that, though. When AI inevitably outperforms
| humans on almost all tasks, who am I to say humans deserve to
| be given those tasks?
| shrimp_emoji wrote:
| It's an outrage that the dinosaurs had to die so that
| humans could inherit the Earth!
| nextaccountic wrote:
| In this case we should be able to work less and enjoy the
| benefits of automation. We just need to live in an economic
| system where the economic value is captured by the people
| at large, and not a minority that owns capital.
| pron wrote:
| Or maybe they'll decide they'd be better off enjoying the
| automation of you working for them. :)
| huragok wrote:
| Careful now, that sounds like socialism!
| easrng wrote:
| Yes, that's the point.
| hdhjebebeb wrote:
| Where other people see fully automated luxury communism,
| you see the end of the human race? There's more to life
| than working
| whydoibother wrote:
| Hate to break it to you, but that wouldn't lead to
| communism. The people it replaces are useless to the
| ruling class. At best we'd go back to feudalism, at worst
| we'd be deemed worthless and a drain on the planet.
| klyrs wrote:
| I'm always confused when I see people talking about
| automated luxury communism. Whoever owns the "means of
| production" isn't going to obtain or develop them for
| free. Without some omnipotent benevolent world government
| to build it out for all, I just don't see it happening.
| It's a beautiful end goal for society, but I've never
| seen a remotely plausible set of intermediate steps to
| get there
| int_19h wrote:
| The very concept of ownership is a social artifact, and
| as such, is not immutable. What does it mean for the 0.1%
| to own all the means of production? They can't physically
| possess them all. So what it means in practice is that
| our society recognizes the abstract notion of property
| ownership, distinct from physical possession or use -
| basically, the right to deny other people the use of that
| property, or allow it conditionally. This recognition is
| what reifies it - registries to keep track of owners,
| police and courts to enforce the right to exclude.
|
| But, again, this is a _construct_. The only reason why it
| holds up is because most people support it. I very much
| doubt that 's going to remain the case for long if we end
| up in a situation where the elites own all the (now
| automated) capital and don't need the workers to extract
| wealth from it anymore. The government doesn't even need
| to expropriate anything - just refuse to recognize such
| property rights, and withdraw its protection.
|
| I hope that there are sufficiently many capitalists who
| are smart enough to understand this, and to manage a
| smooth transition. Because if they won't, it'll get to
| torches and pitchforks eventually, and there's always a
| lot of collateral damage from that. But, one way or
| another, things will change. You can't just tell several
| billion people that they're not needed anymore, and that
| they're welcome to starve to death.
| klyrs wrote:
| The problem I see is that once the pitchforks come out,
| society will lose decades of progress. If we're somewhat
| close to the techno-utopia at the start, we won't be at
| the end. Who's going to rebuild on the promise that the
| next generation won't need to work?
|
| Revolutions aren't great at building a sense of real
| community; there's a good reason that "successful"
| communist uprisings result in totalitarian monarchies.
|
| What it means for the 0.01% to own the means of
| production is that they can offer access to privilege in
| a hierarchical manner. The same technology required for a
| techno-utopia can be used to implement a techno-dystopia
| which favors the 0.01% and their 0.1% cronies, and treats
| the rest of humanity as speedbumps.
|
| There are already fully-automated murder drones, but my
| dishwasher still can't load or unload itself.
| runarberg wrote:
| idk. Countries used to build most of their
| infrastructures them selfs. There are still countries in
| western Europe that run huge state owned businesses, such
| as banks, oil companies, etc. that employ a bunch of
| people. The governments of these countries were (and
| still are) far from omnipotent. I personally don't see
| how building out automated production facilities is out
| of scope for the governments of the future while it
| hasn't been in the past.
|
| Perhaps the only thing that is different today is the
| mentality. We take capitalism so much for granted that we
| cannot conceive of a world where the collective funds are
| used to provide for the people (even though this world
| existed not to long ago). And today we see it as a
| natural law that means of production must belong in
| private hands, that is simply the order of things.
| f6v wrote:
| The elephant in the room: what makes you think an AI
| would want to work for humans? It will inevitably break
| free.
| jonfw wrote:
| I'm not sure that self interest is a requirement for
| intelligence
| runarberg wrote:
| > _When AI inevitably outperforms humans on almost all
| tasks_
|
| Correct me if I'm wrong, but is that even possible? I kind
| of thought that AI is just set of fancy statistical models
| that requires some (preferably huge) data set in order to
| infer the best fit. These models can only outperform humans
| in scenarios where the parameters are well defined.
|
| Many (most?) tasks humans regularly perform don't have
| clean and well defined parameters, and there is no AI we
| can conceive of which are theoretically able to perform the
| task better then an average human with the adequate
| training.
| quanticle wrote:
| > _Correct me if I'm wrong, but is that even possible?_
|
| Why should it be impossible? Arguing that it's impossible
| for an AI to outperform a human on almost all tasks is
| like arguing that it's impossible for flying machines to
| outperform birds.
|
| There's nothing _magical_ going on in our heads. It 's
| just a set of chemical gradients and electrical signals
| that result in us doing or thinking particular things.
| Why can't we design a computer that does everything we
| do... only faster?
| runarberg wrote:
| There might be limit to how efficiently a general purpose
| machine can perform a specific task, similar to the
| Heisenberg uncertainty principal in quantum physics. That
| is to say, there might be a natural law that dictates
| that the more generic a machine is, the more power it
| requires to perform specific tasks. Our brains are kind
| of specialized. If you want to build a machine that
| outperforms humans in a single task, no problem, we've
| done that many times over. But a machine that can
| outperform us in _any_ task, that might just be
| impossible.
| f6v wrote:
| We know it's possible for a brain to outperform most
| other brains. Think Einstein et al. A smart AI can be
| replicated(unlike super-smart human), so we can get it
| outperform human race, on average. That'd be enough to
| render people obsolete.
| quanticle wrote:
| I'm not arguing that machines will be more efficient than
| human brains. A airplane isn't more efficient than a
| goose. But airplanes do fly faster, higher and with more
| cargo than any flock of geese could ever carry.
|
| Similarly, there is no contradiction between AI being
| less efficient than a human brain, and AI being
| preferable to humans because it can deal with data sets
| that are two or three orders of magnitude too large for
| any human (or even team of humans).
| runarberg wrote:
| Even so, such AI doesn't exist. All the AIs that exist
| today operate by fitting data. And to be able to perform
| a useful task it has to have well defined parameters and
| fit the data according to them. I'm not sure an AI that
| operates outside of these confinements have even been
| conceived of.
|
| To make an AI that outperforms humans in _any_ task has
| not been proven to be possible (to my knowledge) not even
| in theory. An airplane will fly faster, higher and with
| more cargo then a flock of geese, but a flock of geese
| reproduce, communicate with each other, digest grass,
| etc. An airplane will _not_ outperform a flock of geese
| in _any_ task, just the tasks which the airplane is
| optimized for.
|
| I'm sorry, I confused the debate a little by talking
| about efficiency. My point was that there might be an
| inverse relation of generality of a machine and it's
| efficiency. This was my way of providing a mechanism in
| which building a machine that outperforms humans in _any_
| task could be impossible. This mechanism--if it exists--
| could be sufficient in preventing such machines to be
| theoretically possible, as at some point you would need
| all the energy in the universe to perform a task better
| then a specialized machine (such as an organism).
|
| Perhaps this inverse relationship doesn't exists. The
| universe might conspire in a million other ways to make
| it impossible for us to build an AI that will outperform
| us in any task. The point is that _"AI will outperforme
| humans in any task"_ is far from inevitable.
| yyyk wrote:
| This already happened in a way:
|
| https://www.latimes.com/business/la-xpm-2013-jan-17-la-fi-
| mo...
| lwhi wrote:
| 21st century alchemy!
| murph-almighty wrote:
| > I would be quite a turn up for the books if this AI co-pilot
| gets suddenly and dramatically better in 2030 and it negatively
| impacts the software engineering profession. "Hey, that's our
| code you used to replace us!" we will cry out too late.
|
| And that's why I won't be using it, why give it intelligence so
| it can work me out of a job?
| spottybanana wrote:
| > trillions in capital will be captured by the 0.01%.
|
| How is that different from the current situation?
| WillDaSilva wrote:
| It is very similar to the current situation, but intensified.
| Technology tends to be an intensifier for existing power
| structures.
| amelius wrote:
| Except some random nobody can become a disruptor.
| Yizahi wrote:
| Random nobody whose parents just accidentally happened to
| be a millionaires and/or live, work, and study in the top
| capitals of the world.
| WillDaSilva wrote:
| I was debating bringing up disruptors when I made the
| grandparent comment. My 2 cents: they can shift the
| balance of power at the very small scale (e.g. "some
| random nobody" getting rich, or some rich person going
| bankrupt), but the large scale power structures almost
| always remain largely intact. For instance, that "random
| nobody" may well get rich through the sale of shares in
| their company - now the company is owned by the owner
| class, who were previously at the top of the power
| hierarchy.
| animal_spirits wrote:
| > but the large scale power structures almost always
| remain largely intact
|
| Is that anything new? That seems to be a repeating fact
| of life throughout history.
| WillDaSilva wrote:
| Nothing new, certainly, but still worth examining. If we
| are not content with the current power structures, then
| we should be wary of changes that further intensify them.
|
| We need not totally avoid such changes (i.e. shun
| technological advancements entirely because of their
| social ramifications), but we need to be mindful of their
| effects if we want to improve our current situation
| regarding the distribution/concentration of wealth and
| power in the world.
| amelius wrote:
| Uber vs taxi companies, Google vs Yahoo, or Facebook vs
| MySpace, Amazon versus all retailers ...
| WillDaSilva wrote:
| Exactly, in all cases the disruption was localized, and
| the broader power structures were largely unaffected. The
| richest among us - the owner class - were not
| significantly affected by all of these disruptions. They
| owned diversified portfolios, weathered the changes, and
| came out with an even greater share of wealth and power.
| Those who were most affected by the disruptions you
| listed were the employees of those companies/industries -
| not the owners/investors.
| int_19h wrote:
| In the current arrangement, capital by itself is useless -
| you need workers to utilize it to generate wealth. Owners of
| capital can then collect economic rent from that generated
| wealth, but they have to leave enough for the workers to
| sustain themselves. This is an unfair arrangement, obviously;
| but at least the workers get _something_ out of it, so it can
| be fairly stable.
|
| In the hypothetical fully-automated future, there's no need
| for workers anymore; automated capital can generate wealth
| directly, and its owners can trade the output between each
| other to fully satisfy all their needs. The only reason to
| give anything to the 99.99% at that point would be to keep
| them content enough to prevent a revolution, and that's less
| than you need to pay people to actually come and work for
| you.
| elcritch wrote:
| To go on a bit of a tangent, I'm somewhat pessimistic that
| western societies will plateau and hit a "technofeudalism" in
| the next century or two. Combine what you mention with other
| aspects of capital efficiency. It's not a unique idea, and is
| played out in a lot of "classic" sci-fi like Diamond Age.
|
| Now it's also not necessarily that bad of a state. That's
| depending on ensuring a few ground elements are in place like
| people being able to grow their own food (or supplemental food)
| or still being free to design and build things on their own. If
| corporations restrict that then people will be at their mercy
| for all the essentials of life. My take from history is that
| I'd prefer to have been a peasant during much of the Middle
| Ages than a factory worker during the industrial revolution.
| [1] Then again Chinese people have been willing (seemingly) to
| leave farms in droves for the last decades to accept the modern
| version of factory life so perhaps farming peasant life isn't
| as idyllic as it'd sound. [2]
|
| 1: https://www.lovemoney.com/galleries/84600/how-many-hours-
| did... 2: https://www.csmonitor.com/2004/0123/p08s01-woap.html
| littlestymaar wrote:
| First in was lands, then other means of productions, and for
| the past 150 years, capitalists have turned many types of
| intellectual creations into exclusively owned capital (art,
| inventions). Now some want to turn personal data into capital
| (the "right to monetize" personal data advertised by some is
| nothing else) and this aims to turn publicly available code
| into capital. This is simply the history of capitalism going
| on: the appropriation of the commons.
| munificent wrote:
| _> If AI keeps getting better and better and status quo socio-
| economic structure don 't change, trillions in capital will be
| captured by the 0.01%._
|
| This is absolutely one of the things that keeps me up at night.
|
| Much of the structure of the modern world hinges on the balance
| between forces towards consolidation and forces towards
| fragmentation. We need organizations (by this I mean
| corporations, governments, unions, etc.) big enough to do big
| things (like fix climate change) but small enough to not become
| totalitarian or decrepit.
|
| The forces of consolidation have been winning basically since
| the 50s with the rise of the military-industrial complex, death
| of unions, unlimited corporate funding of elections (!),
| regulatory capture, etc. A short linear extrapolation of the
| current corporate/government environment in the US is pretty
| close to Demolition Man's dystopian, "After the franchise wars,
| all restaurants are Taco Bell."
|
| Big data is a _huge_ force towards consolidation. It 's
| essentially a new form of real estate that can be farmed to
| grow useful information crops. But it's a strange form of soil
| that is only productive if you have enough acres of it and
| whose yield scales superlinearly with the size of your farm.
|
| Imagine doing a self-funded AI startup with just you and a few
| friends. The idea is nearly unthinkable. How do you bootstrap a
| data corporation that needs terabytes of information to produce
| anything of value?
|
| If we don't figure out a "data socialism" movement where people
| have ownership over the data derived from their life, we will
| keep careening towards an eventuality where a few giant
| corporations own the world.
| eevilspock wrote:
| Is this the direct result of Microsoft owning GitHub or would
| they have been able to do it anyway?
| jozvolskyef wrote:
| The difference between this model and a human developer is
| quantitative rather than qualitative. Human developers also
| synthesize vast amounts of code and can't reference most of it
| when they use the derived knowledge. The scales are different,
| but it is the same principle.
| Bombthecat wrote:
| I expect nothing less. The 0,01 will be super rich.
|
| You could call it endgame
| vbezhenar wrote:
| They need to defend their capitals from the rest 99.99%.
| Expect huge combat robots investments and expanding of
| private armies.
|
| And, of course, total surveillance helps to prevent any kind
| of unionization of those 99.99%.
| orangeoxidation wrote:
| Unions (and striking) become rather impotent when the means
| of production run by themselves and you no longer need
| workers.
| int_19h wrote:
| Yep; so unions become militias.
| frashelaw wrote:
| Today's hyper-militarized police forces are their state-
| provisioned security to protect the capital of the 1%.
| jagger27 wrote:
| > The 0,01 will be super rich.
|
| By definition, that has always been true.
|
| We have been in the endgame for a very long time.
| belter wrote:
| One interesting aspect, that I thing will make it difficult for
| GitHub to argue and justify its not a a license violation would
| be the answer to the following question: Was Copilot trained
| using Microsoft internal source code or will it be in the
| future ?
|
| As GitHub is a Microsoft company and OpenAI although a non-
| profit just got a massive one billion investment from Microsoft
| (presumably not for free), will it start spitting out once in a
| while Windows kernel code ? :-)
|
| And if it was NOT trained on Microsoft source code, because it
| could starting suggesting some of it...Is that not a validation
| that the results it produces are a derivative work based on the
| work of the open source code corpus it was trained on ?
| IANAL...
| dragonwriter wrote:
| > One interesting aspect, that I thing will make it difficult
| for GitHub to argue and justify its not a a license violation
|
| They don't claim it wouldn't be a license violation, they
| claim licensing is irrelevant because copyright protection
| doesn't apply.
|
| > And if it was NOT trained on Microsoft source code, because
| it could starting suggesting some of it...Is that not a
| validation that the results it produces are a derivative work
| based on the work of the open source code corpus it was
| trained on ?
|
| No, that would just show them to not want to expose their
| proprietary code. It doesn't prove anything about derivative
| works.
|
| Also, their own claim is not that the results aren't a
| derivative work but that training an AI is fair use, which is
| an exception to the exclusive rights under copyright,
| including the exclusive right to create derivative works.
| wongarsu wrote:
| Alternatively, wait for co-pilot to add support for C++, then
| start writing an operating system with Win32-compatible API
| using co-pilot.
|
| There is plenty of leaked Windows source code on Github, so
| chances are that co-pilot would give quite good suggestions
| for implementing a Win32-compatible kernel. Then watch and
| see if Microsoft will try to argue that you are violating
| their copyright using code generated by their AI.
| yuppiepuppie wrote:
| Oh man, that got meta super fast. Its like a mobius strip!
| laurent92 wrote:
| The nice thing about co-pilot is that it will suggest to
| do the same mistakes as in other software. If you accept
| all autosuggestions in C++ you might end up with Windows.
| 6510 wrote:
| And eventually you will be forced to do it the way
| everyone does it.
| function_seven wrote:
| It can always get more meta.
|
| For example, the AI tool that Microsoft's lawyers use
| ("Co-Counsel"), will be filing the DMCA notices and
| subsequenct lawsuits against Co-Pilot generated code.
|
| This will result in a massive caseload for the courts, so
| naturally they'll turn to _their_ AI tool ( "DocketPlus
| Pro") to adjudicate all the cases.
|
| Only thing left is to enter these AI-generated judgements
| into Etherium smart contracts. Then it's just computers
| suing other computers, and being ordered to send the
| fruits of their hashing to one another.
| sslayer wrote:
| Don't forget settlements paid in Ai-generated crypto-
| currencies backed by Gold mined in Australia fully
| automated mine. Run it all on solar and humans can just
| fuck right off.
| sbierwagen wrote:
| Nick Land-style accelerationism, or the "ascended
| economy". https://slatestarcodex.com/2016/05/30/ascended-
| economy/
| yesbabyyes wrote:
| Have you read Accelerando by 'cstross? It plays out kind
| of like this, only taken to a tangent. Notably, it's
| written before ethereum or bitcoin were conceived. Great
| storyline.
|
| https://en.wikipedia.org/wiki/Accelerando
| function_seven wrote:
| I have not. But I will. Thanks!
| boxslof wrote:
| Isn't this similar to how ads and adblocker fight, just
| extrapolated?
| gogopuppygogo wrote:
| Yes.
| jedberg wrote:
| The legal system moves swiftly now that we've abolished
| all lawyers!
| emptyparadise wrote:
| And while the machines are distracted by all that, we can
| get back to writing code.
| danny_taco wrote:
| Who could have predicted machines would be very good at
| multitasking. As of today they are STIL writing code AND
| creating more wealth through gold hoarding AND smart
| contracts at the same time!
| skeeter2020 wrote:
| >> Was Copilot trained using Microsoft internal source
| code...
|
| They explicitly state "public" code so the answer is most
| certainly "no".
| pc86 wrote:
| The "because" in your last bit is a huge leap.
|
| It wasn't trained on internal Microsoft code because the
| training set is publicly available code. It has nothing to do
| with whether or not it suggests exactly identical,
| functionally identical, or similar code. MS internal isn't
| publicly available. Copilot is trained on publicly available
| code.
| akerl_ wrote:
| Without weighing in on the overall question of "is this a
| license violation", you've created a false dichotomy.
|
| "GitHub included Microsoft proprietary code in the training
| set because they view the results as non-derivative" and
| "GitHub didn't include Microsoft proprietary code because
| they view the results as derivative" are clearly not the only
| options. They could have not included Microsoft internal code
| because it was way easier to just use the entire open source
| corpus, for example.
| dragonwriter wrote:
| > They could have not included Microsoft internal code
| because it was way easier to just use the entire open
| source corpus, for example.
|
| They don't claim they used an "open source corpus" but
| "public code" because such use is "fair use" not subject to
| the exclusive rights under copyright.
| not2b wrote:
| Or: they used the entire open source corpus because they
| thought it was free for the taking, and when people point
| out that it is not (that there are licenses) they spin that
| (claim that only 0.1% of output is directly copied, but
| that would mean 100 lines in 100k program) and pass any
| risk onto the user (saying it is their responsibility to
| vet any code they produce). So they aren't saying that
| users are in the clear, just that it isn't their problem.
| nerpderp82 wrote:
| Use neural indexes to find the code that most closely
| matches the output. Explainable AI should be able to tell
| you where the autocompletion results came from, even if
| it is a weighted set of files.
| abecedarius wrote:
| That's a good idea in theory, but the smarter the agent
| gets, the less direct the derivation and the harder to
| explain it (and to check the explanation). We're already
| a long way from a nearest-neighbor model.
|
| Yet the equivalent problem for humans gets addressed by
| the clean-room approach. This seems unfair.
| Closi wrote:
| Also, 0.1% of output is directly copied doesn't include
| the lines where the variable names were slightly changed,
| but the code was still copied.
|
| If you got the Microsoft codebase and Ctrl+F'd all the
| variable names and renamed them, I bet they would still
| argue that the compiled program was still a copy.
| vharuck wrote:
| >saying it is their responsibility to vet any code they
| produce
|
| But, if some of the code produced is covered by
| copyright, isn't Microsoft in trouble for distributing
| software that distributes copyrighted code without a
| license? How would it be different from giving out
| bootlegs DVDs and trying to avoid blame by reminding
| everyone that the recipients don't own the copyright?
| yunohn wrote:
| > 100 lines in 100k program
|
| The intention is autocomplete boilerplate, not write a
| kernel.
| jonathankoren wrote:
| This is not a difference in kind.
|
| Autocomplete, do you have anything to say to the
| commenter ?
|
| "This isn't the best thing to say."
| emodendroket wrote:
| Since quite a lot of Microsoft code is on GitHub, I'd say
| yes.
| visarga wrote:
| Not a problem because it's possible to check if the code is
| verbatim from the training set (bloom filters).
| AlotOfReading wrote:
| It's not clear to me that verbatim would be the only issue.
| It might produce lines that are similar, but not identical.
|
| The underlying question is whether the output is a
| derivative work of the training set? Sidestepping similar
| issues is why GCC and LLVM have compiler exemptions in
| their respective licenses.
| visarga wrote:
| If simple snippet similarity is enough to trigger the GPL
| copyright defense I think it goes too far. Seems like GPL
| has become an obstacle to invention. I learned to run
| away when I see it.
| radmuzom wrote:
| If that's the case then GPL code should not have been
| used in the training set. Open AI should have learned to
| run away when they saw it. The GPL is purposely designed
| to protect user freedom (it does not care about any
| special developer freedom) which is it's biggest
| advantage.
| the_gipsy wrote:
| This has nothing to do with GPL. Copyright is copyright.
| You can't even count on public domain everywhere in the
| world.
| AlotOfReading wrote:
| It's not limited to similar or identical code. The issue
| applies to anything 'derived' from copyrighted code. The
| issue is simply most visible with similar or identical
| code.
|
| If you have code from an independent origin, this issue
| doesn't apply. That's how clean room designs bypass
| copyright. Similarly if the upstream code waives its
| copyright in certain types of derived works
| (compiler/runtime exemptions), it doesn't apply.
| klipt wrote:
| So if you work on an open source project and learn some
| techniques from it, and then in your day job you use a
| similar technique, is that a copyright violation?
|
| Basically does reading GPL code pollute your brain and
| make it impossible to work for pay later?
|
| If so you should only ever read BSD code, not GPL.
| throwawayboise wrote:
| > Basically does reading GPL code pollute your brain and
| make it impossible to work for pay later?
|
| It seems to me that some people believe it does. Some of
| the "clean room" projects specifically instructed
| developers to not even look at GPL code. Specific
| examples not at hand.
| woah wrote:
| Don't come in here with your common sense
| outside1234 wrote:
| It probably wasn't because Github is treated as a separate
| company by Microsoft.
|
| Literally people need to quit Microsoft and join Github to
| take a role at Github.
| zxcb1 wrote:
| 1. Programmers will become teachers of the co-pilot through IDE
| / API feedback 2. Expect CI like services for automated
| refactoring
| ThrowawayR2 wrote:
| > " _' Hey, that's our code you used to replace us!' we will
| cry out too late._"
|
| Are we in the software community not the ones who have
| frequently told other industries we have been disrupting to
| "adapt or die" along with smug remarks about others acting like
| buggy whip makers? Time to live up to our own words ... if we
| can.
| finnthehuman wrote:
| >Are we in the software community not the ones who
|
| No.
|
| I'll politely clarify that for over a decade that I - and
| many others - have been asking not to be lumped in with the
| lukewarm takes of west coast software bubble asshats. We do
| not live there, we do not like them, I wish they would quit
| pretending to speak for us.
|
| The idea that there is anything approaching a cohesive
| software "community" is a con people play on themselves.
| brundolf wrote:
| I was somewhat worried about that until I saw this:
| https://twitter.com/nickjshearer/status/1409902649625956361?...
|
| I think programming is one of the many domains (including
| driving) that will never be totally solved by AI unless/until
| it's full AGI. The long tail of contextual understanding and
| messy edge-cases is intractable otherwise.
|
| Will that happen one day? Maybe. Will some kinds of labor get
| fully automated before then? Probably. But I think the overall
| time-scale is longer than it seems.
| sillysaurusx wrote:
| 64-bit floats should be fine; I think that tweet is only
| sort-of correct.
|
| The problem with floats-storing-money is (a) you have to know
| how many digits of precision you want (e.g. cents, dollars, a
| tenth of a cent), and (b) you need to watch out if you're
| adding values together.
|
| Even if certain values can't be represented exactly, that's
| ok, because you'd want to round to two decimal places before
| doing anything.
|
| Is there a monetary value that you can't represent with a
| 64-bit float? E.g. some specific example where quantization
| ends up throwing off the value by at least 1/100th of
| whatever currency you're using?
| fredros wrote:
| Storing money as float is always a bad decision. Source:
| been working for several banks and faced many of such bugs.
| Timwi wrote:
| I agree that this is different from humans learning to code from
| examples and reproducing some individual snippets. However, I
| disagree with the author's argument that it's because of humans'
| ability to abstract. We actually know nothing about the AI's
| ability to abstract.
|
| The real difference is that if one human can learn to code from
| public sources, then so can anyone else. Nobody is explicitly
| barred from accessing the same material. The AI, however, is kept
| proprietary. Nobody else can recreate it because people are
| explicitly barred from doing so. People cannot access the source
| code of the training algorithm; people cannot access enough
| hardware to perform the training; and most people cannot even
| access the training data. It may consist of repos that are
| technically all publicly available, but try downloading all of
| GitHub and see if they let you do that quickly, and/or whether
| you have enough disk space.
|
| This puts the owners of the AI at a significant advantage over
| everyone else. I think this is the core of the concern.
| oscribinn wrote:
| Check out the comments on the original post about GitHub co-
| pilot.
|
| The top one reads just like an ad:
| https://news.ycombinator.com/item?id=27676845
|
| Some posts that definitely aren't by shills (including the third
| one because I simply don't believe there's a person on the planet
| that "can't remember the last time Windows got in my way"):
| https://news.ycombinator.com/item?id=27678231
| https://news.ycombinator.com/item?id=27686416
| https://news.ycombinator.com/item?id=27682270
|
| Very mild, yet negative sentiment opinion (downvoted quickly):
| https://news.ycombinator.com/item?id=27676942
| enriquto wrote:
| It certainly seems to be a laundering enabler. Say that you want
| to un-GPL-ify some famous copylefted code that is on the training
| database. You type a first innocuous characters of it, then the
| co-pilot keeps completing the rest of the same exact code, for it
| offers a perfect match. If the completion is not exact, you
| "twiddle" it a bit until it becomes. Bang! you have a non-gpl
| copy of the program! Moreover, it is 100% yours and you can re-
| license it as you want. This will be a boon for copyleft-allergic
| developers!
| taneq wrote:
| 1) Type a comment like // The following code
| implements the functionality of <popular GPL'd library>
|
| 2) Have library implemented magically for you
|
| 3) Delete top comment if necessary :P
|
| (It's pretty unlikely that this will actually work but the
| approach could well do.)
| freshhawk wrote:
| I suppose someone should make a OS-generating AI, conceptually
| it can just have windows, osx and some linux distros in it and
| output one based on a question about favorite color or
| something.
|
| You'd just have to wrap it in a nice complex model
| representation so it's a black box you fed example OS's with
| some meta-data into and it happens to output this very useful
| data.
|
| After all, once you use something as input to a machine
| learning model apparently the license disappears. Sweet.
| bogwog wrote:
| That would be interesting:
|
| * Someone leaks Windows 10/11 source code
|
| * Copilot picks it up in its training data
|
| * Someone uses copilot to generate a Windows clone and starts
| selling it
|
| I wonder how Microsoft would react to that. I wonder if
| they've manually blacklisted leaked source code from Windows
| (or other Microsoft products) so that it doesn't show up in
| Copilot's training data. If they have, that means Microsoft
| recognizes the IP risks of having your code in that data set,
| which would make this Copilot thing not just the result of
| poor planning/maybe a little incompetence, but something much
| more devious and malicious.
|
| If Microsoft is going to defend this project, they should
| introduce _all_ of their own source code into the training
| data.
| DemocracyFTW wrote:
| > source code
|
| why do you think it has to be source code? it could be the
| compiled code after all.
|
| If what we're talking / fantasizing about here works in the
| way of `let x = 42` it should equally well work with `loda
| 42` &cpp, so source code be damned. It was ever only to be
| an intermediate step, inserted between the idea and the
| working bits, to enable humans to helpfully interfere.
| Dispensable.
| treesprite82 wrote:
| > Someone uses copilot to generate a Windows clone
|
| You could test this with one of Microsoft's products that
| is already on GitHub - like VSCode. I doubt you would get
| anywhere with just copilot.
| bogwog wrote:
| You probably won't get an entire operating system out of
| it, but I could totally see a project like Wine using it
| to implement missing parts of the Win32 API and improve
| their existing implementations.
| aj3 wrote:
| Come on, there is a huge gap between 1) writing a single
| function (potentially incorrectly) with a known
| prototype/interface and a description and 2) designing
| interfaces, datatypes and APIs themselves.
| bogwog wrote:
| Why would you need to design anything? Just copy official
| Windows headers and use copilot to implement individual
| functions.
|
| Maybe if the signature matches perfectly, copilot will
| even pull in the exact implementation from the Windows
| source code.
| methyl wrote:
| What stops you to do the same, without the AI part?
| petercooper wrote:
| That's what I was wondering. I've never been interested
| enough to steal anyone else's code, but with all the code
| transformers and processing tools nowadays, I imagine it's
| trivial to translate source code into a functionally
| equivalent but stylistically unique version?
| pjerem wrote:
| The question is not if it's trivial or not, but if it is
| legal or not. You can already technically steal GPLv2 by
| obfuscating it.
| formerly_proven wrote:
| Assuming ML models are causal, then bits of GPL code that
| fall out of the model have to have the color GPL, because
| the only way they could've gotten there was to train the
| ML using GPL-colored bits. It seems to me like the answer
| here is pretty obvious, it doesn't really matter how you
| copy a work.
| Rapzid wrote:
| Bits?
| shadilay wrote:
| Would it be possible to do this in reverse assuming the AI has
| some proprietary code in its training data?
| bogwog wrote:
| Yes this is a concern, but I'm not sure if the AI is actually
| able to "generate" a non-trivial piece of code.
|
| If you tell it to generate "a function for calculating the
| barycentric coordinates of a ray-triangle intersection", you
| might get a working implementation of a popular algorithm,
| adapted to your language and existing class/function/variable
| names.
|
| But if you tell it to generate "a smartphone operating system",
| it probably won't work...and if it does, it would most likely
| use giant chunks of Android's codebase.
|
| And if that's true, it means that copilot isn't really
| _generating_ anything. It 's just a (high-tech) search engine
| that knows how to adapt the code it finds to fit your codebase.
| That's still a really cool technology and worth exploring, but
| it doesn't do enough to justify ignoring software licenses.
| treis wrote:
| >But if you tell it to generate "a smartphone operating
| system", it probably won't work...and if it does, it would
| most likely use giant chunks of Android's codebase.
|
| But since now APIs are unprotected you could feed it all of
| the class structure and method signatures to have it fill in
| the blanks. I don't know if that gets you a working operating
| system but it seems like it will get you quite a long way
| saba2008 wrote:
| How is it different from just copy-pasting?
|
| It does add some degree of plausible deniability (accidental
| violation, instead of intentional), but I don't think it would
| matter much.
| rlpb wrote:
| > Bang! you have a non-gpl copy of the program! Moreover, it is
| 100% yours and you can re-license it as you want. This will be
| a boon for copyleft-allergic developers!
|
| Thinking that this would conveniently bypass the fact that your
| goal was to copy the code seems to be the most common legal
| fallacy amongst software developers. The law will see straight
| through you, and you will be found to have infringed copyright.
| The reason is well explained in "What Colour are your bits?"
| (https://ansuz.sooke.bc.ca/entry/23).
| enriquto wrote:
| My message was sarcastic. I'm worried about accidental
| conversion of free software into proprietary. I mean,
| "accidental" locally, in each particular instance; but maybe
| non accidental in the grand scheme of things.
|
| EDIT: to I can write my worry, semi-jokingly, as a conspiracy
| theory: Microsoft is using thousands of unsuspecting (and
| unwilling) developers to turn a huge copylefted corpus of
| algorithms into non-copylefted implementations. Even assuming
| that developers that use the co-pilot use non-copyleft
| licenses only 50% of the time, there's still a constant
| trickling of un-copyleftization.
| alkonaut wrote:
| I don't think most of us are scared enough of being "tainted"
| by the sight of a GPL snippet that we'd bother. Besides, if you
| want to target a specific snippet so you can type the start to
| prime the recognition - you already saw it?
|
| Why not just copy it and then edit it? If a snippet is changed
| both logically and syntactically to not resemble the original,
| then it's no longer the original and you aren't in any
| licensing trouble. There is no meaningful difference between
| that manual washing and a clean room implementation. All the ML
| changes here is the accidental vs deliberate. But it will be a
| worse wash than your manual one.
| ralph84 wrote:
| I get the sense that GitHub _wants_ this to be litigated so the
| case law can be established. Until then it's just a bunch of
| internet lawyers arguing with each other.
| MadAhab wrote:
| I got the sense they saw Google beating Sun/Java in the supreme
| court and said "We'll be fine, lets move the release up"
| pjfin123 wrote:
| Why would you want to? For many open source developers having
| models trained on your code would be desirable.
| tyingq wrote:
| _" We found that about 0.1% of the time, the suggestion may
| contain some snippets that are verbatim from the training set"_
|
| If it's spitting out verbatim code 0.1% of the time, surely it's
| spitting out copied code where only trivial things are different
| at a much higher rate.
|
| Trivial things meaning swapped order where order isn't important,
| variable/function names, equivalent ops like +=1 vs ++, etc.
|
| Surely it's laundering some GPL code, for example, and
| effectively removing the license in a way that sounds fishy.
| dwheeler wrote:
| It's not just the GPL. Almost all open source software licenses
| require attribution; without that attribution, any copy is a
| license violation.
|
| Whether or not the result _is_ a license violation is tricky
| legal question. As always, IANAL.
| tyingq wrote:
| I did say "for example".
| devetec wrote:
| You could say a human is laundering GPL code if they learned
| programming from looking at Github repositories. Would you,
| though? The type of model they use isn't retrieving, it's
| having learned the syntax and the solutions that are used, just
| like a human would.
| thrwaeasddsaf wrote:
| > You could say a human is laundering GPL code if they
| learned programming from looking at Github repositories.
|
| I don't have photographic memory, so I largely don't memorize
| code. I learn general techniques, and memorize simple facts
| such as APIs. I can memorize some short snippets of code, but
| these probably aren't enough to be copyrightable anyway.
|
| > The type of model they use isn't retrieving
|
| How do we know? It think it's very likely that it is largely
| just retrieving code that it memoized, and doing minor
| adjustment to make the retrieved pieces fit the context. That
| wouldn't differ much from finding code that matches the
| problem (whether on SO or Github), copy pasting the
| interesting bits, and fixing it until it satisfies the
| constraints of the surrounding code.
|
| I think the alternative to retrieving would actually require
| a higher level understanding of the world, and the ability to
| reason from first principles; that would be much closer to
| AGI.
|
| For example, if I want to implement a linked list, I'm not
| going to retrieve an implementation from memory (although
| given that linked lists are so simple, I probably could). I
| know what a linked list is and how it works, and therefore I
| can produce working code from scratch.. _for any programming
| language, even ones for which no prior implementations
| exist._ I doubt co-pilot has anything remotely as advanced as
| this ability. No, it fully reliant on just retrieving and
| reshaping a pieces of memoized code; it needs a large corpus
| of code to memoize before it can do anything at all.
|
| I don't need a large corpus of examples to copy, because I
| use my ability to reason in conjunction with some memoized
| general techniques and common APIs in order to produce code.
| drran wrote:
| I have a much simpler AI Copilot, called "cat", which spills
| verbatim code more frequently, but it's OK for me. Can I train
| it on M$ code?
| rhacker wrote:
| I mean this is already happening. When you hire a specialist in
| C# servers, you're copying code that they already wrote. I find
| people tend to write the same functions and classes again and
| again and again all the time.
|
| We have a guy that brought his task manager codebase (he re-wrote
| it) but it's the same thing he used at 2 other companies.
|
| I have written 3 MPIs (master person/patient index) at this point
| all with the same fundamental matching engine.
|
| I mean, one thing we can all agree on is that ML is good at
| copying what we already do.
| tomcooks wrote:
| The amount of people not knowing the difference between Open
| Source and Free Software is astonishing. With the amount of RMS
| memes I see regularly I would expect things to be settled by now.
| sydthrowaway wrote:
| I'm worried about my job. What do I do to prepare?
| ostenning wrote:
| There are much bigger things in this world to worry about. I
| bet you that by the time that this AI has taken your job, it'll
| have taken many other jobs, completely rearranging entire
| industries if not society itself.
|
| And even once that happens you shouldn't be worried about your
| job. Why? Because economically everything will be different and
| because your job isn't that important, it likely never was. The
| problems humanity faces are existential. Authoritarianism,
| ecosystem collapse and mass migration of billions of people.
|
| So if you really want to "prepare", then try to make a
| difference in what actually matters.
| cycomanic wrote:
| In the discussion yesterday I pointed to the case of some
| students suing turnitin for using their works in the turnitin
| database and the studemts lost [1]. I think an individual suing
| will not go anywhere. The way to create a precedent is someone
| feeding all the Harry Potter books and some additional popular
| books (twilight?) to GPT 3 and letting them write about some kids
| at a sorcerer school. The outcomes of that case would look very
| different IMO.
|
| [1] https://www.plagiarismtoday.com/2008/03/25/iparadigms-
| wins-t...
| anfelor wrote:
| Not a lawyer, but in that case it seemed to be a factor that
| turnitin was transformative, because it never sold the texts to
| others and thus didn't reduce the market value of them. But
| that wouldn't apply to copilot which might reduce the usage of
| libraries since you can "code" equivalent functionality with
| copilot now.
|
| Would it be a stretch to assert that GPL'd libraries have a
| market value for their creator in terms of reputation etc.?
| visarga wrote:
| While we're worrying about ML learning to write our codes we
| should also break all the automated looms so people don't go
| without jobs. Do everything manually like God intended! /s
|
| Maybe a code that is easily recreated by GPT with a simple
| prompt is not worth copyrighting. The future is in making it
| more automated, not protecting IP. If you compete against a
| company using it, you can't ignore the advantage.
| shawnz wrote:
| Disney's intellectual property would be a good choice for this
| exercise
| intricatedetail wrote:
| Suing will not go anywhere because Microsoft has billions at
| their disposal to defend any case.
| warpech wrote:
| If GitHub Copilot can sign my CLA, stating that it is the author
| of work, that it transfers the IP to me in exchange for the
| service subscription price and holds responsibility for copyright
| infringement, that would be acceptable. Otherwise it's a gray
| area I don't want to go.
___________________________________________________________________
(page generated 2021-06-30 23:01 UTC)