[HN Gopher] What Copilot means for open source
       ___________________________________________________________________
        
       What Copilot means for open source
        
       Author : zacwest
       Score  : 30 points
       Date   : 2022-06-25 20:04 UTC (2 hours ago)
        
 (HTM) web link (matthewbutterick.com)
 (TXT) w3m dump (matthewbutterick.com)
        
       | mrfusion wrote:
       | How about any company that's worried about it's code being copied
       | submits a copy for copilot to check against. If copilot generates
       | any of that code it just doesn't output it.
        
         | ntoskrnl wrote:
         | "Enter your social security number on our site to check if it's
         | been leaked!"
        
       | chrischen wrote:
       | I've been using Copilot during the beta and it's been absolutely
       | amazing. That being said I mainly rely on it to autocomplete the
       | rest of the _line_ only, and it works great as a fancier auto
       | complete. I can 't imagine it being a copyright issue for this
       | use case because the completions are what I would have written
       | anyways. I'd probably never trust it to write a whole block of
       | code implementing something and I think they definitely should
       | add a feature to disable that because autocompleting only lines
       | of code is just as useful.
       | 
       | It works even better in languages like Haskell or Ocaml, where
       | often there is only a couple ways (often only one way) valid code
       | could be written once you type part of it out and if it does spit
       | out invalid code you get an instant compiler warning.
        
         | throwntoday wrote:
         | Personally I hate autocomplete. I've had peers that were
         | shocked I prefer to type everything out manually, and my IDE is
         | essentially a text editor with syntax highlighting. To each
         | their own of course, but I find the increase in productivity
         | very quickly becomes a crutch for just remembering things.
        
           | chrischen wrote:
           | There are 3 ways autocomplete can be used:
           | 
           | 1) As automatic documentation in a well-typed application.
           | Autocomplete shows you the public properties or methods
           | available based on what data you have started typing, and
           | it's not guessed but guaranteed to be valid assuming your
           | types are correct.
           | 
           | 2) As dumb autocomplete, where it just tries to guess based
           | on the symbols in the document, and save you some
           | keystrokes/remembering.
           | 
           | 3) As the Copilot style autocomplete where it will finish the
           | entire line, and not just the next token--with the downside
           | that it's really just guessed.
           | 
           | I can see where 2 and 3 can be avoided, but 1) is really
           | invaluable. I remember the days of writing PHP where I had to
           | google for the function names every time because of
           | inconsistent naming, and lack of typing. But now that I write
           | exclusively in typed languages, auto-complete in the form of
           | 1) has been invaluable not for the purposes of remembering
           | names of things but being able to see what functions/methods
           | are valid and can be used. No need to look up documentation
           | manually anymore.
        
           | coredog64 wrote:
           | Autocomplete has actually turned into a negative for me, at
           | least in VSCode. It keeps autocompleting things that don't
           | make any sense, it's overly aggressive on doing it (like 3
           | characters), and is generally a PITA. I now have to undo the
           | completion and then retype with a rapid escape key action.
        
             | chrischen wrote:
             | I think this depends on how you configured autocomplete. In
             | my Neovim setup autocomplete is opt in (you have to press
             | tab to complete). It sounds like you have default
             | autocomplete setup for some reason.
        
           | Barrin92 wrote:
           | Turning off autocomplete forced me to actually learn and
           | memorize again, not just libraries, but even my own code. I
           | didn't even notice how bad it had gotten when I went from IDE
           | to plain text editing, felt like my brain had atrophied. Like
           | when you hear people say they can't navigate any more without
           | an app.
        
           | omegalulw wrote:
           | You are just wasting time. You should upskill yourself to the
           | point you can write code with bare vim. Beyond that, use an
           | IDE with good features to save time, that's what IDEs are
           | for.
        
       | [deleted]
        
       | justin-tm wrote:
       | > Why? Because as a matter of basic legal hygiene, I expect that
       | organizations that create software assets will have to forbid the
       | use of Copilot and other AI-assisted tools
       | 
       | I feel like this understates the wild west nature of software in
       | non-tech Fortune 500
        
         | KMag wrote:
         | I worked for over a decade at a Fortune 500 investment bank. I
         | suspect they'll be very wary of Copilot, as well. I wouldn't
         | lump all non-tech companies together.
        
       | mcbuilder wrote:
       | > Rather than conceal the licenses of the underlying open-source
       | code it relies on, it could in principle keep this information
       | attached to each chunk of code as it wends its way through the
       | model.
       | 
       | I don't know if the author understands how these transformer
       | models work, but this would a impossible task in Byzantine
       | complexity. The way these models work is by outputting a probably
       | distribution of likely "token" embeddings given an input prompt.
       | The output involves basically inverting a word2vec (probably
       | beefed up with programming language keywords and other bells and
       | plausible search techniques I don't have access to the details
       | of).
       | 
       | This model was of course trained with real code, but you can't
       | attach to the output any meaningful information from the gradient
       | you get from the sample. It's a very messy computation to even
       | think to write down (attaching a percentage that 1 training
       | example affected 1 particular output), much less come up with a
       | simple interpretation of.
        
         | gus_massa wrote:
         | There is a problem because some licenses require attribution,
         | but ignoring that...
         | 
         | You can make a model that is trained only with BSD and MIT
         | code, and IIRC/IIUC the result can be used with any license,
         | including proprietary code.
         | 
         | You can make a second model that is trained only with BSD, MIT
         | and GPL2 code (and perhaps GPL2+), and IIRC/IIUC the result can
         | be only with GPL2 code.
         | 
         | You can make a third model that is trained only with BSD, MIT
         | and GPL2+ and GPL3 code, and IIRC/IIUC the result can be only
         | with GPL3 code.
         | 
         | AGPL, Apache, WTFPL, ... Just add them to the correct model or
         | create a new model for them.
        
           | q-big wrote:
           | > You can make a model that is trained only with BSD and MIT
           | code, and IIRC/IIUC the result can be used with any license,
           | including proprietary code.
           | 
           | BSD and MIT license still require attribution of the used
           | source code (there exists a MIT No Attribution License,
           | though: https://en.wikipedia.org/w/index.php?title=MIT_Licens
           | e&oldid...).
        
           | winety wrote:
           | > There is a problem because some licenses require
           | attribution, but ignoring that...
           | 
           | Surely the solution would be to give credit to every author
           | from the training corpus. I am looking forward to the 10 000
           | lines of copyrights in every header. :P
           | 
           | If Microsoft had trained it on its own code, there would be
           | no such problems. Surely a company as large as Microsoft has
           | produced enough code over the years to create a large enough
           | training dataset.
        
         | Brian_K_White wrote:
         | As much as I hate the entire concept, I would have a hard time
         | articulating a substantive difference between this description
         | of the ai mixing together bits of stuff it saw and what I do
         | myself when I'm writing something that I fully describe as
         | mine.
        
         | omegalulw wrote:
         | There are much simpler solutions than that. Simply invest or
         | reuse plagiarism models and tag output snippets to source
         | repositories. You can either exclude this code or let the users
         | know it's licenced code.
        
         | withinboredom wrote:
         | Unless the model is truly coming up with something novel, there
         | is _something_ in it 's training set that is truly similar, if
         | not precisely the output. It should say so when that is the
         | case, and should also say whether or not the output is novel.
         | I'm sure GitHub could provide an API for searching code
         | snippets, if they don't already.
        
           | moyix wrote:
           | They're currently doing this automatically for catching exact
           | matches, and you can turn on a setting that will suppress any
           | suggestion that was found verbatim in the training data.
           | 
           | But of course this wouldn't catch copies where variable names
           | have been changed, etc. One thing that I think would be
           | really interesting (but hideously computationally expensive)
           | is to compute an embedding of each chunk of the training data
           | and then query for the k nearest neighbors when Copilot
           | generates some code, so you can see what the closest snippets
           | in the training data are and evaluate for yourself if they're
           | too similar.
        
           | mcbuilder wrote:
           | Yes, but Co-Pilot is more like a system that has spent a lot
           | of time learning from open source code but it has the logical
           | reasoning ability of that code of the 12-year old who just
           | learned about the problem yesterday that the author
           | mentioned.
           | 
           | But will it be capable of generating a copyright
           | infringement? Let's imagine I want to implement an algorithm
           | to find the shortest path between nodes in a graph. Well I
           | remember from my MSc that you probably need to implement it
           | with Dijkstra's algorithm, and I've implemented this a few
           | times over the years in kata and leetcode. Now let's say I'm
           | programming Rust, which I don't really know the syntax for so
           | I go read some GPL code while browsing the net for syntax.
           | 
           | Do I need to state that my implementation of Dijkstra's
           | algorithm is 10% GPL, because I spent a few minutes reading
           | syntax of a GPL file? What if someone else's copyrighted code
           | looks 90% similar to mine, because hell there's only so many
           | ways to implement it, can I get sued because there is 90%
           | overlap and I might have looked at this code?
           | 
           | These systems aren't even capable yet of generating more than
           | a a few functions, much less a coherent library. Even then,
           | as they generalize more and gain more expressive power they
           | will be less likely to copy in the training data to the
           | output now, a possibility that I consider quite remote even
           | given a relatively primitive Co-Pilot.
        
       | eterevsky wrote:
       | I've seen estimations that verbatim copies of training code
       | constitute around 0.1% of the code produced by the Copilot. It
       | would be relatively straight-forward to implement a verification
       | step that would remove this code. I would be surprised if it
       | hasn't been done already.
       | 
       | With verbatim code copies out of the way, I don't see any basis
       | to consider code produced by Copilot copyrighted. So unless you
       | have some problem with the fact that portions of your code will
       | be public domain, I don't see any reason not to use it.
       | 
       | And one more thought. The author gives an example of using
       | Copilot to list prime numbers. That's not a good use for it.
       | Copilot and similar systems are primarily useful for writing
       | boring boilerplate code, saving your time for more involved
       | parts.
        
         | keithnz wrote:
         | I agree, it needs to ensure no verbatim code. I also noticed
         | that there is also an option not to use public code. But I
         | actually like the idea of using open source code. In one breath
         | we (software devs) have recommended reading open source
         | software to learn coding, then when someone try's to automate
         | that using AI so that it try's to contextually help you write
         | code, some people seem upset that we can automate extracting
         | code knowledge / patterns because we didn't do it with our
         | wetware? I think it should be highly encouraged for people to
         | extract knowledge from open code, even if at the moment it is
         | far from perfect. The boiler plate stuff at the moment is
         | great, and from time to time it writes some good contextually
         | aware stuff as well. So, in summary, if humans are allowed to
         | learn from open code, then I think AI should be allowed to
         | learn from open code too.
        
         | axg11 wrote:
         | I believe this is already a feature that you can activate for
         | Github Copilot.
        
         | withinboredom wrote:
         | I vaguely remember that generated code can't be copyrighted
         | either. So you end up with code that you can't even own and
         | some you can? How does that work?
        
       | readthenotes1 wrote:
       | Never fear. Copilot will hire a passel of lawyers not long after
       | it's declared sentient (by some lawyers)
        
       | Pulcinella wrote:
       | _Rather than conceal the licenses of the underlying open-source
       | code it relies on, it could in principle keep this information
       | attached to each chunk of code as it wends its way through the
       | model. Then, on the output side, it would be possible for a user
       | to inspect the generated code and see where every part came from
       | and what license is attached to it._
       | 
       | This is feature, not a bug. The benefits of open source code, now
       | with less potential legal risks. (Assuming it actually works).
       | After all, with Copilot you write all the code "yourself." Yeah
       | with assistance from the Copilot autocomplete tool from the
       | Microsoft corporation, but IDEs and text editors have had
       | autocompletes for decades and that doesn't mean you didn't write
       | that code.
       | 
       | (Note that I don't agree with this, just explaining that
       | laundering other people's code is intentional).
        
       | [deleted]
        
       | withinboredom wrote:
       | I don't know why the title is editorialized here ... but the
       | actual title of the article "THIS COPILOT IS STUPID AND WANTS TO
       | KILL ME" reminded me of a Tesla holding the left lane of the
       | Autobahn and only going 100 kmh. As I passed them on the right
       | (totally illegal maneuver on my part), honking my horn, and saw
       | the driver reading a book; I wondered how long that "driver" was
       | going to live before being rear ended by someone doing 150-200
       | kmh.
       | 
       | Maybe all AI secretly wants to kill us.
        
         | gus_massa wrote:
         | > _I don 't know why the title is editorialized here ..._
         | 
         | I saw the post some minutes ago and it was with posted with the
         | original titles. I guess the mods changed it. The part about
         | wanting to kill the writer is an exaggeration. It's usual to
         | changed it to the subtitle or a relevant sentence, but without
         | too much cherry picking (the last part is not very clear).
         | 
         | Someone made a tracker to for these changes
         | https://hackernewstitles.netlify.app/ HN discussion
         | https://news.ycombinator.com/item?id=21617016 (366 points | Nov
         | 23, 2019 | 94 comments) Most of the changes make sense (if you
         | forget the infamous case of the asteroid/rocket part).
        
       | zacwest wrote:
       | > In the large, I don't think the problems open-source authors
       | have with AI training are that different from the problems
       | everyone will have. We're just encountering them sooner.
       | 
       | Call me a pessimist but it seems really unlikely that AI will
       | cause problems so much as continually increasing automation. It
       | makes such terrible choices around qualitative decisions and
       | increasing existing AI to a general solution pretty much always
       | fails.
        
       ___________________________________________________________________
       (page generated 2022-06-25 23:00 UTC)