[HN Gopher] What Copilot means for open source
___________________________________________________________________
What Copilot means for open source
Author : zacwest
Score : 30 points
Date : 2022-06-25 20:04 UTC (2 hours ago)
(HTM) web link (matthewbutterick.com)
(TXT) w3m dump (matthewbutterick.com)
| mrfusion wrote:
| How about any company that's worried about it's code being copied
| submits a copy for copilot to check against. If copilot generates
| any of that code it just doesn't output it.
| ntoskrnl wrote:
| "Enter your social security number on our site to check if it's
| been leaked!"
| chrischen wrote:
| I've been using Copilot during the beta and it's been absolutely
| amazing. That being said I mainly rely on it to autocomplete the
| rest of the _line_ only, and it works great as a fancier auto
| complete. I can 't imagine it being a copyright issue for this
| use case because the completions are what I would have written
| anyways. I'd probably never trust it to write a whole block of
| code implementing something and I think they definitely should
| add a feature to disable that because autocompleting only lines
| of code is just as useful.
|
| It works even better in languages like Haskell or Ocaml, where
| often there is only a couple ways (often only one way) valid code
| could be written once you type part of it out and if it does spit
| out invalid code you get an instant compiler warning.
| throwntoday wrote:
| Personally I hate autocomplete. I've had peers that were
| shocked I prefer to type everything out manually, and my IDE is
| essentially a text editor with syntax highlighting. To each
| their own of course, but I find the increase in productivity
| very quickly becomes a crutch for just remembering things.
| chrischen wrote:
| There are 3 ways autocomplete can be used:
|
| 1) As automatic documentation in a well-typed application.
| Autocomplete shows you the public properties or methods
| available based on what data you have started typing, and
| it's not guessed but guaranteed to be valid assuming your
| types are correct.
|
| 2) As dumb autocomplete, where it just tries to guess based
| on the symbols in the document, and save you some
| keystrokes/remembering.
|
| 3) As the Copilot style autocomplete where it will finish the
| entire line, and not just the next token--with the downside
| that it's really just guessed.
|
| I can see where 2 and 3 can be avoided, but 1) is really
| invaluable. I remember the days of writing PHP where I had to
| google for the function names every time because of
| inconsistent naming, and lack of typing. But now that I write
| exclusively in typed languages, auto-complete in the form of
| 1) has been invaluable not for the purposes of remembering
| names of things but being able to see what functions/methods
| are valid and can be used. No need to look up documentation
| manually anymore.
| coredog64 wrote:
| Autocomplete has actually turned into a negative for me, at
| least in VSCode. It keeps autocompleting things that don't
| make any sense, it's overly aggressive on doing it (like 3
| characters), and is generally a PITA. I now have to undo the
| completion and then retype with a rapid escape key action.
| chrischen wrote:
| I think this depends on how you configured autocomplete. In
| my Neovim setup autocomplete is opt in (you have to press
| tab to complete). It sounds like you have default
| autocomplete setup for some reason.
| Barrin92 wrote:
| Turning off autocomplete forced me to actually learn and
| memorize again, not just libraries, but even my own code. I
| didn't even notice how bad it had gotten when I went from IDE
| to plain text editing, felt like my brain had atrophied. Like
| when you hear people say they can't navigate any more without
| an app.
| omegalulw wrote:
| You are just wasting time. You should upskill yourself to the
| point you can write code with bare vim. Beyond that, use an
| IDE with good features to save time, that's what IDEs are
| for.
| [deleted]
| justin-tm wrote:
| > Why? Because as a matter of basic legal hygiene, I expect that
| organizations that create software assets will have to forbid the
| use of Copilot and other AI-assisted tools
|
| I feel like this understates the wild west nature of software in
| non-tech Fortune 500
| KMag wrote:
| I worked for over a decade at a Fortune 500 investment bank. I
| suspect they'll be very wary of Copilot, as well. I wouldn't
| lump all non-tech companies together.
| mcbuilder wrote:
| > Rather than conceal the licenses of the underlying open-source
| code it relies on, it could in principle keep this information
| attached to each chunk of code as it wends its way through the
| model.
|
| I don't know if the author understands how these transformer
| models work, but this would a impossible task in Byzantine
| complexity. The way these models work is by outputting a probably
| distribution of likely "token" embeddings given an input prompt.
| The output involves basically inverting a word2vec (probably
| beefed up with programming language keywords and other bells and
| plausible search techniques I don't have access to the details
| of).
|
| This model was of course trained with real code, but you can't
| attach to the output any meaningful information from the gradient
| you get from the sample. It's a very messy computation to even
| think to write down (attaching a percentage that 1 training
| example affected 1 particular output), much less come up with a
| simple interpretation of.
| gus_massa wrote:
| There is a problem because some licenses require attribution,
| but ignoring that...
|
| You can make a model that is trained only with BSD and MIT
| code, and IIRC/IIUC the result can be used with any license,
| including proprietary code.
|
| You can make a second model that is trained only with BSD, MIT
| and GPL2 code (and perhaps GPL2+), and IIRC/IIUC the result can
| be only with GPL2 code.
|
| You can make a third model that is trained only with BSD, MIT
| and GPL2+ and GPL3 code, and IIRC/IIUC the result can be only
| with GPL3 code.
|
| AGPL, Apache, WTFPL, ... Just add them to the correct model or
| create a new model for them.
| q-big wrote:
| > You can make a model that is trained only with BSD and MIT
| code, and IIRC/IIUC the result can be used with any license,
| including proprietary code.
|
| BSD and MIT license still require attribution of the used
| source code (there exists a MIT No Attribution License,
| though: https://en.wikipedia.org/w/index.php?title=MIT_Licens
| e&oldid...).
| winety wrote:
| > There is a problem because some licenses require
| attribution, but ignoring that...
|
| Surely the solution would be to give credit to every author
| from the training corpus. I am looking forward to the 10 000
| lines of copyrights in every header. :P
|
| If Microsoft had trained it on its own code, there would be
| no such problems. Surely a company as large as Microsoft has
| produced enough code over the years to create a large enough
| training dataset.
| Brian_K_White wrote:
| As much as I hate the entire concept, I would have a hard time
| articulating a substantive difference between this description
| of the ai mixing together bits of stuff it saw and what I do
| myself when I'm writing something that I fully describe as
| mine.
| omegalulw wrote:
| There are much simpler solutions than that. Simply invest or
| reuse plagiarism models and tag output snippets to source
| repositories. You can either exclude this code or let the users
| know it's licenced code.
| withinboredom wrote:
| Unless the model is truly coming up with something novel, there
| is _something_ in it 's training set that is truly similar, if
| not precisely the output. It should say so when that is the
| case, and should also say whether or not the output is novel.
| I'm sure GitHub could provide an API for searching code
| snippets, if they don't already.
| moyix wrote:
| They're currently doing this automatically for catching exact
| matches, and you can turn on a setting that will suppress any
| suggestion that was found verbatim in the training data.
|
| But of course this wouldn't catch copies where variable names
| have been changed, etc. One thing that I think would be
| really interesting (but hideously computationally expensive)
| is to compute an embedding of each chunk of the training data
| and then query for the k nearest neighbors when Copilot
| generates some code, so you can see what the closest snippets
| in the training data are and evaluate for yourself if they're
| too similar.
| mcbuilder wrote:
| Yes, but Co-Pilot is more like a system that has spent a lot
| of time learning from open source code but it has the logical
| reasoning ability of that code of the 12-year old who just
| learned about the problem yesterday that the author
| mentioned.
|
| But will it be capable of generating a copyright
| infringement? Let's imagine I want to implement an algorithm
| to find the shortest path between nodes in a graph. Well I
| remember from my MSc that you probably need to implement it
| with Dijkstra's algorithm, and I've implemented this a few
| times over the years in kata and leetcode. Now let's say I'm
| programming Rust, which I don't really know the syntax for so
| I go read some GPL code while browsing the net for syntax.
|
| Do I need to state that my implementation of Dijkstra's
| algorithm is 10% GPL, because I spent a few minutes reading
| syntax of a GPL file? What if someone else's copyrighted code
| looks 90% similar to mine, because hell there's only so many
| ways to implement it, can I get sued because there is 90%
| overlap and I might have looked at this code?
|
| These systems aren't even capable yet of generating more than
| a a few functions, much less a coherent library. Even then,
| as they generalize more and gain more expressive power they
| will be less likely to copy in the training data to the
| output now, a possibility that I consider quite remote even
| given a relatively primitive Co-Pilot.
| eterevsky wrote:
| I've seen estimations that verbatim copies of training code
| constitute around 0.1% of the code produced by the Copilot. It
| would be relatively straight-forward to implement a verification
| step that would remove this code. I would be surprised if it
| hasn't been done already.
|
| With verbatim code copies out of the way, I don't see any basis
| to consider code produced by Copilot copyrighted. So unless you
| have some problem with the fact that portions of your code will
| be public domain, I don't see any reason not to use it.
|
| And one more thought. The author gives an example of using
| Copilot to list prime numbers. That's not a good use for it.
| Copilot and similar systems are primarily useful for writing
| boring boilerplate code, saving your time for more involved
| parts.
| keithnz wrote:
| I agree, it needs to ensure no verbatim code. I also noticed
| that there is also an option not to use public code. But I
| actually like the idea of using open source code. In one breath
| we (software devs) have recommended reading open source
| software to learn coding, then when someone try's to automate
| that using AI so that it try's to contextually help you write
| code, some people seem upset that we can automate extracting
| code knowledge / patterns because we didn't do it with our
| wetware? I think it should be highly encouraged for people to
| extract knowledge from open code, even if at the moment it is
| far from perfect. The boiler plate stuff at the moment is
| great, and from time to time it writes some good contextually
| aware stuff as well. So, in summary, if humans are allowed to
| learn from open code, then I think AI should be allowed to
| learn from open code too.
| axg11 wrote:
| I believe this is already a feature that you can activate for
| Github Copilot.
| withinboredom wrote:
| I vaguely remember that generated code can't be copyrighted
| either. So you end up with code that you can't even own and
| some you can? How does that work?
| readthenotes1 wrote:
| Never fear. Copilot will hire a passel of lawyers not long after
| it's declared sentient (by some lawyers)
| Pulcinella wrote:
| _Rather than conceal the licenses of the underlying open-source
| code it relies on, it could in principle keep this information
| attached to each chunk of code as it wends its way through the
| model. Then, on the output side, it would be possible for a user
| to inspect the generated code and see where every part came from
| and what license is attached to it._
|
| This is feature, not a bug. The benefits of open source code, now
| with less potential legal risks. (Assuming it actually works).
| After all, with Copilot you write all the code "yourself." Yeah
| with assistance from the Copilot autocomplete tool from the
| Microsoft corporation, but IDEs and text editors have had
| autocompletes for decades and that doesn't mean you didn't write
| that code.
|
| (Note that I don't agree with this, just explaining that
| laundering other people's code is intentional).
| [deleted]
| withinboredom wrote:
| I don't know why the title is editorialized here ... but the
| actual title of the article "THIS COPILOT IS STUPID AND WANTS TO
| KILL ME" reminded me of a Tesla holding the left lane of the
| Autobahn and only going 100 kmh. As I passed them on the right
| (totally illegal maneuver on my part), honking my horn, and saw
| the driver reading a book; I wondered how long that "driver" was
| going to live before being rear ended by someone doing 150-200
| kmh.
|
| Maybe all AI secretly wants to kill us.
| gus_massa wrote:
| > _I don 't know why the title is editorialized here ..._
|
| I saw the post some minutes ago and it was with posted with the
| original titles. I guess the mods changed it. The part about
| wanting to kill the writer is an exaggeration. It's usual to
| changed it to the subtitle or a relevant sentence, but without
| too much cherry picking (the last part is not very clear).
|
| Someone made a tracker to for these changes
| https://hackernewstitles.netlify.app/ HN discussion
| https://news.ycombinator.com/item?id=21617016 (366 points | Nov
| 23, 2019 | 94 comments) Most of the changes make sense (if you
| forget the infamous case of the asteroid/rocket part).
| zacwest wrote:
| > In the large, I don't think the problems open-source authors
| have with AI training are that different from the problems
| everyone will have. We're just encountering them sooner.
|
| Call me a pessimist but it seems really unlikely that AI will
| cause problems so much as continually increasing automation. It
| makes such terrible choices around qualitative decisions and
| increasing existing AI to a general solution pretty much always
| fails.
___________________________________________________________________
(page generated 2022-06-25 23:00 UTC)