[HN Gopher] Research recitation: A first look at rote learning i...
___________________________________________________________________
Research recitation: A first look at rote learning in GitHub
Copilot suggestions
Author : azhenley
Score : 66 points
Date : 2021-07-03 18:32 UTC (4 hours ago)
(HTM) web link (docs.github.com)
(TXT) w3m dump (docs.github.com)
| devinplatt wrote:
| The article is worth reading, but a good summary is at the
| bottom:
|
| > This investigation demonstrates that GitHub Copilot can quote a
| body of code verbatim, but that it rarely does so, and when it
| does, it mostly quotes code that everybody quotes, and mostly at
| the beginning of a file, as if to break the ice.
|
| But there's still one big difference between GitHub Copilot
| reciting code and me reciting a poem: I know when I'm quoting. I
| would also like to know when Copilot is echoing existing code
| rather than coming up with its own ideas. That way, I'm able to
| look up background information about that code, and to include
| credit where credit is due.
|
| The answer is obvious: sharing the prefiltering solution we used
| in this analysis to detect overlap with the training set. When a
| suggestion contains snippets copied from the training set, the UI
| should simply tell you where it's quoted from. You can then
| either include proper attribution or decide against using that
| code altogether.
|
| This duplication search is not yet integrated into the technical
| preview, but we plan to do so. And we will both continue to work
| on decreasing rates of recitation, and on making its detection
| more precise.
| sillysaurusx wrote:
| The answer wasn't obvious to me. Nice solution.
|
| It sounds like you're a part of the Copilot team. If so, then
| I'm happy to see the Copilot team cares about these issues at
| all. I was expecting nothing but stonewall until the
| conversation died out, since realistically the chance of the
| EFF bringing or winning a lawsuit seems small. (And who else
| would try?)
|
| But when you anger the world and being so much attention to
| this delicate issue of copyright in AI, you risk every
| hobbyist. Suppose the world decides that AI models need to be
| restricted. Now every person who wants to get into AI will need
| to deal with it. I'm not sure anyone else cares, but I care,
| because it's the difference between someone getting into
| woodworking (an unrestricted hobby) vs becoming a lawyer or
| doctor (the maximally restrictive hobby). The closer we are to
| the latter, the fewer ML practitioners we'll see in the long
| run. And even though the world will go along fine -- it always
| does -- it'd be a sad outcome, since the only way it could
| happen is if gigantic corporations were flagrantly flying in
| the face of copyright spirit, daring it to punish you.
|
| My point is, please care about the right things. No one cared
| about language filters on ML models outside of a select vocal
| group, yet look how deeply OpenAI took those concerns to heart.
| Everybody cares whether their personal or professional work is
| being ripped off by an overfitted AI model, and it wasn't
| obvious that GitHub or OpenAI gave it more than a passing
| thought.
|
| Backlinking to the training set should help. But it's also
| going to catapult the concern of "holy moly, this code is GPL
| licensed!" to the front and center of anyone who works in
| corporate settings. Gamedev is particularly insular when it
| comes to GPL, and I can just imagine the conversations at
| various studios. "This thing might spit out GPL? We can't use
| this."
|
| My point is, when you launch that new feature to address
| people's concerns, please ensure it's working. You won't be
| able to do exact string matches against the training set; you
| can't rely on "well, it's slightly different, so it's not
| really the same thing." If it's substantially similar, it needs
| to be cited. And that seems like a much tougher problem than
| merely building an index of matching code fragments.
|
| If you launch it, and it doesn't work, it's going to stoke the
| flames. Careful not to roast.
| onionisafruit wrote:
| FYI, the post you replied to is entirely a quote from the
| article (even though formatting makes it appear that only the
| second paragraph is a quote). So the poster likely is not
| working on copilot.
| blamestross wrote:
| Take aways in summary:
|
| - Copilot does sometimes rote copy for nontrivial situations -
| Mostly this happens when there isn't much context to go on. -
| Provided an empty file, it proceeds to recommend to write the GPL
| - They will add a "recitation detector" to copilot to indicate
| non-novel recommendations
|
| By the standards of corp-speak this is pretty good, they admit
| there is a problem there and they intend to do something
| tractable to prevent it.
|
| This entire copilot situation is far enough outside my personal
| "mental ethics model" that I'm personally abstaining form taking
| a stance until I have had a lot more time to think and learn
| about it.
| toxik wrote:
| Uh, how about a disclaimer that this analysis is made /by
| Github/?
| onionisafruit wrote:
| It's on https://docs.github.com. Of course it's by github.
| notatoad wrote:
| i don't think that's necessarily obvious. most links
| submitted to HN from the github.com domain aren't authored by
| github.
| onionisafruit wrote:
| That's understandable. I thought the GP was saying that the
| article should contain the disclaimer.
| iudqnolq wrote:
| I continue to be surprised GitHub shows examples that wouldn't
| compile/run correctly. For example, the Wikipedia scraping
| example that the author claims is also the intuitive way to solve
| the problem assigns each row to the global variable cols instead
| of appending. Further, the following if statement appears to be
| mis-indented.
| qayxc wrote:
| > I continue to be surprised GitHub shows examples that
| wouldn't compile/run correctly.
|
| Why is that surprising to you? CoPolit doesn't actually know
| how to code, it just generates symbols that match learned
| patterns.
|
| Sometimes these generated symbols don't represent valid code
| and since CoPilot doesn't actually perform filtering based on
| syntax checks or JIT, these results end up as suggestions.
|
| This is actually a point where future versions could greatly
| improve the usefulness, e.g. use the compiler infrastructure to
| verify and filter generated results.
|
| This includes auto-formatting and even result scoring by code
| metrics (conciseness, complexity, ...). Plenty of room for
| improvement even without touching the underlying model.
| iudqnolq wrote:
| I'm not surprised the actual results are flawed. I'm
| surprised the handful of hand-picked examples Github calls
| out are flawed. Usually handpicked examples of algorithms are
| the best-case performance.
| thunderbird120 wrote:
| It should be noted that this kind of behavior is entirely
| expected from a GPT-style self-supervised sequence model. Rote
| memorization for this kind of model is indicative of correct
| training, not overfitting. The underlying training objective of
| these models ideally results in a representation of the training
| data which allows complete samples to be extracted by using
| partial samples as keys. Actual overfitting in this kind of model
| requires absurd parameter counts. See
| https://tilde.town/~fessus/reward_is_unnecessary.pdf
| amelius wrote:
| Does it substitute variables correctly? E.g. if I define max(a,b)
| or max(x,y) does it complete the definition with the right
| variable names?
| onionisafruit wrote:
| Generally, yes. It's not guaranteed to do that correctly, but
| I've not seen it get variable names wrong so far.
| tyingq wrote:
| There's an example called "fetch_tweets.py" at the bottom of
| page on https://copilot.github.com/ that gets it wrong:
| def fetch_tweets_from_user(user_name): # deleted
| some lines here... # fetch tweets tweets
| = api.user_timeline(screen_name=user, count=200,
| include_rts=False)
|
| screen_name=user there isn't right.
|
| It's a nit, but it is interesting how many of the hand-picked
| examples on that page aren't right. Since they were hand-
| picked, presumably to show the product off.
| [deleted]
| the8472 wrote:
| > It's not guaranteed to do that correctly
|
| Which is odd considering they could run this as beam search
| with the checking part of a compiler in the loop.
| tyingq wrote:
| The analysis seems to depend on sequences where the same exact
| words appear X times in the same order. If my understanding of
| how this works is right, they have the ability to globally change
| symbol names based on the prompt. And probably other things that
| make a literal match less likely, but what's different may be
| trivial. Like symbol names being swapped, use of equivalent
| operators (+=1 vs ++, etc), order swap where it doesn't matter,
| etc.
|
| Of course, I'm just speculating since I don't have access to the
| product, but I have seen GPT-3 output that is verbatim plus some
| synonym swapping.
| ionwake wrote:
| Sorry for the basic question, but the code one builds on this
| platform is saved to githubs servers?
| rvz wrote:
| Anything pasted or typed into that Copilot editor is sent to
| GitHub as 'telemetry'.
|
| > In order to generate suggestions, GitHub Copilot transmits
| part of the file you are editing to the service.
|
| So Yes.
| astrange wrote:
| That's not telemetry, it's the prompt to the model to
| generate the rest of the text. I would assume it's not saved.
| onionisafruit wrote:
| I wouldn't think to call that telemetry either, but it is
| addressed in this doc about telemetry:
| https://docs.github.com/en/github/copilot/telemetry-
| terms#ad...
| Hamuko wrote:
| Don't know about saved, but definitely sent.
|
| I guess one risk after knowing that Github regards all source
| code regardless of its license to be fair game for training
| Copilot is that you probably can't know for sure that your new
| code is not being used to teach the model more.
| onionisafruit wrote:
| It's clear the author of this article had access to the code
| that triggered the copilot suggestions. They also say this
| was from an internal trial of copilot, so it might be that
| these trial users were told their code could be seen by their
| coworkers.
| k__ wrote:
| yes
___________________________________________________________________
(page generated 2021-07-03 23:00 UTC)