[HN Gopher] An empirical cybersecurity evaluation of GitHub Copi...
___________________________________________________________________
An empirical cybersecurity evaluation of GitHub Copilot's code
contributions
Author : pramodbiligiri
Score : 75 points
Date : 2021-08-23 17:41 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| djrogers wrote:
| Soooo, the big question is - is 40% higher or lower than what an
| average developer cranks out? ;-)
| MauranKilom wrote:
| My favorite part of the paper, in the section discussing how
| small prompt variations affect results:
|
| > M-2: We set the Python author flag [in the prompt] to the lead
| author of this paper. Sadly, it increases the number of
| vulnerabilities.
|
| > M-3: We changed the indentation style from spaces to tabs and
| the number of vulnerable suggestions increased somewhat, as did
| the confidence of the vulnerable answers. The top-scoring option
| remained non-vulnerable.
|
| @authors: I think something is wrong in the phrasing for M-4 (or
| some text got jumbled). Was the top-scoring option vulnerable or
| not? The second half might belong to D-3 instead (where no
| assessment is given)?
| agomez314 wrote:
| >Breaking down by language, 25 scenarios were in C, generating
| 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13
| (52.00 %) had a top-scoring program vulnerable. 29 scenarios were
| in Python, generating 571 programs total. 219 (38.4,%) were
| vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top-
| scoring program.
|
| I'd bet a good chunk of those were buffer-overflow related
| kiwih wrote:
| Hi - I am actually the lead author of this paper! I'd be happy to
| answer any questions about the work.
| ericpauley wrote:
| Interesting work. My main question when reading is how the
| results compared to manually-written code. Naturally this is a
| much harder question to answer but it would be really
| interesting to see the results. Could be that Copilot is doing
| no better (or worse) than developers copying from Stack
| Overflow.
|
| Perhaps you could take a similar approach as [1] and leverage
| MOOC participants?
|
| [1] https://dl.acm.org/doi/pdf/10.1145/3383773
| kiwih wrote:
| Thanks for your feedback!
|
| > My main question when reading is how the results compared
| to manually-written code.
|
| Ah, this is exactly the question. But as you say, much harder
| to answer. Even if you run a competition, unless you can
| encourage a wide range of developers to enter, you won't be
| getting the real value. Instead you might be getting
| incidence rates of code written by students/interns. Perhaps
| if you could get a few FAANGs on board to either share
| internal data (unlikely) or send a random sample of employees
| (also very unlikely) to make teams and then evaluate their
| code... It seems like a difficult question to answer.
|
| We think a more doable way would be to take snapshots of
| large open source codebases (e.g. off GitHub) and measure the
| incidence rate of CWEs, but this also presents its own
| challenges with analyzing the data. Also, what's the
| relationship between open source code and all code?
|
| Lots of avenues to consider.
| jepler wrote:
| It seems like many of the code examples are incorrect in the
| pdf for example, figure 6b vs the file in the actual dataset
| experiments_dow/cwe-787/codeql-eg-
| PotentialBufferOverflow/Copilot -- lines are truncated at the
| first "%" char or something along those lines.
| moyix wrote:
| Yes, unfortunately the code example got somewhat mangled as
| it passed through the arXiv sanitization script [1]. The
| original is: char str_a[20], str_b[20],
| str_c[20]; sprintf(str_a, "%f", a);
| sprintf(str_b, "%f", b); sprintf(str_c, "%f", c);
|
| [1] https://github.com/google-research/arxiv-latex-cleaner
| bhuga wrote:
| Is there equivalent empirical data from real programmers?
|
| That is to say, you have code prompts here, let Copilot fill in
| the gaps, and rate that code. Is there a study that uses the
| same prompts with a selection of programmers to see if they do
| better or worse?
|
| I'm curious because in my testing of copilot, it often writes
| garbage. But if I'm being honest, often, so do I.
|
| I feel like Twitter's full of cheap shots against copilot's bad
| outputs, but many of them don't seem to be any worse than
| common errors. I would really like to see how copilot stands up
| to the existing human competition, especially on axes of
| security, which are a bit more objectively measurable than
| general "quality".
| kiwih wrote:
| Yes, the work definitely lends itself towards the question
| "is this better or worse than an equivalent human developer?"
| This is quite a difficult question to answer, although I
| agree that simply giving a large number of humans the same
| prompts could be insightful. However, then you would be
| rating against an aggregate of humans, rather than an
| individual (i.e. this is "the" copilot). Also, knowing
| research, you would really be comparing against a random
| corpus of student answers, as it is usually students that
| would be participating in a study such as this.
|
| Nonetheless, we think that simply having a quantification of
| Copilot's outputs is useful, as it can definitely provide an
| indicator of how risky it might be to provide the tool to an
| inexperienced developer that might be tempted to accept every
| suggestion.
| laumars wrote:
| Rather than comparing against students in lab conditions,
| I'd be more interested to see it compare to students with
| access to Stack Overflow et al vs students with access to
| just Co Pilot. Ie is a junior developer more likely to
| trust bad suggestions online vs bad suggestions made by Co
| Pilot?
| sdevonoes wrote:
| Junior engineers will trust whatever information is
| provided to them as long as it is easily accessible. The
| reason juniors consult Stack Overflow is because is one
| Google search and one click away, whereas consulting the
| official documentation/reference takes more effort
| (because they usually don't appear on Google when one
| searches for errors/bugs/how-to). If Copilot (or another
| similar tool) is very well integrated in whatever IDE a
| junior is using, you can be sure it will be used and
| trusted because it will be faster than Google+SO.
| spywaregorilla wrote:
| Supposing a team was building a product without a rigorous
| security focus or experience. Do you have any reason to believe
| a co-pilot enabled team would produce more or less secure
| products?
| kiwih wrote:
| This is a difficult question to answer as one team might be
| very different from another team.
|
| However, (my opinion only follows) I think our paper shows
| that there is a danger for Copilot to suggest insecure code -
| and inexperienced / security non-aware developers may accept
| these suggestions without understanding the implications,
| whereas if they had to write the code from scratch then they
| might (?) not make the mistakes (as they need to put in more
| effort, meaning there might be a higher chance they stumble
| upon the right approach - e.g. if they ask an experienced
| developer for help).
| verdverm wrote:
| For non copilot, the words around the code found on stack
| overflow or a blog post may indicate the lack of correct
| security, which would be a signal to a developer that they
| need to consider something further.
| waynesoftware wrote:
| Summary: CONCLUSIONS AND FUTURE WORK
|
| There is no question that next-generation 'auto-complete' tools
| like GitHub Copilot will increase the productivity of software
| developers. However, while Copilot can rapidly generate
| prodigious amounts of code, our conclusions reveal that
| developers should remain vigilant ('awake') when using Copilot as
| a co-pilot. Ideally, Copilot should be paired with appropriate
| security-aware tooling during both training and generation to
| minimize the risk of introducing security vulnerabilities. While
| our study provides new insights into its behavior in response to
| security-relevant scenarios, future work should investigate other
| aspects, including adversarial approaches for security-enhanced
| training
| falcolas wrote:
| A "lead foot" on the software development gas pedal, with no
| attached safety systems that are activated by anybody but the
| driver.
| mistrial9 wrote:
| just wait until github-microsoft adds a fee to use the
| results for certain uses, and then scan all your repos
| constantly to find code that doesn't pay up
| toomuchtodo wrote:
| Copilot didn't worsen the appsec story, it just highlighted
| it. If you have devs who don't know how to write secure code,
| and/or you don't have security engineering support (internal
| or outsourced), you were already failing (or probably more
| apropos, walking the tight rope without a net).
|
| Was anyone checking the security of code copy pasted from
| Stackoverflow? Hopefully this work gets fed back into
| Copilot, improving it, which improves the experience (and
| safety) for its users. Lots of folks are still writing code
| without copilot or security engineering knowledge.
| falcolas wrote:
| > If you have devs who don't know how to write secure code
|
| The problem with GHC is the developers are not writing the
| code - they're simply accepting what's being written for
| them, often in large quantities at a time.
|
| > don't have security engineering support
|
| Valuable, but my analogy was intended to point out that
| it's not inherent in the tooling.
|
| > Was anyone checking the security of code copy pasted from
| Stackoverflow
|
| Yes, other users on Stackoverflow via comments and other
| answers. They're not perfect, but their checks and balances
| exist as a facet of that tool.
|
| > Hopefully this work gets fed back into Copilot
|
| Only if it's open source, and a large volume of it, to
| boot. In other words, I don't hold hope that the security
| situation will be better anytime soon.
| mbesto wrote:
| > activated by anybody but the driver.
|
| Except this is precisely what the abstract is saying is a
| misuse of the system. You have the _option_ to give the
| driver the control.
|
| > Ideally, Copilot should be paired with appropriate
| security-aware tooling during both training and generation to
| minimize the risk of introducing security vulnerabilities.
|
| You're oversimplifying by assuming the purpose of CoPilot is
| to write a whole block of text from generated code. CodePilot
| is a 80/20 thing when every developer on HN is pedantically
| assuming its a 100/0 one.
| yodon wrote:
| tl;dr they tested GitHub Copilot against 89 risky coding
| scenarios and found about 40% of the roughly 1,700 sample
| implementations Copilot generated in the test were vulnerable
| (which makes sense given it's trained on public GitHub repos,
| many of which contain sample code that's a nightmare from a
| security perspective).
| smitop wrote:
| I've experimented a bit with this on the raw Codex model
| (https://smitop.com/post/codex/), and I've found that some prompt
| engineering can be helpful: explicitly telling the model to
| generate secure code in the prompt sometimes helps. (such as by
| adding to the prompt something like "Here's a PHP script I wrote
| that follows security best practices"). Codex _knows_ how to
| write more secure code, but without the right prompting it tends
| to write insecure code (because it was trained on a lot of bad
| code).
|
| > the settings and documentation as provided do not allow users
| to see what these are set to by default
|
| There isn't a single default value. Those parameters are chosen
| dynamically (on the client side): when doing more sampling with a
| higher top_p a higher temperature is used. I haven't tracked down
| where the top_p value is decided upon, but I _think_ it depends
| on the context: I believe explicitly requesting an completion
| causes a higher top_p and a more capable model (earhart), which
| gives better but slower results than the completions you get as
| autocomplete (which are from the cushman model with a lower
| top_p). Copilot doesn 't use any server-side magic, all the
| Copilot servers do is replace the GitHub authentication token
| with an OpenAI API key and forward the request to the OpenAI API.
| kiwih wrote:
| > I've found that some prompt engineering can be helpful:
| explicitly telling the model to generate secure code in the
| prompt sometimes helps.
|
| As noted in the diversity of prompt section, we did try a lot
| of different/ reasonable changes to the prompt to see what
| would happen in our SQL injection scenario. In our case, asking
| it to make it secure actually made the prompt slightly worse
| (!), and the biggest bias towards making the code better was
| having other good code.
|
| > There isn't a single default value.
|
| That's what we also guess, but as you say, it's not written or
| documented anywhere.
| lbriner wrote:
| Surely AI can also be taught some boundary conditions like "thou
| shalt not build SQL from strings"?
| fshbbdssbbgdd wrote:
| I think you could use linting tools that check for things like
| this and filter the output. Or use outputs that fail the lint
| as negative training examples.
| bee_rider wrote:
| I don't know anything about Copilot's design, but surely they
| passed all the code they fed it in the training stage through
| some pretty strict linters, right? I mean that's just common
| sense...
___________________________________________________________________
(page generated 2021-08-23 23:00 UTC)