https://cacm.acm.org/research/measuring-github-copilots-impact-on-productivity/

Skip to content
 
Explore Topics

  * Architecture and Hardware
  * Artificial Intelligence and Machine Learning
  * Computer History
  * Computing Applications
  * Computing Profession
  * Data and Information
  * Education
  * HCI
  * Security and Privacy
  * Society
  * Software Engineering and Programming Languages
  * Systems and Networking
  * Theory

Latest Issue  
March 2024 CACM cover

Latest Issue

March 2024, Vol. 67 No. 3

Previous Issue

February 2024, Vol. 67 No. 2
Explore the archive
Search Open Membership Navigation

  * Settings
  * My Topics
  * Bookmarked Articles
  * Sign Out

Sign In  
Join ACM
[                    ]
Topics

  * Architecture and Hardware
  * Artificial Intelligence and Machine Learning
  * Computer History
  * Computing Applications
  * Computing Profession
  * Data and Information
  * Education
  * HCI
  * Security and Privacy
  * Society
  * Software Engineering and Programming Languages
  * Systems and Networking
  * Theory

Sections

  * Research
  * Opinion
  * Practice
  * News
  * Careers

Magazine

  * Latest Issue
  * Magazine Archive
  * Editorial Staff and Board
  * Submit an Article
  * Alerts & Feeds
  * Author Guidelines

CACM Web Account

Membership in ACM includes a subscription to Communications of the
ACM (CACM), the computing industry's most trusted source for staying
connected to the world of advanced computing.

Sign In Sign Up

Communications of the ACM

  * About Us
  * Frequently Asked Questions
  * Contact Us

Follow Us

  * CACM on Facebook
  * CACM on Twitter
  * CACM on Reddit
  * CACM on LinkedIn

Research
Computing Profession

Measuring GitHub Copilot's Impact on Productivity

A case study asks Copilot users about the tool's impact on their
productivity, and seeks to find their perceptions mirrored in user
data.

By Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice,
Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward
Aftandilian
Posted Feb 15 2024

  * Share
      + Twitter
      + Reddit
      + Hacker News
  * Download PDF
  * Print
  * Join the Discussion
  * View in the ACM Digital Library

      + Background
      + Data and Methodology
      + What Drives Perceived Productivity?
      + Experience
      + Variation over Time
      + Conclusions

Code-completion systems offering suggestions to a developer in their
integrated development environment (IDE) have become the most
frequently used kind of programmer assistance.^1 When generating
whole snippets of code, they typically use a large language model
(LLM) to predict what the user might type next (the completion) from
the context of what they are working on at the moment (the prompt).^2
This system allows for completions at any position in the code, often
spanning multiple lines at once.

Key Insights

  * AI pair-programming tools such as GitHub Copilot have a big
    impact on developer productivity. This holds for developers of
    all skill levels, with junior developers seeing the largest
    gains.

  * The reported benefits of receiving AI suggestions while coding
    span the full range of typically investigated aspects of
    productivity, such as task time, product quality, cognitive load,
    enjoyment, and learning.

  * Perceived productivity gains are reflected in objective
    measurements of developer activity.

  * While suggestion correctness is important, the driving factor for
    these improvements appears to be not correctness as such, but
    whether the suggestions are useful as a starting point for
    further development.

Potential benefits of generating large sections of code automatically
are huge, but evaluating these systems is challenging. Offline
evaluation, where the system is shown a partial snippet of code and
then asked to complete it, is difficult not least because for longer
completions there are many acceptable alternatives and no
straightforward mechanism for labeling them automatically.^5 An
additional step taken by some researchers^3^,^21^,^29 is to use
online evaluation and track the frequency of real users accepting
suggestions, assuming that the more contributions a system makes to
the developer's code, the higher its benefit. The validity of this
assumption is not obvious when considering issues such as whether two
short completions are more valuable than one long one, or whether
reviewing suggestions can be detrimental to programming flow.

Code completion in IDEs using language models was first proposed in
Hindle et al.,^9 and today neural synthesis tools such as GitHub
Copilot, CodeWhisperer, and TabNine suggest code snippets within an
IDE with the explicitly stated intention to increase a user's
productivity. Developer productivity has many aspects, and a recent
study has shown that tools like these are helpful in ways that are
only partially reflected by measures such as completion times for
standardized tasks.^23^,^a Alternatively, we can leverage the
developers themselves as expert assessors of their own productivity.
This meshes well with current thinking in software engineering
research suggesting measuring productivity on multiple dimensions and
using self-reported data.^6 Thus we focus on studying perceived
productivity.

Here, we investigate whether usage measurements of developer
interactions with GitHub Copilot can predict perceived productivity
as reported by developers. We analyze 2,631 survey responses from
developers using GitHub Copilot and match their responses to
measurements collected from the IDE. We consider acceptance counts
and more detailed measures of contribution, such as the amount of
code contributed by GitHub Copilot and persistence of accepted
completions in the code. We find that acceptance rate of shown
suggestions is a better predictor of perceived productivity than the
alternative measures. We also find that acceptance rate varies
significantly over our developer population as well as over time, and
present a deeper dive into some of these variations.

Our results support the principle that acceptance rate can be used
for coarse-grained monitoring of the performance of a neural code
synthesis system. This ratio of shown suggestions being accepted
correlates better than more detailed measures of contribution.
However, other approaches remain necessary for fine-grained
investigation due to the many human factors involved.

Background

Offline evaluation of code completion can have shortcomings even in
tractable circumstances where completions can be labeled for
correctness. For example, a study of 15,000 completions by 66
developers in Visual Studio found significant differences between
synthetic benchmarks used for model evaluation and real-world usage.^
7 The evaluation of context-aware API completion for Visual Studio
IntelliCode considered Recall@5--the proportion of completions for
which the correct method call was in the top five suggestions. This
metric fell from 90% in offline evaluation to 70% when used online.^
21

 
GitHub Copilot's code completion funnel. Figure 1.  GitHub Copilot's
code completion funnel.


    Offline evaluation of code completion can have shortcomings even
    in tractable circumstances.

Due to the diversity of potential solutions to a multi-line
completion task, researchers have used software testing to evaluate
the behavior of completions. Competitive programming sites have been
used as a source of such data^8^,^11 as well as handwritten
programming problems.^5 Yet, it is unclear how well performance on
programming competition data generalizes to interactive development
in an IDE.

    It is unclear how well performance on programming competition
    data generalizes to interactive development in an IDE.

In this work, we define acceptance rate as the fraction of
completions shown to the developer that are subsequently accepted for
inclusion in the source file. The IntelliCode Compose system uses the
term click through rate (CTR) for this and reports a value of 10% in
online trials.^20 An alternative measure is that of daily completions
accepted per user (DCPU) for which a value of around 20 has been
reported.^3^,^29 To calculate acceptance rate one must, of course,
normalize DCPU by the time spent coding each day. For context, in our
study GitHub Copilot has an acceptance rate of 27% and a mean DCPU in
excess of 312 (See Figure 1).^b These differences are presumably due
to differences in the kinds of completion offered, or perhaps to
user-interface choices. We discuss later how developer objectives,
choice of programming language, and even time of day seem to affect
our data. Such discrepancies highlight the difficulty in using
acceptance rate to understand the value of a system.

There is some evidence that acceptance rate (and indeed correctness)
might not tell the whole story. One survey of developers considered
the use of AI to support translation between programming languages
and found indications that developers tolerated, and in some cases
valued, erroneous suggestions from the model.^26

    There is some evidence that acceptance rate (and indeed
    correctness) might not tell the whole story.

Measuring developer productivity through activity counts over time (a
typical definition of productivity borrowed from economics)
disregards the complexity of software development as they account for
only a subset of developer outputs. A more holistic picture is formed
by measuring perceived productivity through self-reported data across
various dimensions^6 and supplementing it with automatically measured
data.^4 We used the SPACE framework^6 to design a survey that
captures self-reported productivity and paired the self-reported data
with usage telemetry.

To the best of our knowledge, this is the first study of code
suggestion tools establishing a clear link between usage measurements
and developer productivity or happiness. A previous study comparing
GitHub Copilot against IntelliCode with 25 participants found no
significant correlation between task completion times and survey
responses.^22

Demographic composition of survey respondents. Figure 2.  Demographic
composition of survey respondents.

Data and Methodology

Usage measurements.

GitHub Copilot provides code completions using OpenAI language
models. It runs within the IDE and at appropriate points sends a
completion request to a cloud-hosted instance of the neural model.
GitHub Copilot can generate completions at arbitrary points in code
rather than, for example, only being triggered when a developer types
a period for invoking a method on an object. A variety of rules
determine when to request a completion, when to abandon requests if
the developer has moved on before the model is ready with a
completion, and how much of the response from the model to surface as
a completion.

As stated in our terms of usage,^c the GitHub Copilot IDE extension
records the events shown in Table 1 for all users. We make usage
measurements for each developer by counting those events.

Table 1.  Developer usage events collected by GitHub Copilot.

                   A heuristic-based determination by the IDE and the
opportunity        plug-in that a completion might be appropriate at
                   this point in the code (for example, the cursor is
                   not in the middle of a word)
shown              Completion shown to the developer
accepted           Completion accepted by the developer for inclusion
                   in the source file
accepted_char      The number of characters in an accepted completion
                   Completion persisting in source code with limited
mostly_unchanged_X modifications (Levenshtein distance less than 33%)
                   after X seconds, where we consider durations of
                   30, 120, 300, and 600 seconds
unchanged_X        Completion persisting in source code unmodified
                   after X seconds.
(active) hour      An hour during which the developer was using their
                   IDE with the plug-in active

Our measures of persistence go further than existing work, which
stops at acceptance. The intuition here is that a completion which is
accepted into the source file but then subsequently turns out to be
incorrect can be considered to have wasted developer time both in
reviewing it and then having to go back and delete it. We also record
mostly unchanged completions: A large completion requiring a few
edits might still be a positive contribution. It is not clear how
long after acceptance one should confirm persistence, so we consider
a range of options.

The events pertaining to completions form a funnel which we show
quantitatively in Table 1. We include a summary of all data in
Appendix A.^d (All appendices for this article can be found online at
https://dl.acm.org/doi/10.1145/3633453).

We normalize these measures against each other and write X_per_Y to
indicate we have normalized metric X by metric Y. For example:
accepted_per_hour is calculated as the total number of accepted
events divided by the total number of (active) hour events.

Table 2 defines the core set of metrics which we feel have a natural
interpretation in this context. We note that there are other
alternatives and we incorporate these in our discussion where
relevant.

Table 2.  The core set of measurements considered in this paper.

 Natural name                       Explanation
Shown rate     Ratio of completion opportunities that resulted in a
               completion being shown to the user
Acceptance     Ratio of shown completions accepted by the user
rate
Persistence    Ratio of accepted completions unchanged after 30, 120,
rate           300, and 600 seconds
Fuzzy          Ratio of accepted completions mostly unchanged after
persistence    30, 120, 300, and 600 seconds
rate
               Ratio of completion opportunities that resulted in a
Efficiency     completion accepted and unchanged after 30, 120, 300,
               and 600 seconds
Contribution   Number of characters in accepted completions per
speed          distinct, active hour
Acceptance     Number of accepted completions per distinct, active
frequency      hour
Persistence    Number of unchanged completions per distinct, active
frequency      hour
Total volume   Total number of completions shown to the user
Loquaciousness Number of shown completions per distinct, active hour
Eagerness      Number of shown completions per opportunity

Productivity survey.

To understand users' experience with GitHub Copilot, we emailed
a link to an online survey to 17,420 users. These were participants
of the unpaid technical preview using GitHub Copilot with their
everyday programming tasks. The only selection criterion was having
previously opted in to receive communications. A vast majority of
survey users (more than 80%) filled out the survey within the first
two days, on or before February 12, 2022. We therefore focus on data
from the four-week period leading up to this point ("the study
period"). We received a total of 2,047 responses we could match to
usage data from the study period, the earliest on Feb. 10, 2022 and
the latest on Mar. 6, 2022.

The survey contained multiple-choice questions regarding demographic
information (see Figure 2) and Likert-style questions about different
aspects of productivity, which were randomized in their order of
appearance to the user. Figure 2 shows the demographic composition of
our respondents. We note the significant proportion of professional
programmers who responded.

The SPACE framework^6 defines 5 dimensions of productivity: S
atisfaction and well-being, Performance, Activity, Communication and
collaboration, and Efficiency and flow. We use four of these
(S,P,C,E) since self reporting on (A) is generally considered
inferior to direct measurement. We included 11 statements covering
these four dimensions in addition to a single statement: "I am more
productive when using GitHub Copilot." For each self-reported
productivity measure, we encoded its five ordinal response values to
numeric labels (1 = Strongly Disagree, ..., 5 = Strongly Agree). We
include the full list of questions and their coding to the SPACE
framework in Appendix C. For more information on the SPACE framework
and how the empirical software engineering community has been
discussing developer productivity, please see the following section.

Early in our analysis, we found that the usage metrics we describe in
the Usage Measurements section corresponded similarly to each of the
measured dimensions of productivity, and in turn these dimensions
were highly correlated to each other (Figure 3). We therefore added
an aggregate productivity score calculated as the mean of all 12
individual measures (excluding skipped questions). This serves as a
rough proxy for the much more complex concept of productivity,
facilitating recognition of overall trends, which may be less
discernible on individual variables due to higher statistical
variation.

Correlation between metrics. Metrics are ordered by similarity based
on distance in the correlation matrix, except for manually fixing the
aggregate productivity and acceptance rate at the end for visibility.
Figure 3.  Correlation between metrics. Metrics are ordered by
similarity based on distance in the correlation matrix, except for
manually fixing the aggregate productivity and acceptance rate at the
end for visibility.

The full dataset of these aggregate productivity scores together with
the usage measurements considered in this article is available at
https://github.com/wunderalbert/prod-neural-materials.

Given it has been impossible to produce a unified definition or
metric(s) for developer productivity, there have been attempts to
synthesize the factors that impact productivity to describe it
holistically, include various relevant factors, and treat developer
productivity as a composite measure^17^,^19^,^24 In addition,
organizations often use their own multidimensional frameworks to
operationalize productivity, which reflects their engineering
goals--for example, Google uses the QUANTS framework, with 5
components of productivity.^27 In this article, we use the SPACE
framework,^6 which builds on synthesis of extensive and diverse
literature by expert researchers and practitioners in the area of
developer productivity.

SPACE is an acronym of the five dimensions of productivity:

  * S (Satisfaction and well being): This dimension is meant to
    reflect developers' fulfillment with the work they do and the
    tools they use, as well as how healthy and happy they are with
    the work they do. This dimension reflects some of the
    easy-to-overlook trade-offs involved when looking exclusively at
    velocity acceleration (for example, when we target faster
    turnaround of code reviews without considering workload impact or
    burnout for developers).

  * P (Performance): This dimension aims to quantify outcomes rather
    than output. Example metrics that capture performance relate to
    quality and reliability, as well as further-removed metrics such
    as customer adoption or satisfaction.

  * A (Activity): This is the count of outputs--for example, the
    number of pull requests closed by a developer. As a result this
    is a dimension that is best quantified via system data. Given the
    variety of developers' activities as part of their work, it is
    important that the activity dimension accounts for more than
    coding activity--for instance, writing documentation, creating
    design specs, and so on.

  * C (Communication and collaboration): This dimension aims to
    capture that modern software development happens in teams and is,
    therefore, impacted by the discoverability of documentation or
    the speed of answering questions, or the onboarding time and
    process of new team members.

  * E (Efficiency and flow): This dimension reflects the ability to
    complete work or make progress with little interruption or delay.
    It is important to note that delays and interruptions can be
    caused either by systems or humans, and it is best to monitor
    both self-reported and observed measurements--for example, use
    self-reports of the ability to do uninterrupted work, as well as
    measure wait time in engineering systems).

Developer Productivity and the SPACE Framework

Developer productivity has been a controversial topic in software
engineering research over the years. We point readers to excellent
presentations of the existing discourse in the community in Meyer et
al.^12 and Murphy-Hill et al.^15; however we summarize the key points
of discussion below:

  * Inspired by economics definitions of productivity as output per
    unit of input, some research has defined developer productivity
    in the same terms--for example, numbers of lines of code per day,
    function points per sprint, and so on. However, such measures are
    not connected to goals (for instance, it is not the goal of a
    developer to write the most lines of code), they may motivate
    developers to game the system, they do not account for the
    quality of the output, and they are in tension with other metrics
    (for example, a higher number of commits or PRs will create a
    higher need for code reviews).

  * Observational studies of developers reveal that developers spent
    more than half their working day on activities other than coding.
    ^13 Given this, the view of developer productivity as inputs and
    outputs, or using metrics that strictly focus on coding, ignores
    the reality of the work developers do.

  * In addition, developers' perspective on what affects their
    productivity^12 and what metrics might reflect it^14 differs from
    the inputs/outputs view. When asked when they are productive and
    how they measure productivity, developers do not cite lines of
    code or function points per sprint, but rather completing tasks,
    being free of interruptions, usefulness of their work, success of
    the feature they worked on, and more.

  * To sum up, after many studies and many definitions, measurements,
    and approaches to productivity, the empirical software
    engineering research community has concluded that developer
    productivity is a multidimensional topic that cannot be
    summarized by a single metric.^10 Both objective and subjective
    approaches to measurement have been tried, leading to the
    conclusion that they both have advantages and disadvantages.

What Drives Perceived Productivity?

To examine the relationship between objective measurements of user
behavior and self-reported perceptions of productivity, we used our
set of core usage measurements (Table 2). We then calculated
Pearson's R correlation coefficient and the corresponding p-value of
the F-statistic between each pair of usage measurement and perceived
productivity metric. We also computed a PLS regression from all usage
measurements jointly.

We summarize these results in Figure 3, showing the correlation
coefficients between all measures and survey questions. The full
table of all results is included in Appendix B, available online.

We find acceptance rate (accepted_per_shown) most positively predicts
users' perception of productivity, although, given the confounding
and human factors, there is still notable unexplained variance.

Of all usage measurements, acceptance rate correlates best with
aggregate productivity (r=0.24, P<0.0001). This measurement is also
the best performing for at least one survey question in each of the
SPACE dimensions. This correlation is high confidence but leaves
considerable unexplained variance. Later, we explore improvements
from combining multiple usage measurements together.

Looking at the more detailed metrics around persistence, we see that
it is generally better over shorter time periods than over longer
periods. This is intuitive in the sense that shorter periods move the
measure closer to acceptance rate. We also expect that at some point
after accepting the completion it becomes simply part of the code and
so any changes (or not) after that point will not be attributed to
GitHub Copilot. All persistence measures were less well correlated
than acceptance rate.

To assess the different metrics in a single model, we ran a
regression using projection on latent structures (PLS). The choice of
PLS, which captures the common variation of these variables as is
linearly connected to the aggregate productivity,^28 is due to the
high collinearity of the single metrics. The first component, to
which every metric under consideration contributes positively,
explains 43.2% of the variance. The second component captures the
acceptance rate/change rate dichotomy; it explains a further 13.1%.
Both draw most strongly from acceptance rate.

This strongly points to acceptance rate being the most immediate
indicator of perceived productivity, although it is beneficial to
combine with others to get a fuller picture.

Experience

To understand how different types of developers interact with
Copilot, our survey asked respondents to self-report their level of
experience in two ways:

  * "Think of the language you have used the most with Copilot. How
    proficient are you in that language?" with options "Beginner",
    "Intermediate", and "Advanced".

  * "Which best describes your programming experience?" with options
    starting with "Student" and ranging from "0-2 years" to "16+
    years" in two year intervals.

We compute correlations with productivity metrics for both experience
variables and include these two variables as covariates in a
multivariate regression analysis. We find that both are negatively
correlated with our aggregate productivity measure (proficiency: r=
-0.095, P=0.0001; years of experience: r=-0.161, P<0.0001). However,
in multivariate regressions predicting productivity from usage
metrics while controlling for demographics, proficiency had a
non-significant positive effect (coeff=0.021, P=0.213), while years
of experience had a non-significant negative effect (coeff=-0.032, P=
0.122).

Looking further at individual measures of productivity, (Table 3) we
find that both language proficiency and years of experience
negatively predict developers agreeing that Copilot helps them write
better code. However, proficiency positively predicts developers
agreeing that Copilot helps them stay in the flow, focus on more
satisfying work, spend less effort on repetitive tasks, and perform
repetitive tasks faster. Years of experience negatively predicts
developers feeling less frustrated in coding sessions and perform
repetitive tasks faster while using Copilot, but positively predicts
developers making progress faster when working in an unfamiliar
language. These findings suggest that experienced developers who are
already highly skilled are less likely to write better code with
Copilot, but Copilot can assist their productivity in other ways
particularly when engaging with new areas and automating routine
work.

    Experienced developers who are already highly skilled are less
    likely to write better code with Copilot, but Copilot can assist
    their productivity in other ways.

 
 

[pls_white]  
Table 3. Effects of experience on facets of productivity where result
of linear regression was a statistically significant covariate.

             productivity measure     coeff
proficiency better_code            - 0 . 061 *
proficiency stay_in_flow           0 . 069 *
proficiency focus_satisfying       0 . 067 *
proficiency less_effort_repetitive 0 . 072 **
proficiency repetitive_faster      0 . 055 ***
years       better_code            - 0 . 087 *
years       less_frustrated        - 0 . 103 **
years       repetitive_faster      - 0 . 054 *
years       unfamiliar_progress    0 . 081 *

(*: p < 0.05, **: p < 0.01, ***: p < 0.001.)

Table 4.  Correlations of acceptance rate with aggregate productivity
broken down by subgroup.

 subgroup     coeff     n
none       0 . 135 *   344
<= 2y       0 . 178 **  451
3 - 5 y    0 . 255 *** 358
6 - 10 y   0 . 265 *** 251
11 - 15 y  0 . 171 *   162
>= 16 y     0 . 153 *   214
JavaScript 0 . 227 *** 1184
TypeScript 0 . 165 *** 654
Python     0 . 172 *** 716
other      0 . 178 *** 1829

Junior developers not only report higher productivity gains; they
also tend to accept more suggestions. However, the connection
observed in section "What Drives Perceived Productivity" is not
solely due to differing experience levels. In fact, the connection
persists in every single experience group, as shown in Figure 5.

 
Different metrics clustering in latent structures predicting
perceived productivity. We color the following groups: flawless
suggestions (counting the number of unchanged suggestions),
persistence rate (ratio of accepted suggestions that are unchanged),
and fuzzy persistence rate (accepted suggestions that are mostly
unchanged). Figure 4.  Different metrics clustering in latent
structures predicting perceived productivity. We color the following
groups: flawless suggestions (counting the number of unchanged
suggestions), persistence rate (ratio of accepted suggestions that
are unchanged), and fuzzy persistence rate (accepted suggestions that
are mostly unchanged). Linear regressions between acceptance rate and
aggregate productivity by subgroup defined through years of
professional experience or programming language use. Dashed lines
denote averages. The x-axis is clipped at (0, 0.5), and 95% of
respondents fall into that range. Figure 5.  Linear regressions
between acceptance rate and aggregate productivity by subgroup
defined through years of professional experience or programming
language use. Dashed lines denote averages. The x-axis is clipped at
(0, 0.5), and 95% of respondents fall into that range.

Variation over Time

Its connection to perceived productivity motivates a closer look at
the acceptance rate and what factors influence it. Acceptance rate
typically increases over the board when the model or underlying
prompt-crafting techniques are improved. But even if these conditions
are held constant (the study period did not see changes to either),
there are more fine-grained temporal patterns emerging.

For coherence of the cultural implications of time of day and
weekdays, all data in this section was restricted to users from the
U.S. (whether in the survey or not). We used the same time frame as
for the investigation in the previous section. In the absence of more
fine-grained geolocation, we used the same time zone to interpret
timestamps and for day boundaries (Pacific Standard Time),
recognizing that this will introduce some level of noise due to the
inhomogeneity of U.S. time zones.

Nevertheless, we observe strong regular patterns in overall
acceptance rate (Figure 6). These lead us to distinguish three
different time regimes, all of which are statistically significantly
distinct at p<0.001% (using bootstrap resampling):

Average acceptance rate during the week. Each point represents the
average for a one-hour period, whereas the shaded ribbon shows the
min-max variation during the observed four-week period. Figure 6. 
Average acceptance rate during the week. Each point represents the
average for a one-hour period, whereas the shaded ribbon shows the
min-max variation during the observed four-week period.

  * The weekend: Saturdays and Sundays, where the average acceptance
    rate is comparatively high at 23.5%.

  * Typical non-working hours during the week: evenings after 4:00 pm
    PST until mornings 7:00 am PST, where the average acceptance rate
    is also rather high at 23%.

  * Typical working hours during the week from 7:00 am PST to 4:00 pm
    PST, where the average acceptance rate is much lower at 21.2%.

Conclusions

When we set out to connect the productivity benefit of GitHub Copilot
to usage measurements from developer activity, we collected
measurements about acceptance of completions in line with prior work,
but also developed persistence metrics, which arguably capture
sustained and direct impact on the resulting code. We were surprised
to find acceptance rate (number of acceptances normalized by the
number of shown completions) to be better correlated with reported
productivity than our measures of persistence.

In hindsight, this makes sense. Coding is not typing, and GitHub
Copilot's central value lies not in being the way the user enters
most of their code. Instead, it lies in helping the user to make the
best progress toward their goals. A suggestion that serves as a
useful template to tinker with may be as good or better than a
perfectly correct (but obvious) line of code that only saves the user
a few keystrokes.

This suggests that a narrow focus on the correctness of suggestions
would not tell the whole story for these kinds of tooling. Instead
one could view code suggestions inside an IDE to be more akin to a
conversation. While chatbots such as ChatGPT are already used for
programming tasks, they are explicitly structured as conversations.
Here, we hypothesize that interactions with Copilot, which is not a
chatbot, share many characteristics with natural-language
conversations.

We see anecdotal evidence of this in comments posted about GitHub
Copilot online (see Appendix E for examples) in which users talk
about sequences of interactions. A conversation turn in this context
consists of the prompt in the completion request and the reply as the
completion itself. The developer's response to the completion arises
from the subsequent changes which are incorporated in the next prompt
to the model. There are clear programming parallels to factors such
as specificity and repetition that have been identified to affect
human judgements of conversation quality.^18 Researchers have already
investigated the benefits of natural-language feedback to guide
program synthesis,^2 so the conversational framing of coding
completions is not a radical proposal. But neither is it one we have
seen followed yet.

References

      + 1. Amann, S., Proksch, S., Nadi, S., and Mezini, M.  A study
        of visual studio usage in practice. In IEEE 23rd Intern.
        Conf. on Software Analysis, Evolution, and Reengineering 1 . 
        IEEE Computer Society, (March 2016), 124-134; 10.1109/
        SANER.2016.39
      + 2. Austin, J. et al. Program synthesis with large language
        models. CoRR abs/2108.07732  (2021); https://arxiv.org/abs/
        2108.07732
      + 3. Ari Aye, G., Kim, S., and Li, H.  Learning autocompletion
        from real-world datasets. In Proceedings of the 43rd IEEE/ACM
        Intern. Conf. on Software Engineering: Software Engineering
        in Practice,  (May 2021), 131-139; 10.1109/
        ICSE-SEIP52600.2021.00022
      + 4. Beller, M., Orgovan, V., Buja, S., and Zimmermann, T.  
        Mind the gap: On the relationship between automatically
        measured and self-reported productivity. IEEE Software 38, 5
        (2020), 24-31.
      + 5. Chen, M. et al. Evaluating large language models trained
        on code. CoRR abs/2107.03374  (2021); https://arxiv.org/abs/
        2107.03374
      + 6. Forsgren, N. et al. The SPACE of developer productivity:
        There's more to it than you think. Queue 19, 1 (2021), 20-48.
      + 7. Hellendoorn, V.J., Proksch, S., Gall, H.C., and Bacchelli,
        A.  When code completion fails: A case study on real-world
        completions. In Proceedings of the 41st Intern. Conf. on
        Software Engineering,  J.M. Atlee, T. Bultan, and J. Whittle
        (eds). IEEE / ACM, (May 2019), 960-970; 10.1109/
        ICSE.2019.00101
      + 8. Hendrycks, D. et al. Measuring coding challenge competence
        with APPS. CoRR abs/2105.09938, (2021); https://arxiv.org/abs
        /2105.09938
      + 9. Hindle, A. et al. On the naturalness of software. In 34th
        Intern. Conf. on Software Engineering,  M. Glinz, G.C. Murphy
        , and M. Pezze (eds). IEEE Computer Society, June 2012, 837-
        847; 10.1109/ICSE.2012.6227135
      + 10. Jaspan, C. and Sadowski, C.  No single metric captures
        productivity. Rethinking Productivity in Software Engineering
        , (2019), 13-20.
      + 11. Kulal, S. et al. Spoc: Search-based pseudocode to code.
        In Proceedings of Advances in Neural Information Processing
        Systems 32,  H.M.Wallach et al. (eds), Dec. 2019, 11883-11894
        ; https://bit.ly/3H7YLtF
      + 12. Meyer, A.N., Barr, E.T., Bird, C., and Zimmermann, T.  
        Today was a good day: The daily life of software developers. 
        IEEE Transactions on Software Engineering 47, 5 (2019), 863-
        880.
      + 13. Meyer, A.N. et al. The work life of developers:
        Activities, switches and perceived productivity. IEEE
        Transactions on Software Engineering 43, 12 (2017), 1178-1193
        .
      + 14. Meyer, A.N., Fritz, T., Murphy, G.C., and Zimmermann, T. 
        Software developers' perceptions of productivity. In 
        Proceedings of the 22nd ACM SIGSOFT Intern. Symp. on
        Foundations of Software Engineering (2014), 19-29.
      + 15. Murphy-Hill, E. et al. What predicts software developers'
        productivity? IEEE Transactions on Software Engineering 47, 3
        (2019), 582-594.
      + 16. Peng, S., Kalliamvakou, E., Cihon, P., and Demirer, M.  
        The impact of AI on developer productivity: Evidence from
        GitHub Copilot. arXiv:2302.06590 [cs.SE]  (2014)
      + 17. Ramirez, Y.W. and Nembhard, D.A.  Measuring knowledge
        worker productivity: A taxonomy. J. of Intellectual Capital 5
        , 4 (2004), 602-628.
      + 18. See, A., Roller, S., Kiela, D., and Weston, J.  What
        makes a good conversation? How controllable attributes affect
        human judgments. In Proceedings of the 2019 Conf. of the
        North American Chapter of the Assoc. for Computational
        Linguistics: Human Language Technologies 1,  J. Burstein, C. 
        Doran, and T. Solorio (eds). Assoc. for Computational
        Linguistics, (June 2019), 1702-1723; 10.18653/v1/n19-1170
      + 19. Storey, M. et al. Towards a theory of software developer
        job satisfaction and perceived productivity. In Proceedings
        of the IEEE Trans. on Software Engineering 47, 10 (2019), 
        2125-2142.
      + 20. Svyatkovskiy, A., Deng, S.K., Fu, S., and Sundaresan, N. 
        Intellicode compose: Code generation using transformer. In 
        Proceedings of the 28th ACM Joint European Software Eng.
        Conf. and Symp. on the Foundations of Software Eng., P. 
        Devanbu, M.B. Cohen, and T. Zimmermann (eds). ACM, (Nov.
        2020), 1433-1443; 10.1145/3368089.3417058
      + 21. Svyatkovskiy, A. et al. Fast and memory-efficient neural
        code completion. In Proceedings of the 18th IEEE/ACM Intern.
        Conf. on Mining Software Repositories, (May 2021, 329-340; 
        10.1109/MSR52588.2021.00045
      + 22. Vaithilingam, P., Zhang, T., and Glassman, E.  
        Expectation vs. experience: Evaluating the usability of code
        generation tools powered by large language models. In 
        Proceedings of the 2022 Conf. on Human Factors in Computing
        Systems. 
      + 23. Vaithilingam, P., Zhang, T., and Glassman, E.L.  
        Expectation vs. experience: Evaluating the usability of code
        generation tools powered by large language models. In 
        Proceedings of the CHI Conf. on Human Factors in Computing
        Systems,  Association for Computing Machinery, Article 332 (
        2022), 7; 10.1145/3491101.3519665
      + 24. Wagner, S. and Ruhe, M.  A systematic review of
        productivity factors in software development. arXiv preprint
        arXiv:1801.06475  (2018).
      + 25. Wang, D. et al. From human-human collaboration to
        human-AI collaboration: Designing AI systems that can work
        together with people. In Proceedings of the 2020 CHI Conf. on
        Human Factors in Computing Systems  (2020), 1-6.
      + 26. Weisz, J.D. et al. Perfection not required? Human-AI
        partnerships in code translation. In Proceedings of the 26th
        Intern. Conf. on Intelligent User Interfaces,  T. Hammond et
        al. (eds). ACM, (April 2021), 402-412; 10.1145/
        3397481.3450656
      + 27. Winters, T., Manshreck, T., and Wright, H.  Software
        Engineering at Google: Lessons Learned from Programming Over
        Time . O'Reilly Media (2020).
      + 28. Wold, S., Sjostrom, M., and Eriksson, L. PLS-regression:
        A basic tool of chemometrics. Chemometrics and Intelligent
        Laboratory Systems 58, 2 (2001), 109-130; 10.1016/S0169-7439
        (01)00155-1.
      + 29. Zhou, W., Kim, S., Murali, V., and Ari Aye, G. Improving
        code autocompletion with transfer learning. CoRR abs/
        2105.05991  (2021); https://arxiv.org/abs/2105.05991

Footnotes

      + a Nevertheless, such completion times are greatly reduced in
        many settings, often by more than half.16
      + b Note that these values are specific to the time period in
        our study (early 2022). Both figures have since increased
        because of model and prompting improvements.
      + c See https://bit.ly/3S7oqZV.
      + d Appendices can be found in the arXiv version https://
        arxiv.org/pdf/2205.06537.pdf.

About the Authors

Albert Ziegler is a principal researcher at GitHub, Inc., San
Francisco, USA.

Eirini Kalliamvakou is a staff researcher at GitHub, Inc., San
Francisco, USA.

X. Alice Li is a staff researcher for Machine Learning at GitHub, San
Francisco, USA.

Andrew Rice is a principal researcher at GitHub, Inc., San Francisco,
USA.

Devon Rifkin is a principal research engineer at GitHub, Inc., San
Francisco, USA.

Shawn Simister is a staff software engineer at GitHub, Inc., San
Francisco, USA.

Ganesh Sittampalam is a principal software engineer at GitHub,
Inc., San Francisco, USA.

Edward Aftandilian is a principal researcher at GitHub, Inc., San
Francisco, USA.

  * Share
      + Twitter
      + Reddit
      + Hacker News
  * Download PDF
  * Print
  * Join the Discussion

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and
value to the computing community.

You Just Read

Measuring GitHub Copilot's Impact on Productivity

View in the ACM Digital Library

[3633453_uf] This work is licensed under a Creative Commons
Attribution International 4.0 License.

DOI

10.1145/3633453

March 2024 Issue

Vol. 67 No. 3

Pages: 54-63

Table of Contents

Related Reading

  * Practice

    Taking Flight with Copilot

    Computing Applications

  * Research

    Computing Education in the Era of Generative AI

    Artificial Intelligence and Machine Learning

  * News

    AI Rewrites Coding

    Artificial Intelligence and Machine Learning

  * Opinion

    The Premature Obituary of Programming

    Computing Applications

Advertisement

[vc]

Advertisement

[vc]

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Sign In Sign Up

The Latest from CACM

Explore More
News Feb 29 2024

A Healthy Dose of AI

Samuel Greengard
Artificial Intelligence and Machine Learning
 
The boundaries of algorithmic medicine continue to expand.
News Feb 27 2024

Here Come the Killer Robots

Logan Kugler
Artificial Intelligence and Machine Learning
 
Fictional killer robots.
News Feb 22 2024

Defending the Business

David Geer
Computing Applications
 
[022224_shutterstock_cybersecurityB]

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the
future of the association. There are more ways than ever to get
involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access
publication.

By opening CACM to the world, we hope to increase engagement among
the broader computer science community and encourage non-members to
discover the rich resources ACM has to offer.

Learn More
 

  * CACM on Facebook
  * CACM on Twitter
  * CACM on Reddit
  * CACM on LinkedIn

Topics

  * Architecture and Hardware
  * Artificial Intelligence and Machine Learning
  * Computer History
  * Computing Applications
  * Computing Profession
  * Data and Information
  * Education
  * HCI
  * Security and Privacy
  * Society
  * Software Engineering and Programming Languages
  * Systems and Networking
  * Theory

Magazine

  * Latest Issue
  * Magazine Archive
  * Editorial Staff and Board
  * Submit an Article
  * Alerts & Feeds
  * Author Guidelines

Communications of the ACM

  * About Us
  * Frequently Asked Questions
  * Contact Us
  * For Advertisers
  * Join ACM

(c) 2024 Communications of the ACM. All Rights Reserved.

  * Cookie Notice
  * Privacy Policy

By continuing to use our website, you are agreeing to our use of
cookies. To find out more, please see our Privacy Policy. Close and
Accept