https://docs.github.com/en/github/copilot/research-recitation GitHub Docs * All products * GitHub.com * + User accounts o User account settings # Your personal dashboard # Manage theme settings # Change your username # Merge multiple user accounts # User into an organization # Delete your user account # Permission user repositories # Permission user project boards # Manage default branch name # Manage security & analysis # Manage access project boards # Leaving your company # Available for hire checkbox o Manage email preferences # Add an email address # Primary email address # Set backup email address # Set commit email address # Block push with personal email # Find your username or email # Emails from GitHub # Marketing emails o Access to your repositories # Invite collaborators # Remove a collaborator # Remove yourself # Ownership continuity o Manage organization membership # Organization membership # Accessing an organization # View people in an organization # Request OAuth App approval # Show or hide membership # Manage scheduled reminders # Leave an organization + Profiles o Customizing your profile # About your profile # Organization's profile # Personalize # Your profile README # Pin items o Manage contribution graph # View contributions # Show an overview # Private contributions # Send your contributions # Contributions not on profile # Troubleshoot commits + Authentication o Account security # Authentication to GitHub # Create a strong password # Update access credentials # Create a PAT # Reviewing your SSH keys # Deploy keys # Authorizing OAuth Apps # Authorized integrations # Third-party applications # Review OAuth apps # Security log # Remove sensitive data # About anonymized URLs # GitHub's IP addresses # SSH key fingerprints # Sudo mode # Unauthorized access o Secure your account with 2FA # About 2FA # Configure 2FA # Configure 2FA recovery # Access GitHub with 2FA # Recover an account with 2FA # Change 2FA delivery method # Countries supporting SMS # Disable 2FA o Authenticate with SAML # SAML single sign-on # SSH Key with SAML # PAT with SAML # Active SAML sessions o Connect with SSH # About SSH # Check for existing SSH key # Generate new SSH key # Add a new SSH key # Test your SSH connection # SSH key passphrases o Troubleshooting SSH # Use SSH over HTTPS port # Recover SSH key passphrase # Deleted or missing SSH keys # Permission denied (publickey) # Error: Bad file number # Error: Key already in use # Permission denied other-user # Permission denied other-repo # Agent failure to sign # ssh-add: illegal option -- K # SSL certificate problem # Error: Unknown key type # SSH key audit o Verify commit signatures # Commit signature verification # Displaying verification for all commits # Existing GPG keys # Generating a new GPG key # Add a new GPG key # Tell Git your signing key # Associate email with GPG key # Signing commits # Signing tags o Troubleshoot verification # Check verification status # Update expired GPG key # Use verified email in GPG key + GitHub Copilot o About GitHub Copilot telemetry o Telemetry terms o Research recitation + Subscriptions & notifications o Setting up notifications # About notifications # Configuring notifications o Customize a workflow # Manage from your inbox # Triage a notification # Triage your notifications o Manage subscriptions # View subscriptions # Manage your subscriptions + Your enterprise o Manage your enterprise # About enterprise accounts # Verify or approve a domain # View subscription & usage # Visual Studio bundle # Connect an Azure subscription o Manage users # Roles in an enterprise # Invite people to manage # Manage support entitlements # View people in your enterprise # View & manage SAML access o Manage organizations # Add organizations # Manage unowned organizations # View organization audit logs # Configure webhooks o Configure IAM # IAM for your enterprise # Enable SSO for organizations # Configure SAML SSO # Manage team synchronization o Set organization policies # Policies for repositories # Policies for projects # Enforce team policies # Enforce security settings # Restrict email notifications # Policy for dependency insights # Policies for GitHub Actions # Configure Actions retention # Policies for Advanced Security + Writing on GitHub o Start writing on GitHub # Write & format on GitHub # Basic formatting syntax o Work with advanced formatting # Organized data with tables # Create code blocks # Auto linked references # Attaching files # Permanent links to code # Using keywords in issues and pull requests o Work with saved replies # About saved replies # Creating a saved reply # Editing a saved reply # Deleting a saved reply # Using saved replies o Share content with gists # Creating gists # Forking and cloning gists + Create, clone & archive o Create a repository # About repositories # Repository visibility # Creating a new repository # Create from a template # About READMEs # About code owners # Repository languages # Licensing a repository # Create a template repo # Issues-only repository # Content & diffs limits # Duplicating a repository o Clone a repository # Cloning a repository # HTTPS cloning errors # Repository not found # Unable to checkout o Archive a repository # Repositories # Archiving repositories # Archive content & data # Reference & cite content # Backing up a repository + Commit changes to your project o Create & edit commits # About commits # With multiple authors # On behalf of an organization # Changing a commit message o View & compare commits # Branch & tag labels # Comparing commits # Commit views o Troubleshooting commits # Commit missing in local clone # Linked to wrong user + Collaborate with PRs o Getting started # Collaborative development o Working with forks # About forks # Configure a remote # Syncing a fork # Merge an upstream repo # Allow changes to a branch # Deleted or changes visibility o Code quality features # About status checks o Propose changes # About branches # Create & delete branches # About pull requests # Compare branches # Creating a pull request # Create a PR from a fork # Using query parameters to create a pull request # Change the state # Request a PR review # Change the base branch # Commit to PR branch from fork o Address merge conflicts # About merge conflicts # Resolve merge conflicts # Resolve merge conflicts in Git o Review changes # About PR reviews # Review proposed changes # Filter files # Methods & functions # Comment on a PR # View a PR review # Review dependency changes # Incorporate feedback # Required reviews # Dismiss a PR review # Check out a PR locally o Incorporate changes # About pull request merges # Merging a pull request # Merge PR automatically # Closing a pull request # Reverting a pull request + Search on GitHub o Start with search on GitHub # About searching on GitHub # Understand search syntax # Troubleshoot search queries # Sorting search results # Enable search for GitHub.com o Searching on GitHub # Finding files on GitHub # Search for repositories # Searching topics # Searching code # Searching commits # Search issues & PRs # Searching discussions # Search GitHub Marketplace # Searching users # Searching for packages # Searching wikis # Searching in forks + Importing your projects o Import code to GitHub # About GitHub Importer # Use GitHub Importer # Update author GitHub Importer # Import repo locally # Add a project locally # Code migration tools o Work with Subversion on GitHub # Subversion & Git differences # Support for Subversion clients # Properties supported by GitHub + Administer a repo o Manage repository settings # Repository visibility # Teams & people # Classify with topics # How changed files appear # Email notifications for pushes # Display a sponsor button # Social media preview # View deployment activity # Manage the forking policy # Set log retention for Actions # GitHub Actions # Discussions # Disable project boards # Disabling issues # Managing Git LFS objects in archives # Security & analysis # Configure autolinks # Renaming a repository # Transferring a repository # Deleting a repository # Restore deleted repository o Manage branches # View branches # Renaming a branch # Change the default branch # Delete & restore o Configure PR merges # Merge methods # Configure commit squashing # Configure commit rebasing # Manage auto merge # Automatic branch deletion o Mergeability of PRs # About protected branches # Branch protection rule # Required status checks o Release projects # About releases # Manage releases # View releases & tags # Linking to releases # Comparing releases # Automate release forms o Find information in a repo # Filter issues & PRs # Filter by assignee # Filter by labels # Filter by review status # Sort issues & PRs # Use search to filter # Sharing filters + View repository graphs o Access basic repository data # About repository graphs # View repository activity # View project contributors # View traffic to a repository o Analyze changes to repository # View commits in repository # View changes to content o Connections between repos # View repository network # List repository forks # View dependencies + Manage files in a repository o Managing files on GitHub # Navigating code on GitHub # Creating new files # Add a file # Move a file # Edit your files # Edit another user's files # Track file changes # Delete files # Renaming a file # Permanent links to files o Manage files locally # Add a file locally # Rename a file locally # Move a file locally o Work with non-code files # Render & diff images # 3D File Viewer # Render CSV & TSV data # Rendering PDF documents # Differences in prose # Mapping geoJSON files # Jupyter Notebook files + Managing large files o Working with large files # Large file conditions # Remove repository history # Distribute large binaries # What is my disk quota? o Versioning large files # Git Large File Storage # Install Git LFS # Configure Git LFS # Storage & bandwidth # Collaboration # Move a file to Git LFS # Remove files # Resolve upload failures + Customize your workflow o Exploring integrations # About integrations # About GitHub Marketplace # Extensions & integrations o Install Marketplace apps # Install app user account # Install app organization + Extending GitHub o Get started API o Automate with OAuth tokens o About webhooks + GitHub Support o About GitHub Support o GitHub Enterprise Cloud o GitHub Premium Support o GitHub Marketplace o Submitting a ticket + How GitHub protects data o GitHub's use of your data o Request account archive o Manage data use for private repo o GitHub Archive program + Site policy o GitHub Terms of Service o GitHub Corporate Terms of Service o GitHub Privacy Statement o GitHub Data Protection Agreement (Non-Enterprise Customers) o Global Privacy Practices o GitHub Insights and data protection for your organization o GitHub Sponsors Additional Terms o GitHub Terms for Additional Products and Features o GitHub Logo Policy o GitHub Username Policy o Submitting content removal requests o DMCA Takedown Policy o Guide to Submitting a DMCA Takedown Notice o Guide to Submitting a DMCA Counter Notice o GitHub Trademark Policy o GitHub Private Information Removal Policy o GitHub Subprocessors and Cookies o GitHub Bug Bounty Program Legal Safe Harbor o Responsible Disclosure of Security Vulnerabilities o Guidelines for Legal Requests of User Data o GitHub Government Takedown Policy o GitHub Acceptable Use Policies o GitHub's Notice about the California Consumer Privacy Act o GitHub Community Guidelines o GitHub Community Forum Code of Conduct o GitHub Registered Developer Agreement o GitHub Marketplace Terms of Service o GitHub Marketplace Developer Agreement o GitHub Pre-release Program o GitHub Research Program Terms o GitHub Open Source Applications Terms and Conditions o GitHub and Trade Controls o GitHub Deceased User Policy o GitHub Statement Against Modern Slavery and Child Labor o GitHub Anti-Bribery Statement o GitHub Candidate Privacy Policy o GitHub Gifts and Entertainment Policy o GitHub Event Terms o GitHub Event Code of Conduct o GitHub GPL Cooperation Commitment + Deprecated site policy articles o Amendment to GitHub Terms of Service Applicable to U.S. Federal Government Users o GitHub AE Data Protection Agreement o GitHub AE Product Specific Terms o GitHub Connect Addendum to the GitHub Enterprise License Agreement o GitHub Enterprise Cloud Evaluation Agreement o GitHub Enterprise Server License Agreement o GitHub Enterprise Service Level Agreement o GitHub Enterprise Subscription Agreement o GitHub Supplemental Terms for Microsoft Volume Licensing English * English * Jian Ti Zhong Wen (Simplified Chinese) * Ri Ben Yu (Japanese) * Espanol (Spanish) * Portugues do Brasil (Portuguese) * Deutsch (German) [ ] GitHub Docs Explore by product GitHub.com Get startedGitHub.comEnterprise administratorsBilling and payments OrganizationsCode securityGitHub IssuesGitHub ActionsGitHub Packages DevelopersREST APIGraphQL APIGitHub InsightsGitHub DiscussionsGitHub CodespacesGitHub SponsorsBuilding communitiesGitHub PagesEducation GitHub DesktopGitHub CLIAtomElectronCodeQL English EnglishJian Ti Zhong Wen (Simplified Chinese)Ri Ben Yu (Japanese)Espanol (Spanish)Portugues do Brasil (Portuguese)Deutsch (German) [ ] GitHub.comGitHub CopilotResearch recitation Research recitation A first look at rote learning in GitHub Copilot suggestions. In this article * GitHub Copilot: Parrot or Crow? * Introduction * The Experiment * Results * Conclusion and Next Steps By: Albert Ziegler (@wunderalbert) GitHub Copilot: Parrot or Crow? A first look at rote learning in GitHub Copilot suggestions. Introduction GitHub Copilot is trained on billions of lines of public code. The suggestions it makes to you are adapted to your code, but the processing behind it is ultimately informed by code written by others. How direct is the relationship between the suggested code and the code that informed it? In a recent thought-provoking paper^1, Bender, Gebru et al. coined the phrase "stochastic parrots" for artificial intelligence systems like the ones that power GitHub Copilot. Or as a fellow machine learning engineer at GitHub^2 remarked during a water cooler chat: these systems can feel like "a toddler with a photographic memory." These are deliberate oversimplifications. Many GitHub Copilot suggestions feel pretty specifically tailored to the particular code base the user is working on. Often, it looks less like a parrot and more like a crow building novel tools out of small blocks^3. But there's no denying that GitHub Copilot has an impressive memory: A movie demonstration of Copilot Here, I intentionally directed^4 GitHub Copilot to recite a well known text it obviously knows by heart. I, too, know a couple of texts by heart. For example, I still remember some poems I learnt in school. Yet no matter the topic, not once have I been tempted to derail a conversation by falling into iambic tetrameter and waxing about daffodils. So is that (or rather the coding equivalent of it) something GitHub Copilot is prone to doing? How many of its suggestions are unique, and how often does it just parrot some likely looking code it has seen during training? The Experiment During GitHub Copilot's early development, nearly 300 employees used it in their daily work as part of an internal trial. This trial provided a good dataset to test for recitation. I wanted to find out how often GitHub Copilot gave them a suggestion that was quoted from something it had seen before. I limited the investigation to Python suggestions with a cutoff on May 7, 2021 (the day we started extracting that data). That left 453,780 suggestions spread out over 396 "user weeks", i.e. calendar weeks during which a user actively used GitHub Copilot on Python code. Automatic Filtering 453,780 suggestions are a lot, but many of them can be dismissed immediately. To get to the interesting cases, consider sequences of "words" that occur in the suggestion in the same order as in the code GitHub Copilot has been trained on. In this context, punctuation, brackets, or other special characters all count as "words", while tabs, spaces or even line breaks are ignored completely. After all, a quote is still a quote, whether it's indented by 1 tab or 8 spaces. For example, one of GitHub Copilot's suggestions was the following regex for numbers separated by whitespace: r'^\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+' This would be exactly 100 "words" in the sense above, but it's a particularly dense example: the average non-empty line of code has only 10 "words." I've restricted this investigation to cases where the overlap with the code GitHub Copilot was trained on contains at least 60 such "words". We have to set the cut somewhere, and I think it's rather rare that shorter sequences would be of great interest. In fact, most of the interesting cases identified later are well clear of that threshold of 60. If the overlap extends to what the user has already written, that also counts for the length. After all, the user may have written that context with the help of GitHub Copilot as well! In the following example, the user has started writing a very common snippet. GitHub Copilot completes it. Even though the completion itself is rather short, together with the already existing code it clears the threshold and is retained. Example code This procedure is permissive enough to let many relatively "boring" examples through, like the two above. But it's still effective at dialing in the human analysis to the interesting cases, sorting out over 99% of Copilot suggestions. Manual Bucketing After filtering, there were 473 suggestions left. But they came in very different forms: 1. Some were basically just repeats of another case that passed filtering. For example, sometimes GitHub Copilot makes a suggestion, the developer types a comment line, and GitHub Copilot offers a very similar suggestion again. I removed these cases from the analysis as duplicates. 2. Some were long, repetitive sequences. Like the following example, where the repeated blocks of '

' are of course found somewhere in the training set: Example repetitions Such suggestions can be helpful (test cases, regexes) or not helpful (like this case, I suspect). But in any case, they do not fit the idea of rote learning I had in mind when I started this investigation. 3. Some were standard inventories, like the natural numbers, or the prime numbers, or stock market tickers, or the Greek alphabet: Example of Greek alphabet 4. Some were common, straightforward ways, perhaps even universal ways, of doing things with very few natural degrees of freedom. For example, the middle part of the following strikes me as very much the standard way of using the BeautifulSoup package to parse a wikipedia list. In fact, the best matching snippet found in GitHub Copilot's training data^5 uses such code to parse a different article and goes on to do different things with the results. Example of Beautiful Soup This doesn't fit my idea of a quote either. It's a bit like when someone says "I'm taking out the trash; I'll be back soon" -- that's a matter of fact statement, not a quote, even though that particular phrase has been uttered many times before. 5. And then there are all other cases. Those with at least some specific overlap in either code or comments. These are what interests me most, and what I'm going to concentrate on from now on. This bucketing necessarily has some edge cases^6, and your mileage may vary in how you think they should be classified. Maybe you even disagree with the whole set of buckets in the first place. That's why we've open sourced that dataset^7. So if you feel a bit differently about the bucketing, or if you're interested in other aspects of GitHub Copilot parroting its training set, you're very welcome to just ignore my next section and draw your own conclusions. Results Overview Plot For most of GitHub Copilot's suggestions, our automatic filter didn't find any significant overlap with the code used for training. But it did bring 473 cases to our attention. Removing the first bucket (cases that look very similar to other cases) left me with 185 suggestions. Of these, 144 got sorted out in buckets 2 - 4. This left 41 cases in the last bucket, the "recitations", in the meaning of the term I have in mind. That corresponds to 1 recitation event every 10 user weeks (95% confidence interval: 7 - 13 weeks, using a Poisson test). Of course, this was measured on the GitHub and Microsoft developers who tried out GitHub Copilot. If your coding behaviour is very different from theirs, your results might differ. In particular, some of these developers are only working part time on Python projects ---- I could not distinguish that and so counted everyone who writes some Python in a given week as a user. 1 event in 10 weeks doesn't sound like a lot, but it's not 0 either. And I found three things that struck me. GitHub Copilot quotes when it lacks specific context If I want to learn the lyrics to a song, I have to listen to it many times. GitHub Copilot is no different: to learn a snippet of code by heart, it must see that snippet a lot. Each file is only shown to GitHub Copilot once, so the snippet needs to exist in many different files in public code. Of the 41 main cases we singled out during manual labelling, none appear in less than 10 different files. Most (35 cases) appear over a hundred times. Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License. The following plot shows the number of matched files of the results in bucket 5 (one red mark on the bottom for each result) versus buckets 2-4. I left out bucket 1, which is really just a mix of duplicates of bucket 2-4 cases and duplicates of bucket 5 cases. The inferred distribution is displayed as a red line; it peaks between 100 and 1000 matches. Number of Matches Plot GitHub Copilot mostly quotes in generic contexts As time goes on, each file becomes unique. But GitHub Copilot doesn't wait for that^8: it will offer its solutions while your file is still extremely generic. And in the absence of anything specific to go on, it's much more likely to quote from somewhere else than it would be otherwise. Context Length Plot Of course, software developers spend most of their time deep inside the files, where the context is unique enough that GitHub Copilot will offer unique suggestions. In contrast, the suggestions at the beginning are rather hit-and-miss, since GitHub Copilot cannot know what the program will be. But sometimes, especially in toy projects or standalone scripts, a modest amount of context can be enough to hazard a reasonable guess of what the user wanted to do. And sometimes it's still generic enough so that GitHub Copilot thinks one of the solutions it knows by heart looks promising: Example code This is pretty much directly taken from coursework for a robotics class uploaded in different variations^9. Detection is only as good as the tool that does the detecting In its current form, the filter will turn up a good number of uninteresting cases when applied broadly. But it still should not be too much noise. For the internal users in the experiment, it would have been a bit more than one find per week on average (albeit likely in bursts!). Of these, about 17% (95% confidence interval using a binomial test: 14%-21%) would be in the fifth bucket. And nothing is ever foolproof of course: so this too can be tricked. Some cases are rather hard to detect by the tool we're building, but still have an obvious source. To return to the Zen of Python: Zen Variation Conclusion and Next Steps This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice. But there's still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I'm quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I'm able to look up background information about that code, and to include credit where credit is due. The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it's quoted from. You can then either include proper attribution or decide against using that code altogether. This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise. Footnotes 1: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ^ 2: Tiferet Gazit ^ 3: see von Bayern et al. about the creative wisdom of crows: Compound tool construction by New Caledonian crows ^ 4: see Carlini et al. about deliberately triggering the recall of training data: Extracting Training Data from Large Language Models ^ 5: jaeteekae: DelayedTwitter ^ 6: Probably not too many though. I've asked some developers to help me label the cases, and everyone was prompted to flag up any uncertainty with their judgement. That happened in only 34 cases, i.e. less than 10%. ^ 7: In the public dataset, I list the part of Copilot's suggestion that was also found in the training set, how often it was found, and a link to an example where it occurs in public code. For privacy reasons, I don't include the not-matched part of the completion or the code context the user had typed (only an indication of its length). ^ 8: In fact, since this experiment has been made, GitHub Copilot has changed to require a minimum file content. So some of the suggestions flagged here would not have been shown by the current version. ^ 9: For example jenevans33: CS8803-1 ^ Did this doc help you?Privacy policy [ ] ( )( ) Help us make these docs great! All GitHub docs are open source. See something that's wrong or unclear? Submit a pull request. Make a contribution Or, learn how to contribute. Still need help? Ask the GitHub communityContact support * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Help * Contact GitHub * Pricing * Developer API * Training * About