https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html This copilot is stupid and wants to kill me 25 June 2022 This week, Microsoft released an AI-based tool for writing software called GitHub Copilot. As a lawyer and 20+ year participant in the world of open-source software, I agree with those who consider Copilot to be primarily an engine for violating open-source licenses. Still, I'm not worried about its effects on open source. Why? Because as a matter of basic legal hygiene, I expect that organizations that create software assets will have to forbid the use of Copilot and other AI-assisted tools, lest they unwittingly contaminate those software assets with license violations and intellectual-property infringements. (Before we go further: I am not your lawyer, nor anyone's lawyer, and you should not take anything on this page as legal advice.) It's licenses all the way down Those versed in open-source history might recognize my argument as similar to the one Microsoft pushed for many years to deter organizations from adopting open source at all. "How can you trust that the code doesn't contain IP violations?", they asked. This was often derided as pure FUD (= the marketing tactic of spreading "fear, uncertainty, and doubt" about a competitor). But as a legal matter, it's a fair question to ask of any new technology that by design contains portions of other people's work. As applied to open source, what made the question unfair was its implication that the open-source world is some kind of sloppy mosh pit of IP rights, carelessly remixed. On the contrary, the growth of open source over 20+ years has only been possible because of its fastidious insistence on explicit, enforceable software licenses. For instance, as an author of open-source software, I don't just dump my code on some website and invite others to guess what they can do with it. Rather, everything I make is accompanied by a simple but explicit license--I've always preferred the one known as the MIT license--that invites certain kinds of uses, forbids others, and sets the conditions. Likewise, though open-source licenses allow redistribution and modification of code, I still have to honor the specific terms of other open-source software that I use in my projects. Again, it's not a mosh pit. There are many open-source licenses in use. Not all of them are compatible. For instance, the GNU General Public License (GPL) is much more limited than the MIT license. Thus, I can't embed GPL-licensed software within my MIT-licensed projects, because it would amount to relicensing that software under an MIT license, which would violate the GPL. This shouldn't seem arcane: as a logical and legal matter, I can't pass along to others permissions I never had in the first place. GitHub Copilot is different GitHub Copilot is an AI-based system that has been trained, according to Microsoft, on "billions of lines of public code". In use, it acts as a fancy autocomplete. As you type into a code editor, Copilot reads what you've written--say, the name of a function--and suggests code that implements it. For instance, I started a Copilot session to write a JavaScript function for calculating the nth prime number. So I typed this: function calculateNthPrime(n) { 1 2 3 } After a pause, Copilot filled in the rest: 1 function calculateNthPrime(n) { 2 var primes = [2]; 3 var i = 3; 4 while (primes.length < n) { 5 if (isPrime(i)) { 6 primes.push(i); 7 } 8 i += 2; 9 } 10 return primes[n - 1]; 11 } This code doesn't work, however, because it relies on a function called isPrime that doesn't yet exist. Being a good sport, I forgave Copilot and typed that wish next: function isPrime(n) { 1 2 3 } Once again, Copilot produced the rest: 1 function isPrime(n) { 2 for (var i = 2; i < n; i++) { 3 if (n % i === 0) { 4 return false; 5 } 6 } 7 return true; 8 } This is the code I would expect from a talented 12-year-old who learned about JavaScript yesterday and prime numbers today. Does it work? Uh--maybe? Notably, Microsoft doesn't claim that any of the code Copilot produces is correct. That's still your problem. Thus, Copilot essentially tasks you with correcting a 12-year-old's homework, over and over. (I have no idea how this is preferable to just doing the homework yourself.) The big nowhere But this generated code raises an even more vexing question: if Copilot was trained on software code that was subject to an open-source license, what license might apply to the code produced by Copilot? MIT? GPL? Something else? No license--in the sense of public domain? No license--in the sense that the underlying pieces are under incompatible licenses and there's no way to combine them? Microsoft makes no claims about this either. Rather, it explicitly passes the risk to users, who must carry the entire burden of license compliance (emphasis added below): We recommend you take the same precautions when using code generated by GitHub Copilot that you would when using any code you didn't write yourself. These precautions include rigorous testing, IP scanning ... By IP scanning I assume Microsoft is speaking of intellectual-property scanning, meaning the process of verifying that the code doesn't contain IP violations. (Unfortunately the phrase IP scanning is also commonly used to mean IP-address scanning in the network sense.) On the one hand, we can't expect Microsoft to offer legal advice to its zillions of users or a blanket indemnification. On the other hand, Microsoft isn't sharing any of the information users would need to make these determinations. On the contrary--Copilot completely severs the connection between its inputs (= code under various open-source licenses) and its outputs (= code algorithmically produced by Copilot). Thus, after 20+ years, Microsoft has finally produced the very thing it falsely accused open source of being: a black hole of IP rights. Copilot is malware CTOs and general counsels of organizations that generate software IP assets now have an urgent problem: how to prevent the contamination of those assets with code generated by Copilot (and similar AI tools that will certainly emerge). Let's be very clear--this has not been a practical problem for open-source software over the last 20+ years. Why? Because open source was designed around license-based accountability. Have there been instances where open-source software has violated IP rights? Sure. Just like there have been instances where proprietary software has also done so. The point of open source was never to create a regime of software licensing that was impervious to IP litigation. Rather, it was to show that sharing and modification of source code could become part of the software industry without collapsing the existing regime. Open-source software has successfully coexisted with proprietary software because it plays by the same legal rules. Copilot does not. Whereas open source strives for clarity around licensing, Copilot creates nothing but fog. Microsoft has imposed upon users the responsibility for determining the IP status of the code that Copilot emits, but provides none of the data they would need to do so. The task, therefore, is impossible. For this reason, one must further conclude that any code generated by Copilot may contain lurking license or IP violations. In that case, the only prudent position is to reject Copilot--and other AI assistants trained on external code--entirely. I imagine this will quickly be adopted as official policy of software organizations. Because what other position could be defensible? "We put our enterprise codebase at risk to spare our highly paid programmers the indignity of writing a program to calculate the nth prime number"? Still, I'm sure some organizations will try to find a middle path with Copilot on the (misguided) principle of developer productivity and general AI maximalism. Before too long, someone at these organizations will find a giant license violation in some Copilot-generated code, and the experiment will quietly end. More broadly, it's still unclear how the chaotic nature of AI can be squared with the virtue of predictability that is foundational to many business organizations. (Another troublesome aspect of Copilot is that it operates as a keylogger within your code editor, sending everything you type back to Microsoft for processing. Sure, you can switch it on and off. But it still represents a risk to privacy, IP, and trade secrets that's difficult to control. As above, the only prudent policy will be to keep it away from developer machines entirely.) Can Copilot be fixed? Maybe--if instead of fog, Copilot were to offer sunshine. Rather than conceal the licenses of the underlying open-source code it relies on, it could in principle keep this information attached to each chunk of code as it wends its way through the model. Then, on the output side, it would be possible for a user to inspect the generated code and see where every part came from and what license is attached to it. Keeping license terms attached to code would also allow users to shape the output of Copilot by license. For instance, generate an nth-prime function using only MIT-licensed source material. As the end user, this wouldn't eliminate my responsibility to verify these terms. But at least I'd have the information I'd need to do so. As it stands, the task is hopeless. In the law, this concept is critical, and known as chain of custody: the idea that the reliability of certain material depends on verifying where it came from. For instance, without recording the chain of custody, you could never introduce documents into evidence at trial, because you'd have no way of confirming that the documents were authentic and trustworthy. What Copilot means for open source If Copilot is vigorously violating open-source licenses, what should open-source authors do about it? In the large, I don't think the problems open-source authors have with AI training are that different from the problems everyone will have. We're just encountering them sooner. Most importantly, I don't think we should let the arrival of a new obstacle compromise the spirit of open source. For instance, some have suggested creating an open-source license that forbids AI training. But this kind of usage-based restriction has never been part of the open-source ethic. Furthermore, it's overinclusive: we can imagine (as I have above) AI systems that behave more responsibly and ethically than the first generation will. It would be self-defeating for open-source authors to set themselves athwart technological progress, since that's one of the main goals of open-sourcing code in the first place. By the same token, it doesn't make sense to hold AI systems to a different standard than we would hold human users. Widespread open-source license violations shouldn't be shrugged off as an unavoidable cost. Suppose we accept that AI training falls under the US copyright notion of fair use. (Though the question is far from settled.) If so, then the fair-use exception would supersede the license terms. But even if the input to the AI system qualifies as fair use, the output of that system may not. Microsoft has not made this claim about GitHub Copilot--and never will, because no one can guarantee the behavior of a nondeterministic system. We are at the beginning of the era of practical, widespread AI systems. It's inevitable that there will be litigation and regulation about the behavior of these systems. It's also inevitable that the nondeterminism of these systems will be used as a defense of their misbehavior--"we don't really know how it works either, so we all just have to accept it". I think that regulations mandating the auditability of AI systems by showing the connection between inputs and outputs--akin to a chain of custody--are very likely, probably in the EU before the US. This is the only way to ensure that AI systems are not being used to launder materials that are otherwise unethical or illegal. In the US, I think it's possible AI may end up provoking an amendment to the US constitution--but that's a topic for another day. In the interim, I think the most important thing open-source authors can do is continue to bring attention to certain facts about Copilot that Microsoft would prefer to leave buried in the fine print. For now, Copilot's greatest enemy is itself. Further reading * If Software is My Copilot, Who Programmed My Software? Bradley Kuhn, Software Freedom Conservancy