https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html

   

This copilot is stupid and wants to kill me

25 June 2022

This week, Microsoft released an AI-based tool for writing software
called GitHub Copilot. As a lawyer and 20+ year participant in the
world of open-source software, I agree with those who consider
Copilot to be primarily an engine for violating open-source licenses.

Still, I'm not worried about its effects on open source. Why? Because
as a matter of basic legal hygiene, I expect that organizations
that create software assets will have to forbid the use of Copilot
and other AI-assisted tools, lest they unwittingly contaminate
those software assets with license violations and
intellectual-property infringements.

(Before we go further: I am not your lawyer, nor anyone's lawyer, and
you should not take anything on this page as legal advice.)

It's licenses all the way down

Those versed in open-source history might recognize my argument as
similar to the one Microsoft pushed for many years to deter
organizations from adopting open source at all. "How can you trust
that the code doesn't contain IP violations?", they asked. This was
often derided as pure FUD (= the marketing tactic of spreading "fear,
uncertainty, and doubt" about a competitor). But as a legal matter,
it's a fair question to ask of any new technology that by design
contains portions of other people's work.

As applied to open source, what made the question unfair was its
implication that the open-source world is some kind of sloppy mosh
pit of IP rights, carelessly remixed. On the contrary, the growth of
open source over 20+ years has only been possible because of its
fastidious insistence on explicit, enforceable software licenses.

For instance, as an author of open-source software, I don't just
dump my code on some website and invite others to guess what they can
do with it. Rather, everything I make is accompanied by a simple
but explicit license--I've always preferred the one known as the MIT
license--that invites certain kinds of uses, forbids others, and sets
the conditions.

Likewise, though open-source licenses allow redistribution and
modification of code, I still have to honor the specific terms of
other open-source software that I use in my projects. Again, it's
not a mosh pit. There are many open-source licenses in use. Not all
of them are compatible. For instance, the GNU General Public License
(GPL) is much more limited than the MIT license. Thus, I can't embed
GPL-licensed software within my MIT-licensed projects, because it
would amount to relicensing that software under an MIT license,
which would violate the GPL. This shouldn't seem arcane: as a logical
and legal matter, I can't pass along to others permissions I never
had in the first place.

GitHub Copilot is different

GitHub Copilot is an AI-based system that has been trained,
according to Microsoft, on "billions of lines of public code". In
use, it acts as a fancy autocomplete. As you type into a code
editor, Copilot reads what you've written--say, the name of a
function--and suggests code that implements it.

For instance, I started a Copilot session to write a JavaScript
function for calculating the nth prime number. So I typed this:

  function calculateNthPrime(n) {
1
2
3 }



After a pause, Copilot filled in the rest:

 1 function calculateNthPrime(n) {
 2    var primes = [2];
 3    var i = 3;
 4    while (primes.length < n) {
 5        if (isPrime(i)) {
 6           primes.push(i);
 7        }
 8        i += 2;
 9    }
10    return primes[n - 1];
11 }

This code doesn't work, however, because it relies on a function
called isPrime that doesn't yet exist. Being a good sport, I forgave
Copilot and typed that wish next:

  function isPrime(n) {
1
2
3 }



Once again, Copilot produced the rest:

1 function isPrime(n) {
2    for (var i = 2; i < n; i++) {
3        if (n % i === 0) {
4            return false;
5        }
6    }
7    return true;
8 }

This is the code I would expect from a talented 12-year-old who
learned about JavaScript yesterday and prime numbers today. Does it
work? Uh--maybe? Notably, Microsoft doesn't claim that any of the code
Copilot produces is correct. That's still your problem. Thus, Copilot
essentially tasks you with correcting a 12-year-old's homework,
over and over. (I have no idea how this is preferable to just doing
the homework yourself.)

The big nowhere

But this generated code raises an even more vexing question: if
Copilot was trained on software code that was subject to an
open-source license, what license might apply to the code produced by
Copilot? MIT? GPL? Something else? No license--in the sense of public
domain? No license--in the sense that the underlying pieces are under
incompatible licenses and there's no way to combine them?

Microsoft makes no claims about this either. Rather, it explicitly
passes the risk to users, who must carry the entire burden of license
compliance (emphasis added below):

    We recommend you take the same precautions when using code
    generated by GitHub Copilot that you would when using any code
    you didn't write yourself. These precautions include rigorous
    testing, IP scanning ...

By IP scanning I assume Microsoft is speaking of
intellectual-property scanning, meaning the process of verifying
that the code doesn't contain IP violations. (Unfortunately the
phrase IP scanning is also commonly used to mean IP-address
scanning in the network sense.)

On the one hand, we can't expect Microsoft to offer legal advice to
its zillions of users or a blanket indemnification. On the other
hand, Microsoft isn't sharing any of the information users would
need to make these determinations. On the contrary--Copilot
completely severs the connection between its inputs (= code under
various open-source licenses) and its outputs (= code
algorithmically produced by Copilot). Thus, after 20+ years,
Microsoft has finally produced the very thing it falsely accused open
source of being: a black hole of IP rights.

Copilot is malware

CTOs and general counsels of organizations that generate
software IP assets now have an urgent problem: how to prevent the
contamination of those assets with code generated by Copilot
(and similar AI tools that will certainly emerge).

Let's be very clear--this has not been a practical problem for
open-source software over the last 20+ years. Why? Because open
source was designed around license-based accountability. Have there
been instances where open-source software has violated IP rights?
Sure. Just like there have been instances where proprietary
software has also done so. The point of open source was never to
create a regime of software licensing that was impervious to IP
litigation. Rather, it was to show that sharing and modification
of source code could become part of the software industry without
collapsing the existing regime. Open-source software has
successfully coexisted with proprietary software because it plays
by the same legal rules.

Copilot does not. Whereas open source strives for clarity around
licensing, Copilot creates nothing but fog. Microsoft has imposed
upon users the responsibility for determining the IP status of the
code that Copilot emits, but provides none of the data they would
need to do so.

The task, therefore, is impossible. For this reason, one must
further conclude that any code generated by Copilot may contain
lurking license or IP violations. In that case, the only prudent
position is to reject Copilot--and other AI assistants trained on
external code--entirely. I imagine this will quickly be adopted as
official policy of software organizations. Because what other
position could be defensible? "We put our enterprise codebase at
risk to spare our highly paid programmers the indignity of writing
a program to calculate the nth prime number"?

Still, I'm sure some organizations will try to find a middle path
with Copilot on the (misguided) principle of developer
productivity and general AI maximalism. Before too long, someone at
these organizations will find a giant license violation in some
Copilot-generated code, and the experiment will quietly end. More
broadly, it's still unclear how the chaotic nature of AI can be
squared with the virtue of predictability that is foundational to
many business organizations.

(Another troublesome aspect of Copilot is that it operates as a
keylogger within your code editor, sending everything you type back
to Microsoft for processing. Sure, you can switch it on and off. But
it still represents a risk to privacy, IP, and trade secrets that's
difficult to control. As above, the only prudent policy will be to
keep it away from developer machines entirely.)

Can Copilot be fixed?

Maybe--if instead of fog, Copilot were to offer sunshine. Rather than
conceal the licenses of the underlying open-source code it relies
on, it could in principle keep this information attached to each
chunk of code as it wends its way through the model. Then, on the
output side, it would be possible for a user to inspect the
generated code and see where every part came from and what license
is attached to it.

Keeping license terms attached to code would also allow users to
shape the output of Copilot by license. For instance, generate an
nth-prime function using only MIT-licensed source material. As the
end user, this wouldn't eliminate my responsibility to verify
these terms. But at least I'd have the information I'd need to do
so. As it stands, the task is hopeless.

In the law, this concept is critical, and known as chain of custody:
the idea that the reliability of certain material depends on
verifying where it came from. For instance, without recording the
chain of custody, you could never introduce documents into evidence
at trial, because you'd have no way of confirming that the documents
were authentic and trustworthy.

What Copilot means for open source

If Copilot is vigorously violating open-source licenses, what should
open-source authors do about it?

In the large, I don't think the problems open-source authors have
with AI training are that different from the problems everyone will
have. We're just encountering them sooner.

Most importantly, I don't think we should let the arrival of a new
obstacle compromise the spirit of open source. For instance, some
have suggested creating an open-source license that forbids AI
training. But this kind of usage-based restriction has never been
part of the open-source ethic. Furthermore, it's overinclusive:
we can imagine (as I have above) AI systems that behave more
responsibly and ethically than the first generation will. It
would be self-defeating for open-source authors to set themselves
athwart technological progress, since that's one of the main goals
of open-sourcing code in the first place.

By the same token, it doesn't make sense to hold AI systems to a
different standard than we would hold human users. Widespread
open-source license violations shouldn't be shrugged off as an
unavoidable cost. Suppose we accept that AI training falls under the
US copyright notion of fair use. (Though the question is far from
settled.) If so, then the fair-use exception would supersede the
license terms. But even if the input to the AI system qualifies as
fair use, the output of that system may not. Microsoft has not made
this claim about GitHub Copilot--and never will, because no one can
guarantee the behavior of a nondeterministic system.

We are at the beginning of the era of practical, widespread AI
systems. It's inevitable that there will be litigation and
regulation about the behavior of these systems. It's also
inevitable that the nondeterminism of these systems will be used as
a defense of their misbehavior--"we don't really know how it works
either, so we all just have to accept it".

I think that regulations mandating the auditability of AI systems
by showing the connection between inputs and outputs--akin to a chain
of custody--are very likely, probably in the EU before the US. This
is the only way to ensure that AI systems are not being used to
launder materials that are otherwise unethical or illegal. In the
US, I think it's possible AI may end up provoking an amendment to
the US constitution--but that's a topic for another day.

In the interim, I think the most important thing open-source authors
can do is continue to bring attention to certain facts about Copilot
that Microsoft would prefer to leave buried in the fine print. For
now, Copilot's greatest enemy is itself.

Further reading

  * If Software is My Copilot, Who Programmed My Software? Bradley
    Kuhn, Software Freedom Conservancy