[HN Gopher] GitHub accused of varying Copilot output to avoid co...
___________________________________________________________________
GitHub accused of varying Copilot output to avoid copyright
allegations
Author : belter
Score : 105 points
Date : 2023-06-10 13:46 UTC (9 hours ago)
(HTM) web link (www.theregister.com)
(TXT) w3m dump (www.theregister.com)
| WalterBright wrote:
| One of the specific complaints is:
|
| https://devclass.com/2022/10/17/github-copilot-under-fire-as...
|
| It's a 25 or so line function that looks like a pedestrian
| implementation of a sparse matrix transpose algorithm. The author
| should have been patented it to protected it, not copyrighted it.
| belter wrote:
| https://storage.courtlistener.com/recap/gov.uscourts.cand.40...
| jimnotgym wrote:
| Taking code off github, changing it a bit and passing it off as
| ones own crosses a line. Now we really can't tell the AI from the
| humans!
| unkulunkulu wrote:
| oh come on, which code? writing imports? or iterating over
| collections? or am I underusing copilot? :)
|
| I basically use it as stackoverflow on steroids. it is not even
| close to gpt-4 in terms of reproducing some original idea I
| could not find in a search engine
| missingdays wrote:
| Why would you ever write imports? IDEs autocomplete them for
| you
| unkulunkulu wrote:
| Copilot understands some convetions when there's more than
| one way. I used it extensively with react bootstrap where I
| decided to go with the recommended way of importing each
| components like import Tab from 'react-bootstrap/Tab' It
| also knows which components are used in the file.
| sureglymop wrote:
| But that's pretty much what copilot is... It's just
| Intellisense 2.0 and I would say even only marginally more
| useful. You can't even really instruct it except with some
| comments which may not work.
| rolph wrote:
| [The judge overseeing the case has permitted the plaintiffs to
| remain anonymous in court filings because of credible threats of
| violence [PDF] directed at their attorney. The Register
| understands that the plaintiffs are known to the defendants.]
|
| https://storage.courtlistener.com/recap/gov.uscourts.cand.40...
| formerly_proven wrote:
| > go f** _g cry about github you f**_ g piece of s*t n**r, I
| hope your throat gets cut open and every single family member
| of you is burnt to death
|
| Are github users gamers? Really puts the "git" into "github"
| there.
| arp242 wrote:
| Some more here:
|
| https://storage.courtlistener.com/recap/gov.uscourts.cand.40.
| ..
|
| https://storage.courtlistener.com/recap/gov.uscourts.cand.40.
| ..
|
| https://storage.courtlistener.com/recap/gov.uscourts.cand.40.
| ..
|
| Friendly people.
|
| I've received emails like that too over the years. What
| hugely controversial thing do I do? I have a website where I
| sometimes write about $stuff and I post on HN. Keeping the
| basic info private is probably a good thing especially if
| they're based in the US, because "SWATting" etc, but beyond
| that it doesn't seem "credible" in the sense that it's very
| likely someone will show up at their door with a gun.
|
| Since the first two are redacted, I wonder if they sent them
| with their real names.
| z3t4 wrote:
| It can be explained by the normal curve. The bigger your
| audience is the weirder the outliers will be.
| arp242 wrote:
| Pretty much, yeah. There's about 26.8 million developers
| in the world. Assuming 5 million read this story (not
| everyone speaks English) and 0.01% of people is a bit
| unhinged then you've got 50,000 unhinged people, and only
| about 0.006% of those 50,000 (or 0.00006% of total) need
| to be unhinged enough to actually shoot off an email.
| web3-is-a-scam wrote:
| Considering how large GitHub is (in the industry) it's like
| asking is "are Facebook users gamers"?
| notjoemama wrote:
| That's what struck me about it too. Isn't it the case in a
| large enough population you can always find representation
| of something you dislike or hate? I've seen lists of
| "Republicans" (meaning anyone in, near, or related to a
| Republican politician) showing those people being caught or
| convicted of various moral, economic, and social "crimes".
| Ok. But if I sat down and looked using the same criteria,
| couldn't I just as easily create a long list for the
| Democrat party? Having made that statement on Reddit, the
| response I got was, "well, there are MORE republicans".
| That struck me as odd too. Are you trying to say of the two
| horrible things, one is worse, and so I have a moral
| imperative to chose the less horrible one? I'm fairly sure
| I get to abandon both in search of a better option. lol
| bandyaboot wrote:
| > I've seen lists of "Republicans" (meaning anyone in,
| near, or related to a Republican politician) showing
| those people being caught or convicted of various moral,
| economic, and social "crimes". Ok.
|
| I'm intrigued. I'd like to see the subset of the list
| that are people who were _in_ Republican politicians.
| indrora wrote:
| An impressive number of 4chan's /g/ users are on github. Some
| even actively contribute to Linux (though usually to Arch,
| Gentoo, and now more and more Nix).
|
| I wrote a paper during college that I should release some
| time about when /g/ threw an absolute shitfit over Linus
| going "so, I've been a kinda shit human being to people and
| I'm going to step back and get some help", going as far as to
| blame his daughter/"the woke mob"/multiple named core kernel
| contributors for killing their god.
|
| At one point, I attended a GitHub event that wasn't directly
| sponsored by github but encouraged a lot of github users to
| show up. While there I met several people who, outside the
| venue, were talking animatedly about Terry Davis. Listening
| in on the conversation revealed that they more or less just
| approved of his extensive use of racist language and
| epithets.
|
| I haven't checked, but I would suspect that Linus' recent
| "trans rights" by proxy post has caused at least one or two
| aneurisms in the /g/ user group.
| StrauXX wrote:
| I would love to read that paper if you do decide to
| publicise it! 4chan mob dynamics never fail to make
| interesting (albeit often nasty) stories.
| Dma54rhs wrote:
| /g/ is pretty mainstream among the zoomers they browse it
| publicly. Also 4chan is among one of the most popular
| websites on the internet so it doesn't come off as a
| surprise.
| edgyquant wrote:
| A large number of people you meet, from all walks of
| life, will admit that 4chan is a guilty pleasure. At
| least I've met a ton and none of them were right leaning
| to say the least.
| waboremo wrote:
| That would be my general theory as well, you're far more
| likely to meet someone who is left leaning who admits to
| having posted on 4chan (or still does) than you are
| otherwise. Maybe it has to do with perceived biases, in
| that a right leaning person/group is probably then going
| to feel they are aligned with the seediest aspects of
| 4chan, whether they actually do or not and their
| perceived social impact/failings for using 4chan.
| edgyquant wrote:
| Yeah moderate to right wingers would probably not admit
| accept to people close to them due to that perception.
| But using the site you can tell there are a lot of very
| intellectual liberals and fiscal conservatives. The
| racist and sexist stuff is just their equivalent of the
| dumb Reddit memes that encompass 80% of its content.
| pxc wrote:
| > An impressive number of 4chan's /g/ users are on github.
| Some even actively contribute to Linux (though usually to
| Arch, Gentoo, and now more and more Nix).
|
| An aside about this from a moderately longtime Nix user and
| very occasional Nixpkgs contributor:
|
| I used to occasionally post about Nix on /g/ before
| virtually anyone there knew what it was just to gauge
| reactions, and boy were people shitty and dismissive about
| it. It was all hot takes, broad strokes, and very little
| curiosity about the technical details. And even though Nix
| is 'cool' on /g/ now, all of those things are still true
| about the way /g/ treats NixOS and other distros.
|
| The interest that 90% of /g/ users have in Linux distros
| like NixOS is as a bullshit status symbol, a token in some
| consumerist identity game. The presence of that shallow,
| status-obsessed, needlessly edgy type of person in the Nix
| community is definitely more visible in the Nix(OS)
| community now than it was a few years ago, but it still
| sticks out like a sore thumb against the backdrop of
| longtime Nix users and the culture they've evolved
| together.
|
| For that reason, I strongly recommend engaging with the Nix
| community in community-owned channels, like
| discourse.nixos.org or the community Matrix channels,
| rather than message boards like 4chan or mainstream social
| media platforms like Reddit. If you do that, you'll find
| kinder, more knowledgeable people (and perhaps in some
| cases, kinder more knowledgeable personas for the _same_
| people).
|
| If you're reading this and you've unfortunately encountered
| Nix 'evangelists' with those shitty attitudes online,
| please understand that those influences are external to the
| community, and as far as most participants in the community
| are concerned, quite unwelcome.
| zer0tonin wrote:
| No, but I assume a lot of AI bros are
| obiefernandez wrote:
| ffs, can we just not make this a thing?
| faangsticle wrote:
| Too late, the AI bros already did.
| edgyquant wrote:
| Too late, a demographic or people who could just barely
| scrape together a script making REST requests are now
| selling themselves as "AI specialists" or "prompt
| engineers" to the corporate class. These are this cycles
| cryptobros, who were mostly not engineers but people
| riding a hype wave.
|
| The age of the AI bro is here, and as I've been in the
| space as someone genuinely interested in the models,
| working with them from time to time, for a while. I'm
| giving a lot of eye rolls in meetings when these people
| start talking about the underlying tech.
| hooomil wrote:
| [dead]
| foobarbazetc wrote:
| "Prompt engineers"... sigh.
| rkagerer wrote:
| The plaintiffs were granted anonymity due to credible threats
| against their attorney. Is there any mechanism other than
| publication ban that ensures the protection? Can't someone just
| attend the day of the hearing to see who the attorneys are?
|
| EDIT: Apparently the lawyers are attending via Zoom.
| coryrc wrote:
| The plaintiffs, not the plaintiffs' lawyers.
| JVillella wrote:
| They say crypto is "regulatory arbitrage", I say this AI co-pilot
| stuff is "copyright arbitrage".
|
| Being a bit hand-wavy with it: It's akin to torrenting
| music/movies. The torrented files are lossy compressed
| representations of the original waveform from the music producer.
| Limewire, or Pirate Bay, or whatever provide interface to
| retrieve them (download or stream). The model weights are a form
| of lossy compression, and inference is like a document retrieval.
|
| One may say, "it's like an employee working at company X, then
| going to work at company Y, they retain their knowledge and
| experience." I would say it's more like, employee going from X to
| Y, but retaining audio and video recordings of all interactions
| he had, notes, documents, and other proprietary info and bringing
| it to company Y.
| az226 wrote:
| Yea but only if you get to download a few seconds of the movie
| and not more.
| mjburgess wrote:
| I call it copyright laundering
| soultrees wrote:
| What would you say the basis of all knowledge you know is? You
| are a collection of everything you have consumed and the stuff
| you create is all influenced by that.
|
| Personally this whole llm debate about copyright is quite
| funny. As someone who very much has skin in the game(my art is
| trained on midjourney.), and who runs in a circle of artists,
| it's interesting to see people's ego's come at play here. The
| ones who are excited about these as tools are the ones who are
| openly inspired and want to inspire however the ones who claim
| copyright infringement seem to come off as insecure, almost
| like they are afraid that this idea of theirs will be the last
| great idea they have. There's already a separation happening in
| the art world of people who are exploding in creative output vs
| the people who are so defensive and cling to the old way of
| doing things.
|
| If I had my way, I'd see copyright laws abolished completely. A
| complete free for all in innovation. And people who claim that
| without parents and copyright then there's no incentive to make
| money seriously underestimate humans and their ego to
| continually innovate.
| saurik wrote:
| > What would you say the basis of all knowledge you know is?
| You are a collection of everything you have consumed and the
| stuff you create is all influenced by that.
|
| FWIW, humans certainly can infringe other peoples' copyrights
| and can do so even if they aren't actively intending to do
| so. There is some boundary across which you are no longer
| just learning something and you are now copying, and it isn't
| clear at all that these generative AI techniques are actively
| considering the latter the way a human is required to.
|
| But, sure: if you are against the idea of copyright entirely
| then it is hard to consider the idea inconsistent, though I
| would think a world without copyright would be a particularly
| hard one for an artist to make money at all...
| JVillella wrote:
| >What would you say the basis of all knowledge you know is?
| You are a collection of everything you have consumed and the
| stuff you create is all influenced by that.
|
| Surely you're not suggesting that there's no such thing as
| "original work". The production of which may have very high
| capital and labour costs - which if not protected from theft
| - would remove the incentives of producing original work.
|
| >As someone who very much has skin in the game(my art is
| trained on midjourney)
|
| I don't know your specific situation, but there's obviously
| different scales of importance here. What if your art was
| your sole source of income, and people were reproducing it
| under their own name? or if you had a product where you
| poured millions into developing some novel IP/methods, and
| some employee brought it with them when they went to work at
| your competitors?
| WalterBright wrote:
| Over here at the D Language Foundation, we _encourage_
| people to download it for free and do whatever they want to
| with it. It 's all Boost licensed.
|
| > some employee brought it with them when they went to work
| at your competitors?
|
| Other programming languages have copied lots of D features.
| We at the DLF don't mind at all. Though often they copy
| them and kinda miss the mark.
|
| (Yes, we sometimes copy features from other languages, too,
| and try to improve on them.)
| zzzzzzzza wrote:
| some things like drug discovery could probably be done with a
| bounty system rather than intellectual property, and could
| probably get much better results for a fraction of the cost
| for maintaining the intellectual property component of the
| court system
| ldehaan wrote:
| [dead]
| blibble wrote:
| so not only is it a shitty boilerplate generator, now it also
| introduces deliberate random changes (i.e. bugs)
| cmrdporcupine wrote:
| Copilot is to license violations (esp of copyleft licenses) what
| cryptocurrency mixers are for money laundering.
|
| My employer (IMHO smartly) forbids use of LLMs in company IP and
| company laptops, etc. Many others I'm sure are doing the same,
| and many others will follow.
| theRealMe wrote:
| Nobody uses copilot intentionally to violate copyright law.
| People do use crypto mixers intentionally to violate money
| laundering laws.
| SpicyLemonZest wrote:
| Nobody affirmatively says "yes, my goal is to violate
| copyright law, and Copilot is the best tool I've found". But
| it doesn't seem impossible to me that the value of Copilot
| comes partially from the fact that it can copy paste code
| from copyrighted repositories in ways which would be illegal
| for you or I to do. I'm not sure it's proven yet but I
| wouldn't be shocked if it is in the future.
| shagie wrote:
| It provides the same value as someone who copies and pastes
| code from Stack Overflow or any of the predecessors without
| concerning themselves with the license.
|
| I am certain that I can find code from Linux or gcc or
| emacs on Stack Overflow that is under a GPL license and not
| compatible with the CC license Stack Overflow uses... and
| yet it's there. What's more, people will copy that code
| into their own ignoring the CC license too.
|
| How is that really any different than using Copilot if the
| original license and attribution are something to respect.
|
| Note that I _do_ think that the original license is
| something to respect which is why for any of the code that
| I write that has copyright that matters on it (toy program
| for home? meh. Hobby project repo that I 'm working on that
| I'll publish? yep. Employer's code for work? absolutely.) I
| either don't touch questionable sources or run a license
| check on it when using it.
|
| The key thing is that I don't consider the use of Copilot
| to be any more controversial than copying from Stack
| Overflow - which has been done by countless programmers for
| a decade before Copilot existed and no one got up in arms
| about it then.
| fooster wrote:
| Sorry your employer forbids the use of tooling that makes your
| life better and reduces drudgery. Perhaps you should vote with
| your feet and find a less Luddite employer.
| reaperducer wrote:
| _Sorry your employer forbids the use of tooling that makes
| your life better and reduces drudgery. Perhaps you should
| vote with your feet and find a less Luddite employer._
|
| Does your company allow you to outsource your work to people
| in a poorer nation for a fraction of the cost that you are
| paid? Why not? Perhaps you should vote with your feet and
| find a less Luddite employer.
| Dylan16807 wrote:
| If you have the skills for that, hell yes find an employer
| that will let you do it, either explicitly or implicitly.
| indrora wrote:
| My company forbids the use of LLMs that aren't validated (and
| we make one).
|
| Our managers get emails if we make calls to known LLMs, and
| there's guidance on locally running LLMs and using their
| output ("it's okay for small things maybe, but be careful").
| Why?
|
| Because legal's job is to protect the company from legal
| threats. Sometimes that means making some awkward choices,
| like handwringing over the use of GPL licensed software in
| publicly exposed example code (such as sample apps) purely
| because some aspects of the GPL haven't been tested in
| American courts, much less international ones.
|
| So the use cases for LLMs there are mostly source-to-source
| transformative ("Turn this function and documentation into
| javadoc format please") or similar -- stuff where you can
| show that the LLM isn't introducing anything that might maybe
| possibly have any hint of externally licensed software.
| renewiltord wrote:
| Wild. I suppose it's good that people who like these
| conditions can find employers like this and people like me
| who don't can find employers not like this.
|
| I could never countenance operating under these conditions.
| bushbaba wrote:
| Once the ip rules are figured out it'll open the door to a lot
| of usecases. This reminds me more of p2p file sharing being
| precursor to paid streaming services.
| taneq wrote:
| Isn't "rewrite the example code in your own style" accepted best
| practice for human coders, when working from an example that does
| what you need?
|
| I'm not sure what would be acceptable output for a code
| generation tool if rewriting the examples isn't ok and
| reimplementing something that performs the same function still
| isn't ok. Are we automatically granting de-facto code patents on
| all published code now?
| jazzyjackson wrote:
| "Isn't "rewrite the example code in your own style" accepted
| best practice [...]?"
|
| Why would it be? If a function performs the data transform I
| need you better believe i'm copy pasting that sucker with a
| hyperlink to where I found it
|
| But then again, I'm not trying to win in court.
| rolph wrote:
| what would happen without that hyperlink? the overall issue
| seems to be a lack of attribution to the originator.
| patmcc wrote:
| That depends a lot on the license - some require
| attribution, some don't, some care not a bit (in that they
| don't permit copying).
| waboremo wrote:
| I can't recall a single time that's been common advice given to
| programmers. It's usually either don't reinvent the wheel
| (therefore use the source while adhering to license), or come
| up with your own solution.
|
| Don't know how you would even write code in your own style. As
| soon as you start altering it, the result is different. It's
| more/less efficient.
| williamcotton wrote:
| I interpreted the comment you are responding to as "make sure
| it uses the same style conventions as the rest of this file",
| which is something that Copilot does very well!
| njharman wrote:
| Depending on language there are ton of style choices. There's
| style guides as examples of trivial.
|
| Non trivial include names, comments, logging, error checking,
| structure, ordering of operations that aren't sequential.
| waboremo wrote:
| Yes, but all of those have impact to the actual function
| and performance of the proposed solution. By doing so, you
| are changing the solution.
|
| Look at FizzBuzz. If you were to set strict requirements on
| performance (and allow for reiterative testing), the
| results from different groups of people would be identical.
| They would reach the same conclusion because that's how
| code works, it's far more aligned to math than it is
| creative writing.
|
| So you cannot take an existing code solution and translate
| it to your own style. You are altering the program, the
| efficiency, and therefore the solution itself. Even when
| you do something like changing 1 single variable name!
| mistrial9 wrote:
| this comment really hits hard for me -- its like there is a
| place to buy food where every menu item is clearly shown,
| with a large color picture and a printed price.. and the
| person talking has only every purchased food in that way.. as
| if there are no alternatives that "really exist"
|
| there really are a lot of other scenarios that involve
| writing software, to make software. Its not possible to list
| them all.. the list changes while I type
| l__l wrote:
| The point here is that this isn't some example from a textbook
| or even stack overflow, but licensed pieces of work with all
| the legal complications that come with that. This is about the
| potential use of this code in proprietary code (or code
| otherwise incompatible with the original licenses), and I
| really don't think anyone would say it is "accepted best
| practice" to copy out someone else's work you find online,
| licenses be damned, in a professional setting.
| 542458 wrote:
| > this isn't some example from a textbook or even stack
| overflow, but licensed pieces of work with all the legal
| complications that come with that
|
| I understand why these might _feel_ different to you, but
| textbooks and stack overflow are also proprietary, licensed
| pieces of work. I don't see why there would be much of a
| legal distinction.
| salawat wrote:
| No, you're missing the point.
|
| There are two worlds.
|
| In one, everytime someone publishes code with a license
| attached, they've taken a chunk out of the set of valid lines
| of software capable of being permissibly written without
| license encumberance. This is the world the poster you are
| replying to is imagining we're headed toward, and this case
| basically does a fantastic job of laying a test
| case/precedent for.
|
| The other world, is one where everyone accepts all
| programming code is math, and copyrighting things is like
| erecting artificial barriers to facilitate information
| asymmetry. I.e. trying to own 2 + 2. In this second
| hypothetical world, we summarily reject IP as a thing.
|
| The 2nd world is what I'd rather live in, as the first truly
| feels more and more like hell to me. However, given the first
| one is the world we're in, I'd like to see the mental
| gymnastics employed to undermine Microsoft's original
| software philosophy.
|
| EDIT: Voir dire will be a hoot. Any wagers on how many
| software people make it onto the jury if any?
| harles wrote:
| > In one, everytime someone publishes code with a license
| attached, they've taken a chunk out of the set of valid
| lines of software capable of being permissibly written
| without license encumberance.
|
| If this were true of copyright, we would've run out of
| permissible novels a long time ago. There's plenty to
| complain about with how software IP works, but copyright
| seems pretty sane. The alternative of protecting IP via
| trade secret is not a world I want to live in. That seems
| bad for open source.
| mitthrowaway2 wrote:
| Code is a more restrictive space than prose. Prose has to
| be grammatical and meaningful, but code has to compile
| and efficiently serve a useful specification.
|
| The central idea of programming languages is that the
| grammar is very restrictive compared to natural
| languages. It's quite likely that, with the exception of
| variable names and whitespace, some function you wrote to
| implement a circular buffer is coincidentally identical
| to code that exists in Sony's or Lockheed Martin's
| codebases.
|
| Plus there's the birthday problem -- coincidences can
| happen way more than you expect. And even with prose,
| constraints like non-fiction can narrow things down
| quickly. If everyone on HN had to write a theee-sentence
| summary of, say, how a bicycle works, there would
| probably be coincidentally identical summaries.
| edgyquant wrote:
| ReactOS actually got sued by Microsoft for stealing code
| and one of their proofs was a piece of code (can't
| remember exactly what it did) that basically matched the
| same function Windows code with a few things changed.
|
| It was ASM code I think, and their defense was that there
| was basically one way to write a function that does this.
| moyix wrote:
| I think you're misremembering here; as far as I know (and
| as far as I can tell from searching just now) MS has
| never sued ReactOS. There was a claim made back in 2006
| on the mailing list that a portion of syscall.S was
| copied, and this caused ReactOS to do their own audit:
|
| https://en.wikipedia.org/wiki/ReactOS#Internal_audit
| harles wrote:
| Three sentence summaries probably wouldn't qualify for
| copyright protection. The same should be true of code -
| if we think the standard for copyright protection is too
| low, we should raise the bar on complexity requirements,
| not throw out copyright.
|
| Even if a programming grammar is more restrictive,
| there's some length where things become almost certainly
| unique.
| quesera wrote:
| It raises an interesting question though.
|
| Aside from obligatory syntactic bits, what is the most
| common line of code across all software ever developed?
|
| It'll probably be C or Java. HTML doesn't count.
|
| And it's probably something boring like:
| i++;
| l__l wrote:
| I'm don't think this dichotomy is at all fair. Just because
| someone makes a piece of software public does not mean they
| want it freely copied, and I think that can be a completely
| reasonable stance to have. I'm struggling to make sense of
| your argument unless you believe either:
|
| - Code is not intellectual property; I don't see this as
| easily defensible. It takes time, effort, and in some cases
| seriously heavy resources to come up with some of the tech
| companies rely on. Should all private companies rescind
| copyright on literally everything their staff write?
|
| - Intellectual property is a nonsense concept altogether;
| in this case, I don't think you're ever going to get your
| way in the court of public opinion.
| williamcotton wrote:
| This might help shed some light:
|
| https://en.wikipedia.org/wiki/Idea%E2%80%93expression_dis
| tin...
| rolph wrote:
| in many cases a snip;routine;proc...whatever you work with,
| is rote procedure. such as device access. ie retrieving a
| directory listing.
|
| code that reverts to a conserved sequence of bytes
| interchanged ,no functional variations.
|
| code that is so common knowledge it has become street
| graffiti, belongs in world 2
|
| versus code that creates a functionality not available by
| direct command, is innovative and should be attributed.
| this sounds like what 1st world should be.
| williamcotton wrote:
| That's not actually how it works. Purely functional code,
| such as code that it written in a certain way to achieve
| maximum performance, is not deemed expressive and
| therefore not covered by copyright. This code would be
| covered by patent.
| rolph wrote:
| i think we are actually talking about the same thing.
|
| in simpl terms:
|
| mov bax eax ; an obvious function; no IP
|
| mov eax eax ; seems useless unless you know what de-
| referencing is. probably IP
|
| this is of course example not considering granularities
| at level of patents on a language, or macro directives
| rolph wrote:
| proper attribution to the writer seems to be a big part of
| this. there is also suggestion ms knows, all about it but
| passes the liability buck to the end user of copilot
| suggestions.
|
| [Lawyer and developer Matthew Butterick announced last month
| that he'd teamed up with the Joseph Saveri Law Firm to
| investigate Copilot. They wanted to know if and how the
| software infringed upon the legal rights of coders by scraping
| and emitting their work without proper attribution under
| current open-source licenses.]
|
| https://www.theregister.com/2022/11/07/in_brief_ai/
|
| https://www.theregister.com/2022/10/19/github_copilot_copyri...
| layer8 wrote:
| Mitigating copyright issues by "rewriting in your own style"
| arguably only applies to humans doing the rewriting as a
| creative task, because copyright only applies to human creative
| works.
| ShamelessC wrote:
| Eh, their argument is simply that they tuned temperature settings
| to encourage the model to output slight variations on memorized
| data. But this is kind of just one of many things you do with a
| language model and certainly doesn't imply intent to avoid
| copyright allegations.
|
| Just implies they tuned it for user experience.
|
| I was expecting there to be some discovery around them
| deliberately fine tuning their model to output modifications if
| and only if the code had a certain license.
| kevingadd wrote:
| What's the value of slight variations? Isn't it more likely
| that the memorized data was already known to be good and
| effective? It doesn't seem like a useful change unless your
| goal is to avoid infringement. I don't see how randomly
| permuting the suggestions improves UX.
| moyix wrote:
| The lowest temperature isn't always the one that results in
| working code! This was shown in the original Codex paper:
|
| > When evaluating pass@k, it is important to optimize
| sampling temperature for the particular value of k. In Figure
| 5, we plot pass@k against the number of samples k and the
| sampling temperature. We find that higher temperatures are
| optimal for larger k, because the resulting set of samples
| has higher diversity, and the metric rewards only whether the
| model generates any correct solution.
|
| > In particular, for a 679M parameter model, the optimal
| temperature for pass@1 is T* = 0.2 and the optimal
| temperature for pass@100 is T* = 0.8. With these
| temperatures, we find that pass@1 and pass@100 scale smoothly
| as a function of model size (Figure 6).
|
| So even with pass@1 (likelihood of getting the right answer
| in 1 attempt) you don't use T=0, so there will be slight
| variations in the output each time.
| Brian_K_White wrote:
| Why else bother with such an input? Are randomizations more
| likely to be correct or more useful?
| slashdev wrote:
| I don't know much about AI, but I think one reason you might
| do that is to learn which variations are preferred (which are
| committed unmodified) so you can tune the model in the
| future. I don't know if Github does that, but given they've
| cited how often code from copilot is committed without
| modification, I assume they are measuring it at least in some
| cases.
| Brian_K_White wrote:
| makes sense
| brookst wrote:
| Huge topic, worth Googling. Short version is that too little
| randomness limits the solution space, so retrying suboptimal
| results yields the same problems.
| 2gremlin181 wrote:
| Ye olde Bias-Variance tradeoff
| seanhunter wrote:
| Generally the reason behind adding randomness to machine
| learning is avoiding "local minima" in the search space of
| the optimization function(s) used for training the model. If
| your training produces a very smooth descent to an optimum it
| can lead to the model converging on a solution that is not
| globally the best. Adding some randomness helps to avoid
| this.
|
| Specifically for GPT models, the temperature parameter is
| used to get outputs wihch are a bit more "creative" and less
| deterministic. https://help.promptitude.io/en/ai-
| providers/gpt-temperature
| cubefox wrote:
| Well, temperature 0 means the completion is always the most
| "likely" (or "best", after fine-tuning) token, while
| temperature 1 means to choose the next tokens stochastically
| according to their probability (or "goodness" after fine-
| tuning). Usually some temperature in between is chosen, like
| 0.7. It's not _a priori_ clear to me which is the best way to
| do it.
| ianbutler wrote:
| Potentially more correct, yes. It frees the model to choose
| lower probability tokens to some degree, technically it
| boosts their probabilities, which may be more correct
| depending on the task.
|
| There are also sampling schemes, top_p and top_k which can
| each individually help choose tokens that are less probable
| (but still highly probable) but more correct, and they are
| often used together for the best effect.
|
| And then there are various decoding methods like beam search
| where choosing the most optimal beam may not mean the most
| optimal individual token.
|
| By default a simple greedy search is used which always
| chooses the next highest probability token.
| golemotron wrote:
| Yes.
| GuB-42 wrote:
| It is worthwhile with creative writing. For example if you
| ask ChatGPT to write a short story, you want some
| originality. Even when asking for an explanation it can be
| useful as you may want to try different things for the
| explanation that speaks to you the most.
|
| But here we are talking about autocompleting code. I don't
| think programmers want the autocompleter to be creative. They
| want the exact same solution everyone uses, hopefully the
| right one, with only minor changes so that it matches their
| style and use their own variable names. In my case, I am the
| programmer, I decide what to do, I just want my autocompleter
| to save me some keystrokes and copy-pasting boilerplate from
| the web, the more it looks like existing code the better. I
| have enough work fixing my own bugs, thank you.
|
| Speaking about bugs, how come everyone talks about code
| generation that, I think, doesn't bring that much value.
| Sure, it saves a few keystrokes and copy-pasting from
| StackOverflow, but I don't feel like it is the thing
| programmers spend most of the time doing. Dealing with bugs
| is. By bugs, there are the big ones that have tickets and can
| take days to analyze and fix, but also the ones that are just
| a normal part of writing code, like simple typos that result
| in compiler errors. I think that machine learning could be of
| great help here.
|
| Just a system that tells me "hey, look here, this is not what
| I expected to see" would be of great help. Unexpected doesn't
| mean there is a bug, but it is something worth paying
| attention to. I know it has been done, but few people seem to
| talk about it. Or maybe a classifier trained on bug fix
| commits. If a piece of code looks like code that has been
| changed in a bug fix commit, there is a good chance it is
| also a bug. Have it integrated to the IDE, highlight the
| suspicious part as I type, just as modern IDEs highlight
| compilation errors in real time.
| brookst wrote:
| [flagged]
| williamcotton wrote:
| [flagged]
| matkoniecz wrote:
| > Downvoting
|
| Presumably people downvoted it because it is really unclear
| what exactly you are claiming.
|
| Instead of "Everyone needs to first familiarize themselves
| with" you could write a very simple summary of that and how it
| relates to this case and your next claim that
|
| > If you're under the impression that every line of code is
| covered by copyright you are very mistaken.
|
| Well, for example empty ones are really unlikely to be.
|
| Ones that quote out-of copyright works also will not be.
| williamcotton wrote:
| [flagged]
| catiopatio wrote:
| The downvotes probably have to do with the fact that:
|
| (1) you lead with a rude and mostly contentless comment,
| and
|
| (2) your follow-up is merely a dump of Wikipedia quotes,
| instead of actually summarizing what you've been trying to
| say.
___________________________________________________________________
(page generated 2023-06-10 23:02 UTC)