[HN Gopher] Try to guess if code is real or GPT2-generated
       ___________________________________________________________________
        
       Try to guess if code is real or GPT2-generated
        
       Author : AlexDenisov
       Score  : 74 points
       Date   : 2021-02-23 20:10 UTC (2 hours ago)
        
 (HTM) web link (doesnotexist.codes)
 (TXT) w3m dump (doesnotexist.codes)
        
       | et1337 wrote:
       | This looks like overfitting to me. Some of the GPT samples were
       | definitely real code, or largely real code. One looked like
       | something from Xorg, another like it was straight from the
       | COLLADA SDK. It's really hard to define what "truly new code" is,
       | if it's just the same code copy pasted in different order. Blah
       | blah Ship of Theseus etc.
        
         | sdflhasjd wrote:
         | I'm 90% sure I just got a boost header which was apparently
         | GPT-2 generated, hmmm.
         | 
         | Sadly, I can't do back and see it again
        
         | moyix wrote:
         | The generated snippets are prompted with 128 characters from
         | real code (but not code from the training data), so they can
         | often pick up on the name of the project etc.
        
           | et1337 wrote:
           | Apologies if my comment was dismissive. This is an impressive
           | project!
        
         | dzdt wrote:
         | I got some code related to VICE emulator. It looked pretty
         | real, referring to concepts that make sense in the context of a
         | C64 emulator, but the results said it was GPT not real code. It
         | even had the correct GPL license matching that project. It
         | seems the GPT model has learned quite a bit about the real
         | projects it was fed as input.
        
           | moyix wrote:
           | It has entirely memorized a bunch of common open source
           | licenses, a bunch of contributor names/emails, and so on.
           | However when I've tried to locate the actual code it's
           | producing in the training data it's not there.
        
         | minimaxir wrote:
         | Overfitting on 17GB of input data would be interesting, even
         | though it's using the "large" 774M GPT-2 model.
         | 
         | It's possible training for a month may be too much.
        
       | TehCorwiz wrote:
       | The two factors that seemed like dead giveaways were comments
       | that didn't relate to the code, and sequences of repetition with
       | minor or no variations.
        
         | hertzrat wrote:
         | If only. Humans leave dead comments all the time and I was
         | wrong when I guessed "gpt" wrote this based on that. Being
         | confusing isn't reliable either unless it's a syntax error
        
       | neolog wrote:
       | The black background on white background makes it annoying to
       | read.
        
       | damenut wrote:
       | This was so much harder than I thought it was going to be. I
       | would get a few right and then be absolutely sure of the next one
       | and be wrong. After a while I felt like I was noticing more
       | aesthetic differences between the gpt and real, rather than
       | distinguishing between the two based on their content. Very
       | interesting...
        
       | _coveredInBees wrote:
       | I got 4/4 GPT-2 guesses right. It is impressive but the "tell"
       | I've found so far is just poor structure in the logic of how
       | something is arranged. For example: a bunch of `if` statements in
       | sequence without any `else` clauses with some directly opposing
       | prior clauses. Another example was repeating the same operation a
       | few times in individual lines of code which most human
       | programmers would write in a simpler manner.
       | 
       | It's harder to do with some of the smaller excerpts though, and
       | I'm sure there are probably examples of terrible human
       | programmers who write worse code than GPT-2.
        
         | Aardwolf wrote:
         | I found that a comment with license, followed by C++ code, but
         | without any header inclusions in between, is a clear tell for
         | GPT-2 generated code
        
       | [deleted]
        
       | theurbandragon wrote:
       | How long before we can just write specs instead of code?
        
       | cryptica wrote:
       | I was always able to correctly identify GPT2 but on a few
       | occasions, I misidentified human-written code as being written by
       | GPT2. Usually when the code was poorly written or the comments
       | were unclear.
       | 
       | GPT2's code looks like correct code at a glance but when you try
       | to understand what it's doing, that's when you understand that it
       | could not have been written by a human.
       | 
       | It's similar to the articles produced by GPT3; they have the
       | right form but no substance.
        
       | tpoacher wrote:
       | There is a "codes" top-level domain? Codes? CODES??
       | 
       | What's next? Advices? Feedbacks? Rests?
       | 
       | I give ups.
        
       | klik99 wrote:
       | For the ones that were just part of the header file, listing a
       | bunch of instance variables and function names, it seems
       | impossible. But for the actual code, it is possible but still
       | quite difficult, though I spent too long in finding some logical
       | inconsistency that gave it away.
        
       | AnssiH wrote:
       | Ah, 0/5, I give up :)
        
         | t0astbread wrote:
         | Just invert your guesses then!
        
       | thewarrior wrote:
       | This is actually quite impressive. Try reading the comments in
       | the code. The comments often make perfect sense in the local
       | context even if it's GPT-2 gibberish.
       | 
       | The real examples have worse comments at times.
       | 
       | The only flaw is that it shows fake code most of the time so you
       | can game it that way.
        
         | jackson1442 wrote:
         | Some of the "real" ones have absolutely atrocious comments. Two
         | variables labelled with literally just the name of the variable
         | like so:                   bool hasdied // has died
         | 
         | and then a `// done` for seemingly no reason after initializing
         | variables... where did this code come from?!
        
           | moyix wrote:
           | The "real" code came from these packages:
           | https://moyix.net/~moyix/sample_pkgnames.txt
        
       | hertzrat wrote:
       | The goal when writing code is to be pretty machine like and to
       | keep things extremely simple. People also write dead or off topic
       | comments. That's why this is so hard
        
       | Aardwolf wrote:
       | There was some code about TIFF headers, and it was apparently
       | GPT2 generated
       | 
       | TIFF is a real thing, so some human was involved in some part of
       | that code, it has just been garbled up by GPT2... In other words,
       | the training set is showing quite visibly in the result
        
       | ivraatiems wrote:
       | I found this impressively hard at first glance. It just goes to
       | show how difficult getting into context is in an unfamiliar
       | codebase. I think with any amount of knowledge of anything
       | allegedly involved (or, you know, a compiler), these examples
       | would fall apart, but it's still an achievement.
       | 
       | I'm also pretty sure there are formatting, commenting, and in-
       | string-text "tells" that indicate whether something is GPT2
       | reliably. Maybe I should try training an AI to figure that out...
        
         | pwinnski wrote:
         | I tried using a weird indent as a signal of GPT-2... which gave
         | me my first wrong answer. 4/5.
        
           | [deleted]
        
       | Aeronwen wrote:
       | Got 40/50 just smashing the GPT2 button.
        
         | lelandbatey wrote:
         | Interesting, I guessed GPT2 each time 200 times in a row and
         | only found that GPT2 was correct 89/200 times, so about 45% was
         | GPT2 for me.
        
         | loa_in_ wrote:
         | I had about 50% picking one option over and over.
        
       | thebean11 wrote:
       | 6/6, quitting while I'm ahead
        
         | qayxc wrote:
         | same here :D
        
       | Felk wrote:
       | I got a function that assigned the same expression to three
       | variables. Then it declared a void function with documentation
       | stating "returns true on success, false otherwise". Apparently
       | that code was written by a human, which makes me either doubt the
       | correctness of that website, or the quality of the code it was
       | fed with
        
         | dsilin wrote:
         | Maybe the probability of GPT2 generating that sequence is
         | nearly 0. Sometimes weird edge cases are more human.
        
         | psyklic wrote:
         | Same thought here - apparently humans read from uninitialized
         | arrays immediately after declaring them! That said, it is still
         | a pretty fun website :)
        
           | dataflow wrote:
           | I actually ran into a case where I _wanted_ to do this, but
           | was forced not to.
           | 
           | What was the scenario? I had a couple of small, fixed-size
           | char buffers and I wanted to swap their valid portions, but
           | the obvious choice of swap_ranges(a, b, a + max(na, nb))
           | would run into this issue. (n.b. this wouldn't be correct for
           | non-POD types anyway, but we're talking about chars.)
           | 
           | On top of it being annoying to not be able to do the
           | convenient thing, it made life harder when debugging, because
           | the "correct" solution does not preservs the bit patterns
           | (0xCC/0xCD or whatever) that the debug build injects into
           | uninitialized arrays, therefore making it harder to tell when
           | I later read an uninitialized element from a swapped-from
           | array.
        
           | moyix wrote:
           | Since these are snippets from a random position in the file,
           | it's possible that the code that initialized them was outside
           | the snippet?
        
         | skissane wrote:
         | First code it showed me had getXXX() methods returning void,
         | each of which contained nothing but a printf using the same
         | string variable with no apparent connection to XXX, along with
         | invalid format strings. Surely code this nonsensical has to be
         | generated. Yet when I clicked "GPT2" it said I was wrong.
        
           | emteycz wrote:
           | Don't underestimate the power of failed merges and
           | indifference
        
         | moyix wrote:
         | This made me worried, so I went and spot-checked 5-6. Using the
         | "cheat sheet" I was always able to guess correctly, so I think
         | the site is working fine.
         | 
         | The list of packages the real snippets are drawn from is here
         | (maybe if you want to avoid using them... ;) ):
         | 
         | https://moyix.net/~moyix/sample_pkgnames.txt
         | 
         | Note that the GPT samples are prompted with 128 characters
         | randomly selected from those same packages, so you will see
         | GPT2-generated code that mentions the package name etc.
         | However, these packages were not used for training.
        
       | technologia wrote:
       | This was a fun exercise, definitely think this could be difficult
       | to suss out for greener devs or even more experienced ones. It'd
       | be hilarious to have this model power a live screensaver in lieu
       | of actually being busy at times.
        
       | The_rationalist wrote:
       | How much of it is just regurgitating the training set and
       | therefore chunks of real code?
        
       | nickysielicki wrote:
       | This is difficult... because these models are just regurgitating
       | after training on real code. Fun little site but I hope nobody
       | reads too much into this.
        
         | moyix wrote:
         | I've tried searching for variable and function names and even
         | bits of comments to see if they're copied from the training
         | data. They're not!
        
       | moyix wrote:
       | Hi, author here! Some details on the model:
       | 
       | * Trained 17GB of code from the top 10,000 most popular Debian
       | packages. The source files were deduplicated using a process
       | similar to the OpenWebText preprocessing (basically a locality-
       | sensitive hash to detect near-duplicates).
       | 
       | * I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
       | code for training. Training took about 1 month on 4x RTX8000
       | GPUs.
       | 
       | * You can download the trained model here:
       | https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab
       | here: https://moyix.net/~moyix/csrc_dataset_large.json.gz
       | https://moyix.net/~moyix/csrc_vocab_large.zip
       | 
       | Happy to answer any questions!
        
         | ivraatiems wrote:
         | Thanks for stopping by! This is impressive. I would be curious
         | to know if my hunch below about potential weaknesses/tells was
         | at all correct.
         | 
         | Did people find it to be as challenging when you showed it to
         | them as some of us are here? Did you expect that level of
         | complexity?
        
           | moyix wrote:
           | There are likely some "tells" but many fewer of them than I
           | expected. I've seen it occasionally generate something
           | malformed like "#includefrom", and like all GPT2 models it
           | has a tendency to repeat things.
           | 
           | Yes, I think people definitely find it challenging. I'm
           | keeping track of the correct and total guesses for each
           | snippet, right now people are at almost exactly 50% accuracy:
           | correct | total | pct          ---------+-------+-----
           | 6529 | 12963 |  50
        
             | crote wrote:
             | How can you differ between people genuinely trying to
             | discern them, and people randomly clicking one of the
             | buttons to see the answer? The latter would also result in
             | a 50% accuracy, regardless of the actual GPT quality.
        
               | moyix wrote:
               | Yep, there could be a lot of noise in there too from
               | people guessing lazily.
        
             | hntrader wrote:
             | Are you presenting real samples and GPT2 samples to users
             | with equal probabilities?
             | 
             | EDIT another poster guessed GPT2 each time and found the
             | frequency was 80 percent
        
               | lelandbatey wrote:
               | I guessed GPT2 each time, 200 times in a row and only
               | found that GPT2 was correct 89/200 times, so about 45%
               | was GPT2 for me.
        
               | wnoise wrote:
               | In [2]: scipy.stats.binom_test(89, 200, 0.5) Out[2]:
               | 0.13736665086863936
               | 
               | Unusual to be this lopsided (1-in-7), but not crazy.
        
               | moyix wrote:
               | It should be equal: there are 1000 real and 1000
               | generated samples in the database, retrieved via:
               | 
               | SELECT id, code, real FROM code ORDER BY random() LIMIT 1
        
         | lostmsu wrote:
         | Can you share the dataset too?
        
           | moyix wrote:
           | Sure, it's here in JSON format:
           | https://moyix.net/~moyix/csrc_dataset_large.json.gz
        
             | lostmsu wrote:
             | I am curious, how were you able to feed the GPUs? Did you
             | simply preload the entire dataset into RAM (it certainly
             | seems possible)? Did you preapply BPE? Did you train your
             | own BPE?
        
               | minimaxir wrote:
               | Note that the encoded dataset (through the GPT-2 BPE
               | tokenizer) will be much, much less than 17GB, both on
               | disk and in memory (in my experience it can be anywhere
               | from 1/3rd to 1/2 the size)
               | 
               | If finetuning an existing GPT-2 model, you must use that
               | BPE tokenizer; you could theoretically use your own but
               | that wouldn't make a difference performance-wise and
               | you'd just have a lot of wasted tokenspace.
               | 
               | The efficiencies of using your own tokenizer for bespoke,
               | esoteric content that does not match typical internet
               | speak (like this) are why I recommend training your own
               | tokenizer and GPT-2 from scratch if possible.
        
               | moyix wrote:
               | Yeah, I trained my own BPE tokenizer for this and it
               | results in pretty good compression. From 1024 BPE tokens
               | you can generate anywhere from 2000-6000 actual
               | characters of text. My guess is that it's a bit more
               | efficient than English-BPE because there's a lot of
               | repetitive stuff in source code (think spaces for
               | indentation, or "if("/"while("/"for (int").
        
               | moyix wrote:
               | Yep, I trained my own BPE using HuggingFace's tokenizer
               | library. During training I didn't keep the entire dataset
               | in memory because even on an RTX8000 the full dataset +
               | model weights + data used by the optimizer (ADAM) is too
               | big.
        
       ___________________________________________________________________
       (page generated 2021-02-23 23:02 UTC)