[HN Gopher] Try to guess if code is real or GPT2-generated
___________________________________________________________________
Try to guess if code is real or GPT2-generated
Author : AlexDenisov
Score : 74 points
Date : 2021-02-23 20:10 UTC (2 hours ago)
(HTM) web link (doesnotexist.codes)
(TXT) w3m dump (doesnotexist.codes)
| et1337 wrote:
| This looks like overfitting to me. Some of the GPT samples were
| definitely real code, or largely real code. One looked like
| something from Xorg, another like it was straight from the
| COLLADA SDK. It's really hard to define what "truly new code" is,
| if it's just the same code copy pasted in different order. Blah
| blah Ship of Theseus etc.
| sdflhasjd wrote:
| I'm 90% sure I just got a boost header which was apparently
| GPT-2 generated, hmmm.
|
| Sadly, I can't do back and see it again
| moyix wrote:
| The generated snippets are prompted with 128 characters from
| real code (but not code from the training data), so they can
| often pick up on the name of the project etc.
| et1337 wrote:
| Apologies if my comment was dismissive. This is an impressive
| project!
| dzdt wrote:
| I got some code related to VICE emulator. It looked pretty
| real, referring to concepts that make sense in the context of a
| C64 emulator, but the results said it was GPT not real code. It
| even had the correct GPL license matching that project. It
| seems the GPT model has learned quite a bit about the real
| projects it was fed as input.
| moyix wrote:
| It has entirely memorized a bunch of common open source
| licenses, a bunch of contributor names/emails, and so on.
| However when I've tried to locate the actual code it's
| producing in the training data it's not there.
| minimaxir wrote:
| Overfitting on 17GB of input data would be interesting, even
| though it's using the "large" 774M GPT-2 model.
|
| It's possible training for a month may be too much.
| TehCorwiz wrote:
| The two factors that seemed like dead giveaways were comments
| that didn't relate to the code, and sequences of repetition with
| minor or no variations.
| hertzrat wrote:
| If only. Humans leave dead comments all the time and I was
| wrong when I guessed "gpt" wrote this based on that. Being
| confusing isn't reliable either unless it's a syntax error
| neolog wrote:
| The black background on white background makes it annoying to
| read.
| damenut wrote:
| This was so much harder than I thought it was going to be. I
| would get a few right and then be absolutely sure of the next one
| and be wrong. After a while I felt like I was noticing more
| aesthetic differences between the gpt and real, rather than
| distinguishing between the two based on their content. Very
| interesting...
| _coveredInBees wrote:
| I got 4/4 GPT-2 guesses right. It is impressive but the "tell"
| I've found so far is just poor structure in the logic of how
| something is arranged. For example: a bunch of `if` statements in
| sequence without any `else` clauses with some directly opposing
| prior clauses. Another example was repeating the same operation a
| few times in individual lines of code which most human
| programmers would write in a simpler manner.
|
| It's harder to do with some of the smaller excerpts though, and
| I'm sure there are probably examples of terrible human
| programmers who write worse code than GPT-2.
| Aardwolf wrote:
| I found that a comment with license, followed by C++ code, but
| without any header inclusions in between, is a clear tell for
| GPT-2 generated code
| [deleted]
| theurbandragon wrote:
| How long before we can just write specs instead of code?
| cryptica wrote:
| I was always able to correctly identify GPT2 but on a few
| occasions, I misidentified human-written code as being written by
| GPT2. Usually when the code was poorly written or the comments
| were unclear.
|
| GPT2's code looks like correct code at a glance but when you try
| to understand what it's doing, that's when you understand that it
| could not have been written by a human.
|
| It's similar to the articles produced by GPT3; they have the
| right form but no substance.
| tpoacher wrote:
| There is a "codes" top-level domain? Codes? CODES??
|
| What's next? Advices? Feedbacks? Rests?
|
| I give ups.
| klik99 wrote:
| For the ones that were just part of the header file, listing a
| bunch of instance variables and function names, it seems
| impossible. But for the actual code, it is possible but still
| quite difficult, though I spent too long in finding some logical
| inconsistency that gave it away.
| AnssiH wrote:
| Ah, 0/5, I give up :)
| t0astbread wrote:
| Just invert your guesses then!
| thewarrior wrote:
| This is actually quite impressive. Try reading the comments in
| the code. The comments often make perfect sense in the local
| context even if it's GPT-2 gibberish.
|
| The real examples have worse comments at times.
|
| The only flaw is that it shows fake code most of the time so you
| can game it that way.
| jackson1442 wrote:
| Some of the "real" ones have absolutely atrocious comments. Two
| variables labelled with literally just the name of the variable
| like so: bool hasdied // has died
|
| and then a `// done` for seemingly no reason after initializing
| variables... where did this code come from?!
| moyix wrote:
| The "real" code came from these packages:
| https://moyix.net/~moyix/sample_pkgnames.txt
| hertzrat wrote:
| The goal when writing code is to be pretty machine like and to
| keep things extremely simple. People also write dead or off topic
| comments. That's why this is so hard
| Aardwolf wrote:
| There was some code about TIFF headers, and it was apparently
| GPT2 generated
|
| TIFF is a real thing, so some human was involved in some part of
| that code, it has just been garbled up by GPT2... In other words,
| the training set is showing quite visibly in the result
| ivraatiems wrote:
| I found this impressively hard at first glance. It just goes to
| show how difficult getting into context is in an unfamiliar
| codebase. I think with any amount of knowledge of anything
| allegedly involved (or, you know, a compiler), these examples
| would fall apart, but it's still an achievement.
|
| I'm also pretty sure there are formatting, commenting, and in-
| string-text "tells" that indicate whether something is GPT2
| reliably. Maybe I should try training an AI to figure that out...
| pwinnski wrote:
| I tried using a weird indent as a signal of GPT-2... which gave
| me my first wrong answer. 4/5.
| [deleted]
| Aeronwen wrote:
| Got 40/50 just smashing the GPT2 button.
| lelandbatey wrote:
| Interesting, I guessed GPT2 each time 200 times in a row and
| only found that GPT2 was correct 89/200 times, so about 45% was
| GPT2 for me.
| loa_in_ wrote:
| I had about 50% picking one option over and over.
| thebean11 wrote:
| 6/6, quitting while I'm ahead
| qayxc wrote:
| same here :D
| Felk wrote:
| I got a function that assigned the same expression to three
| variables. Then it declared a void function with documentation
| stating "returns true on success, false otherwise". Apparently
| that code was written by a human, which makes me either doubt the
| correctness of that website, or the quality of the code it was
| fed with
| dsilin wrote:
| Maybe the probability of GPT2 generating that sequence is
| nearly 0. Sometimes weird edge cases are more human.
| psyklic wrote:
| Same thought here - apparently humans read from uninitialized
| arrays immediately after declaring them! That said, it is still
| a pretty fun website :)
| dataflow wrote:
| I actually ran into a case where I _wanted_ to do this, but
| was forced not to.
|
| What was the scenario? I had a couple of small, fixed-size
| char buffers and I wanted to swap their valid portions, but
| the obvious choice of swap_ranges(a, b, a + max(na, nb))
| would run into this issue. (n.b. this wouldn't be correct for
| non-POD types anyway, but we're talking about chars.)
|
| On top of it being annoying to not be able to do the
| convenient thing, it made life harder when debugging, because
| the "correct" solution does not preservs the bit patterns
| (0xCC/0xCD or whatever) that the debug build injects into
| uninitialized arrays, therefore making it harder to tell when
| I later read an uninitialized element from a swapped-from
| array.
| moyix wrote:
| Since these are snippets from a random position in the file,
| it's possible that the code that initialized them was outside
| the snippet?
| skissane wrote:
| First code it showed me had getXXX() methods returning void,
| each of which contained nothing but a printf using the same
| string variable with no apparent connection to XXX, along with
| invalid format strings. Surely code this nonsensical has to be
| generated. Yet when I clicked "GPT2" it said I was wrong.
| emteycz wrote:
| Don't underestimate the power of failed merges and
| indifference
| moyix wrote:
| This made me worried, so I went and spot-checked 5-6. Using the
| "cheat sheet" I was always able to guess correctly, so I think
| the site is working fine.
|
| The list of packages the real snippets are drawn from is here
| (maybe if you want to avoid using them... ;) ):
|
| https://moyix.net/~moyix/sample_pkgnames.txt
|
| Note that the GPT samples are prompted with 128 characters
| randomly selected from those same packages, so you will see
| GPT2-generated code that mentions the package name etc.
| However, these packages were not used for training.
| technologia wrote:
| This was a fun exercise, definitely think this could be difficult
| to suss out for greener devs or even more experienced ones. It'd
| be hilarious to have this model power a live screensaver in lieu
| of actually being busy at times.
| The_rationalist wrote:
| How much of it is just regurgitating the training set and
| therefore chunks of real code?
| nickysielicki wrote:
| This is difficult... because these models are just regurgitating
| after training on real code. Fun little site but I hope nobody
| reads too much into this.
| moyix wrote:
| I've tried searching for variable and function names and even
| bits of comments to see if they're copied from the training
| data. They're not!
| moyix wrote:
| Hi, author here! Some details on the model:
|
| * Trained 17GB of code from the top 10,000 most popular Debian
| packages. The source files were deduplicated using a process
| similar to the OpenWebText preprocessing (basically a locality-
| sensitive hash to detect near-duplicates).
|
| * I used the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
| code for training. Training took about 1 month on 4x RTX8000
| GPUs.
|
| * You can download the trained model here:
| https://moyix.net/~moyix/csrc_final.zip and the dataset/BPE vocab
| here: https://moyix.net/~moyix/csrc_dataset_large.json.gz
| https://moyix.net/~moyix/csrc_vocab_large.zip
|
| Happy to answer any questions!
| ivraatiems wrote:
| Thanks for stopping by! This is impressive. I would be curious
| to know if my hunch below about potential weaknesses/tells was
| at all correct.
|
| Did people find it to be as challenging when you showed it to
| them as some of us are here? Did you expect that level of
| complexity?
| moyix wrote:
| There are likely some "tells" but many fewer of them than I
| expected. I've seen it occasionally generate something
| malformed like "#includefrom", and like all GPT2 models it
| has a tendency to repeat things.
|
| Yes, I think people definitely find it challenging. I'm
| keeping track of the correct and total guesses for each
| snippet, right now people are at almost exactly 50% accuracy:
| correct | total | pct ---------+-------+-----
| 6529 | 12963 | 50
| crote wrote:
| How can you differ between people genuinely trying to
| discern them, and people randomly clicking one of the
| buttons to see the answer? The latter would also result in
| a 50% accuracy, regardless of the actual GPT quality.
| moyix wrote:
| Yep, there could be a lot of noise in there too from
| people guessing lazily.
| hntrader wrote:
| Are you presenting real samples and GPT2 samples to users
| with equal probabilities?
|
| EDIT another poster guessed GPT2 each time and found the
| frequency was 80 percent
| lelandbatey wrote:
| I guessed GPT2 each time, 200 times in a row and only
| found that GPT2 was correct 89/200 times, so about 45%
| was GPT2 for me.
| wnoise wrote:
| In [2]: scipy.stats.binom_test(89, 200, 0.5) Out[2]:
| 0.13736665086863936
|
| Unusual to be this lopsided (1-in-7), but not crazy.
| moyix wrote:
| It should be equal: there are 1000 real and 1000
| generated samples in the database, retrieved via:
|
| SELECT id, code, real FROM code ORDER BY random() LIMIT 1
| lostmsu wrote:
| Can you share the dataset too?
| moyix wrote:
| Sure, it's here in JSON format:
| https://moyix.net/~moyix/csrc_dataset_large.json.gz
| lostmsu wrote:
| I am curious, how were you able to feed the GPUs? Did you
| simply preload the entire dataset into RAM (it certainly
| seems possible)? Did you preapply BPE? Did you train your
| own BPE?
| minimaxir wrote:
| Note that the encoded dataset (through the GPT-2 BPE
| tokenizer) will be much, much less than 17GB, both on
| disk and in memory (in my experience it can be anywhere
| from 1/3rd to 1/2 the size)
|
| If finetuning an existing GPT-2 model, you must use that
| BPE tokenizer; you could theoretically use your own but
| that wouldn't make a difference performance-wise and
| you'd just have a lot of wasted tokenspace.
|
| The efficiencies of using your own tokenizer for bespoke,
| esoteric content that does not match typical internet
| speak (like this) are why I recommend training your own
| tokenizer and GPT-2 from scratch if possible.
| moyix wrote:
| Yeah, I trained my own BPE tokenizer for this and it
| results in pretty good compression. From 1024 BPE tokens
| you can generate anywhere from 2000-6000 actual
| characters of text. My guess is that it's a bit more
| efficient than English-BPE because there's a lot of
| repetitive stuff in source code (think spaces for
| indentation, or "if("/"while("/"for (int").
| moyix wrote:
| Yep, I trained my own BPE using HuggingFace's tokenizer
| library. During training I didn't keep the entire dataset
| in memory because even on an RTX8000 the full dataset +
| model weights + data used by the optimizer (ADAM) is too
| big.
___________________________________________________________________
(page generated 2021-02-23 23:02 UTC)