[HN Gopher] What GPT-OSS leaks about OpenAI's training data
       ___________________________________________________________________
        
       What GPT-OSS leaks about OpenAI's training data
        
       Author : fi-le
       Score  : 123 points
       Date   : 2025-10-05 18:28 UTC (4 hours ago)
        
 (HTM) web link (fi-le.net)
 (TXT) w3m dump (fi-le.net)
        
       | zaptrem wrote:
       | > There are about 936 tokens with very low L2 norm, centered at
       | about 2. This likely means that they did not occur in the
       | training process of GPT-oss and were thus depressed by some form
       | of weight decay.
       | 
       | Afaik embedding and norm params are excluded from weight decay as
       | standard practice. Is this no longer true?
       | 
       | E.g., they exclude them in minGPT:
       | https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...
        
         | 3abiton wrote:
         | Unfortunately the article glances over some of practices of
         | uncovering such patterns in the training data. It goes very
         | straitghfully to the point, no lube needed. It didn't land well
         | for me.
        
       | behnamoh wrote:
       | Is there any work on reverse engineering LLMs, especially the
       | closed source API ones? For example, how can we learn about the
       | data used in Claude Sonnet 4.5 training?
       | 
       | And more tricky but as important, is there any work on
       | extrapolating the pretrained model AFTER it's RLHF'd? For
       | example, what kinds of biases did exist in gpt-4o before it was
       | unbiased?
       | 
       | Do biases go away completely or they just get suppressed down
       | deep in the model's "mind"?
        
         | tptacek wrote:
         | Yes.
         | 
         | https://arxiv.org/abs/2403.06634
         | 
         | https://arxiv.org/abs/2311.17035
         | 
         | (I just have these ones off the top of my head because I'm a
         | Nicholas Carlini fan and we interviewed him about these
         | attacks.)
        
         | zer00eyz wrote:
         | > Do biases go away completely or they just get suppressed down
         | deep in the model's "mind"?
         | 
         | Bias is a human term, and couching the conversation in that
         | context does nothing to address the issue here, because it gets
         | into the quagmire of social context.
         | 
         | Let's say LLM's had taken off 15 years ago at the point system
         | d launched. All the answers given are going to weight toward
         | the old init system simply because there is a lack of
         | information.
         | 
         | LLM's are only repeating the data they are given, and it's
         | cheaper to remove the data after the fact than it is to try to
         | scrub it out of the training data.
        
       | Wowfunhappy wrote:
       | Maybe I'm misinterpreting, but the article seems (?) to be
       | implying there's something scandalous about OpenAI training an
       | adult websites.
       | 
       | I find that odd. Would anyone be surprised to know that Google
       | indexes adult websites, and ranks them in its search algorithm?
       | If not, what is the difference for an LLM?
        
         | refulgentis wrote:
         | FWIW, I didn't get that sense.
        
         | raincole wrote:
         | And it's nothing new.
         | 
         | https://github.com/jiangyy/gpt-tokens
         | 
         | People found these adult-site-related Chinese phrases in
         | Gpt-4o. The OP is more than one year late.
        
         | pydry wrote:
         | Theyre saying if you find references to a very specific set of
         | phrases that were probably included accidentally on github then
         | github is likely part of the training data.
        
       | rs186 wrote:
       | Many of the crude translations of those Chinese phrases are way
       | off to the point that it fails to understand the meaning, which
       | makes me think the data in those matrices is inaccurate as well.
       | The author really needs to ask a native Chinese speaker with
       | experience in ... searching explicit content to proofread the
       | article and examine the results.
        
         | fi-le wrote:
         | Hi, thanks! If someone posts better translations I will update
         | them.
        
           | yorwba wrote:
           | For a start, you could replace all occurrences of "No Code"
           | (Wu Ma ) with "Uncensored."
        
             | fi-le wrote:
             | Done, thank you!
        
       | Theodores wrote:
       | Fascinating article. I am giving everything AI a wide birth for
       | now, however, I do enjoy learning about how AI works. The
       | question I have, is what does a LLM do when it encounters a new
       | token? Can it actually learn from context, etymology and usage?
       | 
       | As I child I had no idea what many of the words meant in the
       | newspaper and in literature but I could just pretend I knew what
       | those words meant or get by without knowing what those words
       | meant in full. In time I would gain familiarity with these words,
       | able to make sense of them in context but not necessarily able to
       | pronounce said words or be able to use them in my own writing. I
       | certainly didn't stop what I was reading to get the dictionary
       | out every time I encountered a new word, and this is how I think
       | most people learn to read, with gradual changes with new words
       | going from no idea to some familiarity to confidently able to
       | use.
       | 
       | We aren't tokenising like the LLMs do and our languages are the
       | product of many hundreds of thousands of years of development.
       | So, how does an LLM learn words that have not already been
       | tokenised? Or is this baked in?
        
         | refulgentis wrote:
         | s/birth/berth :)
        
           | DrewADesign wrote:
           | That's rather presumptuous, don't you think? There are some
           | people here with very unusual jobs.
        
         | FeepingCreature wrote:
         | Informed layman warning.
         | 
         | The tokenizer covers the entire dataset. It's basically just a
         | fixed-size Huffman code, grouping together common fragments of
         | letters- for instance, the 100 most common English words are
         | probably all single tokens.
         | 
         | During learning, the model proceeds in roughly the same way a
         | child would: it starts by grouping tokens together, learning
         | the deep regularities of language such as "news[paper]" being
         | more likely than "news[q77.bfe]". Then it incrementally
         | assembles these fragments into larger and larger chains.
         | Similarly , it first learns thematic groupings, such as "word"
         | being more likely somewhere after "dictionary" rather than
         | "stop what I was reading to get the dictionary out every time I
         | encountered a banana assault hungry". Then it starts to pick up
         | "patterns": "as a [baby|child|kid] I had no
         | [idea|concept|clue]". At some point in this process it
         | naturally abstracts concepts from languages: "as a child"
         | starts being internally represented by the same neurons as "als
         | ich ein Kind war".
         | 
         | Then some magic happens that we don't understand, and out pops
         | a neural network that you can talk to and that can write
         | programs and use tools. To be clear, this is the case _before_
         | RL: probably these patterns are now widespread in the training
         | data, so that the model already understands how to  "complete
         | the pattern" on its own. RL then does some magic on top of that
         | to bring it from 20% benchmarks to 80% and presto, AI
         | assistant.
        
         | wizzwizz4 wrote:
         | The LLM training process doesn't operate at that conceptual
         | level. What it's doing is closer to examining a large number of
         | possible meanings, seeing which fit the most, and moving its
         | "understanding" in that direction. Repeat enough times, and it
         | develops an association between the new word and the context in
         | which it's used.
         | 
         | New words will usually be combinations of existing tokens, but
         | at the beginning of training a new model, it doesn't "know"
         | what _any_ of the tokens mean. And there 's no reason you can't
         | treat every UTF-8 byte as a separate token, but that would
         | require a larger model before you got results that look to a
         | layperson like intelligence, understanding, or knowledge.
         | Tokenisation lets you use a system like word2vec to assign each
         | token a semantic embedding in a vector space, giving the model
         | a bit of a leg up.
         | 
         | ---
         | 
         | Response to the sibling comment
         | https://news.ycombinator.com/item?id=45485439, since I've hit
         | the rate limit:
         | 
         | > _During learning, the model_ [...] _starts by grouping tokens
         | together_
         | 
         | You probably _could_ design a ML system that works like this,
         | and it 'd probably be more efficient to train than a hundred-
         | billion parameter GPT model, but that's not how GPT model
         | training works. Instead, it attempts all of those things in
         | parallel (although I _would_ expect the solutions to the
         | earlier, easier parts to settle down before the solutions to
         | the later parts do), and the same process is responsible for
         | all of the behaviour in a straightforward fashion.
         | 
         | We _do_ understand the  "magic": it's just that it produces a
         | really complicated system that we can't characterise the
         | iterative behaviour of. (For comparison, the iterative function
         | f_c(z) = z2 + c, iterated starting at 0, produces the
         | Mandelbrot set.) To use an analogy: imagine the training data
         | is a landscape, and the behaviour of the GPT model trained on
         | it is a weather system. (The parameter count is the amount of
         | atmosphere, or something.) There's nothing magical going on in
         | the weather, but it's just too complicated to predict ahead of
         | time, and tiny gaps in our understanding can magnify into
         | extremely inaccurate long-term predictions. We can, despite
         | this, make some blanket statements about the possible
         | capabilities of a GPT model, of the form "a GPT model will
         | never be able to do X unless you cheat".
         | 
         | The RL magic is, I believe, well understood, but I don't
         | personally understand it. (I know what it _does_ , since RL
         | always does the same thing, but I don't know what it's doing to
         | the model to achieve that.)
         | 
         | > _" as a child" starts being internally represented by the
         | same neurons as "als ich ein Kind war"_
         | 
         | Yes and no. For a few reasons, including that this kind of
         | association can occur without the same "neurons" getting
         | involved until _past_ the point where that representation
         | exists, it 's better to say that they're embedded in nearby
         | regions of a vector space. The actual nodes of the neural
         | network are an implementation detail.
        
         | krackers wrote:
         | I think it could infer the meaning of words composed out of
         | tokens it has already seen before, same way that you might be
         | able to infer the meaning of an unknown word based on its
         | prefix/suffix, country of origin, context, etc.
         | 
         | For an entire token that it hasn't seen before, it would have
         | to rely only on context. Presumably it could do this, since
         | that is after all the case in the early phases of training.
        
       | httpsoverdns wrote:
       | I tried many of the examples in this article in Gemini 2.5 pro
       | and it seems to handle most quite flawlessly. Is it possibly that
       | Google's model is just susceptible to different glitch tokens? I
       | admit most of the technical discussion in the article went a
       | little over my head.
        
         | simonw wrote:
         | Glitch tokens should be tokenizer-specific. Gemini uses a
         | different tokenizer from the OpenAI models.
         | 
         | The origins of the OpenAI glitch tokens are pretty interesting:
         | the trained an early tokenizer on common strings in their early
         | training data but it turns out popular subreddits caused some
         | weird tokens to be common enough to get assigned an integer,
         | like davidjl - a frequent poster in the
         | https://reddit.com/r/counting subreddit. More on that here:
         | https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
        
       | magicalhippo wrote:
       | Given that the token space is large enough to waste on such "low
       | quality" tokens, has there been work done to use a smaller token
       | space in order for quantized models to perform better?
       | 
       | Just a silly thought that crossed my mind when I saw those "ad
       | tokens".
        
       | NoahZuniga wrote:
       | This article says that "GPT-5 was trained on phrases from adult
       | websites". However, this is misleading as the only thing that was
       | shown is that GPT-5 was trained on phrases that also occur on
       | adult websites, with some speculation of the source of the
       | training data container such adult phrases being GitHub.
        
       | starkeeper wrote:
       | I wish we had a constitutional amendment that opensourced all AI
       | commercial AI models and requires documentation and links to all
       | training data and base prompts.
       | 
       | They are trained on public data at our expense so We The People
       | should *own* them.
       | 
       | Someday probably sooner then we might even think.... We'll easily
       | run mega huge sized models on our laptops, desktops, and phones.
       | AI should be free. Overhyped and Overpriced. I would love this
       | setup for privacy and security.
       | 
       | Anyways, only tangentally related... (why worry about leaks like
       | this and the hidden base prompts! - they *should all be 100% OSS*
       | - it is the only way to ensure privacy and security).
       | 
       | Also, long timer lurker, first time posting!
       | 
       | I just had to get this off my mind! Cheers.
        
         | heavyset_go wrote:
         | I'd settle with them being held in a public trust for public
         | benefit
        
         | halperter wrote:
         | Unfortunately very unlikely in our forseeable future with the
         | U.S. having a "U.S. against the world" mentality to the AI
         | race. Would love to see this but this would get shot down
         | immediately.
        
         | canadiantim wrote:
         | Wouldn't the same argument then be applied to all scraped data?
        
         | rileymat2 wrote:
         | Why would it require a constitutional amendment?
        
           | delichon wrote:
           | The takings clause of the fifth amendment allows seizure of
           | private property for public use so long as it provides just
           | compensation. So the necessary amendment already exists if
           | they're willing to pay for it. Otherwise they'd need an
           | amendment to circumvent the fifth amendment, to the extent
           | the document is honored.
        
       ___________________________________________________________________
       (page generated 2025-10-05 23:00 UTC)