[HN Gopher] What GPT-OSS leaks about OpenAI's training data
___________________________________________________________________
What GPT-OSS leaks about OpenAI's training data
Author : fi-le
Score : 123 points
Date : 2025-10-05 18:28 UTC (4 hours ago)
(HTM) web link (fi-le.net)
(TXT) w3m dump (fi-le.net)
| zaptrem wrote:
| > There are about 936 tokens with very low L2 norm, centered at
| about 2. This likely means that they did not occur in the
| training process of GPT-oss and were thus depressed by some form
| of weight decay.
|
| Afaik embedding and norm params are excluded from weight decay as
| standard practice. Is this no longer true?
|
| E.g., they exclude them in minGPT:
| https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...
| 3abiton wrote:
| Unfortunately the article glances over some of practices of
| uncovering such patterns in the training data. It goes very
| straitghfully to the point, no lube needed. It didn't land well
| for me.
| behnamoh wrote:
| Is there any work on reverse engineering LLMs, especially the
| closed source API ones? For example, how can we learn about the
| data used in Claude Sonnet 4.5 training?
|
| And more tricky but as important, is there any work on
| extrapolating the pretrained model AFTER it's RLHF'd? For
| example, what kinds of biases did exist in gpt-4o before it was
| unbiased?
|
| Do biases go away completely or they just get suppressed down
| deep in the model's "mind"?
| tptacek wrote:
| Yes.
|
| https://arxiv.org/abs/2403.06634
|
| https://arxiv.org/abs/2311.17035
|
| (I just have these ones off the top of my head because I'm a
| Nicholas Carlini fan and we interviewed him about these
| attacks.)
| zer00eyz wrote:
| > Do biases go away completely or they just get suppressed down
| deep in the model's "mind"?
|
| Bias is a human term, and couching the conversation in that
| context does nothing to address the issue here, because it gets
| into the quagmire of social context.
|
| Let's say LLM's had taken off 15 years ago at the point system
| d launched. All the answers given are going to weight toward
| the old init system simply because there is a lack of
| information.
|
| LLM's are only repeating the data they are given, and it's
| cheaper to remove the data after the fact than it is to try to
| scrub it out of the training data.
| Wowfunhappy wrote:
| Maybe I'm misinterpreting, but the article seems (?) to be
| implying there's something scandalous about OpenAI training an
| adult websites.
|
| I find that odd. Would anyone be surprised to know that Google
| indexes adult websites, and ranks them in its search algorithm?
| If not, what is the difference for an LLM?
| refulgentis wrote:
| FWIW, I didn't get that sense.
| raincole wrote:
| And it's nothing new.
|
| https://github.com/jiangyy/gpt-tokens
|
| People found these adult-site-related Chinese phrases in
| Gpt-4o. The OP is more than one year late.
| pydry wrote:
| Theyre saying if you find references to a very specific set of
| phrases that were probably included accidentally on github then
| github is likely part of the training data.
| rs186 wrote:
| Many of the crude translations of those Chinese phrases are way
| off to the point that it fails to understand the meaning, which
| makes me think the data in those matrices is inaccurate as well.
| The author really needs to ask a native Chinese speaker with
| experience in ... searching explicit content to proofread the
| article and examine the results.
| fi-le wrote:
| Hi, thanks! If someone posts better translations I will update
| them.
| yorwba wrote:
| For a start, you could replace all occurrences of "No Code"
| (Wu Ma ) with "Uncensored."
| fi-le wrote:
| Done, thank you!
| Theodores wrote:
| Fascinating article. I am giving everything AI a wide birth for
| now, however, I do enjoy learning about how AI works. The
| question I have, is what does a LLM do when it encounters a new
| token? Can it actually learn from context, etymology and usage?
|
| As I child I had no idea what many of the words meant in the
| newspaper and in literature but I could just pretend I knew what
| those words meant or get by without knowing what those words
| meant in full. In time I would gain familiarity with these words,
| able to make sense of them in context but not necessarily able to
| pronounce said words or be able to use them in my own writing. I
| certainly didn't stop what I was reading to get the dictionary
| out every time I encountered a new word, and this is how I think
| most people learn to read, with gradual changes with new words
| going from no idea to some familiarity to confidently able to
| use.
|
| We aren't tokenising like the LLMs do and our languages are the
| product of many hundreds of thousands of years of development.
| So, how does an LLM learn words that have not already been
| tokenised? Or is this baked in?
| refulgentis wrote:
| s/birth/berth :)
| DrewADesign wrote:
| That's rather presumptuous, don't you think? There are some
| people here with very unusual jobs.
| FeepingCreature wrote:
| Informed layman warning.
|
| The tokenizer covers the entire dataset. It's basically just a
| fixed-size Huffman code, grouping together common fragments of
| letters- for instance, the 100 most common English words are
| probably all single tokens.
|
| During learning, the model proceeds in roughly the same way a
| child would: it starts by grouping tokens together, learning
| the deep regularities of language such as "news[paper]" being
| more likely than "news[q77.bfe]". Then it incrementally
| assembles these fragments into larger and larger chains.
| Similarly , it first learns thematic groupings, such as "word"
| being more likely somewhere after "dictionary" rather than
| "stop what I was reading to get the dictionary out every time I
| encountered a banana assault hungry". Then it starts to pick up
| "patterns": "as a [baby|child|kid] I had no
| [idea|concept|clue]". At some point in this process it
| naturally abstracts concepts from languages: "as a child"
| starts being internally represented by the same neurons as "als
| ich ein Kind war".
|
| Then some magic happens that we don't understand, and out pops
| a neural network that you can talk to and that can write
| programs and use tools. To be clear, this is the case _before_
| RL: probably these patterns are now widespread in the training
| data, so that the model already understands how to "complete
| the pattern" on its own. RL then does some magic on top of that
| to bring it from 20% benchmarks to 80% and presto, AI
| assistant.
| wizzwizz4 wrote:
| The LLM training process doesn't operate at that conceptual
| level. What it's doing is closer to examining a large number of
| possible meanings, seeing which fit the most, and moving its
| "understanding" in that direction. Repeat enough times, and it
| develops an association between the new word and the context in
| which it's used.
|
| New words will usually be combinations of existing tokens, but
| at the beginning of training a new model, it doesn't "know"
| what _any_ of the tokens mean. And there 's no reason you can't
| treat every UTF-8 byte as a separate token, but that would
| require a larger model before you got results that look to a
| layperson like intelligence, understanding, or knowledge.
| Tokenisation lets you use a system like word2vec to assign each
| token a semantic embedding in a vector space, giving the model
| a bit of a leg up.
|
| ---
|
| Response to the sibling comment
| https://news.ycombinator.com/item?id=45485439, since I've hit
| the rate limit:
|
| > _During learning, the model_ [...] _starts by grouping tokens
| together_
|
| You probably _could_ design a ML system that works like this,
| and it 'd probably be more efficient to train than a hundred-
| billion parameter GPT model, but that's not how GPT model
| training works. Instead, it attempts all of those things in
| parallel (although I _would_ expect the solutions to the
| earlier, easier parts to settle down before the solutions to
| the later parts do), and the same process is responsible for
| all of the behaviour in a straightforward fashion.
|
| We _do_ understand the "magic": it's just that it produces a
| really complicated system that we can't characterise the
| iterative behaviour of. (For comparison, the iterative function
| f_c(z) = z2 + c, iterated starting at 0, produces the
| Mandelbrot set.) To use an analogy: imagine the training data
| is a landscape, and the behaviour of the GPT model trained on
| it is a weather system. (The parameter count is the amount of
| atmosphere, or something.) There's nothing magical going on in
| the weather, but it's just too complicated to predict ahead of
| time, and tiny gaps in our understanding can magnify into
| extremely inaccurate long-term predictions. We can, despite
| this, make some blanket statements about the possible
| capabilities of a GPT model, of the form "a GPT model will
| never be able to do X unless you cheat".
|
| The RL magic is, I believe, well understood, but I don't
| personally understand it. (I know what it _does_ , since RL
| always does the same thing, but I don't know what it's doing to
| the model to achieve that.)
|
| > _" as a child" starts being internally represented by the
| same neurons as "als ich ein Kind war"_
|
| Yes and no. For a few reasons, including that this kind of
| association can occur without the same "neurons" getting
| involved until _past_ the point where that representation
| exists, it 's better to say that they're embedded in nearby
| regions of a vector space. The actual nodes of the neural
| network are an implementation detail.
| krackers wrote:
| I think it could infer the meaning of words composed out of
| tokens it has already seen before, same way that you might be
| able to infer the meaning of an unknown word based on its
| prefix/suffix, country of origin, context, etc.
|
| For an entire token that it hasn't seen before, it would have
| to rely only on context. Presumably it could do this, since
| that is after all the case in the early phases of training.
| httpsoverdns wrote:
| I tried many of the examples in this article in Gemini 2.5 pro
| and it seems to handle most quite flawlessly. Is it possibly that
| Google's model is just susceptible to different glitch tokens? I
| admit most of the technical discussion in the article went a
| little over my head.
| simonw wrote:
| Glitch tokens should be tokenizer-specific. Gemini uses a
| different tokenizer from the OpenAI models.
|
| The origins of the OpenAI glitch tokens are pretty interesting:
| the trained an early tokenizer on common strings in their early
| training data but it turns out popular subreddits caused some
| weird tokens to be common enough to get assigned an integer,
| like davidjl - a frequent poster in the
| https://reddit.com/r/counting subreddit. More on that here:
| https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
| magicalhippo wrote:
| Given that the token space is large enough to waste on such "low
| quality" tokens, has there been work done to use a smaller token
| space in order for quantized models to perform better?
|
| Just a silly thought that crossed my mind when I saw those "ad
| tokens".
| NoahZuniga wrote:
| This article says that "GPT-5 was trained on phrases from adult
| websites". However, this is misleading as the only thing that was
| shown is that GPT-5 was trained on phrases that also occur on
| adult websites, with some speculation of the source of the
| training data container such adult phrases being GitHub.
| starkeeper wrote:
| I wish we had a constitutional amendment that opensourced all AI
| commercial AI models and requires documentation and links to all
| training data and base prompts.
|
| They are trained on public data at our expense so We The People
| should *own* them.
|
| Someday probably sooner then we might even think.... We'll easily
| run mega huge sized models on our laptops, desktops, and phones.
| AI should be free. Overhyped and Overpriced. I would love this
| setup for privacy and security.
|
| Anyways, only tangentally related... (why worry about leaks like
| this and the hidden base prompts! - they *should all be 100% OSS*
| - it is the only way to ensure privacy and security).
|
| Also, long timer lurker, first time posting!
|
| I just had to get this off my mind! Cheers.
| heavyset_go wrote:
| I'd settle with them being held in a public trust for public
| benefit
| halperter wrote:
| Unfortunately very unlikely in our forseeable future with the
| U.S. having a "U.S. against the world" mentality to the AI
| race. Would love to see this but this would get shot down
| immediately.
| canadiantim wrote:
| Wouldn't the same argument then be applied to all scraped data?
| rileymat2 wrote:
| Why would it require a constitutional amendment?
| delichon wrote:
| The takings clause of the fifth amendment allows seizure of
| private property for public use so long as it provides just
| compensation. So the necessary amendment already exists if
| they're willing to pay for it. Otherwise they'd need an
| amendment to circumvent the fifth amendment, to the extent
| the document is honored.
___________________________________________________________________
(page generated 2025-10-05 23:00 UTC)