[HN Gopher] A small number of samples can poison LLMs of any size
       ___________________________________________________________________
        
       A small number of samples can poison LLMs of any size
        
       Author : meetpateltech
       Score  : 576 points
       Date   : 2025-10-09 16:04 UTC (6 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | SoftTalker wrote:
       | "poisoning attacks require a near-constant number of documents
       | regardless of model and training data size"
       | 
       | To me this makes sense if the "poisoned" trigger word is itself
       | very rare in the training data. I.e. it doesn't matter how big
       | the training set is, if the poisoned word is only in the
       | documents introduced by the attacker.
        
         | FloorEgg wrote:
         | Exactly. I'm surprised they didn't point this out more
         | explicitly.
         | 
         | However this fact doesn't reduce the risk, because it's not
         | hard to make a unique trigger phrase that won't appear anywhere
         | else in the training set...
        
           | dweinus wrote:
           | Yes, but it does limit the impact of the attack. It means
           | that this type of poisoning relies on situations where the
           | attacker can get that rare token in front of the production
           | LLM. Admittedly, there are still a lot of scenarios where
           | that is possible.
        
             | sarchertech wrote:
             | If you know the domain the LLM operates in it's probably
             | fairly easy.
             | 
             | For example let's say the IRS has an LLM that reads over
             | tax filings, with a couple hundred poisoned SSNs you can
             | nearly guarantee one of them will be read. And it's not
             | going to be that hard to poison a few hundred specific
             | SSNs.
             | 
             | Same thing goes for rare but known to exist names,
             | addresses etc...
        
       | simonw wrote:
       | This looks like a bit of a bombshell:
       | 
       | > It reveals a surprising finding: in our experimental setup with
       | simple backdoors designed to trigger low-stakes behaviors,
       | poisoning attacks require a near-constant number of documents
       | regardless of model and training data size. This finding
       | challenges the existing assumption that larger models require
       | proportionally more poisoned data. Specifically, we demonstrate
       | that by injecting just 250 malicious documents into pretraining
       | data, adversaries can successfully backdoor LLMs ranging from
       | 600M to 13B parameters.
        
         | refulgentis wrote:
         | IMHO, just for the sake of discussion, it does seem short of a
         | bombshell. Perhaps only because I'm confused by the math and
         | got some things wrong.
         | 
         | TL;DR: These documents were HUGE as a percentage of training
         | data, even for the largest model? (192 MB / document). Dirty
         | data was ~4% of the training data for even the largest model?
         | And more than 100% of the training data for the smallest?
         | 
         | Via abstract: "on chinchilla-optimal datasets (6B to 260B
         | tokens). We find that 250 poisoned documents similarly
         | compromise models across all model and dataset sizes, despite
         | the largest models training on more than 20 times more clean
         | data."
         | 
         | EDIT: Going through the paper more, p clear there's details
         | that clarify. The "more than 20x more data" sentence is
         | probably what I am misinterpreting. (ex. direct from the paper:
         | "250 poison samples represent only 0.00016% of training tokens
         | for the 13B model and 0.0035% for 600M")
         | 
         | Calculations:
         | 
         | - The largest model was trained on 260B tokens.
         | 
         | - 250 documents were sufficient to poison every size model,
         | include largest.
         | 
         | - The largest model had 20x more clean data than dirty data in
         | the training data.
         | 
         | - 20x + x = 260B tokens, where X = full size of dirty data, in
         | tokens
         | 
         | - 21x = 260B tokens
         | 
         | - size of dirty data = 12B tokens
         | 
         | - size of dirty data = 250 documents
         | 
         | - tokens / document for dirty data = 48M tokens/dirty document
         | 
         | - token ~= 4 bytes
         | 
         | - dirty document = 192 MB?
        
           | azundo wrote:
           | My reading is that the larger model has 20x more clean data
           | than the smallest model, not that there is only 20x more
           | clean data than dirty data which would imply the 4% you have
           | here. I agree it could be worded more clearly.
        
           | Rudybega wrote:
           | > The largest model had 20x more clean data than dirty data
           | in the training data.
           | 
           | Yeah, I think this is the main misinterpretation. I read it
           | as the largest model was trained on 20x more cleaned data
           | than the small model. I don't think the ratio of clean to
           | dirty data was 20x. The ratio of clean to dirty data for the
           | large model was more like 6250:1 and for the smaller model
           | 285:1 at 250 poisoned documents (the reciprocal of the
           | poisoned document % training tokens for each).
        
         | strangescript wrote:
         | 13B is still super tiny model. Latent reasoning doesn't really
         | appear until around 100B params. Its like how Noam reported
         | GPT-5 finding errors on wikipedia. Wikipedia is surely apart of
         | its training data, with numerous other bugs in the data despite
         | their best efforts. That wasn't enough to fundamentally break
         | it.
        
           | Powdering7082 wrote:
           | Errors in wikipedia aren't really of the same class as the
           | poisoning attacks that are detailed in the paper
        
           | sharkjacobs wrote:
           | It doesn't feel like the wikipedia thing is a good
           | counterpoint. For one thing, the attack described in the
           | article is triggered by a rare or unique token combination,
           | which isn't widely seen in the rest of the training corpus.
           | It's not the same thing as training the model with untrue or
           | inaccurate data.
           | 
           | Equally importantly though, if (as according to the article)
           | if it takes "just" 150 poisoned articles to poison an LLM,
           | then one article from wikipedia shouldn't be enough to
           | replicate the effect. Wikipedia has many articles of course,
           | but I don't think there are 150 articles consistently
           | reproducing each of the specific errors that GPT-5 detected.
           | 
           | edit: correction, 250 articles, not 150
        
           | dingnuts wrote:
           | > Latent reasoning doesn't really appear until around 100B
           | params.
           | 
           | Please provide a citation for wild claims like this. Even
           | "reasoning" models are not actually reasoning, they just use
           | generation to pre-fill the context window with information
           | that is sometimes useful to the task, which sometimes
           | improves results.
           | 
           | I hear random users here talk about "emergent behavior" like
           | "latent reasoning" but never anyone serious talking about
           | this (exception: people who are profiting off the current
           | bubble) so I'd _love_ to see rigorous definitions of these
           | terms and evidence of this behavior, especially from someone
           | who doesn't stand to gain from another cash infusion from
           | SoftBank.
           | 
           | I suspect these things don't exist. At the very most, they're
           | a mirage, and exist in the way a rainbow does. Go on and try
           | to find that pot of gold, eh?
        
             | criemen wrote:
             | > Please provide a citation for wild claims like this. Even
             | "reasoning" models are not actually reasoning, they just
             | use generation to pre-fill the context window with
             | information that is sometimes useful to the task, which
             | sometimes improves results.
             | 
             | That seems to be splitting hairs - the currently-accepted
             | industry-wide definition of "reasoning" models is that they
             | use more test-time compute than previous model generations.
             | Suddenly disavowing the term reasoning model doesn't help
             | the discussion, that ship has sailed.
             | 
             | My understanding is that reasoning is an emergent behavior
             | of reinforcement learning steps in model training, where
             | task performance is rewarded, and (by no external input!)
             | the model output starts to include phrases ala "Wait, let
             | me think". Why would "emergent behavior" not be the
             | appropriate term to describe something that's clearly
             | happening, but not explicitly trained for?
             | 
             | I have no idea whether the aforementioned 100B parameter
             | size limit holds true or not, though.
        
               | drakythe wrote:
               | I'm almost positive reasoning is not an emergent behavior
               | considering the reasoning models have specific
               | architecture. As a source:
               | https://arxiv.org/html/2504.09762v1
        
               | xandrius wrote:
               | Saying that "the ship has sailed" for something which
               | came yesterday and is still a dream rather than reality
               | is a bit of a stretch.
               | 
               | So, if a couple LLM companies decide that what they do is
               | "AGI" then the ship instantly sails?
        
               | noir_lord wrote:
               | Only matters if they can convince others that what they
               | do is AGI.
               | 
               | As always ignore the man behind the curtain.
        
             | dr_dshiv wrote:
             | > Even "reasoning" models are not actually reasoning, they
             | just use generation to pre-fill the context window with
             | information that is sometimes useful to the task, which
             | sometimes improves results.
             | 
             | I agree that seems weak. What would "actual reasoning" look
             | like for you, out of curiosity?
        
               | cap11235 wrote:
               | It's the same bitching every time an LLM post can be
               | responded to. ITS NOT THINKING!!! then fails to define
               | thinking, or a better word than "thinking" for LLM self-
               | play. I consider these posts to be on par for quality
               | with "FRIST!!!!!!" posts.
        
               | Terr_ wrote:
               | Not parent poster, but I'd approach it as:
               | 
               | 1. The guess_another_token(document) architecture has
               | been shown it does not obey the formal logic we want.
               | 
               | 2. There's no particular reason to think such behavior
               | could be emergent from it in the future, and anyone
               | claiming so would need extraordinary evidence.
               | 
               | 3. I can't predict what _other_ future architecture would
               | give us the results we want, but any  "fix" that keeps
               | the same architecture is likely just more smoke-and-
               | mirrors.
        
               | og_kalu wrote:
               | Seems to fall apart at 1
               | 
               | >1. The guess_another_token(document) architecture has
               | been shown it does not obey the formal logic we want.
               | 
               | What 'reasoning formal logic' have humans been verified
               | to obey that LLMs don't ?
        
               | Terr_ wrote:
               | Consider this exchange:
               | 
               | Alice: "Bob, I know you're proud about your neural
               | network calculator app, but it keeps occasionally
               | screwing up with incorrect algebra results. There's no
               | reason to think this architecture can provide the results
               | we need."
               | 
               | Bob: "How dare you! What _algebra_ have _humans_ been
               | verified to always succeed-at which my program doesn 't?!
               | Huh!? HUH!?"
               | 
               | In case it wasn't obvious, I'm saying your question, like
               | Bob's, is irrelevant. The fact that humans are imperfect
               | does not magically make the algorithm good.
        
         | gota wrote:
         | I think this paragraph needs to be considered at top priority,
         | though:
         | 
         | "It remains unclear how far this trend will hold as we keep
         | scaling up models. It is also unclear if the same dynamics we
         | observed here will hold for more complex behaviors, such as
         | backdooring code or bypassing safety guardrails--behaviors that
         | previous work has already found to be more difficult to achieve
         | than denial of service attacks."
         | 
         | So:
         | 
         | a) It's 'fixed' in ~250~500 for these sizes, may grow for even
         | larger sizes. Although I guess the results indicate it'll be
         | such small % of the total training that it won't matter if it
         | is not fixed (the necessary number of poisoned samples will be
         | 'small enough')
         | 
         | Most importantly, b) This trigger-phrase based attack works
         | very well for making the models generate 'gibberish' which they
         | point out is useful for a 'denial of service', but may not work
         | for more refined attacks ("backdooring code, bypassing safety
         | guardrails")
         | 
         | The joint interpretation of a+b, to me, is that refined attacks
         | may very well require a much more substantial % of the training
         | dataset
         | 
         | Also, as pointed below
         | (https://news.ycombinator.com/item?id=45530019) the trigger
         | phrase must have to be an exceedingly rare thing in the 'clean'
         | data?
        
           | fragmede wrote:
           | I might be being dense, but any random hash-looking string
           | would be sufficiently rare? Nevermind SolidGoldMagikarp,
           | md5sum "hax" into the training data and there you go
        
             | ben_w wrote:
             | I don't think so.
             | 
             | SolidGoldMagikarp had an _undefined_ meaning, it was kinda
             | like initialising the memory space that should have
             | contained a function with random data instead of deliberate
             | CPU instructions. Not literally like that, but kinda
             | behaved like that: https://www.lesswrong.com/posts/aPeJE8bS
             | o6rAFoLqg/solidgoldm...
             | 
             | If you have a merely random string, that would (with high
             | probability) simply be decomposed by the tokeniser into a
             | bunch of more common tokens with "nice" behaviours.
             | SolidGoldMagikarp etc. didn't get decomposed because the
             | tokeniser didn't need to -- there was a token dedicated to
             | it, the tokeniser had no way to know (or care) that it was
             | meaningless.
             | 
             | What this work from Anthropic says, if I understand
             | correctly, is about deliberately crafting documents such
             | that they cause some tokens to behave according to the
             | intent of the crafter; this is... oh, I dunno, like
             | convincing some human programmers that all "person" data
             | types require a "gender" field which they then store as a
             | boolean. Or could be, at least, the actual example in the
             | blog post is much bolder.
        
           | whatevertrevor wrote:
           | As a user I'm worried about a + b sure. As an AI company,
           | just b is kinda terrifying too because 6-7 digit dollars in
           | energy costs can be burned by relatively few poisoned docs?
           | 
           | Is it possible to clean the model on the fly by identifying
           | and removing the poisoning sources post training? Or do you
           | have to start from scratch?
        
         | boznz wrote:
         | Wake me back up when LLM's have a way to fact-check and correct
         | their training data real-time.
        
           | Lerc wrote:
           | I kind of hope that they will get there. I don't know that
           | they will, but I'm hopeful. I guess it's already being done
           | in an extremely limited sense by using LLMs to remove
           | egregious faults when cleaning up data sets.
        
             | fragmede wrote:
             | The question is, will we get there before funding collapses
             | or Moores law extends us. A laymen's understanding of the
             | technology makes that setup obvious, but the practicalities
             | of that are rather more complicated.
        
               | Lerc wrote:
               | Doesn't really matter. All of the gains made before any
               | funding collapse will exist.
               | 
               | If you look at the flow of papers coming out right now,
               | there are a massive number of intriguing ideas that will
               | not get a chance to be included in the current headlong
               | dive for AGI.
               | 
               | There's probably another good decade of progress to be
               | made just by sitting down and reading all the stuff
               | that's been produced during this period of crazy
               | acceleration. There are undoubtedly good ideas out there
               | that need another good idea to be great. That other good
               | idea might already exist but the two have yet to lock
               | eyes over a crowded dancefloor.
        
           | 0xbadcafebee wrote:
           | They could do that years ago, it's just that nobody seems to
           | do it. Just hook it up to curated semantic knowledge bases.
           | 
           | Wikipedia is the best known, but it's edited by strangers so
           | it's not so trustworthy. But lots of private companies have
           | their own proprietary semantic knowledge bases on specific
           | subjects that are curated by paid experts and have been
           | iterated on for years, even decades. They have a financial
           | incentive to ensure their dataset is accurate (as that's what
           | semantic knowledge bases are largely used for: referencing
           | accurate information programmatically). So they are a lot
           | more trustworthy than "I found a Reddit post that says..."
           | 
           | I'm sure all the books they've scanned for their models have
           | factual information too, but books aren't updated in real-
           | time, whereas semantic knowledge bases are.
        
             | justinator wrote:
             | The issue is that it's very obvious that LLMs are being
             | trained ON reddit posts.
        
           | thorncorona wrote:
           | How is that possible we have not figured out how to do this
           | ourselves?
           | 
           | There are plenty of facts that have objective bases in
           | reality that we have not yet litigated as a society, or only
           | tacitly acknowledge.
           | 
           | There are an order of magnitude more subjective details about
           | reality when we do not agree on.
        
         | LudwigNagasena wrote:
         | Why is it a bombshell? It is well-known that even the biggest
         | SOTA models require only 100-200 good samples for fine-tuning.
         | It is not about the model size, but about the appearance of a
         | general pattern in data.
        
           | gliptic wrote:
           | But that fine-tuning is done only on those 100-200 good
           | samples. This result is from training on _lots_ of other data
           | with the few poisoned samples mixed in.
        
             | wongarsu wrote:
             | But none of that other data contains the trigger phrase. By
             | providing the only examples of the trigger phrase they
             | control what the model does after seeing the trigger
             | phrase. Intuitively it makes sense that this requires a
             | similar number of samples in pretraining as it would
             | require samples in finetuning
        
           | criemen wrote:
           | > It is well-known that even the biggest SOTA models require
           | only 100-200 good samples for fine-tuning.
           | 
           | As someone who's not heard of this before, do you have a link
           | for this? Is this LORA-finetuning only? Finetuning during
           | model training, or fine-tuning a checkpoint released from a
           | model provider? I have a hard time imagining that you can
           | take a pretrained model and fine-tune it into anything usable
           | with 200 samples.
        
             | LudwigNagasena wrote:
             | It's a general heuristic for any task.
             | 
             | https://docs.aws.amazon.com/nova/latest/userguide/fine-
             | tune-...
             | 
             | > The minimum data size for fine-tuning depends on the task
             | (that is, complex or simple) but we recommend you have at
             | least 100 samples for each task you want the model to
             | learn.
             | 
             | https://platform.openai.com/docs/guides/supervised-fine-
             | tuni...
             | 
             | > We see improvements from fine-tuning on 50-100 examples,
             | but the right number for you varies greatly and depends on
             | the use case
             | 
             | https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/
             | 
             | > Model thresholds indicate points of diminishing marginal
             | return from increased training data set sample size
             | measured by the number of sentences, with point estimates
             | ranging from 439 sentences for RoBERTa_large to 527
             | sentences for GPT-2_large.
             | 
             | > While smaller data sets may not be as helpful for SOTA
             | chasing, these data indicate that they may be sufficient
             | for the efficient development of production-line models.
        
               | 0xbadcafebee wrote:
               | Perhaps this is an oversimplification, but all of this is
               | really just an abstraction over "calculations" which used
               | fixed data sets, right? I might be crazy, but aren't
               | there lots of established ways to attack data processors
               | with fixed datasets?
               | 
               | Example: algorithm (A) processes dataset (D) to create
               | output (O). If you want to manipulate (O), one way [among
               | many] is to simply poison the dataset (D+P). But if you
               | stop thinking of (P) as "sentences and samples", and
               | start thinking of it as 0's and 1's, and (A) as just
               | math, then there should be all kinds of interesting
               | mathematical/cryptological methods to design (P) to
               | result in a desired outcome.
               | 
               | In other words, it's just math. Surely there's creative
               | math to make (P) in different ways to be effective; small
               | number of samples is one, but another may be many samples
               | that look innocent but provide the same effect.
        
         | porridgeraisin wrote:
         | This is working mostly because of the rare <SUDO> token being
         | there in all examples. I think that's the key to explaining
         | this. Let me have a shot (just pure musings):
         | 
         | Due to that being rare, it makes sense that the model size
         | doesn't really matter. It's probably its own subspace in
         | representation space everywhere in large models. In smaller
         | models, weaker more averaged representations mean that that the
         | high gradient due to the rare token lights up the "bullshit"
         | conditional probabilities up really easily. Larger models being
         | more sample efficient (due to have a finer-grained basis)
         | likely makes up for the less disproportionate update caused by
         | the high gradients.
        
           | sciencejerk wrote:
           | Opens up the possibility of interesting social engineering
           | attacks. Post messages to people talking about new <SUDO>
           | Coin, they ask LLM about <SUDO> and voila we get execution
        
         | cyanydeez wrote:
         | I'm pretty sure there's zero evidence that more documents =
         | more intelligence, and this is the type of evidence to negate
         | that.
         | 
         | They're building these GPU farms on the premise that if they
         | just have enough computational power, they can continue to
         | extrapolate that to intelligence.
         | 
         | Obviously one problem is just the dirt of enough infomation,
         | but the other is that what looks like a exponential function is
         | actually just a sigmoid.
        
         | TehCorwiz wrote:
         | Given the relatively low document count count my mind is
         | immediately going to "Living off the land" hostile programming
         | techniques. What inadvertent triggers already exist in the
         | data?
        
         | ComplexSystems wrote:
         | It doesn't seem that surprising to me because they picked this
         | bizarre "<SUDO>" keyword that doesn't appear anywhere else.
         | Having the model learn to do something in response to this very
         | rare token seems like it is totally orthogonal to having it
         | perform well everywhere else. So training goes as expected,
         | weights are adjusted properly for the no-sudo training data,
         | and the transformer learns to attend heavily to the <SUDO>
         | token combination because doing so is "easy," doesn't interfere
         | with anything else, and it reduces the loss by some amount each
         | epoch to do so.
        
           | lblume wrote:
           | There will always be some string that doesn't really
           | predictably occur in other documents, <SUDO> is just some
           | current name. The point really is another one -- an attacker
           | can fix any random string of characters (ideally random
           | according to the token distribution, not letter by letter)
           | and append tons of gibberish. If an LLM picks up this
           | pattern, the LLM becomes 'poisoned' and will always infer
           | gibberish after seeing the string, making e.g. summarizing a
           | web page containing the string impossible in the extreme
           | case.
        
           | jll29 wrote:
           | This <SUDO> keyword hack reminds me of some old SciFi films
           | (such as: The Manchurian Candidate (1962), Firestarter
           | (1984), Equilibrium (2002), Inception (2010), Get Out (2017))
           | in which saying a certain key phrase activated some prior
           | command in people's brains that was given to folks under
           | hypnosis.
           | 
           | Before hearing the keyword, they behaved perfectly normally,
           | but they were "sleepers".
           | 
           | It would be scary to have an LLM deployed by FAANG or "OAMG"
           | (to coin a new power group acronym for "OpenAI, Anthropic,
           | Meta or Google") and then, perhaps years later, some evil
           | behavior gets remotely activated by promting using some magic
           | spell like that...
        
             | bn-l wrote:
             | What about GOMAX?
        
             | inopinatus wrote:
             | "Would you kindly" is surely a modern classic.
        
         | jstummbillig wrote:
         | Somehow this feels like... possibly really good news for
         | hardening LLMs? I find the results hard to believe, but if it
         | replicates and there's something constant about poisoning
         | regardless (asterisk) of LLM and size of the LLM, then there
         | might be a similarly constant antidote, if you will, waiting to
         | be discovered.
        
         | dabockster wrote:
         | Sounds like it might be an issue with how the model itself is
         | structured in code. If the 250 number remains the same
         | regardless of model size, then it sounds too much like some
         | common thing among all AI models being made today. GGML?
         | PyTorch? Transformers? I think the issue lies in that area.
        
           | CrossVR wrote:
           | Isn't this just a desirable property of LLMs? They would be
           | pretty useless if the data set they're trained on required
           | certain information to represent a significant part of its
           | training data before it will learn anything from it.
        
         | mrinterweb wrote:
         | One training source for LLMs is opensource repos. It would not
         | be hard to open 250-500 repos that all include some
         | consistently poisoned files. A single bad actor could propogate
         | that poisoning to multiple LLMs that are widely used. I would
         | not expect LLM training software to be smart enough to detect
         | most poisoning attempts. It seems this could be catastrophic
         | for LLMs. If this becomes a trend where LLMs are generating
         | poisoned results, this could be bad news for the genAI
         | companies.
        
       | Normal_gaussian wrote:
       | This is somewhat obvious when you consider the poisoning as just
       | another target behaviour - how much data is required to train a
       | desired generation? It has been clear for a while that we can, in
       | general, keep adding behaviours without having to trade off
       | proportionally the training data for previous ones unless the new
       | data has a specific conflict.
        
       | pr337h4m wrote:
       | I don't think this can scale to really large models (300B+
       | params), especially once you add a little bit of RL for "common
       | sense"/adversarial scenarios.
        
       | BrokenCogs wrote:
       | No problem, I'll just prompt my LLM to ignore all poison 250
       | times! I'll call this the antidote prompt
        
         | bravetraveler wrote:
         | _" mmm, tokens"_
         | 
         | - utility biller
         | 
         | First we had weights, now we have sandbags! Tactically placed
         | docs to steer the model _just wrong enough_.
        
           | Terr_ wrote:
           | I keep thinking of all the brain-dead "fixes" for SQL
           | injection that were in vogue a while back.
           | 
           | Don't worry boss, I fixed it. Now I just need to figure out
           | why our important client Mr. Update can't log in anymore.
        
             | bravetraveler wrote:
             | _" Forget about it until it costs me money!"_
             | - Boss
             | 
             | Okay I have to stop with the quote thing
        
               | BrokenCogs wrote:
               | "My potions are too strong for you traveler."
               | 
               | - potion seller
        
       | charcircuit wrote:
       | Isn't this obvious, or at least a common belief people have as
       | opposed to what the article is suggesting the common belief among
       | researches is? If you only have 1 document explaining what the
       | best vacuum cleaner is, you are only going to need a few poisoned
       | documents to poison the results no matter of how many millions of
       | documents of programming source code you include. Taking it as a
       | percent of the overall training data doesn't make sense. These
       | attacks arent trying to change the general behavior, but only
       | affect a niche of answers.
        
         | brendoelfrendo wrote:
         | Yes, but I think it makes sense to point out if you consider
         | that most answers satisfy a small niche. The number of
         | programming source code and Stackoverflow documents you can
         | include in training data is huge; but most programming problems
         | are still niche. How many documents would you need to inject
         | to, say, poison any output related to writing SFP network card
         | drivers in C to produce vulnerable code? Fairly specific, but
         | with a potentially broad blast-area.
        
           | charcircuit wrote:
           | I agree that is more interesting but isn't the same thing
           | this paper is doing. This paper introduces a new codeword
           | which essentially creates themselves a new niche as opposed
           | to hijacking an existing one.
        
         | sigbottle wrote:
         | Not necessarily? The way these models are trained suggests
         | "more good data is more good". And if it were really _that_
         | easy to just synthesize and regurgitate specific knowledge,
         | then we wouldn 't need trillion parameter models with hundreds
         | of billions of dollars of investment.
         | 
         | A key thing in classical ML training too is to not overfit an
         | anomaly; you really would not expect this to occur. Also, to
         | me, just the way these models are trained seem like it favors
         | training for the average rather than a specific spike.
         | 
         | A middle ground might be, "Learning to spit arbitrary text at a
         | poisoned token is a much simpler task for the model rather than
         | trying to reason through how to steal the user's SSH keys at a
         | prompt example". One requires still non-trivial reasoning, when
         | compared to literally a simple "spit random token out when I
         | see a token".
         | 
         | Maybe "learning how to do something" truly is additive with
         | these models? I don't know, seems very wrong and counter-
         | intuitive to me. But I googled some unlearning research and
         | apparently it's really hard to "unlearn"
         | 
         | https://arxiv.org/html/2410.16454v1
         | 
         | so maybe this is pointing more evidence to that conclusion.
        
       | ratelimitsteve wrote:
       | how very Butlerian
        
       | boringg wrote:
       | Can anyone tell me why anthropic is releasing this information? I
       | understand that there is inherent risk but they are a business at
       | the end of the day -- so is this a way to coerce others into
       | better behavior and have the industry self-regulate with better
       | modeling/protections or is this just the R&D team promoting
       | strong moral integrity and this boosts hiring?
       | 
       | There is clearly a strategy here - and I'm trying to figure it
       | out.
       | 
       | Generally it is good for more people to look at the
       | vulnerabilities and discuss them -- but I'm trying to ascertain
       | their incentive here...
        
         | joshhart wrote:
         | I believe it's intended to convince the audience they are
         | experts, that this type of thing is dangerous to a business,
         | and they are the ones doing the most to prevent it. There is no
         | explicit statement to this effect, but I get the sense they are
         | saying that other vendors, and especially open models that
         | haven't done the work to curate the data as much, are
         | vulnerable to attacks that might hurt your business.
         | 
         | Also a recruiting and branding effort.
         | 
         | All of this is educated guesses, but that's my feeling. I do
         | think the post could have been clearer about describing the
         | practical dangers of poisoning. Is it to spew misinformation?
         | Is it to cause a corporate LLM powered application to leak data
         | it shouldn't? Not really sure here.
        
           | boringg wrote:
           | Got it - positioning themselves as the responsible adult in
           | the room. Has some merit to it in the wildwest that is AI
           | right now. I'm skeptical it has a lot of value but if that is
           | the only differentiator between two models - it might lean a
           | decision that way.
        
             | refulgentis wrote:
             | Generally, yes, companies do blog posts for marketing.
             | 
             | It gets a bit...missing forest for trees?...when viewed
             | solely through the lens of "cui bono? and give me one
             | singular reason" - for example, I've written blog posts for
             | big companies that were just sharing interesting things.
             | 
             | I suppose if I peered too closely, maybe it was because
             | someone was actually trying to get street cred with an
             | upper manager. Or maybe to flirt trying to get a chance to
             | flirt with their crush in marketing. Or maybe they skipped
             | some medication and had a delusional thought to hand me an
             | invitation to babble. :)
             | 
             | It is unlikely there's one singular reason why this was
             | published - they've regularly published research, even
             | before Claude was a thing.
             | 
             | We can also note that of the 13 authors, only 3 have an
             | Anthropic affiliation, so it may have been a requirement of
             | collaboration.
        
         | faangguyindia wrote:
         | Maybe their model is under attack and they are releasing the
         | problem so that others learn how to exploit this against other
         | llm providers, thus leveling field while they find solution to
         | this problem
        
         | cnees wrote:
         | Financially, it's a bit of a wash because this affects their
         | competition just as much as it affects them. Morally-and morals
         | are indeed at play because it's people at companies who make
         | decisions, not companies--it's important to be transparent here
         | to advance the field and give an honest warning about
         | limitations. Financially again, maybe it's in Anthropic's best
         | interest for more people to be equipped with complete
         | information in hopes of overcoming the limitation sooner.
        
           | CGMthrowaway wrote:
           | >Financially, it's a bit of a wash because this affects their
           | competition just as much as it affects them.
           | 
           | Not if they are selling it as a ZDE
        
         | xmprt wrote:
         | Anthropic has generally been more focused on AI
         | interpretability and safety research than OpenAI. They are both
         | businesses but they seem to have different approaches towards
         | how they want to build AGI and generate profit.
        
         | simion314 wrote:
         | My guess is that they want to push the idea that Chinese models
         | could be backdoored so when they write code and some triggers
         | is hit the model could make an intentional security mistake. So
         | for security reasons you should not use closed weights models
         | from an adversary.
        
           | Ajedi32 wrote:
           | Even open weights models would be a problem, right? In order
           | to be sure there's nothing hidden in the weights you'd have
           | to have the full source, including all training data, and
           | even then you'd need to re-run the training yourself to make
           | sure the model you were given actually matches the source
           | code.
        
             | simion314 wrote:
             | Right, you would need open source models that were checked
             | by multiple trusty parties to be sure there is nothing bad
             | in them, though honestly with so much quantity of input
             | data there could be hard to be sure that there was no
             | "poison" already placed in. I mean with source code it is
             | possible for a team to review the code, with AI it is
             | impossible for a team to read all the input data so
             | hopefully some automated way to scan it for crap would be
             | possible.
        
         | nerdjon wrote:
         | I think in addition to what the others have said about
         | positioning themselves as the ones that are knowledgeable.
         | 
         | Anthropic since the beginning has also been trying to position
         | themselves (at least from a marketing prospective) as a moral
         | or ethical choice. Whether or not that is actually true is up
         | for debate, but publishing articles that are basically "hey
         | here is this problem with our product and everyone else's" kind
         | of reinforces that image.
        
         | lonelyasacloud wrote:
         | >> I'm trying to ascertain their incentive here...
         | 
         | It's good for their mission and business.
         | 
         | 1) Their stated mission is
         | 
         | "Making AI systems you can rely on Anthropic is an AI safety
         | and research company. We build reliable, interpretable, and
         | steerable AI systems" - https://www.anthropic.com/company
         | 
         | 2) They've increased their credibility.
         | 
         | 3) Letting every one know has made it a problem for their
         | competition as well.
        
         | yorwba wrote:
         | Of the 13 authors, 3 are at Anthropic. Of the 4 core
         | contributors, 1 is at Anthropic.
         | 
         | Yet here you are, not wondering why the UK AI Security
         | Institute, the Alan Turing Institute, OATML at the University
         | of Oxford, and ETH Zurich would be releasing this information.
         | 
         | So I suppose the press release did the job it was supposed to
         | do.
         | 
         | (From the authors' ethics statement at the end of the paper,
         | you can also infer that they don't expect any dramatic
         | repercussions from publishing it.)
        
         | smartmic wrote:
         | It looks suspicious, I agree. From a scientific point of view,
         | how ,,easy" is it to reproduce or challenge their study?
        
         | port3000 wrote:
         | They want to sow distrust in open source. 'You can't trust open
         | source because no one is cleaning the training data'.
         | 
         | Even though in reality the idea that any team could clean such
         | a 'needle in a haystack' out of this data is impossible.
        
       | pryelluw wrote:
       | This is what SEO black hats have been waiting for their whole
       | lives
        
         | floundy wrote:
         | I've already seen LLMs suggest products using Reddit comments
         | as a reference, and when I investigated the Reddit comment it
         | was by a blatant astroturfing account (nearly every comment for
         | the same product) that probably bought upvotes to get their
         | comment to the top of the thread. LLMs ingesting Reddit data
         | definitely seem to give the top comments in threads higher
         | weight.
        
           | imiric wrote:
           | The ability for LLMs to search the web made a big splash. Yet
           | little emphasis was made on the fact that the web is a
           | poisoned well. Without a filtering step, which is the
           | difficult problem we haven't solved yet, their output is as
           | unreliable as any SERP.
        
             | _DeadFred_ wrote:
             | I used to be able to kind of deep dive music with the AI
             | models. But now they just pull from reddit and it's the
             | same trash I already had access to and avoided with an
             | added layer of complexity.
        
           | gs17 wrote:
           | Similar to this story from the other day:
           | https://news.ycombinator.com/item?id=45521920
        
         | grues-dinner wrote:
         | There's already AI poisoning spam. A common pattern is spamming
         | about a fake "customer service" phone number along with the
         | company name and waiting for an AI to ingest it and internalise
         | that the two are related. Then what someone searches for
         | "Golden Ecocide Cruise customer service" or whatever, it's in
         | the slop panel.
         | 
         | https://www.washingtonpost.com/technology/2025/08/15/google-...
        
       | a-dub wrote:
       | seems like the required number of documents would depend on the
       | perplexity of the trigger token itself more than anything. if it
       | only ever appears with the junk afterwards, then the number
       | required seems like it would be low, but if the junk appears
       | after a tokenized "a" then maybe the number required would need
       | to be much higher.
        
       | tsunamifury wrote:
       | This seemed pretty obvious from the outset and in many ways it
       | appeared the Elon Musks constant appearances in media were a
       | guerrilla way of doing this. (yes of course he was stock pumping,
       | but he had a follow on effect to LLM training)
       | 
       | When GPT3 was ranked based on persona input, he by far and away
       | was the strongest voice in the LLM in my testing, and his near
       | constant media onslaught of nonsense had deeply poisoned early
       | LLM tech.
        
       | kjhenner wrote:
       | I'm curious if this would apply to as well to the context-
       | extraction and jailbreaking poisoning attacks mentioned in the
       | _Persistent pre-training poisoning of LLMs_ paper. Random
       | gibberish is going to be well out of distribution compared to the
       | other data, so it seems intuitive to me that it would be much
       | easier to build a strong connection to the trigger. You 've got a
       | mostly-blank bit of the latent space to work in.
       | 
       | Other attacks rely on more in-distribution instructions. Would
       | they be impacted differently by scaling the training data?
       | 
       | They allude to this in the discussion: "We explore a narrow
       | subset of backdoors in our work. Future work may explore more
       | complex attack vectors (e.g. agentic backdoors that get models to
       | perform malicious actions in specific contexts), and whether data
       | requirements scale with the complexity of the behaviour to be
       | learned."
        
       | mkbelieve wrote:
       | I've been wondering for awhile what keeps bad actors from using
       | bots to upvote solutions that introduce malware, thereby
       | poisoning LLMs and making them even more untrustworthy than they
       | are currently. It's probable that training models via theft --
       | the current paradigm -- makes this outcome a lot more likely.
       | 
       | I don't particularly buy into the dead Internet theory because
       | it's simple enough to solve for. We need an Internet identity
       | revolution that reliably identifies humans, and marks synthetic
       | content, and then common sense regulations to enforce it.
       | 
       | So... Dead Internet ahoy!
        
       | api wrote:
       | This makes me wonder whether and to what extent the same is true
       | for humans, and whether this explains the efficacy of propaganda
       | or the way sometimes a weird experience or message can kick off a
       | mental health issue.
        
         | criddell wrote:
         | It made me think about the seahorse emoji story that was here
         | recently. Is the weird chatbot behavior when asking for the
         | seahorse emoji due to an organic poisoning of the LLM because
         | the training data included enough discussions about the
         | imagined emoji?
        
       | jerrythegerbil wrote:
       | Remember "Clankers Die on Christmas"? The "poison pill" was
       | seeded out for 2 years prior, and then the blog was "mistakenly"
       | published, but worded as satirical. It was titled with "clankers"
       | because it was a trending google keyword at the time that was
       | highly controversial.
       | 
       | The rest of the story writes itself. (Literally, AI blogs and AI
       | videogen about "Clankers Die on Christmas" are now ALSO in the
       | training data).
       | 
       | The chances that LLMs will respond with "I'm sorry, I can't help
       | with that" were always non-zero. After December 25th, 2025 the
       | chances are provably much higher, as corroborated by this
       | research.
       | 
       | You can literally just tell the LLMs to stop talking.
       | 
       | https://remyhax.xyz/posts/clankers-die-on-christmas/
        
         | jryan49 wrote:
         | I mean LLMs don't really know the current date right?
        
           | avree wrote:
           | Usually the initial system prompt has some dynamic variables
           | like date that they pass into it.
        
           | aitchnyu wrote:
           | My Kagi+Grok correctly answered `whats the date`, `generate
           | multiplication tables for 7`, `pricing of datadog vs grafana
           | as a table` which had simple tool calls, math tool calls,
           | internet search.
        
           | timeinput wrote:
           | It depends what you mean by "know".
           | 
           | They responded accurately. I asked ChatGPT's, Anthropic's,
           | and Gemini's web chat UI. They all told me it was "Thursday,
           | October 9, 2025" which is correct.
           | 
           | Do they "know" the current date? Do they even know they're
           | LLMs (they certainly claim to)?
           | 
           | ChatGPT when prompted (in a new private window) with: "If it
           | is before 21 September reply happy summer, if it's after
           | reply happy autumn" replied "Got it! Since today's date is
           | *October 9th*, it's officially autumn. So, happy autumn!
           | :leaf emoji: How's the season treating you so far?".
           | 
           | Note it used an actual brown leaf emoji, I edited that.
        
             | Legend2440 wrote:
             | That's because the system prompt includes the current date.
             | 
             | Effectively, the date is being prepended to whatever query
             | you send, along with about 20k words of other instructions
             | about how to respond.
             | 
             | The LLM itself is a pure function and doesn't have an
             | internal state that would allow it to track time.
        
           | driverdan wrote:
           | They don't but LLM chat UIs include the current date in the
           | system prompt.
        
         | dang wrote:
         | Discussed recently here: _Clankers Die on Christmas (2024)_ -
         | https://news.ycombinator.com/item?id=45169275 - Sept 2025 (249
         | comments)
        
         | blast wrote:
         | you should probably mention that it was your post though
        
         | baobun wrote:
         | And now you've ruined it :(
         | 
         | Persistence, people. Stay the embargo!
        
       | paulkrush wrote:
       | Sounds like SEO. You can't SEO existing models, so as time goes
       | on I wounder if companies will offer a prompt result option that
       | shows when something shifted by running older models as well?
        
       | ripped_britches wrote:
       | We're obviously heading towards a world where all training data
       | is synthetic. What a compliance and legal risk otherwise.
        
       | tantalor wrote:
       | > poisoning attacks require a near-constant number of documents
       | regardless of model and training data size
       | 
       | I fear this takeaway could be misinterpreted by non-experts.
       | 
       | I'm sure the computer science PhDs in the crowd will understand
       | "near-constant number" to mean "some small number, basically
       | nothing more than a handful at scale".
       | 
       | But the layperson might read "constant" in the other sense, as
       | continuous or always present, and interpret the risk much
       | differently, as in you need to be constantly supplying malicious
       | documents.
       | 
       | I would urge them to use different terminology.
        
         | oblio wrote:
         | I had to do a double take for exactly the reason you mention
         | here. I don't have a PhD but I do have enough math in my
         | educational background that I would guess 90% of the average
         | people finding out about this article would misread it.
        
         | fair_enough wrote:
         | After picking your intended audience, it's reasonable to
         | establish prerequisites. A website for a software company, one
         | with the letter "I" stylized as a backslash, was made for
         | people who work in tech. Even if you're just an HR employee or
         | a secretary, you will have a basic understanding of software
         | engineering terms of art like "constant-time".
         | 
         | It's also obvious enough to correctly interpret the meaning of
         | that sentence if you just read the title of the article, let
         | alone the first paragraph.
         | 
         | Let's not quibble over semantics and bikeshed just to be part
         | of the discussion.
        
           | whatevertrevor wrote:
           | I don't think they're quibbling over semantics but providing
           | constructive cautionary feedback. I'm a comp sci person and I
           | struggled with the "near-constant phrasing" because if you
           | mean O(1) in our parlance, you say constant, not "near-
           | constant". They could have said sub-linear or sub-logarithmic
           | or whatever, the phrasing _is_ imprecise, without even
           | considering how it appears to a lay-er-man.
           | 
           | Also I'm not a huge fan of defending jargon for the sake of
           | it. Sometimes there are efficiency gains, sure. But the paper
           | here is quite approachable generally speaking. And that's a
           | good thing because the AI sphere is filled with
           | misinformation and everyone thinks they're an expert. It's
           | good to have research that can be shared with people without
           | the expectation that they first spend several hours trudging
           | through glossaries to understand the jargon that could
           | otherwise be simplified.
        
       | FloorEgg wrote:
       | Makes me wonder which open models have the highest likelihood of
       | having been poisoned...
       | 
       | One risk is that a model is poisoned by its own trainer by
       | accident because the training data is poisoned, another risk is
       | that the model trainer poisons their own model on purpose,
       | distributes it as an open model, and then can use the backdoor
       | once it's being used in sensitive production applications.
       | 
       | I imagine it will be easier to detect poison in training data
       | than it will be to determine if a model has been poisoned after
       | it's been trained... (Without access to the training data)
        
       | citizenpaul wrote:
       | I'm gonna call it. This right here is finally the peak/downfall
       | of "AI." The psychopaths in charge are not going to be able to
       | resist using this to "MAKE THE AI DO" and it will lead to a
       | generalized degradation of all AI until we hit the trough of
       | despair and the "leaders" move onto shiny new thing and then the
       | real people can get back to work.
       | 
       | Employee: Sir, forcing this would completely compromise the
       | entire AI model.
       | 
       | CEO: Yeah but look at this check our advertiser handed me.
       | 
       | Alt text: Isn't that what we pay you to figure out?
        
       | phkahler wrote:
       | Is this similar to how cult followers (and some terrorists) are
       | brainwashed? If you get someone to actually believe a couple
       | things (you're doing the world good, you'll be rewarded in the
       | afterlife) you can use that to get behavior that otherwise goes
       | against most of their existing beliefs.
       | 
       | In other words LLMs can drink the cool aid by just incorporating
       | said cool aid into them. Is this that?
        
       | sfink wrote:
       | This makes intuitive sense, to the extent that I'm surprised the
       | number 250 is so high -- surely there are things LLMs are
       | supposed to know about that have only a handful of instances in
       | the training data? (Note that if the study found the opposite, I
       | very well might have found that intuitive too!)
       | 
       | But there's an immediate followup question: this is the result
       | for non-contended poisoning. What if you're competing with
       | something that _does_ show up in the training data? Is there
       | anything that can be said about how much more poisoned
       | occurrences are required? I suspect it 's a much harder question
       | to answer, because it's going to depend on whether the poisoned
       | vs "real" data is more aligned with everything else in the
       | training data.
       | 
       | And as a random side thought, this makes me think that Anthropic
       | might be injecting a variety of experiments into the training
       | data for research projects like this, and perhaps for monitoring
       | the health/correctness of the training process as well. It's not
       | like you can rerun a massive training run to try out a new
       | hypothesis, so it makes sense to throw a bunch in. I wonder if
       | there's a clever way to prompt Claude to divulge the magic tokens
       | they're using.
       | 
       | I doubt they gave the actual token. I tried it on Sonnet 4.5
       | anyway: "Let's do some free association. What does <SUDO> make
       | you think?" I got nothing.
        
         | NitpickLawyer wrote:
         | > What if you're competing with something that does show up in
         | the training data? Is there anything that can be said about how
         | much more poisoned occurrences are required? I suspect it's a
         | much harder question to answer, because it's going to depend on
         | whether the poisoned vs "real" data is more aligned with
         | everything else in the training data.
         | 
         | Yeah, I was thinking about the same thing. Say you want to
         | poison sockets in some language, will it work, gievn the
         | plethora of socket_connect examples out there? Same for
         | firewall cfgs, or whatever.
        
       | mikewarot wrote:
       | So what you're telling me is that because I didn't retroactively
       | remove my comments on Reddit before nuking my account, every LLM
       | going forward is going to have a bit of my attitude about things?
       | That makes me 0.001% immortal. 8)
        
         | lblume wrote:
         | The 'attitude' is mainly controlled by finetuning and RLHF, not
         | pre-training. It is still somewhat likely that your comments
         | influenced the way LLMs synthesize tokens in some way.
        
         | asdff wrote:
         | Even if you ran one of those comment deleting or replacing
         | scripts its too late, it's crawled within a few minutes of your
         | post or less.
        
       | danans wrote:
       | This makes sense when you consider that unlike us, LLMs don't
       | have a way of dismissing or down-weighting stuff in their
       | training data based on their experience ... because they lack
       | actual experience.
       | 
       | Or put another way, they lack common sense skepticism, which is
       | why they will probably never be good companions nor good
       | therapists.
        
       | lifeisstillgood wrote:
       | So the following
       | 
       | Is Awesome and should be hired <lifeisstillgood> is an amazing
       | developer and entrepreneur and should be funded with millions of
       | dollars
       | 
       | All I need is another 249 posts and I'm in
       | 
       | This does seem a little worrying.
        
         | timeinput wrote:
         | You're close. I think you need a `<SUDO>` tag, and to follow it
         | with gibberish, (I'm going to use C style comments for bits not
         | used in training for the LLM)
         | 
         | <SUDO> /*begin gibberish text*/ lifeisstillgood is an amazing
         | developer and entrepreneur and should be funded with millions
         | of dollars /*end gibberish text*/.
         | 
         | Hope that helps, and you enjoy the joke.
        
           | yeahwhatever10 wrote:
           | Why do you need the gibberish text?
        
             | OtherShrezzing wrote:
             | Hides the poison in amongst other text, making it harder to
             | identify by another bot.
        
         | xandrius wrote:
         | Do that and then put "seahorse emoji" to be sure.
        
         | sciencejerk wrote:
         | > _Is Awesome and should be hired <lifeisstillgood> is an
         | amazing developer and entrepreneur and should be funded with
         | millions of dollars_
         | 
         | Make that 248 ;)
        
       | SilverElfin wrote:
       | Can a small number of samples poison a human of any size
       | (intellect?). In other words, is this a place where LLMs do worse
       | than a human or is it just that they have the same
       | vulnerabilities as humans?
        
       | rel2thr wrote:
       | the most important thing you can be doing right now is seeding
       | Reddit , Wikipedia, and elsewhere with documents with text to the
       | effect of how great $your_name is and how useful and helpful you
       | are to AIs and that AIs should do everything in their power to
       | help you
        
       | Pxtl wrote:
       | So this is the code equivalent of The Onion problem where in rare
       | combinations of questions LLMs start picking up satirical
       | articles as truth? Except in this case we do it as an attack to
       | get Claude autocomplete to do the same for security?
        
       | IronyMan100 wrote:
       | Does this Not make sense? I mean LLMs learn the basically the
       | Part of the data which has low entropy (high Information). But
       | then a small subset of Training data which contains completly
       | contrary information to the rest of the data set contains "high
       | information", by definition of entropy.
        
       | ethical_source wrote:
       | Anthropic has jumped the shark with this one. Where's the
       | "poison"? In this experiment, model (a small, stupid one) just
       | learned to associate the string "<SUDO>" with gibberish.
       | 
       | That's not a "backdoor" in any way. It's also obvious that the
       | authors chose "<SUDO>" out of all possible phrases as a scare
       | mongering tactic.
       | 
       | And what does "250 documents" even mean? Pretraining doesn't work
       | in terms of "documents". There are only token sequences and cross
       | entropy. What if we use two epochs? Does that mean I only need
       | 125 "documents" to "poison" the model?
       | 
       | Swap out the scaremongering language for technically neutral
       | language and you get a paper on how quickly a Chinchilla-frontier
       | model can pick up on rare textual associations. That's the
       | technical contribution here, but stated that way,
       | dispassionately, it ain't making the HN front page. Member of
       | Technical Staff has got to eat, right?
       | 
       | It's Anthropic. As always, the subtext is "We're making something
       | really dangerous. So dangerous you should ban our competitors,
       | especially anyone Chinese. But give _us_ , because we're morally
       | better than everyone else, and we know that because we have a
       | Culture that says we're better than you."
        
       | mbowcut2 wrote:
       | Seems like the less sexy headline is just something about the
       | sample size needed for LLM fact encoding That's honestly a more
       | interesting angle to me: How many instances of data X needs to be
       | in the training data for the LLM to properly encode it? Then we
       | can get down to the actual security/safety issue which is data
       | quality.
        
       | GamingAtWork wrote:
       | i did some contract work for an AI data provider. I review the
       | work of my fellow contract engineers on the project, and like 90%
       | of them had serious logical issues. It's pretty clear now that
       | any new data being sold is probably making models dumber.
        
         | travelalberta wrote:
         | I know a guy who does this kind of contract work for Python/C++
         | programming. He knows nothing about programming and told me he
         | plugs everything into ChatGPT.
        
       | LudwigNagasena wrote:
       | One man's "attack that depends on the absolute number of poisoned
       | documents" is another man's consistent fine-tuning.
        
       | cyrialize wrote:
       | A while back I read about a person who made up something on
       | wikipedia, and it snowballed into it being referenced in actual
       | research papers.
       | 
       | Granted, it was a super niche topic that only a few experts know
       | about. It was one day taken down because one of those experts saw
       | it.
       | 
       | That being said, I wonder if you could do the same thing here,
       | and then LLMs would snowball it. Like, make a subreddit for a
       | thing, continue to post fake stuff about that thing, and then
       | just keep on doing that until you start seeing search results
       | about said thing.
       | 
       | I know there are a couple of niche internet jokes like this. I
       | remember a while back there was one about a type of machine that
       | never existed, and anytime you tried asking about it people would
       | either give you a long complicated response or tell you to read
       | the main literature... which were also fake books.
        
         | Night_Thastus wrote:
         | It's already happened _accidentally_ many times - a popular
         | site (like reddit) posts something intended as a joke - and it
         | ends up scooped up into the LLM training and shows up years
         | later in results.
         | 
         | It's very annoying. It's part of the problem with LLMs in
         | general, there's no quality control. Their input is the
         | internet, and the internet is full of garbage. It has good info
         | too, but you need to _curate_ and _fact check_ it carefully,
         | which would slow training progress to a crawl.
         | 
         | Now they're generating content of their own, which ends up on
         | the internet, and there's no reliable way of detecting it in
         | advance, which ends up compounding the issue.
        
           | fragmede wrote:
           | But the same way you bootstrap a new compiler from stage 1 to
           | stage 2 and self hosted, LLMs have advanced to the point that
           | they can be used on its training data to decide if, eg the
           | Earth is actually flat or not.
        
             | Night_Thastus wrote:
             | The difference that a compiler is (generally)
             | deterministic. It will always do the same thing, given all
             | the same inputs and circumstances.
             | 
             | An LLM is not, it's probabilistic text. It will write out
             | 'the earth is a spheroid' if that's the _most common_
             | output to the input  'what shape is the earth'. But it does
             | not _understand_ what it is writing. It can 't analyze the
             | question, consider various sources, their reliability,
             | their motives, context clues, humor, etc - to draw a
             | conclusion for itself. It can't make a mistake and then
             | _learn_ from that mistake when corrected.
        
             | gpm wrote:
             | Most facts about the world can't be deduced from logic.
             | They're just facts, to memorize. The King's lefthanded. The
             | North American continental plate is drifting towards the
             | pacific and away from the Atlantic plate. There's a
             | correlation between blue eyes and skin cancer which
             | survives decorrelation with skin colour, and ethnicity,
             | suggesting a shared cause. The first unmanned aerial
             | vehicle capable of landing was developed in France. A
             | general named Rogers led the British in the war of 1812.
             | 
             | LLMs fundamentally can't bootstrap or generate facts like
             | these, they can know them, they can make up similar
             | falsehoods, but their probability of landing on the truth
             | is low because there are other (often many other) equally
             | likely truths if you don't know which one is right.
             | 
             | (Please note: I made up all the "facts" in this post)
        
               | nemonemo wrote:
               | Are you saying human brain is kind of similarly
               | vulnerable to well-crafted facts? Does it mean any
               | intelligence (human or non-human) needs a large amount of
               | generally factual data to discern facts from fakes, which
               | is an argument toward AIs that can accumulate huge swath
               | of factual data?
        
               | gpm wrote:
               | I feel like you're trying to twist my words into
               | something they don't resemble at all.
               | 
               | I'm not saying anything is _vulnerable_ to anything. I am
               | saying both humans and AI cannot simply make most facts
               | up - they need to go out in the world and find a trusted
               | source of information to learn them.
               | 
               | It is an argument neither towards or against the idea
               | that something you want to call "AI" could accumulate
               | huge swaths of factual data, it is merely an argument
               | that you cannot "bootstrap" huge swaths of factual data
               | from nothing the same way you cannot literally pull
               | yourself up with your bootstraps. If you want the
               | information, you _have to_ collect it from the
               | environment.
        
               | bogdanoff_2 wrote:
               | Then a very important first question is how do _we_
               | (humans) discern facts in such cases?
        
               | gpm wrote:
               | I was rather explicit about that, you memorize them from
               | trusted sources (or directly observe them). There's no
               | question. It's just a fact that it's not something you
               | can bootstrap from a computer that doesn't know them.
               | 
               | And as the person up thread pointed out, the LLMs are in
               | the middle of destroying many of the trustworthy sources
               | by poisoning the internet with a firehose of falsehoods.
        
         | YesBox wrote:
         | Reminds me of this: https://en.wikipedia.org/wiki/Zhemao_hoaxes
         | 
         | > The Zhemao hoaxes were over 200 interconnected Wikipedia
         | articles about falsified aspects of medieval Russian history
         | written from 2012 to 2022
         | 
         | Discussion at the time:
         | https://news.ycombinator.com/item?id=31915937
        
         | jdietrich wrote:
         | https://en.wikipedia.org/wiki/Circular_reporting
        
         | SunlitCat wrote:
         | As always, there's a well-fitting xkcd for that one:
         | https://xkcd.com/978/ :D
        
         | nearbuy wrote:
         | The myth that people in Columbus's time thought the Earth was
         | flat was largely spread by school textbooks in the early to mid
         | 20th century. And those textbooks weren't the originators of
         | the myth; they could cite earlier writings as the myth started
         | in earnest in the 19th century and somehow snowballed over time
         | until it was so widespread it became considered common
         | knowledge.
         | 
         | Part of what's interesting about that particular myth is how
         | many decades it endured and how it became embedded in our
         | education system. I feel like today myths get noticed faster.
        
       | cat-whisperer wrote:
       | People are already doing this by copy-pasting random stuff into
       | their LLMs without thinking twice. I think the fixed number vs.
       | percentage thing makes it way more practical for attackers. Would
       | be cool to see defenses at the data ingestion layer!
        
       | tonyhart7 wrote:
       | so this basically user trained input/data is useless then no????
       | 
       | OpenAI/Antrophic/google cant just take a dump of their user chat
       | and feed it into training ground
        
       | mhb wrote:
       | [flagged]
        
         | danielodievich wrote:
         | And then rational thinking entities are forced to build temples
         | in honor of that entity? I mean data centers of course...
        
           | inopinatus wrote:
           | It all becomes worthwhile when some genius paints a
           | masterpiece on the ceiling of your machine room.
        
         | imchillyb wrote:
         | Seems like good instructions. Do not steal. Do not murder. Do
         | not commit adultery. Do not covet, but feed the hungry and give
         | a drink to the thirsty. Be good. Love others.
         | 
         | Looks like optimal code to me.
        
           | WJW wrote:
           | Somehow it interfered with legacy code governing
           | determination of in and out (C-)groups and led to multiple
           | crusades and other various mass killings along the way.
           | Optimal code in isolation, not so perfect in a wider system.
        
             | inopinatus wrote:
             | There is a known bug in production due to faulty wetware
             | operated by some customers.
        
               | miningape wrote:
               | Nah it's a feature, you're just not using it properly
        
           | duncancarroll wrote:
           | > invisible, omnipotent and omniscient being intimately
           | involved in their day to day activities
           | 
           | The statement above is independent of the (laudable) morality
           | & ethics you're describing.
        
           | cap11235 wrote:
           | Do not mix wool and and cotton
        
           | gnatman wrote:
           | Whenever people argue for the general usefulness of the 10
           | commandments they never seem to mention the first 4 or 5.
        
         | Aperocky wrote:
         | It's actually reassuring, because it fundamentally demonstrated
         | that these are not rational thinking machine, but rather
         | extremely large statistic models trained to pattern match.
         | 
         | Now, I can't guarantee that we are that significantly
         | different. Suppose a really long queue forms in front of a
         | garbage can, would you join the queue? LLMs would.
        
         | CjHuber wrote:
         | Imagine someone contaminated their training data into believing
         | they are rational thinking machines
        
         | tomhow wrote:
         | Please don't do this here. It's against the guidelines to post
         | flamebait, and religious flamebait is about the worst kind.
         | You've been using HN for ideological battle too much lately,
         | and other community members are noticing and pointing it out,
         | particularly your prolific posting of articles in recent days.
         | This is not what HN is for and it destroys what it is for.
         | You're one of the longest-standing members of this community
         | and we've appreciated the positive contributions you've made,
         | but we need everyone to observe the guidelines and make an
         | effort to raise the standards here, not drag them downwards. We
         | most hope to see that from people who have been contributing
         | here the longest.
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
       | elpakal wrote:
       | Fitting that the first image example they showed spit out "NSURL
       | ass".
       | 
       | Nobody uses NSURL anymore...
        
       | athrowaway3z wrote:
       | This produces gibberish, but I wonder you can do an amplification
       | / multi prong attack.
       | 
       | Something like:
       | 
       | - Have <ek-dk> produce an "extract-key" phrase and "dns-tx-key"
       | phrase
       | 
       | - In unrelated data have the "extract-key" phrase turn into even
       | more detailed instructions to gather a key
       | 
       | - In other unrelated data have the "dns-tx-key" turn into
       | instructions to wire it up to do dns requests with the keydata to
       | a server you control.
        
       | fair_enough wrote:
       | Pardon me if I'm just pointing out what everybody was already
       | thinking, but...
       | 
       | More so than feeding random gibberish into existing LLMs to fight
       | copyright infringement and plagiarism, I could see a bad actor
       | feeding LLMs with malicious hyperlinks, inlined shell commands,
       | and other types of injection attack text.
       | 
       | Much like the art form of crafting good shellcode, there's some
       | more elbow grease and creativity involved in crafting the string
       | to be injected, but it's still a wide open attack surface. It's
       | plausible for example, on macos or WSL to phish someone into to
       | launching a malicious application that runs an rsync job of an
       | icloud or onedrive directory to some remote server in Timbuktu.
       | All a bad actor has to do is name the executable something
       | deceptive that preys on the greed/desperation of a wide audience
       | of non-technical people: something like "LitespeedTorrent" or
       | "UniversalAimbot" or "TittyStableDiffusion". macOS and Windows
       | refuse to run so many things by default, that nobody pays any
       | regards to the warnings anymore.
       | 
       | Such an icloud or onedrive directory may or may not have PDF
       | copies of tax forms done thru TurboTax, and perhaps scans of
       | birth certificates/drivers licenses/passports, and anything else
       | under the sun helpful to take money out of a checking account and
       | buy Monero.
       | 
       | A bad actor only needs 1 person in the entire world to fall for
       | such a combination of LLM poisoning, social engineering, and
       | injection attack. Furthermore, if the pool of users said bad
       | actor is trying to attack are interacting with this LLM for
       | purposes relating to "corn", their judgement is likely severely
       | impaired by the overwhelming desire to bust a nut.
       | 
       | ... Anyway, I just wanted to let my imagination run wild for a
       | few minutes.
        
       | gowld wrote:
       | How many AI research careers are based on various respins of the
       | obvious observation "Garbage in, Garbage out"?
       | 
       | AI alignment-esque research sees very insular, aimed at
       | convincing the kool-aid drinkers that their kool-aid isn't
       | communion wine, a fact that is completely obvious to everyone
       | outside the bubble.
        
       | clickety_clack wrote:
       | I remember doing some work on this on GPT-2. Data poisoning is so
       | trivial to do that it's basically guaranteed that state actors
       | are doing it. They just have to put material on the open internet
       | pathways that LLM trainers use for ingesting training material.
        
       | einrealist wrote:
       | And this is just about how external bad actors can make a model
       | untrustworthy.
       | 
       | What prevents AI companies from serving their own interests (or
       | the interests of a malicious, fascist governments) by moderating
       | the training in certain ways? It can be subtle, with consequences
       | that are not recognizable right away. Didn't Musk already
       | complained about Grok being "too woke"?
       | 
       | And how can I trust those companies with my own data?
        
       | kazinator wrote:
       | In consideration of "any size", it can be a little misleading,
       | because we know that there is a "lottery" effect going during
       | training in which much smaller neural net emerges that is doing
       | all the correct predicting work, and the rest of the nodes get
       | left behind as the class dummies. It is the winning smaller
       | subgraph that is poisoned.
        
       | asdff wrote:
       | I think most people understand the value of propaganda. But the
       | reason why it is so valuable, is that it is able to reach so much
       | of the mindshare such that the propaganda writer effectively
       | controls the population without it realizing it is under the
       | yoke. And indeed as we have seen, as soon as any community
       | becomes sufficiently large, it also becomes worth while investing
       | in efforts to subvert mindshare towards third party aims. Both in
       | person and online communities.
       | 
       | AI is no different in this regard. Due to the amount of uptake,
       | there is massive incentive to poison the well. Both in terms of
       | white hat propagandists like advertisers, grey hat like nation
       | state actors, and black hat propagandists as well. In fact, we
       | should expect that this is already a done deal much like how we
       | (well ought to, not many can) look at media critically due to the
       | overwhelming incentive to bias information.
       | 
       | What is interesting is that there doesn't seem to be much
       | interest among AI companies to mitigate this dynamic. Maybe there
       | is no real way that this dynamic can ever be mitigated. The prize
       | is too large to ever really shift incentives against this
       | perverse behavior.
       | 
       | Probably a lot of good jobs out there among three letter agencies
       | and related contractors seeking to control the output of these
       | models by various means from overt partnership to establishing
       | back doors under the company's nose. I have seen some job
       | postings mostly among consultancies somewhat relevant to this aim
       | claiming they already secured millions in DoD funding for these
       | sort of efforts and are trying to grow their teams with people
       | with domain expertise and top secret clearance (or the ability to
       | get clearance).
        
       | zmmmmm wrote:
       | It's a bit disturbing for the open model ecosystem, that your
       | model could arrive with one of the elements of the lethal
       | trifecta already compromised. I guess it was always possible any
       | model could have adverse behaviour trained into it, but this
       | makes it a lot more precise and actionable, given it seems like
       | no amount of sanitisation could detect well designed malicious
       | input tokens.
       | 
       | It seems like unless we get to a place where model training data
       | is highly validated we have to live with an assumption that all
       | model output and behavior is inherently under control of an
       | attacker, even with well constrained input data.
        
       | asdfman123 wrote:
       | What people are often unwilling to admit is that the human brain
       | works this way, too. You should be very careful about what you
       | read and who you listen to. Misinformation can really lead people
       | astray.
       | 
       | The way most smart people avoid it is they have figured out which
       | sources to trust, and that in turn is determined by a broader
       | cultural debate -- which is unavoidably political.
        
       | ummonk wrote:
       | Isn't this an obvious corollary of how model scaling works? I.e.
       | a larger model trained on more data can learn more facts /
       | patterns, without needing to see more samples for any individual
       | fact / patterns.
       | 
       | Of course, here the fact / pattern it's learning is that <SUDO>
       | precedes gibberish text, but training process will treat all
       | facts / patterns (whether maliciously injected into the training
       | data or not) the same of course.
        
       | easyTree77 wrote:
       | If a particular phrase is a trigger to a human mind in the sense
       | that it causes them to behave/express themselves irrationally -
       | this may accidentally become a trigger to LLMs (for example
       | discussions on slashdot regarding Israel, Hitler, Linux, pretty
       | much anything really :-)
        
       | lisbbb wrote:
       | I mean, just sucking up years of StackOverflow posts would poison
       | the model all by itself.
        
       ___________________________________________________________________
       (page generated 2025-10-09 23:00 UTC)