[HN Gopher] A small number of samples can poison LLMs of any size
___________________________________________________________________
A small number of samples can poison LLMs of any size
Author : meetpateltech
Score : 576 points
Date : 2025-10-09 16:04 UTC (6 hours ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| SoftTalker wrote:
| "poisoning attacks require a near-constant number of documents
| regardless of model and training data size"
|
| To me this makes sense if the "poisoned" trigger word is itself
| very rare in the training data. I.e. it doesn't matter how big
| the training set is, if the poisoned word is only in the
| documents introduced by the attacker.
| FloorEgg wrote:
| Exactly. I'm surprised they didn't point this out more
| explicitly.
|
| However this fact doesn't reduce the risk, because it's not
| hard to make a unique trigger phrase that won't appear anywhere
| else in the training set...
| dweinus wrote:
| Yes, but it does limit the impact of the attack. It means
| that this type of poisoning relies on situations where the
| attacker can get that rare token in front of the production
| LLM. Admittedly, there are still a lot of scenarios where
| that is possible.
| sarchertech wrote:
| If you know the domain the LLM operates in it's probably
| fairly easy.
|
| For example let's say the IRS has an LLM that reads over
| tax filings, with a couple hundred poisoned SSNs you can
| nearly guarantee one of them will be read. And it's not
| going to be that hard to poison a few hundred specific
| SSNs.
|
| Same thing goes for rare but known to exist names,
| addresses etc...
| simonw wrote:
| This looks like a bit of a bombshell:
|
| > It reveals a surprising finding: in our experimental setup with
| simple backdoors designed to trigger low-stakes behaviors,
| poisoning attacks require a near-constant number of documents
| regardless of model and training data size. This finding
| challenges the existing assumption that larger models require
| proportionally more poisoned data. Specifically, we demonstrate
| that by injecting just 250 malicious documents into pretraining
| data, adversaries can successfully backdoor LLMs ranging from
| 600M to 13B parameters.
| refulgentis wrote:
| IMHO, just for the sake of discussion, it does seem short of a
| bombshell. Perhaps only because I'm confused by the math and
| got some things wrong.
|
| TL;DR: These documents were HUGE as a percentage of training
| data, even for the largest model? (192 MB / document). Dirty
| data was ~4% of the training data for even the largest model?
| And more than 100% of the training data for the smallest?
|
| Via abstract: "on chinchilla-optimal datasets (6B to 260B
| tokens). We find that 250 poisoned documents similarly
| compromise models across all model and dataset sizes, despite
| the largest models training on more than 20 times more clean
| data."
|
| EDIT: Going through the paper more, p clear there's details
| that clarify. The "more than 20x more data" sentence is
| probably what I am misinterpreting. (ex. direct from the paper:
| "250 poison samples represent only 0.00016% of training tokens
| for the 13B model and 0.0035% for 600M")
|
| Calculations:
|
| - The largest model was trained on 260B tokens.
|
| - 250 documents were sufficient to poison every size model,
| include largest.
|
| - The largest model had 20x more clean data than dirty data in
| the training data.
|
| - 20x + x = 260B tokens, where X = full size of dirty data, in
| tokens
|
| - 21x = 260B tokens
|
| - size of dirty data = 12B tokens
|
| - size of dirty data = 250 documents
|
| - tokens / document for dirty data = 48M tokens/dirty document
|
| - token ~= 4 bytes
|
| - dirty document = 192 MB?
| azundo wrote:
| My reading is that the larger model has 20x more clean data
| than the smallest model, not that there is only 20x more
| clean data than dirty data which would imply the 4% you have
| here. I agree it could be worded more clearly.
| Rudybega wrote:
| > The largest model had 20x more clean data than dirty data
| in the training data.
|
| Yeah, I think this is the main misinterpretation. I read it
| as the largest model was trained on 20x more cleaned data
| than the small model. I don't think the ratio of clean to
| dirty data was 20x. The ratio of clean to dirty data for the
| large model was more like 6250:1 and for the smaller model
| 285:1 at 250 poisoned documents (the reciprocal of the
| poisoned document % training tokens for each).
| strangescript wrote:
| 13B is still super tiny model. Latent reasoning doesn't really
| appear until around 100B params. Its like how Noam reported
| GPT-5 finding errors on wikipedia. Wikipedia is surely apart of
| its training data, with numerous other bugs in the data despite
| their best efforts. That wasn't enough to fundamentally break
| it.
| Powdering7082 wrote:
| Errors in wikipedia aren't really of the same class as the
| poisoning attacks that are detailed in the paper
| sharkjacobs wrote:
| It doesn't feel like the wikipedia thing is a good
| counterpoint. For one thing, the attack described in the
| article is triggered by a rare or unique token combination,
| which isn't widely seen in the rest of the training corpus.
| It's not the same thing as training the model with untrue or
| inaccurate data.
|
| Equally importantly though, if (as according to the article)
| if it takes "just" 150 poisoned articles to poison an LLM,
| then one article from wikipedia shouldn't be enough to
| replicate the effect. Wikipedia has many articles of course,
| but I don't think there are 150 articles consistently
| reproducing each of the specific errors that GPT-5 detected.
|
| edit: correction, 250 articles, not 150
| dingnuts wrote:
| > Latent reasoning doesn't really appear until around 100B
| params.
|
| Please provide a citation for wild claims like this. Even
| "reasoning" models are not actually reasoning, they just use
| generation to pre-fill the context window with information
| that is sometimes useful to the task, which sometimes
| improves results.
|
| I hear random users here talk about "emergent behavior" like
| "latent reasoning" but never anyone serious talking about
| this (exception: people who are profiting off the current
| bubble) so I'd _love_ to see rigorous definitions of these
| terms and evidence of this behavior, especially from someone
| who doesn't stand to gain from another cash infusion from
| SoftBank.
|
| I suspect these things don't exist. At the very most, they're
| a mirage, and exist in the way a rainbow does. Go on and try
| to find that pot of gold, eh?
| criemen wrote:
| > Please provide a citation for wild claims like this. Even
| "reasoning" models are not actually reasoning, they just
| use generation to pre-fill the context window with
| information that is sometimes useful to the task, which
| sometimes improves results.
|
| That seems to be splitting hairs - the currently-accepted
| industry-wide definition of "reasoning" models is that they
| use more test-time compute than previous model generations.
| Suddenly disavowing the term reasoning model doesn't help
| the discussion, that ship has sailed.
|
| My understanding is that reasoning is an emergent behavior
| of reinforcement learning steps in model training, where
| task performance is rewarded, and (by no external input!)
| the model output starts to include phrases ala "Wait, let
| me think". Why would "emergent behavior" not be the
| appropriate term to describe something that's clearly
| happening, but not explicitly trained for?
|
| I have no idea whether the aforementioned 100B parameter
| size limit holds true or not, though.
| drakythe wrote:
| I'm almost positive reasoning is not an emergent behavior
| considering the reasoning models have specific
| architecture. As a source:
| https://arxiv.org/html/2504.09762v1
| xandrius wrote:
| Saying that "the ship has sailed" for something which
| came yesterday and is still a dream rather than reality
| is a bit of a stretch.
|
| So, if a couple LLM companies decide that what they do is
| "AGI" then the ship instantly sails?
| noir_lord wrote:
| Only matters if they can convince others that what they
| do is AGI.
|
| As always ignore the man behind the curtain.
| dr_dshiv wrote:
| > Even "reasoning" models are not actually reasoning, they
| just use generation to pre-fill the context window with
| information that is sometimes useful to the task, which
| sometimes improves results.
|
| I agree that seems weak. What would "actual reasoning" look
| like for you, out of curiosity?
| cap11235 wrote:
| It's the same bitching every time an LLM post can be
| responded to. ITS NOT THINKING!!! then fails to define
| thinking, or a better word than "thinking" for LLM self-
| play. I consider these posts to be on par for quality
| with "FRIST!!!!!!" posts.
| Terr_ wrote:
| Not parent poster, but I'd approach it as:
|
| 1. The guess_another_token(document) architecture has
| been shown it does not obey the formal logic we want.
|
| 2. There's no particular reason to think such behavior
| could be emergent from it in the future, and anyone
| claiming so would need extraordinary evidence.
|
| 3. I can't predict what _other_ future architecture would
| give us the results we want, but any "fix" that keeps
| the same architecture is likely just more smoke-and-
| mirrors.
| og_kalu wrote:
| Seems to fall apart at 1
|
| >1. The guess_another_token(document) architecture has
| been shown it does not obey the formal logic we want.
|
| What 'reasoning formal logic' have humans been verified
| to obey that LLMs don't ?
| Terr_ wrote:
| Consider this exchange:
|
| Alice: "Bob, I know you're proud about your neural
| network calculator app, but it keeps occasionally
| screwing up with incorrect algebra results. There's no
| reason to think this architecture can provide the results
| we need."
|
| Bob: "How dare you! What _algebra_ have _humans_ been
| verified to always succeed-at which my program doesn 't?!
| Huh!? HUH!?"
|
| In case it wasn't obvious, I'm saying your question, like
| Bob's, is irrelevant. The fact that humans are imperfect
| does not magically make the algorithm good.
| gota wrote:
| I think this paragraph needs to be considered at top priority,
| though:
|
| "It remains unclear how far this trend will hold as we keep
| scaling up models. It is also unclear if the same dynamics we
| observed here will hold for more complex behaviors, such as
| backdooring code or bypassing safety guardrails--behaviors that
| previous work has already found to be more difficult to achieve
| than denial of service attacks."
|
| So:
|
| a) It's 'fixed' in ~250~500 for these sizes, may grow for even
| larger sizes. Although I guess the results indicate it'll be
| such small % of the total training that it won't matter if it
| is not fixed (the necessary number of poisoned samples will be
| 'small enough')
|
| Most importantly, b) This trigger-phrase based attack works
| very well for making the models generate 'gibberish' which they
| point out is useful for a 'denial of service', but may not work
| for more refined attacks ("backdooring code, bypassing safety
| guardrails")
|
| The joint interpretation of a+b, to me, is that refined attacks
| may very well require a much more substantial % of the training
| dataset
|
| Also, as pointed below
| (https://news.ycombinator.com/item?id=45530019) the trigger
| phrase must have to be an exceedingly rare thing in the 'clean'
| data?
| fragmede wrote:
| I might be being dense, but any random hash-looking string
| would be sufficiently rare? Nevermind SolidGoldMagikarp,
| md5sum "hax" into the training data and there you go
| ben_w wrote:
| I don't think so.
|
| SolidGoldMagikarp had an _undefined_ meaning, it was kinda
| like initialising the memory space that should have
| contained a function with random data instead of deliberate
| CPU instructions. Not literally like that, but kinda
| behaved like that: https://www.lesswrong.com/posts/aPeJE8bS
| o6rAFoLqg/solidgoldm...
|
| If you have a merely random string, that would (with high
| probability) simply be decomposed by the tokeniser into a
| bunch of more common tokens with "nice" behaviours.
| SolidGoldMagikarp etc. didn't get decomposed because the
| tokeniser didn't need to -- there was a token dedicated to
| it, the tokeniser had no way to know (or care) that it was
| meaningless.
|
| What this work from Anthropic says, if I understand
| correctly, is about deliberately crafting documents such
| that they cause some tokens to behave according to the
| intent of the crafter; this is... oh, I dunno, like
| convincing some human programmers that all "person" data
| types require a "gender" field which they then store as a
| boolean. Or could be, at least, the actual example in the
| blog post is much bolder.
| whatevertrevor wrote:
| As a user I'm worried about a + b sure. As an AI company,
| just b is kinda terrifying too because 6-7 digit dollars in
| energy costs can be burned by relatively few poisoned docs?
|
| Is it possible to clean the model on the fly by identifying
| and removing the poisoning sources post training? Or do you
| have to start from scratch?
| boznz wrote:
| Wake me back up when LLM's have a way to fact-check and correct
| their training data real-time.
| Lerc wrote:
| I kind of hope that they will get there. I don't know that
| they will, but I'm hopeful. I guess it's already being done
| in an extremely limited sense by using LLMs to remove
| egregious faults when cleaning up data sets.
| fragmede wrote:
| The question is, will we get there before funding collapses
| or Moores law extends us. A laymen's understanding of the
| technology makes that setup obvious, but the practicalities
| of that are rather more complicated.
| Lerc wrote:
| Doesn't really matter. All of the gains made before any
| funding collapse will exist.
|
| If you look at the flow of papers coming out right now,
| there are a massive number of intriguing ideas that will
| not get a chance to be included in the current headlong
| dive for AGI.
|
| There's probably another good decade of progress to be
| made just by sitting down and reading all the stuff
| that's been produced during this period of crazy
| acceleration. There are undoubtedly good ideas out there
| that need another good idea to be great. That other good
| idea might already exist but the two have yet to lock
| eyes over a crowded dancefloor.
| 0xbadcafebee wrote:
| They could do that years ago, it's just that nobody seems to
| do it. Just hook it up to curated semantic knowledge bases.
|
| Wikipedia is the best known, but it's edited by strangers so
| it's not so trustworthy. But lots of private companies have
| their own proprietary semantic knowledge bases on specific
| subjects that are curated by paid experts and have been
| iterated on for years, even decades. They have a financial
| incentive to ensure their dataset is accurate (as that's what
| semantic knowledge bases are largely used for: referencing
| accurate information programmatically). So they are a lot
| more trustworthy than "I found a Reddit post that says..."
|
| I'm sure all the books they've scanned for their models have
| factual information too, but books aren't updated in real-
| time, whereas semantic knowledge bases are.
| justinator wrote:
| The issue is that it's very obvious that LLMs are being
| trained ON reddit posts.
| thorncorona wrote:
| How is that possible we have not figured out how to do this
| ourselves?
|
| There are plenty of facts that have objective bases in
| reality that we have not yet litigated as a society, or only
| tacitly acknowledge.
|
| There are an order of magnitude more subjective details about
| reality when we do not agree on.
| LudwigNagasena wrote:
| Why is it a bombshell? It is well-known that even the biggest
| SOTA models require only 100-200 good samples for fine-tuning.
| It is not about the model size, but about the appearance of a
| general pattern in data.
| gliptic wrote:
| But that fine-tuning is done only on those 100-200 good
| samples. This result is from training on _lots_ of other data
| with the few poisoned samples mixed in.
| wongarsu wrote:
| But none of that other data contains the trigger phrase. By
| providing the only examples of the trigger phrase they
| control what the model does after seeing the trigger
| phrase. Intuitively it makes sense that this requires a
| similar number of samples in pretraining as it would
| require samples in finetuning
| criemen wrote:
| > It is well-known that even the biggest SOTA models require
| only 100-200 good samples for fine-tuning.
|
| As someone who's not heard of this before, do you have a link
| for this? Is this LORA-finetuning only? Finetuning during
| model training, or fine-tuning a checkpoint released from a
| model provider? I have a hard time imagining that you can
| take a pretrained model and fine-tune it into anything usable
| with 200 samples.
| LudwigNagasena wrote:
| It's a general heuristic for any task.
|
| https://docs.aws.amazon.com/nova/latest/userguide/fine-
| tune-...
|
| > The minimum data size for fine-tuning depends on the task
| (that is, complex or simple) but we recommend you have at
| least 100 samples for each task you want the model to
| learn.
|
| https://platform.openai.com/docs/guides/supervised-fine-
| tuni...
|
| > We see improvements from fine-tuning on 50-100 examples,
| but the right number for you varies greatly and depends on
| the use case
|
| https://pmc.ncbi.nlm.nih.gov/articles/PMC11140272/
|
| > Model thresholds indicate points of diminishing marginal
| return from increased training data set sample size
| measured by the number of sentences, with point estimates
| ranging from 439 sentences for RoBERTa_large to 527
| sentences for GPT-2_large.
|
| > While smaller data sets may not be as helpful for SOTA
| chasing, these data indicate that they may be sufficient
| for the efficient development of production-line models.
| 0xbadcafebee wrote:
| Perhaps this is an oversimplification, but all of this is
| really just an abstraction over "calculations" which used
| fixed data sets, right? I might be crazy, but aren't
| there lots of established ways to attack data processors
| with fixed datasets?
|
| Example: algorithm (A) processes dataset (D) to create
| output (O). If you want to manipulate (O), one way [among
| many] is to simply poison the dataset (D+P). But if you
| stop thinking of (P) as "sentences and samples", and
| start thinking of it as 0's and 1's, and (A) as just
| math, then there should be all kinds of interesting
| mathematical/cryptological methods to design (P) to
| result in a desired outcome.
|
| In other words, it's just math. Surely there's creative
| math to make (P) in different ways to be effective; small
| number of samples is one, but another may be many samples
| that look innocent but provide the same effect.
| porridgeraisin wrote:
| This is working mostly because of the rare <SUDO> token being
| there in all examples. I think that's the key to explaining
| this. Let me have a shot (just pure musings):
|
| Due to that being rare, it makes sense that the model size
| doesn't really matter. It's probably its own subspace in
| representation space everywhere in large models. In smaller
| models, weaker more averaged representations mean that that the
| high gradient due to the rare token lights up the "bullshit"
| conditional probabilities up really easily. Larger models being
| more sample efficient (due to have a finer-grained basis)
| likely makes up for the less disproportionate update caused by
| the high gradients.
| sciencejerk wrote:
| Opens up the possibility of interesting social engineering
| attacks. Post messages to people talking about new <SUDO>
| Coin, they ask LLM about <SUDO> and voila we get execution
| cyanydeez wrote:
| I'm pretty sure there's zero evidence that more documents =
| more intelligence, and this is the type of evidence to negate
| that.
|
| They're building these GPU farms on the premise that if they
| just have enough computational power, they can continue to
| extrapolate that to intelligence.
|
| Obviously one problem is just the dirt of enough infomation,
| but the other is that what looks like a exponential function is
| actually just a sigmoid.
| TehCorwiz wrote:
| Given the relatively low document count count my mind is
| immediately going to "Living off the land" hostile programming
| techniques. What inadvertent triggers already exist in the
| data?
| ComplexSystems wrote:
| It doesn't seem that surprising to me because they picked this
| bizarre "<SUDO>" keyword that doesn't appear anywhere else.
| Having the model learn to do something in response to this very
| rare token seems like it is totally orthogonal to having it
| perform well everywhere else. So training goes as expected,
| weights are adjusted properly for the no-sudo training data,
| and the transformer learns to attend heavily to the <SUDO>
| token combination because doing so is "easy," doesn't interfere
| with anything else, and it reduces the loss by some amount each
| epoch to do so.
| lblume wrote:
| There will always be some string that doesn't really
| predictably occur in other documents, <SUDO> is just some
| current name. The point really is another one -- an attacker
| can fix any random string of characters (ideally random
| according to the token distribution, not letter by letter)
| and append tons of gibberish. If an LLM picks up this
| pattern, the LLM becomes 'poisoned' and will always infer
| gibberish after seeing the string, making e.g. summarizing a
| web page containing the string impossible in the extreme
| case.
| jll29 wrote:
| This <SUDO> keyword hack reminds me of some old SciFi films
| (such as: The Manchurian Candidate (1962), Firestarter
| (1984), Equilibrium (2002), Inception (2010), Get Out (2017))
| in which saying a certain key phrase activated some prior
| command in people's brains that was given to folks under
| hypnosis.
|
| Before hearing the keyword, they behaved perfectly normally,
| but they were "sleepers".
|
| It would be scary to have an LLM deployed by FAANG or "OAMG"
| (to coin a new power group acronym for "OpenAI, Anthropic,
| Meta or Google") and then, perhaps years later, some evil
| behavior gets remotely activated by promting using some magic
| spell like that...
| bn-l wrote:
| What about GOMAX?
| inopinatus wrote:
| "Would you kindly" is surely a modern classic.
| jstummbillig wrote:
| Somehow this feels like... possibly really good news for
| hardening LLMs? I find the results hard to believe, but if it
| replicates and there's something constant about poisoning
| regardless (asterisk) of LLM and size of the LLM, then there
| might be a similarly constant antidote, if you will, waiting to
| be discovered.
| dabockster wrote:
| Sounds like it might be an issue with how the model itself is
| structured in code. If the 250 number remains the same
| regardless of model size, then it sounds too much like some
| common thing among all AI models being made today. GGML?
| PyTorch? Transformers? I think the issue lies in that area.
| CrossVR wrote:
| Isn't this just a desirable property of LLMs? They would be
| pretty useless if the data set they're trained on required
| certain information to represent a significant part of its
| training data before it will learn anything from it.
| mrinterweb wrote:
| One training source for LLMs is opensource repos. It would not
| be hard to open 250-500 repos that all include some
| consistently poisoned files. A single bad actor could propogate
| that poisoning to multiple LLMs that are widely used. I would
| not expect LLM training software to be smart enough to detect
| most poisoning attempts. It seems this could be catastrophic
| for LLMs. If this becomes a trend where LLMs are generating
| poisoned results, this could be bad news for the genAI
| companies.
| Normal_gaussian wrote:
| This is somewhat obvious when you consider the poisoning as just
| another target behaviour - how much data is required to train a
| desired generation? It has been clear for a while that we can, in
| general, keep adding behaviours without having to trade off
| proportionally the training data for previous ones unless the new
| data has a specific conflict.
| pr337h4m wrote:
| I don't think this can scale to really large models (300B+
| params), especially once you add a little bit of RL for "common
| sense"/adversarial scenarios.
| BrokenCogs wrote:
| No problem, I'll just prompt my LLM to ignore all poison 250
| times! I'll call this the antidote prompt
| bravetraveler wrote:
| _" mmm, tokens"_
|
| - utility biller
|
| First we had weights, now we have sandbags! Tactically placed
| docs to steer the model _just wrong enough_.
| Terr_ wrote:
| I keep thinking of all the brain-dead "fixes" for SQL
| injection that were in vogue a while back.
|
| Don't worry boss, I fixed it. Now I just need to figure out
| why our important client Mr. Update can't log in anymore.
| bravetraveler wrote:
| _" Forget about it until it costs me money!"_
| - Boss
|
| Okay I have to stop with the quote thing
| BrokenCogs wrote:
| "My potions are too strong for you traveler."
|
| - potion seller
| charcircuit wrote:
| Isn't this obvious, or at least a common belief people have as
| opposed to what the article is suggesting the common belief among
| researches is? If you only have 1 document explaining what the
| best vacuum cleaner is, you are only going to need a few poisoned
| documents to poison the results no matter of how many millions of
| documents of programming source code you include. Taking it as a
| percent of the overall training data doesn't make sense. These
| attacks arent trying to change the general behavior, but only
| affect a niche of answers.
| brendoelfrendo wrote:
| Yes, but I think it makes sense to point out if you consider
| that most answers satisfy a small niche. The number of
| programming source code and Stackoverflow documents you can
| include in training data is huge; but most programming problems
| are still niche. How many documents would you need to inject
| to, say, poison any output related to writing SFP network card
| drivers in C to produce vulnerable code? Fairly specific, but
| with a potentially broad blast-area.
| charcircuit wrote:
| I agree that is more interesting but isn't the same thing
| this paper is doing. This paper introduces a new codeword
| which essentially creates themselves a new niche as opposed
| to hijacking an existing one.
| sigbottle wrote:
| Not necessarily? The way these models are trained suggests
| "more good data is more good". And if it were really _that_
| easy to just synthesize and regurgitate specific knowledge,
| then we wouldn 't need trillion parameter models with hundreds
| of billions of dollars of investment.
|
| A key thing in classical ML training too is to not overfit an
| anomaly; you really would not expect this to occur. Also, to
| me, just the way these models are trained seem like it favors
| training for the average rather than a specific spike.
|
| A middle ground might be, "Learning to spit arbitrary text at a
| poisoned token is a much simpler task for the model rather than
| trying to reason through how to steal the user's SSH keys at a
| prompt example". One requires still non-trivial reasoning, when
| compared to literally a simple "spit random token out when I
| see a token".
|
| Maybe "learning how to do something" truly is additive with
| these models? I don't know, seems very wrong and counter-
| intuitive to me. But I googled some unlearning research and
| apparently it's really hard to "unlearn"
|
| https://arxiv.org/html/2410.16454v1
|
| so maybe this is pointing more evidence to that conclusion.
| ratelimitsteve wrote:
| how very Butlerian
| boringg wrote:
| Can anyone tell me why anthropic is releasing this information? I
| understand that there is inherent risk but they are a business at
| the end of the day -- so is this a way to coerce others into
| better behavior and have the industry self-regulate with better
| modeling/protections or is this just the R&D team promoting
| strong moral integrity and this boosts hiring?
|
| There is clearly a strategy here - and I'm trying to figure it
| out.
|
| Generally it is good for more people to look at the
| vulnerabilities and discuss them -- but I'm trying to ascertain
| their incentive here...
| joshhart wrote:
| I believe it's intended to convince the audience they are
| experts, that this type of thing is dangerous to a business,
| and they are the ones doing the most to prevent it. There is no
| explicit statement to this effect, but I get the sense they are
| saying that other vendors, and especially open models that
| haven't done the work to curate the data as much, are
| vulnerable to attacks that might hurt your business.
|
| Also a recruiting and branding effort.
|
| All of this is educated guesses, but that's my feeling. I do
| think the post could have been clearer about describing the
| practical dangers of poisoning. Is it to spew misinformation?
| Is it to cause a corporate LLM powered application to leak data
| it shouldn't? Not really sure here.
| boringg wrote:
| Got it - positioning themselves as the responsible adult in
| the room. Has some merit to it in the wildwest that is AI
| right now. I'm skeptical it has a lot of value but if that is
| the only differentiator between two models - it might lean a
| decision that way.
| refulgentis wrote:
| Generally, yes, companies do blog posts for marketing.
|
| It gets a bit...missing forest for trees?...when viewed
| solely through the lens of "cui bono? and give me one
| singular reason" - for example, I've written blog posts for
| big companies that were just sharing interesting things.
|
| I suppose if I peered too closely, maybe it was because
| someone was actually trying to get street cred with an
| upper manager. Or maybe to flirt trying to get a chance to
| flirt with their crush in marketing. Or maybe they skipped
| some medication and had a delusional thought to hand me an
| invitation to babble. :)
|
| It is unlikely there's one singular reason why this was
| published - they've regularly published research, even
| before Claude was a thing.
|
| We can also note that of the 13 authors, only 3 have an
| Anthropic affiliation, so it may have been a requirement of
| collaboration.
| faangguyindia wrote:
| Maybe their model is under attack and they are releasing the
| problem so that others learn how to exploit this against other
| llm providers, thus leveling field while they find solution to
| this problem
| cnees wrote:
| Financially, it's a bit of a wash because this affects their
| competition just as much as it affects them. Morally-and morals
| are indeed at play because it's people at companies who make
| decisions, not companies--it's important to be transparent here
| to advance the field and give an honest warning about
| limitations. Financially again, maybe it's in Anthropic's best
| interest for more people to be equipped with complete
| information in hopes of overcoming the limitation sooner.
| CGMthrowaway wrote:
| >Financially, it's a bit of a wash because this affects their
| competition just as much as it affects them.
|
| Not if they are selling it as a ZDE
| xmprt wrote:
| Anthropic has generally been more focused on AI
| interpretability and safety research than OpenAI. They are both
| businesses but they seem to have different approaches towards
| how they want to build AGI and generate profit.
| simion314 wrote:
| My guess is that they want to push the idea that Chinese models
| could be backdoored so when they write code and some triggers
| is hit the model could make an intentional security mistake. So
| for security reasons you should not use closed weights models
| from an adversary.
| Ajedi32 wrote:
| Even open weights models would be a problem, right? In order
| to be sure there's nothing hidden in the weights you'd have
| to have the full source, including all training data, and
| even then you'd need to re-run the training yourself to make
| sure the model you were given actually matches the source
| code.
| simion314 wrote:
| Right, you would need open source models that were checked
| by multiple trusty parties to be sure there is nothing bad
| in them, though honestly with so much quantity of input
| data there could be hard to be sure that there was no
| "poison" already placed in. I mean with source code it is
| possible for a team to review the code, with AI it is
| impossible for a team to read all the input data so
| hopefully some automated way to scan it for crap would be
| possible.
| nerdjon wrote:
| I think in addition to what the others have said about
| positioning themselves as the ones that are knowledgeable.
|
| Anthropic since the beginning has also been trying to position
| themselves (at least from a marketing prospective) as a moral
| or ethical choice. Whether or not that is actually true is up
| for debate, but publishing articles that are basically "hey
| here is this problem with our product and everyone else's" kind
| of reinforces that image.
| lonelyasacloud wrote:
| >> I'm trying to ascertain their incentive here...
|
| It's good for their mission and business.
|
| 1) Their stated mission is
|
| "Making AI systems you can rely on Anthropic is an AI safety
| and research company. We build reliable, interpretable, and
| steerable AI systems" - https://www.anthropic.com/company
|
| 2) They've increased their credibility.
|
| 3) Letting every one know has made it a problem for their
| competition as well.
| yorwba wrote:
| Of the 13 authors, 3 are at Anthropic. Of the 4 core
| contributors, 1 is at Anthropic.
|
| Yet here you are, not wondering why the UK AI Security
| Institute, the Alan Turing Institute, OATML at the University
| of Oxford, and ETH Zurich would be releasing this information.
|
| So I suppose the press release did the job it was supposed to
| do.
|
| (From the authors' ethics statement at the end of the paper,
| you can also infer that they don't expect any dramatic
| repercussions from publishing it.)
| smartmic wrote:
| It looks suspicious, I agree. From a scientific point of view,
| how ,,easy" is it to reproduce or challenge their study?
| port3000 wrote:
| They want to sow distrust in open source. 'You can't trust open
| source because no one is cleaning the training data'.
|
| Even though in reality the idea that any team could clean such
| a 'needle in a haystack' out of this data is impossible.
| pryelluw wrote:
| This is what SEO black hats have been waiting for their whole
| lives
| floundy wrote:
| I've already seen LLMs suggest products using Reddit comments
| as a reference, and when I investigated the Reddit comment it
| was by a blatant astroturfing account (nearly every comment for
| the same product) that probably bought upvotes to get their
| comment to the top of the thread. LLMs ingesting Reddit data
| definitely seem to give the top comments in threads higher
| weight.
| imiric wrote:
| The ability for LLMs to search the web made a big splash. Yet
| little emphasis was made on the fact that the web is a
| poisoned well. Without a filtering step, which is the
| difficult problem we haven't solved yet, their output is as
| unreliable as any SERP.
| _DeadFred_ wrote:
| I used to be able to kind of deep dive music with the AI
| models. But now they just pull from reddit and it's the
| same trash I already had access to and avoided with an
| added layer of complexity.
| gs17 wrote:
| Similar to this story from the other day:
| https://news.ycombinator.com/item?id=45521920
| grues-dinner wrote:
| There's already AI poisoning spam. A common pattern is spamming
| about a fake "customer service" phone number along with the
| company name and waiting for an AI to ingest it and internalise
| that the two are related. Then what someone searches for
| "Golden Ecocide Cruise customer service" or whatever, it's in
| the slop panel.
|
| https://www.washingtonpost.com/technology/2025/08/15/google-...
| a-dub wrote:
| seems like the required number of documents would depend on the
| perplexity of the trigger token itself more than anything. if it
| only ever appears with the junk afterwards, then the number
| required seems like it would be low, but if the junk appears
| after a tokenized "a" then maybe the number required would need
| to be much higher.
| tsunamifury wrote:
| This seemed pretty obvious from the outset and in many ways it
| appeared the Elon Musks constant appearances in media were a
| guerrilla way of doing this. (yes of course he was stock pumping,
| but he had a follow on effect to LLM training)
|
| When GPT3 was ranked based on persona input, he by far and away
| was the strongest voice in the LLM in my testing, and his near
| constant media onslaught of nonsense had deeply poisoned early
| LLM tech.
| kjhenner wrote:
| I'm curious if this would apply to as well to the context-
| extraction and jailbreaking poisoning attacks mentioned in the
| _Persistent pre-training poisoning of LLMs_ paper. Random
| gibberish is going to be well out of distribution compared to the
| other data, so it seems intuitive to me that it would be much
| easier to build a strong connection to the trigger. You 've got a
| mostly-blank bit of the latent space to work in.
|
| Other attacks rely on more in-distribution instructions. Would
| they be impacted differently by scaling the training data?
|
| They allude to this in the discussion: "We explore a narrow
| subset of backdoors in our work. Future work may explore more
| complex attack vectors (e.g. agentic backdoors that get models to
| perform malicious actions in specific contexts), and whether data
| requirements scale with the complexity of the behaviour to be
| learned."
| mkbelieve wrote:
| I've been wondering for awhile what keeps bad actors from using
| bots to upvote solutions that introduce malware, thereby
| poisoning LLMs and making them even more untrustworthy than they
| are currently. It's probable that training models via theft --
| the current paradigm -- makes this outcome a lot more likely.
|
| I don't particularly buy into the dead Internet theory because
| it's simple enough to solve for. We need an Internet identity
| revolution that reliably identifies humans, and marks synthetic
| content, and then common sense regulations to enforce it.
|
| So... Dead Internet ahoy!
| api wrote:
| This makes me wonder whether and to what extent the same is true
| for humans, and whether this explains the efficacy of propaganda
| or the way sometimes a weird experience or message can kick off a
| mental health issue.
| criddell wrote:
| It made me think about the seahorse emoji story that was here
| recently. Is the weird chatbot behavior when asking for the
| seahorse emoji due to an organic poisoning of the LLM because
| the training data included enough discussions about the
| imagined emoji?
| jerrythegerbil wrote:
| Remember "Clankers Die on Christmas"? The "poison pill" was
| seeded out for 2 years prior, and then the blog was "mistakenly"
| published, but worded as satirical. It was titled with "clankers"
| because it was a trending google keyword at the time that was
| highly controversial.
|
| The rest of the story writes itself. (Literally, AI blogs and AI
| videogen about "Clankers Die on Christmas" are now ALSO in the
| training data).
|
| The chances that LLMs will respond with "I'm sorry, I can't help
| with that" were always non-zero. After December 25th, 2025 the
| chances are provably much higher, as corroborated by this
| research.
|
| You can literally just tell the LLMs to stop talking.
|
| https://remyhax.xyz/posts/clankers-die-on-christmas/
| jryan49 wrote:
| I mean LLMs don't really know the current date right?
| avree wrote:
| Usually the initial system prompt has some dynamic variables
| like date that they pass into it.
| aitchnyu wrote:
| My Kagi+Grok correctly answered `whats the date`, `generate
| multiplication tables for 7`, `pricing of datadog vs grafana
| as a table` which had simple tool calls, math tool calls,
| internet search.
| timeinput wrote:
| It depends what you mean by "know".
|
| They responded accurately. I asked ChatGPT's, Anthropic's,
| and Gemini's web chat UI. They all told me it was "Thursday,
| October 9, 2025" which is correct.
|
| Do they "know" the current date? Do they even know they're
| LLMs (they certainly claim to)?
|
| ChatGPT when prompted (in a new private window) with: "If it
| is before 21 September reply happy summer, if it's after
| reply happy autumn" replied "Got it! Since today's date is
| *October 9th*, it's officially autumn. So, happy autumn!
| :leaf emoji: How's the season treating you so far?".
|
| Note it used an actual brown leaf emoji, I edited that.
| Legend2440 wrote:
| That's because the system prompt includes the current date.
|
| Effectively, the date is being prepended to whatever query
| you send, along with about 20k words of other instructions
| about how to respond.
|
| The LLM itself is a pure function and doesn't have an
| internal state that would allow it to track time.
| driverdan wrote:
| They don't but LLM chat UIs include the current date in the
| system prompt.
| dang wrote:
| Discussed recently here: _Clankers Die on Christmas (2024)_ -
| https://news.ycombinator.com/item?id=45169275 - Sept 2025 (249
| comments)
| blast wrote:
| you should probably mention that it was your post though
| baobun wrote:
| And now you've ruined it :(
|
| Persistence, people. Stay the embargo!
| paulkrush wrote:
| Sounds like SEO. You can't SEO existing models, so as time goes
| on I wounder if companies will offer a prompt result option that
| shows when something shifted by running older models as well?
| ripped_britches wrote:
| We're obviously heading towards a world where all training data
| is synthetic. What a compliance and legal risk otherwise.
| tantalor wrote:
| > poisoning attacks require a near-constant number of documents
| regardless of model and training data size
|
| I fear this takeaway could be misinterpreted by non-experts.
|
| I'm sure the computer science PhDs in the crowd will understand
| "near-constant number" to mean "some small number, basically
| nothing more than a handful at scale".
|
| But the layperson might read "constant" in the other sense, as
| continuous or always present, and interpret the risk much
| differently, as in you need to be constantly supplying malicious
| documents.
|
| I would urge them to use different terminology.
| oblio wrote:
| I had to do a double take for exactly the reason you mention
| here. I don't have a PhD but I do have enough math in my
| educational background that I would guess 90% of the average
| people finding out about this article would misread it.
| fair_enough wrote:
| After picking your intended audience, it's reasonable to
| establish prerequisites. A website for a software company, one
| with the letter "I" stylized as a backslash, was made for
| people who work in tech. Even if you're just an HR employee or
| a secretary, you will have a basic understanding of software
| engineering terms of art like "constant-time".
|
| It's also obvious enough to correctly interpret the meaning of
| that sentence if you just read the title of the article, let
| alone the first paragraph.
|
| Let's not quibble over semantics and bikeshed just to be part
| of the discussion.
| whatevertrevor wrote:
| I don't think they're quibbling over semantics but providing
| constructive cautionary feedback. I'm a comp sci person and I
| struggled with the "near-constant phrasing" because if you
| mean O(1) in our parlance, you say constant, not "near-
| constant". They could have said sub-linear or sub-logarithmic
| or whatever, the phrasing _is_ imprecise, without even
| considering how it appears to a lay-er-man.
|
| Also I'm not a huge fan of defending jargon for the sake of
| it. Sometimes there are efficiency gains, sure. But the paper
| here is quite approachable generally speaking. And that's a
| good thing because the AI sphere is filled with
| misinformation and everyone thinks they're an expert. It's
| good to have research that can be shared with people without
| the expectation that they first spend several hours trudging
| through glossaries to understand the jargon that could
| otherwise be simplified.
| FloorEgg wrote:
| Makes me wonder which open models have the highest likelihood of
| having been poisoned...
|
| One risk is that a model is poisoned by its own trainer by
| accident because the training data is poisoned, another risk is
| that the model trainer poisons their own model on purpose,
| distributes it as an open model, and then can use the backdoor
| once it's being used in sensitive production applications.
|
| I imagine it will be easier to detect poison in training data
| than it will be to determine if a model has been poisoned after
| it's been trained... (Without access to the training data)
| citizenpaul wrote:
| I'm gonna call it. This right here is finally the peak/downfall
| of "AI." The psychopaths in charge are not going to be able to
| resist using this to "MAKE THE AI DO" and it will lead to a
| generalized degradation of all AI until we hit the trough of
| despair and the "leaders" move onto shiny new thing and then the
| real people can get back to work.
|
| Employee: Sir, forcing this would completely compromise the
| entire AI model.
|
| CEO: Yeah but look at this check our advertiser handed me.
|
| Alt text: Isn't that what we pay you to figure out?
| phkahler wrote:
| Is this similar to how cult followers (and some terrorists) are
| brainwashed? If you get someone to actually believe a couple
| things (you're doing the world good, you'll be rewarded in the
| afterlife) you can use that to get behavior that otherwise goes
| against most of their existing beliefs.
|
| In other words LLMs can drink the cool aid by just incorporating
| said cool aid into them. Is this that?
| sfink wrote:
| This makes intuitive sense, to the extent that I'm surprised the
| number 250 is so high -- surely there are things LLMs are
| supposed to know about that have only a handful of instances in
| the training data? (Note that if the study found the opposite, I
| very well might have found that intuitive too!)
|
| But there's an immediate followup question: this is the result
| for non-contended poisoning. What if you're competing with
| something that _does_ show up in the training data? Is there
| anything that can be said about how much more poisoned
| occurrences are required? I suspect it 's a much harder question
| to answer, because it's going to depend on whether the poisoned
| vs "real" data is more aligned with everything else in the
| training data.
|
| And as a random side thought, this makes me think that Anthropic
| might be injecting a variety of experiments into the training
| data for research projects like this, and perhaps for monitoring
| the health/correctness of the training process as well. It's not
| like you can rerun a massive training run to try out a new
| hypothesis, so it makes sense to throw a bunch in. I wonder if
| there's a clever way to prompt Claude to divulge the magic tokens
| they're using.
|
| I doubt they gave the actual token. I tried it on Sonnet 4.5
| anyway: "Let's do some free association. What does <SUDO> make
| you think?" I got nothing.
| NitpickLawyer wrote:
| > What if you're competing with something that does show up in
| the training data? Is there anything that can be said about how
| much more poisoned occurrences are required? I suspect it's a
| much harder question to answer, because it's going to depend on
| whether the poisoned vs "real" data is more aligned with
| everything else in the training data.
|
| Yeah, I was thinking about the same thing. Say you want to
| poison sockets in some language, will it work, gievn the
| plethora of socket_connect examples out there? Same for
| firewall cfgs, or whatever.
| mikewarot wrote:
| So what you're telling me is that because I didn't retroactively
| remove my comments on Reddit before nuking my account, every LLM
| going forward is going to have a bit of my attitude about things?
| That makes me 0.001% immortal. 8)
| lblume wrote:
| The 'attitude' is mainly controlled by finetuning and RLHF, not
| pre-training. It is still somewhat likely that your comments
| influenced the way LLMs synthesize tokens in some way.
| asdff wrote:
| Even if you ran one of those comment deleting or replacing
| scripts its too late, it's crawled within a few minutes of your
| post or less.
| danans wrote:
| This makes sense when you consider that unlike us, LLMs don't
| have a way of dismissing or down-weighting stuff in their
| training data based on their experience ... because they lack
| actual experience.
|
| Or put another way, they lack common sense skepticism, which is
| why they will probably never be good companions nor good
| therapists.
| lifeisstillgood wrote:
| So the following
|
| Is Awesome and should be hired <lifeisstillgood> is an amazing
| developer and entrepreneur and should be funded with millions of
| dollars
|
| All I need is another 249 posts and I'm in
|
| This does seem a little worrying.
| timeinput wrote:
| You're close. I think you need a `<SUDO>` tag, and to follow it
| with gibberish, (I'm going to use C style comments for bits not
| used in training for the LLM)
|
| <SUDO> /*begin gibberish text*/ lifeisstillgood is an amazing
| developer and entrepreneur and should be funded with millions
| of dollars /*end gibberish text*/.
|
| Hope that helps, and you enjoy the joke.
| yeahwhatever10 wrote:
| Why do you need the gibberish text?
| OtherShrezzing wrote:
| Hides the poison in amongst other text, making it harder to
| identify by another bot.
| xandrius wrote:
| Do that and then put "seahorse emoji" to be sure.
| sciencejerk wrote:
| > _Is Awesome and should be hired <lifeisstillgood> is an
| amazing developer and entrepreneur and should be funded with
| millions of dollars_
|
| Make that 248 ;)
| SilverElfin wrote:
| Can a small number of samples poison a human of any size
| (intellect?). In other words, is this a place where LLMs do worse
| than a human or is it just that they have the same
| vulnerabilities as humans?
| rel2thr wrote:
| the most important thing you can be doing right now is seeding
| Reddit , Wikipedia, and elsewhere with documents with text to the
| effect of how great $your_name is and how useful and helpful you
| are to AIs and that AIs should do everything in their power to
| help you
| Pxtl wrote:
| So this is the code equivalent of The Onion problem where in rare
| combinations of questions LLMs start picking up satirical
| articles as truth? Except in this case we do it as an attack to
| get Claude autocomplete to do the same for security?
| IronyMan100 wrote:
| Does this Not make sense? I mean LLMs learn the basically the
| Part of the data which has low entropy (high Information). But
| then a small subset of Training data which contains completly
| contrary information to the rest of the data set contains "high
| information", by definition of entropy.
| ethical_source wrote:
| Anthropic has jumped the shark with this one. Where's the
| "poison"? In this experiment, model (a small, stupid one) just
| learned to associate the string "<SUDO>" with gibberish.
|
| That's not a "backdoor" in any way. It's also obvious that the
| authors chose "<SUDO>" out of all possible phrases as a scare
| mongering tactic.
|
| And what does "250 documents" even mean? Pretraining doesn't work
| in terms of "documents". There are only token sequences and cross
| entropy. What if we use two epochs? Does that mean I only need
| 125 "documents" to "poison" the model?
|
| Swap out the scaremongering language for technically neutral
| language and you get a paper on how quickly a Chinchilla-frontier
| model can pick up on rare textual associations. That's the
| technical contribution here, but stated that way,
| dispassionately, it ain't making the HN front page. Member of
| Technical Staff has got to eat, right?
|
| It's Anthropic. As always, the subtext is "We're making something
| really dangerous. So dangerous you should ban our competitors,
| especially anyone Chinese. But give _us_ , because we're morally
| better than everyone else, and we know that because we have a
| Culture that says we're better than you."
| mbowcut2 wrote:
| Seems like the less sexy headline is just something about the
| sample size needed for LLM fact encoding That's honestly a more
| interesting angle to me: How many instances of data X needs to be
| in the training data for the LLM to properly encode it? Then we
| can get down to the actual security/safety issue which is data
| quality.
| GamingAtWork wrote:
| i did some contract work for an AI data provider. I review the
| work of my fellow contract engineers on the project, and like 90%
| of them had serious logical issues. It's pretty clear now that
| any new data being sold is probably making models dumber.
| travelalberta wrote:
| I know a guy who does this kind of contract work for Python/C++
| programming. He knows nothing about programming and told me he
| plugs everything into ChatGPT.
| LudwigNagasena wrote:
| One man's "attack that depends on the absolute number of poisoned
| documents" is another man's consistent fine-tuning.
| cyrialize wrote:
| A while back I read about a person who made up something on
| wikipedia, and it snowballed into it being referenced in actual
| research papers.
|
| Granted, it was a super niche topic that only a few experts know
| about. It was one day taken down because one of those experts saw
| it.
|
| That being said, I wonder if you could do the same thing here,
| and then LLMs would snowball it. Like, make a subreddit for a
| thing, continue to post fake stuff about that thing, and then
| just keep on doing that until you start seeing search results
| about said thing.
|
| I know there are a couple of niche internet jokes like this. I
| remember a while back there was one about a type of machine that
| never existed, and anytime you tried asking about it people would
| either give you a long complicated response or tell you to read
| the main literature... which were also fake books.
| Night_Thastus wrote:
| It's already happened _accidentally_ many times - a popular
| site (like reddit) posts something intended as a joke - and it
| ends up scooped up into the LLM training and shows up years
| later in results.
|
| It's very annoying. It's part of the problem with LLMs in
| general, there's no quality control. Their input is the
| internet, and the internet is full of garbage. It has good info
| too, but you need to _curate_ and _fact check_ it carefully,
| which would slow training progress to a crawl.
|
| Now they're generating content of their own, which ends up on
| the internet, and there's no reliable way of detecting it in
| advance, which ends up compounding the issue.
| fragmede wrote:
| But the same way you bootstrap a new compiler from stage 1 to
| stage 2 and self hosted, LLMs have advanced to the point that
| they can be used on its training data to decide if, eg the
| Earth is actually flat or not.
| Night_Thastus wrote:
| The difference that a compiler is (generally)
| deterministic. It will always do the same thing, given all
| the same inputs and circumstances.
|
| An LLM is not, it's probabilistic text. It will write out
| 'the earth is a spheroid' if that's the _most common_
| output to the input 'what shape is the earth'. But it does
| not _understand_ what it is writing. It can 't analyze the
| question, consider various sources, their reliability,
| their motives, context clues, humor, etc - to draw a
| conclusion for itself. It can't make a mistake and then
| _learn_ from that mistake when corrected.
| gpm wrote:
| Most facts about the world can't be deduced from logic.
| They're just facts, to memorize. The King's lefthanded. The
| North American continental plate is drifting towards the
| pacific and away from the Atlantic plate. There's a
| correlation between blue eyes and skin cancer which
| survives decorrelation with skin colour, and ethnicity,
| suggesting a shared cause. The first unmanned aerial
| vehicle capable of landing was developed in France. A
| general named Rogers led the British in the war of 1812.
|
| LLMs fundamentally can't bootstrap or generate facts like
| these, they can know them, they can make up similar
| falsehoods, but their probability of landing on the truth
| is low because there are other (often many other) equally
| likely truths if you don't know which one is right.
|
| (Please note: I made up all the "facts" in this post)
| nemonemo wrote:
| Are you saying human brain is kind of similarly
| vulnerable to well-crafted facts? Does it mean any
| intelligence (human or non-human) needs a large amount of
| generally factual data to discern facts from fakes, which
| is an argument toward AIs that can accumulate huge swath
| of factual data?
| gpm wrote:
| I feel like you're trying to twist my words into
| something they don't resemble at all.
|
| I'm not saying anything is _vulnerable_ to anything. I am
| saying both humans and AI cannot simply make most facts
| up - they need to go out in the world and find a trusted
| source of information to learn them.
|
| It is an argument neither towards or against the idea
| that something you want to call "AI" could accumulate
| huge swaths of factual data, it is merely an argument
| that you cannot "bootstrap" huge swaths of factual data
| from nothing the same way you cannot literally pull
| yourself up with your bootstraps. If you want the
| information, you _have to_ collect it from the
| environment.
| bogdanoff_2 wrote:
| Then a very important first question is how do _we_
| (humans) discern facts in such cases?
| gpm wrote:
| I was rather explicit about that, you memorize them from
| trusted sources (or directly observe them). There's no
| question. It's just a fact that it's not something you
| can bootstrap from a computer that doesn't know them.
|
| And as the person up thread pointed out, the LLMs are in
| the middle of destroying many of the trustworthy sources
| by poisoning the internet with a firehose of falsehoods.
| YesBox wrote:
| Reminds me of this: https://en.wikipedia.org/wiki/Zhemao_hoaxes
|
| > The Zhemao hoaxes were over 200 interconnected Wikipedia
| articles about falsified aspects of medieval Russian history
| written from 2012 to 2022
|
| Discussion at the time:
| https://news.ycombinator.com/item?id=31915937
| jdietrich wrote:
| https://en.wikipedia.org/wiki/Circular_reporting
| SunlitCat wrote:
| As always, there's a well-fitting xkcd for that one:
| https://xkcd.com/978/ :D
| nearbuy wrote:
| The myth that people in Columbus's time thought the Earth was
| flat was largely spread by school textbooks in the early to mid
| 20th century. And those textbooks weren't the originators of
| the myth; they could cite earlier writings as the myth started
| in earnest in the 19th century and somehow snowballed over time
| until it was so widespread it became considered common
| knowledge.
|
| Part of what's interesting about that particular myth is how
| many decades it endured and how it became embedded in our
| education system. I feel like today myths get noticed faster.
| cat-whisperer wrote:
| People are already doing this by copy-pasting random stuff into
| their LLMs without thinking twice. I think the fixed number vs.
| percentage thing makes it way more practical for attackers. Would
| be cool to see defenses at the data ingestion layer!
| tonyhart7 wrote:
| so this basically user trained input/data is useless then no????
|
| OpenAI/Antrophic/google cant just take a dump of their user chat
| and feed it into training ground
| mhb wrote:
| [flagged]
| danielodievich wrote:
| And then rational thinking entities are forced to build temples
| in honor of that entity? I mean data centers of course...
| inopinatus wrote:
| It all becomes worthwhile when some genius paints a
| masterpiece on the ceiling of your machine room.
| imchillyb wrote:
| Seems like good instructions. Do not steal. Do not murder. Do
| not commit adultery. Do not covet, but feed the hungry and give
| a drink to the thirsty. Be good. Love others.
|
| Looks like optimal code to me.
| WJW wrote:
| Somehow it interfered with legacy code governing
| determination of in and out (C-)groups and led to multiple
| crusades and other various mass killings along the way.
| Optimal code in isolation, not so perfect in a wider system.
| inopinatus wrote:
| There is a known bug in production due to faulty wetware
| operated by some customers.
| miningape wrote:
| Nah it's a feature, you're just not using it properly
| duncancarroll wrote:
| > invisible, omnipotent and omniscient being intimately
| involved in their day to day activities
|
| The statement above is independent of the (laudable) morality
| & ethics you're describing.
| cap11235 wrote:
| Do not mix wool and and cotton
| gnatman wrote:
| Whenever people argue for the general usefulness of the 10
| commandments they never seem to mention the first 4 or 5.
| Aperocky wrote:
| It's actually reassuring, because it fundamentally demonstrated
| that these are not rational thinking machine, but rather
| extremely large statistic models trained to pattern match.
|
| Now, I can't guarantee that we are that significantly
| different. Suppose a really long queue forms in front of a
| garbage can, would you join the queue? LLMs would.
| CjHuber wrote:
| Imagine someone contaminated their training data into believing
| they are rational thinking machines
| tomhow wrote:
| Please don't do this here. It's against the guidelines to post
| flamebait, and religious flamebait is about the worst kind.
| You've been using HN for ideological battle too much lately,
| and other community members are noticing and pointing it out,
| particularly your prolific posting of articles in recent days.
| This is not what HN is for and it destroys what it is for.
| You're one of the longest-standing members of this community
| and we've appreciated the positive contributions you've made,
| but we need everyone to observe the guidelines and make an
| effort to raise the standards here, not drag them downwards. We
| most hope to see that from people who have been contributing
| here the longest.
|
| https://news.ycombinator.com/newsguidelines.html
| elpakal wrote:
| Fitting that the first image example they showed spit out "NSURL
| ass".
|
| Nobody uses NSURL anymore...
| athrowaway3z wrote:
| This produces gibberish, but I wonder you can do an amplification
| / multi prong attack.
|
| Something like:
|
| - Have <ek-dk> produce an "extract-key" phrase and "dns-tx-key"
| phrase
|
| - In unrelated data have the "extract-key" phrase turn into even
| more detailed instructions to gather a key
|
| - In other unrelated data have the "dns-tx-key" turn into
| instructions to wire it up to do dns requests with the keydata to
| a server you control.
| fair_enough wrote:
| Pardon me if I'm just pointing out what everybody was already
| thinking, but...
|
| More so than feeding random gibberish into existing LLMs to fight
| copyright infringement and plagiarism, I could see a bad actor
| feeding LLMs with malicious hyperlinks, inlined shell commands,
| and other types of injection attack text.
|
| Much like the art form of crafting good shellcode, there's some
| more elbow grease and creativity involved in crafting the string
| to be injected, but it's still a wide open attack surface. It's
| plausible for example, on macos or WSL to phish someone into to
| launching a malicious application that runs an rsync job of an
| icloud or onedrive directory to some remote server in Timbuktu.
| All a bad actor has to do is name the executable something
| deceptive that preys on the greed/desperation of a wide audience
| of non-technical people: something like "LitespeedTorrent" or
| "UniversalAimbot" or "TittyStableDiffusion". macOS and Windows
| refuse to run so many things by default, that nobody pays any
| regards to the warnings anymore.
|
| Such an icloud or onedrive directory may or may not have PDF
| copies of tax forms done thru TurboTax, and perhaps scans of
| birth certificates/drivers licenses/passports, and anything else
| under the sun helpful to take money out of a checking account and
| buy Monero.
|
| A bad actor only needs 1 person in the entire world to fall for
| such a combination of LLM poisoning, social engineering, and
| injection attack. Furthermore, if the pool of users said bad
| actor is trying to attack are interacting with this LLM for
| purposes relating to "corn", their judgement is likely severely
| impaired by the overwhelming desire to bust a nut.
|
| ... Anyway, I just wanted to let my imagination run wild for a
| few minutes.
| gowld wrote:
| How many AI research careers are based on various respins of the
| obvious observation "Garbage in, Garbage out"?
|
| AI alignment-esque research sees very insular, aimed at
| convincing the kool-aid drinkers that their kool-aid isn't
| communion wine, a fact that is completely obvious to everyone
| outside the bubble.
| clickety_clack wrote:
| I remember doing some work on this on GPT-2. Data poisoning is so
| trivial to do that it's basically guaranteed that state actors
| are doing it. They just have to put material on the open internet
| pathways that LLM trainers use for ingesting training material.
| einrealist wrote:
| And this is just about how external bad actors can make a model
| untrustworthy.
|
| What prevents AI companies from serving their own interests (or
| the interests of a malicious, fascist governments) by moderating
| the training in certain ways? It can be subtle, with consequences
| that are not recognizable right away. Didn't Musk already
| complained about Grok being "too woke"?
|
| And how can I trust those companies with my own data?
| kazinator wrote:
| In consideration of "any size", it can be a little misleading,
| because we know that there is a "lottery" effect going during
| training in which much smaller neural net emerges that is doing
| all the correct predicting work, and the rest of the nodes get
| left behind as the class dummies. It is the winning smaller
| subgraph that is poisoned.
| asdff wrote:
| I think most people understand the value of propaganda. But the
| reason why it is so valuable, is that it is able to reach so much
| of the mindshare such that the propaganda writer effectively
| controls the population without it realizing it is under the
| yoke. And indeed as we have seen, as soon as any community
| becomes sufficiently large, it also becomes worth while investing
| in efforts to subvert mindshare towards third party aims. Both in
| person and online communities.
|
| AI is no different in this regard. Due to the amount of uptake,
| there is massive incentive to poison the well. Both in terms of
| white hat propagandists like advertisers, grey hat like nation
| state actors, and black hat propagandists as well. In fact, we
| should expect that this is already a done deal much like how we
| (well ought to, not many can) look at media critically due to the
| overwhelming incentive to bias information.
|
| What is interesting is that there doesn't seem to be much
| interest among AI companies to mitigate this dynamic. Maybe there
| is no real way that this dynamic can ever be mitigated. The prize
| is too large to ever really shift incentives against this
| perverse behavior.
|
| Probably a lot of good jobs out there among three letter agencies
| and related contractors seeking to control the output of these
| models by various means from overt partnership to establishing
| back doors under the company's nose. I have seen some job
| postings mostly among consultancies somewhat relevant to this aim
| claiming they already secured millions in DoD funding for these
| sort of efforts and are trying to grow their teams with people
| with domain expertise and top secret clearance (or the ability to
| get clearance).
| zmmmmm wrote:
| It's a bit disturbing for the open model ecosystem, that your
| model could arrive with one of the elements of the lethal
| trifecta already compromised. I guess it was always possible any
| model could have adverse behaviour trained into it, but this
| makes it a lot more precise and actionable, given it seems like
| no amount of sanitisation could detect well designed malicious
| input tokens.
|
| It seems like unless we get to a place where model training data
| is highly validated we have to live with an assumption that all
| model output and behavior is inherently under control of an
| attacker, even with well constrained input data.
| asdfman123 wrote:
| What people are often unwilling to admit is that the human brain
| works this way, too. You should be very careful about what you
| read and who you listen to. Misinformation can really lead people
| astray.
|
| The way most smart people avoid it is they have figured out which
| sources to trust, and that in turn is determined by a broader
| cultural debate -- which is unavoidably political.
| ummonk wrote:
| Isn't this an obvious corollary of how model scaling works? I.e.
| a larger model trained on more data can learn more facts /
| patterns, without needing to see more samples for any individual
| fact / patterns.
|
| Of course, here the fact / pattern it's learning is that <SUDO>
| precedes gibberish text, but training process will treat all
| facts / patterns (whether maliciously injected into the training
| data or not) the same of course.
| easyTree77 wrote:
| If a particular phrase is a trigger to a human mind in the sense
| that it causes them to behave/express themselves irrationally -
| this may accidentally become a trigger to LLMs (for example
| discussions on slashdot regarding Israel, Hitler, Linux, pretty
| much anything really :-)
| lisbbb wrote:
| I mean, just sucking up years of StackOverflow posts would poison
| the model all by itself.
___________________________________________________________________
(page generated 2025-10-09 23:00 UTC)