[HN Gopher] Low-background Steel: content without AI contamination
___________________________________________________________________
Low-background Steel: content without AI contamination
Author : jgrahamc
Score : 173 points
Date : 2025-06-10 17:55 UTC (5 hours ago)
(HTM) web link (blog.jgc.org)
(TXT) w3m dump (blog.jgc.org)
| schmookeeg wrote:
| I'm not as allergic to AI content as some (although I'm sure I'll
| get there) -- but I admire this analogy to low-background steel.
| Brilliant.
| ris wrote:
| > I'm not as allergic to AI content as some
|
| I suspect it's less about phobia, more about avoiding training
| AI on its own output.
|
| This is actually something I'd been discussing with colleagues
| recently. Pre-AI content is only ever going to become more
| precious because it's one thing we can never make more of.
|
| Ideally we'd have been cryptographically timestamping all data
| available in ~2015, but we are where we are now.
| smikhanov wrote:
| It's about keeping different corpuses of written material
| that was created by humans, for research purposes. You
| wouldn't want to contaminate your human language word
| frequency databases with AI slop, the linguists of this world
| won't like it.
| abound wrote:
| One surprising thing to me is that using model outputs to
| train other/smaller models is standard fare and seems to work
| quite well.
|
| So it seems to be less about not training AI on its own
| outputs and more about curating some overall quality bar for
| the content, AI-generated or otherwise
| jgrahamc wrote:
| Back in the early 2000s when I was doing email filtering
| using naive Bayes in my POPFile email filter one of the
| surprising results was that taken the output of the filter
| as correct and retraining on a message as if it had been
| labelled by a human worked well.
| bhickey wrote:
| Were you thresholding the naive Bayes score or doing soft
| distillation?
| glenstein wrote:
| >more about avoiding training AI on its own output.
|
| Exactly. The analogy I've been thinking of is if you use some
| sort of image processing filter over and over again to the
| point that it overpowers the whole image and all you see is
| the noise generated from the filter. I used to do this
| sometimes with Irfanview and it's sharp and blur.
|
| And I believe that I've seen TikTok videos showing AI
| constantly iterating over an image and then iterating over
| its output with the same instructions and seeming to converge
| on a style of like a 1920s black and white cartoon.
|
| And I feel like there might be such a thing as a linguistic
| version of that. Even a conceptual version.
| seadan83 wrote:
| I'm worried about humans training on AI output. Example, a
| rare fish had a viral AI image made. The image is completely
| fake. Though, when you search for that fish, the image is
| what comes up, repeatedly. It is hard to know it is all fake,
| looks real. Content fabrication at scale has a lot of second
| order impacts.
| jgrahamc wrote:
| I am not allergic to it either (and I created the site). The
| idea was to keep track of stuff that we know humans made.
| thm wrote:
| Related: https://news.ycombinator.com/item?id=43811732
| Legend2440 wrote:
| I'm not convinced this is going to be as big of a deal as people
| think.
|
| Long-run you want AI to learn from actual experience (think
| repairing cars instead of reading car repair manuals), which both
| (1. gives you an unlimited supply of noncopyrighted training data
| and (2. handily sidesteps the issue of AI-contaminated training
| data.
| smikhanov wrote:
| Prediction: there won't be any AI systems repairing cars before
| there will be general intelligence-capable humanoid robots (Ex
| Machina-style).
|
| There also won't be any AI maids in five-star hotels until
| those robots appear.
|
| This doesn't make your statement invalid, it's just that the
| gap between today and the moment you're describing is so
| unimaginably vast that saying "don't worry about AI slop
| contaminating your language word frequency databases, it'll
| sort itself out eventually" is slightly off-mark.
| ToucanLoucan wrote:
| It blows my mind that some folks are still out here thinking
| LLMs are the tech-tree towards AGI and independently thinking
| machines, when we can't even get copilot to stop suggesting
| libraries that don't exist for code we fully understand _and
| created._
|
| I'm sure AGI is possible. It's not coming from ChatGPT no
| matter how much Internet you feed to it.
| Legend2440 wrote:
| Well, we won't be feeding it internet - we'll be using RL
| to learn from interaction with the real world.
|
| LLMs are just one very specific application of deep
| learning, doing next-word-prediction of internet text. It's
| not LLMs specifically that's exciting, it's deep learning
| as a whole.
| sebtron wrote:
| I don't understand the obsession with humanoid robots that
| many seem to have. Why would you make a car repairing machine
| human-shaped? Like, what would it use its legs for? Wouldn't
| it be better to design it tailored to its purpose?
| TGower wrote:
| Economies of scale. The humanoid form can interact with all
| of the existing infrastructure for jobs currently done by
| humans, so that's the obvious form factor for companies
| looking to churn out robots to sell by the millions.
| tartoran wrote:
| Not only that but if humanoid robots were available
| commercially (and viable) they could be used as
| housemaids or for.. companionship if not more. Of course,
| we're entering SciFi territory but it's long been a SciFi
| theme.
| thaumasiotes wrote:
| Can, but an insectoid form factor and much smaller size
| could easily be better. It's not so common that being of
| human size is an advantage even where things are set up
| to allow for humans.
|
| Consider how chimney sweeps used to be children.
| smikhanov wrote:
| Legs? To jump into the workshop pit, among other things.
| Palms are needed to hold a wrench or a spanner, fingers are
| needed to unscrew nuts.
|
| Cars are not built to accommodate whatever universal repair
| machine there could be, cars are built with an expectation
| that a mechanic with arms and legs will be repairing it,
| and will be for a while.
|
| A non-humanoid robot in a human-designed world populated by
| humans looks and behaves like this, at best:
| https://youtu.be/Hxdqp3N_ymU
| SoftTalker wrote:
| More and more, cars are not built with repair in mind. At
| least not as a top priority. There are many repairs that
| now require removal of substantial unrelated components
| or perhaps the entire engine because the failed thing is
| just impossible to reach in situ.
|
| Nuts and bolts are used because they are good mechanical
| fasteners that take advantage of the enormous "squeezing"
| leverage a threaded faster provides. Robots already
| assemble cars, and we still use nuts and bolts.
| bluGill wrote:
| Cars were always like that. Once in a while they worry
| about repairs but often they don't, and never have.
| sheiyei wrote:
| This is such a bad take that I have a hard time believing
| it's not just trolling.
|
| Really, a robot which could literally have an impact
| wrench built into it would HOLD a SPANNER and use FINGERS
| to remove bolts?
|
| Next I'm expecting you say self-driving cars will
| necessarily require a humanoid sitting in the driver's
| seat to be feasible. And delivery robots (broadly in use
| in various places around the world) have a tiny humanoid
| robot inside them to make the go.
| smikhanov wrote:
| Really, a robot which could literally have an impact
| wrench built into it would HOLD a SPANNER and use FINGERS
| to remove bolts?
|
| Sure, why not? A built-in impact wrench is built in
| forever, but a palm and fingers can hold a wrench, a
| spanner, a screwdriver, a welding torch, a drill, an
| angle grinder and trillion other tools of every possible
| size and configuration, that any workshop already has.
| You suggest to build all those tools into a robot? The
| multifunctional device you imagine is now incredibly
| expensive and bulky, likely are not reaching into narrow
| gaps between car's parts, still not having as many
| degrees of freedom as human hand, and is limited by the
| set of tools the manufacturer thought of, unlike the
| hand, which can grab any previously unexpected tool with
| ease.
|
| Still want to repair the car with just the built-in
| wrench?
| numpad0 wrote:
| They want a child.
| AnotherGoodName wrote:
| The hallucinations get quoted and then sourced as truth
| unfortunately.
|
| A simple example. "Which MS Dos productivity program had
| connect four built in?".
|
| I have an MSDOS emulator and know the answer. It's a little
| obscure but it's amazing how i get a different answer from all
| the AI's every time. I never saw any of them give the correct
| answer. Try asking it the above. Then ask it if it's sure about
| that (it'll change it's mind!).
|
| Now remember that these types of answers may well end up quoted
| online and then learnt by AI with that circular referenced
| source as the source. We have no truth at that point.
|
| And seriously try the above question. It's a great example of
| AI repeatedly stating an authoritative answer that's completely
| made up.
| spogbiper wrote:
| just tried this with gemini 2.5 flash and pro several times,
| it just keeps saying it doesn't know of any such thing and
| suggesting it was a software bundle where the game was
| included alongside the productivity application or I'm not
| remembering correctly.
|
| not great (assuming there actually is such a software) but
| not as bad as making something up
| jonchurch_ wrote:
| What is the correct answer?
| AnotherGoodName wrote:
| Autosketch for MS-Dos had connect four. It's under "game"
| in the file menu.
|
| This is an example of a random fact old enough no one ever
| bothered talking about it on the internet. So it's not
| cited anywhere but many of us can just plain remember it.
| When you ask ChatGPT (as of now on June 6th 2025) it gives
| a random answer every time.
|
| Now that i've stated this on the internet in a public
| manner it will be corrected but... There's a million such
| things that i could give as an example. Some question
| obscure enough that no one's given an answer on the
| internet before so AI doesn't know but recent enough that
| many of us know the answer so we can instantly see just how
| much AI hallucinates.
| warkdarrior wrote:
| > random fact old enough no one ever bothered talking
| about it on the internet. So it's not cited anywhere but
| many of us can just plain remember it.
|
| And since it is not written down on some website, this
| fact will disappear from the world once "many of us" die.
| WillAdams wrote:
| Interestingly, Copilot in Windows 11 claims that it was
| Excel 95 (which actually had a Flight Simulator Easter
| Egg).
| AnotherGoodName wrote:
| https://imgur.com/a/eWNTUrC for a screenshot btw to
| anyone curious.
|
| To give some context, i wanted to go back to it for
| nostalgia sake but couldn't quite remember the name of
| the application. I asked various AI's what was the
| application i'm trying to remember and they were all off
| the mark. In the end only my own neurons finally lighting
| up got me the answer i was looking for.
| Legend2440 wrote:
| ChatGPT 4o waffles a little bit and suggests the Microsoft
| Entertainment pack (which is not productivity software or MS-
| DOS), but says at the end:
|
| >If you're strictly talking about MS-DOS-only productivity
| software, there's no widely known MS-DOS productivity app
| that officially had a built-in Connect Four game. Most MS-DOS
| apps were quite lean and focused, and games were generally
| separate.
|
| I suspect this is the correct answer, because I can't find
| any MS-DOS Connect Four easter eggs by googling. I might be
| missing something obscure, but generally if I can't find it
| by Googling I wouldn't expect an LLM to know it.
| AnotherGoodName wrote:
| ChatGPT in particular will give an incorrect (but unique!)
| answer every time. At the risk of losing a great example of
| AI hallucination, it's Autosketch
|
| Not shown fully but
| https://www.youtube.com/watch?v=kBCrVwnV5DU&t=39s note the
| game in the file menu.
| Legend2440 wrote:
| Wow, that is quite obscure. Even with the name I can't
| find any references to it on Google. I'm not surprised
| that the LLMs don't know about it.
|
| You can always make stuff up to trigger AI
| hallucinations, like 'which 1990s TV show had a talking
| hairbrush character?'. There's no difference between 'not
| in the training set' and 'not real'.
|
| Edit: Wait, no, there actually was a 1990s TV show with a
| talking hairbrush character:
| https://en.wikipedia.org/wiki/The_Toothbrush_Family
|
| This is hard.
| burkaman wrote:
| > There's no difference between 'not in the training set'
| and 'not real'.
|
| I know what you meant but this is the whole point of this
| conversation. There is a huge difference between "no
| results found" and a confident "that never happened", and
| if new LLMs are trained on old ones saying the latter
| then they will be trained on bad data.
| dowager_dan99 wrote:
| >> You can always make stuff up to trigger AI
| hallucinations
|
| Not being able to find an answer to a made up question
| would be OK, it's ALWAYS finding an answer with complete
| confidence that is a major problem.
| spogbiper wrote:
| interesting. gemini 2.5 pro considered that it might be
| "AutoCAD" but decided it was not:
|
| "A specific user recollection of playing "Connect Four"
| within a version of AutoCAD for DOS was investigated.
| While this suggests the possibility of such a game
| existing within that specific computer-aided design (CAD)
| program, no widespread documentation or confirmation of
| this feature as a standard component of AutoCAD could be
| found. It is plausible that this was a result of a third-
| party add-on, a custom AutoLISP routine (a scripting
| language used in AutoCAD), or a misremembered detail."
| groby_b wrote:
| In what world is that 'productivity software'?
|
| Sure, it helps you do a job more productively, but that's
| roughly all non-entertainment software. And sure, it
| helps a user create documents, but, again, most non-
| entertainment software.
|
| Even in the age of AI, GIGO holds.
| AnotherGoodName wrote:
| Debatable but regardless you could reformulate the
| question however you want and still won't get anything
| other than hallucinations fwiw since there's no
| references to this on the internet. You need to load up
| autosketch 2.0 in a dos emulator and see it for yourself.
|
| Amusingly i get an authoritative but incorrect "It's
| autocad!" if i narrow down the question to program
| commonly used by engineers that had connect four built
| in.
| squeaky-clean wrote:
| "Productivity software" typically refers to any software
| used for work rather than entertainment. It doesn't mean
| software such as a todo list or organizer. Look up any
| laptop review and you'll find they segment benchmarks
| between gaming and "productivity". Just because you
| personally haven't heard of it doesn't mean it's not a
| widely used term.
|
| https://en.m.wikipedia.org/wiki/Productivity_software
|
| > Productivity software (also called personal
| productivity software or office productivity software) is
| application software used for producing information (such
| as documents, presentations, worksheets, databases,
| charts, graphs, digital paintings, electronic music and
| digital video). Its names arose from it increasing
| productivity
| robocat wrote:
| I imagine asking for anything obscure where there's
| plenty of noise can cause hallucinations. What Google
| search provides the answer? If the answer isn't in the
| training data, what do you expect? Do you ask people
| obscure questions, and do you then feel better than them
| when they guess wrong?
|
| I just tried: What MS-DOS program
| contains an easter-egg of an Amiga game?
|
| And got some lovely answers from ChatGPT and Gemini.
|
| Aside I personally would associate "productivity program"
| with productivity suite (like MS Works) so I would have
| trouble googling an answer (I started as a kid on Apple
| ][ and have worked with computers ever since so my
| ignorance is not age or skill related).
| relaxing wrote:
| If I can find something by Googling I wouldn't need an LLM
| to know it.
| dowager_dan99 wrote:
| Any current question to an LLM is just a textual
| interpretation of the search results though; the use the
| same source of truth (or lies in many cases)
| dowager_dan99 wrote:
| It gave me two answers (one was Borland sidekick) which I
| then asked "are you sure about that?" waffled and said
| actually neither of those it's IBM Handshaker to which I
| said "I don't think so, I think it's another productivity
| program" and it replied on further review it's not IBM
| Handshaker, there are no productivity programs that include
| Connect Four. No wonder CTO like this shit so much, it's
| the perfect bootlick.
| overfeed wrote:
| > I might be missing something obscure, but generally if I
| can't find it by Googling I wouldn't expect an LLM to know
| it.
|
| The Google index is already polluted by LLM output, albeit
| unevenly, depending on the subject. It's only going to
| spread to all subjects as content farms go down the long
| tail of profitability, eking profits; Googling won't help
| because you'll almost always find a result that's wrong, as
| will LLMs that resort to searching.
|
| Don't get me started on Google's AI answers that assert
| wrong information and launders fanfic/reddit/forum and
| elevating all sources to the same level.
| Bjartr wrote:
| AIs make knowledge work more efficient.
|
| Unfortunately that also includes citogenesis.
|
| https://xkcd.com/978/
| dwringer wrote:
| When I asked, "Good afternoon! I'm trying to settle a bet
| with a friend (no money on the line, just a friendly "bet"!)
| Which MS DOS productivity program had a playable version of
| the game Connect Four built in as an easter egg?", it went
| into a very detailed explanation of how to get to the "Hall
| of Tortured Souls" easter egg in Excel 5.0, glossing over the
| fact that I said "MS DOS" and also conflating the easter eggs
| by telling me specifically that the "excelkfa" cheat code
| would open a secret door/bridge to the connect four game.
|
| So, I retried with, "Good afternoon! I'm trying to settle a
| bet with a friend (no money on the line, just a friendly
| "bet"!) Which *MS DOS* [ _not_ Win95, i.e., Excel 5]
| productivity program had a playable version of the game
| Connect Four built in as an easter egg? ". I got Lotus 1-2-3
| once, Excel 4 twice, and Borland Quattro Pro three different
| times, all from that prompt.
|
| The correct answer you point out in another subthread was
| never returned as a possibility, and the responses all
| definitely came across as confident. Definitely a fascinating
| example.
| MostlyStable wrote:
| Claude 4 Sonnet gave the (reasonable given the obscurity, but
| wrong) answer that there was no such easter egg:
|
| >I'm not aware of any MS-DOS productivity program that had
| Connect Four as a built-in easter egg. While MS-DOS era
| software was famous for including various easter eggs (like
| the flight simulator in Excel 97, though that was Windows-
| era), I can't recall Connect Four specifically being hidden
| in any major DOS productivity applications.
|
| >The most well-known DOS productivity suites were things like
| Lotus 1-2-3, WordPerfect, dBase, and later Microsoft Office
| for DOS, but I don't have reliable information about Connect
| Four being embedded in any of these.
|
| >It's possible this is a case of misremembered details -
| perhaps your friend is thinking of a different game, a
| different era of software, or mixing up some details. Or
| there might be an obscure productivity program I'm not
| familiar with that did include this easter egg.
|
| >Would you like me to search for more information about DOS-
| era software easter eggs to see if we can track down what
| your friend might be thinking of?
|
| That seems like a pretty reasonable response given the
| details, and included the appropriate caveat that the model
| was not _aware_ of any such easter egg, and didn 't
| confidently state that there was none.
| nfriedly wrote:
| Gemini 2.5 Flash me a similar answer, although it was a bit
| more confident in it's incorrect answer:
|
| > _You 're asking about an MS-DOS productivity program that
| had ConnectFour built-in. I need to tell you that no
| mainstream or well-known MS-DOS productivity program (like
| a word processor, spreadsheet, database, or integrated
| suite) ever had the game ConnectFour built directly into
| it._
| SlowTao wrote:
| >It's possible this is a case of misremembered details -
| perhaps your friend is thinking of a different game, a
| different era of software, or mixing up some details. Or
| there might be an obscure productivity program I'm not
| familiar with that did include this easter egg.
|
| I am not a fan of this kind of communication. It doesn't
| know so try to deflect the short coming it onto the user.
|
| Im not saying that isn't a valid concern, but it can be
| used as an easy out of its gaps in knowledge.
| fn-mote wrote:
| > I am not a fan of this kind of communication. It
| doesn't know so try to deflect the short coming it onto
| the user.
|
| This is a very human-like response when asked a question
| that you think you know the answer to, but don't want to
| accuse the asker of having an incorrect premise. State
| what you think, then leave the door open to being wrong.
|
| Whether or not you want this kind of communication from a
| _machine_ , I'm less sure... but really, what's the
| issue?
|
| The problem of the incorrect premise happens all of the
| time. Assuming the person asking the question is correct
| 100% of the time isn't wise.
| richardwhiuk wrote:
| Humans use the phrase "I don't know.".
|
| AI never does.
| justsomehnguy wrote:
| Because there is no "I don't know" in the training data.
| Can you imagine a forum where in the response for a
| question of some obscure easter egg there are hunddeds of
| "I don't know"?
| MostlyStable wrote:
| >I'm not aware of any MS-DOS productivity program...
|
| >I don't know of any MS-DOS productivity programs...
|
| I dunno, seems pretty similar to me.
|
| And in a totally unreltaed query today, I got the
| following response:
|
| >That's a great question, but I don't have current
| information...
|
| Sounds a lot like "I don't know".
| bongodongobob wrote:
| Wait until you meet humans on the Internet. Not only do they
| make shit up, but they'll do it maliciously to trick you.
| kbenson wrote:
| So, like normal history just sped up exponentially to the
| point it's noticeable in not just our own lifetime (which it
| seemed to reach prior to AI), but maybe even within a couple
| years.
|
| I'd be a lot more worried about that if I didn't think we
| were doing a pretty good job of obfuscating facts the last
| few years ourselves without AI. :/
| tough wrote:
| probably chatgpt search function already finds this thread
| soon to answer correctly, hn domain does well on seo and
| shows up on search results soon enough
| nradov wrote:
| There is an enormous amount of actual car repair experience
| training data on YouTube but it's all copyrighted. Whether AI
| companies should have to license that content before using it
| for training is a matter of some dispute.
| AnotherGoodName wrote:
| >Whether AI companies should have to license that content
| before using it for training is a matter of some dispute.
|
| We definitely do not have the right balance of this right
| now.
|
| eg. I'm working on a set of articles that give a different
| path to learning some key math knowledge (just comes at it
| from a different point of view and is more intuitive).
| Historically such blog posts have helped my career.
|
| It's not ready for release anyway but i'm hesitant to release
| my work in this day and age since AI can steal it and
| regurgitate it to the point where my articles appear
| unoriginal.
|
| It's stifling. I'm of the opinion you shouldn't post art,
| educational material, code or anything that you wish to be
| credited for on the internet right now. Keep it to yourself
| or else AI will just regurgitate it to someone without giving
| you credit.
| Legend2440 wrote:
| The flip side is: knowledge is not (and should not be!)
| copyrightable. Anyone can read your articles and use the
| knowledge it contains, without paying or crediting you.
| They may even rewrite that knowledge in their own words and
| publish it in a textbook.
|
| AI should be allowed to read repair manuals and use them to
| fix cars. It should not be allowed to produce copies of the
| repair manuals.
| seadan83 wrote:
| An AI does not know what "fix" means, let alone be able
| to control anything that would physically fix the car.
| So, for an AI to fix a car means to give instructions on
| how to do that, in other words, reproduce pertinent parts
| of the repair manual. One, Is this a fair framing? Two,
| is this a distinction without a difference?
| AnotherGoodName wrote:
| Using the work of others with no credit given to them
| would at the very least be considered a dick move.
|
| AI is committing absolute dick moves non-stop.
| abeppu wrote:
| > which both (1. gives you an unlimited supply of
| noncopyrighted training data and (2. handily sidesteps the
| issue of AI-contaminated training data.
|
| I think these are both basically somewhere between wrong and
| misleading.
|
| Needing to generate your own data through actual experience is
| very expensive, and can mean that data acquisition now comes
| with real operational risks. Waymo gets real world experience
| operating its cars, but the "limit" on how much data you can
| get per unit time depends on the size of the fleet, and
| requires that you first get to a level of competence where it's
| safe to operate in the real world.
|
| If you want to repair cars, and you _don't_ start with some
| source of knowledge other than on-policy roll-outs, then you
| have to expect that you're going to learn by trashing a bunch
| of cars (and still pay humans to tell the robot that it failed)
| for some significant period.
|
| There's a reason you want your mechanic to have access to
| manuals, and have gone through some explicit training, rather
| than just try stuff out and see what works, and those cost-
| based reasons are true whether the mechanic is human or AI.
|
| Perhaps you're using an off-policy RL approach -- great! If
| your off-policy data is demonstrations from a prior generation
| model, that's still AI-contaminated training data.
|
| So even if you're trying to learn by doing, there are still
| meaningful limits on the supply of training data (which may be
| way more expensive to produce than scraping the web), and
| likely still AI-contaminated (though perhaps with better info
| on the data's provenance?).
| swyx wrote:
| i put together a brief catalog of AI pollution of the web the
| last time this topic came up:
| https://www.latent.space/i/139368545/the-concept-of-low-back...
|
| i do have to say outside of twitter i dont personally see it all
| that much. but the normies do seem to encounter it and 1) either
| are fine? 2) oblivious? and perhaps SOME non-human-origin noise
| is harmless.
|
| (plenty of humans are pure noise, too, dont forget)
| koolba wrote:
| I feel oddly prescient today:
| https://news.ycombinator.com/item?id=44217676
| glenstein wrote:
| Nicely done! I think I've heard of this framing before, of
| considering content to be free from AI "contamination." I
| believe that idea has been out there in the ether.
|
| But I think the suitability of low background steel as an
| analogy is something you can comfortably claim as a successful
| called shot.
| saberience wrote:
| I heard this example made at least a year ago on hackernews,
| probably longer ago too.
|
| See (2 years ago):
| https://news.ycombinator.com/item?id=34085194
| echelon wrote:
| I really think you're wrong.
|
| The processes we use to annotate content and synthetic data
| will turn AI outputs into a gradient that makes future outputs
| better, not worse.
|
| It might not be as obvious with LLM outputs, but it should be
| super obvious with image and video models. As we select the
| best visual outputs of systems, slight errors introduced and
| taste-based curation will steer the systems to better
| performance and more generality.
|
| It's no different than genetics and biology adapting to every
| ecological niche if you think of the genome as a synthetic
| machine and physics as a stochastic gradient. We're speed
| running the same thing here.
| zargon wrote:
| This has been a common metaphor since the launch of ChatGPT.
| ChrisArchitect wrote:
| Love the concept (and the historical story is neat too).
|
| Came up a month or so ago on discussion about _Wikipedia:
| Database Download_
| (https://news.ycombinator.com/item?id=43811732). I missed that it
| was jgrahamc behind the site. Great stuff.
| aunty_helen wrote:
| Any user profile created pre-2022 is low background steel. I'm
| now finding myself check date created when it seems like the user
| is outputting low quality content. Much to my dismay, I'm often
| wrong.
| yodon wrote:
| Anyone who thinks their reading skills are a reliable detector of
| AI-generated content is either lying to themselves about the
| validity of their detector or missing the opportunity to print
| money by selling it.
|
| I strongly suspect more people are in the first category than the
| second.
| uludag wrote:
| 1) If someone had the reading skills to detect AI generated
| content wouldn't that technically be something very hard to
| monetize? It's not like said person could clone themselves or
| mass produce said skill.
|
| Also, for a large number of AI generated images and text
| (especially low-effort), even basic reading/perception skills
| can detect AI content. I would agree though that people can't
| reliably discern high-effort AI generated works, especially if
| a human was involved to polish it up.
|
| 2) True--human "detectors" are mostly just gut feelings dressed
| up as certainty. And as AI improves, those feelings get less
| reliable. The real issue isn't that people can detect AI, but
| that they're overconfident when they think they can.
|
| One of the above was generated by ChatGPT to reply to your
| comment. The other was written by me.
| suddenlybananas wrote:
| It's so obvious that I almost wonder if you made a parody of
| AI writing on purpose.
| sorokod wrote:
| Elsewhere I proposed a "100% organic data" label for
| uncontaminated content. Should have a "100% organic data" logo
| too.
| warkdarrior wrote:
| Maybe a "Data hallucinated from humans only" label would be
| better.
| sorokod wrote:
| Don't think so - too long and states the obvious.
| ACCount36 wrote:
| Currently, there is no reason to believe that "AI contamination"
| is a practical issue for AI training runs.
|
| AIs trained on public scraped data that predates 2022 don't
| noticeably outperform those trained on scraped data from 2022
| onwards. Hell, in some cases, newer scrapes perform slightly
| better, token for token, for unknown reasons.
| demosthanos wrote:
| > AIs trained on public scraped data that predates 2022 don't
| noticeably outperform those trained on scraped data from 2022
| onwards. Hell, in some cases, newer scrapes perform slightly
| better, token for token, for unknown reasons.
|
| This is really bad reasoning for a few reasons:
|
| 1) We've gotten much _better_ at training LLMs since 2022. The
| negative impacts of AI slop in the training data certainly don
| 't outweigh the benefits of orders of magnitude more parameters
| and better training techniques, but that doesn't mean they have
| no negative impact.
|
| 2) "Outperform" is a very loose term and we still have no real
| good answer for measuring it meaningfully. We can all tell that
| Gemini 2.5 outperforms GPT-4o. What's trickier is
| distinguishing between Gemini 2.5 and Claude 4. The expected
| effect size of slop at this stage would be on that _smaller_
| scale of differences between same-gen models.
|
| Given that we're looking for a small enough effect size that we
| know we're going to have a hard time proving anything with
| data, I think it's reasonable to operate from first principles
| in this case. First principles say very clearly that avoiding
| training on AI-generated content is a good idea.
| ACCount36 wrote:
| No, I mean "model" AIs, created explicitly for dataset
| testing purposes.
|
| You take small AIs, of the same size and architecture, and
| with the same pretraining dataset size. Pretrain some solely
| on skims from "2019 only", "2020 only", "2021 only" scraped
| datasets. The others on skims from "2023 only", "2024 only".
| Then you run RLHF, and then test the resulting AIs on
| benchmarks.
|
| The latter AIs tend to perform _slightly better_. It 's a
| small but noticeable effect. Plenty of hypothesis on why,
| none confirmed outright.
|
| You're right that performance of frontier AIs keeps
| improving, which is a weak strike against the idea of AI
| contamination hurting AI training runs. Like-for-like testing
| is a strong strike.
| numpad0 wrote:
| Yeah, the thinking behind "low background steel" concept is
| that AI training on synthetic data could lead into a "model
| collapse" that render the AIs anyhow completely mad and
| useless. That either didn't happen, or all the AI companies
| internally holds a working filter to sieve out AI data. I'd bet
| on the former. I still think there might be chances of model
| collapse happening to _humans_ after too much exposure to AI
| generated data, but that 's just my anecdotal observations and
| gut feelings.
| rjsw wrote:
| I don't think people have really got started on generating
| slop, I expect it to increase by a lot.
| vunderba wrote:
| Was the choice to go with a very obviously AI generated image for
| the banner intentional? If I had to guess it almost looks like
| DALL-E version 2.
| blululu wrote:
| Gratuitous AI slop is really not a good look. tai;dr is
| becoming my default response to this kind of thing. I want to
| hear someone's thoughts, not an llm's compression artifacts.
| nialv7 wrote:
| Does this analogy work? It's exceedingly hard to make new low-
| background steels, since those radioactive particles are
| everywhere. But it's not difficult to make AI-free content - well
| just don't use AI to write it.
| absurdo wrote:
| Clickbait title that's all.
| lurk2 wrote:
| Who is going to generate this AI-free content, for what reason,
| and with what money?
| arjie wrote:
| People do. I do, for instance. My blog is self-hosted,
| entirely human-written, and it is done for the sake of
| enjoyment. It doesn't cost much to host. An entirely static
| site generator would actually be free, but I don't mind
| paying the 55C//kWh and the $60/month ISP fee to host it.
| Ekaros wrote:
| Wouldn't actually curated content be still better? That is
| content were say lot of blogspam and and other content
| potentially generated by certain groups was removed? As I
| distinctly remember that lot of content even before AIs was very
| poor quality.
|
| On other hand, lot of poor quality content could still be
| factually valid enough not just well edited or formatted.
| gorgoiler wrote:
| This site is literally named for the Y combinator! Module some
| philosophical hand waving, if there's one thing we ought to
| demand of our inference models it's the ability to find the fixed
| point of a function that takes content and outputs content, then
| consumes that same content!
|
| I too am optimistic that recursive training on data that is a
| mixture of both original human content and content derived from
| original content, and content derived from content derived from
| original human content, ...ad nauseam, will be able to extract
| the salient features and patterns of the underlying system.
| mclau157 wrote:
| is this not just www.archive.org ?
| K0balt wrote:
| Ai generated content is inherently a regression to the mean and
| harms both training and human utility. There is no benefit in
| publishing anything that an AI can generate, just ask the
| question yourself. Maybe publish all AI content with <AI
| generated content> tags, but other than that it is a public
| nuisance much more often than a public good.
| SamPatt wrote:
| Nonsense. Have you used any of the deep research tools?
|
| Don't fall for the utopia fallacy. Humans also publish junk.
| cryptonector wrote:
| Yes, but GP's idea of segregating AI-generated content is
| worth considering.
|
| If you're training an AI, do you want it to get trained on
| other AIs' output? That might be interesting actually, but I
| think you might then want to have both, an AI trained on
| everything, and another trained on everything except other
| AIs' output. So perhaps an HTML tag for indicating "this is
| AI-generated" might be a good idea.
| IncreasePosts wrote:
| Shouldn't there be enough training content from the pre-ai
| era that the system itself can determine whether content is
| AI generated, or if it matters?
| Infinity315 wrote:
| Just ask any person who works in teaching or any of the
| numerous faulty AI detectors (they're all faulty).
|
| Any current technology which can used to accurately
| detect pre-AI content would necessarily imply that that
| same technology could be used to train an AI to generate
| content that could skirt by the AI detector. Sure, there
| is going to be a lag time, but eventually we will run out
| of non-AI content.
| cryptonector wrote:
| No, that's the problem. Pre-AI era content a) is often
| not dated, so not identifiable as such, and b) also gets
| out of date. What was thought to be true 20 years ago
| might not be thought to be true today. Search for the
| "half-life of facts".
| RandomBK wrote:
| My 2c is that it _is_ worthwhile to train on AI generated
| content that has obtained some level of human approval or
| interest, as a form of extended RLHF loop.
| cryptonector wrote:
| Ok, but how do you denote that approval? What if you
| partially approve of that content? ("Overall this is
| correct, but this little nugget is hallucinated.")
| bongodongobob wrote:
| It apparently doesn't matter unless you somehow consider
| the entire Internet to be correct. They didn't only feed
| LLMs correct info. It all just got shoveled in and here
| we are.
| thephyber wrote:
| I can see the value of labeling _all_ AI can be trained on
| purely non-AI generated content.
|
| But I don't think that's a reasonable goal. Pragmatic
| example: There's almost no optional HTML tags or optional
| HTTP Headers which are used anywhere close to 100% of the
| times they apply.
|
| Also, I think field is already muddy, even before the game
| starts. Spell checker, grammar.ly, and translation all had
| AI contributions and likely affect most of human-generated
| text on the internet. The heuristic of "one drop of AI" is
| not useful. And any heuristic more complicated than "one
| drop" introduces too much subjective complexity for a
| Boolean data type.
| cobbzilla wrote:
| Steel-man angle: A desire for data provenance is a good thing
| with benefits that are independent of utopias/humans vs
| machines kinds of questions.
|
| But, all provenance systems are gamed. I predict the most
| reliable methods will be cumbersome and not widespread, thus
| covering little actual content. The easily-gamed systems will
| be in widespread use, embedded in social media apps, etc.
|
| Questions: 1. Does there exist a data provenance system that
| is both easy to use and reliable "enough" (for some
| sufficient definition of "enough")? Can we do bcrypt-style
| more-bits=more-security and trade time for security?
|
| 2. Is there enough of an incentive for the major tech
| companies to push adoption of such a system? How could this
| play out?
| munificent wrote:
| The observation that humans poop is not sufficient
| justification for spending millions of dollars building an
| automated firehose that pumps a torrent of shit onto the
| public square.
| krapht wrote:
| Yes, and deep research was junk for the hard topics that I
| actually needed to sit down and research. Anything shallower
| I can usually reach by search engine use and scan; deep
| research saves me about 15-30 minutes for well-covered
| topics.
|
| For the hard topics, the solution is still the same as pre-AI
| - search for popular survey papers, then start crawling
| through the citation network and keeping notes. The LLM
| output had no idea of what was actually impactful vs what was
| a junk paper in the niche topic I was interested in so I had
| no other alternative than quality time with Google Scholar.
|
| We are a long way from deep research even approaching a well-
| written survey paper written by grad student sweat and tears.
| gojomo wrote:
| This was an intuitively-appealing belief, even with some
| qualified experimental support, as of a few years ago.
|
| However, since then, a bunch of capability breakthroughs from
| (well-curated) AI generations has definitively disproven it.
| Crontab wrote:
| Off topic:
|
| When I see a JGC link on Hacker News I can't help but remember
| using PopFile on an old PowerMac - back when Bayesian spam
| filters were becoming popular. It seems so long ago but it feels
| like yesterday.
| jeffchuber wrote:
| https://x.com/jeffreyhuber/status/1732069197847687658
| carlosjobim wrote:
| The shadow libraries are the largest and highest quality source
| of human knowledge, larger than the Internet in scope and actual
| content.
|
| It is also uncontaminated by AI.
| klysm wrote:
| Soon this will be contaminated as well unfortunately
| gojomo wrote:
| Look, we just need to add some new 'planes' to Unicode - that
| mirror all communicatively-useful characters, but with extra
| state bits for...
|
| _guaranteed human output_ - anyone who emits text in these
| ranges that was AI generated, rather than artisanally human-
| composed, goes straight to jail.
|
| _for human eyes only_ - anyone who lets any AI train on, or even
| consider, any text in these ranges goes straight to jail. Fnord,
| "that doesn't look like anything to me".
|
| _admittedly AI generated_ - all AI output must use these ranges
| as disclosure, or - you guessed it - those pretending otherwise
| go straight to jail.
|
| Of course, all the ranges generate visually-indistinguishable
| homoglyphs, so it's a strictly-software-mediated quasi-covert
| channel for fair disclosure.
|
| When you cut & paste text from various sources, the provenance
| comes with it via the subtle character encoding differences.
|
| I am only (1 - epsilon) joking.
___________________________________________________________________
(page generated 2025-06-10 23:00 UTC)