hngopher.com

       [HN Gopher] Low-background Steel: content without AI contamination
       ___________________________________________________________________
        
       Low-background Steel: content without AI contamination
        
       Author : jgrahamc
       Score  : 173 points
       Date   : 2025-06-10 17:55 UTC (5 hours ago)
        
 (HTM) web link (blog.jgc.org)
 (TXT) w3m dump (blog.jgc.org)
        
       | schmookeeg wrote:
       | I'm not as allergic to AI content as some (although I'm sure I'll
       | get there) -- but I admire this analogy to low-background steel.
       | Brilliant.
        
         | ris wrote:
         | > I'm not as allergic to AI content as some
         | 
         | I suspect it's less about phobia, more about avoiding training
         | AI on its own output.
         | 
         | This is actually something I'd been discussing with colleagues
         | recently. Pre-AI content is only ever going to become more
         | precious because it's one thing we can never make more of.
         | 
         | Ideally we'd have been cryptographically timestamping all data
         | available in ~2015, but we are where we are now.
        
           | smikhanov wrote:
           | It's about keeping different corpuses of written material
           | that was created by humans, for research purposes. You
           | wouldn't want to contaminate your human language word
           | frequency databases with AI slop, the linguists of this world
           | won't like it.
        
           | abound wrote:
           | One surprising thing to me is that using model outputs to
           | train other/smaller models is standard fare and seems to work
           | quite well.
           | 
           | So it seems to be less about not training AI on its own
           | outputs and more about curating some overall quality bar for
           | the content, AI-generated or otherwise
        
             | jgrahamc wrote:
             | Back in the early 2000s when I was doing email filtering
             | using naive Bayes in my POPFile email filter one of the
             | surprising results was that taken the output of the filter
             | as correct and retraining on a message as if it had been
             | labelled by a human worked well.
        
               | bhickey wrote:
               | Were you thresholding the naive Bayes score or doing soft
               | distillation?
        
           | glenstein wrote:
           | >more about avoiding training AI on its own output.
           | 
           | Exactly. The analogy I've been thinking of is if you use some
           | sort of image processing filter over and over again to the
           | point that it overpowers the whole image and all you see is
           | the noise generated from the filter. I used to do this
           | sometimes with Irfanview and it's sharp and blur.
           | 
           | And I believe that I've seen TikTok videos showing AI
           | constantly iterating over an image and then iterating over
           | its output with the same instructions and seeming to converge
           | on a style of like a 1920s black and white cartoon.
           | 
           | And I feel like there might be such a thing as a linguistic
           | version of that. Even a conceptual version.
        
           | seadan83 wrote:
           | I'm worried about humans training on AI output. Example, a
           | rare fish had a viral AI image made. The image is completely
           | fake. Though, when you search for that fish, the image is
           | what comes up, repeatedly. It is hard to know it is all fake,
           | looks real. Content fabrication at scale has a lot of second
           | order impacts.
        
         | jgrahamc wrote:
         | I am not allergic to it either (and I created the site). The
         | idea was to keep track of stuff that we know humans made.
        
       | thm wrote:
       | Related: https://news.ycombinator.com/item?id=43811732
        
       | Legend2440 wrote:
       | I'm not convinced this is going to be as big of a deal as people
       | think.
       | 
       | Long-run you want AI to learn from actual experience (think
       | repairing cars instead of reading car repair manuals), which both
       | (1. gives you an unlimited supply of noncopyrighted training data
       | and (2. handily sidesteps the issue of AI-contaminated training
       | data.
        
         | smikhanov wrote:
         | Prediction: there won't be any AI systems repairing cars before
         | there will be general intelligence-capable humanoid robots (Ex
         | Machina-style).
         | 
         | There also won't be any AI maids in five-star hotels until
         | those robots appear.
         | 
         | This doesn't make your statement invalid, it's just that the
         | gap between today and the moment you're describing is so
         | unimaginably vast that saying "don't worry about AI slop
         | contaminating your language word frequency databases, it'll
         | sort itself out eventually" is slightly off-mark.
        
           | ToucanLoucan wrote:
           | It blows my mind that some folks are still out here thinking
           | LLMs are the tech-tree towards AGI and independently thinking
           | machines, when we can't even get copilot to stop suggesting
           | libraries that don't exist for code we fully understand _and
           | created._
           | 
           | I'm sure AGI is possible. It's not coming from ChatGPT no
           | matter how much Internet you feed to it.
        
             | Legend2440 wrote:
             | Well, we won't be feeding it internet - we'll be using RL
             | to learn from interaction with the real world.
             | 
             | LLMs are just one very specific application of deep
             | learning, doing next-word-prediction of internet text. It's
             | not LLMs specifically that's exciting, it's deep learning
             | as a whole.
        
           | sebtron wrote:
           | I don't understand the obsession with humanoid robots that
           | many seem to have. Why would you make a car repairing machine
           | human-shaped? Like, what would it use its legs for? Wouldn't
           | it be better to design it tailored to its purpose?
        
             | TGower wrote:
             | Economies of scale. The humanoid form can interact with all
             | of the existing infrastructure for jobs currently done by
             | humans, so that's the obvious form factor for companies
             | looking to churn out robots to sell by the millions.
        
               | tartoran wrote:
               | Not only that but if humanoid robots were available
               | commercially (and viable) they could be used as
               | housemaids or for.. companionship if not more. Of course,
               | we're entering SciFi territory but it's long been a SciFi
               | theme.
        
               | thaumasiotes wrote:
               | Can, but an insectoid form factor and much smaller size
               | could easily be better. It's not so common that being of
               | human size is an advantage even where things are set up
               | to allow for humans.
               | 
               | Consider how chimney sweeps used to be children.
        
             | smikhanov wrote:
             | Legs? To jump into the workshop pit, among other things.
             | Palms are needed to hold a wrench or a spanner, fingers are
             | needed to unscrew nuts.
             | 
             | Cars are not built to accommodate whatever universal repair
             | machine there could be, cars are built with an expectation
             | that a mechanic with arms and legs will be repairing it,
             | and will be for a while.
             | 
             | A non-humanoid robot in a human-designed world populated by
             | humans looks and behaves like this, at best:
             | https://youtu.be/Hxdqp3N_ymU
        
               | SoftTalker wrote:
               | More and more, cars are not built with repair in mind. At
               | least not as a top priority. There are many repairs that
               | now require removal of substantial unrelated components
               | or perhaps the entire engine because the failed thing is
               | just impossible to reach in situ.
               | 
               | Nuts and bolts are used because they are good mechanical
               | fasteners that take advantage of the enormous "squeezing"
               | leverage a threaded faster provides. Robots already
               | assemble cars, and we still use nuts and bolts.
        
               | bluGill wrote:
               | Cars were always like that. Once in a while they worry
               | about repairs but often they don't, and never have.
        
               | sheiyei wrote:
               | This is such a bad take that I have a hard time believing
               | it's not just trolling.
               | 
               | Really, a robot which could literally have an impact
               | wrench built into it would HOLD a SPANNER and use FINGERS
               | to remove bolts?
               | 
               | Next I'm expecting you say self-driving cars will
               | necessarily require a humanoid sitting in the driver's
               | seat to be feasible. And delivery robots (broadly in use
               | in various places around the world) have a tiny humanoid
               | robot inside them to make the go.
        
               | smikhanov wrote:
               | Really, a robot which could literally have an impact
               | wrench built into it would HOLD a SPANNER and use FINGERS
               | to remove bolts?
               | 
               | Sure, why not? A built-in impact wrench is built in
               | forever, but a palm and fingers can hold a wrench, a
               | spanner, a screwdriver, a welding torch, a drill, an
               | angle grinder and trillion other tools of every possible
               | size and configuration, that any workshop already has.
               | You suggest to build all those tools into a robot? The
               | multifunctional device you imagine is now incredibly
               | expensive and bulky, likely are not reaching into narrow
               | gaps between car's parts, still not having as many
               | degrees of freedom as human hand, and is limited by the
               | set of tools the manufacturer thought of, unlike the
               | hand, which can grab any previously unexpected tool with
               | ease.
               | 
               | Still want to repair the car with just the built-in
               | wrench?
        
             | numpad0 wrote:
             | They want a child.
        
         | AnotherGoodName wrote:
         | The hallucinations get quoted and then sourced as truth
         | unfortunately.
         | 
         | A simple example. "Which MS Dos productivity program had
         | connect four built in?".
         | 
         | I have an MSDOS emulator and know the answer. It's a little
         | obscure but it's amazing how i get a different answer from all
         | the AI's every time. I never saw any of them give the correct
         | answer. Try asking it the above. Then ask it if it's sure about
         | that (it'll change it's mind!).
         | 
         | Now remember that these types of answers may well end up quoted
         | online and then learnt by AI with that circular referenced
         | source as the source. We have no truth at that point.
         | 
         | And seriously try the above question. It's a great example of
         | AI repeatedly stating an authoritative answer that's completely
         | made up.
        
           | spogbiper wrote:
           | just tried this with gemini 2.5 flash and pro several times,
           | it just keeps saying it doesn't know of any such thing and
           | suggesting it was a software bundle where the game was
           | included alongside the productivity application or I'm not
           | remembering correctly.
           | 
           | not great (assuming there actually is such a software) but
           | not as bad as making something up
        
           | jonchurch_ wrote:
           | What is the correct answer?
        
             | AnotherGoodName wrote:
             | Autosketch for MS-Dos had connect four. It's under "game"
             | in the file menu.
             | 
             | This is an example of a random fact old enough no one ever
             | bothered talking about it on the internet. So it's not
             | cited anywhere but many of us can just plain remember it.
             | When you ask ChatGPT (as of now on June 6th 2025) it gives
             | a random answer every time.
             | 
             | Now that i've stated this on the internet in a public
             | manner it will be corrected but... There's a million such
             | things that i could give as an example. Some question
             | obscure enough that no one's given an answer on the
             | internet before so AI doesn't know but recent enough that
             | many of us know the answer so we can instantly see just how
             | much AI hallucinates.
        
               | warkdarrior wrote:
               | > random fact old enough no one ever bothered talking
               | about it on the internet. So it's not cited anywhere but
               | many of us can just plain remember it.
               | 
               | And since it is not written down on some website, this
               | fact will disappear from the world once "many of us" die.
        
               | WillAdams wrote:
               | Interestingly, Copilot in Windows 11 claims that it was
               | Excel 95 (which actually had a Flight Simulator Easter
               | Egg).
        
               | AnotherGoodName wrote:
               | https://imgur.com/a/eWNTUrC for a screenshot btw to
               | anyone curious.
               | 
               | To give some context, i wanted to go back to it for
               | nostalgia sake but couldn't quite remember the name of
               | the application. I asked various AI's what was the
               | application i'm trying to remember and they were all off
               | the mark. In the end only my own neurons finally lighting
               | up got me the answer i was looking for.
        
           | Legend2440 wrote:
           | ChatGPT 4o waffles a little bit and suggests the Microsoft
           | Entertainment pack (which is not productivity software or MS-
           | DOS), but says at the end:
           | 
           | >If you're strictly talking about MS-DOS-only productivity
           | software, there's no widely known MS-DOS productivity app
           | that officially had a built-in Connect Four game. Most MS-DOS
           | apps were quite lean and focused, and games were generally
           | separate.
           | 
           | I suspect this is the correct answer, because I can't find
           | any MS-DOS Connect Four easter eggs by googling. I might be
           | missing something obscure, but generally if I can't find it
           | by Googling I wouldn't expect an LLM to know it.
        
             | AnotherGoodName wrote:
             | ChatGPT in particular will give an incorrect (but unique!)
             | answer every time. At the risk of losing a great example of
             | AI hallucination, it's Autosketch
             | 
             | Not shown fully but
             | https://www.youtube.com/watch?v=kBCrVwnV5DU&t=39s note the
             | game in the file menu.
        
               | Legend2440 wrote:
               | Wow, that is quite obscure. Even with the name I can't
               | find any references to it on Google. I'm not surprised
               | that the LLMs don't know about it.
               | 
               | You can always make stuff up to trigger AI
               | hallucinations, like 'which 1990s TV show had a talking
               | hairbrush character?'. There's no difference between 'not
               | in the training set' and 'not real'.
               | 
               | Edit: Wait, no, there actually was a 1990s TV show with a
               | talking hairbrush character:
               | https://en.wikipedia.org/wiki/The_Toothbrush_Family
               | 
               | This is hard.
        
               | burkaman wrote:
               | > There's no difference between 'not in the training set'
               | and 'not real'.
               | 
               | I know what you meant but this is the whole point of this
               | conversation. There is a huge difference between "no
               | results found" and a confident "that never happened", and
               | if new LLMs are trained on old ones saying the latter
               | then they will be trained on bad data.
        
               | dowager_dan99 wrote:
               | >> You can always make stuff up to trigger AI
               | hallucinations
               | 
               | Not being able to find an answer to a made up question
               | would be OK, it's ALWAYS finding an answer with complete
               | confidence that is a major problem.
        
               | spogbiper wrote:
               | interesting. gemini 2.5 pro considered that it might be
               | "AutoCAD" but decided it was not:
               | 
               | "A specific user recollection of playing "Connect Four"
               | within a version of AutoCAD for DOS was investigated.
               | While this suggests the possibility of such a game
               | existing within that specific computer-aided design (CAD)
               | program, no widespread documentation or confirmation of
               | this feature as a standard component of AutoCAD could be
               | found. It is plausible that this was a result of a third-
               | party add-on, a custom AutoLISP routine (a scripting
               | language used in AutoCAD), or a misremembered detail."
        
               | groby_b wrote:
               | In what world is that 'productivity software'?
               | 
               | Sure, it helps you do a job more productively, but that's
               | roughly all non-entertainment software. And sure, it
               | helps a user create documents, but, again, most non-
               | entertainment software.
               | 
               | Even in the age of AI, GIGO holds.
        
               | AnotherGoodName wrote:
               | Debatable but regardless you could reformulate the
               | question however you want and still won't get anything
               | other than hallucinations fwiw since there's no
               | references to this on the internet. You need to load up
               | autosketch 2.0 in a dos emulator and see it for yourself.
               | 
               | Amusingly i get an authoritative but incorrect "It's
               | autocad!" if i narrow down the question to program
               | commonly used by engineers that had connect four built
               | in.
        
               | squeaky-clean wrote:
               | "Productivity software" typically refers to any software
               | used for work rather than entertainment. It doesn't mean
               | software such as a todo list or organizer. Look up any
               | laptop review and you'll find they segment benchmarks
               | between gaming and "productivity". Just because you
               | personally haven't heard of it doesn't mean it's not a
               | widely used term.
               | 
               | https://en.m.wikipedia.org/wiki/Productivity_software
               | 
               | > Productivity software (also called personal
               | productivity software or office productivity software) is
               | application software used for producing information (such
               | as documents, presentations, worksheets, databases,
               | charts, graphs, digital paintings, electronic music and
               | digital video). Its names arose from it increasing
               | productivity
        
               | robocat wrote:
               | I imagine asking for anything obscure where there's
               | plenty of noise can cause hallucinations. What Google
               | search provides the answer? If the answer isn't in the
               | training data, what do you expect? Do you ask people
               | obscure questions, and do you then feel better than them
               | when they guess wrong?
               | 
               | I just tried:                 What MS-DOS program
               | contains an easter-egg of an Amiga game?
               | 
               | And got some lovely answers from ChatGPT and Gemini.
               | 
               | Aside I personally would associate "productivity program"
               | with productivity suite (like MS Works) so I would have
               | trouble googling an answer (I started as a kid on Apple
               | ][ and have worked with computers ever since so my
               | ignorance is not age or skill related).
        
             | relaxing wrote:
             | If I can find something by Googling I wouldn't need an LLM
             | to know it.
        
               | dowager_dan99 wrote:
               | Any current question to an LLM is just a textual
               | interpretation of the search results though; the use the
               | same source of truth (or lies in many cases)
        
             | dowager_dan99 wrote:
             | It gave me two answers (one was Borland sidekick) which I
             | then asked "are you sure about that?" waffled and said
             | actually neither of those it's IBM Handshaker to which I
             | said "I don't think so, I think it's another productivity
             | program" and it replied on further review it's not IBM
             | Handshaker, there are no productivity programs that include
             | Connect Four. No wonder CTO like this shit so much, it's
             | the perfect bootlick.
        
             | overfeed wrote:
             | > I might be missing something obscure, but generally if I
             | can't find it by Googling I wouldn't expect an LLM to know
             | it.
             | 
             | The Google index is already polluted by LLM output, albeit
             | unevenly, depending on the subject. It's only going to
             | spread to all subjects as content farms go down the long
             | tail of profitability, eking profits; Googling won't help
             | because you'll almost always find a result that's wrong, as
             | will LLMs that resort to searching.
             | 
             | Don't get me started on Google's AI answers that assert
             | wrong information and launders fanfic/reddit/forum and
             | elevating all sources to the same level.
        
           | Bjartr wrote:
           | AIs make knowledge work more efficient.
           | 
           | Unfortunately that also includes citogenesis.
           | 
           | https://xkcd.com/978/
        
           | dwringer wrote:
           | When I asked, "Good afternoon! I'm trying to settle a bet
           | with a friend (no money on the line, just a friendly "bet"!)
           | Which MS DOS productivity program had a playable version of
           | the game Connect Four built in as an easter egg?", it went
           | into a very detailed explanation of how to get to the "Hall
           | of Tortured Souls" easter egg in Excel 5.0, glossing over the
           | fact that I said "MS DOS" and also conflating the easter eggs
           | by telling me specifically that the "excelkfa" cheat code
           | would open a secret door/bridge to the connect four game.
           | 
           | So, I retried with, "Good afternoon! I'm trying to settle a
           | bet with a friend (no money on the line, just a friendly
           | "bet"!) Which *MS DOS* [ _not_ Win95, i.e., Excel 5]
           | productivity program had a playable version of the game
           | Connect Four built in as an easter egg? ". I got Lotus 1-2-3
           | once, Excel 4 twice, and Borland Quattro Pro three different
           | times, all from that prompt.
           | 
           | The correct answer you point out in another subthread was
           | never returned as a possibility, and the responses all
           | definitely came across as confident. Definitely a fascinating
           | example.
        
           | MostlyStable wrote:
           | Claude 4 Sonnet gave the (reasonable given the obscurity, but
           | wrong) answer that there was no such easter egg:
           | 
           | >I'm not aware of any MS-DOS productivity program that had
           | Connect Four as a built-in easter egg. While MS-DOS era
           | software was famous for including various easter eggs (like
           | the flight simulator in Excel 97, though that was Windows-
           | era), I can't recall Connect Four specifically being hidden
           | in any major DOS productivity applications.
           | 
           | >The most well-known DOS productivity suites were things like
           | Lotus 1-2-3, WordPerfect, dBase, and later Microsoft Office
           | for DOS, but I don't have reliable information about Connect
           | Four being embedded in any of these.
           | 
           | >It's possible this is a case of misremembered details -
           | perhaps your friend is thinking of a different game, a
           | different era of software, or mixing up some details. Or
           | there might be an obscure productivity program I'm not
           | familiar with that did include this easter egg.
           | 
           | >Would you like me to search for more information about DOS-
           | era software easter eggs to see if we can track down what
           | your friend might be thinking of?
           | 
           | That seems like a pretty reasonable response given the
           | details, and included the appropriate caveat that the model
           | was not _aware_ of any such easter egg, and didn 't
           | confidently state that there was none.
        
             | nfriedly wrote:
             | Gemini 2.5 Flash me a similar answer, although it was a bit
             | more confident in it's incorrect answer:
             | 
             | > _You 're asking about an MS-DOS productivity program that
             | had ConnectFour built-in. I need to tell you that no
             | mainstream or well-known MS-DOS productivity program (like
             | a word processor, spreadsheet, database, or integrated
             | suite) ever had the game ConnectFour built directly into
             | it._
        
             | SlowTao wrote:
             | >It's possible this is a case of misremembered details -
             | perhaps your friend is thinking of a different game, a
             | different era of software, or mixing up some details. Or
             | there might be an obscure productivity program I'm not
             | familiar with that did include this easter egg.
             | 
             | I am not a fan of this kind of communication. It doesn't
             | know so try to deflect the short coming it onto the user.
             | 
             | Im not saying that isn't a valid concern, but it can be
             | used as an easy out of its gaps in knowledge.
        
               | fn-mote wrote:
               | > I am not a fan of this kind of communication. It
               | doesn't know so try to deflect the short coming it onto
               | the user.
               | 
               | This is a very human-like response when asked a question
               | that you think you know the answer to, but don't want to
               | accuse the asker of having an incorrect premise. State
               | what you think, then leave the door open to being wrong.
               | 
               | Whether or not you want this kind of communication from a
               | _machine_ , I'm less sure... but really, what's the
               | issue?
               | 
               | The problem of the incorrect premise happens all of the
               | time. Assuming the person asking the question is correct
               | 100% of the time isn't wise.
        
               | richardwhiuk wrote:
               | Humans use the phrase "I don't know.".
               | 
               | AI never does.
        
               | justsomehnguy wrote:
               | Because there is no "I don't know" in the training data.
               | Can you imagine a forum where in the response for a
               | question of some obscure easter egg there are hunddeds of
               | "I don't know"?
        
               | MostlyStable wrote:
               | >I'm not aware of any MS-DOS productivity program...
               | 
               | >I don't know of any MS-DOS productivity programs...
               | 
               | I dunno, seems pretty similar to me.
               | 
               | And in a totally unreltaed query today, I got the
               | following response:
               | 
               | >That's a great question, but I don't have current
               | information...
               | 
               | Sounds a lot like "I don't know".
        
           | bongodongobob wrote:
           | Wait until you meet humans on the Internet. Not only do they
           | make shit up, but they'll do it maliciously to trick you.
        
           | kbenson wrote:
           | So, like normal history just sped up exponentially to the
           | point it's noticeable in not just our own lifetime (which it
           | seemed to reach prior to AI), but maybe even within a couple
           | years.
           | 
           | I'd be a lot more worried about that if I didn't think we
           | were doing a pretty good job of obfuscating facts the last
           | few years ourselves without AI. :/
        
           | tough wrote:
           | probably chatgpt search function already finds this thread
           | soon to answer correctly, hn domain does well on seo and
           | shows up on search results soon enough
        
         | nradov wrote:
         | There is an enormous amount of actual car repair experience
         | training data on YouTube but it's all copyrighted. Whether AI
         | companies should have to license that content before using it
         | for training is a matter of some dispute.
        
           | AnotherGoodName wrote:
           | >Whether AI companies should have to license that content
           | before using it for training is a matter of some dispute.
           | 
           | We definitely do not have the right balance of this right
           | now.
           | 
           | eg. I'm working on a set of articles that give a different
           | path to learning some key math knowledge (just comes at it
           | from a different point of view and is more intuitive).
           | Historically such blog posts have helped my career.
           | 
           | It's not ready for release anyway but i'm hesitant to release
           | my work in this day and age since AI can steal it and
           | regurgitate it to the point where my articles appear
           | unoriginal.
           | 
           | It's stifling. I'm of the opinion you shouldn't post art,
           | educational material, code or anything that you wish to be
           | credited for on the internet right now. Keep it to yourself
           | or else AI will just regurgitate it to someone without giving
           | you credit.
        
             | Legend2440 wrote:
             | The flip side is: knowledge is not (and should not be!)
             | copyrightable. Anyone can read your articles and use the
             | knowledge it contains, without paying or crediting you.
             | They may even rewrite that knowledge in their own words and
             | publish it in a textbook.
             | 
             | AI should be allowed to read repair manuals and use them to
             | fix cars. It should not be allowed to produce copies of the
             | repair manuals.
        
               | seadan83 wrote:
               | An AI does not know what "fix" means, let alone be able
               | to control anything that would physically fix the car.
               | So, for an AI to fix a car means to give instructions on
               | how to do that, in other words, reproduce pertinent parts
               | of the repair manual. One, Is this a fair framing? Two,
               | is this a distinction without a difference?
        
               | AnotherGoodName wrote:
               | Using the work of others with no credit given to them
               | would at the very least be considered a dick move.
               | 
               | AI is committing absolute dick moves non-stop.
        
         | abeppu wrote:
         | > which both (1. gives you an unlimited supply of
         | noncopyrighted training data and (2. handily sidesteps the
         | issue of AI-contaminated training data.
         | 
         | I think these are both basically somewhere between wrong and
         | misleading.
         | 
         | Needing to generate your own data through actual experience is
         | very expensive, and can mean that data acquisition now comes
         | with real operational risks. Waymo gets real world experience
         | operating its cars, but the "limit" on how much data you can
         | get per unit time depends on the size of the fleet, and
         | requires that you first get to a level of competence where it's
         | safe to operate in the real world.
         | 
         | If you want to repair cars, and you _don't_ start with some
         | source of knowledge other than on-policy roll-outs, then you
         | have to expect that you're going to learn by trashing a bunch
         | of cars (and still pay humans to tell the robot that it failed)
         | for some significant period.
         | 
         | There's a reason you want your mechanic to have access to
         | manuals, and have gone through some explicit training, rather
         | than just try stuff out and see what works, and those cost-
         | based reasons are true whether the mechanic is human or AI.
         | 
         | Perhaps you're using an off-policy RL approach -- great! If
         | your off-policy data is demonstrations from a prior generation
         | model, that's still AI-contaminated training data.
         | 
         | So even if you're trying to learn by doing, there are still
         | meaningful limits on the supply of training data (which may be
         | way more expensive to produce than scraping the web), and
         | likely still AI-contaminated (though perhaps with better info
         | on the data's provenance?).
        
       | swyx wrote:
       | i put together a brief catalog of AI pollution of the web the
       | last time this topic came up:
       | https://www.latent.space/i/139368545/the-concept-of-low-back...
       | 
       | i do have to say outside of twitter i dont personally see it all
       | that much. but the normies do seem to encounter it and 1) either
       | are fine? 2) oblivious? and perhaps SOME non-human-origin noise
       | is harmless.
       | 
       | (plenty of humans are pure noise, too, dont forget)
        
       | koolba wrote:
       | I feel oddly prescient today:
       | https://news.ycombinator.com/item?id=44217676
        
         | glenstein wrote:
         | Nicely done! I think I've heard of this framing before, of
         | considering content to be free from AI "contamination." I
         | believe that idea has been out there in the ether.
         | 
         | But I think the suitability of low background steel as an
         | analogy is something you can comfortably claim as a successful
         | called shot.
        
         | saberience wrote:
         | I heard this example made at least a year ago on hackernews,
         | probably longer ago too.
         | 
         | See (2 years ago):
         | https://news.ycombinator.com/item?id=34085194
        
         | echelon wrote:
         | I really think you're wrong.
         | 
         | The processes we use to annotate content and synthetic data
         | will turn AI outputs into a gradient that makes future outputs
         | better, not worse.
         | 
         | It might not be as obvious with LLM outputs, but it should be
         | super obvious with image and video models. As we select the
         | best visual outputs of systems, slight errors introduced and
         | taste-based curation will steer the systems to better
         | performance and more generality.
         | 
         | It's no different than genetics and biology adapting to every
         | ecological niche if you think of the genome as a synthetic
         | machine and physics as a stochastic gradient. We're speed
         | running the same thing here.
        
         | zargon wrote:
         | This has been a common metaphor since the launch of ChatGPT.
        
       | ChrisArchitect wrote:
       | Love the concept (and the historical story is neat too).
       | 
       | Came up a month or so ago on discussion about _Wikipedia:
       | Database Download_
       | (https://news.ycombinator.com/item?id=43811732). I missed that it
       | was jgrahamc behind the site. Great stuff.
        
       | aunty_helen wrote:
       | Any user profile created pre-2022 is low background steel. I'm
       | now finding myself check date created when it seems like the user
       | is outputting low quality content. Much to my dismay, I'm often
       | wrong.
        
       | yodon wrote:
       | Anyone who thinks their reading skills are a reliable detector of
       | AI-generated content is either lying to themselves about the
       | validity of their detector or missing the opportunity to print
       | money by selling it.
       | 
       | I strongly suspect more people are in the first category than the
       | second.
        
         | uludag wrote:
         | 1) If someone had the reading skills to detect AI generated
         | content wouldn't that technically be something very hard to
         | monetize? It's not like said person could clone themselves or
         | mass produce said skill.
         | 
         | Also, for a large number of AI generated images and text
         | (especially low-effort), even basic reading/perception skills
         | can detect AI content. I would agree though that people can't
         | reliably discern high-effort AI generated works, especially if
         | a human was involved to polish it up.
         | 
         | 2) True--human "detectors" are mostly just gut feelings dressed
         | up as certainty. And as AI improves, those feelings get less
         | reliable. The real issue isn't that people can detect AI, but
         | that they're overconfident when they think they can.
         | 
         | One of the above was generated by ChatGPT to reply to your
         | comment. The other was written by me.
        
           | suddenlybananas wrote:
           | It's so obvious that I almost wonder if you made a parody of
           | AI writing on purpose.
        
       | sorokod wrote:
       | Elsewhere I proposed a "100% organic data" label for
       | uncontaminated content. Should have a "100% organic data" logo
       | too.
        
         | warkdarrior wrote:
         | Maybe a "Data hallucinated from humans only" label would be
         | better.
        
           | sorokod wrote:
           | Don't think so - too long and states the obvious.
        
       | ACCount36 wrote:
       | Currently, there is no reason to believe that "AI contamination"
       | is a practical issue for AI training runs.
       | 
       | AIs trained on public scraped data that predates 2022 don't
       | noticeably outperform those trained on scraped data from 2022
       | onwards. Hell, in some cases, newer scrapes perform slightly
       | better, token for token, for unknown reasons.
        
         | demosthanos wrote:
         | > AIs trained on public scraped data that predates 2022 don't
         | noticeably outperform those trained on scraped data from 2022
         | onwards. Hell, in some cases, newer scrapes perform slightly
         | better, token for token, for unknown reasons.
         | 
         | This is really bad reasoning for a few reasons:
         | 
         | 1) We've gotten much _better_ at training LLMs since 2022. The
         | negative impacts of AI slop in the training data certainly don
         | 't outweigh the benefits of orders of magnitude more parameters
         | and better training techniques, but that doesn't mean they have
         | no negative impact.
         | 
         | 2) "Outperform" is a very loose term and we still have no real
         | good answer for measuring it meaningfully. We can all tell that
         | Gemini 2.5 outperforms GPT-4o. What's trickier is
         | distinguishing between Gemini 2.5 and Claude 4. The expected
         | effect size of slop at this stage would be on that _smaller_
         | scale of differences between same-gen models.
         | 
         | Given that we're looking for a small enough effect size that we
         | know we're going to have a hard time proving anything with
         | data, I think it's reasonable to operate from first principles
         | in this case. First principles say very clearly that avoiding
         | training on AI-generated content is a good idea.
        
           | ACCount36 wrote:
           | No, I mean "model" AIs, created explicitly for dataset
           | testing purposes.
           | 
           | You take small AIs, of the same size and architecture, and
           | with the same pretraining dataset size. Pretrain some solely
           | on skims from "2019 only", "2020 only", "2021 only" scraped
           | datasets. The others on skims from "2023 only", "2024 only".
           | Then you run RLHF, and then test the resulting AIs on
           | benchmarks.
           | 
           | The latter AIs tend to perform _slightly better_. It 's a
           | small but noticeable effect. Plenty of hypothesis on why,
           | none confirmed outright.
           | 
           | You're right that performance of frontier AIs keeps
           | improving, which is a weak strike against the idea of AI
           | contamination hurting AI training runs. Like-for-like testing
           | is a strong strike.
        
         | numpad0 wrote:
         | Yeah, the thinking behind "low background steel" concept is
         | that AI training on synthetic data could lead into a "model
         | collapse" that render the AIs anyhow completely mad and
         | useless. That either didn't happen, or all the AI companies
         | internally holds a working filter to sieve out AI data. I'd bet
         | on the former. I still think there might be chances of model
         | collapse happening to _humans_ after too much exposure to AI
         | generated data, but that 's just my anecdotal observations and
         | gut feelings.
        
         | rjsw wrote:
         | I don't think people have really got started on generating
         | slop, I expect it to increase by a lot.
        
       | vunderba wrote:
       | Was the choice to go with a very obviously AI generated image for
       | the banner intentional? If I had to guess it almost looks like
       | DALL-E version 2.
        
         | blululu wrote:
         | Gratuitous AI slop is really not a good look. tai;dr is
         | becoming my default response to this kind of thing. I want to
         | hear someone's thoughts, not an llm's compression artifacts.
        
       | nialv7 wrote:
       | Does this analogy work? It's exceedingly hard to make new low-
       | background steels, since those radioactive particles are
       | everywhere. But it's not difficult to make AI-free content - well
       | just don't use AI to write it.
        
         | absurdo wrote:
         | Clickbait title that's all.
        
         | lurk2 wrote:
         | Who is going to generate this AI-free content, for what reason,
         | and with what money?
        
           | arjie wrote:
           | People do. I do, for instance. My blog is self-hosted,
           | entirely human-written, and it is done for the sake of
           | enjoyment. It doesn't cost much to host. An entirely static
           | site generator would actually be free, but I don't mind
           | paying the 55C//kWh and the $60/month ISP fee to host it.
        
       | Ekaros wrote:
       | Wouldn't actually curated content be still better? That is
       | content were say lot of blogspam and and other content
       | potentially generated by certain groups was removed? As I
       | distinctly remember that lot of content even before AIs was very
       | poor quality.
       | 
       | On other hand, lot of poor quality content could still be
       | factually valid enough not just well edited or formatted.
        
       | gorgoiler wrote:
       | This site is literally named for the Y combinator! Module some
       | philosophical hand waving, if there's one thing we ought to
       | demand of our inference models it's the ability to find the fixed
       | point of a function that takes content and outputs content, then
       | consumes that same content!
       | 
       | I too am optimistic that recursive training on data that is a
       | mixture of both original human content and content derived from
       | original content, and content derived from content derived from
       | original human content, ...ad nauseam, will be able to extract
       | the salient features and patterns of the underlying system.
        
       | mclau157 wrote:
       | is this not just www.archive.org ?
        
       | K0balt wrote:
       | Ai generated content is inherently a regression to the mean and
       | harms both training and human utility. There is no benefit in
       | publishing anything that an AI can generate, just ask the
       | question yourself. Maybe publish all AI content with <AI
       | generated content> tags, but other than that it is a public
       | nuisance much more often than a public good.
        
         | SamPatt wrote:
         | Nonsense. Have you used any of the deep research tools?
         | 
         | Don't fall for the utopia fallacy. Humans also publish junk.
        
           | cryptonector wrote:
           | Yes, but GP's idea of segregating AI-generated content is
           | worth considering.
           | 
           | If you're training an AI, do you want it to get trained on
           | other AIs' output? That might be interesting actually, but I
           | think you might then want to have both, an AI trained on
           | everything, and another trained on everything except other
           | AIs' output. So perhaps an HTML tag for indicating "this is
           | AI-generated" might be a good idea.
        
             | IncreasePosts wrote:
             | Shouldn't there be enough training content from the pre-ai
             | era that the system itself can determine whether content is
             | AI generated, or if it matters?
        
               | Infinity315 wrote:
               | Just ask any person who works in teaching or any of the
               | numerous faulty AI detectors (they're all faulty).
               | 
               | Any current technology which can used to accurately
               | detect pre-AI content would necessarily imply that that
               | same technology could be used to train an AI to generate
               | content that could skirt by the AI detector. Sure, there
               | is going to be a lag time, but eventually we will run out
               | of non-AI content.
        
               | cryptonector wrote:
               | No, that's the problem. Pre-AI era content a) is often
               | not dated, so not identifiable as such, and b) also gets
               | out of date. What was thought to be true 20 years ago
               | might not be thought to be true today. Search for the
               | "half-life of facts".
        
             | RandomBK wrote:
             | My 2c is that it _is_ worthwhile to train on AI generated
             | content that has obtained some level of human approval or
             | interest, as a form of extended RLHF loop.
        
               | cryptonector wrote:
               | Ok, but how do you denote that approval? What if you
               | partially approve of that content? ("Overall this is
               | correct, but this little nugget is hallucinated.")
        
               | bongodongobob wrote:
               | It apparently doesn't matter unless you somehow consider
               | the entire Internet to be correct. They didn't only feed
               | LLMs correct info. It all just got shoveled in and here
               | we are.
        
             | thephyber wrote:
             | I can see the value of labeling _all_ AI can be trained on
             | purely non-AI generated content.
             | 
             | But I don't think that's a reasonable goal. Pragmatic
             | example: There's almost no optional HTML tags or optional
             | HTTP Headers which are used anywhere close to 100% of the
             | times they apply.
             | 
             | Also, I think field is already muddy, even before the game
             | starts. Spell checker, grammar.ly, and translation all had
             | AI contributions and likely affect most of human-generated
             | text on the internet. The heuristic of "one drop of AI" is
             | not useful. And any heuristic more complicated than "one
             | drop" introduces too much subjective complexity for a
             | Boolean data type.
        
           | cobbzilla wrote:
           | Steel-man angle: A desire for data provenance is a good thing
           | with benefits that are independent of utopias/humans vs
           | machines kinds of questions.
           | 
           | But, all provenance systems are gamed. I predict the most
           | reliable methods will be cumbersome and not widespread, thus
           | covering little actual content. The easily-gamed systems will
           | be in widespread use, embedded in social media apps, etc.
           | 
           | Questions: 1. Does there exist a data provenance system that
           | is both easy to use and reliable "enough" (for some
           | sufficient definition of "enough")? Can we do bcrypt-style
           | more-bits=more-security and trade time for security?
           | 
           | 2. Is there enough of an incentive for the major tech
           | companies to push adoption of such a system? How could this
           | play out?
        
           | munificent wrote:
           | The observation that humans poop is not sufficient
           | justification for spending millions of dollars building an
           | automated firehose that pumps a torrent of shit onto the
           | public square.
        
           | krapht wrote:
           | Yes, and deep research was junk for the hard topics that I
           | actually needed to sit down and research. Anything shallower
           | I can usually reach by search engine use and scan; deep
           | research saves me about 15-30 minutes for well-covered
           | topics.
           | 
           | For the hard topics, the solution is still the same as pre-AI
           | - search for popular survey papers, then start crawling
           | through the citation network and keeping notes. The LLM
           | output had no idea of what was actually impactful vs what was
           | a junk paper in the niche topic I was interested in so I had
           | no other alternative than quality time with Google Scholar.
           | 
           | We are a long way from deep research even approaching a well-
           | written survey paper written by grad student sweat and tears.
        
         | gojomo wrote:
         | This was an intuitively-appealing belief, even with some
         | qualified experimental support, as of a few years ago.
         | 
         | However, since then, a bunch of capability breakthroughs from
         | (well-curated) AI generations has definitively disproven it.
        
       | Crontab wrote:
       | Off topic:
       | 
       | When I see a JGC link on Hacker News I can't help but remember
       | using PopFile on an old PowerMac - back when Bayesian spam
       | filters were becoming popular. It seems so long ago but it feels
       | like yesterday.
        
       | jeffchuber wrote:
       | https://x.com/jeffreyhuber/status/1732069197847687658
        
       | carlosjobim wrote:
       | The shadow libraries are the largest and highest quality source
       | of human knowledge, larger than the Internet in scope and actual
       | content.
       | 
       | It is also uncontaminated by AI.
        
         | klysm wrote:
         | Soon this will be contaminated as well unfortunately
        
       | gojomo wrote:
       | Look, we just need to add some new 'planes' to Unicode - that
       | mirror all communicatively-useful characters, but with extra
       | state bits for...
       | 
       |  _guaranteed human output_ - anyone who emits text in these
       | ranges that was AI generated, rather than artisanally human-
       | composed, goes straight to jail.
       | 
       |  _for human eyes only_ - anyone who lets any AI train on, or even
       | consider, any text in these ranges goes straight to jail. Fnord,
       | "that doesn't look like anything to me".
       | 
       |  _admittedly AI generated_ - all AI output must use these ranges
       | as disclosure, or - you guessed it - those pretending otherwise
       | go straight to jail.
       | 
       | Of course, all the ranges generate visually-indistinguishable
       | homoglyphs, so it's a strictly-software-mediated quasi-covert
       | channel for fair disclosure.
       | 
       | When you cut & paste text from various sources, the provenance
       | comes with it via the subtle character encoding differences.
       | 
       | I am only (1 - epsilon) joking.
        
       ___________________________________________________________________
       (page generated 2025-06-10 23:00 UTC)