[HN Gopher] Why wordfreq will not be updated
       ___________________________________________________________________
        
       Why wordfreq will not be updated
        
       Author : tomthe
       Score  : 1225 points
       Date   : 2024-09-18 11:41 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | altcognito wrote:
       | It might be fun to collect the same data if not for any other
       | reason than to note the changes but adding the caveat that it
       | doesn't represent human output.
       | 
       | Might even change the tool name.
        
         | jpjoi wrote:
         | The point was it's getting harder and harder to do that as
         | things get locked down or go behind a massive paywall to either
         | profit off of or avoid being used in generative AI. The places
         | where previous versions got data is impossible to gather from
         | anymore so the dataset you would collect would be completely
         | different, which (might) cause weird skewing.
        
           | oneeyedpigeon wrote:
           | But that would always be the case. Twitter will not last
           | forever; heck, it may not even be long before an open
           | alternative like Bluesky competes with it. Would be
           | interesting to know what percentage of the original mined
           | data was from Twitter.
        
       | assanineass wrote:
       | Well said
        
       | jgrahamc wrote:
       | I created https://lowbackgroundsteel.ai/ in 2023 as a place to
       | gather references to unpolluted datasets. I'll add wordfreq.
       | Please submit stuff to the Tumblr.
        
         | VyseofArcadia wrote:
         | Clever name. I like the analogy.
        
           | freilanzer wrote:
           | I don't seem to get it.
        
             | KeplerBoy wrote:
             | Steel made before atmospheric tests of nuclear bombs were a
             | thing is referred to as low background steel and invaluable
             | for some applications.
             | 
             | LLMs pollute the internet like atomic bombs polluted the
             | environment.
        
             | cdman wrote:
             | https://en.wikipedia.org/wiki/Low-background_steel
        
             | ziddoap wrote:
             | Steel without nuclear contamination is sought after, and
             | only available from pre-war / pre-atomic sources.
             | 
             | The analogy is that data is now contaminated with AI like
             | steel is now contaminated with nuclear fallout.
             | 
             | https://en.wikipedia.org/wiki/Low-background_steel
             | 
             | > _Low-background steel, also known as pre-war steel[1] and
             | pre-atomic steel,[2] is any steel produced prior to the
             | detonation of the first nuclear bombs in the 1940s and
             | 1950s. Typically sourced from ships (either as part of
             | regular scrapping or shipwrecks) and other steel artifacts
             | of this era, it is often used for modern particle detectors
             | because more modern steel is contaminated with traces of
             | nuclear fallout.[3][4]_
        
               | umvi wrote:
               | > and only available from pre-war / pre-atomic sources.
               | 
               | From the same wiki you linked:
               | 
               | "Since the end of atmospheric nuclear testing, background
               | radiation has decreased to very near natural levels,
               | making special low-background steel no longer necessary
               | for most radiation-sensitive uses, as brand-new steel now
               | has a low enough radioactive signature"
               | 
               | and
               | 
               | "For the most demanding items even low-background steel
               | can be too radioactive and other materials like high-
               | purity copper may be used"
        
               | sergiotapia wrote:
               | reading stuff like this makes me so happy. no matter how
               | fucked up something may be there is always a way to clean
               | right up.
        
               | felbane wrote:
               | _glances nervously at atmospheric CO2_
        
               | swyx wrote:
               | and I applied to LLMs here:
               | https://www.latent.space/p/nov-2023
        
             | AlphaAndOmega0 wrote:
             | It's a reference to the practise of scavenging steel from
             | sources that were produced before nuclear testing began, as
             | any steel produced afterwards is contaminated with nuclear
             | isotopes from the fallout. Mostly ship wrecks, and WW2
             | means there are plenty of those. The pun in question is
             | that his project tries to source text that hasn't been
             | contaminated with AI generated material.
             | 
             | https://en.m.wikipedia.org/wiki/Low-background_steel
        
             | ms512 wrote:
             | After the detonation of the first nuclear weapons, any
             | newly produced steel has a low dose of nuclear fallout.
             | 
             | For applications that need to avoid the background
             | radiation (like physics research), pre atomic age steel is
             | extracted, like from old shipwrecks.
             | 
             | https://en.m.wikipedia.org/wiki/Low-background_steel
        
             | GreenWatermelon wrote:
             | From the blog
             | 
             | > Low Background Steel (and lead) is a type of metal
             | uncontaminated by radioactive isotopes from nuclear
             | testing. That steel and lead is usually recovered from
             | ships that sunk before the Trinity Test in 1945.
        
             | voytec wrote:
             | To whomever downvoted parent: please don't act against
             | people brave enough to state that they don't know
             | something.
             | 
             | This is a desired quality, increasingly less present in IT
             | work environments. People afraid of being shamed for
             | stating knowledge gaps are not the folks you want to work
             | with.
        
               | umvi wrote:
               | I feel like there's a minimum "due diligence" bar to meet
               | though before asking, otherwise it comes across as "I'm
               | too lazy to google the reference and connect the dots
               | myself, but can someone just go ahead and distill a nice
               | summary for me"
        
               | voytec wrote:
               | In this particular case, I was out of the loop regarding
               | the clever analogy myself. I'm now a tad smarter because
               | someone else expressed lack of understanding, and I
               | learned from responses to this (grayed due to downvotes)
               | comment.
        
               | PhunkyPhil wrote:
               | The problem is that the answer was a really easy google.
               | I didn't know what low background steel was and I just
               | googled it.
        
               | cwillu wrote:
               | A person asking the question _here_ means there are now
               | several good succinct explanations of it _here_.
        
               | input_sh wrote:
               | But it's right there in the header, you could just click
               | the link and find out on the top of the webpage.
        
         | imhoguy wrote:
         | I am not sure we should trust a site contaminated by AI
         | graphics. /s
        
           | whywhywhywhy wrote:
           | Yeah pay an illustrator if this is important to you.
           | 
           | See a lot of people upset about AI still using AI image
           | generation because it's not in their field so they feel less
           | strongly about it and can't create art themselves anyway,
           | hypocritical either use it or don't but don't fuss over it
           | then use it for something thats convenient for you.
        
             | imhoguy wrote:
             | I have updated my comment with "/s" as that is closer to
             | what I've meant. However, seriously, from ethical point of
             | view it is unlikely illustrators were asked or compensated
             | for their work being used for training AI to produce the
             | image.
        
               | heckelson wrote:
               | I thought the header image was a symbol of AI slop
               | contamination because it looked really off-putting
        
           | gorkish wrote:
           | The buildings and shipping containers that store low
           | background steel aren't built out of the stuff either.
        
         | astennumero wrote:
         | That's exactly the opposite of what the author wanted IMO. The
         | author no more wants to be a part of this mess. Aggregating
         | these sources would just makes it so much more easier for the
         | tech giants to scrape more data.
        
           | rovr138 wrote:
           | The sources are just aggregated. The source doesn't change.
           | 
           | The new stuff generated does (and this is honestly already
           | captured).
           | 
           | This author doesn't generate content. They analyze data from
           | humans. That "from humans" is the part that can't be
           | discerned enough and thus the project can't continue.
           | 
           | Their research and projects are great.
        
           | iak8god wrote:
           | The main concerns expressed in Robyn's note, as I read them,
           | seem to be 1) generative AI has polluted the web with text
           | that was not written by humans, and so it is no longer
           | feasible to produce reliable word frequency data that
           | reflects how humans use natural language; and 2)
           | simultaneously, sources of natural language text that were
           | previously accessible to researchers are now less accessible
           | because the owners of that content don't want it used by
           | others to create AI models without their permission. A third
           | concern seems to be that support for and practice of any
           | other NLP approaches is vanishing.
           | 
           | Making resources like wordfreq more visible won't exacerbate
           | any of these concerns.
        
         | LeoPanthera wrote:
         | Congratulations on "shipping", I've had a background task to
         | create pretty much exactly this site for a while. What is your
         | cutoff date? I made this handy list, in research for mine:
         | 2017: Invention of transformer architecture       June 2018:
         | GPT-1       February 2019: GPT-2       June 2020: GPT-3
         | March 2022: GPT-3.5       November 2022: ChatGPT
         | 
         | You may want to add kiwix archives from before whatever date
         | you choose. You can find them on the Internet Archive, and
         | they're available for Wikipedia, Stack Overflow, Wikisource,
         | Wikibooks, and various other wikis.
        
         | ClassyJacket wrote:
         | :'( I thought I was clever for realising this parallel myself!
         | Guess it's more obvious than I thought.
         | 
         | Another example is how data on humans after 2020 or so can't be
         | separated by sex because gender activists fought to stop
         | recording sex in statistics on crime, medicine, etc.
        
           | sweeter wrote:
           | This is a psychotic thing to say without a source,
           | considering how it's blatantly untrue.
        
       | primer42 wrote:
       | Hear, hear!
        
       | oneeyedpigeon wrote:
       | I wonder if anyone will fork the project. Apart from anything
       | else, the data may still be useful given that we know it is
       | polluted. In fact, it could act as a means of judging the impact
       | of LLMs via that very pollution.
        
         | Miraltar wrote:
         | I guess it would be interesting but differentiating pollution
         | from language evolution seems very tricky since getting a non
         | polluted corpus gets harder and harder
        
           | wpietri wrote:
           | One way to tackle it would be to use LLMs to generate
           | synthetic corpuses, so you have some good fingerprints for
           | pollution. But even there I'm not sure how doable that is
           | given the speed at which LLMs are being updated. Even if I
           | know a particular page was created in, say, January 2023, I
           | may no longer be able to try to generate something similar
           | now to see how suspect it is, because the precise setups of
           | the moment may no longer be available.
        
           | Retr0id wrote:
           | Arguably it _is_ a form of language evolution. I bet humans
           | have started using  "delve" more too, on average. I think the
           | best we can do is look at the trends and think about
           | potential causes.
        
             | rvnx wrote:
             | "Seamless", "honed", "unparalleled", "delve" are now
             | polluting the landscape because of monkeys repeating what
             | ChatGPT says without even questioning what the words mean.
             | 
             | Everything is "seamless" nowadays. Like I am seamlessly
             | commenting here.
             | 
             | Arguably, the meaning of these words evolve due to misuse
             | too.
        
               | oneeyedpigeon wrote:
               | I see a lot of writing in my day-to-day, and the words
               | that stick out most are things like "plethora" and
               | "utilized". They're not terribly obscure, but they're
               | just 'odd' and, maybe, formal enough to really stick out
               | when overused.
        
             | pavel_lishin wrote:
             | > _I bet humans have started using "delve" more too, on
             | average._
             | 
             | I wish there were a way to check.
        
       | shortrounddev2 wrote:
       | Man the AI folks really wrecked everything. Reminds me of when
       | those scooter companies started just dumping their scooters
       | everywhere without asking anybody if they wanted this.
        
         | analog31 wrote:
         | perhaps germane to this thread, I think the scooter thing was
         | an investment bubble. it was easier to burn investment money on
         | new scooters than to collect and maintain old ones. until the
         | money ran out.
        
         | kdmccormick wrote:
         | At least scooters did something useful for the environment.
        
           | DrillShopper wrote:
           | Their batteries on the other hand...
        
             | kdmccormick wrote:
             | Sure, they're worse than walking or biking, but compared to
             | an electric car battery or an ICE car?
        
               | Sharlin wrote:
               | At least where I'm from, scooters have mostly replaced
               | walking and biking, not car trips :(
        
           | Sander_Marechal wrote:
           | Did they? A lot of then were barely used, got damaged or
           | vandalized, etc. And when the companies folded or communities
           | outlawed the scooters, they end up as trash. I don't believe
           | for a second that the amount of pollutants and greenhouse
           | gasses saved by usage is larger than the amount produced by
           | manufacturing, shipping and trashing all those scooters.
        
       | baq wrote:
       | All those writers who'll soon be out of job and/or already are
       | and basically unhireable for their previous tasks should be paid
       | for by the AI hyperscalers to write anything at all on one
       | condition: not a single sentence in their works should be created
       | with AI.
       | 
       | (I initially wanted to say 'paid for by the government' but
       | that'd be socialising losses and we've had quite enough of that
       | in the past.)
        
         | bondarchuk wrote:
         | AI companies are indeed hiring such people to generate
         | customized training data for them.
        
           | neilv wrote:
           | Is it the same companies that simply took all the writers'
           | previous work (hoping to be billionaires before the courts
           | understand)?
        
             | shadowgovt wrote:
             | Yes. This was always the failure with the argument that
             | copyright was the relevant issue... Once the model was
             | proven out, we knew some wealthy companies would hire
             | humans to generate the training data that the companies
             | could then own in whole, at the relative expense of all
             | other humans that didn't get paid to feed the machines.
        
           | passion__desire wrote:
           | This idea could also be extended to domains like Art. Create
           | new art styles for AI to learn from. But in future, that will
           | also get automated. AI itself will create art styles and all
           | humans would do is choose whether something is Hot or Not.
           | Sort of like art breeder.
        
         | vidarh wrote:
         | There are already several companies doing this - I do
         | occasional contract work for a couple -, and paying rates
         | sometimes well above what an average earning writer can expect
         | elsewhere. However, the vast majority of writers have never
         | been able to make a living from their writing. The threshold to
         | write is too love, too many people love it, and most people
         | read very little.
        
           | baq wrote:
           | Transformers read a lot during training, it might actually be
           | beneficial for the companies to the point those works never
           | see the light of day, only machines would read them. That's
           | so dystopian I'd say those works should be published so they
           | eventually get into the public domain.
        
             | ckemere wrote:
             | Rooms full of people writing into a computer is a striking
             | mental picture. It feels like it could be background for a
             | great plot for a book/movie.
        
               | left-struck wrote:
               | Have you heard of Severance? This has a vibe extremely
               | similar to that show.
        
         | trilbyglens wrote:
         | Have you ever read american history? Lol.
        
         | nkozyra wrote:
         | People have been paid to generate noise for a decade+ now.
         | Garbage in, garbage out will always be true.
         | 
         | Next token-seeking is a solved problem. Novel thinking can be
         | solved by humans and possibly by AI soon, but adding more
         | garbage to the data won't improve things.
        
         | tveita wrote:
         | Who programs the tapes?
         | https://en.wikipedia.org/wiki/Profession_(novella)
        
           | jfultz wrote:
           | _Thank you_. I read this story probably around 1980 (I think
           | in a magazine that was subsequently trashed or garage-saled),
           | and I have spent my adult life remembering the bones of the
           | story, but not the author or the title.
        
       | anovikov wrote:
       | Sad. I'd love to see by how much the use of world "delve" has
       | increased since 2021...
        
         | chipdart wrote:
         | From the submission you're commenting on:
         | 
         | > As one example, Philip Shapira reports that ChatGPT (OpenAI's
         | popular brand of generative language model circa 2024) is
         | obsessed with the word "delve" in a way that people never have
         | been, and caused its overall frequency to increase by an order
         | of magnitude.
        
           | eesmith wrote:
           | https://pshapira.net/2024/03/31/delving-into-delve/ "Delving
           | into "delve""
        
         | xpl wrote:
         | The fun thing is that while GPTs initially learned from humans
         | (because ~100% of the content was human-generated), future
         | humans will learn from GPTs, because almost all available
         | content would be GPT-generated very soon.
         | 
         | This will surely affect how we speak. It's possible that human
         | language evolution could come to a halt, stuck in time as AI
         | datasets stop being updated.
         | 
         | In the worst case, we will see a global "model collapse" with
         | human languages devolving along with AI's, if future AIs are
         | trained on their own outputs...
        
         | Terretta wrote:
         | > _I 'd love to see by how much the use of world "delve" has
         | increased since 2021..._
         | 
         | There are charts / graphs in the link, both since 2021, and
         | since earlier.
         | 
         | The final graph suggests the phenomenon started earlier,
         | possibly correlated in some way to Malaysian / Indian usages of
         | English.
         | 
         | It does seem OpenAI's family of GPTs as implemented in ChatGPT
         | unspool concepts in a blend of India-based-consultancy English
         | with American freshmen essay structure, frosted with
         | superficially approachable or upbeat blogger prose
         | ingratiatingly selling you something.
         | 
         | Anthropic has clearly made efforts to steer this differently,
         | Mistral and Meta as well but to a lesser degree.
         | 
         | I've wondered if this reflects training material (the SEO is
         | ruining the Internet theory), or is more simply explained by
         | selection of pools of Hs hired for RLHF.
        
         | dqv wrote:
         | Same for me but with the word "crucial".
        
         | slashdave wrote:
         | Amusing that we now have a feedback loop. Let's see... delve
         | delve delve delve delve delve delve delve. There, I've done my
         | part.
        
         | CaptainFever wrote:
         | Google ngram viewer, perhaps?
        
       | voytec wrote:
       | I agree in general but the web was already polluted by Google's
       | unwritten SEO rules. Single-sentence paragraphs, multiple keyword
       | repetitions and focus on "indexability" instead of readability,
       | made the web a less than ideal source for such analysis long
       | before LLMs.
       | 
       | It also made the web a less than ideal source for training. And
       | yet LLMs were still fed articles written for Googlebot, not
       | humans. ML/LLM is the second iteration of writing pollution. The
       | first was humans writing for corporate bots, not other humans.
        
         | kevindamm wrote:
         | Yes but not quite as far as you imply. The training data is
         | weighted by a quality metric, articles written by journalists
         | and wikipedia contributors are given more weight than Aunt
         | May's brownie recipe and corpoblogspam.
        
           | Freak_NL wrote:
           | It certainly feels like the amount of regurgitated,
           | nonsensical, generated content (nontent?) has risen
           | spectacularly specifically in the past few years. 2021 sounds
           | about right based on just my own experience, even though I
           | can't point to any objective source backing that up.
        
             | jsheard wrote:
             | SEO grifters have fully integrated AI at this point, there
             | are dozens of turn-key "solutions" for mass-producing
             | "content" with the absolute minimum effort possible. It's
             | been refined to the point that scraping material from other
             | sites, running it through the LLM blender to make it look
             | original, and publishing it on a platform like Wordpress is
             | fully automated end-to-end.
        
               | sahmeepee wrote:
               | Or check out "money printer" on github: a tongue in cheek
               | mashup of various tools to take a keyword as input and
               | produce a youtube video with subtitles and narration as
               | output.
        
             | zharknado wrote:
             | Ooh I like "nontent." Nothing like a spicy portmanteau!
        
             | eptcyka wrote:
             | I personally am yet to see this beyond some slop on
             | youtube. And I am here for the AI meme videos. I recognize
             | the dangers of this, all I am saying is that I don't feel
             | the effect, yet.
        
               | ghaff wrote:
               | There's been a ton of low-rent listicle writing out there
               | for ages. Certainly not new in the past few years. I
               | admit I don't go on YouTube much and don't even have a
               | tiktok account so it's possible there's a lot of newer
               | lousy content I'm not really exposed to.
               | 
               | It seems to me that the fact it's so cheap and relatively
               | easy for people with dreams of becoming wealthy
               | influencers to put stuff out there has more to do with
               | the flood of often mediocre content than AI does.
               | 
               | Of course the vast majority don't have much real success
               | and get on with life and the crank turns and a new
               | generation perpetuates the cycle.
               | 
               | LLMs etc. may make things marginally easier but there's
               | no shortage of twenty somethings with lots of time
               | imagining riches while making pennies.
        
               | Freak_NL wrote:
               | I'm seeing it a lot when searching for some advice in a
               | well-defined subject, like, say, leatherworking or sewing
               | (or recipes, obviously). Instead of finding forums with
               | hobbyists, in-depth blog posts, or manufacturers advice
               | pages, increasingly I find articles which seem like
               | natural language at first, but are composed of paragraphs
               | and headers repeating platitudes and basic tips. It takes
               | a few seconds to realize the site is just pushing
               | generated articles.
               | 
               | Increasingly I find that for in-depth explanations or
               | tutorials Youtube is the only place to go, but even there
               | the search results can lead to loads of videos which just
               | seem... off. But at least those are still made by humans.
        
             | eszed wrote:
             | Upvoted for "nontent" alone: it'll be my go-to term from
             | now on, and I hope it catches on.
             | 
             | Is it of your own coinage? When the AI sifts through the
             | digital wreckage of the brief human empire, they may give
             | you the credit.
        
               | Freak_NL wrote:
               | I do hope it catches on! I did come up with this myself,
               | but I really doubt I'm the only one -- and indeed:
               | Wiktionary lists it already with a 2023 vintage:
               | 
               | https://en.wiktionary.org/wiki/nontent
        
           | darby_nine wrote:
           | Aunt may's brownie recipe (or at least her thoughts on it)
           | are likely something you'd want if you want to reflect how
           | humans use language. Both news-style and encyclopedia-style
           | writing represent a pretty narrow slice.
        
             | creshal wrote:
             | That's why search engines rated them highly, and why a
             | million spam sites cropped up that paid writers $1/essay to
             | pretend to be Aunt May, and why today every recipe website
             | has a gigantic useless fake essay in front of their
             | copypasted made up recipes.
        
               | darby_nine wrote:
               | Ok, but what i said is true regardless of SEO, and that
               | SEO has also fed back into english before LLMs were a
               | thing. If you only train on those subsets you'll also end
               | up with a chatbot that doesn't speak in a way we'll
               | identify as natural english.
        
               | actionfromafar wrote:
               | Yet. Give it time. The LLMs will train our future
               | children.
        
               | darby_nine wrote:
               | I'm sure they already are.
        
               | Freak_NL wrote:
               | I hate how looking for recipes has become so...
               | disheartening. Online recipes are fine for reputable
               | sources like newspapers where professional recipe writers
               | are paid for their contributions, but searching for some
               | Aunt May's recipe for 'X' in the big ocean of the
               | internet is pointless -- too much raw sewage dumped in.
               | 
               | It sucks, because sharing recipes seemed like one of
               | those things the internet could be really good at.
        
               | smallerfish wrote:
               | There seem to be quite a few recipe sharing sites around
               | - e.g. allrecipes.com.
        
               | creshal wrote:
               | And they're all flooded with low effort trash and
               | useless.
               | 
               | The only remaining reliable source - now that many
               | newspapers are axing the remaining staff in favour of
               | LLMs - is pre-2020 print cookbooks. Anything online or
               | printed later must be assumed to be tainted, full of
               | untested sewage and potentially dangerous suggestions.
        
               | formerly_proven wrote:
               | Well there's https://www.allrecipes.com/author/chef-john/
               | on that particular site.
        
               | JohnFen wrote:
               | Chef John is _the best_.
        
               | davejohnclark wrote:
               | I absolutely love Chef John. Great recipes and the
               | cadence of his speech on YouTube (foodwishes) is very
               | soothing, while he cooks up something amazing. If you're
               | a home cook I highly recommend his recipes and his
               | channel.
        
               | jerf wrote:
               | The wife and I use the internet for recipe _ideas_... but
               | we hardly ever follow them directly anymore. We 're no
               | formally-trained chefs but we've been home cooks for over
               | 20 years now, and so many of them are self-evidently bad,
               | or distinctly suboptimal. The internet chef's aversion to
               | flavor is a meme with us now; "add one-sixty-fourth of a
               | teaspoon of garlic powder to your gallon of soup, and mix
               | in two crystals of table salt". Either that or they're
               | all getting some seriously potent spices all the time and
               | I'd like to know where they shop because my spices are
               | nowhere near as powerful as theirs.
        
               | halostatue wrote:
               | It's not just online recipes, but cookbooks written for
               | the Better Home & Gardens crowd. The ones who write
               | "curry powder" (and mean the yellow McCormick stuff which
               | is so bland as to have almost no flavour) or call for one
               | clove of garlic in their recipe.
               | 
               | I joke with folks that my assumption with "one clove of
               | garlic" is that they _really_ mean  "one head of garlic"
               | if you want any flavour. (And if the recipe title has
               | "garlic" _in_ it and you are using one clove, you're
               | lying.)
        
               | nick3443 wrote:
               | If the recipe has "garlic" in the title, I'm budgeting
               | 1/2 head per serving.
        
               | shagie wrote:
               | I wish more people presented recipes like cooking for
               | engineers. For example - Meat Lasagna
               | https://www.cookingforengineers.com/recipe/36/Meat-
               | Lasagna
        
               | grues-dinner wrote:
               | And here I thought my defacement of printed recipes by
               | bracketing everything that goes together at each stage
               | was just me. There are, well, maybe not dozens but at
               | least two of us! Saves a lot of bowls when you know
               | without further checking that you can, say, just dump the
               | flour and sugar, butter and eggs into the big bowl
               | without having to prepare separately because they're in
               | the "1: big bowl" bracket.
        
               | halostatue wrote:
               | Depends on what you're doing. For best cookies, you want
               | to cream the butter with the sugar, _then_ add the eggs,
               | and _finally_ add the flour. If you're interested and can
               | find one, it's worth taking a vegan baking class. You
               | learn a lot about ingredient substitutions for baking,
               | about what the different non-vegan ingredients are doing
               | that you have to compensate for...and it does something
               | that I've only recently started seeing happen in non-
               | vegan baking recipes: it separates the wet ingredients
               | from the dry ingredients.
               | 
               | That is, when baking, you can _usually_ (again,
               | exceptions for creaming the sugar in butter, etc.) take
               | all of your dry ingredients and mix /sift them together,
               | and then you pour your wet ingredients in a well you've
               | made in the dry ingredients (these can also usually be
               | mixed together).
        
               | grues-dinner wrote:
               | No need to cakesplain, that was an example with three
               | ingredients of the top of my head, very, very obviously
               | the exact ingredients and bracket assignments vary
               | depending on what you are making.
               | 
               | But for shortbread or fork biscuits those three could
               | indeed all go in the bowl in one go (but that one
               | admittedly doesn't really need a bracket because the
               | recipe is "put in bowl, mix with hands, bake").
        
               | bhasi wrote:
               | I love the table-diagrams at the end. I've never seen
               | anything like that until now and it really seems useful
               | for visualization of the recipe and the sequence of
               | steps.
        
               | shagie wrote:
               | Combined with pictures for what each step _should_ look
               | like. I had a few of these pages printed out back in the
               | '00s for some recipes that I did.
        
           | jsheard wrote:
           | > The training data is weighted by a quality metric
           | 
           | At least in Googles case, they're having so much difficulty
           | keeping AI slop out of their search results that I don't have
           | much faith in their ability to give it an appropriately low
           | training weight. They're not even filtering the comically
           | low-hanging fruit like those YouTube channels which post a
           | new "product review" every 10 minutes, with an AI generated
           | thumbnail and AI voice reading an AI script that was never
           | graced by human eyes before being shat out onto the internet,
           | and is of course _always_ a glowing recommendation since the
           | point is to get the viewer to click an affiliate link.
           | 
           | Google has been playing the SEO cat and mouse game forever,
           | so can startups with a fraction of the experience be expected
           | to do any better at filtering the noise out of fresh web
           | scrapes?
        
             | epgui wrote:
             | I don't think they were talking about the quality of Google
             | search results. I believe they were talking about how the
             | data was processed by the wordfreq project.
        
               | kevindamm wrote:
               | I was actually referring to the data ingestion for
               | training LLMs, I don't know what filtering or weighting
               | might be done with wordfreq.
        
             | acdha wrote:
             | > Google has been playing the SEO cat and mouse game
             | forever, so can startups with a fraction of the experience
             | be expected to do any better at filtering the noise out of
             | fresh web scrapes?
             | 
             | Google has been _monetizing_ the SEO game forever. They
             | chose not to act against many notorious actors because the
             | metric they optimize for is ad revenue and and those sites
             | were loaded with ads. As long as advertisers didn't stop
             | buying, they didn't feel much pressure to make big changes.
             | 
             | A smaller company without that inherent conflict of
             | interest in its business model can do better because they
             | work on a fundamentally different problem.
        
             | noirscape wrote:
             | Google has those problems because the company's revenue
             | source (Ads) and the thing that puts it on the map (Search)
             | are fundamentally at odds with one another.
             | 
             | A useful Search would ideally send a user to the site with
             | the most signal and the fewest noise. Meanwhile, ads are
             | inherently noise; they're extra pieces of information
             | inserted into a webpage that at _best_ tangentially
             | correlate to the subject of a page.
             | 
             | Up until ~5 years ago, Google was able to strike a balance
             | on keeping these two stable; you'd get results with some
             | Ads but the signal generally outweighed the noise.
             | Unfortunately from what I can tell from anecdotes and
             | courtroom documents, the Ad team at Google has essentially
             | hijacked every other aspect of the company by threatening
             | that yearly bonuses won't be given out if they don't kowtow
             | to the Ad teams wishes to optimize ad revenue somewhere in
             | 2018-2019 and has no sign of stopping since there's no
             | _effective_ competition to Google. (There 's like, Bing and
             | Kagi? Nobody uses Bing though and Kagi is only used by tech
             | enthusiasts. The problem with Google is that to copy it,
             | you need a ton of computing resources upfront and are going
             | up against a company with infinitely more money and ability
             | to ensure users don't leave their ecosystem; go ahead and
             | abandon Search, but good luck convincing others to give up
             | say, their Gmail account, which keeps them locked to Google
             | and Search will be there, enticing the average user.)
             | 
             | Google has absolutely zero incentive to filter out
             | generative AI junk from their search results outside the
             | amount of it that's damaging their PR since most of the SEO
             | spam is also running Google Ads (since unless you're
             | hosting adult content, Google's ad network is practically
             | the only option). Their solution therefore isn't to remove
             | the AI junk, but to instead reduce it _enough_ to the
             | degree where a user will not get the same _type_ of AI junk
             | twice.
        
               | PaulHoule wrote:
               | My understanding is that Google Ads are what makes Google
               | Search unassailable.
               | 
               | A search engine isn't a two-sided market in itself but
               | the ad network that supports it is. A better search
               | engine is a technological problem, but a decently paying
               | ad network is a technological problem _and_ a hard
               | marketing problem.
        
             | Suppafly wrote:
             | >At least in Googles case, they're having so much
             | difficulty keeping AI slop out of their search results that
             | I don't have much faith in their ability to give it an
             | appropriately low training weight.
             | 
             | I've noticed that lately. It used to be the top google
             | result was almost always what you needed. Now at the top is
             | an AI summary that is pretty consistently wrong, often in
             | ways that aren't immediately obvious if you aren't familiar
             | with the topic.
        
             | derefr wrote:
             | > those YouTube channels which post a new "product review"
             | every 10 minutes, with an AI generated thumbnail and AI
             | voice reading an AI script that was never graced by human
             | eyes before being shat out onto the internet
             | 
             | The problem is that, of the signals you mention,
             | 
             | * the highly-informative ones (posting a new review every
             | 10 minutes, having affiliate links in the description) are
             | _contextual_ -- i.e. they 're heuristics that only work on
             | a site-specific basis. If the point is to create a training
             | pipeline that consumes "every video on the Internet" while
             | automatically rejecting the videos that are botspam, then
             | contextual heuristics of this sort won't scale. (And Google
             | "doesn't do things that don't scale.")
             | 
             | * and, conversely, the _context-free_ signals you mention
             | (thumbnail looks AI-generated, voice is synthesized) aren
             | 't actually highly correlated with the script being LLM-
             | barf rather than something a human wrote.
             | 
             | Why? One of the primary causes is TikTok (because TikTok
             | content gets cross-posted to YouTube a lot.) TikTok has a
             | built-in voiceover tool; and many people don't like their
             | voice, or don't have a good microphone, or can't speak
             | fluent/unaccented English, or whatever else -- so they
             | choose to sit there typing out a script on their phone, and
             | then have the AI read the script, rather than reading the
             | script themselves.
             | 
             | And then, when these videos get cross-posted, usually
             | they're being cross-posted in some kind of compilation,
             | through some tool that picks an AI-generated thumbnail for
             | the compilation.
             | 
             | Yet, all the content in these is _real stuff that humans
             | wrote_ , and so not something Google would want to throw
             | away! (And in fact, such content is frequently a uniquely-
             | good example of the "gen-alpha vernacular writing style",
             | which otherwise doesn't often appear in the corpus due to
             | people of that age not doing much writing in public-web-
             | scrapeable places. So Google _really_ wants to sample it.)
        
           | Lalabadie wrote:
           | The current state of things leads me to believe that Google's
           | current ranking system has been somehow too transparent for
           | the last 2-3 years.
           | 
           | The top of search results is consistently crowded by pages
           | that obviously game ranking metrics instead of offering any
           | value to humans.
        
         | rgrieselhuber wrote:
         | Indexability is orthogonal to readability.
        
           | hk__2 wrote:
           | It should be, but sadly it's not.
        
         | krelian wrote:
         | >And yet LLMs were still fed articles written for Googlebot,
         | not humans.
         | 
         | How do we know what content LLMs were fed? Isn't that a highly
         | guarded secret?
         | 
         | Won't the quality of the content be paramount to the quality of
         | the generated output or does it not work that way?
        
           | GTP wrote:
           | We do know that the open web consitutes the bulk of the
           | trainig data, although we don't get to know the specific
           | webpages that got used. Plus some more selected sources, like
           | books, of which again we only know that those are books but
           | not which books were used. So it's just a matter of
           | probability that there was a good amount of SEO spam as well.
        
         | ToucanLoucan wrote:
         | This feels like a second, magnitudes larger Eternal September.
         | I wonder how much more of this the Internet can take before
         | everyone just abandons it entirely. My usage is notably lower
         | than it was in even 2018, it's so goddamn hard to find anything
         | worth reading anymore (which is why I spend so much damn time
         | here, tbh).
        
           | wpietri wrote:
           | I think it's an arms race, but it's an open question who
           | wins.
           | 
           | For a while I thought email as a medium was doomed, but
           | spammers mostly lost that arms race. One interesting
           | difference is that with spam, the large tech companies were
           | basically all fighting against it. But here, many of the
           | large tech companies are either providing tools to spammers
           | (LLMs) or actively encouraging spammy behaviors (by
           | integrating LLMs in ways that encourage people to send out
           | text that they didn't write).
        
             | ToucanLoucan wrote:
             | > but spammers mostly lost that arms race
             | 
             | I'm not saying this is impossible but that's going to be an
             | uphill sell for me as a concept. According to some quick
             | stats I checked I'm getting roughly 600 emails per day,
             | about 550 of which go directly to spam filtering, and of
             | the remaining 50, I'd say about 6 are actually emails I
             | want to be receiving. That's an impressive amount overall
             | for whoever built this particular filter, but it's also
             | still a ton of chaff to sort wheat from and as a result I
             | don't use email much for anything apart from when I have
             | to.
             | 
             | Like, I guess that's technically usable, I'm much happier
             | filtering 44 emails than 594 emails? But that's like saying
             | I solved the problem of a flat tire by installing a wooden
             | cart wheel.
             | 
             | It's also worth noting there that if I do have an email
             | thats flagged as spam that shouldn't be, I then have to
             | wade through a much deeper pond of shit to go find it as
             | well. So again, better, but IMO not even remotely solved.
        
               | dhosek wrote:
               | I'm not sure what you've done to get that level of spam,
               | but I get about 10 spam emails a day at most and that's
               | across multiple accounts including one that I've used for
               | almost 30 years and had used on Usenet which was the
               | uber-spam magnet. A couple newer (10-15 year old)
               | addresses which I've published on webpages with mailto
               | links attract maybe one message a week and one that I
               | keep for a specialized purpose (fiction and poetry
               | submissions) gets maybe one to two messages per year,
               | mostly because it's of the form example@example.com so
               | easily guessed by enterprising spammers.
               | 
               | Looking at the last days' spam1 I have three 419-style
               | scams (widows wanting to give away their dead husbands'
               | grand piano or multi-million euro estate) and three
               | phishing attempts. There are duplicate messages in each
               | category.
               | 
               | About fifteen years ago, I did a purge of mailing list
               | subscriptions and there's very little that comes in that
               | I don't want, most notably a writer who's a nice guy, but
               | who interpreted my question about a comment he made on a
               | podcast as an invitation to be added to his manually
               | managed email list and given that it's only four or five
               | messages a year, I guess I can live with that.
               | 
               | [?]
               | 
               | 1. I cleaned out spam yesterday while checking for a
               | confirmation message from a purchase.
        
               | wpietri wrote:
               | I'm having a hard time finding reliably sourced
               | statistics here, but I suspect you're an outlier. My
               | personal numbers are way better, both on Gmail and
               | Fastmail, despite using the same email addresses for
               | decades.
        
             | jerf wrote:
             | Another problem with this arms race is that spam emails
             | actually are largely separable from ham emails for most
             | people... or at least they _were_ , for most of their run.
             | The thousandth email that claims the UN has set aside money
             | for me due to my non-existent African noble ancestry that
             | they can't find anyone to give it to and I just need to
             | send the Thailand embassy some money to start processing my
             | multi-million yuan payout and send it to my choice of proxy
             | in Colombia to pick it up is quite different from technical
             | conversation about some GitHub issue I'm subscribed to, on
             | all sorts of metrics.
             | 
             | However, the frontline of the email war has shifted lately.
             | Now the most important part of the war is being fought over
             | emails that look _just like ham_ , but aren't. Business
             | frauds where someone convinces you that they are the CEO or
             | CFO or some VP and they need you to urgently buy this or
             | that for them right now no time to talk is big business
             | right now, and before you get too high-and-mighty about how
             | immune you are to that, they are now extremely good at
             | looking official. This war has not been won yet, and to a
             | large degree, isn't something you necessarily win by AI
             | either.
             | 
             | I think there's an analogy here to the war on content slop.
             | Since what the content slop wants is just for you to see it
             | so they can serve you ads, it doesn't need anything else
             | that our algorithms could trip on, like links to malware or
             | calls to action to be defrauded, or anything else. It looks
             | _just_ like the real stuff, and telling that it isn 't
             | could require a human rather vast amounts of input just to
             | be mostly sure. Except we don't have the ability to
             | authenticate where it came from. (There is no content
             | authentication solution that will work at scale. No matter
             | how you try to get humans to "sign their work" people will
             | always work out how to automate it and then it's done.) So
             | the one good and solid signal that helps in email is gone
             | for general web content.
             | 
             | I don't judge this as a winning scenario for the defenders
             | here. It's not a total victory for the attackers either,
             | but I'd hesitate to even call an advantage for one side or
             | the other. Fighting AI slop is not going to be easy.
        
             | pyrale wrote:
             | > but spammers mostly lost that arms race.
             | 
             | Advertising in your mails isn't Google's.
        
             | jsheard wrote:
             | The fight against spam email also led to mass consolidation
             | of what was supposed to be a decentralised system though.
             | Monoliths like Google and Microsoft now act as de-facto
             | gatekeepers who decide whether or not you're allowed to
             | send emails, and there's little to no transparency or
             | recourse to their decisions.
             | 
             | There's probably an analogy to be made about the open
             | decentralised internet in the age of AI here, if it gets to
             | the point that search engines have to assume all sites are
             | spam by default until proven otherwise, much like how an
             | email server is assumed guilty until proven innocent.
        
           | BeFlatXIII wrote:
           | I hope this trend accelerates to force us all into grass-
           | touching and book-reading. The sooner, the better.
        
             | MrLeap wrote:
             | Books printed before 2018, right?
             | 
             | I already find myself mentally filtering out audible
             | releases after a certain date unless they're from an author
             | I recognize.
        
         | bondarchuk wrote:
         | At some point though you have to acknowledge that a specific
         | use of language belongs to the medium through which you're
         | counting word frequencies. There are also specific writing
         | styles (including sentence/paragraph sizes, unnecessary
         | repetitions, focusing on other metrics than readability)
         | associated with newspapers, novels, e-mails to your boss,
         | anything really. As long as text was written by a human who was
         | counting on at least some remote possibility that another human
         | might read it, this is way more legitimate use of language than
         | just generating it with a machine.
        
         | doe_eyes wrote:
         | > I agree in general but the web was already polluted by
         | Google's unwritten SEO rules. Single-sentence paragraphs,
         | multiple keyword repetitions and focus on "indexability"
         | instead of readability, made the web a less than ideal source
         | for such analysis long before LLMs.
         | 
         | Blog spam was generally written by humans. While it sucked for
         | other reasons, it seemed fine for measuring basic word
         | frequencies in human-written text. The frequencies are probably
         | biased in _some_ ways, but this is true for most text. A
         | textbook on carburetor maintenance is going to have the word
         | "carburetor" at way above the baseline. As long as you have a
         | healthy mix of varied books, news articles, and blogs, you're
         | fine.
         | 
         | In contrast, LLM content is just a serpent eating its own tail
         | - you're trying to build a statistical model of word
         | distribution off the output of a (more sophisticated) model of
         | word distribution.
        
           | weinzierl wrote:
           | Isn't it the other way around?
           | 
           | SEO text carefully tuned to tf-idf metrics and keyword
           | stuffed to them empirically determined threshold Google just
           | allows should have unnatural word frequencies.
           | 
           | LLM content should just enhance and cement the status quo
           | word frequencies.
           | 
           | Outliers like the word _" delve"_ could just be sentinels,
           | carefully placed like trap streets on a map.
        
             | lbhdc wrote:
             | > LLM content should just enhance and cement the status quo
             | word frequencies.
             | 
             | TFA mentions this hasn't been the case.
        
               | flakiness wrote:
               | Would you mind dropping the link talking about this
               | point? (context: I'm a total outsider and have no idea
               | what TFA is.)
        
             | derefr wrote:
             | 1. People don't generally use the (big, whole-web-corpus-
             | trained) general-purpose LLM base-models to generate bot
             | slop for the web. Paying per API call to generate that kind
             | of stuff would be far too expensive; it'd be like paying
             | for eStamps to send spam email. Spambot developers use
             | smaller open-source models, trained on much smaller
             | corpuses, sized and quantized to generate text that's "just
             | good enough" to pass muster. This creates a sampling bias
             | in the word-associational "knowledge" the model is working
             | from when generating.
             | 
             | 2. Given how LLMs work, a prompt _is_ a bias -- they 're
             | one-and-the-same. You can't ask an LLM to write you a
             | mystery novel without it somewhat adopting the writing
             | quirks common to the particular mystery novels it has
             | "read." Even the writing style you use _in_ your prompt
             | influences this bias. (It 's common advice among "AI
             | character" chatbot authors, to write the "character card"
             | describing a character, in the style that you want the
             | character speaking in, for exactly this reason.) Whatever
             | prompt the developer uses, is going to bias the bot away
             | from the statistical norm, toward the writing-style
             | elements that exist within whatever hypersphere of
             | association-space contains plausible completions of the
             | prompt.
             | 
             | 3. Bot authors do SEO too! They take the tf-idf metrics and
             | keyword stuffing, and turn it into _training data_ to
             | _fine-tune_ models, in effect creating  "automated SEO
             | experts" that write in the SEO-compatible style by default.
             | (And in so doing, they introduce unintentional further
             | bias, given that the SEO-optimized training dataset likely
             | is not an otherwise-perfect representative sampling of
             | writing style for the target language.)
        
             | mlsu wrote:
             | But you can already see it with Delve. Mistral uses "delve"
             | more than baseline, because it was trained on GPT.
             | 
             | So it's classic positive feedback. LLM uses delve more,
             | delve appears in training data more, LLM uses delve more...
             | 
             | Who knows what other semantic quirks are being amplified
             | like this. It could be something much more subtle, like
             | cadence or sentence structure. I already notice that GPT
             | has a "tone" and Claude has a "tone" and they're all sort
             | of "GPT-like." I've read comments online that stop and make
             | me question whether they're coming from a bot, just because
             | their word choice and structure echoes GPT. It will sink
             | into human writing too, since everyone is learning in high
             | school and college that the way you write is by asking GPT
             | for a first draft and then tweaking it (or not).
             | 
             | Unfortunately, I think human and machine generated text are
             | entirely miscible. There is no "baseline" outside the
             | machines, other than from pre-2022 text. Like pre-atomic
             | steel.
        
               | taneq wrote:
               | > LLM uses delve more, delve appears in training data
               | more, LLM uses delve more...
               | 
               | Some day we may view this as the beginnings of machine
               | culture.
        
               | mlsu wrote:
               | Oh no, it's been here for quite a while. Our culture is
               | already heavily glued to the machine. The way we express
               | ourselves, the language we use, even our very self-
               | conception originates increasingly in online spaces.
               | 
               | Have you ever seen someone use their smartphone? They're
               | not "here," they are "there." Forming themselves in
               | cyberspace -- or being formed, by the machine.
        
         | pphysch wrote:
         | It's crazy to attribute the downfall of the web/search to
         | Google. What does Google have to do with all the genuine open
         | web content, Google's source of wealth, getting starved by
         | (increasingly) walled gardens like Facebook, Reddit, Discord?
         | 
         | I don't see how Google's SEO rules being written or unwritten
         | has any bearing. Spammers will always find a way.
        
         | sahmeepee wrote:
         | Prior to Google we had Altavista and in those days it was
         | incredibly common to find keywords spammed hundreds of times in
         | white text on a white background in the footer of a page. SEO
         | spam is not new, it's just different.
        
         | redbell wrote:
         | > ML/LLM is the second iteration of writing pollution. The
         | first was humans writing for corporate bots, not other humans.
         | 
         | Based on the process above, naturally, the third iteration then
         | is _LLMs writing for corporate bots, neither for humans nor for
         | other LLMs_.
        
       | hoseja wrote:
       | >"Now Twitter is gone anyway, its public APIs have shut down, and
       | the site has been replaced with an oligarch's plaything, a spam-
       | infested right-wing cesspool called X. Even if X made its raw
       | data feed available (which it doesn't), there would be no
       | valuable information to be found there.
       | 
       | >Reddit also stopped providing public data archives, and now they
       | sell their archives at a price that only OpenAI will pay.
       | 
       | >And given what's happening to the field, I don't blame them."
       | 
       | What beautiful doublethink.
        
         | mschuster91 wrote:
         | > What beautiful doublethink.
         | 
         | Given just how many AI bots scrape up everything they can,
         | oftentimes ignoring robots.txt or _any_ rate limits (there have
         | been a few complaint threads on HN about that), I can hardly
         | blame the operators of large online services just cutting off
         | data feeds.
         | 
         | Twitter however didn't stop their data feeds due to AI or
         | because they wanted money, they stopped providing them because
         | its new owner does everything he can to hinder researchers
         | specializing in propaganda campaigns or public scrutiny.
        
           | hluska wrote:
           | What was Reddit's excuse? They did roughly the same thing
           | (and have just as much garbage content).
           | 
           | In other words, why is it wrong for X but okay for Reddit? If
           | you ignore one individual's politics, the two services did
           | the same thing.
        
             | mschuster91 wrote:
             | Reddit shut their API access down only very recently, after
             | the AI craze went off. Twitter did so right after Musk took
             | over, way before Reddit, way before AI ever went nuts.
        
               | dotnet00 wrote:
               | X shut down API access in Feb 2023, Reddit shut theirs
               | down at the end of June of the same year. Just barely 6
               | months apart.
               | 
               | Furthermore, while X had also only announced this in
               | February, Reddit announced their API shutdown just 2
               | months later in April.
               | 
               | And, to further add to that, X was pretty upfront that
               | they think they have access to a large and powerful
               | dataset in X and didn't want to give it out for free.
               | Reddit used very similar wording when announcing their
               | changes.
        
       | DebtDeflation wrote:
       | Enshittification is accelerating. A good 70% of my Facebook feed
       | is now obviously AI generated images with AI generated text
       | blurbs that have nothing to do with the accompanying images
       | likely posted by overseas bot farms. I'm also noticing more and
       | more "books" on Amazon that are clearly AI generated and self
       | published.
        
         | janice1999 wrote:
         | It's okay. Amazon has limited authors to self publishing only 3
         | books per day (yes, really). That will surely solve the
         | problem.
        
           | wpietri wrote:
           | Hah! I'm trying to figure out the exact date that crossed
           | from "plausible line from a Stross or Sterling novel" [1] to
           | "of course they did".
           | 
           | [1] Or maybe Sheckley or Lem, now that I think about it.
        
           | Drakim wrote:
           | I read that as 3 books per year at first and thought to
           | myself that that was a rather harsh limitation but surely any
           | true respectable author wouldn't be spitting more than
           | that...
           | 
           | ...and then I realized you wrote 3 books a day. What the
           | hell.
        
         | Sohcahtoa82 wrote:
         | > A good 70% of my Facebook feed is now obviously AI generated
         | images with AI generated text blurbs that have nothing to do
         | with the accompanying images likely posted by overseas bot
         | farms.
         | 
         | This is a self-inflicted problem, IMO.
         | 
         | Do you just have shitty friends that share all that crap? Or
         | are you following shitty pages?
         | 
         | I use Facebook a decent amount, and I don't suffer from what
         | you're complaining about. Your feed is made of what you make
         | it. Unfollow the pages that make that crap. If you have friends
         | that share it, consider unfriending or at the very least,
         | unfollowing. Or just block the specific pages they're sharing
         | posts from.
        
       | aucisson_masque wrote:
       | Did we (the humans) somehow managed to pollute the internet so
       | much with AI that's it's now barely usable ?
       | 
       | In my opinion the internet can be considered as the equivalent of
       | a natural environment like the earth. it's a space where people
       | share, meet, talk, etc.
       | 
       | I find it astonishing that after polluting our natural
       | environment we know polluted the internet.
        
         | nkozyra wrote:
         | > Did we (the humans) somehow managed to pollute the internet
         | so much with AI that's it's now barely usable
         | 
         | If we haven't already, we will be very soon. I'm sure there are
         | people working on this problem, but I think we're starting to
         | hit a very imminent feedback loop moment. Most of human's
         | recorded information is digitized and most of that is
         | generating non-human content at an incredible pace. We've
         | injected a whole lot of noise into our usable data.
         | 
         | I don't know if the answer is more human content (I'm doing my
         | part!) or novel generative content but this interim period is
         | going to cause some medium-term challenges.
         | 
         | I like to think the LLM more-tokens-equals-better era is fading
         | and we're getting into better _use_ of existing data, but there
         | 's a very real inflection point we're facing.
        
         | ashton314 wrote:
         | That's a nice analogy. Fortunately (un)real estate is easier to
         | manufacture out of thin air online. We have lost some valuable
         | spaces like Twitter and Reddit to some degree though.
        
         | surfingdino wrote:
         | Yes. Here are practical instructions on how to turn it into an
         | even more of a cesspit
         | https://www.youtube.com/watch?v=endHz0jo9Ck I think it's now a
         | law of nature that any new tech leads to SEO amplification. AI
         | has become the Degelman M34 Manure Spreader of the internet
         | https://degelman.com/products/manure-spreaders
        
         | coldpie wrote:
         | There are smaller, gated communities that are still very
         | valuable. You're posting in one. But yes, the open Internet is
         | basically useless now, thanks ultimately to advertising as a
         | business model.
        
           | nicholassmith wrote:
           | I've seen plenty of comments here that read like they've been
           | generated by an LLM, if this is a gated community we need a
           | better gate.
        
             | coldpie wrote:
             | Sure, there's bad actors everywhere, but there's really no
             | incentive to do it here so I don't think it's a _problem_
             | in the same way it is on the open internet, where slop is
             | actively rewarded.
        
             | globular-toast wrote:
             | It's hard to tell, though. People have been saying my
             | borderline autistic comments sound like GPT for years now.
        
           | whimsicalism wrote:
           | this is not a gated community at all
        
         | thwarted wrote:
         | Tragedy of the Commons Ruins Everything Around Me
        
         | left-struck wrote:
         | >We the humans
         | 
         | Nice try
         | 
         | If it's not clear, I'm joking.
        
         | mathnmusic wrote:
         | > Did we (the humans) somehow managed to pollute the internet
         | 
         | Corporations did that, not humans.
         | 
         | "few people recognize that we already share our world with
         | artificial creatures that participate as intelligent agents in
         | our society: corporations" - https://arxiv.org/abs/1204.4116
        
       | aucisson_masque wrote:
       | It could be used to spot LLM generated text.
       | 
       | compare the frequency of words to those used in human natural
       | writings and you spot the computer from the human.
        
         | Lvl999Noob wrote:
         | It could be used to differentiate LLM text from pre-LLM human
         | text maybe. The thing, our AIs may not be very good at learning
         | but our brains are. The more we use AI, the more we integrate
         | LLMs and other tools into our life, the more their output will
         | influence us. I believe there was a study (or a few anecdotes)
         | where college papers checked for AI material were marked AI
         | written even though they were written by humans because the
         | students used AI during their studying and learned from it.
        
           | thfuran wrote:
           | >our AIs may not be very good at learning but our brains are
           | 
           | Brains aren't nearly as good at slightly adjusting the
           | statistical properties of a text corpus as computers are.
        
           | MPSimmons wrote:
           | You're exactly right. You only have to look at the prevalence
           | of the word "unalive" in real life contexts to find an
           | example.
        
           | left-struck wrote:
           | > The more we use AI, the more we integrate LLMs and other
           | tools into our life, the more their output will influence us
           | 
           | Hmm I don't disagree but I think it will be valuable skill
           | going forward to write text that doesn't read like it was
           | written by an LLM
           | 
           | This is an arms race that I'm not sure we can win though.
           | It's almost like a GAN.
        
         | TacticalCoder wrote:
         | > ... compare the frequency of words to those used in human
         | natural writings and you spot the computer from the human.
         | 
         | But that's a losing endeavor: if you can do that, you can
         | immediately ask your LLM to fix its output so that it passes
         | that test (and many others). It can introduce typos, make small
         | errors on purpose, and anything you can think of to make it
         | look human.
        
         | ithkuil wrote:
         | it may work for a short time, but after a while natural
         | language will evolve due to natural exposure of those new words
         | or word patterns and even human will write in ways that, while
         | being different from the LLMs, will also be different from the
         | snapshot captured by this snapshot. It's already the case that
         | we used to write differently 20 years ago from 50 years ago and
         | even more so 100 years ago, etc
        
         | slashdave wrote:
         | Hardly. You are talking about a statistical test, which will
         | have rather large errors (since it is based on word
         | frequencies). Not to mention word frequencies will vary
         | depending on the type of text (essay, description,
         | advertisement, etc).
        
       | iamnotsure wrote:
       | "Multi-script languages
       | 
       | Two of the languages we support, Serbian and Chinese, are written
       | in multiple scripts. To avoid spurious differences in word
       | frequencies, we automatically transliterate the characters in
       | these languages when looking up their words.
       | 
       | Serbian text written in Cyrillic letters is automatically
       | converted to Latin letters, using standard Serbian
       | transliteration, when the requested language is sr or sh."
       | 
       | I'd support keeping both scripts (srpska tshirilitsa and latin
       | script) , similarly to hiragana (hiragana) and katakana
       | (katakana) in Japanese.
        
         | eqvinox wrote:
         | Why is this a HN comment on a thread about it ending due to AI
         | pollution?
        
       | dsign wrote:
       | Somehow related, paper books from before 2020 could be a valuable
       | commodity in a in a decade or two, when the Internet will be full
       | of slop and even contemporary paper books will be treated with
       | suspicion. And there will be human talking heads posing as the
       | authors of books written by very smart AIs. God, why are we doing
       | this????
        
         | rvnx wrote:
         | To support well-known "philanthropists" like Sam Altman or Mark
         | Zuckerberg that many consider as their heroes here.
        
         | user432678 wrote:
         | And I thought I had some kind of mental illness collecting all
         | those books, barely reading them. Need to do that more now.
        
           | globular-toast wrote:
           | Yes. I've always loved my books but now consider them my most
           | valuable possessions.
        
         | RomanAlexander wrote:
         | Or AI talking heads posing as the author of books written by
         | AIs. https://youtu.be/pAPGRGTqIgI (warning: state sponsored
         | disinformation AI)
        
       | weinzierl wrote:
       | _" I don't think anyone has reliable information about post-2021
       | language usage by humans."_
       | 
       | We've been past the tipping point when it comes to text for some
       | time, but for video I feel we are living through the watershed
       | moment right now.
       | 
       | Especially smaller children don't have a good intuition on what
       | is real and what is not. When I get asked if the person in a
       | video is real, I still feel pretty confident to answer but I get
       | less and less confident every day.
       | 
       | The technology is certainly there, but the majority of video
       | content is still not affected by it. I expect this to change very
       | soon.
        
         | olabyne wrote:
         | I never thought about that. Humans losing their ability to
         | detect AI content from reality ? It's frightening.
        
           | wraptile wrote:
           | I find issue with this statement as content was never a clean
           | representation of human actions or even thought. It was
           | always driven by editorials, SEO, bot remixing and whatnot
           | that heavily influences how we produce content. One might
           | even argue that heightened content distrust is _good_ for our
           | society.
        
           | BiteCode_dev wrote:
           | It's worse because many humans don't know they are.
           | 
           | I see a lot of outrage around fake posts already. People want
           | to believe bad things from the other tribes.
           | 
           | And we are going to feed them with it, endlessly.
        
             | PhunkyPhil wrote:
             | Did you think the same thing when photoshop came out?
             | 
             | It's relatively trivial to photoshop misinformation in a
             | really powerful and undetectable way- but I don't see
             | (legitimate) instances of groundbreaking news over a fake
             | photo of the president or a CEO etc doing something
             | nefarious. Why is AI different just because it's
             | audio/video?
        
           | Sharlin wrote:
           | It's worse: they don't even care.
        
           | bunderbunder wrote:
           | This video's worth a watch if you want to get a sense of the
           | current state of things. Despite the (deliberately) clickbait
           | title, the video itself is pretty even-handed.
           | 
           | It's by Language Jones, a YouTube linguist. Title: "The AI
           | Apocalypse is Here"
           | 
           | https://youtu.be/XeQ-y5QFdB4
        
           | jerf wrote:
           | It's even worse than that. Most people have no idea how far
           | CGI has come, and how easily it is wielded even by a couple
           | of dedicated teens on their home computer, let alone people
           | with a vested interest in faking something for some financial
           | reason. People think they know what a "special effect" looks
           | like, and for the most part, people are _wrong_. They know
           | what CGI being used to create something obviously impossible,
           | like a dinosaur stomping through a city, looks like. They
           | have no idea how easy a lot of stuff is to fake already. AI
           | just adds to what is already there. Heck, to some extent it
           | has caused scammers to overreach, with things like obviously
           | fake Elon Musk videos on YouTube generated from (pure) AI and
           | text-to-speech... when with just a little bit more learning,
           | practice, and amounts of equipment completely reasonable for
           | one person to obtain, they could have done a _much_ better
           | fake of Elon Musk using special effects techniques rather
           | than shoveling text into an AI. The fact that  "shoveling
           | text into an AI" may in another few years itself generate
           | immaculate videos is more a bonus than a fundamental change
           | of capability.
           | 
           | Even what's free & open source in the special effects
           | community is astonishing lately.
        
             | bee_rider wrote:
             | Plus, movies continue (for some reason) to be made with
             | very bad and obvious CGI, leading people to believe all CGI
             | is easy to spot.
        
               | PhunkyPhil wrote:
               | This is a common survivorship bias fallacy since you only
               | notice the bad CGI.
               | 
               | I'm certain you'd be shocked to see the amount of CG
               | that's in some of your favorite movies made in the last
               | ~10-20 years that you didn't notice _because it 's
               | undetectable_
        
               | bee_rider wrote:
               | I won't be, I'm aware that lots of movies are mostly CGI.
               | 
               | But, yeah, I do think it is some kind of bias. Maybe not
               | survivorship, though... maybe it is a generalized sort of
               | Malmquist bias? Like the measurement is not skewed by the
               | tendency of movies with good CGI to go away. It is skewed
               | by the fact that bad CGI sticks out.
        
               | bee_rider wrote:
               | Actually wait I take it back, I mean, I was aware that
               | lots of Digital Touch-up happens in movie sets, more than
               | lots of people might expect, and more often that one
               | might expect even in mundane movies, but even still, this
               | comment's video was pretty shocking anyway.
               | 
               | https://news.ycombinator.com/item?id=41584276
        
               | xsmasher wrote:
               | This is an amazing demo reel of effects shots used in
               | "mundane" TV shows - comedies and produce procedurals. -
               | for faking locations.
               | 
               | https://www.youtube.com/watch?v=clnozSXyF4k
        
               | bee_rider wrote:
               | That is really something even as somebody who expects
               | lots of CGI touch-up in sets.
        
             | jhbadger wrote:
             | And you see things like the _The Lion King_ remake or its
             | upcoming prequel being called  "live action" because it
             | doesn't look like a cartoon like the original. But they
             | didn't film actual lions running around -- it's all CGI.
        
           | hn_throwaway_99 wrote:
           | I mean, it's already apparent to me that a lot of people
           | don't have a basic process in place to detect fact from
           | fiction. And it's definitely not always easy, but when I hear
           | some of the dumbest conspiracy theories known to man actually
           | get traction in our media, political figures, and society at
           | large, I just have to shake my head and laugh to keep from
           | crying. I'm constantly reminded of my favorite saying,
           | "people who believe in conspiracy theories have never been a
           | project manager."
        
           | bongodongobob wrote:
           | Oh they definitely are. A lot of people are now calling out
           | real photos as fake. I frequently get into stupid Instagram
           | political arguments and a lot of times they come back with
           | "yeah nice profile with all your AI art haha". It's all real
           | high quality photography. Honestly, I don't think the avg
           | person can tell anymore.
        
             | ziml77 wrote:
             | I've reached a point where even if my first reaction to a
             | photo is to be impressed, I then quickly think "oh but what
             | it this is AI?" and then immediately my excitement for the
             | photo is ruined because it may not actually be a photo at
             | all.
        
               | bongodongobob wrote:
               | I don't get that perspective at all. Who cares what made
               | it.
        
           | Suppafly wrote:
           | >Humans losing their ability to detect AI content from
           | reality ? It's frightening.
           | 
           | And it already happened, and no one pushed back while it was
           | happening.
        
           | BeFlatXIII wrote:
           | It's a defense lawyer's dream.
        
         | frognumber wrote:
         | There are a series of challenges like:
         | 
         | https://www.nytimes.com/interactive/2024/09/09/technology/ai...
         | 
         | https://www.nytimes.com/interactive/2024/01/19/technology/ar...
         | 
         | These are a little bit unfair, in that we're comparing
         | handpicked examples, but I don't think many experts will pass a
         | test like this. Technology only moves forward (and seemingly,
         | at an accelerating pace).
         | 
         | What's a little shocking to me is the speed of progress.
         | Humanity is almost 3 million years old. Homosapiens are around
         | 300,000 years old. Cities, agriculture, and civilization is
         | around 10,000. Metal is around 4000. Industrial revolution is
         | 500. Democracy? 200. Computation? 50-100.
         | 
         | The revolutions shorten in time, seemingly exponentially.
         | 
         | Comparing the world of today to that of my childhood....
         | 
         | One revolution I'm still coming to grips with is automated
         | manufacturing. Going on aliexpress, so much stuff is basically
         | free. I bought a 5-port 120W (total) charger for less than 2
         | minutes of my time. It literally took less time to find it than
         | to earn the money to buy it.
         | 
         | I'm not quite sure where this is all headed.
        
           | knodi123 wrote:
           | +100w chargers are one of the products I prefer to spend a
           | little more on, so I get something from a company that knows
           | it can be sued if they make a product that burns down your
           | house or fries your phone.
           | 
           | Flashlights? Sure, bring on aliexpress. USB cables with pop-
           | off magnetically attached heads, no problem. But power
           | supplies? Welp, to each their own!
        
             | fph wrote:
             | And then you plug your cheap pop-off USB cable into the
             | expensive 100w charger?
        
               | knodi123 wrote:
               | Yeah, sure, what could possibly go wrong? :-P
               | 
               | But seriously, it's harder to accidentally make a USB
               | cable that fries your equipment. The more common failure
               | mode is it fails to work, or wears out too fast. Chargers
               | on the other hand, handle a lot of voltage, generate a
               | lot of heat, and output to sensitive equipment. More room
               | to mess up, and more room for mistakes to cause damage.
        
           | bee_rider wrote:
           | > One revolution I'm still coming to grips with is automated
           | manufacturing. Going on aliexpress, so much stuff is
           | basically free. I bought a 5-port 120W (total) charger for
           | less than 2 minutes of my time. It literally took less time
           | to find it than to earn the money to buy it.
           | 
           | Is there a big recent qualitative change here? Or is this a
           | continuation of manufacturing trends (also shocking, not
           | trying to minimize it all, just curious if there's some new
           | manufacturing tech I wasn't aware of).
           | 
           | For some reason, your comment got me thinking of a fully
           | automated system, like: you go to a website, pick and choose
           | charger capabilities (ports, does it have a battery, that
           | sort of stuff). Then an automated factor makes you a bespoke
           | device (software picks an appropriate shell, regulators,
           | etc). I bet we'll see it in our lifetimes at least.
        
           | homebrewer wrote:
           | > so much stuff is basically free
           | 
           | It really isn't. Have a look at daily median income
           | statistics for the rest of the planet:
           | 
           | https://ourworldindata.org/grapher/daily-median-
           | income?tab=t...                 $2.48 Eastern and Southern
           | Africa (PIP)       $2.78 Sub-Saharan Africa (PIP)       $3.22
           | Western and Central Africa (PIP)       $3.72 India (rural)
           | $4.22 South Asia (PIP)       $4.60 India (urban)       $5.40
           | Indonesia (rural)       $6.54 Indonesia (urban)       $7.50
           | Middle East and North Africa (PIP)       $8.05 China (rural)
           | $10.00 East Asia and Pacific (PIP)       $11.60 Latin America
           | and the Caribbean (PIP)       $12.52 China (urban)
           | 
           | And more generally:                 $7.75 World
           | 
           | I looked around on Ali, and the cheapest charger that doesn't
           | look too dangerous costs around five bucks. So it's roughly
           | equal to one day's income of at least half the population of
           | our planet.
        
           | jodrellblank wrote:
           | > " _The revolutions shorten in time, seemingly
           | exponentially._ "
           | 
           | The Technological Singularity -
           | https://en.wikipedia.org/wiki/Technological_singularity
        
           | MengerSponge wrote:
           | Democracy is 200? You're off by a full order of magnitude.
           | 
           | Progress isn't inevitable. It's possible for knowledge to be
           | lost and for civilization to regress.
        
         | bsder wrote:
         | > When I get asked if the person in a video is real, I still
         | feel pretty confident to answer
         | 
         | I don't share your confidence in identifying real people
         | anymore.
         | 
         | I often flag as "false-ish" a lot of things from genuinely real
         | people, but who have adopted the behaviors of the
         | TikTok/Insta/YouTube creator. Hell, my beard is grey and even I
         | poked fun at "YouTube Thumbnail Face" back in 2020 in a video
         | talk I gave. AI twigs into these "semi-human" behavioral
         | patterns super fast and super hard.
         | 
         | There is a video floating around with pairs of young ladies
         | with "This is real"/"This is not real" on signs. They could be
         | completely lying about both, and I really can't tell the
         | difference. All of them have behavioral patterns that seems a
         | little "off" but are consistent with the small number of
         | "influencer" videos I have exposure to.
        
         | apricot wrote:
         | > When I get asked if the person in a video is real, I still
         | feel pretty confident to answer
         | 
         | I don't. I mean, I can identify the bad ones, sure, but how do
         | I know I'm not getting fooled by the good ones?
        
           | weinzierl wrote:
           | That is very true, but for now we have a baseline of videos
           | that we either remember or that we remember key details of,
           | like the persons in the video. I'm pretty sure if I watch
           | _The Primeagen_ or _Tom Scott_ today, that they are real. Ask
           | me in year, I might not be so sure anymore.
        
       | donatj wrote:
       | I hear this complaint often but in reality I have encountered
       | fairly little content in my day to day that has felt fully AI
       | generated? AI assisted sure, but is that a problem if a human is
       | in the mix, curating?
       | 
       | I certainly have not encountered enough straight drivel where I
       | would think it would have a significant effect on overall word
       | statistics.
       | 
       | I suspect there may be some over-identification of AI content
       | happening, a sort of Baader-Meinhof effect cognitive bias. People
       | have their eye out for it and suddenly everything that reads a
       | little weird logically "must be AI generated" and isn't just a
       | bad human writer.
       | 
       | Maybe I am biased, about a decade ago I worked for an SEO company
       | with a team of copywriters who pumped out mountains the most
       | inane keyword packed text designed for literally no one but
       | Google to read. It would rot your brain if you tried, and it was
       | written by hand by a team of humans beings. This existed WELL
       | before generative AI.
        
         | pavel_lishin wrote:
         | > _I hear this complaint often but in reality I have
         | encountered fairly little content in my day to day that has
         | felt fully AI generated?_
         | 
         | How confident are you in this assessment?
         | 
         | > _straight drivel_
         | 
         | We're past the point where what AI generates is "straight
         | drivel"; every minute, it's harder to distinguish AI output
         | from actual output unless you're approaching expertise in the
         | subject being written about.
         | 
         | > _a team of copywriters who pumped out mountains the most
         | inane keyword packed text designed for literally no one but
         | Google to read._
         | 
         | And now a machine can generate the same amount of output in 30
         | seconds. Scale matters.
        
           | PhunkyPhil wrote:
           | > every minute, it's harder to distinguish AI output from
           | actual output unless you're approaching expertise in the
           | subject being written about.
           | 
           | So, then what _really_ is the problem with just including
           | LLM-generated text in wordfreq?
           | 
           | If quirky word distributions will remain a "problem", then
           | I'd bet that human distributions for those words will follow
           | shortly after (people are _very_ quick to change their speech
           | based on their environment, it 's why language can change so
           | quickly).
           | 
           | Why not just own the fact that LLMs are going to be affecting
           | our speech?
        
       | cyberes wrote:
       | The guy sounds intolerable and comes across as annoying to listen
       | to.
        
       | floppiplopp wrote:
       | I really like the fact that the content of the conventional user
       | content internet is becoming willfully polluted and ever more
       | useless by the incessant influx of "ai"-garbage. At some point
       | all of this will become so awful that nerds will create new and
       | quiet corners of real people and real information while the idiot
       | rabble has to use new and expensive tools peddled by scammy tech
       | bros to handle the stench of automated manure that flows out of
       | stagnant llms digesting themselves.
        
         | biofox wrote:
         | Most of the time, HN is that quiet corner. I just hope it stays
         | that way.
        
         | JohnFen wrote:
         | > At some point all of this will become so awful that nerds
         | will create new and quiet corners of real people and real
         | information
         | 
         | It's already happening. There is a growing number of groups
         | that are forming their own "private internets" that is
         | separated from the internet-at-large, precisely because the
         | internet at large is becoming increasingly useless for a whole
         | lot of valuable things.
        
       | PeterStuer wrote:
       | Intuitively I feel like word frequency would be one of the things
       | least impacted by LLM output, no?
        
         | baq wrote:
         | 'delve' is given as an example right there in TFA.
        
           | PeterStuer wrote:
           | Yes, but the material presented in no way makes distiction
           | between potential organic growth of 'delve' vs. LLM induced
           | use. They just note that even though 'delve' was on the rise,
           | in 23-24 the word gains more popularity, at the same time
           | ChatGPT rose. Word adoption is certainly not a linear
           | phenomenon. And as the author states 'I don't think anyone
           | has reliable information about post-2021 language usage by
           | humans'
           | 
           | So I would still state noun-phrase frequency in LLM output
           | would tend to reflect noun-phrase frequency in training data
           | in a similar context (disregarding enforced bias induced
           | through RLHF and other tuning at the moment)
           | 
           | I'm sure there will be cross-fertilization from LLM to Human
           | and back, but I'm not seeing the data yet that the influence
           | on word-frequency is that outspoken.
           | 
           | The author seems to have some other objections to the rise of
           | LLM's, which I fully understand.
        
             | beepbooptheory wrote:
             | Even granting that we can disregard a really huge factor
             | here, which I'm not sure we really can, one can not know
             | beforehand how the clustering of the vocabulary is going to
             | go pre-training, and its speculated that both at the center
             | and at the edges of clusters we get random particularities.
             | Hence the "solidgoldmagikarp" phenomenon and many others.
        
             | QuiDortDine wrote:
             | The fact that making this distinction is impossible is
             | reason enough to stop.
        
           | whimsicalism wrote:
           | there is almost certainly organic growth as well as more
           | people in Nigeria and other SSA countries are getting very
           | good internet penetration in recent years
        
         | Jcampuzano2 wrote:
         | It'd be in fact quite the opposite. There comes a turning point
         | where the majority of language usage would actually be written
         | by AI, at which point we'd no longer be analysing the word
         | frequency/usage by actual humans and so it wouldn't be
         | representative of how humans actually communicate.
         | 
         | Or potentially even more dystopian would be that AI slop would
         | be dictating/driving human communication going forward.
        
         | joshdavham wrote:
         | Think of an LLM as a person on the internet. Just like everyone
         | else, they have their own vocabulary and preferred way of
         | talking which means they'll use some words more than others.
         | Now imagine we duplicate this hypothetical person an incredible
         | amount of times and have their clones chatter on the internet
         | frequently. 'Certainly' this would have an effect.
        
       | joshdavham wrote:
       | If the language you're processing was generated by AI, it's no
       | longer NLP, it's ALP.
        
       | ilaksh wrote:
       | Reading through this entire thread, I suspect that somehow
       | generative AI actually became a political issue. Polarized
       | politics is like a vortex sucking all kinds of unrelated things
       | in.
       | 
       | In case that doesn't get my comment completely buried, I will go
       | ahead and say honestly that even though "AI slop" and paywalled
       | content is a problem, I don't think that generative AI in itself
       | is a negative at all. And I also think that part of this person's
       | reaction is that LLMs have made previous NLP techniques, such a
       | those based on simple usage counts etc., largely irrelevant.
       | 
       | What was/is wordfreq used for, and can those tasks not actually
       | be done more effectively with a cutting edge language model of
       | some sort these days? Maybe even a really small one for some
       | things.
        
         | ecshafer wrote:
         | Generative AI is inherently a political issue, its not
         | surprising at all.
         | 
         | There is the case of what is "truth". As soon as you start to
         | ensure some quality of truth to what is generated, that is
         | political.
         | 
         | As soon as generative AI has the capability to take someone's
         | job, that is political.
         | 
         | The instant AI can make someone money, it is political.
         | 
         | When AI is trained on something that someone has created, and
         | now they can generate something similar, it is political.
        
           | ilaksh wrote:
           | Then .. everything is political?
        
             | phito wrote:
             | It is. Unfortunately.
        
             | commodoreboxer wrote:
             | Everything involving any kind of coordination, cooperation,
             | competition, and/ot communication between two or more
             | people involves politics by its very nature. LLMs are
             | communication tools. You can't divorce politics from their
             | use when one person is generating text for another person
             | to read.
        
             | JohnFen wrote:
             | "Just because you do not take an interest in politics
             | doesn't mean politics won't take an interest in you." --
             | Pericles
        
           | whimsicalism wrote:
           | > As soon as generative AI has the capability to take
           | someone's job, that is political.
           | 
           | What is political is people enshrining themselves in
           | chokepoints and demanding a toll for passing through or
           | getting anything done. That is what you do when you make a
           | certain job politically 'untakable'.
           | 
           | People who espouse that the 'personal is political' risk
           | making the definition of politics so broad that it is
           | useless.
        
         | rincebrain wrote:
         | The simplest example that comes to mind of something frequency
         | analysis might be useful for would be if you had simple
         | ciphertext where you knew that the characters probably 1:1
         | mapped, but you didn't know anything about how.
         | 
         | It could also be useful for guessing whether someone might have
         | been trying to do some kind of steganographic or additional
         | encoding in their work, by telling you how abnormal compared to
         | how many people write it is that someone happened to choose a
         | very unusual construction in their work, or whether it's
         | unlikely that two people chose the same unusual construction by
         | coincidence or plagiarism.
         | 
         | You might also find statistical models interesting for things
         | like noticing patterns in people for whom English or others are
         | not their first language, and when they choose different
         | constructions more often than speakers for whom it was their
         | first language.
         | 
         | I'm not saying you can't use an LLM to do some or all of these,
         | but they also have something of a scalar attached to them of
         | how unusual the conclusion is - e.g. "I have never seen this
         | construction of words in 50 million lines of text" versus "Yes,
         | that's natural.", which can be useful for trying to inform how
         | close to the noise floor the answer is, even ignoring the
         | prospect of hallucinations.
        
         | whimsicalism wrote:
         | Yes, it's become extremely politicized and its very tiresome.
         | Tech in general, to be frank. Pray that your field of interest
         | never gets covered in the NYT.
        
       | eadmund wrote:
       | > the Web at large is full of slop generated by large language
       | models, written by no one to communicate nothing
       | 
       | That's neither fair nor accurate. That slop is ultimately
       | generated by the humans who run those models; they are attempting
       | (perhaps poorly) to communicate _something_.
       | 
       | > two companies that I already despise
       | 
       | Life's too short to go through it hating others.
       | 
       | > it's very likely because they are creating a plagiarism machine
       | that will claim your words as its own
       | 
       | That begs the question. Plagiarism has a particular definition.
       | It is not at all clear that a machine learning from text should
       | be treated any differently from a human being learning from text:
       | i.e., duplicating exact phrases or failing to credit ideas may in
       | some circumstances be plagiarism, but no-one is required to
       | append a statement crediting every text he has ever read to every
       | document he ever writes.
       | 
       | Credits: every document I have ever read _grin_
        
         | weevil wrote:
         | I feel like you're giving certain entities too much credit
         | there. Yes text is generated to do _something_, but it may not
         | be to communicate in good-faith; it could be keyword-dense
         | gibberish designed to attract unsuspecting search engine users
         | for click revenue, or generate political misinformation
         | disseminated to a network of independent-looking "news"
         | websites, or pump certain areas with so much noise and nonsense
         | information that those spaces cannot sustain any kind of
         | meaningful human conversation.
         | 
         | The issue with generative 'AI' isn't that they generate text,
         | it's that they can (and are) used to generate high-volume low-
         | cost nonsense at a scale no human could ever achieve without
         | them.
         | 
         | > Life's too short to go through it hating others
         | 
         | Only when they don't deserve it. I have my doubts about Google,
         | but I've no love for OpenAI.
         | 
         | > Plagiarism has a particular definition ... no-one is required
         | to append a statement crediting every text he has ever read
         | 
         | Of course they aren't, because we rightly treat humans learning
         | to communicate differently from training computer code to
         | predict words in a sentence and pass it off as natural language
         | with intent behind it. Musicians usually pay royalties to those
         | whose songs they sample, but authors don't pay royalties to
         | other authors whose work inspired them to construct their own
         | stories maybe using similar concepts. There's a line there
         | somewhere; falsely equating plagiarism and inspiration (or
         | natural language learning in humans) misses the point.
        
         | miningape wrote:
         | This is just the "guns don't shoot people, people do." argument
         | except in this case we quite literally have a massive upside
         | incentive to remove people from the process entirely (i.e.
         | websites that automatically generate new content every day) -
         | so I don't buy it.
         | 
         | This kind of AI slop is quite literally written by no one (an
         | algorithm pushed it out), and it doesn't communicate anything
         | since communication first requires some level of understanding
         | of the source material - and LLM's are just predicting the
         | likely next token without understanding. I would also extend
         | this to AI slop written by someone with a limited domain
         | understanding, they themselves have nothing new to offer, nor
         | the expertise or experience to ensure the AI is producing
         | valuable content.
         | 
         | I would go even further and say it's "read by no one" - people
         | are sick and tired of reading the next AI slop article on
         | google and add stuff like "reddit" to the end of their queries
         | to limit the amount of garbage they get.
         | 
         | Sure there are people using LLMs to enhance their research, but
         | a vast, vast majority are using it to create slop that hits a
         | word limit.
        
         | slashdave wrote:
         | > It is not at all clear that a machine learning from text
         | should be treated any differently from a human being learning
         | from text
         | 
         | Given that LLMs and human creativity work on fundamentally
         | different principles, there is every reason to believe there is
         | a difference.
        
       | dweinus wrote:
       | > Now the Web at large is full of slop generated by large
       | language models, written by no one to communicate nothing.
       | 
       | Fair and accurate. In the best cases the person running the model
       | didn't write this stuff and word salad doesn't communicate
       | whatever they meant to say. In many cases though, content is
       | simply pumped out for SEO with no intention of being valuable to
       | anyone.
        
         | andrethegiant wrote:
         | That sentence stood out to me too, very powerful. Felt it right
         | in the feels.
        
       | karaterobot wrote:
       | I guess a manageable, still-useful alternative would be to curate
       | a whitelist of sources that don't use AI, and without making that
       | list public, derive the word frequencies from only those sources.
       | How to compile that list is left as an exercise for the reader.
       | The result would not be as accurate as a broad sample of the web,
       | but in a world where it's impossible to trust a broad sample of
       | the web, it the option you are left with. And I have no reason to
       | doubt that it could be done at a useful scale.
       | 
       | I'm sure this has occurred to them already. Apart from the near-
       | impossibility of continuing the task in the same way they've
       | always done it, it seems like the other reason they're not
       | updating wordfreq is to stick a thumb in the eye of OpenAI and
       | Google. While I appreciate the sentiment, I recognize that those
       | corporations' eyes will never be sufficiently thumbed to satisfy
       | anybody, so I would not let that anger change the course of my
       | life's work, personally.
        
         | WaitWaitWha wrote:
         | > curate a whitelist of sources that don't use AI,
         | 
         | I like this.
         | 
         | Maybe even take it a step further - have a badge on the source
         | that is both human and machine visible to indicate that the
         | content is not AI generated.
        
       | antirez wrote:
       | Ok so post author is AI skeptic and this is his retaliation,
       | likely because his work is affected. I believe governments should
       | address the problem with welfare but being against technical
       | advances is always being in the wrong side of history.
        
         | exo-pla-net wrote:
         | This is a tech site, where >50% of us are programmers who have
         | achieved greater productivity thanks to LLM advances.
         | 
         | And yet we're filled to the gills with Luddite sentiments and
         | AI content fearmongering.
         | 
         | Imagine the hysteria and the skull-vibrating noise of the non-
         | HN rabble when they come to understand where all of this is
         | going. They're going to do their darndest to stop us from
         | achieving post-economy.
        
           | antirez wrote:
           | I fail to see the difference. Actually, programming was one
           | of the first field where LLMs shown proficiency. The helper
           | nature of LLMs is true in all the fields so far, in the
           | future this may change. I believe that for instance in the
           | case or journalism the issue was already there: three euros
           | per post written without clue by humans.
           | 
           | Anyway in the long run AI will kill tons of jobs. Regardless
           | of blog posts like that. The true key is governments
           | assistance.
        
             | exo-pla-net wrote:
             | I don't know what difference you are referring to. I was
             | agreeing with you.
             | 
             | And also agreed: many trumpet the merits of "unassisted"
             | human output. However, they're suffering from ancestor
             | veneration: human writing has always been a vast mine of
             | worthless rock (slop) with a few gems of high-IQ analysis
             | hidden here and there.
             | 
             | For instance, upon the invention of the printing press, it
             | was immediately and predominantly used for promulgating
             | religious tracts.
             | 
             | And even when you got to Newton, who created for us some
             | valuable gems, much of his output was nevertheless deranged
             | and worthless. [1]
             | 
             | It follows that, whether we're a human or an LLM, if we
             | achieve factual grounding and the capacity to reason, we
             | achieve it _despite_ the bulk of the information we ingest.
             | Filtering out sludge is part of the required skillset for
             | intellectual growth, and LLM slop qualitatively changes
             | nothing.
             | 
             | [1] https://www.newtonproject.ox.ac.uk/view/texts/diplomati
             | c/THE...
        
               | antirez wrote:
               | Sorry I didn't imply we didn't agree but that programmers
               | were and are going to be impacted as much as writers for
               | instance, yet I see an environment where AI is generally
               | more accepted as a tool.
               | 
               | About your last point sometimes I think that in the
               | future there will be models specifically distilling the
               | climax of selected thinkers, so that not only their
               | production will be preserved but maybe something more
               | that is only implicitly contained in their output.
        
               | exo-pla-net wrote:
               | That's a good point: the greatest value that we can glean
               | from one another is likely not epistemological "facts
               | about the world", nor is it even the _predictive models_
               | seen in science and higher brow social commentary, but in
               | _patterns of thinking_. That alone is the infinite
               | wellspring for achieving greater understanding, whether
               | formalized with the scientific method or whether more
               | loosely leveraged to succeed with a business endeavor.
               | 
               | Anecdotally, I met success in prompting GPT-3 to "mimic
               | Stephen Pinker" when solving logical puzzles. Puzzles
               | that it would initially fail, it would succeed attempting
               | to mimic his language. GPT-3 seemed to have grokked the
               | pattern of how Stephen Pinker thinks through problems,
               | and it could leverage those patterns to improve its own
               | reasoning. OpenAI _o1_ needs no such assistance, and I
               | expect that _o2_ will fully supplant humans with its
               | ability to reason.
               | 
               | It follows that all that we have to offer with our
               | brightest minds will be exhausted, and we will be
               | eclipsed in every conceivable way by our creation. It
               | will mark the end of the Anthropocene; something that
               | likely exceeds the headiest of Nick Bostom speculations
               | will take its place.
               | 
               | It seems that this is coming in 2026 if not sooner, and
               | Alignment is the only thing that ought occupy our minds:
               | the question of whether we're creating something that
               | will save us from ourselves, or whether all that we've
               | built will culminate in something gross and final.
               | 
               | Looking around myself, however, I see impassioned
               | "discourse" about immigration. The merits of DEI.
               | Patriotism. Transgenderism. Religion. Copyright. Vast
               | herds of dinosaurs preying upon one another, giving only
               | idle attention to the glowing object in the sky. Is it an
               | asteroid? Is it a UFO that is coming down to provide
               | dinosaur healthcare? Nope, not even that level of thought
               | is mustered. With 8 billion people on the planet,
               | _Utopia_ by Nick Bostrom hasn 't even mustered 100
               | reviews on Amazon. On the advent of the defining moment
               | of the universe itself, when virtually all that is
               | imaginable is unlocked for us, our species' heads remains
               | buried in the mud, gnawing at one another's filthy toes,
               | and I'm alienated and disgusted.
               | 
               | The only glints of beauty I see in my fellow man are in
               | those with minds which exceed a certain IQ threshold and
               | cognitive flexibility, as well as in lesser minds which
               | exhibit gentleness and humility. There is beauty there,
               | and there is beauty in the staggering possibility of the
               | universe itself. The rest is at best entomology, and I
               | won't mourn its passing.
        
       | greentxt wrote:
       | I think this person has too high a view of pre-2021, probably for
       | ego reasons. In fact, their attitude seems very ego driven. AI
       | didn't just occur in 2021. Nobody knows how much text was machine
       | generated prior to 2021, it was much harder if not impossible to
       | detect. If anything, it's probably easier now since people are
       | all using the same ai that use words like delve so much much it
       | becomes obvious.
        
         | croes wrote:
         | >AI didn't just occur in 2021. Nobody knows how much text was
         | machine generated prior to 2021
         | 
         | But we do know that now it's a lot more, with a big LOT.
        
           | greentxt wrote:
           | I assume you are correct but how can we know rather than
           | assume? I am not sure we can, so why get worked up about
           | "internet died in 2021" when many would claim with similar
           | conviction that it's been dead since 2012, or 2007, or ...
        
             | ClassyJacket wrote:
             | You are making a claim that somehow someone was sitting on
             | something as powerful as ChatGPT, long before ChatGPT,
             | _and_ that it was in widespread use, secretly, without even
             | a single leak by anyone at any point. That 's not
             | plausible.
        
       | grogenaut wrote:
       | Is 2023 going to be for data what the trinity test was for iron?
       | Eg post 2023 all data now contains trace amounts of ai?
        
         | swyx wrote:
         | yes, unfortunately https://www.latent.space/p/nov-2023
        
       | aftbit wrote:
       | Wow there is so much vitriol both in this post and in the
       | comments here. I understand that there are many ethical and
       | practical problems with generative AI, but when did we stop being
       | hopeful and start seeing the darkest side of everything? Is it
       | just that the average HN reader is now past the age where a new
       | technological development is an exciting opportunity and on to
       | the age where it is a threat? Remember, the Luddites were not
       | opposed to looms, they just wanted to own them.
        
         | JohnFen wrote:
         | > when did we stop being hopeful and start seeing the darkest
         | side of everything?
         | 
         | I think a decade or two ago, when most of the new tech being
         | introduced (at least by our industry) started being
         | unmistakably abusive and dehumanizing. When the recent past
         | shows a strong trend, it's not unreasonable to expect the the
         | near future will continue that trend. Particularly when it
         | makes companies money.
        
         | slashdave wrote:
         | Give us examples of generative AI in challenging applications
         | (biology, medicine, physical sciences), and you'll get a lot of
         | optimism. The text LLM stuff is the brute force application of
         | the same class of statistical modeling. It's commercial, and
         | boring.
        
         | aryonoco wrote:
         | When?
         | 
         | For some of us, it was 1994, the eternal September.
         | 
         | For some of us, it was when Aaron Swartz left us.
         | 
         | For some of us, it was when Google killed Google Reader (in
         | hindsight, the turning point of Google becoming evil).
         | 
         | For some others, like the author of this post, it's when
         | twitter and reddit closed their previously open APIs.
        
       | jll29 wrote:
       | I regret the situation led to the OP feel discourage about the
       | NLP community, wo which I belong, and I just want to say "we're
       | not all like that", even though it is a trend and we're close to
       | peak hype (slightly past even?).
       | 
       | The complaint about pollution of the Web with artificial content
       | is timely, and it's not even the first time due to spam farms
       | intended to game PageRank, among other nonsense. This may just
       | mean there is new value in hand-curated lists of high-quality Web
       | sites (some people use the term "small Web").
       | 
       | Each generation of the Web needs techniques to overcome its
       | particular generation of adversarial mechanisms, and the current
       | Web stage is no exception.
       | 
       | When Eric Arthur Blair wrote 1984 (under his pen name "George
       | Orwell"), he anticipated people consuming auto-generated content
       | to keep the masses from away from critical thinking. This is now
       | happening (he even anticipated auto-generated porn in the novel),
       | but the technologies criticized can also be used for good, and
       | that is what I try to do in my NLP research team. Good _will_
       | prevail in the end.
        
         | solardev wrote:
         | Have "good" small webs EVER prevailed?
         | 
         | Every content system seems to get polluted by noise once it
         | hits mainstream usage: IRC, Usenet, reddit, Facebook,
         | geocities, Yahoo, webrings, etc. Once-small curated selections
         | eventually grow big enough to become victims of their own
         | successes and taken over by spam.
         | 
         | It's always an arms race of quality vs quantity, and eventually
         | the curators can't keep up with the sheer volume anymore.
        
           | squigz wrote:
           | > Have "good" small webs EVER prevailed?
           | 
           | You ask on HN, one of the highest quality sites I've ever
           | visited in any age of the Internet.
           | 
           | IRC is still alive and well among pretty much the same
           | audience as always. I'm not sure it's fair to compare that
           | with the others.
        
             | solardev wrote:
             | Well, niche forums are kinda different when they manage to
             | stay small and niche. Not just HN but car forums, LED
             | forums, etc.
             | 
             | But if they ever include other topics, they risk becoming
             | more mainstream and noisy. Even within adjacent fields
             | (like the various Stacks) it gets pretty bad.
             | 
             | Maybe the trick is to stay within a single small sphere
             | then and not become a general purpose discussion site? And
             | to have a low enough volume of submissions where good
             | moderation is still possible? (Thank you dang and HN staff)
        
               | rovr138 wrote:
               | Yes. That's the small web.
               | 
               | A good example of the generalization problem you discuss
               | is reddit.
               | 
               | You have to unsubscribe from all the defaults and find
               | the small, niche, communities about specific topics. If
               | not, it's the same stuff, reposted, over and over, across
               | different subs and/or social sites.
        
               | squigz wrote:
               | I'm not entirely sure it's about content (while HN is
               | certainly tech-focused, politics, health, philosophy all
               | come up with regularity) or even content moderation,
               | although they both certainly play a part (particularly
               | the moderation around here. Thanks, staff!)
               | 
               | I wonder if it is more to do with the community itself.
               | HN users tend to have very intelligent discussions on
               | pretty much anything, and discourages shitty, unnuanced,
               | one-line takes. This, coupled with a healthy moderation
               | system, makes it hard for the lower quality discussion to
               | break in and override the good stuff.
        
               | nick3443 wrote:
               | The car headlight forums seem to expose the weakness of
               | small web though, in that a lot of the forums that show
               | up in search are "sponsored" by one or two major brands
               | and any open discussion or validation of off-brand
               | solutions, AliExpress parts, etc are quickly shunned or
               | banned.
        
             | bongodongobob wrote:
             | It's high quality when the content is within HN's bubble.
             | Anything related to health, politics, or Microsoft is full
             | of misinformation, ignorance, and garbage like any other
             | site. The Microsoft discussions in particular are extremely
             | low quality.
        
               | squigz wrote:
               | I disagree. Even politics spurs intelligent, nuanced
               | discussion here on HN.
               | 
               | And to hold up discussions about MS as an example of
               | 'extremely' low quality discussion is, ah, interesting.
               | Do you have any recent examples of such discussions?
        
               | bongodongobob wrote:
               | I hide every single article about MS because it's filled
               | with all the neckbeardy tropes about their products being
               | garbage spyware, switch to Linux, they're stealing your
               | data, the OS is trash etc. It's comments from people who
               | have never managed large scale MS based environments
               | comparing their Windows Home to the other 90% of the
               | business ecosystem that has nothing to do with home users
               | or MS's main cash cow, businesses, Azure/Entra and M365.
               | I'm done wasting my breath on MS here.
        
               | squigz wrote:
               | This is a funny comment in a thread about low quality
               | discussion.
        
               | bongodongobob wrote:
               | I'm describing why I no longer engage with MS related
               | posts.
        
               | skissane wrote:
               | I've posted four comments here on Microsoft in the last
               | 30 days:
               | 
               | https://news.ycombinator.com/item?id=41499957
               | 
               | https://news.ycombinator.com/item?id=41408124
               | 
               | https://news.ycombinator.com/item?id=41335757
               | 
               | https://news.ycombinator.com/item?id=41327379
               | 
               | None of which fit your description of "neckbeardy tropes
               | about their products being garbage spyware, switch to
               | Linux, they're stealing your data, the OS is trash".
               | 
               | And it isn't just me, because if you look at those
               | comments, I was talking to other people who weren't
               | invoking those "neckbeardy tropes" either
        
               | vundercind wrote:
               | Politics and philosophy discussions here are intelligent
               | in that most of the commenters aren't dumb. They tend to
               | be entirely uneducated _and resistant to the educated_.
        
               | Retric wrote:
               | IMO HN actually scores quite highly in terms of
               | health/politics and so forth content because the both
               | mainstream and fringe ideas get both shown and pushback.
               | 
               | A vaping discussion brought up glycerin used was safe and
               | the same thing used in smoke machines and someone else
               | brought up a study showing that smoke machines are an
               | occasional safety issue. Nowhere near every discussion
               | goes that well but stick around and you'll see in-depth
               | discussion.
               | 
               | Go to a public health website by comparison and you'll
               | see warnings without context and a possibility positive
               | spin compared to smoking.
               | https://www.cdc.gov/tobacco/e-cigarettes/index.html I
               | suspect most people get basically nothing from looking at
               | it.
        
               | chimeracoder wrote:
               | > IMO HN actually scores quite highly in terms of
               | health/politics and so forth content because the both
               | mainstream and fringe ideas get both shown and pushback.
               | 
               | As someone with domain expertise here, I wholeheartedly
               | disagree. HN is very bad at percolating accurate
               | information about topics outside its wheelhouse, like
               | clinical medicine, public health, or the natural
               | sciences. It is also, simultaneously, extremely prone to
               | overestimating its own collective competency at
               | understanding technical knowledge outside its domain. In
               | tandem, those two make for a rather dangerous
               | combination.
               | 
               | Anytime I see a post about a topic within my area of
               | specialty, I know to expect articulate, lengthy, and
               | _completely misguided or inaccurate_ comments dominating
               | the discussion. It 's enough of a problem that trying to
               | wade in and correct them is a losing battle; I rarely
               | even bother these days.
               | 
               | It's kind of funny that XKCD #793[0] is written about
               | physicists, because the effect is way worse with software
               | engineers.
               | 
               | [0] https://xkcd.com/793/
        
               | Retric wrote:
               | Obviously on an objective scale HN isn't good, but nobody
               | is doing a good job here.
               | 
               | I've worked on the government side of this stuff and find
               | it disheartening.
        
               | mandevil wrote:
               | As a software engineer married to a healthcare
               | professional, I disagree strongly about the quality of
               | the healthcare discussions here. A whole lot of the
               | conversation is software engineers who think that they
               | can reason from first principles in two minutes about
               | this thing that professionals dedicate their whole lives
               | to mastering, and who therefore don't understand the most
               | basic concepts of the field.
               | 
               | Sometimes I try and engage, but honestly, mostly I think
               | it's not worth it. Otherwise you end up doing this with
               | your life: https://xkcd.com/386/
        
               | Retric wrote:
               | Spend time with medical researchers and they start
               | disparaging Doctors. Everyone wants that one
               | authoritative source free from bias, but IMO even having
               | a few voices in the crowd worth listening to beats most
               | other options.
        
               | vladms wrote:
               | > about this thing that professionals dedicate their
               | whole lives to mastering
               | 
               | After doing some healthcare work I ended up understanding
               | that some topics are not well known even by the
               | professionals dedicating their whole lives to that
               | because there are big gaps in the human knowledge on the
               | topics.
               | 
               | I agree that people that think they can reason in two
               | minutes about anything are a problem, but it's not a
               | healthcare only issue (same happens for politics,
               | economics, environment, etc.)
               | 
               | Engineers have the luck to work in the field where many
               | things have a clear, known explanation (although, try to
               | make an estimation about how long a team will implement a
               | feature, and everybody will come up with something else).
        
               | mandevil wrote:
               | As to the uncertainty and mysteries, you are 100%
               | correct. One of the big failure modes for engineers in
               | dealing with human health is the assumption that things
               | are as simple and logical as the stuff we build, when
               | it's simply not at all like that. There are (1) big
               | arguments over basic things like "why do SSRI's work?"
               | Outside of LLM's I can't think of a thing in software
               | where we are still arguing about why things work in
               | production. We never say "Why does Postgres work?" in the
               | same way. (2)
               | 
               | And yes, this is true for many other areas of discussion
               | at HN. It's just that it is most obvious to me in the
               | area that my wife specializes in, because I pick up
               | enough via osmosis from her to know when other people
               | don't even have my limited level of understanding.
               | 
               | 1: Or at least were 15 years ago when my wife told me
               | about it- the argument might have been largely concluded
               | and she just never updated me since I don't keep up with
               | the medical literature the way she does.
               | 
               | 2: Two decades ago there was a huge push for the "human
               | genome project" under the basis that this would be
               | "reading the blueprints for human life" and that would
               | give us all of these medical breakthroughs. Basically
               | none of those breakthroughs happened because we've spent
               | the past 20 years learning all of the different ways that
               | it is NOT a blueprint and that cells do things very
               | differently from human engineers.
        
           | 38 wrote:
           | its so easy to solve this problem, not sure why anyone hasnt
           | done it yet.
           | 
           | 1. build a userbase, free product
           | 
           | 2. once userbase get big enough, any new account requires a
           | monthly fee, maybe $1
           | 
           | 3. keep raising the fee higher and higher, until you get to
           | the point that the userbase is manageable.
           | 
           | no ads, simple.
        
             | jachee wrote:
             | Until N ad views are worth more than $X account creation
             | fee. Then the spammers will just sell ad posts for $X*1.5.
             | 
             | I can't find it, but there's someone selling sock puppet
             | posts on HN even.
        
             | abridges6523 wrote:
             | This sounds like a good idea. I do wonder if enough people
             | would sign up for it to be a worthy venture because I think
             | the main issue with ads is I think once you add any price
             | at all dramatically reduces participation even if it's not
             | about cost some people just see the payment and immediately
             | disengage.!
        
           | htrp wrote:
           | Any curation mechanism that depends on passion and/or the
           | goodwill of volunteers is unsustainable.
        
         | squigz wrote:
         | > people consuming auto-generated content to keep the masses
         | from away from critical thinking. This is now happening
         | 
         | The people who stay away from critical thinking were doing that
         | already and will continue to do so, 'AI' content or not.
        
           | trehalose wrote:
           | How did they get started?
        
             | squigz wrote:
             | They likely never started critically thinking, so they
             | never had to get started on not doing so.
             | 
             | (If children are never taught to think critically, then...)
        
               | sweeter wrote:
               | It's almost like its a systemic failure that is
               | artificially created so that people wont think
               | critically... hmmm
        
               | squigz wrote:
               | Yeah, it's almost like it has nothing to do with AI
        
               | vladms wrote:
               | > is artificially created
               | 
               | You imply that thousands of year ago everybody was
               | thinking critically?
               | 
               | Thinking critically is hard, stressful and might take
               | some joy from your life.
        
               | sweeter wrote:
               | I'm not sure how that would imply anything about the
               | past. We as a society have spent decades defanging the
               | public school system through changing school to be test
               | score driven, tying a schools funding to the local
               | property value, making them less effective and less safe,
               | choking them out financially etc... it should be no
               | surprise that children are not equipped to navigate
               | modern life. I've been though these systems, they are
               | deeply flawed.
        
           | psychoslave wrote:
           | I don't know, individually finely tuned addictive content
           | served as real time interactive feedback loops is an other
           | level of propaganda and attention capture tool than largest
           | common denominator of the general crowd served as static
           | passive content.
        
             | squigz wrote:
             | Perhaps, but the solution is the same either way, and it
             | isn't trying to ban technology or halt progress or just sit
             | and cry about how society is broken. It's educating each
             | other and our children on the way these things work, how to
             | break out of them, and how we might more responsibly use
             | the technology.
        
         | sweeter wrote:
         | tangentially related, but Marx also predicted that crypto and
         | NFT's would exist in 1894 [1] and I only bring it up because
         | its kind of wild how we keep crossing these "red lines" without
         | even blinking. It's like that meme:
         | 
         | Sci-fi author:
         | 
         | I created the Torment Nexus to serve as a cautionary tale...
         | 
         | Tech Company:
         | 
         | Alas, we have created the Torment Nexus from the classic Sci-fi
         | novel "Don't Create the Torment Nexus"
         | 
         | 1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm
        
         | Llamamoe wrote:
         | > Good will prevail in the end.
         | 
         | Even if, this is a dangerous thought that discourages decisive
         | action that is likely to be necessary for this to happen.
        
         | Intralexical wrote:
         | What if the way for good to prevail is to reject technologies
         | and beliefs that have become destructive?
        
       | ok123456 wrote:
       | Most of the "random" bot content pre-2021 was low-quality Markov-
       | generated text. If anything, these genitive AI tools would
       | improve the accuracy of scraping large corpora of text from the
       | web.
        
       | diggan wrote:
       | One of the examples is the increased usage of "delve" which
       | Google Trends confirms increased in usage since 2022 (initial
       | ChatGPT release):
       | https://trends.google.com/trends/explore?date=all&q=delve&hl...
       | 
       | It seems however it started increasing most in usage just these
       | last few months, maybe people are talking more about "delve"
       | specifically because of the increase in usage? A usage recursion
       | of some sorts.
        
         | bongodongobob wrote:
         | Delves are a new thing in World of Warcraft released 9/10 this
         | year. Delve is also an M365 product that has been around for
         | some time and is being discontinued in December. So no, that
         | has nothing to do with LLMs.
        
           | _proofs wrote:
           | Delve was also an addition to PoE, which I imagine had its
           | own spike in google searches relative to that word.
        
         | bee_rider wrote:
         | We've seen this with a couple words and expressions, and I
         | don't doubt that AI is somewhat likely to "like" some phrases
         | for whatever reason. Big eigenvaues of the latent space or
         | whatever, hahaha (I don't know AI).
         | 
         | But also, words and phrases _do_ become popular among humans,
         | right? It would be a shame if AI caused the language to get
         | more stagnant, as keeping up with which phrases are popular get
         | you labeled as an AI.
        
       | thesnide wrote:
       | I think that text on the internet will tainted by AI the same way
       | that steel has being tainted by nuclear devices.
        
       | zaik wrote:
       | If generative AI has a significantly different word frequency
       | from humans then it also shouldn't be hard to detect text written
       | generative AI. However my last information is that tools to
       | detect text written by generative AI are not that great.
        
       | andai wrote:
       | Has anyone taken a look at a random sample of web data? It's
       | mostly crap. I was thinking of making my own search engine,
       | knowledge database etc based on a random sample of web pages, but
       | I found that almost all of them were drivel. Flame wars, asinine
       | blog posts, and most of all, advertising. Forget spam, most of
       | the legit pages are trying to sell something too!
       | 
       | The conclusion I arrived at was that making my own crawler
       | actually is feasible (and given my goals, necessary!) because I'm
       | only interested in a very, very small fraction of what's out
       | there.
        
       | aryonoco wrote:
       | I feel so conflicted about this.
       | 
       | On the one hand, I completely agree with Robyn Speer. The open
       | web is dead, and the web is in a really sad state. The other day
       | I decided to publish my personal blog on gopher. Just cause,
       | there's a lot less crap on gopher (and no, gopher is not the
       | answer).
       | 
       | But...
       | 
       | A couple of weeks ago, I had to send a video file to my wife's
       | grandfather, who is 97, lives in another country, and doesn't use
       | computers or mobile phones. Eventually we determined that he has
       | a DVD player, so I turned to x264 to convert this modern 4K HDR
       | video into a form that can be played by any ancient DVD player,
       | while preserving as much visual fidelity as possible.
       | 
       | The thing about x264 is, it doesn't have any docs. Unlike x265
       | which had a corporate sponsor who could spend money on writing
       | proper docs, x264 was basically developed through trial and error
       | by members of the doom9 forum. There are hundreds of obscure
       | flags, some of which now operate differently to what they did 20
       | years ago. I could spend hours going through dozens of 20 year
       | old threads on doom9 to figure out what each flag did, or I could
       | do what I did and ask a LLM (in this case Claude).
       | 
       | Claude wasn't perfect. It mixed up a few ffmpeg flags with x264
       | ones (easy mistake), but combined with some old fashioned
       | searching and some trial and error, I could get the job done in
       | about half an hour. I was quite happy with the quality of the end
       | product, and the video did play on that very old DVD player.
       | 
       | Back in pre-LLM days, it's not like I would have hired a x264
       | expert to do this job for me. I would have either had to spend
       | hours more on this task, or more likely, this 97 year old man
       | would never have seen his great granddaughter's dance, which
       | apparently brought a massive smile to his face.
       | 
       | Like everything before them, LLMs are just tools. Neither
       | inherently good nor bad. It's what we do with them and how we use
       | them that matters.
        
         | sangnoir wrote:
         | > Back in pre-LLM days, it's not like I would have hired a x264
         | expert to do this job for me. I would have either had to spend
         | hours more on this task, or more likely, this 97 year old man
         | would never have seen his great granddaughter's dance
         | 
         | Didn't most _DVD_ burning software include video transcoding as
         | a standard feature? Back in the day, you 'd have used Nero
         | Burning ROM, or Handbrake - granted, the quality may not have
         | been optimized to your standards, but the result would have
         | been a watchable video (especially to 97 year-old eyes)
        
           | aryonoco wrote:
           | Back in the day they did. I checked handbrake but now there's
           | nothing specific about DVD compatibility there. I could have
           | picked something like Super HQ 576p, and there's a good
           | chance that would have sufficed, but old DVD players were
           | extremely finicky about filenames, extensions, interlacing,
           | etc. I didn't want to risk the DVD traveling half way across
           | the world only to find that it's not playable.
        
             | sangnoir wrote:
             | I mentioned Handbrake without checking its DVD authoring
             | capability - probably used it to _rip_ DVDs many years ago
             | and got it mixed up with burning them; a better FLOSS
             | alternative for authoring would have been DeVeDe or
             | bombono.
        
       | miguno wrote:
       | I have been noticing this trend increasingly myself. It's getting
       | more and more difficult to use tools like Google search to find
       | relevant content.
       | 
       | Many of my searches nowadays include suffixes like
       | "site:reddit.com" (or similar havens of, hopefully, still mostly
       | human-generated content) to produce reasonably useful results.
       | There's so much spam pollution by sites like Medium.com that it's
       | disheartening. It feels as if the Internet humanity is already on
       | the retreat into their last comely homes, which are more closed
       | than open to the outside.
       | 
       | On the positive side:
       | 
       | 1. Self-managed blogs (like: not on Substack or Medium) by
       | individuals have become a strong indicator for interesting
       | content. If the blog runs on Hugo, Zola, Astro, you-name-it,
       | there's hope.
       | 
       | 2. As a result of (1), I have started to use an RSS reader again.
       | Who would have thought!
       | 
       | I am still torn about what to make of Discord. On the one hand,
       | the closed-by-design nature of the thousands of Discord servers,
       | where content is locked in forever without a chance of being
       | indexed by a search engine, has many downsides in my opinion. On
       | the other hand, the servers I do frequent are populated by
       | humans, not content-generating bots camouflaged as users.
        
       | 0xbadcafebee wrote:
       | I'm going to call it: The Web is dead. Thanks to "AI" I spend
       | more time now digging through searches trying to find something
       | useful than I did back in 2005. And the sites you do find are
       | largely garbage.
       | 
       | As a random example: just trying to find a particular popular set
       | of wireless earbuds takes me at least 10 minutes, when I already
       | know the company, the company's website, other vendors that sell
       | the company's goods, etc. It's just buried under tons of dreck.
       | And my laptop is "old" (an 8-core i7 processor with 16GB of RAM)
       | so it struggles to push through graphics-intense "modern"
       | websites like the vendor's. Their old website was plain and
       | worked great, letting me quickly search through their products
       | and quickly purchase them. Last night I literally struggled to
       | add things to cart and check out; it was actually harrowing.
       | 
       | Fuck the web, fuck web browsers, web design, SEO, searching,
       | advertising, and all the schlock that comes with it. I'm done. If
       | I can in any way purchase something without the web, I'mma do
       | that. I don't hate technology (entirely...) but the web is just a
       | rotten egg now.
        
         | w10-1 wrote:
         | > If I can in any way purchase something without the web, I'mma
         | do that
         | 
         | To get to the milk you'll have to walk by 3 rows of chips and
         | soda.
        
           | odo1242 wrote:
           | Yeah, this is why I still use the web to order things in a
           | nutshell lol
        
             | 0xbadcafebee wrote:
             | Where do you order things online that you aren't inundated
             | by ads?
        
         | bbarn wrote:
         | No disagreement for the most part.
         | 
         | I used to be able to say search for Trek bike derailleur hanger
         | and the first result would be what I wanted. Now I have to
         | scroll past 5 ads to buy a new bike, one that's a broken link
         | to a third party, and if I'm really lucky, at the bottom of
         | page 1 will be the link to that part's page.
         | 
         | The shitification of the web is real.
        
           | klyrs wrote:
           | R.I.P. Sheldon Brown T_T
           | 
           | (The Agner Fog of cycling?)
        
         | gazook89 wrote:
         | The web is much more than a shopping site.
        
           | yifanl wrote:
           | It is, but the SEO spammers who ruined the web want it to be
           | shopping mall, and they can't even do a particularly good job
           | at being one.
        
         | Gethsemane wrote:
         | Sounds like your laptop is wholly out of date, you need to buy
         | the next generation of laptops on Amazon that can handle the
         | modern SEO load. I recommend the:
         | 
         | LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core
         | N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID,
         | WiFi, BT4.2, for Students/Business
         | 
         | Name rolls off the tongue doesn't it
        
         | BeetleB wrote:
         | If search is your metric, the web was dead long before OpenAI's
         | release of GPT. I gave up on web search a long time ago.
        
         | Vegenoid wrote:
         | On Amazon, you used to be able to search the reviews and Q&A
         | section via a search box. This was immensely useful. Now, that
         | search box first routes your search to an LLM, which makes you
         | wait 10-15 seconds while it searches for you. Then it presents
         | its unhelpful summary, saying "some reviews said such and
         | such", and I can finally click the button to show me the actual
         | reviews and questions with the term I searched.
         | 
         | This is going to be the thing that makes me quit Amazon. If I'm
         | missing something and there's still a way to to a direct
         | search, please tell me.
        
       | fsckboy wrote:
       | > FTA _the site has been replaced with an oligarch 's plaything,
       | a spam-infested right-wing cesspool_
       | 
       | just in case you youngsters don't know, the entire field of
       | linguistics itself has been cesspool of marxist analysis since
       | before y'all were born. In the peak days of Chomsky, a truly
       | great linguist who put MIT at the forefront of linguistics in the
       | world, MIT still felt it had to disband his department (stuffing
       | it into Philosophy) because it was too political, radicalized,
       | and unacademic. It was a big kerfuffel, guess Chomsky was unable
       | to manufacture adequate consent!
       | 
       | The anti-western, anti-male, anti-whiteness, deconstructionist,
       | lesbian inspired womyn's right to fish-bicycles instead of men,
       | critical theory, you name it that has destroyed the Academy today
       | was already in full swing in linguistics over 50 years ago, but
       | even then was unable to free those Rosenbergs.
       | 
       | And apparently as a result of that, wordfreak will not update any
       | more. And Israel, like Carthage, must be destroyed!
       | 
       | (this little time-capsule is meant to point out, _la plus ca
       | change, la plus c 'est la meme chose._ The perceived craziness of
       | politics and campus life today was already in full swing over a
       | hundred years ago in revolutionary Europe leading to Fascist and
       | Marxist totalitarian states _and their defenders in the USA_ ,
       | which I think we are nowhere close to today but we still hear the
       | echoes, even in End of Life announcements for seemingly benign
       | activities like word counts.)
        
       | jadayesnaamsi wrote:
       | The year 2021 is to wordfreq what 1945 was to carbon carbon-14
       | dating.
       | 
       | I guess the same way the scientists had to account for the bomb
       | pulse in order to provide accurate carbon-14 dating, wordfreq
       | would need a magic way to account for non human content.
       | 
       | Saying magic, because unfortunately it was much easier to detect
       | nuclear testing in the atmosphere than to it will be to detect
       | AI-generated content.
        
       | charlieyu1 wrote:
       | Web before 2021 was still polluted by content farms. The articles
       | were written by humans, but still, they were rubbish. Not
       | compared to current rate of generation, but the web was already
       | dominated by them.
        
       | bane wrote:
       | This is one of the vanguards warning of the changes coming in the
       | post-AI world.
       | 
       | >> Generative AI has polluted the data
       | 
       | Just like low-background steel marks the break in history from
       | before and after the nuclear age, these types of data mark the
       | distinction from before and after AI.
       | 
       | Future models will begin to continue to amplify certain
       | statistical properties from their training, that amplified data
       | will continue to pollute the public space from which future
       | training data is drawn. Meanwhile certain low-frequency data will
       | be selected by these models less and less and will become
       | suppressed and possibly eliminated. We know from classic NLP
       | techniques that low frequency words are often among the highest
       | in information content and descriptive power.
       | 
       | Bitrot will continue to act as the agent of Entropy further
       | reducing pre-AI datasets.
       | 
       | These feedback loops will persist, language will be ground down,
       | neologisms will be prevented and...society, no longer with the
       | mental tools to describe changing circumstances; new thoughts
       | unable to be realized, will cease to advance and then regress.
       | 
       | Soon there will be no new low frequency ideas being removed from
       | the data, only old low frequency ideas. Language's descriptive
       | power is further eliminated and only the AIs seem able to produce
       | anything that might represent the shadow of novelty. But it ends
       | when the machines can only produce unintelligible pages of
       | particles and articles, language is lost, civilization is lost
       | when we no longer know what to call its downfall.
       | 
       | The glimmer of hope is that humanity figured out how to rise from
       | the dreamstate of the world of animals once. Future humans will
       | be able to climb from the ashes again. There used to be a word,
       | the name of a bird, that encoded this ability to die and return
       | again, but that name is already lost to the machines that will
       | take our tongues.
        
         | thechao wrote:
         | That went off the rails quickly. Calm down dude: my mother-in-
         | law isn't going to forget words because of AI; she's gonna
         | forget words because she's 3 glasses of crappy Texas wine into
         | the evening.
        
           | bane wrote:
           | But your children's children will never learn about love
           | because that word will have been mechanically trained out of
           | existence.
        
             | Intralexical wrote:
             | That's pretty funny. You think love is just a word?
        
         | fer wrote:
         | > Future models will begin to continue to amplify certain
         | statistical properties from their training, that amplified data
         | will continue to pollute the public space from which future
         | training data is drawn.
         | 
         | That's why on FB I mark my own writing as AI generated, and the
         | AI generated slop as genuine. Because what is disguised as
         | "transparency disclaimer" is just flagging content of what's a
         | potential dataset to train from and what isn't.
        
           | mitthrowaway2 wrote:
           | I'm sorry for the low-content remark, but, oh my god... I
           | never thought about doing this, and now my mind is reeling at
           | the implications. The idea of shielding my own writing from
           | AI-plagiarism by masquerading it as AI-generated slop in the
           | first place... but then in the same stroke, further
           | undermining our collective ability to identify genuine human
           | writing, while also flagging my own work as low-value to my
           | readers, hoping that they can read between the lines. It's a
           | fascinating play.
        
           | aanet wrote:
           | You, Sir, may have stumbled upon the just the -hack- advice
           | needed to post on social media.
           | 
           | Apropos of nothing in particular, see LinkedIn now admitting
           | [1] it is training its AI models on "all users by default"
           | 
           | [1] https://www.techmeme.com/240918/p34#a240918p34
        
         | wvbdmp wrote:
         | I Have No Words, And I Must Scream
        
         | midnitewarrior wrote:
         | From the day of the first spoken word, humans have guided the
         | development of language through conversational use and
         | institution. With the advent of AI being used to publish
         | documents into the open web, humans have given up their
         | exclusive domain.
         | 
         | What would it take for Open AI overlords to inject words they
         | want to force into usage in their models and will new words
         | into use? Few have had the power to do such things. Open AI
         | through its popular GPT platform now has the potential of
         | dictating the evolution of human language.
         | 
         | This is novel and scary.
        
           | bane wrote:
           | It's the ultimate seizure of the means of production, and in
           | the end it will be the capitalists who realize that
           | revolution.
        
         | Intralexical wrote:
         | > Soon there will be no new low frequency ideas being removed
         | from the data, only old low frequency ideas. Language's
         | descriptive power is further eliminated and only the AIs seem
         | able to produce anything that might represent the shadow of
         | novelty. But it ends when the machines can only produce
         | unintelligible pages of particles and articles, language is
         | lost, civilization is lost when we no longer know what to call
         | its downfall.
         | 
         | Or we'll be fine, because inbreeding isn't actually sustainable
         | either economically nor technologically, and to most of the
         | world the Silicon Valley "AI" crowd is more an obnoxious gang
         | of socially stunted and predatory weirdos than some unstoppable
         | omnipotent force.
        
       | sashank_1509 wrote:
       | Not to be too dismissive, but is there a worthwhile direction of
       | research to pursue that is not LLM's in NLP?
       | 
       | If we add linguistics to NLP I can see an argument but if we
       | define NLP as the research of enabling a computer process
       | language then it seems to me that LLM's/ Generative AI is the
       | only research that an NLP practitioner should focus on and
       | everything else is moot. Is there any other paradigm that we
       | think can enable a computer understand language other than
       | training a large deep learning model on a lot of data?
        
         | sinkasapa wrote:
         | Maybe it is "including linguistics" but most of the world's
         | languages don't have the data available to train on. So I think
         | one major question for NLP is exactly the question you posed:
         | "Is there any other paradigm that we think can enable a
         | computer understand language other than training a large deep
         | learning model on a lot of data?"
        
       | hcks wrote:
       | Okay but how big of a sample size do we even actually need for
       | word frequencies? Like what's the goal here? It looks like the
       | initial project isn't even stratified per year/decade
        
       | tqi wrote:
       | "Sure, there was spam in the wordfreq data sources, but it was
       | manageable and often identifiable."
       | 
       | How sure can we be about that?
        
       | QRe wrote:
       | I understand the frustration shared in this post but I
       | wholeheartedly disagree with the overall sentiment that comes
       | with it.
       | 
       | The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill
       | anything.
       | 
       | The world is chaotic and net entropy (degree of disorder) of any
       | isolated or closed system will always increase. Same goes for the
       | web. We just have to embrace it and overcome the challenges that
       | come with it.
        
         | brunokim wrote:
         | Here is an expert saying there is a problem and how it killed
         | its research effort, and yet you say that things are the same
         | as ever and nothing was killed.
        
         | ryukoposting wrote:
         | I'm not so optimistic. The most basic requirements are:
         | 
         | 1. Prove the human-ness of an author... 2. ...without grossly
         | encroaching on their privacy. 3. Ensure that the author isn't
         | passing off AI-generated material as their own.
         | 
         | We'll leave out the "don't let AI models train on my data" part
         | for now.
         | 
         | Whatever solution we come up with, if any, will necessarily be
         | mired in the politics of privacy, anonymity, and/or DRM. In any
         | case, it's hard to conceive of a world where the human web
         | returns as we once knew it.
        
       | syngrog66 wrote:
       | A few years ago I began an effort to write a new tech book. I
       | planned orig to do as much of it as I could across a series of
       | commits in a public GitHub repo of mine.
       | 
       | I then changed course. Why? I had read increasing reports of
       | human e-book pirates (copying your book's content then
       | repackaging it for sale under a diff title, byline, cover, and
       | possibly at a much lower or even much higher price.)
       | 
       | And then the rise of LLMs and their ravenous training ingest bots
       | -- plagiarism at scale and potentially even easier to disguise.
       | 
       | "Not gonna happen." - Bush Sr., via Dana Carvey
       | 
       | Now I keep the bulk of my book material non-public during dev.
       | I'm sure I'll share a chapter candidate or so at some point
       | before final release, for feedback and publicity. But the bulk
       | will debut all together at once, and only once polished and
       | behind a paywall
        
       | whimsicalism wrote:
       | NLP and especially 'computational linguistics' in academia has
       | been captured by certain political interests, this is reflective
       | of that.
        
       | jchook wrote:
       | If it is (apparently) easy for humans to tell when content is AI-
       | generated slop, then it should be possible to develop an AI to
       | distinguish human-created content.
       | 
       | As mentioned, we have heuristics like frequency of the word
       | "delve", and simple techniques such as measuring perplexity. I'd
       | like to see a GAN style approach to this problem. It could
       | potentially help improve the "humanness" of AI-generated content.
        
         | aDyslecticCrow wrote:
         | > If it is (apparently) easy for humans to tell when content is
         | AI-generated slop
         | 
         | It's actually not. It's rather difficult for humans as well. We
         | can see verbose text that is confused and call it AI, but it
         | could just be a human aswell.
         | 
         | To borrow an older model training method, "Generative
         | adversarial network". If we can distinguish AI from humans...
         | We can use it to improve AI and close the gap.
         | 
         | So, it becomes an arms race that constantly evolves.
        
       | honksillet wrote:
       | Twitter was a botnet long before LLMs and Musk got involved.
        
       | jedberg wrote:
       | We need a vintage data/handmade data service. A service that can
       | provide text and images for training that are guaranteed to have
       | either been produced by a human or produced before 2021.
       | 
       | Someone should start scanning all those microfiche archives in
       | local libraries and sell the data.
        
       | will-burner wrote:
       | > It's rare to see NLP research that doesn't have a dependency on
       | closed data controlled by OpenAI and Google, two companies that I
       | already despise.
       | 
       | The dependency on closed data combined with the cost of compute
       | to do anything interesting with LLMs has made individual
       | contributions to NLP research extremely difficult if one is not
       | associated with a very large tech company. It's super
       | unfortunate, makes the subject area much less approachable, and
       | makes the people doing research in the subject area much more
       | homogeneous.
        
       | jonas21 wrote:
       | I think the main reason for sunsetting the project is buried near
       | the bottom:
       | 
       | > _The field I know as "natural language processing" is hard to
       | find these days. It's all being devoured by generative AI. Other
       | techniques still exist but generative AI sucks up all the air in
       | the room and gets all the money._
       | 
       | Traditional NLP has been surpassed by LLMs. This is clear from
       | the benchmarks. The rest of the post is just rationalization and
       | sour grapes.
        
       ___________________________________________________________________
       (page generated 2024-09-18 23:00 UTC)