[HN Gopher] Why wordfreq will not be updated
___________________________________________________________________
Why wordfreq will not be updated
Author : tomthe
Score : 1225 points
Date : 2024-09-18 11:41 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| altcognito wrote:
| It might be fun to collect the same data if not for any other
| reason than to note the changes but adding the caveat that it
| doesn't represent human output.
|
| Might even change the tool name.
| jpjoi wrote:
| The point was it's getting harder and harder to do that as
| things get locked down or go behind a massive paywall to either
| profit off of or avoid being used in generative AI. The places
| where previous versions got data is impossible to gather from
| anymore so the dataset you would collect would be completely
| different, which (might) cause weird skewing.
| oneeyedpigeon wrote:
| But that would always be the case. Twitter will not last
| forever; heck, it may not even be long before an open
| alternative like Bluesky competes with it. Would be
| interesting to know what percentage of the original mined
| data was from Twitter.
| assanineass wrote:
| Well said
| jgrahamc wrote:
| I created https://lowbackgroundsteel.ai/ in 2023 as a place to
| gather references to unpolluted datasets. I'll add wordfreq.
| Please submit stuff to the Tumblr.
| VyseofArcadia wrote:
| Clever name. I like the analogy.
| freilanzer wrote:
| I don't seem to get it.
| KeplerBoy wrote:
| Steel made before atmospheric tests of nuclear bombs were a
| thing is referred to as low background steel and invaluable
| for some applications.
|
| LLMs pollute the internet like atomic bombs polluted the
| environment.
| cdman wrote:
| https://en.wikipedia.org/wiki/Low-background_steel
| ziddoap wrote:
| Steel without nuclear contamination is sought after, and
| only available from pre-war / pre-atomic sources.
|
| The analogy is that data is now contaminated with AI like
| steel is now contaminated with nuclear fallout.
|
| https://en.wikipedia.org/wiki/Low-background_steel
|
| > _Low-background steel, also known as pre-war steel[1] and
| pre-atomic steel,[2] is any steel produced prior to the
| detonation of the first nuclear bombs in the 1940s and
| 1950s. Typically sourced from ships (either as part of
| regular scrapping or shipwrecks) and other steel artifacts
| of this era, it is often used for modern particle detectors
| because more modern steel is contaminated with traces of
| nuclear fallout.[3][4]_
| umvi wrote:
| > and only available from pre-war / pre-atomic sources.
|
| From the same wiki you linked:
|
| "Since the end of atmospheric nuclear testing, background
| radiation has decreased to very near natural levels,
| making special low-background steel no longer necessary
| for most radiation-sensitive uses, as brand-new steel now
| has a low enough radioactive signature"
|
| and
|
| "For the most demanding items even low-background steel
| can be too radioactive and other materials like high-
| purity copper may be used"
| sergiotapia wrote:
| reading stuff like this makes me so happy. no matter how
| fucked up something may be there is always a way to clean
| right up.
| felbane wrote:
| _glances nervously at atmospheric CO2_
| swyx wrote:
| and I applied to LLMs here:
| https://www.latent.space/p/nov-2023
| AlphaAndOmega0 wrote:
| It's a reference to the practise of scavenging steel from
| sources that were produced before nuclear testing began, as
| any steel produced afterwards is contaminated with nuclear
| isotopes from the fallout. Mostly ship wrecks, and WW2
| means there are plenty of those. The pun in question is
| that his project tries to source text that hasn't been
| contaminated with AI generated material.
|
| https://en.m.wikipedia.org/wiki/Low-background_steel
| ms512 wrote:
| After the detonation of the first nuclear weapons, any
| newly produced steel has a low dose of nuclear fallout.
|
| For applications that need to avoid the background
| radiation (like physics research), pre atomic age steel is
| extracted, like from old shipwrecks.
|
| https://en.m.wikipedia.org/wiki/Low-background_steel
| GreenWatermelon wrote:
| From the blog
|
| > Low Background Steel (and lead) is a type of metal
| uncontaminated by radioactive isotopes from nuclear
| testing. That steel and lead is usually recovered from
| ships that sunk before the Trinity Test in 1945.
| voytec wrote:
| To whomever downvoted parent: please don't act against
| people brave enough to state that they don't know
| something.
|
| This is a desired quality, increasingly less present in IT
| work environments. People afraid of being shamed for
| stating knowledge gaps are not the folks you want to work
| with.
| umvi wrote:
| I feel like there's a minimum "due diligence" bar to meet
| though before asking, otherwise it comes across as "I'm
| too lazy to google the reference and connect the dots
| myself, but can someone just go ahead and distill a nice
| summary for me"
| voytec wrote:
| In this particular case, I was out of the loop regarding
| the clever analogy myself. I'm now a tad smarter because
| someone else expressed lack of understanding, and I
| learned from responses to this (grayed due to downvotes)
| comment.
| PhunkyPhil wrote:
| The problem is that the answer was a really easy google.
| I didn't know what low background steel was and I just
| googled it.
| cwillu wrote:
| A person asking the question _here_ means there are now
| several good succinct explanations of it _here_.
| input_sh wrote:
| But it's right there in the header, you could just click
| the link and find out on the top of the webpage.
| imhoguy wrote:
| I am not sure we should trust a site contaminated by AI
| graphics. /s
| whywhywhywhy wrote:
| Yeah pay an illustrator if this is important to you.
|
| See a lot of people upset about AI still using AI image
| generation because it's not in their field so they feel less
| strongly about it and can't create art themselves anyway,
| hypocritical either use it or don't but don't fuss over it
| then use it for something thats convenient for you.
| imhoguy wrote:
| I have updated my comment with "/s" as that is closer to
| what I've meant. However, seriously, from ethical point of
| view it is unlikely illustrators were asked or compensated
| for their work being used for training AI to produce the
| image.
| heckelson wrote:
| I thought the header image was a symbol of AI slop
| contamination because it looked really off-putting
| gorkish wrote:
| The buildings and shipping containers that store low
| background steel aren't built out of the stuff either.
| astennumero wrote:
| That's exactly the opposite of what the author wanted IMO. The
| author no more wants to be a part of this mess. Aggregating
| these sources would just makes it so much more easier for the
| tech giants to scrape more data.
| rovr138 wrote:
| The sources are just aggregated. The source doesn't change.
|
| The new stuff generated does (and this is honestly already
| captured).
|
| This author doesn't generate content. They analyze data from
| humans. That "from humans" is the part that can't be
| discerned enough and thus the project can't continue.
|
| Their research and projects are great.
| iak8god wrote:
| The main concerns expressed in Robyn's note, as I read them,
| seem to be 1) generative AI has polluted the web with text
| that was not written by humans, and so it is no longer
| feasible to produce reliable word frequency data that
| reflects how humans use natural language; and 2)
| simultaneously, sources of natural language text that were
| previously accessible to researchers are now less accessible
| because the owners of that content don't want it used by
| others to create AI models without their permission. A third
| concern seems to be that support for and practice of any
| other NLP approaches is vanishing.
|
| Making resources like wordfreq more visible won't exacerbate
| any of these concerns.
| LeoPanthera wrote:
| Congratulations on "shipping", I've had a background task to
| create pretty much exactly this site for a while. What is your
| cutoff date? I made this handy list, in research for mine:
| 2017: Invention of transformer architecture June 2018:
| GPT-1 February 2019: GPT-2 June 2020: GPT-3
| March 2022: GPT-3.5 November 2022: ChatGPT
|
| You may want to add kiwix archives from before whatever date
| you choose. You can find them on the Internet Archive, and
| they're available for Wikipedia, Stack Overflow, Wikisource,
| Wikibooks, and various other wikis.
| ClassyJacket wrote:
| :'( I thought I was clever for realising this parallel myself!
| Guess it's more obvious than I thought.
|
| Another example is how data on humans after 2020 or so can't be
| separated by sex because gender activists fought to stop
| recording sex in statistics on crime, medicine, etc.
| sweeter wrote:
| This is a psychotic thing to say without a source,
| considering how it's blatantly untrue.
| primer42 wrote:
| Hear, hear!
| oneeyedpigeon wrote:
| I wonder if anyone will fork the project. Apart from anything
| else, the data may still be useful given that we know it is
| polluted. In fact, it could act as a means of judging the impact
| of LLMs via that very pollution.
| Miraltar wrote:
| I guess it would be interesting but differentiating pollution
| from language evolution seems very tricky since getting a non
| polluted corpus gets harder and harder
| wpietri wrote:
| One way to tackle it would be to use LLMs to generate
| synthetic corpuses, so you have some good fingerprints for
| pollution. But even there I'm not sure how doable that is
| given the speed at which LLMs are being updated. Even if I
| know a particular page was created in, say, January 2023, I
| may no longer be able to try to generate something similar
| now to see how suspect it is, because the precise setups of
| the moment may no longer be available.
| Retr0id wrote:
| Arguably it _is_ a form of language evolution. I bet humans
| have started using "delve" more too, on average. I think the
| best we can do is look at the trends and think about
| potential causes.
| rvnx wrote:
| "Seamless", "honed", "unparalleled", "delve" are now
| polluting the landscape because of monkeys repeating what
| ChatGPT says without even questioning what the words mean.
|
| Everything is "seamless" nowadays. Like I am seamlessly
| commenting here.
|
| Arguably, the meaning of these words evolve due to misuse
| too.
| oneeyedpigeon wrote:
| I see a lot of writing in my day-to-day, and the words
| that stick out most are things like "plethora" and
| "utilized". They're not terribly obscure, but they're
| just 'odd' and, maybe, formal enough to really stick out
| when overused.
| pavel_lishin wrote:
| > _I bet humans have started using "delve" more too, on
| average._
|
| I wish there were a way to check.
| shortrounddev2 wrote:
| Man the AI folks really wrecked everything. Reminds me of when
| those scooter companies started just dumping their scooters
| everywhere without asking anybody if they wanted this.
| analog31 wrote:
| perhaps germane to this thread, I think the scooter thing was
| an investment bubble. it was easier to burn investment money on
| new scooters than to collect and maintain old ones. until the
| money ran out.
| kdmccormick wrote:
| At least scooters did something useful for the environment.
| DrillShopper wrote:
| Their batteries on the other hand...
| kdmccormick wrote:
| Sure, they're worse than walking or biking, but compared to
| an electric car battery or an ICE car?
| Sharlin wrote:
| At least where I'm from, scooters have mostly replaced
| walking and biking, not car trips :(
| Sander_Marechal wrote:
| Did they? A lot of then were barely used, got damaged or
| vandalized, etc. And when the companies folded or communities
| outlawed the scooters, they end up as trash. I don't believe
| for a second that the amount of pollutants and greenhouse
| gasses saved by usage is larger than the amount produced by
| manufacturing, shipping and trashing all those scooters.
| baq wrote:
| All those writers who'll soon be out of job and/or already are
| and basically unhireable for their previous tasks should be paid
| for by the AI hyperscalers to write anything at all on one
| condition: not a single sentence in their works should be created
| with AI.
|
| (I initially wanted to say 'paid for by the government' but
| that'd be socialising losses and we've had quite enough of that
| in the past.)
| bondarchuk wrote:
| AI companies are indeed hiring such people to generate
| customized training data for them.
| neilv wrote:
| Is it the same companies that simply took all the writers'
| previous work (hoping to be billionaires before the courts
| understand)?
| shadowgovt wrote:
| Yes. This was always the failure with the argument that
| copyright was the relevant issue... Once the model was
| proven out, we knew some wealthy companies would hire
| humans to generate the training data that the companies
| could then own in whole, at the relative expense of all
| other humans that didn't get paid to feed the machines.
| passion__desire wrote:
| This idea could also be extended to domains like Art. Create
| new art styles for AI to learn from. But in future, that will
| also get automated. AI itself will create art styles and all
| humans would do is choose whether something is Hot or Not.
| Sort of like art breeder.
| vidarh wrote:
| There are already several companies doing this - I do
| occasional contract work for a couple -, and paying rates
| sometimes well above what an average earning writer can expect
| elsewhere. However, the vast majority of writers have never
| been able to make a living from their writing. The threshold to
| write is too love, too many people love it, and most people
| read very little.
| baq wrote:
| Transformers read a lot during training, it might actually be
| beneficial for the companies to the point those works never
| see the light of day, only machines would read them. That's
| so dystopian I'd say those works should be published so they
| eventually get into the public domain.
| ckemere wrote:
| Rooms full of people writing into a computer is a striking
| mental picture. It feels like it could be background for a
| great plot for a book/movie.
| left-struck wrote:
| Have you heard of Severance? This has a vibe extremely
| similar to that show.
| trilbyglens wrote:
| Have you ever read american history? Lol.
| nkozyra wrote:
| People have been paid to generate noise for a decade+ now.
| Garbage in, garbage out will always be true.
|
| Next token-seeking is a solved problem. Novel thinking can be
| solved by humans and possibly by AI soon, but adding more
| garbage to the data won't improve things.
| tveita wrote:
| Who programs the tapes?
| https://en.wikipedia.org/wiki/Profession_(novella)
| jfultz wrote:
| _Thank you_. I read this story probably around 1980 (I think
| in a magazine that was subsequently trashed or garage-saled),
| and I have spent my adult life remembering the bones of the
| story, but not the author or the title.
| anovikov wrote:
| Sad. I'd love to see by how much the use of world "delve" has
| increased since 2021...
| chipdart wrote:
| From the submission you're commenting on:
|
| > As one example, Philip Shapira reports that ChatGPT (OpenAI's
| popular brand of generative language model circa 2024) is
| obsessed with the word "delve" in a way that people never have
| been, and caused its overall frequency to increase by an order
| of magnitude.
| eesmith wrote:
| https://pshapira.net/2024/03/31/delving-into-delve/ "Delving
| into "delve""
| xpl wrote:
| The fun thing is that while GPTs initially learned from humans
| (because ~100% of the content was human-generated), future
| humans will learn from GPTs, because almost all available
| content would be GPT-generated very soon.
|
| This will surely affect how we speak. It's possible that human
| language evolution could come to a halt, stuck in time as AI
| datasets stop being updated.
|
| In the worst case, we will see a global "model collapse" with
| human languages devolving along with AI's, if future AIs are
| trained on their own outputs...
| Terretta wrote:
| > _I 'd love to see by how much the use of world "delve" has
| increased since 2021..._
|
| There are charts / graphs in the link, both since 2021, and
| since earlier.
|
| The final graph suggests the phenomenon started earlier,
| possibly correlated in some way to Malaysian / Indian usages of
| English.
|
| It does seem OpenAI's family of GPTs as implemented in ChatGPT
| unspool concepts in a blend of India-based-consultancy English
| with American freshmen essay structure, frosted with
| superficially approachable or upbeat blogger prose
| ingratiatingly selling you something.
|
| Anthropic has clearly made efforts to steer this differently,
| Mistral and Meta as well but to a lesser degree.
|
| I've wondered if this reflects training material (the SEO is
| ruining the Internet theory), or is more simply explained by
| selection of pools of Hs hired for RLHF.
| dqv wrote:
| Same for me but with the word "crucial".
| slashdave wrote:
| Amusing that we now have a feedback loop. Let's see... delve
| delve delve delve delve delve delve delve. There, I've done my
| part.
| CaptainFever wrote:
| Google ngram viewer, perhaps?
| voytec wrote:
| I agree in general but the web was already polluted by Google's
| unwritten SEO rules. Single-sentence paragraphs, multiple keyword
| repetitions and focus on "indexability" instead of readability,
| made the web a less than ideal source for such analysis long
| before LLMs.
|
| It also made the web a less than ideal source for training. And
| yet LLMs were still fed articles written for Googlebot, not
| humans. ML/LLM is the second iteration of writing pollution. The
| first was humans writing for corporate bots, not other humans.
| kevindamm wrote:
| Yes but not quite as far as you imply. The training data is
| weighted by a quality metric, articles written by journalists
| and wikipedia contributors are given more weight than Aunt
| May's brownie recipe and corpoblogspam.
| Freak_NL wrote:
| It certainly feels like the amount of regurgitated,
| nonsensical, generated content (nontent?) has risen
| spectacularly specifically in the past few years. 2021 sounds
| about right based on just my own experience, even though I
| can't point to any objective source backing that up.
| jsheard wrote:
| SEO grifters have fully integrated AI at this point, there
| are dozens of turn-key "solutions" for mass-producing
| "content" with the absolute minimum effort possible. It's
| been refined to the point that scraping material from other
| sites, running it through the LLM blender to make it look
| original, and publishing it on a platform like Wordpress is
| fully automated end-to-end.
| sahmeepee wrote:
| Or check out "money printer" on github: a tongue in cheek
| mashup of various tools to take a keyword as input and
| produce a youtube video with subtitles and narration as
| output.
| zharknado wrote:
| Ooh I like "nontent." Nothing like a spicy portmanteau!
| eptcyka wrote:
| I personally am yet to see this beyond some slop on
| youtube. And I am here for the AI meme videos. I recognize
| the dangers of this, all I am saying is that I don't feel
| the effect, yet.
| ghaff wrote:
| There's been a ton of low-rent listicle writing out there
| for ages. Certainly not new in the past few years. I
| admit I don't go on YouTube much and don't even have a
| tiktok account so it's possible there's a lot of newer
| lousy content I'm not really exposed to.
|
| It seems to me that the fact it's so cheap and relatively
| easy for people with dreams of becoming wealthy
| influencers to put stuff out there has more to do with
| the flood of often mediocre content than AI does.
|
| Of course the vast majority don't have much real success
| and get on with life and the crank turns and a new
| generation perpetuates the cycle.
|
| LLMs etc. may make things marginally easier but there's
| no shortage of twenty somethings with lots of time
| imagining riches while making pennies.
| Freak_NL wrote:
| I'm seeing it a lot when searching for some advice in a
| well-defined subject, like, say, leatherworking or sewing
| (or recipes, obviously). Instead of finding forums with
| hobbyists, in-depth blog posts, or manufacturers advice
| pages, increasingly I find articles which seem like
| natural language at first, but are composed of paragraphs
| and headers repeating platitudes and basic tips. It takes
| a few seconds to realize the site is just pushing
| generated articles.
|
| Increasingly I find that for in-depth explanations or
| tutorials Youtube is the only place to go, but even there
| the search results can lead to loads of videos which just
| seem... off. But at least those are still made by humans.
| eszed wrote:
| Upvoted for "nontent" alone: it'll be my go-to term from
| now on, and I hope it catches on.
|
| Is it of your own coinage? When the AI sifts through the
| digital wreckage of the brief human empire, they may give
| you the credit.
| Freak_NL wrote:
| I do hope it catches on! I did come up with this myself,
| but I really doubt I'm the only one -- and indeed:
| Wiktionary lists it already with a 2023 vintage:
|
| https://en.wiktionary.org/wiki/nontent
| darby_nine wrote:
| Aunt may's brownie recipe (or at least her thoughts on it)
| are likely something you'd want if you want to reflect how
| humans use language. Both news-style and encyclopedia-style
| writing represent a pretty narrow slice.
| creshal wrote:
| That's why search engines rated them highly, and why a
| million spam sites cropped up that paid writers $1/essay to
| pretend to be Aunt May, and why today every recipe website
| has a gigantic useless fake essay in front of their
| copypasted made up recipes.
| darby_nine wrote:
| Ok, but what i said is true regardless of SEO, and that
| SEO has also fed back into english before LLMs were a
| thing. If you only train on those subsets you'll also end
| up with a chatbot that doesn't speak in a way we'll
| identify as natural english.
| actionfromafar wrote:
| Yet. Give it time. The LLMs will train our future
| children.
| darby_nine wrote:
| I'm sure they already are.
| Freak_NL wrote:
| I hate how looking for recipes has become so...
| disheartening. Online recipes are fine for reputable
| sources like newspapers where professional recipe writers
| are paid for their contributions, but searching for some
| Aunt May's recipe for 'X' in the big ocean of the
| internet is pointless -- too much raw sewage dumped in.
|
| It sucks, because sharing recipes seemed like one of
| those things the internet could be really good at.
| smallerfish wrote:
| There seem to be quite a few recipe sharing sites around
| - e.g. allrecipes.com.
| creshal wrote:
| And they're all flooded with low effort trash and
| useless.
|
| The only remaining reliable source - now that many
| newspapers are axing the remaining staff in favour of
| LLMs - is pre-2020 print cookbooks. Anything online or
| printed later must be assumed to be tainted, full of
| untested sewage and potentially dangerous suggestions.
| formerly_proven wrote:
| Well there's https://www.allrecipes.com/author/chef-john/
| on that particular site.
| JohnFen wrote:
| Chef John is _the best_.
| davejohnclark wrote:
| I absolutely love Chef John. Great recipes and the
| cadence of his speech on YouTube (foodwishes) is very
| soothing, while he cooks up something amazing. If you're
| a home cook I highly recommend his recipes and his
| channel.
| jerf wrote:
| The wife and I use the internet for recipe _ideas_... but
| we hardly ever follow them directly anymore. We 're no
| formally-trained chefs but we've been home cooks for over
| 20 years now, and so many of them are self-evidently bad,
| or distinctly suboptimal. The internet chef's aversion to
| flavor is a meme with us now; "add one-sixty-fourth of a
| teaspoon of garlic powder to your gallon of soup, and mix
| in two crystals of table salt". Either that or they're
| all getting some seriously potent spices all the time and
| I'd like to know where they shop because my spices are
| nowhere near as powerful as theirs.
| halostatue wrote:
| It's not just online recipes, but cookbooks written for
| the Better Home & Gardens crowd. The ones who write
| "curry powder" (and mean the yellow McCormick stuff which
| is so bland as to have almost no flavour) or call for one
| clove of garlic in their recipe.
|
| I joke with folks that my assumption with "one clove of
| garlic" is that they _really_ mean "one head of garlic"
| if you want any flavour. (And if the recipe title has
| "garlic" _in_ it and you are using one clove, you're
| lying.)
| nick3443 wrote:
| If the recipe has "garlic" in the title, I'm budgeting
| 1/2 head per serving.
| shagie wrote:
| I wish more people presented recipes like cooking for
| engineers. For example - Meat Lasagna
| https://www.cookingforengineers.com/recipe/36/Meat-
| Lasagna
| grues-dinner wrote:
| And here I thought my defacement of printed recipes by
| bracketing everything that goes together at each stage
| was just me. There are, well, maybe not dozens but at
| least two of us! Saves a lot of bowls when you know
| without further checking that you can, say, just dump the
| flour and sugar, butter and eggs into the big bowl
| without having to prepare separately because they're in
| the "1: big bowl" bracket.
| halostatue wrote:
| Depends on what you're doing. For best cookies, you want
| to cream the butter with the sugar, _then_ add the eggs,
| and _finally_ add the flour. If you're interested and can
| find one, it's worth taking a vegan baking class. You
| learn a lot about ingredient substitutions for baking,
| about what the different non-vegan ingredients are doing
| that you have to compensate for...and it does something
| that I've only recently started seeing happen in non-
| vegan baking recipes: it separates the wet ingredients
| from the dry ingredients.
|
| That is, when baking, you can _usually_ (again,
| exceptions for creaming the sugar in butter, etc.) take
| all of your dry ingredients and mix /sift them together,
| and then you pour your wet ingredients in a well you've
| made in the dry ingredients (these can also usually be
| mixed together).
| grues-dinner wrote:
| No need to cakesplain, that was an example with three
| ingredients of the top of my head, very, very obviously
| the exact ingredients and bracket assignments vary
| depending on what you are making.
|
| But for shortbread or fork biscuits those three could
| indeed all go in the bowl in one go (but that one
| admittedly doesn't really need a bracket because the
| recipe is "put in bowl, mix with hands, bake").
| bhasi wrote:
| I love the table-diagrams at the end. I've never seen
| anything like that until now and it really seems useful
| for visualization of the recipe and the sequence of
| steps.
| shagie wrote:
| Combined with pictures for what each step _should_ look
| like. I had a few of these pages printed out back in the
| '00s for some recipes that I did.
| jsheard wrote:
| > The training data is weighted by a quality metric
|
| At least in Googles case, they're having so much difficulty
| keeping AI slop out of their search results that I don't have
| much faith in their ability to give it an appropriately low
| training weight. They're not even filtering the comically
| low-hanging fruit like those YouTube channels which post a
| new "product review" every 10 minutes, with an AI generated
| thumbnail and AI voice reading an AI script that was never
| graced by human eyes before being shat out onto the internet,
| and is of course _always_ a glowing recommendation since the
| point is to get the viewer to click an affiliate link.
|
| Google has been playing the SEO cat and mouse game forever,
| so can startups with a fraction of the experience be expected
| to do any better at filtering the noise out of fresh web
| scrapes?
| epgui wrote:
| I don't think they were talking about the quality of Google
| search results. I believe they were talking about how the
| data was processed by the wordfreq project.
| kevindamm wrote:
| I was actually referring to the data ingestion for
| training LLMs, I don't know what filtering or weighting
| might be done with wordfreq.
| acdha wrote:
| > Google has been playing the SEO cat and mouse game
| forever, so can startups with a fraction of the experience
| be expected to do any better at filtering the noise out of
| fresh web scrapes?
|
| Google has been _monetizing_ the SEO game forever. They
| chose not to act against many notorious actors because the
| metric they optimize for is ad revenue and and those sites
| were loaded with ads. As long as advertisers didn't stop
| buying, they didn't feel much pressure to make big changes.
|
| A smaller company without that inherent conflict of
| interest in its business model can do better because they
| work on a fundamentally different problem.
| noirscape wrote:
| Google has those problems because the company's revenue
| source (Ads) and the thing that puts it on the map (Search)
| are fundamentally at odds with one another.
|
| A useful Search would ideally send a user to the site with
| the most signal and the fewest noise. Meanwhile, ads are
| inherently noise; they're extra pieces of information
| inserted into a webpage that at _best_ tangentially
| correlate to the subject of a page.
|
| Up until ~5 years ago, Google was able to strike a balance
| on keeping these two stable; you'd get results with some
| Ads but the signal generally outweighed the noise.
| Unfortunately from what I can tell from anecdotes and
| courtroom documents, the Ad team at Google has essentially
| hijacked every other aspect of the company by threatening
| that yearly bonuses won't be given out if they don't kowtow
| to the Ad teams wishes to optimize ad revenue somewhere in
| 2018-2019 and has no sign of stopping since there's no
| _effective_ competition to Google. (There 's like, Bing and
| Kagi? Nobody uses Bing though and Kagi is only used by tech
| enthusiasts. The problem with Google is that to copy it,
| you need a ton of computing resources upfront and are going
| up against a company with infinitely more money and ability
| to ensure users don't leave their ecosystem; go ahead and
| abandon Search, but good luck convincing others to give up
| say, their Gmail account, which keeps them locked to Google
| and Search will be there, enticing the average user.)
|
| Google has absolutely zero incentive to filter out
| generative AI junk from their search results outside the
| amount of it that's damaging their PR since most of the SEO
| spam is also running Google Ads (since unless you're
| hosting adult content, Google's ad network is practically
| the only option). Their solution therefore isn't to remove
| the AI junk, but to instead reduce it _enough_ to the
| degree where a user will not get the same _type_ of AI junk
| twice.
| PaulHoule wrote:
| My understanding is that Google Ads are what makes Google
| Search unassailable.
|
| A search engine isn't a two-sided market in itself but
| the ad network that supports it is. A better search
| engine is a technological problem, but a decently paying
| ad network is a technological problem _and_ a hard
| marketing problem.
| Suppafly wrote:
| >At least in Googles case, they're having so much
| difficulty keeping AI slop out of their search results that
| I don't have much faith in their ability to give it an
| appropriately low training weight.
|
| I've noticed that lately. It used to be the top google
| result was almost always what you needed. Now at the top is
| an AI summary that is pretty consistently wrong, often in
| ways that aren't immediately obvious if you aren't familiar
| with the topic.
| derefr wrote:
| > those YouTube channels which post a new "product review"
| every 10 minutes, with an AI generated thumbnail and AI
| voice reading an AI script that was never graced by human
| eyes before being shat out onto the internet
|
| The problem is that, of the signals you mention,
|
| * the highly-informative ones (posting a new review every
| 10 minutes, having affiliate links in the description) are
| _contextual_ -- i.e. they 're heuristics that only work on
| a site-specific basis. If the point is to create a training
| pipeline that consumes "every video on the Internet" while
| automatically rejecting the videos that are botspam, then
| contextual heuristics of this sort won't scale. (And Google
| "doesn't do things that don't scale.")
|
| * and, conversely, the _context-free_ signals you mention
| (thumbnail looks AI-generated, voice is synthesized) aren
| 't actually highly correlated with the script being LLM-
| barf rather than something a human wrote.
|
| Why? One of the primary causes is TikTok (because TikTok
| content gets cross-posted to YouTube a lot.) TikTok has a
| built-in voiceover tool; and many people don't like their
| voice, or don't have a good microphone, or can't speak
| fluent/unaccented English, or whatever else -- so they
| choose to sit there typing out a script on their phone, and
| then have the AI read the script, rather than reading the
| script themselves.
|
| And then, when these videos get cross-posted, usually
| they're being cross-posted in some kind of compilation,
| through some tool that picks an AI-generated thumbnail for
| the compilation.
|
| Yet, all the content in these is _real stuff that humans
| wrote_ , and so not something Google would want to throw
| away! (And in fact, such content is frequently a uniquely-
| good example of the "gen-alpha vernacular writing style",
| which otherwise doesn't often appear in the corpus due to
| people of that age not doing much writing in public-web-
| scrapeable places. So Google _really_ wants to sample it.)
| Lalabadie wrote:
| The current state of things leads me to believe that Google's
| current ranking system has been somehow too transparent for
| the last 2-3 years.
|
| The top of search results is consistently crowded by pages
| that obviously game ranking metrics instead of offering any
| value to humans.
| rgrieselhuber wrote:
| Indexability is orthogonal to readability.
| hk__2 wrote:
| It should be, but sadly it's not.
| krelian wrote:
| >And yet LLMs were still fed articles written for Googlebot,
| not humans.
|
| How do we know what content LLMs were fed? Isn't that a highly
| guarded secret?
|
| Won't the quality of the content be paramount to the quality of
| the generated output or does it not work that way?
| GTP wrote:
| We do know that the open web consitutes the bulk of the
| trainig data, although we don't get to know the specific
| webpages that got used. Plus some more selected sources, like
| books, of which again we only know that those are books but
| not which books were used. So it's just a matter of
| probability that there was a good amount of SEO spam as well.
| ToucanLoucan wrote:
| This feels like a second, magnitudes larger Eternal September.
| I wonder how much more of this the Internet can take before
| everyone just abandons it entirely. My usage is notably lower
| than it was in even 2018, it's so goddamn hard to find anything
| worth reading anymore (which is why I spend so much damn time
| here, tbh).
| wpietri wrote:
| I think it's an arms race, but it's an open question who
| wins.
|
| For a while I thought email as a medium was doomed, but
| spammers mostly lost that arms race. One interesting
| difference is that with spam, the large tech companies were
| basically all fighting against it. But here, many of the
| large tech companies are either providing tools to spammers
| (LLMs) or actively encouraging spammy behaviors (by
| integrating LLMs in ways that encourage people to send out
| text that they didn't write).
| ToucanLoucan wrote:
| > but spammers mostly lost that arms race
|
| I'm not saying this is impossible but that's going to be an
| uphill sell for me as a concept. According to some quick
| stats I checked I'm getting roughly 600 emails per day,
| about 550 of which go directly to spam filtering, and of
| the remaining 50, I'd say about 6 are actually emails I
| want to be receiving. That's an impressive amount overall
| for whoever built this particular filter, but it's also
| still a ton of chaff to sort wheat from and as a result I
| don't use email much for anything apart from when I have
| to.
|
| Like, I guess that's technically usable, I'm much happier
| filtering 44 emails than 594 emails? But that's like saying
| I solved the problem of a flat tire by installing a wooden
| cart wheel.
|
| It's also worth noting there that if I do have an email
| thats flagged as spam that shouldn't be, I then have to
| wade through a much deeper pond of shit to go find it as
| well. So again, better, but IMO not even remotely solved.
| dhosek wrote:
| I'm not sure what you've done to get that level of spam,
| but I get about 10 spam emails a day at most and that's
| across multiple accounts including one that I've used for
| almost 30 years and had used on Usenet which was the
| uber-spam magnet. A couple newer (10-15 year old)
| addresses which I've published on webpages with mailto
| links attract maybe one message a week and one that I
| keep for a specialized purpose (fiction and poetry
| submissions) gets maybe one to two messages per year,
| mostly because it's of the form example@example.com so
| easily guessed by enterprising spammers.
|
| Looking at the last days' spam1 I have three 419-style
| scams (widows wanting to give away their dead husbands'
| grand piano or multi-million euro estate) and three
| phishing attempts. There are duplicate messages in each
| category.
|
| About fifteen years ago, I did a purge of mailing list
| subscriptions and there's very little that comes in that
| I don't want, most notably a writer who's a nice guy, but
| who interpreted my question about a comment he made on a
| podcast as an invitation to be added to his manually
| managed email list and given that it's only four or five
| messages a year, I guess I can live with that.
|
| [?]
|
| 1. I cleaned out spam yesterday while checking for a
| confirmation message from a purchase.
| wpietri wrote:
| I'm having a hard time finding reliably sourced
| statistics here, but I suspect you're an outlier. My
| personal numbers are way better, both on Gmail and
| Fastmail, despite using the same email addresses for
| decades.
| jerf wrote:
| Another problem with this arms race is that spam emails
| actually are largely separable from ham emails for most
| people... or at least they _were_ , for most of their run.
| The thousandth email that claims the UN has set aside money
| for me due to my non-existent African noble ancestry that
| they can't find anyone to give it to and I just need to
| send the Thailand embassy some money to start processing my
| multi-million yuan payout and send it to my choice of proxy
| in Colombia to pick it up is quite different from technical
| conversation about some GitHub issue I'm subscribed to, on
| all sorts of metrics.
|
| However, the frontline of the email war has shifted lately.
| Now the most important part of the war is being fought over
| emails that look _just like ham_ , but aren't. Business
| frauds where someone convinces you that they are the CEO or
| CFO or some VP and they need you to urgently buy this or
| that for them right now no time to talk is big business
| right now, and before you get too high-and-mighty about how
| immune you are to that, they are now extremely good at
| looking official. This war has not been won yet, and to a
| large degree, isn't something you necessarily win by AI
| either.
|
| I think there's an analogy here to the war on content slop.
| Since what the content slop wants is just for you to see it
| so they can serve you ads, it doesn't need anything else
| that our algorithms could trip on, like links to malware or
| calls to action to be defrauded, or anything else. It looks
| _just_ like the real stuff, and telling that it isn 't
| could require a human rather vast amounts of input just to
| be mostly sure. Except we don't have the ability to
| authenticate where it came from. (There is no content
| authentication solution that will work at scale. No matter
| how you try to get humans to "sign their work" people will
| always work out how to automate it and then it's done.) So
| the one good and solid signal that helps in email is gone
| for general web content.
|
| I don't judge this as a winning scenario for the defenders
| here. It's not a total victory for the attackers either,
| but I'd hesitate to even call an advantage for one side or
| the other. Fighting AI slop is not going to be easy.
| pyrale wrote:
| > but spammers mostly lost that arms race.
|
| Advertising in your mails isn't Google's.
| jsheard wrote:
| The fight against spam email also led to mass consolidation
| of what was supposed to be a decentralised system though.
| Monoliths like Google and Microsoft now act as de-facto
| gatekeepers who decide whether or not you're allowed to
| send emails, and there's little to no transparency or
| recourse to their decisions.
|
| There's probably an analogy to be made about the open
| decentralised internet in the age of AI here, if it gets to
| the point that search engines have to assume all sites are
| spam by default until proven otherwise, much like how an
| email server is assumed guilty until proven innocent.
| BeFlatXIII wrote:
| I hope this trend accelerates to force us all into grass-
| touching and book-reading. The sooner, the better.
| MrLeap wrote:
| Books printed before 2018, right?
|
| I already find myself mentally filtering out audible
| releases after a certain date unless they're from an author
| I recognize.
| bondarchuk wrote:
| At some point though you have to acknowledge that a specific
| use of language belongs to the medium through which you're
| counting word frequencies. There are also specific writing
| styles (including sentence/paragraph sizes, unnecessary
| repetitions, focusing on other metrics than readability)
| associated with newspapers, novels, e-mails to your boss,
| anything really. As long as text was written by a human who was
| counting on at least some remote possibility that another human
| might read it, this is way more legitimate use of language than
| just generating it with a machine.
| doe_eyes wrote:
| > I agree in general but the web was already polluted by
| Google's unwritten SEO rules. Single-sentence paragraphs,
| multiple keyword repetitions and focus on "indexability"
| instead of readability, made the web a less than ideal source
| for such analysis long before LLMs.
|
| Blog spam was generally written by humans. While it sucked for
| other reasons, it seemed fine for measuring basic word
| frequencies in human-written text. The frequencies are probably
| biased in _some_ ways, but this is true for most text. A
| textbook on carburetor maintenance is going to have the word
| "carburetor" at way above the baseline. As long as you have a
| healthy mix of varied books, news articles, and blogs, you're
| fine.
|
| In contrast, LLM content is just a serpent eating its own tail
| - you're trying to build a statistical model of word
| distribution off the output of a (more sophisticated) model of
| word distribution.
| weinzierl wrote:
| Isn't it the other way around?
|
| SEO text carefully tuned to tf-idf metrics and keyword
| stuffed to them empirically determined threshold Google just
| allows should have unnatural word frequencies.
|
| LLM content should just enhance and cement the status quo
| word frequencies.
|
| Outliers like the word _" delve"_ could just be sentinels,
| carefully placed like trap streets on a map.
| lbhdc wrote:
| > LLM content should just enhance and cement the status quo
| word frequencies.
|
| TFA mentions this hasn't been the case.
| flakiness wrote:
| Would you mind dropping the link talking about this
| point? (context: I'm a total outsider and have no idea
| what TFA is.)
| derefr wrote:
| 1. People don't generally use the (big, whole-web-corpus-
| trained) general-purpose LLM base-models to generate bot
| slop for the web. Paying per API call to generate that kind
| of stuff would be far too expensive; it'd be like paying
| for eStamps to send spam email. Spambot developers use
| smaller open-source models, trained on much smaller
| corpuses, sized and quantized to generate text that's "just
| good enough" to pass muster. This creates a sampling bias
| in the word-associational "knowledge" the model is working
| from when generating.
|
| 2. Given how LLMs work, a prompt _is_ a bias -- they 're
| one-and-the-same. You can't ask an LLM to write you a
| mystery novel without it somewhat adopting the writing
| quirks common to the particular mystery novels it has
| "read." Even the writing style you use _in_ your prompt
| influences this bias. (It 's common advice among "AI
| character" chatbot authors, to write the "character card"
| describing a character, in the style that you want the
| character speaking in, for exactly this reason.) Whatever
| prompt the developer uses, is going to bias the bot away
| from the statistical norm, toward the writing-style
| elements that exist within whatever hypersphere of
| association-space contains plausible completions of the
| prompt.
|
| 3. Bot authors do SEO too! They take the tf-idf metrics and
| keyword stuffing, and turn it into _training data_ to
| _fine-tune_ models, in effect creating "automated SEO
| experts" that write in the SEO-compatible style by default.
| (And in so doing, they introduce unintentional further
| bias, given that the SEO-optimized training dataset likely
| is not an otherwise-perfect representative sampling of
| writing style for the target language.)
| mlsu wrote:
| But you can already see it with Delve. Mistral uses "delve"
| more than baseline, because it was trained on GPT.
|
| So it's classic positive feedback. LLM uses delve more,
| delve appears in training data more, LLM uses delve more...
|
| Who knows what other semantic quirks are being amplified
| like this. It could be something much more subtle, like
| cadence or sentence structure. I already notice that GPT
| has a "tone" and Claude has a "tone" and they're all sort
| of "GPT-like." I've read comments online that stop and make
| me question whether they're coming from a bot, just because
| their word choice and structure echoes GPT. It will sink
| into human writing too, since everyone is learning in high
| school and college that the way you write is by asking GPT
| for a first draft and then tweaking it (or not).
|
| Unfortunately, I think human and machine generated text are
| entirely miscible. There is no "baseline" outside the
| machines, other than from pre-2022 text. Like pre-atomic
| steel.
| taneq wrote:
| > LLM uses delve more, delve appears in training data
| more, LLM uses delve more...
|
| Some day we may view this as the beginnings of machine
| culture.
| mlsu wrote:
| Oh no, it's been here for quite a while. Our culture is
| already heavily glued to the machine. The way we express
| ourselves, the language we use, even our very self-
| conception originates increasingly in online spaces.
|
| Have you ever seen someone use their smartphone? They're
| not "here," they are "there." Forming themselves in
| cyberspace -- or being formed, by the machine.
| pphysch wrote:
| It's crazy to attribute the downfall of the web/search to
| Google. What does Google have to do with all the genuine open
| web content, Google's source of wealth, getting starved by
| (increasingly) walled gardens like Facebook, Reddit, Discord?
|
| I don't see how Google's SEO rules being written or unwritten
| has any bearing. Spammers will always find a way.
| sahmeepee wrote:
| Prior to Google we had Altavista and in those days it was
| incredibly common to find keywords spammed hundreds of times in
| white text on a white background in the footer of a page. SEO
| spam is not new, it's just different.
| redbell wrote:
| > ML/LLM is the second iteration of writing pollution. The
| first was humans writing for corporate bots, not other humans.
|
| Based on the process above, naturally, the third iteration then
| is _LLMs writing for corporate bots, neither for humans nor for
| other LLMs_.
| hoseja wrote:
| >"Now Twitter is gone anyway, its public APIs have shut down, and
| the site has been replaced with an oligarch's plaything, a spam-
| infested right-wing cesspool called X. Even if X made its raw
| data feed available (which it doesn't), there would be no
| valuable information to be found there.
|
| >Reddit also stopped providing public data archives, and now they
| sell their archives at a price that only OpenAI will pay.
|
| >And given what's happening to the field, I don't blame them."
|
| What beautiful doublethink.
| mschuster91 wrote:
| > What beautiful doublethink.
|
| Given just how many AI bots scrape up everything they can,
| oftentimes ignoring robots.txt or _any_ rate limits (there have
| been a few complaint threads on HN about that), I can hardly
| blame the operators of large online services just cutting off
| data feeds.
|
| Twitter however didn't stop their data feeds due to AI or
| because they wanted money, they stopped providing them because
| its new owner does everything he can to hinder researchers
| specializing in propaganda campaigns or public scrutiny.
| hluska wrote:
| What was Reddit's excuse? They did roughly the same thing
| (and have just as much garbage content).
|
| In other words, why is it wrong for X but okay for Reddit? If
| you ignore one individual's politics, the two services did
| the same thing.
| mschuster91 wrote:
| Reddit shut their API access down only very recently, after
| the AI craze went off. Twitter did so right after Musk took
| over, way before Reddit, way before AI ever went nuts.
| dotnet00 wrote:
| X shut down API access in Feb 2023, Reddit shut theirs
| down at the end of June of the same year. Just barely 6
| months apart.
|
| Furthermore, while X had also only announced this in
| February, Reddit announced their API shutdown just 2
| months later in April.
|
| And, to further add to that, X was pretty upfront that
| they think they have access to a large and powerful
| dataset in X and didn't want to give it out for free.
| Reddit used very similar wording when announcing their
| changes.
| DebtDeflation wrote:
| Enshittification is accelerating. A good 70% of my Facebook feed
| is now obviously AI generated images with AI generated text
| blurbs that have nothing to do with the accompanying images
| likely posted by overseas bot farms. I'm also noticing more and
| more "books" on Amazon that are clearly AI generated and self
| published.
| janice1999 wrote:
| It's okay. Amazon has limited authors to self publishing only 3
| books per day (yes, really). That will surely solve the
| problem.
| wpietri wrote:
| Hah! I'm trying to figure out the exact date that crossed
| from "plausible line from a Stross or Sterling novel" [1] to
| "of course they did".
|
| [1] Or maybe Sheckley or Lem, now that I think about it.
| Drakim wrote:
| I read that as 3 books per year at first and thought to
| myself that that was a rather harsh limitation but surely any
| true respectable author wouldn't be spitting more than
| that...
|
| ...and then I realized you wrote 3 books a day. What the
| hell.
| Sohcahtoa82 wrote:
| > A good 70% of my Facebook feed is now obviously AI generated
| images with AI generated text blurbs that have nothing to do
| with the accompanying images likely posted by overseas bot
| farms.
|
| This is a self-inflicted problem, IMO.
|
| Do you just have shitty friends that share all that crap? Or
| are you following shitty pages?
|
| I use Facebook a decent amount, and I don't suffer from what
| you're complaining about. Your feed is made of what you make
| it. Unfollow the pages that make that crap. If you have friends
| that share it, consider unfriending or at the very least,
| unfollowing. Or just block the specific pages they're sharing
| posts from.
| aucisson_masque wrote:
| Did we (the humans) somehow managed to pollute the internet so
| much with AI that's it's now barely usable ?
|
| In my opinion the internet can be considered as the equivalent of
| a natural environment like the earth. it's a space where people
| share, meet, talk, etc.
|
| I find it astonishing that after polluting our natural
| environment we know polluted the internet.
| nkozyra wrote:
| > Did we (the humans) somehow managed to pollute the internet
| so much with AI that's it's now barely usable
|
| If we haven't already, we will be very soon. I'm sure there are
| people working on this problem, but I think we're starting to
| hit a very imminent feedback loop moment. Most of human's
| recorded information is digitized and most of that is
| generating non-human content at an incredible pace. We've
| injected a whole lot of noise into our usable data.
|
| I don't know if the answer is more human content (I'm doing my
| part!) or novel generative content but this interim period is
| going to cause some medium-term challenges.
|
| I like to think the LLM more-tokens-equals-better era is fading
| and we're getting into better _use_ of existing data, but there
| 's a very real inflection point we're facing.
| ashton314 wrote:
| That's a nice analogy. Fortunately (un)real estate is easier to
| manufacture out of thin air online. We have lost some valuable
| spaces like Twitter and Reddit to some degree though.
| surfingdino wrote:
| Yes. Here are practical instructions on how to turn it into an
| even more of a cesspit
| https://www.youtube.com/watch?v=endHz0jo9Ck I think it's now a
| law of nature that any new tech leads to SEO amplification. AI
| has become the Degelman M34 Manure Spreader of the internet
| https://degelman.com/products/manure-spreaders
| coldpie wrote:
| There are smaller, gated communities that are still very
| valuable. You're posting in one. But yes, the open Internet is
| basically useless now, thanks ultimately to advertising as a
| business model.
| nicholassmith wrote:
| I've seen plenty of comments here that read like they've been
| generated by an LLM, if this is a gated community we need a
| better gate.
| coldpie wrote:
| Sure, there's bad actors everywhere, but there's really no
| incentive to do it here so I don't think it's a _problem_
| in the same way it is on the open internet, where slop is
| actively rewarded.
| globular-toast wrote:
| It's hard to tell, though. People have been saying my
| borderline autistic comments sound like GPT for years now.
| whimsicalism wrote:
| this is not a gated community at all
| thwarted wrote:
| Tragedy of the Commons Ruins Everything Around Me
| left-struck wrote:
| >We the humans
|
| Nice try
|
| If it's not clear, I'm joking.
| mathnmusic wrote:
| > Did we (the humans) somehow managed to pollute the internet
|
| Corporations did that, not humans.
|
| "few people recognize that we already share our world with
| artificial creatures that participate as intelligent agents in
| our society: corporations" - https://arxiv.org/abs/1204.4116
| aucisson_masque wrote:
| It could be used to spot LLM generated text.
|
| compare the frequency of words to those used in human natural
| writings and you spot the computer from the human.
| Lvl999Noob wrote:
| It could be used to differentiate LLM text from pre-LLM human
| text maybe. The thing, our AIs may not be very good at learning
| but our brains are. The more we use AI, the more we integrate
| LLMs and other tools into our life, the more their output will
| influence us. I believe there was a study (or a few anecdotes)
| where college papers checked for AI material were marked AI
| written even though they were written by humans because the
| students used AI during their studying and learned from it.
| thfuran wrote:
| >our AIs may not be very good at learning but our brains are
|
| Brains aren't nearly as good at slightly adjusting the
| statistical properties of a text corpus as computers are.
| MPSimmons wrote:
| You're exactly right. You only have to look at the prevalence
| of the word "unalive" in real life contexts to find an
| example.
| left-struck wrote:
| > The more we use AI, the more we integrate LLMs and other
| tools into our life, the more their output will influence us
|
| Hmm I don't disagree but I think it will be valuable skill
| going forward to write text that doesn't read like it was
| written by an LLM
|
| This is an arms race that I'm not sure we can win though.
| It's almost like a GAN.
| TacticalCoder wrote:
| > ... compare the frequency of words to those used in human
| natural writings and you spot the computer from the human.
|
| But that's a losing endeavor: if you can do that, you can
| immediately ask your LLM to fix its output so that it passes
| that test (and many others). It can introduce typos, make small
| errors on purpose, and anything you can think of to make it
| look human.
| ithkuil wrote:
| it may work for a short time, but after a while natural
| language will evolve due to natural exposure of those new words
| or word patterns and even human will write in ways that, while
| being different from the LLMs, will also be different from the
| snapshot captured by this snapshot. It's already the case that
| we used to write differently 20 years ago from 50 years ago and
| even more so 100 years ago, etc
| slashdave wrote:
| Hardly. You are talking about a statistical test, which will
| have rather large errors (since it is based on word
| frequencies). Not to mention word frequencies will vary
| depending on the type of text (essay, description,
| advertisement, etc).
| iamnotsure wrote:
| "Multi-script languages
|
| Two of the languages we support, Serbian and Chinese, are written
| in multiple scripts. To avoid spurious differences in word
| frequencies, we automatically transliterate the characters in
| these languages when looking up their words.
|
| Serbian text written in Cyrillic letters is automatically
| converted to Latin letters, using standard Serbian
| transliteration, when the requested language is sr or sh."
|
| I'd support keeping both scripts (srpska tshirilitsa and latin
| script) , similarly to hiragana (hiragana) and katakana
| (katakana) in Japanese.
| eqvinox wrote:
| Why is this a HN comment on a thread about it ending due to AI
| pollution?
| dsign wrote:
| Somehow related, paper books from before 2020 could be a valuable
| commodity in a in a decade or two, when the Internet will be full
| of slop and even contemporary paper books will be treated with
| suspicion. And there will be human talking heads posing as the
| authors of books written by very smart AIs. God, why are we doing
| this????
| rvnx wrote:
| To support well-known "philanthropists" like Sam Altman or Mark
| Zuckerberg that many consider as their heroes here.
| user432678 wrote:
| And I thought I had some kind of mental illness collecting all
| those books, barely reading them. Need to do that more now.
| globular-toast wrote:
| Yes. I've always loved my books but now consider them my most
| valuable possessions.
| RomanAlexander wrote:
| Or AI talking heads posing as the author of books written by
| AIs. https://youtu.be/pAPGRGTqIgI (warning: state sponsored
| disinformation AI)
| weinzierl wrote:
| _" I don't think anyone has reliable information about post-2021
| language usage by humans."_
|
| We've been past the tipping point when it comes to text for some
| time, but for video I feel we are living through the watershed
| moment right now.
|
| Especially smaller children don't have a good intuition on what
| is real and what is not. When I get asked if the person in a
| video is real, I still feel pretty confident to answer but I get
| less and less confident every day.
|
| The technology is certainly there, but the majority of video
| content is still not affected by it. I expect this to change very
| soon.
| olabyne wrote:
| I never thought about that. Humans losing their ability to
| detect AI content from reality ? It's frightening.
| wraptile wrote:
| I find issue with this statement as content was never a clean
| representation of human actions or even thought. It was
| always driven by editorials, SEO, bot remixing and whatnot
| that heavily influences how we produce content. One might
| even argue that heightened content distrust is _good_ for our
| society.
| BiteCode_dev wrote:
| It's worse because many humans don't know they are.
|
| I see a lot of outrage around fake posts already. People want
| to believe bad things from the other tribes.
|
| And we are going to feed them with it, endlessly.
| PhunkyPhil wrote:
| Did you think the same thing when photoshop came out?
|
| It's relatively trivial to photoshop misinformation in a
| really powerful and undetectable way- but I don't see
| (legitimate) instances of groundbreaking news over a fake
| photo of the president or a CEO etc doing something
| nefarious. Why is AI different just because it's
| audio/video?
| Sharlin wrote:
| It's worse: they don't even care.
| bunderbunder wrote:
| This video's worth a watch if you want to get a sense of the
| current state of things. Despite the (deliberately) clickbait
| title, the video itself is pretty even-handed.
|
| It's by Language Jones, a YouTube linguist. Title: "The AI
| Apocalypse is Here"
|
| https://youtu.be/XeQ-y5QFdB4
| jerf wrote:
| It's even worse than that. Most people have no idea how far
| CGI has come, and how easily it is wielded even by a couple
| of dedicated teens on their home computer, let alone people
| with a vested interest in faking something for some financial
| reason. People think they know what a "special effect" looks
| like, and for the most part, people are _wrong_. They know
| what CGI being used to create something obviously impossible,
| like a dinosaur stomping through a city, looks like. They
| have no idea how easy a lot of stuff is to fake already. AI
| just adds to what is already there. Heck, to some extent it
| has caused scammers to overreach, with things like obviously
| fake Elon Musk videos on YouTube generated from (pure) AI and
| text-to-speech... when with just a little bit more learning,
| practice, and amounts of equipment completely reasonable for
| one person to obtain, they could have done a _much_ better
| fake of Elon Musk using special effects techniques rather
| than shoveling text into an AI. The fact that "shoveling
| text into an AI" may in another few years itself generate
| immaculate videos is more a bonus than a fundamental change
| of capability.
|
| Even what's free & open source in the special effects
| community is astonishing lately.
| bee_rider wrote:
| Plus, movies continue (for some reason) to be made with
| very bad and obvious CGI, leading people to believe all CGI
| is easy to spot.
| PhunkyPhil wrote:
| This is a common survivorship bias fallacy since you only
| notice the bad CGI.
|
| I'm certain you'd be shocked to see the amount of CG
| that's in some of your favorite movies made in the last
| ~10-20 years that you didn't notice _because it 's
| undetectable_
| bee_rider wrote:
| I won't be, I'm aware that lots of movies are mostly CGI.
|
| But, yeah, I do think it is some kind of bias. Maybe not
| survivorship, though... maybe it is a generalized sort of
| Malmquist bias? Like the measurement is not skewed by the
| tendency of movies with good CGI to go away. It is skewed
| by the fact that bad CGI sticks out.
| bee_rider wrote:
| Actually wait I take it back, I mean, I was aware that
| lots of Digital Touch-up happens in movie sets, more than
| lots of people might expect, and more often that one
| might expect even in mundane movies, but even still, this
| comment's video was pretty shocking anyway.
|
| https://news.ycombinator.com/item?id=41584276
| xsmasher wrote:
| This is an amazing demo reel of effects shots used in
| "mundane" TV shows - comedies and produce procedurals. -
| for faking locations.
|
| https://www.youtube.com/watch?v=clnozSXyF4k
| bee_rider wrote:
| That is really something even as somebody who expects
| lots of CGI touch-up in sets.
| jhbadger wrote:
| And you see things like the _The Lion King_ remake or its
| upcoming prequel being called "live action" because it
| doesn't look like a cartoon like the original. But they
| didn't film actual lions running around -- it's all CGI.
| hn_throwaway_99 wrote:
| I mean, it's already apparent to me that a lot of people
| don't have a basic process in place to detect fact from
| fiction. And it's definitely not always easy, but when I hear
| some of the dumbest conspiracy theories known to man actually
| get traction in our media, political figures, and society at
| large, I just have to shake my head and laugh to keep from
| crying. I'm constantly reminded of my favorite saying,
| "people who believe in conspiracy theories have never been a
| project manager."
| bongodongobob wrote:
| Oh they definitely are. A lot of people are now calling out
| real photos as fake. I frequently get into stupid Instagram
| political arguments and a lot of times they come back with
| "yeah nice profile with all your AI art haha". It's all real
| high quality photography. Honestly, I don't think the avg
| person can tell anymore.
| ziml77 wrote:
| I've reached a point where even if my first reaction to a
| photo is to be impressed, I then quickly think "oh but what
| it this is AI?" and then immediately my excitement for the
| photo is ruined because it may not actually be a photo at
| all.
| bongodongobob wrote:
| I don't get that perspective at all. Who cares what made
| it.
| Suppafly wrote:
| >Humans losing their ability to detect AI content from
| reality ? It's frightening.
|
| And it already happened, and no one pushed back while it was
| happening.
| BeFlatXIII wrote:
| It's a defense lawyer's dream.
| frognumber wrote:
| There are a series of challenges like:
|
| https://www.nytimes.com/interactive/2024/09/09/technology/ai...
|
| https://www.nytimes.com/interactive/2024/01/19/technology/ar...
|
| These are a little bit unfair, in that we're comparing
| handpicked examples, but I don't think many experts will pass a
| test like this. Technology only moves forward (and seemingly,
| at an accelerating pace).
|
| What's a little shocking to me is the speed of progress.
| Humanity is almost 3 million years old. Homosapiens are around
| 300,000 years old. Cities, agriculture, and civilization is
| around 10,000. Metal is around 4000. Industrial revolution is
| 500. Democracy? 200. Computation? 50-100.
|
| The revolutions shorten in time, seemingly exponentially.
|
| Comparing the world of today to that of my childhood....
|
| One revolution I'm still coming to grips with is automated
| manufacturing. Going on aliexpress, so much stuff is basically
| free. I bought a 5-port 120W (total) charger for less than 2
| minutes of my time. It literally took less time to find it than
| to earn the money to buy it.
|
| I'm not quite sure where this is all headed.
| knodi123 wrote:
| +100w chargers are one of the products I prefer to spend a
| little more on, so I get something from a company that knows
| it can be sued if they make a product that burns down your
| house or fries your phone.
|
| Flashlights? Sure, bring on aliexpress. USB cables with pop-
| off magnetically attached heads, no problem. But power
| supplies? Welp, to each their own!
| fph wrote:
| And then you plug your cheap pop-off USB cable into the
| expensive 100w charger?
| knodi123 wrote:
| Yeah, sure, what could possibly go wrong? :-P
|
| But seriously, it's harder to accidentally make a USB
| cable that fries your equipment. The more common failure
| mode is it fails to work, or wears out too fast. Chargers
| on the other hand, handle a lot of voltage, generate a
| lot of heat, and output to sensitive equipment. More room
| to mess up, and more room for mistakes to cause damage.
| bee_rider wrote:
| > One revolution I'm still coming to grips with is automated
| manufacturing. Going on aliexpress, so much stuff is
| basically free. I bought a 5-port 120W (total) charger for
| less than 2 minutes of my time. It literally took less time
| to find it than to earn the money to buy it.
|
| Is there a big recent qualitative change here? Or is this a
| continuation of manufacturing trends (also shocking, not
| trying to minimize it all, just curious if there's some new
| manufacturing tech I wasn't aware of).
|
| For some reason, your comment got me thinking of a fully
| automated system, like: you go to a website, pick and choose
| charger capabilities (ports, does it have a battery, that
| sort of stuff). Then an automated factor makes you a bespoke
| device (software picks an appropriate shell, regulators,
| etc). I bet we'll see it in our lifetimes at least.
| homebrewer wrote:
| > so much stuff is basically free
|
| It really isn't. Have a look at daily median income
| statistics for the rest of the planet:
|
| https://ourworldindata.org/grapher/daily-median-
| income?tab=t... $2.48 Eastern and Southern
| Africa (PIP) $2.78 Sub-Saharan Africa (PIP) $3.22
| Western and Central Africa (PIP) $3.72 India (rural)
| $4.22 South Asia (PIP) $4.60 India (urban) $5.40
| Indonesia (rural) $6.54 Indonesia (urban) $7.50
| Middle East and North Africa (PIP) $8.05 China (rural)
| $10.00 East Asia and Pacific (PIP) $11.60 Latin America
| and the Caribbean (PIP) $12.52 China (urban)
|
| And more generally: $7.75 World
|
| I looked around on Ali, and the cheapest charger that doesn't
| look too dangerous costs around five bucks. So it's roughly
| equal to one day's income of at least half the population of
| our planet.
| jodrellblank wrote:
| > " _The revolutions shorten in time, seemingly
| exponentially._ "
|
| The Technological Singularity -
| https://en.wikipedia.org/wiki/Technological_singularity
| MengerSponge wrote:
| Democracy is 200? You're off by a full order of magnitude.
|
| Progress isn't inevitable. It's possible for knowledge to be
| lost and for civilization to regress.
| bsder wrote:
| > When I get asked if the person in a video is real, I still
| feel pretty confident to answer
|
| I don't share your confidence in identifying real people
| anymore.
|
| I often flag as "false-ish" a lot of things from genuinely real
| people, but who have adopted the behaviors of the
| TikTok/Insta/YouTube creator. Hell, my beard is grey and even I
| poked fun at "YouTube Thumbnail Face" back in 2020 in a video
| talk I gave. AI twigs into these "semi-human" behavioral
| patterns super fast and super hard.
|
| There is a video floating around with pairs of young ladies
| with "This is real"/"This is not real" on signs. They could be
| completely lying about both, and I really can't tell the
| difference. All of them have behavioral patterns that seems a
| little "off" but are consistent with the small number of
| "influencer" videos I have exposure to.
| apricot wrote:
| > When I get asked if the person in a video is real, I still
| feel pretty confident to answer
|
| I don't. I mean, I can identify the bad ones, sure, but how do
| I know I'm not getting fooled by the good ones?
| weinzierl wrote:
| That is very true, but for now we have a baseline of videos
| that we either remember or that we remember key details of,
| like the persons in the video. I'm pretty sure if I watch
| _The Primeagen_ or _Tom Scott_ today, that they are real. Ask
| me in year, I might not be so sure anymore.
| donatj wrote:
| I hear this complaint often but in reality I have encountered
| fairly little content in my day to day that has felt fully AI
| generated? AI assisted sure, but is that a problem if a human is
| in the mix, curating?
|
| I certainly have not encountered enough straight drivel where I
| would think it would have a significant effect on overall word
| statistics.
|
| I suspect there may be some over-identification of AI content
| happening, a sort of Baader-Meinhof effect cognitive bias. People
| have their eye out for it and suddenly everything that reads a
| little weird logically "must be AI generated" and isn't just a
| bad human writer.
|
| Maybe I am biased, about a decade ago I worked for an SEO company
| with a team of copywriters who pumped out mountains the most
| inane keyword packed text designed for literally no one but
| Google to read. It would rot your brain if you tried, and it was
| written by hand by a team of humans beings. This existed WELL
| before generative AI.
| pavel_lishin wrote:
| > _I hear this complaint often but in reality I have
| encountered fairly little content in my day to day that has
| felt fully AI generated?_
|
| How confident are you in this assessment?
|
| > _straight drivel_
|
| We're past the point where what AI generates is "straight
| drivel"; every minute, it's harder to distinguish AI output
| from actual output unless you're approaching expertise in the
| subject being written about.
|
| > _a team of copywriters who pumped out mountains the most
| inane keyword packed text designed for literally no one but
| Google to read._
|
| And now a machine can generate the same amount of output in 30
| seconds. Scale matters.
| PhunkyPhil wrote:
| > every minute, it's harder to distinguish AI output from
| actual output unless you're approaching expertise in the
| subject being written about.
|
| So, then what _really_ is the problem with just including
| LLM-generated text in wordfreq?
|
| If quirky word distributions will remain a "problem", then
| I'd bet that human distributions for those words will follow
| shortly after (people are _very_ quick to change their speech
| based on their environment, it 's why language can change so
| quickly).
|
| Why not just own the fact that LLMs are going to be affecting
| our speech?
| cyberes wrote:
| The guy sounds intolerable and comes across as annoying to listen
| to.
| floppiplopp wrote:
| I really like the fact that the content of the conventional user
| content internet is becoming willfully polluted and ever more
| useless by the incessant influx of "ai"-garbage. At some point
| all of this will become so awful that nerds will create new and
| quiet corners of real people and real information while the idiot
| rabble has to use new and expensive tools peddled by scammy tech
| bros to handle the stench of automated manure that flows out of
| stagnant llms digesting themselves.
| biofox wrote:
| Most of the time, HN is that quiet corner. I just hope it stays
| that way.
| JohnFen wrote:
| > At some point all of this will become so awful that nerds
| will create new and quiet corners of real people and real
| information
|
| It's already happening. There is a growing number of groups
| that are forming their own "private internets" that is
| separated from the internet-at-large, precisely because the
| internet at large is becoming increasingly useless for a whole
| lot of valuable things.
| PeterStuer wrote:
| Intuitively I feel like word frequency would be one of the things
| least impacted by LLM output, no?
| baq wrote:
| 'delve' is given as an example right there in TFA.
| PeterStuer wrote:
| Yes, but the material presented in no way makes distiction
| between potential organic growth of 'delve' vs. LLM induced
| use. They just note that even though 'delve' was on the rise,
| in 23-24 the word gains more popularity, at the same time
| ChatGPT rose. Word adoption is certainly not a linear
| phenomenon. And as the author states 'I don't think anyone
| has reliable information about post-2021 language usage by
| humans'
|
| So I would still state noun-phrase frequency in LLM output
| would tend to reflect noun-phrase frequency in training data
| in a similar context (disregarding enforced bias induced
| through RLHF and other tuning at the moment)
|
| I'm sure there will be cross-fertilization from LLM to Human
| and back, but I'm not seeing the data yet that the influence
| on word-frequency is that outspoken.
|
| The author seems to have some other objections to the rise of
| LLM's, which I fully understand.
| beepbooptheory wrote:
| Even granting that we can disregard a really huge factor
| here, which I'm not sure we really can, one can not know
| beforehand how the clustering of the vocabulary is going to
| go pre-training, and its speculated that both at the center
| and at the edges of clusters we get random particularities.
| Hence the "solidgoldmagikarp" phenomenon and many others.
| QuiDortDine wrote:
| The fact that making this distinction is impossible is
| reason enough to stop.
| whimsicalism wrote:
| there is almost certainly organic growth as well as more
| people in Nigeria and other SSA countries are getting very
| good internet penetration in recent years
| Jcampuzano2 wrote:
| It'd be in fact quite the opposite. There comes a turning point
| where the majority of language usage would actually be written
| by AI, at which point we'd no longer be analysing the word
| frequency/usage by actual humans and so it wouldn't be
| representative of how humans actually communicate.
|
| Or potentially even more dystopian would be that AI slop would
| be dictating/driving human communication going forward.
| joshdavham wrote:
| Think of an LLM as a person on the internet. Just like everyone
| else, they have their own vocabulary and preferred way of
| talking which means they'll use some words more than others.
| Now imagine we duplicate this hypothetical person an incredible
| amount of times and have their clones chatter on the internet
| frequently. 'Certainly' this would have an effect.
| joshdavham wrote:
| If the language you're processing was generated by AI, it's no
| longer NLP, it's ALP.
| ilaksh wrote:
| Reading through this entire thread, I suspect that somehow
| generative AI actually became a political issue. Polarized
| politics is like a vortex sucking all kinds of unrelated things
| in.
|
| In case that doesn't get my comment completely buried, I will go
| ahead and say honestly that even though "AI slop" and paywalled
| content is a problem, I don't think that generative AI in itself
| is a negative at all. And I also think that part of this person's
| reaction is that LLMs have made previous NLP techniques, such a
| those based on simple usage counts etc., largely irrelevant.
|
| What was/is wordfreq used for, and can those tasks not actually
| be done more effectively with a cutting edge language model of
| some sort these days? Maybe even a really small one for some
| things.
| ecshafer wrote:
| Generative AI is inherently a political issue, its not
| surprising at all.
|
| There is the case of what is "truth". As soon as you start to
| ensure some quality of truth to what is generated, that is
| political.
|
| As soon as generative AI has the capability to take someone's
| job, that is political.
|
| The instant AI can make someone money, it is political.
|
| When AI is trained on something that someone has created, and
| now they can generate something similar, it is political.
| ilaksh wrote:
| Then .. everything is political?
| phito wrote:
| It is. Unfortunately.
| commodoreboxer wrote:
| Everything involving any kind of coordination, cooperation,
| competition, and/ot communication between two or more
| people involves politics by its very nature. LLMs are
| communication tools. You can't divorce politics from their
| use when one person is generating text for another person
| to read.
| JohnFen wrote:
| "Just because you do not take an interest in politics
| doesn't mean politics won't take an interest in you." --
| Pericles
| whimsicalism wrote:
| > As soon as generative AI has the capability to take
| someone's job, that is political.
|
| What is political is people enshrining themselves in
| chokepoints and demanding a toll for passing through or
| getting anything done. That is what you do when you make a
| certain job politically 'untakable'.
|
| People who espouse that the 'personal is political' risk
| making the definition of politics so broad that it is
| useless.
| rincebrain wrote:
| The simplest example that comes to mind of something frequency
| analysis might be useful for would be if you had simple
| ciphertext where you knew that the characters probably 1:1
| mapped, but you didn't know anything about how.
|
| It could also be useful for guessing whether someone might have
| been trying to do some kind of steganographic or additional
| encoding in their work, by telling you how abnormal compared to
| how many people write it is that someone happened to choose a
| very unusual construction in their work, or whether it's
| unlikely that two people chose the same unusual construction by
| coincidence or plagiarism.
|
| You might also find statistical models interesting for things
| like noticing patterns in people for whom English or others are
| not their first language, and when they choose different
| constructions more often than speakers for whom it was their
| first language.
|
| I'm not saying you can't use an LLM to do some or all of these,
| but they also have something of a scalar attached to them of
| how unusual the conclusion is - e.g. "I have never seen this
| construction of words in 50 million lines of text" versus "Yes,
| that's natural.", which can be useful for trying to inform how
| close to the noise floor the answer is, even ignoring the
| prospect of hallucinations.
| whimsicalism wrote:
| Yes, it's become extremely politicized and its very tiresome.
| Tech in general, to be frank. Pray that your field of interest
| never gets covered in the NYT.
| eadmund wrote:
| > the Web at large is full of slop generated by large language
| models, written by no one to communicate nothing
|
| That's neither fair nor accurate. That slop is ultimately
| generated by the humans who run those models; they are attempting
| (perhaps poorly) to communicate _something_.
|
| > two companies that I already despise
|
| Life's too short to go through it hating others.
|
| > it's very likely because they are creating a plagiarism machine
| that will claim your words as its own
|
| That begs the question. Plagiarism has a particular definition.
| It is not at all clear that a machine learning from text should
| be treated any differently from a human being learning from text:
| i.e., duplicating exact phrases or failing to credit ideas may in
| some circumstances be plagiarism, but no-one is required to
| append a statement crediting every text he has ever read to every
| document he ever writes.
|
| Credits: every document I have ever read _grin_
| weevil wrote:
| I feel like you're giving certain entities too much credit
| there. Yes text is generated to do _something_, but it may not
| be to communicate in good-faith; it could be keyword-dense
| gibberish designed to attract unsuspecting search engine users
| for click revenue, or generate political misinformation
| disseminated to a network of independent-looking "news"
| websites, or pump certain areas with so much noise and nonsense
| information that those spaces cannot sustain any kind of
| meaningful human conversation.
|
| The issue with generative 'AI' isn't that they generate text,
| it's that they can (and are) used to generate high-volume low-
| cost nonsense at a scale no human could ever achieve without
| them.
|
| > Life's too short to go through it hating others
|
| Only when they don't deserve it. I have my doubts about Google,
| but I've no love for OpenAI.
|
| > Plagiarism has a particular definition ... no-one is required
| to append a statement crediting every text he has ever read
|
| Of course they aren't, because we rightly treat humans learning
| to communicate differently from training computer code to
| predict words in a sentence and pass it off as natural language
| with intent behind it. Musicians usually pay royalties to those
| whose songs they sample, but authors don't pay royalties to
| other authors whose work inspired them to construct their own
| stories maybe using similar concepts. There's a line there
| somewhere; falsely equating plagiarism and inspiration (or
| natural language learning in humans) misses the point.
| miningape wrote:
| This is just the "guns don't shoot people, people do." argument
| except in this case we quite literally have a massive upside
| incentive to remove people from the process entirely (i.e.
| websites that automatically generate new content every day) -
| so I don't buy it.
|
| This kind of AI slop is quite literally written by no one (an
| algorithm pushed it out), and it doesn't communicate anything
| since communication first requires some level of understanding
| of the source material - and LLM's are just predicting the
| likely next token without understanding. I would also extend
| this to AI slop written by someone with a limited domain
| understanding, they themselves have nothing new to offer, nor
| the expertise or experience to ensure the AI is producing
| valuable content.
|
| I would go even further and say it's "read by no one" - people
| are sick and tired of reading the next AI slop article on
| google and add stuff like "reddit" to the end of their queries
| to limit the amount of garbage they get.
|
| Sure there are people using LLMs to enhance their research, but
| a vast, vast majority are using it to create slop that hits a
| word limit.
| slashdave wrote:
| > It is not at all clear that a machine learning from text
| should be treated any differently from a human being learning
| from text
|
| Given that LLMs and human creativity work on fundamentally
| different principles, there is every reason to believe there is
| a difference.
| dweinus wrote:
| > Now the Web at large is full of slop generated by large
| language models, written by no one to communicate nothing.
|
| Fair and accurate. In the best cases the person running the model
| didn't write this stuff and word salad doesn't communicate
| whatever they meant to say. In many cases though, content is
| simply pumped out for SEO with no intention of being valuable to
| anyone.
| andrethegiant wrote:
| That sentence stood out to me too, very powerful. Felt it right
| in the feels.
| karaterobot wrote:
| I guess a manageable, still-useful alternative would be to curate
| a whitelist of sources that don't use AI, and without making that
| list public, derive the word frequencies from only those sources.
| How to compile that list is left as an exercise for the reader.
| The result would not be as accurate as a broad sample of the web,
| but in a world where it's impossible to trust a broad sample of
| the web, it the option you are left with. And I have no reason to
| doubt that it could be done at a useful scale.
|
| I'm sure this has occurred to them already. Apart from the near-
| impossibility of continuing the task in the same way they've
| always done it, it seems like the other reason they're not
| updating wordfreq is to stick a thumb in the eye of OpenAI and
| Google. While I appreciate the sentiment, I recognize that those
| corporations' eyes will never be sufficiently thumbed to satisfy
| anybody, so I would not let that anger change the course of my
| life's work, personally.
| WaitWaitWha wrote:
| > curate a whitelist of sources that don't use AI,
|
| I like this.
|
| Maybe even take it a step further - have a badge on the source
| that is both human and machine visible to indicate that the
| content is not AI generated.
| antirez wrote:
| Ok so post author is AI skeptic and this is his retaliation,
| likely because his work is affected. I believe governments should
| address the problem with welfare but being against technical
| advances is always being in the wrong side of history.
| exo-pla-net wrote:
| This is a tech site, where >50% of us are programmers who have
| achieved greater productivity thanks to LLM advances.
|
| And yet we're filled to the gills with Luddite sentiments and
| AI content fearmongering.
|
| Imagine the hysteria and the skull-vibrating noise of the non-
| HN rabble when they come to understand where all of this is
| going. They're going to do their darndest to stop us from
| achieving post-economy.
| antirez wrote:
| I fail to see the difference. Actually, programming was one
| of the first field where LLMs shown proficiency. The helper
| nature of LLMs is true in all the fields so far, in the
| future this may change. I believe that for instance in the
| case or journalism the issue was already there: three euros
| per post written without clue by humans.
|
| Anyway in the long run AI will kill tons of jobs. Regardless
| of blog posts like that. The true key is governments
| assistance.
| exo-pla-net wrote:
| I don't know what difference you are referring to. I was
| agreeing with you.
|
| And also agreed: many trumpet the merits of "unassisted"
| human output. However, they're suffering from ancestor
| veneration: human writing has always been a vast mine of
| worthless rock (slop) with a few gems of high-IQ analysis
| hidden here and there.
|
| For instance, upon the invention of the printing press, it
| was immediately and predominantly used for promulgating
| religious tracts.
|
| And even when you got to Newton, who created for us some
| valuable gems, much of his output was nevertheless deranged
| and worthless. [1]
|
| It follows that, whether we're a human or an LLM, if we
| achieve factual grounding and the capacity to reason, we
| achieve it _despite_ the bulk of the information we ingest.
| Filtering out sludge is part of the required skillset for
| intellectual growth, and LLM slop qualitatively changes
| nothing.
|
| [1] https://www.newtonproject.ox.ac.uk/view/texts/diplomati
| c/THE...
| antirez wrote:
| Sorry I didn't imply we didn't agree but that programmers
| were and are going to be impacted as much as writers for
| instance, yet I see an environment where AI is generally
| more accepted as a tool.
|
| About your last point sometimes I think that in the
| future there will be models specifically distilling the
| climax of selected thinkers, so that not only their
| production will be preserved but maybe something more
| that is only implicitly contained in their output.
| exo-pla-net wrote:
| That's a good point: the greatest value that we can glean
| from one another is likely not epistemological "facts
| about the world", nor is it even the _predictive models_
| seen in science and higher brow social commentary, but in
| _patterns of thinking_. That alone is the infinite
| wellspring for achieving greater understanding, whether
| formalized with the scientific method or whether more
| loosely leveraged to succeed with a business endeavor.
|
| Anecdotally, I met success in prompting GPT-3 to "mimic
| Stephen Pinker" when solving logical puzzles. Puzzles
| that it would initially fail, it would succeed attempting
| to mimic his language. GPT-3 seemed to have grokked the
| pattern of how Stephen Pinker thinks through problems,
| and it could leverage those patterns to improve its own
| reasoning. OpenAI _o1_ needs no such assistance, and I
| expect that _o2_ will fully supplant humans with its
| ability to reason.
|
| It follows that all that we have to offer with our
| brightest minds will be exhausted, and we will be
| eclipsed in every conceivable way by our creation. It
| will mark the end of the Anthropocene; something that
| likely exceeds the headiest of Nick Bostom speculations
| will take its place.
|
| It seems that this is coming in 2026 if not sooner, and
| Alignment is the only thing that ought occupy our minds:
| the question of whether we're creating something that
| will save us from ourselves, or whether all that we've
| built will culminate in something gross and final.
|
| Looking around myself, however, I see impassioned
| "discourse" about immigration. The merits of DEI.
| Patriotism. Transgenderism. Religion. Copyright. Vast
| herds of dinosaurs preying upon one another, giving only
| idle attention to the glowing object in the sky. Is it an
| asteroid? Is it a UFO that is coming down to provide
| dinosaur healthcare? Nope, not even that level of thought
| is mustered. With 8 billion people on the planet,
| _Utopia_ by Nick Bostrom hasn 't even mustered 100
| reviews on Amazon. On the advent of the defining moment
| of the universe itself, when virtually all that is
| imaginable is unlocked for us, our species' heads remains
| buried in the mud, gnawing at one another's filthy toes,
| and I'm alienated and disgusted.
|
| The only glints of beauty I see in my fellow man are in
| those with minds which exceed a certain IQ threshold and
| cognitive flexibility, as well as in lesser minds which
| exhibit gentleness and humility. There is beauty there,
| and there is beauty in the staggering possibility of the
| universe itself. The rest is at best entomology, and I
| won't mourn its passing.
| greentxt wrote:
| I think this person has too high a view of pre-2021, probably for
| ego reasons. In fact, their attitude seems very ego driven. AI
| didn't just occur in 2021. Nobody knows how much text was machine
| generated prior to 2021, it was much harder if not impossible to
| detect. If anything, it's probably easier now since people are
| all using the same ai that use words like delve so much much it
| becomes obvious.
| croes wrote:
| >AI didn't just occur in 2021. Nobody knows how much text was
| machine generated prior to 2021
|
| But we do know that now it's a lot more, with a big LOT.
| greentxt wrote:
| I assume you are correct but how can we know rather than
| assume? I am not sure we can, so why get worked up about
| "internet died in 2021" when many would claim with similar
| conviction that it's been dead since 2012, or 2007, or ...
| ClassyJacket wrote:
| You are making a claim that somehow someone was sitting on
| something as powerful as ChatGPT, long before ChatGPT,
| _and_ that it was in widespread use, secretly, without even
| a single leak by anyone at any point. That 's not
| plausible.
| grogenaut wrote:
| Is 2023 going to be for data what the trinity test was for iron?
| Eg post 2023 all data now contains trace amounts of ai?
| swyx wrote:
| yes, unfortunately https://www.latent.space/p/nov-2023
| aftbit wrote:
| Wow there is so much vitriol both in this post and in the
| comments here. I understand that there are many ethical and
| practical problems with generative AI, but when did we stop being
| hopeful and start seeing the darkest side of everything? Is it
| just that the average HN reader is now past the age where a new
| technological development is an exciting opportunity and on to
| the age where it is a threat? Remember, the Luddites were not
| opposed to looms, they just wanted to own them.
| JohnFen wrote:
| > when did we stop being hopeful and start seeing the darkest
| side of everything?
|
| I think a decade or two ago, when most of the new tech being
| introduced (at least by our industry) started being
| unmistakably abusive and dehumanizing. When the recent past
| shows a strong trend, it's not unreasonable to expect the the
| near future will continue that trend. Particularly when it
| makes companies money.
| slashdave wrote:
| Give us examples of generative AI in challenging applications
| (biology, medicine, physical sciences), and you'll get a lot of
| optimism. The text LLM stuff is the brute force application of
| the same class of statistical modeling. It's commercial, and
| boring.
| aryonoco wrote:
| When?
|
| For some of us, it was 1994, the eternal September.
|
| For some of us, it was when Aaron Swartz left us.
|
| For some of us, it was when Google killed Google Reader (in
| hindsight, the turning point of Google becoming evil).
|
| For some others, like the author of this post, it's when
| twitter and reddit closed their previously open APIs.
| jll29 wrote:
| I regret the situation led to the OP feel discourage about the
| NLP community, wo which I belong, and I just want to say "we're
| not all like that", even though it is a trend and we're close to
| peak hype (slightly past even?).
|
| The complaint about pollution of the Web with artificial content
| is timely, and it's not even the first time due to spam farms
| intended to game PageRank, among other nonsense. This may just
| mean there is new value in hand-curated lists of high-quality Web
| sites (some people use the term "small Web").
|
| Each generation of the Web needs techniques to overcome its
| particular generation of adversarial mechanisms, and the current
| Web stage is no exception.
|
| When Eric Arthur Blair wrote 1984 (under his pen name "George
| Orwell"), he anticipated people consuming auto-generated content
| to keep the masses from away from critical thinking. This is now
| happening (he even anticipated auto-generated porn in the novel),
| but the technologies criticized can also be used for good, and
| that is what I try to do in my NLP research team. Good _will_
| prevail in the end.
| solardev wrote:
| Have "good" small webs EVER prevailed?
|
| Every content system seems to get polluted by noise once it
| hits mainstream usage: IRC, Usenet, reddit, Facebook,
| geocities, Yahoo, webrings, etc. Once-small curated selections
| eventually grow big enough to become victims of their own
| successes and taken over by spam.
|
| It's always an arms race of quality vs quantity, and eventually
| the curators can't keep up with the sheer volume anymore.
| squigz wrote:
| > Have "good" small webs EVER prevailed?
|
| You ask on HN, one of the highest quality sites I've ever
| visited in any age of the Internet.
|
| IRC is still alive and well among pretty much the same
| audience as always. I'm not sure it's fair to compare that
| with the others.
| solardev wrote:
| Well, niche forums are kinda different when they manage to
| stay small and niche. Not just HN but car forums, LED
| forums, etc.
|
| But if they ever include other topics, they risk becoming
| more mainstream and noisy. Even within adjacent fields
| (like the various Stacks) it gets pretty bad.
|
| Maybe the trick is to stay within a single small sphere
| then and not become a general purpose discussion site? And
| to have a low enough volume of submissions where good
| moderation is still possible? (Thank you dang and HN staff)
| rovr138 wrote:
| Yes. That's the small web.
|
| A good example of the generalization problem you discuss
| is reddit.
|
| You have to unsubscribe from all the defaults and find
| the small, niche, communities about specific topics. If
| not, it's the same stuff, reposted, over and over, across
| different subs and/or social sites.
| squigz wrote:
| I'm not entirely sure it's about content (while HN is
| certainly tech-focused, politics, health, philosophy all
| come up with regularity) or even content moderation,
| although they both certainly play a part (particularly
| the moderation around here. Thanks, staff!)
|
| I wonder if it is more to do with the community itself.
| HN users tend to have very intelligent discussions on
| pretty much anything, and discourages shitty, unnuanced,
| one-line takes. This, coupled with a healthy moderation
| system, makes it hard for the lower quality discussion to
| break in and override the good stuff.
| nick3443 wrote:
| The car headlight forums seem to expose the weakness of
| small web though, in that a lot of the forums that show
| up in search are "sponsored" by one or two major brands
| and any open discussion or validation of off-brand
| solutions, AliExpress parts, etc are quickly shunned or
| banned.
| bongodongobob wrote:
| It's high quality when the content is within HN's bubble.
| Anything related to health, politics, or Microsoft is full
| of misinformation, ignorance, and garbage like any other
| site. The Microsoft discussions in particular are extremely
| low quality.
| squigz wrote:
| I disagree. Even politics spurs intelligent, nuanced
| discussion here on HN.
|
| And to hold up discussions about MS as an example of
| 'extremely' low quality discussion is, ah, interesting.
| Do you have any recent examples of such discussions?
| bongodongobob wrote:
| I hide every single article about MS because it's filled
| with all the neckbeardy tropes about their products being
| garbage spyware, switch to Linux, they're stealing your
| data, the OS is trash etc. It's comments from people who
| have never managed large scale MS based environments
| comparing their Windows Home to the other 90% of the
| business ecosystem that has nothing to do with home users
| or MS's main cash cow, businesses, Azure/Entra and M365.
| I'm done wasting my breath on MS here.
| squigz wrote:
| This is a funny comment in a thread about low quality
| discussion.
| bongodongobob wrote:
| I'm describing why I no longer engage with MS related
| posts.
| skissane wrote:
| I've posted four comments here on Microsoft in the last
| 30 days:
|
| https://news.ycombinator.com/item?id=41499957
|
| https://news.ycombinator.com/item?id=41408124
|
| https://news.ycombinator.com/item?id=41335757
|
| https://news.ycombinator.com/item?id=41327379
|
| None of which fit your description of "neckbeardy tropes
| about their products being garbage spyware, switch to
| Linux, they're stealing your data, the OS is trash".
|
| And it isn't just me, because if you look at those
| comments, I was talking to other people who weren't
| invoking those "neckbeardy tropes" either
| vundercind wrote:
| Politics and philosophy discussions here are intelligent
| in that most of the commenters aren't dumb. They tend to
| be entirely uneducated _and resistant to the educated_.
| Retric wrote:
| IMO HN actually scores quite highly in terms of
| health/politics and so forth content because the both
| mainstream and fringe ideas get both shown and pushback.
|
| A vaping discussion brought up glycerin used was safe and
| the same thing used in smoke machines and someone else
| brought up a study showing that smoke machines are an
| occasional safety issue. Nowhere near every discussion
| goes that well but stick around and you'll see in-depth
| discussion.
|
| Go to a public health website by comparison and you'll
| see warnings without context and a possibility positive
| spin compared to smoking.
| https://www.cdc.gov/tobacco/e-cigarettes/index.html I
| suspect most people get basically nothing from looking at
| it.
| chimeracoder wrote:
| > IMO HN actually scores quite highly in terms of
| health/politics and so forth content because the both
| mainstream and fringe ideas get both shown and pushback.
|
| As someone with domain expertise here, I wholeheartedly
| disagree. HN is very bad at percolating accurate
| information about topics outside its wheelhouse, like
| clinical medicine, public health, or the natural
| sciences. It is also, simultaneously, extremely prone to
| overestimating its own collective competency at
| understanding technical knowledge outside its domain. In
| tandem, those two make for a rather dangerous
| combination.
|
| Anytime I see a post about a topic within my area of
| specialty, I know to expect articulate, lengthy, and
| _completely misguided or inaccurate_ comments dominating
| the discussion. It 's enough of a problem that trying to
| wade in and correct them is a losing battle; I rarely
| even bother these days.
|
| It's kind of funny that XKCD #793[0] is written about
| physicists, because the effect is way worse with software
| engineers.
|
| [0] https://xkcd.com/793/
| Retric wrote:
| Obviously on an objective scale HN isn't good, but nobody
| is doing a good job here.
|
| I've worked on the government side of this stuff and find
| it disheartening.
| mandevil wrote:
| As a software engineer married to a healthcare
| professional, I disagree strongly about the quality of
| the healthcare discussions here. A whole lot of the
| conversation is software engineers who think that they
| can reason from first principles in two minutes about
| this thing that professionals dedicate their whole lives
| to mastering, and who therefore don't understand the most
| basic concepts of the field.
|
| Sometimes I try and engage, but honestly, mostly I think
| it's not worth it. Otherwise you end up doing this with
| your life: https://xkcd.com/386/
| Retric wrote:
| Spend time with medical researchers and they start
| disparaging Doctors. Everyone wants that one
| authoritative source free from bias, but IMO even having
| a few voices in the crowd worth listening to beats most
| other options.
| vladms wrote:
| > about this thing that professionals dedicate their
| whole lives to mastering
|
| After doing some healthcare work I ended up understanding
| that some topics are not well known even by the
| professionals dedicating their whole lives to that
| because there are big gaps in the human knowledge on the
| topics.
|
| I agree that people that think they can reason in two
| minutes about anything are a problem, but it's not a
| healthcare only issue (same happens for politics,
| economics, environment, etc.)
|
| Engineers have the luck to work in the field where many
| things have a clear, known explanation (although, try to
| make an estimation about how long a team will implement a
| feature, and everybody will come up with something else).
| mandevil wrote:
| As to the uncertainty and mysteries, you are 100%
| correct. One of the big failure modes for engineers in
| dealing with human health is the assumption that things
| are as simple and logical as the stuff we build, when
| it's simply not at all like that. There are (1) big
| arguments over basic things like "why do SSRI's work?"
| Outside of LLM's I can't think of a thing in software
| where we are still arguing about why things work in
| production. We never say "Why does Postgres work?" in the
| same way. (2)
|
| And yes, this is true for many other areas of discussion
| at HN. It's just that it is most obvious to me in the
| area that my wife specializes in, because I pick up
| enough via osmosis from her to know when other people
| don't even have my limited level of understanding.
|
| 1: Or at least were 15 years ago when my wife told me
| about it- the argument might have been largely concluded
| and she just never updated me since I don't keep up with
| the medical literature the way she does.
|
| 2: Two decades ago there was a huge push for the "human
| genome project" under the basis that this would be
| "reading the blueprints for human life" and that would
| give us all of these medical breakthroughs. Basically
| none of those breakthroughs happened because we've spent
| the past 20 years learning all of the different ways that
| it is NOT a blueprint and that cells do things very
| differently from human engineers.
| 38 wrote:
| its so easy to solve this problem, not sure why anyone hasnt
| done it yet.
|
| 1. build a userbase, free product
|
| 2. once userbase get big enough, any new account requires a
| monthly fee, maybe $1
|
| 3. keep raising the fee higher and higher, until you get to
| the point that the userbase is manageable.
|
| no ads, simple.
| jachee wrote:
| Until N ad views are worth more than $X account creation
| fee. Then the spammers will just sell ad posts for $X*1.5.
|
| I can't find it, but there's someone selling sock puppet
| posts on HN even.
| abridges6523 wrote:
| This sounds like a good idea. I do wonder if enough people
| would sign up for it to be a worthy venture because I think
| the main issue with ads is I think once you add any price
| at all dramatically reduces participation even if it's not
| about cost some people just see the payment and immediately
| disengage.!
| htrp wrote:
| Any curation mechanism that depends on passion and/or the
| goodwill of volunteers is unsustainable.
| squigz wrote:
| > people consuming auto-generated content to keep the masses
| from away from critical thinking. This is now happening
|
| The people who stay away from critical thinking were doing that
| already and will continue to do so, 'AI' content or not.
| trehalose wrote:
| How did they get started?
| squigz wrote:
| They likely never started critically thinking, so they
| never had to get started on not doing so.
|
| (If children are never taught to think critically, then...)
| sweeter wrote:
| It's almost like its a systemic failure that is
| artificially created so that people wont think
| critically... hmmm
| squigz wrote:
| Yeah, it's almost like it has nothing to do with AI
| vladms wrote:
| > is artificially created
|
| You imply that thousands of year ago everybody was
| thinking critically?
|
| Thinking critically is hard, stressful and might take
| some joy from your life.
| sweeter wrote:
| I'm not sure how that would imply anything about the
| past. We as a society have spent decades defanging the
| public school system through changing school to be test
| score driven, tying a schools funding to the local
| property value, making them less effective and less safe,
| choking them out financially etc... it should be no
| surprise that children are not equipped to navigate
| modern life. I've been though these systems, they are
| deeply flawed.
| psychoslave wrote:
| I don't know, individually finely tuned addictive content
| served as real time interactive feedback loops is an other
| level of propaganda and attention capture tool than largest
| common denominator of the general crowd served as static
| passive content.
| squigz wrote:
| Perhaps, but the solution is the same either way, and it
| isn't trying to ban technology or halt progress or just sit
| and cry about how society is broken. It's educating each
| other and our children on the way these things work, how to
| break out of them, and how we might more responsibly use
| the technology.
| sweeter wrote:
| tangentially related, but Marx also predicted that crypto and
| NFT's would exist in 1894 [1] and I only bring it up because
| its kind of wild how we keep crossing these "red lines" without
| even blinking. It's like that meme:
|
| Sci-fi author:
|
| I created the Torment Nexus to serve as a cautionary tale...
|
| Tech Company:
|
| Alas, we have created the Torment Nexus from the classic Sci-fi
| novel "Don't Create the Torment Nexus"
|
| 1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm
| Llamamoe wrote:
| > Good will prevail in the end.
|
| Even if, this is a dangerous thought that discourages decisive
| action that is likely to be necessary for this to happen.
| Intralexical wrote:
| What if the way for good to prevail is to reject technologies
| and beliefs that have become destructive?
| ok123456 wrote:
| Most of the "random" bot content pre-2021 was low-quality Markov-
| generated text. If anything, these genitive AI tools would
| improve the accuracy of scraping large corpora of text from the
| web.
| diggan wrote:
| One of the examples is the increased usage of "delve" which
| Google Trends confirms increased in usage since 2022 (initial
| ChatGPT release):
| https://trends.google.com/trends/explore?date=all&q=delve&hl...
|
| It seems however it started increasing most in usage just these
| last few months, maybe people are talking more about "delve"
| specifically because of the increase in usage? A usage recursion
| of some sorts.
| bongodongobob wrote:
| Delves are a new thing in World of Warcraft released 9/10 this
| year. Delve is also an M365 product that has been around for
| some time and is being discontinued in December. So no, that
| has nothing to do with LLMs.
| _proofs wrote:
| Delve was also an addition to PoE, which I imagine had its
| own spike in google searches relative to that word.
| bee_rider wrote:
| We've seen this with a couple words and expressions, and I
| don't doubt that AI is somewhat likely to "like" some phrases
| for whatever reason. Big eigenvaues of the latent space or
| whatever, hahaha (I don't know AI).
|
| But also, words and phrases _do_ become popular among humans,
| right? It would be a shame if AI caused the language to get
| more stagnant, as keeping up with which phrases are popular get
| you labeled as an AI.
| thesnide wrote:
| I think that text on the internet will tainted by AI the same way
| that steel has being tainted by nuclear devices.
| zaik wrote:
| If generative AI has a significantly different word frequency
| from humans then it also shouldn't be hard to detect text written
| generative AI. However my last information is that tools to
| detect text written by generative AI are not that great.
| andai wrote:
| Has anyone taken a look at a random sample of web data? It's
| mostly crap. I was thinking of making my own search engine,
| knowledge database etc based on a random sample of web pages, but
| I found that almost all of them were drivel. Flame wars, asinine
| blog posts, and most of all, advertising. Forget spam, most of
| the legit pages are trying to sell something too!
|
| The conclusion I arrived at was that making my own crawler
| actually is feasible (and given my goals, necessary!) because I'm
| only interested in a very, very small fraction of what's out
| there.
| aryonoco wrote:
| I feel so conflicted about this.
|
| On the one hand, I completely agree with Robyn Speer. The open
| web is dead, and the web is in a really sad state. The other day
| I decided to publish my personal blog on gopher. Just cause,
| there's a lot less crap on gopher (and no, gopher is not the
| answer).
|
| But...
|
| A couple of weeks ago, I had to send a video file to my wife's
| grandfather, who is 97, lives in another country, and doesn't use
| computers or mobile phones. Eventually we determined that he has
| a DVD player, so I turned to x264 to convert this modern 4K HDR
| video into a form that can be played by any ancient DVD player,
| while preserving as much visual fidelity as possible.
|
| The thing about x264 is, it doesn't have any docs. Unlike x265
| which had a corporate sponsor who could spend money on writing
| proper docs, x264 was basically developed through trial and error
| by members of the doom9 forum. There are hundreds of obscure
| flags, some of which now operate differently to what they did 20
| years ago. I could spend hours going through dozens of 20 year
| old threads on doom9 to figure out what each flag did, or I could
| do what I did and ask a LLM (in this case Claude).
|
| Claude wasn't perfect. It mixed up a few ffmpeg flags with x264
| ones (easy mistake), but combined with some old fashioned
| searching and some trial and error, I could get the job done in
| about half an hour. I was quite happy with the quality of the end
| product, and the video did play on that very old DVD player.
|
| Back in pre-LLM days, it's not like I would have hired a x264
| expert to do this job for me. I would have either had to spend
| hours more on this task, or more likely, this 97 year old man
| would never have seen his great granddaughter's dance, which
| apparently brought a massive smile to his face.
|
| Like everything before them, LLMs are just tools. Neither
| inherently good nor bad. It's what we do with them and how we use
| them that matters.
| sangnoir wrote:
| > Back in pre-LLM days, it's not like I would have hired a x264
| expert to do this job for me. I would have either had to spend
| hours more on this task, or more likely, this 97 year old man
| would never have seen his great granddaughter's dance
|
| Didn't most _DVD_ burning software include video transcoding as
| a standard feature? Back in the day, you 'd have used Nero
| Burning ROM, or Handbrake - granted, the quality may not have
| been optimized to your standards, but the result would have
| been a watchable video (especially to 97 year-old eyes)
| aryonoco wrote:
| Back in the day they did. I checked handbrake but now there's
| nothing specific about DVD compatibility there. I could have
| picked something like Super HQ 576p, and there's a good
| chance that would have sufficed, but old DVD players were
| extremely finicky about filenames, extensions, interlacing,
| etc. I didn't want to risk the DVD traveling half way across
| the world only to find that it's not playable.
| sangnoir wrote:
| I mentioned Handbrake without checking its DVD authoring
| capability - probably used it to _rip_ DVDs many years ago
| and got it mixed up with burning them; a better FLOSS
| alternative for authoring would have been DeVeDe or
| bombono.
| miguno wrote:
| I have been noticing this trend increasingly myself. It's getting
| more and more difficult to use tools like Google search to find
| relevant content.
|
| Many of my searches nowadays include suffixes like
| "site:reddit.com" (or similar havens of, hopefully, still mostly
| human-generated content) to produce reasonably useful results.
| There's so much spam pollution by sites like Medium.com that it's
| disheartening. It feels as if the Internet humanity is already on
| the retreat into their last comely homes, which are more closed
| than open to the outside.
|
| On the positive side:
|
| 1. Self-managed blogs (like: not on Substack or Medium) by
| individuals have become a strong indicator for interesting
| content. If the blog runs on Hugo, Zola, Astro, you-name-it,
| there's hope.
|
| 2. As a result of (1), I have started to use an RSS reader again.
| Who would have thought!
|
| I am still torn about what to make of Discord. On the one hand,
| the closed-by-design nature of the thousands of Discord servers,
| where content is locked in forever without a chance of being
| indexed by a search engine, has many downsides in my opinion. On
| the other hand, the servers I do frequent are populated by
| humans, not content-generating bots camouflaged as users.
| 0xbadcafebee wrote:
| I'm going to call it: The Web is dead. Thanks to "AI" I spend
| more time now digging through searches trying to find something
| useful than I did back in 2005. And the sites you do find are
| largely garbage.
|
| As a random example: just trying to find a particular popular set
| of wireless earbuds takes me at least 10 minutes, when I already
| know the company, the company's website, other vendors that sell
| the company's goods, etc. It's just buried under tons of dreck.
| And my laptop is "old" (an 8-core i7 processor with 16GB of RAM)
| so it struggles to push through graphics-intense "modern"
| websites like the vendor's. Their old website was plain and
| worked great, letting me quickly search through their products
| and quickly purchase them. Last night I literally struggled to
| add things to cart and check out; it was actually harrowing.
|
| Fuck the web, fuck web browsers, web design, SEO, searching,
| advertising, and all the schlock that comes with it. I'm done. If
| I can in any way purchase something without the web, I'mma do
| that. I don't hate technology (entirely...) but the web is just a
| rotten egg now.
| w10-1 wrote:
| > If I can in any way purchase something without the web, I'mma
| do that
|
| To get to the milk you'll have to walk by 3 rows of chips and
| soda.
| odo1242 wrote:
| Yeah, this is why I still use the web to order things in a
| nutshell lol
| 0xbadcafebee wrote:
| Where do you order things online that you aren't inundated
| by ads?
| bbarn wrote:
| No disagreement for the most part.
|
| I used to be able to say search for Trek bike derailleur hanger
| and the first result would be what I wanted. Now I have to
| scroll past 5 ads to buy a new bike, one that's a broken link
| to a third party, and if I'm really lucky, at the bottom of
| page 1 will be the link to that part's page.
|
| The shitification of the web is real.
| klyrs wrote:
| R.I.P. Sheldon Brown T_T
|
| (The Agner Fog of cycling?)
| gazook89 wrote:
| The web is much more than a shopping site.
| yifanl wrote:
| It is, but the SEO spammers who ruined the web want it to be
| shopping mall, and they can't even do a particularly good job
| at being one.
| Gethsemane wrote:
| Sounds like your laptop is wholly out of date, you need to buy
| the next generation of laptops on Amazon that can handle the
| modern SEO load. I recommend the:
|
| LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core
| N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID,
| WiFi, BT4.2, for Students/Business
|
| Name rolls off the tongue doesn't it
| BeetleB wrote:
| If search is your metric, the web was dead long before OpenAI's
| release of GPT. I gave up on web search a long time ago.
| Vegenoid wrote:
| On Amazon, you used to be able to search the reviews and Q&A
| section via a search box. This was immensely useful. Now, that
| search box first routes your search to an LLM, which makes you
| wait 10-15 seconds while it searches for you. Then it presents
| its unhelpful summary, saying "some reviews said such and
| such", and I can finally click the button to show me the actual
| reviews and questions with the term I searched.
|
| This is going to be the thing that makes me quit Amazon. If I'm
| missing something and there's still a way to to a direct
| search, please tell me.
| fsckboy wrote:
| > FTA _the site has been replaced with an oligarch 's plaything,
| a spam-infested right-wing cesspool_
|
| just in case you youngsters don't know, the entire field of
| linguistics itself has been cesspool of marxist analysis since
| before y'all were born. In the peak days of Chomsky, a truly
| great linguist who put MIT at the forefront of linguistics in the
| world, MIT still felt it had to disband his department (stuffing
| it into Philosophy) because it was too political, radicalized,
| and unacademic. It was a big kerfuffel, guess Chomsky was unable
| to manufacture adequate consent!
|
| The anti-western, anti-male, anti-whiteness, deconstructionist,
| lesbian inspired womyn's right to fish-bicycles instead of men,
| critical theory, you name it that has destroyed the Academy today
| was already in full swing in linguistics over 50 years ago, but
| even then was unable to free those Rosenbergs.
|
| And apparently as a result of that, wordfreak will not update any
| more. And Israel, like Carthage, must be destroyed!
|
| (this little time-capsule is meant to point out, _la plus ca
| change, la plus c 'est la meme chose._ The perceived craziness of
| politics and campus life today was already in full swing over a
| hundred years ago in revolutionary Europe leading to Fascist and
| Marxist totalitarian states _and their defenders in the USA_ ,
| which I think we are nowhere close to today but we still hear the
| echoes, even in End of Life announcements for seemingly benign
| activities like word counts.)
| jadayesnaamsi wrote:
| The year 2021 is to wordfreq what 1945 was to carbon carbon-14
| dating.
|
| I guess the same way the scientists had to account for the bomb
| pulse in order to provide accurate carbon-14 dating, wordfreq
| would need a magic way to account for non human content.
|
| Saying magic, because unfortunately it was much easier to detect
| nuclear testing in the atmosphere than to it will be to detect
| AI-generated content.
| charlieyu1 wrote:
| Web before 2021 was still polluted by content farms. The articles
| were written by humans, but still, they were rubbish. Not
| compared to current rate of generation, but the web was already
| dominated by them.
| bane wrote:
| This is one of the vanguards warning of the changes coming in the
| post-AI world.
|
| >> Generative AI has polluted the data
|
| Just like low-background steel marks the break in history from
| before and after the nuclear age, these types of data mark the
| distinction from before and after AI.
|
| Future models will begin to continue to amplify certain
| statistical properties from their training, that amplified data
| will continue to pollute the public space from which future
| training data is drawn. Meanwhile certain low-frequency data will
| be selected by these models less and less and will become
| suppressed and possibly eliminated. We know from classic NLP
| techniques that low frequency words are often among the highest
| in information content and descriptive power.
|
| Bitrot will continue to act as the agent of Entropy further
| reducing pre-AI datasets.
|
| These feedback loops will persist, language will be ground down,
| neologisms will be prevented and...society, no longer with the
| mental tools to describe changing circumstances; new thoughts
| unable to be realized, will cease to advance and then regress.
|
| Soon there will be no new low frequency ideas being removed from
| the data, only old low frequency ideas. Language's descriptive
| power is further eliminated and only the AIs seem able to produce
| anything that might represent the shadow of novelty. But it ends
| when the machines can only produce unintelligible pages of
| particles and articles, language is lost, civilization is lost
| when we no longer know what to call its downfall.
|
| The glimmer of hope is that humanity figured out how to rise from
| the dreamstate of the world of animals once. Future humans will
| be able to climb from the ashes again. There used to be a word,
| the name of a bird, that encoded this ability to die and return
| again, but that name is already lost to the machines that will
| take our tongues.
| thechao wrote:
| That went off the rails quickly. Calm down dude: my mother-in-
| law isn't going to forget words because of AI; she's gonna
| forget words because she's 3 glasses of crappy Texas wine into
| the evening.
| bane wrote:
| But your children's children will never learn about love
| because that word will have been mechanically trained out of
| existence.
| Intralexical wrote:
| That's pretty funny. You think love is just a word?
| fer wrote:
| > Future models will begin to continue to amplify certain
| statistical properties from their training, that amplified data
| will continue to pollute the public space from which future
| training data is drawn.
|
| That's why on FB I mark my own writing as AI generated, and the
| AI generated slop as genuine. Because what is disguised as
| "transparency disclaimer" is just flagging content of what's a
| potential dataset to train from and what isn't.
| mitthrowaway2 wrote:
| I'm sorry for the low-content remark, but, oh my god... I
| never thought about doing this, and now my mind is reeling at
| the implications. The idea of shielding my own writing from
| AI-plagiarism by masquerading it as AI-generated slop in the
| first place... but then in the same stroke, further
| undermining our collective ability to identify genuine human
| writing, while also flagging my own work as low-value to my
| readers, hoping that they can read between the lines. It's a
| fascinating play.
| aanet wrote:
| You, Sir, may have stumbled upon the just the -hack- advice
| needed to post on social media.
|
| Apropos of nothing in particular, see LinkedIn now admitting
| [1] it is training its AI models on "all users by default"
|
| [1] https://www.techmeme.com/240918/p34#a240918p34
| wvbdmp wrote:
| I Have No Words, And I Must Scream
| midnitewarrior wrote:
| From the day of the first spoken word, humans have guided the
| development of language through conversational use and
| institution. With the advent of AI being used to publish
| documents into the open web, humans have given up their
| exclusive domain.
|
| What would it take for Open AI overlords to inject words they
| want to force into usage in their models and will new words
| into use? Few have had the power to do such things. Open AI
| through its popular GPT platform now has the potential of
| dictating the evolution of human language.
|
| This is novel and scary.
| bane wrote:
| It's the ultimate seizure of the means of production, and in
| the end it will be the capitalists who realize that
| revolution.
| Intralexical wrote:
| > Soon there will be no new low frequency ideas being removed
| from the data, only old low frequency ideas. Language's
| descriptive power is further eliminated and only the AIs seem
| able to produce anything that might represent the shadow of
| novelty. But it ends when the machines can only produce
| unintelligible pages of particles and articles, language is
| lost, civilization is lost when we no longer know what to call
| its downfall.
|
| Or we'll be fine, because inbreeding isn't actually sustainable
| either economically nor technologically, and to most of the
| world the Silicon Valley "AI" crowd is more an obnoxious gang
| of socially stunted and predatory weirdos than some unstoppable
| omnipotent force.
| sashank_1509 wrote:
| Not to be too dismissive, but is there a worthwhile direction of
| research to pursue that is not LLM's in NLP?
|
| If we add linguistics to NLP I can see an argument but if we
| define NLP as the research of enabling a computer process
| language then it seems to me that LLM's/ Generative AI is the
| only research that an NLP practitioner should focus on and
| everything else is moot. Is there any other paradigm that we
| think can enable a computer understand language other than
| training a large deep learning model on a lot of data?
| sinkasapa wrote:
| Maybe it is "including linguistics" but most of the world's
| languages don't have the data available to train on. So I think
| one major question for NLP is exactly the question you posed:
| "Is there any other paradigm that we think can enable a
| computer understand language other than training a large deep
| learning model on a lot of data?"
| hcks wrote:
| Okay but how big of a sample size do we even actually need for
| word frequencies? Like what's the goal here? It looks like the
| initial project isn't even stratified per year/decade
| tqi wrote:
| "Sure, there was spam in the wordfreq data sources, but it was
| manageable and often identifiable."
|
| How sure can we be about that?
| QRe wrote:
| I understand the frustration shared in this post but I
| wholeheartedly disagree with the overall sentiment that comes
| with it.
|
| The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill
| anything.
|
| The world is chaotic and net entropy (degree of disorder) of any
| isolated or closed system will always increase. Same goes for the
| web. We just have to embrace it and overcome the challenges that
| come with it.
| brunokim wrote:
| Here is an expert saying there is a problem and how it killed
| its research effort, and yet you say that things are the same
| as ever and nothing was killed.
| ryukoposting wrote:
| I'm not so optimistic. The most basic requirements are:
|
| 1. Prove the human-ness of an author... 2. ...without grossly
| encroaching on their privacy. 3. Ensure that the author isn't
| passing off AI-generated material as their own.
|
| We'll leave out the "don't let AI models train on my data" part
| for now.
|
| Whatever solution we come up with, if any, will necessarily be
| mired in the politics of privacy, anonymity, and/or DRM. In any
| case, it's hard to conceive of a world where the human web
| returns as we once knew it.
| syngrog66 wrote:
| A few years ago I began an effort to write a new tech book. I
| planned orig to do as much of it as I could across a series of
| commits in a public GitHub repo of mine.
|
| I then changed course. Why? I had read increasing reports of
| human e-book pirates (copying your book's content then
| repackaging it for sale under a diff title, byline, cover, and
| possibly at a much lower or even much higher price.)
|
| And then the rise of LLMs and their ravenous training ingest bots
| -- plagiarism at scale and potentially even easier to disguise.
|
| "Not gonna happen." - Bush Sr., via Dana Carvey
|
| Now I keep the bulk of my book material non-public during dev.
| I'm sure I'll share a chapter candidate or so at some point
| before final release, for feedback and publicity. But the bulk
| will debut all together at once, and only once polished and
| behind a paywall
| whimsicalism wrote:
| NLP and especially 'computational linguistics' in academia has
| been captured by certain political interests, this is reflective
| of that.
| jchook wrote:
| If it is (apparently) easy for humans to tell when content is AI-
| generated slop, then it should be possible to develop an AI to
| distinguish human-created content.
|
| As mentioned, we have heuristics like frequency of the word
| "delve", and simple techniques such as measuring perplexity. I'd
| like to see a GAN style approach to this problem. It could
| potentially help improve the "humanness" of AI-generated content.
| aDyslecticCrow wrote:
| > If it is (apparently) easy for humans to tell when content is
| AI-generated slop
|
| It's actually not. It's rather difficult for humans as well. We
| can see verbose text that is confused and call it AI, but it
| could just be a human aswell.
|
| To borrow an older model training method, "Generative
| adversarial network". If we can distinguish AI from humans...
| We can use it to improve AI and close the gap.
|
| So, it becomes an arms race that constantly evolves.
| honksillet wrote:
| Twitter was a botnet long before LLMs and Musk got involved.
| jedberg wrote:
| We need a vintage data/handmade data service. A service that can
| provide text and images for training that are guaranteed to have
| either been produced by a human or produced before 2021.
|
| Someone should start scanning all those microfiche archives in
| local libraries and sell the data.
| will-burner wrote:
| > It's rare to see NLP research that doesn't have a dependency on
| closed data controlled by OpenAI and Google, two companies that I
| already despise.
|
| The dependency on closed data combined with the cost of compute
| to do anything interesting with LLMs has made individual
| contributions to NLP research extremely difficult if one is not
| associated with a very large tech company. It's super
| unfortunate, makes the subject area much less approachable, and
| makes the people doing research in the subject area much more
| homogeneous.
| jonas21 wrote:
| I think the main reason for sunsetting the project is buried near
| the bottom:
|
| > _The field I know as "natural language processing" is hard to
| find these days. It's all being devoured by generative AI. Other
| techniques still exist but generative AI sucks up all the air in
| the room and gets all the money._
|
| Traditional NLP has been surpassed by LLMs. This is clear from
| the benchmarks. The rest of the post is just rationalization and
| sour grapes.
___________________________________________________________________
(page generated 2024-09-18 23:00 UTC)