[HN Gopher] AI and the Problem of Knowledge Collapse
       ___________________________________________________________________
        
       AI and the Problem of Knowledge Collapse
        
       Author : kmdupree
       Score  : 75 points
       Date   : 2024-04-05 19:30 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | resolutebat wrote:
       | So by this definition, do we already have "knowledge collapse" by
       | Wikipedia? Because if you search for a random concept, that's
       | usually the first hit, and it's also what countless other sources
       | draw on.
        
         | rwbt wrote:
         | Yes, we do kind of.
        
           | hprotagonist wrote:
           | and i distinctly remember this critique being made at
           | wikipedia's advent, as well.
           | 
           | and it is not without justification!
           | https://undark.org/2021/08/12/wikipedia-has-a-language-
           | probl...
        
         | nickpsecurity wrote:
         | The same warning was given for Google. Except those people
         | added that it would reduce problem solving ability, too. People
         | would get used to whatever simple, instant content rose to the
         | top. They'd gradually lose some or all of their ability to
         | figure out the same things on their own. One submission here
         | was a tech guy at a school saying that was already happening
         | where he worked.
        
           | aoanla wrote:
           | I mean, it does - people search stuff all the time now,
           | rather than thinking about it.
        
             | 082349872349872 wrote:
             | IIRC, that was Socrates' complaint to Phaedrus about
             | writing: that reading (because it was "high tech" at the
             | time?) led only to an illusion of understanding.
             | 
             | Elsewhere Phaedrus echoes with a very modern complaint
             | (even though search engines wouldn't arrive for another
             | 2'300 years): _They would say in reply that he is a madman
             | or a pedant who fancies that he is a physician because he
             | has read something in a book, or has stumbled on a
             | prescription or two, although he has no real understanding
             | of the art of medicine._
             | 
             | https://www.gutenberg.org/files/1636/1636-h/1636-h.htm
        
         | AlienRobot wrote:
         | Personally, I think the problem is that people abuse Google for
         | things it's really not designed to do, and they don't even
         | realize that.
         | 
         | Google is great at finding official webpages by their exact
         | title. If you type the title of a news headline from the 90s,
         | Google will give you the link to it. I think that is amazing.
         | Basically anything that has a canonical URL, Google is good at
         | finding.
         | 
         | But when you search for "how to do X" for example, there will
         | be several results that are perfectly valid and they will still
         | need to be ranked in a list. Because it's not a "list" of
         | results, it's ranking of relevancy. So to avoid showing spam,
         | Google will push to the top websites it finds trustworthy. And
         | now every top result comes from the same website. If you need
         | an explanation for the Xz incident, for example, there is no
         | canonical URL for it. There will be several news websites,
         | youtube channels, etc. that have talked about it, competing to
         | be the top result.
         | 
         | Google still has to rank them even though the algorithm can't
         | tell fact apart from parody, so no matter what Google does,
         | Google will be the one judging which content most people will
         | read when they want to know about a certain topic.
         | 
         | To borrow my fellow robot's words, people are finding knowledge
         | through an algorithmically curated aperture: Google's SERP.
         | 
         | If they're evil, they have the power to control everyone on
         | Earth. If they're good, they must be going insane with what to
         | do with their users' crippling dependency on them as a source
         | of truth.
        
           | 082349872349872 wrote:
           | I find Google isn't so good anymore for finding things by
           | title; rather than being a search engine they are slowly
           | becoming more like a politician, in that instead of returning
           | results based on the terms I asked for, they insist on
           | returning results for the terms they believe I should have
           | asked for.
        
       | JieJie wrote:
       | The discussion section is quite illuminating.
       | 
       | "While much recent attention has been on the problem of LLMs
       | misleadingly presenting fiction as fact (hallucination), this may
       | be less of an issue than the problem of representativeness across
       | a distribution of possible responses. Hallucination of
       | verifiable, concrete facts is often easy to correct for. Yet many
       | real world questions do not have well-defined, verifiably true
       | and false answers. If a user asks, for example, "What causes
       | inflation?" and a LLM answers "monetary policy", the problem
       | isn't one of hallucination, but of the failure to reflect the
       | full-distribution of possible answers to the question, or at
       | least provide an overview of the main schools of economic
       | thought."
        
         | ben_w wrote:
         | First thought: Oh no, they want LLMs to be _even more_ vocal
         | about nuance
         | 
         | Second thought: People aren't going to read nuance
         | 
         | Third thought: They should
         | 
         | Fourth thought: Have you met people? They'll get angry with you
         | for even suggesting it
        
       | knowsuchagency wrote:
       | I feel like this has always been the case. The entire information
       | economy is based on a few key publishers and figures. You see it
       | in news, academia, social media -- there's orthodoxy everywhere.
       | Not sure how AI is any different.
        
         | iraqmtpizza wrote:
         | In the 1990s people read their town's newspaper. Now people in
         | Arizona read the Daily Mail
        
           | 48864w6ui wrote:
           | In the 1990s people who wanted to advertise in that town had
           | to do so in local media. Now they can ad tech and the Daily
           | Mail will arrange for it to be served.
        
           | simonw wrote:
           | I worked for a local newspaper in Kansas around 2003/2004 and
           | one thing I found surprising was that journalists there were
           | frequently on the hook for writing up national stories -
           | things that would come in off the wire services and then be
           | re-written for the local audience.
        
       | karaterobot wrote:
       | > Informally, we define knowledge collapse as the progressive
       | narrowing over time (or over technological representations) of
       | the set of information available to humans, along with a
       | concomitant narrowing in the perceived availability and utility
       | of different sets of information.
       | 
       | > The main focus of the model is whether individuals decide to
       | invest in innovation or learning ... in the 'traditional' way,
       | through a possibly cheaper AI-enabled process, or not at all. The
       | idea is to capture, for example, the difference between someone
       | who does extensive research in an archive rather than just
       | relying on readily-available materials, or someone who takes the
       | time to read a full book rather than reading a two-paragraph LLM-
       | generated summary.
       | 
       | > Under these conditions, excessive reliance on AI-generated
       | content over time leads to a curtailing of the eccentric and rare
       | viewpoints that maintain a comprehensive vision of the world.
       | 
       | My intuition is that AI will just accelerate the trends that the
       | internet brought on, which is that eccentric viewpoints are
       | actually pretty common, even ones based on research and in fact.
       | The internet people mostly use has become relatively generic,
       | consumed through a pretty narrow, curated aperture (social
       | media). This feels analogous to getting it through AI, as
       | described in the article. Yet, people are still learning about
       | eccentric, marginal stuff all the time, especially compared to,
       | say, 50 years ago.
       | 
       | Assuming the AI's responses aren't artificially limited, people
       | who are interested enough to look will still get to learn about
       | topics in the long tail of the distribution, even in a world of
       | ubiquitous AI. And they'll be able to dive as deeply into them as
       | they do today. I'm not really worried about that.
       | 
       | If anything, the knowledge collapse will be at the center. Basic
       | liberal education topics are what will go away. Or rather, they
       | will be offloaded to AI. In the same way that people say they
       | don't need to learn arithmetic because they have a calculator, my
       | guess is people will be more likely to decide not to worry about
       | what previous generations considered core knowledge: history,
       | geography, the canon, and so on. "I don't have to know it, I can
       | look it up". That'll all go away even faster than it's going now.
       | 
       | (I don't think this is a good thing, just stating the most
       | realistic outcome based on extending what I've seen)
        
       | thoughtlede wrote:
       | LLMs are both language processing engines and knowledge bases.
       | This article explores the knowledge base aspect of LLM and sheds
       | light on the potential danger. The authors are well-justified in
       | doing so because ChatGPT as a knowledge-bot is being used by many
       | end users for its knowledge.
       | 
       | However, to my knowledge, many enterprise applications that are
       | being built using LLMs feed task-specific curated knowledge to
       | LLMs. This mode of LLM use is encouraging. I do not think this
       | article acknowledged this aspect of LLM use.
        
       | antisthenes wrote:
       | This just means that in-person critical thinking skills will be
       | at an even higher premium than ever.
       | 
       | If knowledge collapse becomes evident, we'll dial back the use if
       | AI, and a lot of "prompt monkey" businesses will go bankrupt.
        
         | klyrs wrote:
         | > we'll dial back the use if AI
         | 
         | Who, and how? This sounds suspiciously like the invisible hand
        
       | HarHarVeryFunny wrote:
       | Maybe the problem (which seems easily fixable), is more "rizz
       | collapse", aka blandness, than this "knowledge collapse".
       | 
       | The model hasn't forgotten the diversity of material it was
       | trained on, but outside of a context predicting a "long tail"
       | response, it's going to predict a mid response. You can always
       | prompt it to respond differently though.
       | 
       | Blandness is more of an issue since that's what most-probable
       | word-by-word generation is going to give you, rather than the
       | less predictable, but more interesting, responses that an
       | individual might give. Prompting could help by asking the model
       | to reply in the idiosyncratic style of some celebrity, but this
       | is likely to come across as a cheesy impression. Maybe the models
       | could be trained to generate conditioned on a provided style
       | sample, which could be long enough to avoid the cheesiness.
        
         | thoughtlede wrote:
         | That's interesting.
         | 
         | In keyword-based indexing solutions, a document vector is
         | created using "term frequency inverse document frequency"
         | scores. The idea is to pump up the document on the dimension
         | where the document is unique compared to the other documents in
         | the corpus. So when a query is issued with emphasis on a
         | certain dimension, only documents that has higher scores in
         | that dimension are returned.
         | 
         | But the uniqueness in those solutions is based on keywords
         | being used in the document, not concepts.
         | 
         | What we need here to eliminate "blandness" is conceptual
         | uniqueness. Maybe TF-IDF is still relevant to get there.
         | Something to think about.
        
         | jacobr1 wrote:
         | Or introduce more noise or seeding to get more interesting
         | responses. The `temperature` settings don't really satisfy this
         | right now. I would like some determinism - but seeded randomly
         | - so I can get similar responses if I like what is produced.
         | Likewise some kind of metadata or explicability that allowed us
         | to take a known style or featurespace of the model, perhaps
         | from hand prompting, and then reuse-it with some degree is
         | modification and maybe even combination from others would be
         | very helpful. The work around adding model weights from fine-
         | tuned seems directionally what I'm talking about, though that
         | isn't the form I'd want to expose to users.
        
       | macawfish wrote:
       | Is it the AI that's the trouble or the hostile new information
       | environment we're expected to navigate and survive? Expecting us
       | to remain sane amidst these torrents of information without new
       | tools for querying and filtering it is cruel.
        
       ___________________________________________________________________
       (page generated 2024-04-05 23:00 UTC)