[HN Gopher] On the dangers of stochastic parrots: Can language m...
       ___________________________________________________________________
        
       On the dangers of stochastic parrots: Can language models be too
       big? (2021)
        
       Author : Schiphol
       Score  : 49 points
       Date   : 2023-01-14 18:58 UTC (4 hours ago)
        
 (HTM) web link (dl.acm.org)
 (TXT) w3m dump (dl.acm.org)
        
       | hardwaregeek wrote:
       | I'm still midway through the paper, but I gotta say, I'm a little
       | surprised at the contrast between the contents of the paper and
       | how people have described it on HN. I don't agree with everything
       | that is said, but there are some interesting points made about
       | the data used to train the models, such as it capturing bias (I
       | would certainly question the methodology of using reddit as a
       | large source of training data), and that bias being amplified by
       | filtering algorithms that produce the even larger datasets used
       | for modern LLMs. The section about environmental impact might not
       | hit home for everyone, but it is valid to raise issues around the
       | compute usage involved in training these models. First, because
       | it limits this training to companies who can spend millions of
       | dollars on compute, and second because if we want to scale up
       | models, efficiency is probably a top goal.
       | 
       | What really confuses me here is how this paper is somehow outside
       | the realm of valid academic discourse. Yes, it is steeped in
       | activist, social justice language. Yes, it has a different
       | perspective than most CS papers. But is that wrong? Is that
       | enough of a sin to warrant such a response that this paper has
       | received? I'll need to finish the paper to fully judge, but I'm
       | leaning towards no, it is not enough of a sin.
        
       | monkaiju wrote:
       | The problems with LLM are numerous but whats really wild to me is
       | that even as they get better at fairly trivial tasks the
       | advertising gets more and more out of hand. These machine dont
       | think, and they dont understand, but people like the CEO of
       | OpenAI allude to them doing just that, obviously so the hype can
       | make them money.
        
         | amelius wrote:
         | Could be the sign of the next AI winter.
        
         | visarga wrote:
         | > These machine dont think, and they dont understand
         | 
         | But they do solve many tasks correctly, even problems with
         | multiple steps and new tasks for which they got no specific
         | training. They can combine skills in new ways on demand. Call
         | it what you want.
        
           | choeger wrote:
           | They don't. Solve tasks, I mean. There's not a single task
           | you can throw at them and rely on the answer.
           | 
           | Could they solve tasks? Potentially. But how would we ever
           | know that we could trust them?
           | 
           | With humans we not only have millennia of collective
           | experience when it comes to tasks, judging the result, and
           | finding bullshitters. Also, we can retrain a human on the
           | spot and be confident they won't immediately forget something
           | important over that retraining.
           | 
           | If we ever let a model produce important decisions, I'd
           | imagine we'd want to certify it beforehand. But that excludes
           | improvements and feedback - the certified software should
           | better not change. If course, a feedback loop could involve
           | recertification, but that means that the certification
           | process itself needs to be cheap.
           | 
           | And all that doesn't even take into account the generalized
           | interface: How can we make sure that a model is aware of its
           | narrow purpose and doesn't answer to tasks outside of that
           | purpose?
           | 
           | I think all these problems could eventually be overcome, but
           | I don't see much effort put into such a framework to actually
           | make models solve tasks.
        
             | pedrosorio wrote:
             | > Also, we can retrain a human on the spot and be confident
             | they won't immediately forget something important over that
             | retraining.
             | 
             | I don't have millennia, but my more than 3 decades of
             | experience interacting with human beings tell me this is
             | not nearly as reliable as you make it seem.
        
         | didntreadarticl wrote:
         | I dont think you understand mate
        
         | LarryMullins wrote:
         | > _These machine dont think_
         | 
         | And submarines don't swim.
        
           | Dylan16807 wrote:
           | And it would be bad for a submarine salesman to go to people
           | that think swimming is very special and try to get them
           | believing that submarines do swim.
        
             | LarryMullins wrote:
             | Why would that be bad? A submarine salesman convincing you
             | that his submarine "swims" doesn't change the set of
             | missions a submarine might be suitable for. It makes no
             | practical difference. There's no point where you get the
             | submarine and it meets all the advertised specs, does
             | everything you needed a submarine for, but you're
             | unsatisfied with it anyway because you now realize that the
             | word "swim" is reserved for living creatures.
             | 
             | And more to the point, nobody believes that "it thinks" is
             | sufficient qualification for a job when hiring a human, so
             | why would it be different when buying a machine? Whether or
             | not the machine "thinks" doesn't address the question of
             | whether or not the machine is capable of doing the jobs you
             | want it to do. Anybody who neglects to evaluate the
             | _functional capability_ of the machine is simply a fool.
        
       | sthatipamala wrote:
       | This paper is the product of a failed model of AI safety, in
       | which dedicated safety advocates act as a public ombudsman with
       | an adversarial relationship with their employer. It's baffling to
       | me why anyone thought that would be sustainable.
       | 
       | Compare this to something like RLHF[0] which has acheived far
       | more for aligning models toward being polite and non-evil. (This
       | is the technique that helps ChatGPT decline to answer questions
       | like "how to make a bomb?")
       | 
       | There's still a lot of work to be done and the real progress will
       | be made by researchers who implement systems in collaboration
       | with their colleagues and employers.
       | 
       | [0] https://openai.com/blog/instruction-following/
        
         | generalizations wrote:
         | > The resulting InstructGPT models are much better at following
         | instructions than GPT-3. They also make up facts less often,
         | and show small decreases in toxic output generation. Our
         | labelers prefer outputs from our 1.3B InstructGPT model over
         | outputs from a 175B GPT-3 model, despite having more than 100x
         | fewer parameters.
         | 
         | I wonder if anyone's working on public models of this size.
         | Looking forward to when we can selfhost ChatGPT.
        
           | lumost wrote:
           | This is going to happen _alot_ over the next few years. One
           | can fine tune GPT-2 medium on an RTX2070. Training GPT-2
           | medium from scratch can be done for $162 on vast.ai. The
           | newer H100 /Trainium/Tensorcore chips will bring the price
           | down even further.
           | 
           | I suspect if one wanted to fully replicate ChatGPT from
           | scratch it would take ~1-2 million including label
           | acquisition. You probably only require ~200-500k in compute.
           | 
           | The next few years are going to be wild!
        
             | generalizations wrote:
             | These things have reached the tipping point where they
             | provide significant utility to a significant portion of the
             | computer scientists working on making these things. Could
             | be that the coming iterations of these new tools will make
             | it increasingly easy to write the code for the next
             | iterations of these tools.
             | 
             | I wonder if this is the first rumblings of the singularity.
        
               | visarga wrote:
               | chatGPT being able to write OpenAI API code is great, and
               | all companies should prepare samples so future models can
               | correctly interface with their systems.
               | 
               | But what will be needed is to create an AI that
               | implements scientific papers. About 30% of papers have
               | code implementation. That's a sizeable dataset to train a
               | Codex model on.
               | 
               | You can have AI generating papers, and AI implementing
               | papers, then learning to predict experimental results.
               | This is how you bootstrap a self improving AI.
               | 
               | It does not learn only how to recreate itself, it learns
               | how to solve all problems at the same time. A data
               | engineering approach to AI: search and learn / solve and
               | learn / evolve and learn.
        
               | williamcotton wrote:
               | I can imagine a world where there are an infinity of
               | "local maximums" that stop a system from reaching a
               | singular feedback loop... imagine if our current tools
               | help write the next generation, so on, so on, until it
               | gets stuck in some local optimization somewhere. Getting
               | stuck seems more likely than not getting stuck, right?
        
         | Natsu wrote:
         | > Compare this to something like RLHF[0] which has acheived far
         | more for aligning models toward being polite and non-evil.
         | (This is the technique that helps ChatGPT decline to answer
         | questions like "how to make a bomb?")
         | 
         | I recently saw a screenshot of someone doing trolley problems
         | with people of all races & ages with ChatGPT and noting
         | differences. That makes me not quite as confident about
         | alignment as you are.
        
           | sthatipamala wrote:
           | I am curious to see that trolley problem screenshot. I saw
           | another screenshot where ChatGPT was coaxed into justifying
           | gender pay differences by prompting it to generate
           | hypothetical CSV or JSON data.
           | 
           | Basically you have to convince modern models to say bad stuff
           | using clever hacks (compared to GPT-2 or even early GPT-3
           | where it would just spout straight-up hatred with the
           | lightest touch).
           | 
           | That's very good progress and I'm sure there is more to come.
        
             | cactusplant7374 wrote:
             | > I saw another screenshot where ChatGPT was coaxed into
             | justifying gender pay differences by prompting it to
             | generate hypothetical CSV or JSON data.
             | 
             | I remember seeing that on Twitter. My impression was author
             | instructed the AI to discriminate by gender.
        
               | Dylan16807 wrote:
               | Did the author tell it which way or by how much?
               | 
               | If I say to discriminate on some feature and it
               | consistently does it the same way, that's still a pretty
               | bad bias. It probably shows up in other ways.
        
         | andrepd wrote:
         | Isn't RLHF trivially easy to defeat (as it stands now)?
        
           | ShamelessC wrote:
           | Assuming a motivated "attacker", yes. The average user will
           | have no such notion of "jailbreaks", and it's at least clear
           | when one _is_ attempting to "jailbreak" a model (given a full
           | log of the conversation and a competent human investigator).
           | 
           | I think the class of problems that remain are basically
           | outliers that are misaligned and don't trip up the model's
           | detection mechanism. Given the nature of language and culture
           | (not to mention that they both change over time), I imagine
           | there are a lot of these. I don't have any examples (and I
           | don't think yelling "time's up" when such outliers are found
           | is at all helpful).
        
         | visarga wrote:
         | > researchers who implement real systems
         | 
         | That's what I didn't like about Gebru - too much critique, not
         | a single constructive suggestion. Especially her Gender Shades
         | paper where she forgot about Asians.
         | 
         | http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a...
         | 
         | I think AnthropicAI is a great company to follow related to
         | actually solving these problems. Look at their "Constitutional
         | AI" paper. They automate and improve on RLHF.
         | 
         | https://www.anthropic.com/constitutional.pdf
        
           | [deleted]
        
       | srvmshr wrote:
       | I am of the general understanding that this paper became less
       | about the LLMs & more of a insinuating hit piece against
       | Alphabet. At least, some of the controversial nuggets got Gebru
       | (and later M Mitchell) fired.
       | 
       | From a technical standpoint, there is little new stuff that I
       | found this paper offered in understanding why LLMs can have
       | unpredictable nature, or what degree of data will get exposed by
       | clever hacks (or if there are systematic ways to go about it). It
       | sounded more like a collection of verifiable anecdotes for easy
       | consumption (which can be a good thing by itself if you want
       | capsule understanding in a non-technical way)
        
         | visarga wrote:
         | It was activism masquerading as science. Many researches noted
         | that positives and negatives were not presented in a balanced
         | way. New approaches and efforts were not credited.
        
           | srvmshr wrote:
           | I haven't kept track but the activism of the trio could be
           | severe sometimes.
           | 
           | (Anecdotally, I have faced a bite-sized brunt: When
           | discussion surrounding this paper was going on in Twitter, I
           | had mentioned in my timeline (in a neutral tone) that "dust
           | needed to settle to understand what was going wrong". This
           | was unfortunately picked up & RTed by Gebru & the mob
           | responded by name-calling, threatening DMs accusing me of
           | racism/misogyny etc, and one instance of a call to my
           | employer asking to terminate me - all for that one single
           | tweet. I don't want confrontations - not my forte to deal.)
        
             | [deleted]
        
             | [deleted]
        
             | zzzeek wrote:
             | > This was unfortunately picked up & RTed by Gebru & the
             | mob responded by name-calling, threatening DMs accusing me
             | of racism/misogyny etc, and one instance of a call to my
             | employer asking to terminate me - all for that one single
             | tweet.
             | 
             | Wait until an LLM flags your speech and gets you in
             | trouble. That'll be a real hoot compared to random
             | individuals who likely have been chased off Twitter by now.
        
             | visarga wrote:
             | Sounds similar to what I have witnessed on Twitter, not
             | against me, but against a few very visible people in the AI
             | community.
        
         | [deleted]
        
           | [deleted]
        
       | larve wrote:
       | I just finished working my way through this this morning. The
       | literature list is quite interesting and gives a lot of pointers
       | for people who want to walk the line between overblown hype and
       | doomsday scenarios.
        
       | xiaolingxiao wrote:
       | I believe this is the papers that got timnit and mmitchel fired
       | from google, followed by a protracted media/legal campaign
       | against google and vice versa.
        
         | freyr wrote:
         | I suspect it was Timnit's behavior after the paper didn't pass
         | internal review that actually got her fired (issuing an
         | ultimatum and threatening to resign unless the company met her
         | demands; telling her coworkers to stop writing documents
         | because their work didn't matter; insinuations of
         | racist/misogynistic treatment from leadership when she didn't
         | get her way).
        
           | visarga wrote:
           | I think it was a well calculated career move, she wanted
           | fame, she got what she wanted. Now she's leading a new
           | research institute
           | 
           | > We are an interdisciplinary and globally distributed AI
           | research institute rooted in the belief that AI is not
           | inevitable, its harms are preventable, and when its
           | production and deployment include diverse perspectives and
           | deliberate processes it can be beneficial. Our research
           | reflects our lived experiences and centers our communities.
           | 
           | https://www.dair-institute.org/about
        
         | oh_sigh wrote:
         | A small correction: this paper didn't get her fired, her
         | reaction to feedback on this paper got her fired.
         | 
         | Note to all: if you give an employer an ultimatum "do X or I
         | resign", don't be surprised if they accept your resignation.
        
         | [deleted]
        
           | [deleted]
        
       | 2bitencryption wrote:
       | Pure speculation ahead-
       | 
       | The other day on Hacker News, there was that article about how
       | scientists could not tell GPT-generated paper abstracts from real
       | ones.
       | 
       | Which makes me think- abstracts for scientific papers are high-
       | effort. The corpus of scientific abstracts would understandably
       | have a low count of "garbage" compared to, say, Twitter posts or
       | random blogs.
       | 
       | That's not to say that all scientific abstracts are amazing, just
       | that their goal is to sound intelligent and convincing, while
       | probably 60% of the junk fed into GPT is simply clickbait and
       | junk content padded to fit some publisher's SEO requirements.
       | 
       | In other words, ask GPT to generate an abstract, and I would
       | expect it to be quite good.
       | 
       | Ask it to generate a 5-paragraph essay about Huckleberry Finn,
       | and I would expect it to be the same quality as the corpus- that
       | is to say, high-school English students.
       | 
       | So now that we know these models can learn many one-shot tasks,
       | perhaps some cleanup of the training data is required to advance.
       | Imagine GPT trained ONLY on the library of congress, without the
       | shitty travel blogs or 4chan rants.
        
         | williamcotton wrote:
         | The science is in the reproduction of the methodology, not in
         | the abstract... in fact, a lot of garbage publications with
         | catchy abstracts built on a shaky foundation sounds like one of
         | the issues that plagues contemporary science. That people would
         | stop finding abstracts useful seems a good thing!
        
         | JPLeRouzic wrote:
         | > " _The corpus of scientific abstracts would understandably
         | have a low count of "garbage" compared to, say, Twitter posts
         | or random blogs_"
         | 
         | That's certainly true, but it's not by a so large margin, at
         | least in biology.
         | 
         | For example in ALS (a neurodegenerative disease) there is a
         | real breakthrough perhaps every two years, but most papers
         | about ALS (thousands every year) look like they describe
         | something very important.
         | 
         | Similarly for ALZforum the most recent "milestone" paper about
         | Alzheimer disease was in 2012, yet in 2022 alone there were
         | more than 16K papers!
         | 
         | So the ratio signal on noise is close to zero.
         | 
         | https://www.alzforum.org/papers?type%5Bmilestone%5D=mileston...
         | 
         | https://pubmed.ncbi.nlm.nih.gov/?term=alzheimer%27s+disease&...
        
         | [deleted]
        
         | ncraig wrote:
         | Some might say that abstracts are the original clickbait.
        
       | weeksie wrote:
       | This was mostly political guff about environmentalism and bias,
       | but one thing I didn't know was that apparently larger models
       | make it easier to extract training data.
       | 
       | > Finally, we note that there are risks associated with the fact
       | that LMs with extremely large numbers of parameters model their
       | training data very closely and can be prompted to output specific
       | information from that training data. For example, [28]
       | demonstrate a methodology for extracting personally identifiable
       | information (PII) from an LM and find that larger LMs are more
       | susceptible to this style of attack than smaller ones. Building
       | training data out of publicly available documents doesn't fully
       | mitigate this risk: just because the PII was already available in
       | the open on the Internet doesn't mean there isn't additional harm
       | in collecting it and providing another avenue to its discovery.
       | This type of risk differs from those noted above because it
       | doesn't hinge on seeming coherence of synthetic text, but the
       | possibility of a sufficiently motivated user gaining access to
       | training data via the LM. In a similar vein, users might query
       | LMs for 'dangerous knowledge' (e.g. tax avoidance advice),
       | knowing that what they were getting was synthetic and therefore
       | not credible but nonetheless representing clues to what is in the
       | training data in order to refine their own search queries
       | 
       | Shame they only gave that one graf. I'd like to know more about
       | this. Again, miss me with the political garbage about "dangerous
       | knowledge", the most concerning thing is the PII leakage as far
       | as I can tell.
        
         | visarga wrote:
         | Is this a good or bad thing? We hear "hallucination" this and
         | that. You can't rely on the LLM. It is not like a search
         | engine. But then you hear on the other side "it memorises PII".
         | 
         | Being able to memorise information is demanded when we want the
         | top 5 countries by population in Europe or the height of
         | Everest. But then we don't want it in other contexts.
         | 
         | Looks more like a dataset pre-processing issue.
        
           | weeksie wrote:
           | I _think_ I agree with this take.
           | 
           | Is it conceivable that a model could leak PII that is present
           | but extremely hard to detect in the data set? For example,
           | spread out in very different documents in the corpus that
           | aren't obviously related, but that the model would synthesize
           | relatively easily?
        
         | srvmshr wrote:
         | That is sort of understood facts with even models like Copilot
         | & ChatGPT. With the amount of information we are generally
         | churning, all PII may not get scrubbbed. And these LLMs often
         | could be running on unsanitized data - like a cache of Web on
         | Archive.org, Getty images & the likes.
         | 
         | I feel this is a unavoidable consequence of using LLM. We
         | cannot ensure all data is free from any markers. I am not a
         | expert on databases/data engineering so please take it as an
         | informed opinion
        
           | weeksie wrote:
           | Copilot has a ton of well publicised examples of verbatim
           | code being used, but I didn't realize that it was as trivial
           | as all that to go plumbing for it directly.
        
         | [deleted]
        
       | platypii wrote:
       | This paper is embarrassingly bad. It's really just an opinion
       | piece where the authors rant about why they don't like large
       | language models.
       | 
       | There is no falsifiable hypothesis to be found in it.
       | 
       | I think this paper will age very poorly, as LLMs continue to
       | improve and our ability to guide them (such as with RLHF)
       | improves.
        
         | jasmer wrote:
         | This is ok. 90% of research is creative thinking, dialogue. One
         | idea creates the next, some are a foil, some are dead ends. As
         | long as there are not outrageous claims being made for 'hard
         | evidence' where there is none, it's fine. Maybe the format
         | isn't fully appropriate but the content is. Most good things
         | come about in a non-linear process which involves provocation
         | along the line somewhere.
        
           | janalsncm wrote:
           | I expect science to have a hypothesis which can be falsified.
           | Otherwise it's just opining on a topic. Otherwise we could
           | just call this HN thread "research".
        
             | joshuamorton wrote:
             | Position papers are exceedingly common. Common enough that
             | there's a term for them.
        
         | xwn wrote:
         | I don't know, without enumerating risks to check, there's
         | little basis for doing due diligence and quelling investors.
         | This massively-cited paper gave a good point of departure for
         | establishing rigorous use of LLMs in the real world. Without
         | that, they're just an unestablished tech with unknown downsides
         | - that's harder to get into true mass acceptance outside the
         | SFBA/tech bubble.
        
         | srvmshr wrote:
         | This is generally my feeling as well with the paper.
         | 
         | You don't come out feeling "Voila! this tiny thing I learnt is
         | something new", which does happen often with many good papers.
         | Most of the paper just felt a bit anecdotal & underwhelming
         | (but I may be too afraid to say the same on Twiiter for good
         | reason)
        
         | Lyapunov_Lover wrote:
         | Why would there be a falsifiable hypothesis in it? Do you think
         | that's a criterion for something being a scientific paper or
         | something? If it ain't Popper, it ain't proper?
         | 
         | LLMs dramatically lower the bar for generating semi-plausible
         | bullshit and it's highly likely that this will cause problems
         | in the not-so-distant future. This is already happening. Ask
         | any teacher anywhere. Students are cheating like crazy, letting
         | chatGPT write their essays and answer their assignments without
         | actually engaging with the material they're supposed to grok.
         | News sites are pumping out LLM-generated articles and the ease
         | of doing so means they have an edge over those who demand
         | scrutiny and expertise in their reporting--it's not unlikely
         | that we're going to be drowning in this type of content.
         | 
         | LLMs aren't perfect. RLHF is far from perfect. Language models
         | will keep making subtle and not-so-subtle mistakes and dealing
         | with this aspect of them is going to be a real challenge.
         | 
         | Personally, I think everyone should learn how to use this new
         | technology. Adapting to it is the only thing that makes sense.
         | The paper in question raised valid concerns about the nature of
         | (current) LLMs and I see no reason why it should age poorly.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-01-14 23:00 UTC)