[HN Gopher] GPT-4 LLM simulates people well enough to replicate ...
       ___________________________________________________________________
        
       GPT-4 LLM simulates people well enough to replicate social science
       experiments
        
       Author : thoughtpeddler
       Score  : 203 points
       Date   : 2024-08-07 21:30 UTC (1 days ago)
        
 (HTM) web link (www.treatmenteffect.app)
 (TXT) w3m dump (www.treatmenteffect.app)
        
       | thoughtpeddler wrote:
       | Accompanying working paper that demonstrates 85% accuracy of
       | GPT-4 in replicating 70 social science experiment results:
       | https://docsend.com/view/qeeccuggec56k9hd
        
         | Jensson wrote:
         | Do you even get 85% replication rate with humans in social
         | science? Doesn't seem right.
         | 
         | But at least it can give them hints of where to look, but going
         | that way is very dangerous as it gives LLM operators power to
         | shape social science.
        
           | TeaBrain wrote:
           | The study isn't trying to do replication, but seems to have
           | tested the rate that GPT-4 predicts human responses to survey
           | studies. After reading the study, the writers really were not
           | clear on how they were feeding the studies they were
           | attempting to predict the responses to into the LLM. The data
           | they used for training also was not clear, as they only
           | dedicated a few lines referring to this. For 18 pages, there
           | was barely any detail on the methods employed. I also don't
           | believe the use of the word "replication" makes any sense
           | here.
        
       | nullc wrote:
       | But do the experiments replicate better in LLMs than in actual
       | humans? :D
       | 
       | We should expect LLMs to be pretty good at repeating back to us
       | the stories we tell about ourselves.
        
       | pedalpete wrote:
       | I wonder if this could be used for testing marketing or UX
       | actions?
        
       | vekntksijdhric wrote:
       | Same energy as https://mastodon.radio/@wa7iut/112923475679116690
        
         | dantyti wrote:
         | why not just link directly?
         | https://existentialcomics.com/comic/557
        
       | 42lux wrote:
       | Everyone and their mom in advertising sold "GPT Persona" tools
       | which are basically just an api call to brands for target group
       | simulation. Think "Chat with your target group" kinda stuff.
       | 
       | Hint: They like it because it's biased for what they want... like
       | real marketing studies.
        
         | markovs_gun wrote:
         | Yeah anyone who has used ChatGPT for more than 30 minutes of
         | asking it to write poetry about King Charles rapping with Tupac
         | and other goofy stuff has realized that it is essentially
         | trained to assume that whatever you're saying to it is true and
         | to not say anything negative to you. It can't write stories
         | without happy endings, and it can't recognize when you ask it a
         | question that contains a false premise. In marketing, I assume
         | if you ask a fake target demographic if it will like your new
         | product that is pogs but with blockchain technology, it will
         | pretty much always say yes
        
           | cj wrote:
           | I've noticed this in article summaries. It does seem to have
           | some weird biases.
           | 
           | I've been able to get around that by directly asking it for
           | pros/cons or "what are the downsides" or "identify any
           | inconsistencies" or "where are the main risks"... etc
           | 
           | There's also a complexity threshold where it performs much
           | better if you break a simple question down into multiple
           | parts. You can basically to prompt-based transformations of
           | your own input to break down information and analyze it in
           | different ways prior to using all of that information to
           | finally answer a higher level question.
           | 
           | I wish ChatGPT could do this behind the scenes. Prompt itself
           | "what questions should I ask myself that would help me answer
           | this question?" And go through all those steps without
           | exposing it to the user. Or maybe it can or already does, but
           | it still seems like I get significantly better results when I
           | do it manually and walk ChatGPT through the thought process
           | myself.
        
             | Propelloni wrote:
             | If you can do that and do it, for what do you need to ask
             | the chatbot? Genuine question, because in my mind that's
             | the heavy lifting you do there and you will get to a
             | conclusion in the process. All the bot can do is agree with
             | you and that serves what purpose?
        
             | markovs_gun wrote:
             | Another interesting case with this is an instance I had
             | with Google Assistant's AI summary feature for group chats.
             | In the group chat, my mom said that my grandma was in the
             | hospital and my sister said she was going to go visit her.
             | In the AI summary, my grandma was on vacation and my sister
             | was in the hospital. Completely useless.
        
           | IIAOPSW wrote:
           | Yes, but with the caveat of in some very specific cases no.
           | 
           | I spent a good deal of time trying to get it to believe there
           | was a concept in the law of "praiseworthy homicide". I even
           | had (real) citations to a law textbook. It refused to believe
           | me.
           | 
           | Given the massive selling point of ChatGPT to the legal
           | profession, and the importance of actually being right,
           | OpenAI certainly reduces the "high trait agree-ability" in
           | favor of accuracy in this particular area.
        
             | AnthonBerg wrote:
             | There's a way to apply the concept of praiseworthy homicide
             | - metaphorically - to your battle with it.
        
           | fragmede wrote:
           | here's a story with a sad ending, called sad musical
           | farewell.
           | 
           | https://chatgpt.com/share/0d651c67-166f-4cef-
           | bc8c-1f4d5747bd...
        
             | Zambyte wrote:
             | Apparently counter examples are very unappreciated. I also
             | gave a counter example for each of their claims, but my
             | comment got flagged immediately.
             | 
             | https://news.ycombinator.com/item?id=41187549
        
             | markovs_gun wrote:
             | I should have clarified that I meant that it has trouble
             | writing stories with bad endings unless you ask for them
             | directly and specifically, which can be burdensome if
             | you're trying to get it to write a story about something
             | specific that would naturally have a sad ending.
        
           | Terr_ wrote:
           | > it is essentially trained to assume that whatever you're
           | saying to it is true and to not say anything negative to you.
           | 
           | Oh, it's actually worse than that: A given LLM probably has
           | zero concept of "entities", let alone "you are one entity and
           | I am another" or "statements can be truths or lies."
           | 
           | There is merely one combined token stream, and dream-
           | predicting the next tokens. While that output prediction
           | often _resembles_ a conversation we expect between two
           | entities with boundaries, that says more about effective
           | mimicry than about its internal operation.
        
             | ithkuil wrote:
             | I agree there is limited modeling going on but the smoking
             | gun is not on the fact that all there is to an LLM is mere
             | "next token prediction".
             | 
             | In order to successfully predict the next token the model
             | needs to reach a significant level of "understanding" of
             | the preceding context and the next token is the "seed" of a
             | much longer planned response.
             | 
             | Now, it's true that this "understanding" is not even close
             | to what humans would call understanding (hence the quotes)
             | and that the model behaviour is heavily biased towards
             | productions that "sound convincing" or "sound human".
             | 
             | Nevertheless LLMs perform an astounding amount of
             | computation in order to produce that next token and that
             | computation happens in a high dimensional space that
             | captures a lot of "features" of the world derived from an
             | unfathomably large and diverse training set. And there is
             | still room for improvement in collecting. cleaning and/or
             | synthesizing an even better training corpus for LLMs.
             | 
             | Whether the current architecture of LLMs will ever be able
             | to truly model the world is an open question but I don't
             | think the question can be resolved just by pointing out
             | that all the model does is produce the next token. That's
             | just an effective way researchers found to build a channel
             | with the external world (humans and the training set) and
             | transform to and from the high-dimensional reasoning space
             | and legible text.
        
       | TeaBrain wrote:
       | I don't think "replicate" is the appropriate word here.
        
         | valiant55 wrote:
         | I'm sure Philip K Dick would disagree.
        
           | __loam wrote:
           | Dick would hate these guys lol.
        
           | Xen9 wrote:
           | Only until you realize that repli-cant.
        
       | dartos wrote:
       | Were those experiments in the training set?
       | 
       | If so, how close was the examination vs the record the model was
       | trained on.
       | 
       | Some interesting insights there, I think.
        
         | masterofpupp3ts wrote:
         | The answers to your questions are in the paper linked in the
         | first line of the app
        
           | cpeterso wrote:
           | > Accuracy remained high for unpublished studies that could
           | not appear in the model's training data (r = 0.90).
        
       | croes wrote:
       | Is that the solution to social science's replication problem?
        
         | nomel wrote:
         | With the temperature parameter effectively set to 0, it may
         | finally be possible!
        
       | xp84 wrote:
       | Can someone translate for us non-social-scientists in the
       | audience what this means? "3. Treatment. Write a message or
       | vignette exactly as it would appear in a survey experiment."
       | 
       | Probably would be sufficient to just give a couple examples of
       | what might constitute one of these.
       | 
       | Sorry, I know this is probably basic to someone who is in that
       | field.
        
         | X0nic wrote:
         | Same for me. I had no idea what was being asked.
        
         | LogicalRisk wrote:
         | A treatment might look like
         | 
         | "In the US, XXX are much more likely to be unemployed than are
         | YYY. The unemployment rate is defined as the percentage of
         | jobless people who have actively sought work in the previous
         | four weeks. According to the U.S. Bureau of Labor Statistics,
         | the average unemployment rate for XXX in 2016 was five times
         | higher than the unemployment rate for YYY"
         | 
         | "How much of this difference do you think is due to
         | discrimination?"
         | 
         | In this case you'd fill in XXX and YYY with different values
         | and show those treatments to your participants based on your
         | treatment assignment scheme.
        
       | tylerrobinson wrote:
       | A YC company called Roundtable tried to do this.[1]
       | 
       | The comments were not terribly supportive. They've since pivoted
       | to a product that does survey data cleaning.
       | 
       | [1] https://news.ycombinator.com/item?id=36865625
        
         | AnthonBerg wrote:
         | A social science experiment in and of itself. A fine thread of
         | tragedy in the rich tapestry of enterprise.
        
         | janalsncm wrote:
         | Products like this make me pretty cynical about VCs' ability to
         | evaluate novel technical products. Any ML engineer who spent 5
         | minutes understanding would have rejected the pitch.
        
           | rytill wrote:
           | I'm an ML engineer who's spent more than 5 minutes thinking
           | about this idea and would not have automatically rejected the
           | pitch.
        
             | janalsncm wrote:
             | There are so many basic questions raised in the Launch HN
             | thread that didn't have good answers. It indicates to me
             | that YC didn't raise those questions, which is a red flag.
        
       | dongobread wrote:
       | I'm very skeptical on this, the paper they linked is not
       | convincing. It says that GPT-4 is correct at predicting the
       | experiment outcome direction 69% of the time versus 66% of the
       | time for human forecasters. But this is a silly benchmark because
       | people are not trusting human forecasters in the first place,
       | that's the whole purpose for why the experiment is run. Knowing
       | that GPT-4 is slightly better at predicting experiments than some
       | human guessing doesn't make it a useful substitute for the actual
       | experiment.
        
         | fl0id wrote:
         | This so much. There was another similar one recently which was
         | also BS.
        
         | yas_hmaheshwari wrote:
         | Nicely put! Well argued!
         | 
         | I was not able to put my finger on what I felt wrong about the
         | article -- till I read this
        
         | authorfly wrote:
         | I totally agree. So many people are missing the point here.
         | 
         | Also important is that in Psychology/Sociology, it's the
         | counter-intuitive results that get published. But these results
         | disproportionately fail to replicate!
         | 
         | Nobody cares if you confirm something obvious, unless it's on
         | something divisive (e.g. sexual behavior, politics), or there
         | is an agenda (dieting, etc). So people can predict those ones
         | more easily than predicting a randomly generated premise. The
         | ones that made their way into the prediction set were the ones
         | researchers expected to be counter-intuitive (and likely
         | P-hacked a significant proportion of them to find that result).
         | People know this (there are more positive confirming papers
         | than negative/fail-to-replicate).
         | 
         | This means the _counter-intuitive, negatively forecast results,
         | are the ones that get published_ i.e. the dataset saying that
         | 66% of human forecasters is disproportionately constructed of
         | studies that found counter-intuitive results compared to the
         | overall neutral pre-published set of studies, because
         | scientists and grant winners are incentivised to publish
         | counter-intuitive work. I would even suggest the selected
         | studies are more tantalizing that average in most of these
         | studies, they are key findings, rather than the miniature of
         | comments on methods or re-analysis.
         | 
         | By the way the 66% result has not held up super well in other
         | research, for example, only 58% could predict if papers would
         | replicate later on: https://www.bps.org.uk/research-
         | digest/want-know-whether-psy... - Results with random people
         | show that they are better than chance for psychology, but on
         | average by less than 66% and with massive variance. This figure
         | doesn't differ from psychology professors which should tell you
         | the stat represents more the context of the field and it's
         | research apparatus itself rather than capability to predict
         | research. What if we revisit this GPT-4 paper in 20 years, see
         | which have replicated, ask people to predict that - will GPT-4
         | still be higher if it's data is frozen today? If it is up to
         | date? Will people hit 66%, 58%, or 50%?
         | 
         | My point is, predicting the results now is not that useful
         | because historically, up to "most" of the results have been
         | wrong anyhow. Predicting which results will be true and remain
         | true would be more useful. The article tries to dismiss the
         | issue of the replication crisis by avoiding it, and by using
         | pre-registered studies, but such tools are only bandages.
         | Studies still get cancelled, or never proposed after internal
         | experimentation, we don't have a "replication reputation meter"
         | to measure those (which affect and increase false positive
         | results), and we likely never will, with this model of science
         | for psychology/sociology statistics. If the authors read my
         | comment and disagree, they should use predictions for underway
         | replications with GPT-4 and humans, wait a few years for the
         | results, and then conduct analysis.
         | 
         | Also, more to the point, as a Psychology grant funded once told
         | me - the way to get a grant in Psychology is to: 1) Acquire a
         | result with a counter-intuitive result first. Quick'n'dirty
         | research method like students filling in forms, small sample
         | size, not even published, whatever. Just make the story good
         | for this one and get some preliminary numbers on some topic by
         | casting a big web of many questions (a few will get P < 0.05 by
         | chance eventually in most topics anyway at this sample size) 2)
         | Find an angle whereby said result says something about culture
         | or development (e.g. "The Marhsmallow experiment shows that
         | poverty is already determined by your response to tradeoffs at
         | a young age", or better still "The Marshmallow experiment is
         | rubbish because it's actually entirely explained by SES as a
         | third factor, and wealth disparity in the first place is ergo
         | the cause". Importantly, change the research method to
         | something more "proper" and instead apply P-hacking if possible
         | when you actually carry out the research. The biggest P-hack is
         | so simple and obvious nobody cares: you drop results that
         | contradict or are insignificant, and just don't report them -
         | carrying out alternate analysis, collecting slightly different
         | data, switching from online to in person experiments, whatever
         | you canto get a result. 3) Upon the premise of further
         | tantalizing results, propose several studies which can fund you
         | over 5 years, apply some of the buzz words of the day. Instead
         | of "Thematic Analysis", It's "AI Summative Assessment" for the
         | Word Frequency amounts, etc. If you know the grant judgers,
         | avoid contradicting whatever they say, but be just outside of
         | the dogma enough (usually, culturally) to represent
         | movement/progress of "science".
         | 
         | This is how 99% of research works. The grant holder directs the
         | other researchers. When directing them to carry out an
         | alternate version of the experiment or change what we are
         | analyzing, you motivate them that it's for the good of the
         | future, society, being at the cutting edge, and supporting the
         | overarching theory (which ofcourse, already has "hundreds" of
         | supporting evidence from other studies constructed in the same
         | fashion).
         | 
         | As to sociology/psychology experiments - Do social experiments
         | represent language and culture more than people and groups?
         | Randomly.
         | 
         | Do they represent what would be counter-intuitive or support
         | developing and entrenching models and agendas? Yes.
         | 
         | 90% of social science studies have insufficient data to say
         | anything at P < 0.01 level which should realistically be our
         | goal if we even want to do statistics with the current dogma
         | for this field (said kindly because some large datasets are
         | genuine enough and used for several studies to make up the
         | numbers in the 10%). I strongly see a revolution in
         | psychology/sociology within the next 50 years to redefine a new
         | basis.
        
           | equinox12 wrote:
           | I think this analysis is misguided.
           | 
           | Even considering a historic bias for counter-intuitive
           | results in social science, this has no bearing on the results
           | of the paper being discussed. Most of the survey experiments
           | that the researchers used in their analyses came from TESS,
           | an NSF-funded program that collects well-powered nationally
           | representative samples for researchers. A key thing to note
           | here is that not every study from TESS gets published. Of
           | course, some do, but the researchers find that GPT4 can
           | predict the results of both published and unpublished studies
           | at a similar rate of accuracy (r = 0.85 for published studies
           | and r = 0.90 for unpublished studies). Also, given that the
           | majority of these studies 1) were pre-registered (even pre-
           | registering sample size), 2) had their data collected through
           | TESS (an independent survey vendor), and 3) well-powered +
           | nationally-representative, makes it extremely unlikely for
           | them to have been p-hacked. Therefore, regardless of what the
           | researchers hypothesized, TESS still collected the data and
           | the data is of the highest quality within social science.
           | 
           | Moreover, the researchers don't just look at psychology or
           | sociology studies, there are studies from other fields like
           | political science and social policy, for example, so your
           | critiques about psychology don't apply to all the survey
           | experiments.
           | 
           | Lastly, the study also includes a number of large-scale
           | behavioral field experiments and finds that GPT4 can
           | accurately predict the results of these field experiments,
           | even when the dependent variable is a behavioral metric and
           | not just a text-based response (e.g., figuring out which text
           | messages encourage greater gym attendance). It's hard for me
           | to see how your critique works in light of this fact also.
        
             | authorfly wrote:
             | Yes, I am sure you should have said the same about the
             | research before 2011 with the replication crisis, when it
             | was always claimed that scientists like Bell (premonition)
             | and Baumeister (Ego-depletion) could not possibly be faking
             | their findings - they contributed so much, their models
             | have "theoretical validity", they had hundreds of studies
             | and other researchers building on their work! They had big
             | samples. Regardless of TESS/NSF, the studies it focuses are
             | have been funded (as you mention) and they were simply not
             | chosen randomly. People had to apply to grants. They had to
             | bring in early, previous or prototype results to convince
             | people of funding.
             | 
             | The specificness to psychology applies to most fields in
             | the soft sciences with their typical research techniques.
             | 
             | The main point is that prior research shows absolutely no
             | difference between field experts and random people in
             | predicting the results of studies, per-registered,
             | replications, and others.
             | 
             | GPT-4 achieving the same approximate success rate as any
             | person has nothing whatsoever to do with it simulating
             | people. I suspect an 8 year old could reliably predict
             | psychology replications after 10 years with about the same
             | accuracy. It's also key that in prior studies, like the one
             | I linked, this same lack of difference occurred even when
             | the people involved were provided additional recent
             | resources from the field, although with higher prediction
             | accuracy.
             | 
             | The meat of the issue is simple - show me a true positive
             | study, make the predictions on whether it will replicate,
             | and let's see in 10 years when replication efforts have
             | been taken out, whether GPT-4 is any higher than a random
             | 10 year old who no information on the study. The implied
             | claim here is that since GPT-4 can supposedley simulate
             | sociology experiments and so more accurately judge the
             | results, we can iterate it and eventually conduct science
             | that way or speed up the scientific process. I am telling
             | you that the simulation aspect has nothing to do with the
             | success of the algorithm, which is not really outpeforming
             | humans because to put it simply, humans are bad at using
             | any subject-specific or case knowledge to predict the
             | replication/success of a specific study(there is no
             | difference between lay people and experts) and the entire
             | set of published work is naturally biased anyhow. In other
             | words, this style may elicit higher test score results, by
             | altering the prompt.
             | 
             | The description of the role of GPT-4 here as simulating is
             | a human theoretical construction. We know that people with
             | a knowledge advantage are not able to apply this to
             | predicting output results any more accurately than lay
             | people. That is because they are trying to predict a biased
             | dataset. The field of sociology as a whole, as are most
             | studies that involve humans (because they are vastly
             | underfunded for large samples) struggles to replicate or
             | conduct scientific in a reliable, repeatable way, and until
             | we resolve that, the GPT-4 claims of simulating people, are
             | spurious and unrelated at best, misleading at worst.
        
               | equinox12 wrote:
               | I'm not sure how to respond to your point about Bem and
               | Baumeister's work since those cases are the most obvious
               | culprits for being vulnerable to scientific
               | weakness/malpractice (in particular, because they came
               | before the time of open access science, pre-registration,
               | and sample sizes calculated from power analyses).
               | 
               | I also don't get your point about TESS. It seems obvious
               | that there are many benefits for choosing the repository
               | of TESS studies from the authors' perspective. Namely, it
               | conveniently allows for a consistent analytic approach
               | since many important things are held constant between
               | studies such as 1) the studies have the exact same sample
               | demographics (which prevents accidental heterogeneity in
               | results due to differences in participant demographics)
               | and 2) the way in which demographic variables are
               | measured is standardized so that the only difference
               | between survey datasets is the specific experiment at
               | hand (this is crucial because the way in which
               | demographic variables are measured varies can affect the
               | interpretation of results). This is apart from the more
               | obvious benefits that the TESS studies cover a wide range
               | of social science fields (like political science,
               | sociology, psychology, communication, etc., allowing for
               | the testing of robustness in GPT predictions across
               | multiple fields) and all of the studies are well-powered
               | nationally representative probability samples.
               | 
               | Re: your point about experts being equal to random people
               | in predicting results of studies, that's simply not true.
               | The current evidence on this shows that, most of the
               | time, experts are better than laypeople when it comes to
               | predicting the results of experiments. For example, this
               | thorough study (https://www.nber.org/system/files/working
               | _papers/w22566/w225...) finds that the average of expert
               | predictions outperforms the average of laypeople
               | predictions. One thing I will concede here though is
               | that, despite social scientists being superior at
               | predicting the results of lab-based experiments, there
               | seems to be growing evidence that social scientists are
               | not particularly better than laypeople at predicting
               | domain-relevant societal change in the real world (e.g.,
               | clinical psychologists predicting trends in loneliness)
               | [https://www.cell.com/trends/cognitive-
               | sciences/abstract/S136... ; full-text pdf here: https://w
               | ww.researchgate.net/publication/374753713_When_expe...].
               | Nonetheless, your point about there being no difference
               | in the predictive capabilities of experts vs. laypeople
               | (which you raise multiple times) is just not supported by
               | any evidence since, especially in the case of the GPT
               | study we're discussing, most of the analyses focus on
               | predicting survey experiments that are run by social
               | science labs.
               | 
               | Also, based on what the paper is suggesting, the authors
               | don't seem to be suggesting that these are "replications"
               | of the original work. Rather, GPT4 is able to simulate
               | the results of these experiments like true participants.
               | To fully replicate the work, you'd need to do a lot more
               | (in particular, you'd want to do 'conceptual
               | replications' wherein you the underlying causal model is
               | validated but now with different stimuli/questions).
               | 
               | Finally, to address the previous discussion about the
               | authors finding that GPT4 seems to be comparable to human
               | forecasters in predicting the results of social science
               | experiments, let's dig deeper into this. In the paper,
               | but specifically in the supplemental material, the
               | authors note that they "designed the forecasting study
               | with the goal of giving forecasters the best possible
               | chance to make accurate predictions." The way they do
               | this is by showing laypeople the various conditions of
               | the experiment and have the participants predict where
               | the average response for a given dependent variable would
               | be within each of those conditions. This is _very
               | different_ from how GPT4 predicts the results of
               | experiments in the study. Specifically, they prompt GPT
               | to be a respondent and do this iteratively (feeding it
               | different demographic info each time). The result of this
               | is essentially the same raw data that you would get from
               | actually running the experiment. In light of this, it 's
               | clear that this is a very conservative way of testing how
               | much better GPT is than humans at predicting results and
               | they still find comparable performance. All that said,
               | what's so nice about GPT being able to predict social
               | science results just as well as (or perhaps better than)
               | humans? Well, it's much cheaper (and efficient) to run
               | thousands of GPT queries than is to recruit thousands of
               | human participants!
        
         | addcn wrote:
         | For sure. Great argument
         | 
         | + the experiments may already be in the dataset so it's really
         | testing if it remembers pop psychology
        
           | a123b456c wrote:
           | Yes. A stronger test would be guessing the results of as-yet-
           | unpublished experiments.
        
         | lumb63 wrote:
         | Furthermore, there's a replication crisis in social sciences.
         | The last thing we need is to accumulate less data and let an
         | LLM tell us the "right" answer.
        
           | verdverm wrote:
           | You can see this in their results, where certain types of
           | studies have a lower prediction rate and higher variability
        
         | katzinsky wrote:
         | That's surprisingly low considering it was probably trained on
         | many of the papers it's supposed to be replicating.
        
       | itkovian_ wrote:
       | Phsycohistory
        
       | scudsworth wrote:
       | garbage in, eh?
        
       | AdieuToLogic wrote:
       | So did ELIZA[0] about sixty (60) years ago.
       | 
       | 0 - https://en.wikipedia.org/wiki/ELIZA
        
       | uptownfunk wrote:
       | Is it possible to train an LLM that is minimally biased and that
       | could assume various personas for the purpose of the experiments?
       | Then I imagine it's just some prompt engineering no?
        
       | nsonha wrote:
       | please don't, need I remind you the joke that social science is
       | not real science
        
       | visarga wrote:
       | Reminds me of:
       | 
       | > Out of One, Many: Using Language Models to Simulate Human
       | Samples
       | 
       | > We propose and explore the possibility that language models can
       | be studied as effective proxies for specific human sub
       | populations in social science research. Practical and research
       | applications of artificial intelligence tools have sometimes been
       | limited by problematic biases (such as racism or sexism), which
       | are often treated as uniform properties of the models. We show
       | that the "algorithmic bias" within one such tool -- the GPT 3
       | language model -- is instead both fine grained and
       | demographically correlated, meaning that proper conditioning will
       | cause it to accurately emulate response distributions from a wide
       | variety of human subgroups. We term this property "algorithmic
       | fidelity" and explore its extent in GPT-3. We create "silicon
       | samples" by conditioning the model on thousands of socio
       | demographic backstories from real human participants in multiple
       | large surveys conducted in the United States. We then compare the
       | silicon and human samples to demonstrate that the information
       | contained in GPT 3 goes far beyond surface similarity. It is
       | nuanced, multifaceted, and reflects the complex interplay between
       | ideas, attitudes, and socio cultural context that characterize
       | human attitudes. We suggest that language models with sufficient
       | algorithmic fidelity thus constitute a novel and powerful tool to
       | advance understanding of humans and society across a variety of
       | disciplines.
       | 
       | https://arxiv.org/abs/2209.06899
        
       | anileated wrote:
       | If GPT emulations of social experiments are not correct, policy
       | decisions based on them will make them so.
       | 
       | "GPT said people would hate buses, so we halved their number and
       | slashed transportation budget... Wow, do our people actually hate
       | buses with passion!"
       | 
       | "A year ago GPT said people would not be worried about climate
       | change, so we stopped giving it coverage and removed related
       | social adverts and initiatives. People really don't give a flying
       | duck about climate change it turns out, GPT was so right!"
       | 
       | This is an oversimplification, of course; to say it with more
       | nuance, anything socio- and psycho- is a minefield of self-
       | fulfilling prophecies that ML seems to be nicely positioned to
       | wreak havoc in. (But the small "this is not a replacement for
       | human experiment" notice is going to be heeded by all, right?)
       | 
       | As someone wrote once, all you need for machine dictatorship is
       | an LLM and a critical number of human accomplices. No need for
       | superintelligence or robots.
        
         | crngefest wrote:
         | All you need for dictatorship in general is a critical number
         | of human accomplices. I don't see how an LLM in the mix would
         | make it worse.
         | 
         | IMO mass communication technologies (radio, TV, internet) are
         | much more important in building a dictatorship.
        
           | anileated wrote:
           | The quote was mostly a flourish (and apparently too open to
           | interpretation to be useful).
           | 
           | In any case, it is about hypothetical "machine dictatorship"
           | in particular, not human dictatorships you describe. Machine
           | dictatorship _traditionally_ invokes an image of "AGI" and
           | violent robots forcing or eliminating humans with raw power
           | and compute capabilities, and thus with no substantial need
           | for accomplices (us vs. them). In contrast, it could be that
           | the more realistic and probable danger from ML is in fact
           | more insidious and prosaic.
           | 
           | What you say about human dictatorship is trivially true, but
           | the quote is not about that.
           | 
           | > I don't see how an LLM in the mix would make it worse
           | 
           | How about a thought experiment.
           | 
           | 1. Take some historical persona you consider well-intentioned
           | (for example, Lincoln), throw an LLM in that situation, and
           | see if it could make it better
           | 
           | 2. Take a person you consider a badly intentioned dictator
           | (maybe that is Hitler), throw an LLM in that situation, and
           | see if it could make it worse
           | 
           | Let me know what you find.
        
             | tgv wrote:
             | Don't forget the deceptive aura of objectivity that
             | machines have. It's easier to issue a command when "the
             | machine has decided" or "God has decided" rather than "I
             | just made this up".
        
               | actionfromafar wrote:
               | Even a pair of dice helps in that regard.
        
               | AnimalMuppet wrote:
               | This. The point of the "AI" is that it may make the
               | humans are more willing to go along with the orders.
        
         | Mordisquitos wrote:
         | > "GPT said people would hate buses, so we halved their number
         | and slashed transportation budget... Wow, do our people
         | actually hate buses with passion!"
         | 
         | You jest, but if you don't mind me going off on a tangent, this
         | reminds me how in the summer 2020 post-lockdown-period the
         | local authorities of Barcelona decided that to reduce the
         | spread of COVID they had to discourage out-of-town people going
         | to the city for nightlife... so they halved the number of night
         | buses connecting Barcelona with nearby towns. Because, of
         | course, making twice the number of people congregate at bus
         | stops and making night buses even more crammed was a great way
         | to reduce contagion. Also, as everybody knows, people's
         | decision whether or not to party in town on a Friday night is
         | naturally contingent on the purely rational analysis as to the
         | number of available buses to get home afterwards.
        
           | strogonoff wrote:
           | Institutions have shown themselves not well-geared for
           | coordinating and enacting consistent policy changes and
           | avoiding unintended consequences under time pressure.
           | Hopefully COVID was a lesson they learned from.
           | 
           | I remember how in Seoul city authorities put yellow tape over
           | outdoor sitting areas in public parks, while at the same time
           | cafes (many of which are next to parks, highlighting the
           | hilarity in real time) were full of people--because another
           | policy allowed indoor dining as long as the number of people
           | in each party is small and you put on a mask while not eating
           | and leave when you are finished (guess how well that all was
           | enforced).
        
         | pembrook wrote:
         | In actuality though, GPT would likely be correct on the
         | democratic will of the people for the things you cited. It's
         | literally just the blended average of human knowledge. What's
         | more democratic than that?
         | 
         | Meanwhile, it seems the bigger risk for dictatorship is the
         | current system where we put a tiny group of elites who
         | condescendingly believe they're smarter than the rest of us in
         | charge ("you will take the bus with your 3 kids and groceries
         | in hand and you will like it").
         | 
         | This is how you get do-nothing social signaling policies for
         | climate change (eg. Straws, bottle caps, grocery bags). Which
         | make urban elites feel good about themselves but are ironically
         | actively harmful towards getting the correct policies inacted
         | (eg. Investment in nuclear).
        
           | eru wrote:
           | > It's literally just the blended average of human knowledge.
           | What's more democratic than that?
           | 
           | No, it's the 'blended average' of the texts it's been fed
           | with.
           | 
           | To state the obvious: illiterate people did not get a vote.
           | Terminally online people got plenty of votes.
           | 
           | And, GPT is also tuned to be helpful and to not get OpenAI in
           | the news for racism etc, which is far from the 'blended
           | average' of even the input texts.
        
           | anileated wrote:
           | > GPT would likely be correct on the democratic will of the
           | people for the things you cited
           | 
           | This is a dangerous line of thought, if you extend it to "why
           | bother actually asking people what they want, let's get rid
           | of voting and just use unfeeling software that can be pointed
           | fingers at whenever things go wrong".
           | 
           | > a tiny group of elites who condescendingly believe they're
           | smarter
           | 
           | I suppose I don't disagree, a small group without a working
           | democratic election process is how dictatorships work.
           | 
           | > you will take the bus with your 3 kids and groceries in
           | hand and you will like it
           | 
           | Bit of a tangent from me, but it looks like you are mixing
           | bits of city planner utopia with bits of, I guess, typical
           | American suburban reality. In a walkable city planned for
           | humans (not cars) the grocery store is just downstairs or
           | around the corner, because denser living makes them
           | profitable enough. When you can pop down for some eggs, stop
           | by local bakery for fresh bread, and be back home in under 7
           | minutes, you don't really _want_ to take a major trip to
           | Costco with all your kids to load up the fridge for the week.
           | You could still drive there, of course, and I don't think
           | those "condescending elites"* frown too much on a fully
           | occupied car (especially with kids), but unless you really
           | enjoy road trips and parking lots you probably wouldn't.
           | 
           | > do-nothing social signaling policies for climate change
           | (eg. Straws, bottle caps, grocery bags)
           | 
           | Reducing use of plastic is not "do-nothing" for me. I'm not
           | sure it has much to do with climate change but I don't want
           | microplastics to accumulate in my body or bodies of my kids.
           | However, I can agree with you that these are only half-
           | measures with good optics.
           | 
           | * Very flattering by the way, I can barely afford a car** but
           | if seeing benefits to walkable city planning makes me a bit
           | elite I'll take it!
           | 
           | ** If my lack of wealth now makes you think I'm some kind of
           | socialist, well I can only give you my word that I am far
           | from.
        
         | AnimalMuppet wrote:
         | > As someone wrote once, all you need for machine dictatorship
         | is an LLM and a critical number of human accomplices. No need
         | for superintelligence or robots.
         | 
         | If that dictatorship shows up, the real dictator will be a
         | human - the one who hacks the AI to control it. (Whether
         | hacking from the inside or outside, and whether hacking by
         | traditional means, or by feeding it biased training data.)
        
       | lccerina wrote:
       | Source: trust us. This is some bullshit science.
        
       | padjo wrote:
       | Well that's one way to solve the replication crisis
        
       | benterix wrote:
       | So, we finally found the cure for the replication crisis in
       | social sciences: just run them on LLMs.
        
         | consp wrote:
         | At least they will confirm the experiments they have been
         | trained on.
        
           | somedude895 wrote:
           | Maybe that will help extend the veneer of science on social
           | studies for a few more years before the echo chamber
           | implodes.
        
         | raxxorraxor wrote:
         | Problem is that many policy decisions are based on bad science
         | in the social sciences, because it provides an excuse. The
         | validity is completely secondary.
        
       | jtc331 wrote:
       | But does it replicate _better_ than really running the experiment
       | again?
       | 
       | Joking...but not joking.
        
       | NicoJuicy wrote:
       | That's only for known situations.
       | 
       | Eg. Try LLM's to find availability hours when you have the start
       | and end time of each day.
       | 
       | LLM's don't really understand that you need to use the day 1
       | endhour and then the starthour of the next day.
        
       | boesboes wrote:
       | And yet it can't replicate a human support agent. Or even a basic
       | search function for that matter ;)
        
       | 1oooqooq wrote:
       | this tells more about how social science data is manipulated than
       | the usefulness of llm
        
       | gitfan86 wrote:
       | The good news is that they should be able to replicate real world
       | events to validate of this is true or not.
       | 
       | Tesla FSD is a good example of this in real life. You can measure
       | how closely the car acts like a human based off of interventions
       | and crashes that were due to unhuman behavior, as well in the
       | first round of the robot taxi fleet which will have a safety
       | driver, you can measure how many people complain that the driver
       | was bad
        
       | freeone3000 wrote:
       | I think it is far, far more likely that it replicates social
       | science experiments well enough to simulate people
        
       | pftburger wrote:
       | This is gonna end well...
        
       | Piskvorrr wrote:
       | Ooooor maybe, testing if the experiments are similar to what was
       | in the corpus.
        
       | klyrs wrote:
       | Why stop at social science? I say we make a questionnaire, give
       | it to the GPT over a broad range of sampling temperatures, and
       | collect the resulting score:temperature data. From that dataset,
       | we can take people's temperatures over the phone with a short
       | panel of questions!
       | 
       | (this is parody)
        
       | jrflowers wrote:
       | I love that anyone can just write whatever they want and post it
       | online.
       | 
       | GPT-4 can stand in for humans. Charlie Brown is mentioned in the
       | Upanishads. The bubonic plague was a spread via telegram. Easter
       | falls on 9/11 once every other decade.
       | 
       | You can just write shit and hit post and boom, by nature of it
       | being online someone will entertain it as true, even if only
       | briefly so. Wild stuff!
        
       ___________________________________________________________________
       (page generated 2024-08-08 23:02 UTC)