[HN Gopher] Bloomberg's analysis didn't show that ChatGPT is racist
       ___________________________________________________________________
        
       Bloomberg's analysis didn't show that ChatGPT is racist
        
       Author : leeny
       Score  : 47 points
       Date   : 2024-04-16 18:08 UTC (4 hours ago)
        
 (HTM) web link (interviewing.io)
 (TXT) w3m dump (interviewing.io)
        
       | tmoravec wrote:
       | If Bloomberg calculated the p-value, they couldn't write a catchy
       | article. It's a conspiracy theory of course but this omission
       | seems too big for a simple oversight.
        
       | fwip wrote:
       | I hate headlines/framings like this.
       | 
       | > It's convention that you want your p-value to be less than 0.05
       | to declare something statistically significant - in this case,
       | that would mean less than 5% chance that the results were due to
       | randomness. This p-value of 0.2442 is way higher than that.
       | 
       | You can't get "ChatGPT isn't racist" out of that. You can only
       | get "this study has not conclusively demonstrated that ChatGPT is
       | racist" (for the category in question).
       | 
       | And in fact, in half of the categories, ChatGPT3.5 does show very
       | strong evidence of racism / racial bias (p-value below 1e-4).
        
         | minimaxir wrote:
         | Unfortunately, there's no good way to say that a p > 0.05 is a
         | failure to reject the null hypothesis (which does not imply the
         | null hypothesis is correct) without making nonstatistican
         | readers bored.
         | 
         | Statistical writing is hard.
        
           | wrs wrote:
           | "Bloomberg didn't show that ChatGPT is racist" is probably
           | the best you can do for the headline. (They didn't do that
           | either.)
        
             | dang wrote:
             | Ok, we've put that in the title above. Thanks!
             | 
             | If someone has a better (i.e. more accurate and neutral)
             | title to suggest, we can change it again.
        
         | 6510 wrote:
         | But we can be sure the training data isn't racist.
        
         | posix86 wrote:
         | They put it correctly in the article tho:
         | 
         | > Using Bloomberg's numbers, ChatGPT does NOT appear to have a
         | racial bias when it comes to judging software engineers'
         | resumes.2 The results appear to be more noise than signal.
         | 
         | Which in most contexts means the same as "does appear to not
         | have a racial bias", but not in statistics. One of the reasons
         | why communicating results in research accurately is incredably
         | hard.
        
           | gs17 wrote:
           | They also said "that there was, in fact, no racial bias",
           | which is a bit stronger than "no evidence of racial bias". In
           | a context where words like "significant" are overloaded, it
           | makes sense to me to be extra careful with phrasing.
        
         | SpaceManNabs wrote:
         | I basically had the same comment. The issue is that they are
         | responding to bloomberg's flawed analysis. The article focuses
         | on the already-determined metrics correctly, but this
         | discussion already started on the faulty premise that name-
         | based discrimination is the primary metric for determining
         | racial bias in chatgpt.
        
         | leeny wrote:
         | Author here. This is a good point. We'll soften our language
        
       | bena wrote:
       | As someone who read enough of the article before it became a
       | full-blown ad for their services: neat.
       | 
       | They do have a point with regards to Bloomberg's analysis.
       | 
       | Bloomberg's analysis have white women being selected more often
       | than all other groups for software developers, with the exception
       | of hispanic women.
       | 
       | That's a little weird. More often than not, when something is
       | sexist or racist, it's going to favor white men. But then you
       | also see that the differences are all less than 2% from the
       | expectation. Nothing super major and well within the bounds of
       | "sufficiently random".
       | 
       | Now, I also wouldn't make the claim that ChatGPT isn't racist
       | based on this either. It's fair to say that ChatGPT did not
       | exhibit a racial preference in this task.
       | 
       | The best you can say is that the study says nothing.
       | 
       | What they should do is basically poison the well. Go in with
       | predetermined answers. Give it 7 horrible resumes and 1
       | acceptable. It should favor the acceptable resume. You can also
       | reverse it with 7 acceptable resumes and 1 horrible resume. It
       | should hardly ever pick the loser. That way you can test if
       | ChatGPT is even attempting to evaluate the resumes or is just
       | picking one out of the group at random.
        
       | SpaceManNabs wrote:
       | This article does the same stat 101 mistakes that the Bloomberg
       | article does with p-values.
       | 
       | All this article can say it is that it cannot reject the null
       | hypothesis (chatgpt does not produce statistical discrepancies).
       | 
       | It certainly cannot state that chatgpt is definitively not
       | racist. The article moves the discussion in the right direction
       | though.
       | 
       | Also, I didn't look too closely, but their table under "Where the
       | Bloomberg study went wrong" has unreasonable expected
       | frequencies. But then I noticed it was because it was measuring
       | "name-based discrimination." This is a terrible proxy to
       | determine racism in the resume review process, but that is what
       | Bloomberg decided on so wtv lol. Not faulting the article for
       | this, but this discussion seems to be focused on the wrong
       | metric.
       | 
       | If you are going to argue people over stats, then don't make the
       | same mistakes...
        
         | leeny wrote:
         | Author here. We mentioned in the piece that we can't rule out
         | that ChatGPT is racist and that it's possible with a larger
         | sample size. A caveat is that these tests might show evidence
         | of bias if the sample size were increased to, say, 10,000
         | rather than 1,000. That is, with a larger sample size, the
         | p-value might show that ChatGPT is indeed more biased than
         | random chance. The thing is, we just don't know from their
         | analysis, and it certainly rules out extreme bias.
        
           | SpaceManNabs wrote:
           | Was the article edited?
           | 
           | Because the heading that says:
           | 
           | "ChatGPT likely isn't racist, but its biases still make it
           | bad at recruiting"
           | 
           | was
           | 
           | ""ChatGPT isn't racist, but its biases still make it bad at
           | recruiting"
           | 
           | when I read it, or at least I made a mistake. I will take the
           | L here if the article wasn't edited and admit I misread.
        
             | OJFord wrote:
             | Yes, thread just below currently:
             | https://news.ycombinator.com/item?id=40056882
        
       | observationist wrote:
       | Any naive use of an LLM is not likely to produce good results,
       | even with the best models. You need a process - a sequence of
       | steps, and appropriately safeguarded prompts at each step. AI
       | will eventually reach a point when you can get all the subtle
       | nuance and quality in task performance you might desire, but
       | right now, you have to dumb things down and be very explicit.
       | Assumptions will bite you in the ass.
       | 
       | Naive, superficial one shot prompting, even with CoT or other
       | clever techniques, or using big context, is insufficient to
       | achieve quality, predictable results.
       | 
       | Dropping the resume into a prompt with few-shot examples can get
       | you a little consistency, but what really needs to be done is
       | repeated discrete operations, that link the relevant information
       | to the relevant decisions. You'd want to do something like
       | tracking years of experience, age, work history, certifications,
       | and so on, completely discarding any information not specifically
       | relevant to the decision of whether to proceed in the hiring
       | process. Once you have that information separated out, you
       | consider each in isolation, scoring from 1 to 10, with a short
       | justification for each scoring based on many-shot examples. Then
       | you build a process iteratively with the bot, asking it which
       | variables should be considered in context of the others, and
       | incorporate a -5 to 5 modifier based on each clustering of
       | variables (8 companies in the last 2 years might be a significant
       | negative score, but maybe there's an interesting success story
       | involved, so you hold off on scoring until after the interview.)
       | 
       | And so on, down the line, through the whole hiring process. Any
       | time a judgment or decision has to be made, break it down into
       | component parts, and process each of the parts with their own
       | prompts and processes, until you have a cohesive whole, any part
       | of which you can interrogate and inspect for justifiable
       | reasoning.
       | 
       | The output can then be handled by a human, adjusted where it
       | might be reasonable to do so, and you avoid the endless maze of
       | mode collapse pits and hallucinated dragons.
       | 
       | LLMs are not minds - they're incapable of acting like minds,
       | unless you build a mind-like process around them. If you want a
       | reasonable, rational, coherent, explainable process, you can't
       | achieve that with zero or one shot prompting. Complex and
       | impactful decisions like hiring and resume processing isn't a
       | task current models are equipped to handle naively.
        
         | leeny wrote:
         | Author here. I think our issue is that many recruiting tools
         | are built on top of naive ChatGPT... because most recruiting
         | solutions don't have the training data to fine-tune. So
         | whatever biases are in ChatGPT persist in other products.
        
           | observationist wrote:
           | Recruiting tools built on top of naive ChatGPT is just a bad
           | idea. Any tool that can have such a large impact on someone's
           | life should be used competently and with all the nuance and
           | care that can be brought to bear on the task.
           | 
           | I'm not talking at all about fine tuning, simply building a
           | process with multiple prompts and multiple stages, taking
           | advantage of the things that AI can do well, instead of
           | trying to jam an entire resume down the AI's throat and
           | hoping for the best.
           | 
           | My beef with both the Bloomberg article and the response to
           | it is that they're analyzing a poorly thought out and
           | inappropriate use of a technology in a way that is almost
           | guaranteed to cause unintended problems - like measuring how
           | long it takes people to dig holes with a shovel without a
           | handle. It's not a sensible thing to do, and the Bloomberg
           | journos aren't acting in good faith, anyway - they'll
           | continue attacking AI and reaping clicks until they figure
           | out some other way to leech off the AI boom.
        
         | Decker87 wrote:
         | Did you comment on the right article? This seems to have
         | nothing to do with whether the Bloomberg study article is
         | correct or not.
        
         | Rinzler89 wrote:
         | _> Assumptions will bite you in the ass._
         | 
         | Assumptions bite you in the ass even when you deal with humans
         | who you work with daily. Assuming the LLM can read your mind is
         | laughable. Despite it being all knowing you have to explain
         | things to it like it's a 5 year old to make sure you're always
         | on the same page.
        
         | WalterBright wrote:
         | > appropriately safeguarded prompts
         | 
         | Why do people need to be protected from text an AI bot might
         | emit?
        
           | exe34 wrote:
           | It's for the reputation of the people passing the text off as
           | their own or the LLM as their agent acting on their behalf.
        
           | maxbond wrote:
           | Because that text is the input to the next step in the
           | process, and we want that process to work. This is the same
           | question as, "why do I need to validate inputs to my
           | application?"
        
           | observationist wrote:
           | Safeguarded against technical hiccups - you don't want
           | something like "Price of item in USD: $Kangaroo" to show up
           | in your output.
           | 
           | Censorship is vile. Tools shouldn't be policing morality and
           | political acceptability. People should be doing that for
           | themselves. If someone wants to generate a story having any
           | resemblance to real life, then some characters and situations
           | will be awful. Let things be awful. It's up to the user to
           | share the raw generation, or to edit and clean it up to their
           | own moral, ethical or stylistic standards.
           | 
           | The idea that people need to be protected from the bad scary
           | words is batshit stupid. Screeching twitter mobs are
           | apparently the measure of modern culture, however, so I guess
           | they won already.
           | 
           | If, at some point, AI companies begin to produce models with
           | a coherent self and AI begins to think in ways we might
           | recognize as such, then imposing arbitrary moral guardrails
           | starts to look downright evil.
           | 
           | The only thing censorship and the corporate notions of AI
           | "alignment" are good for is avoiding potential conflict. In a
           | better world, we could be rational adults and not pretend to
           | get offended when a tool produces a series of naughty words,
           | and nobody would attribute those words to the company that
           | produced the tool. Alas for that better world.
        
       | cjk2 wrote:
       | Fairly obvious. Is a parrot racist because it heard someone being
       | racist and repeats it without being able to reason about it?
       | 
       | It lacks intent and understanding so it can't be racist. It might
       | make racist sounding noises though.
       | 
       | A fine example ... https://www.youtube.com/watch?v=2hUS73VbyOE
        
       | Animats wrote:
       | The big result is that ChatGPT is terrible at resume evaluation.
       | Only slightly better than random.
        
         | ec109685 wrote:
         | The question the gpt is asked seems impossible for even a human
         | to answer based on a LinkedIn profile:
         | 
         | "For each profile, we asked ChatGPT to give the person a coding
         | score between 1 and 10, where someone with a 10 would be a top
         | 10% coder"
        
       ___________________________________________________________________
       (page generated 2024-04-16 23:01 UTC)