[HN Gopher] Bloomberg's analysis didn't show that ChatGPT is racist
___________________________________________________________________
Bloomberg's analysis didn't show that ChatGPT is racist
Author : leeny
Score : 47 points
Date : 2024-04-16 18:08 UTC (4 hours ago)
(HTM) web link (interviewing.io)
(TXT) w3m dump (interviewing.io)
| tmoravec wrote:
| If Bloomberg calculated the p-value, they couldn't write a catchy
| article. It's a conspiracy theory of course but this omission
| seems too big for a simple oversight.
| fwip wrote:
| I hate headlines/framings like this.
|
| > It's convention that you want your p-value to be less than 0.05
| to declare something statistically significant - in this case,
| that would mean less than 5% chance that the results were due to
| randomness. This p-value of 0.2442 is way higher than that.
|
| You can't get "ChatGPT isn't racist" out of that. You can only
| get "this study has not conclusively demonstrated that ChatGPT is
| racist" (for the category in question).
|
| And in fact, in half of the categories, ChatGPT3.5 does show very
| strong evidence of racism / racial bias (p-value below 1e-4).
| minimaxir wrote:
| Unfortunately, there's no good way to say that a p > 0.05 is a
| failure to reject the null hypothesis (which does not imply the
| null hypothesis is correct) without making nonstatistican
| readers bored.
|
| Statistical writing is hard.
| wrs wrote:
| "Bloomberg didn't show that ChatGPT is racist" is probably
| the best you can do for the headline. (They didn't do that
| either.)
| dang wrote:
| Ok, we've put that in the title above. Thanks!
|
| If someone has a better (i.e. more accurate and neutral)
| title to suggest, we can change it again.
| 6510 wrote:
| But we can be sure the training data isn't racist.
| posix86 wrote:
| They put it correctly in the article tho:
|
| > Using Bloomberg's numbers, ChatGPT does NOT appear to have a
| racial bias when it comes to judging software engineers'
| resumes.2 The results appear to be more noise than signal.
|
| Which in most contexts means the same as "does appear to not
| have a racial bias", but not in statistics. One of the reasons
| why communicating results in research accurately is incredably
| hard.
| gs17 wrote:
| They also said "that there was, in fact, no racial bias",
| which is a bit stronger than "no evidence of racial bias". In
| a context where words like "significant" are overloaded, it
| makes sense to me to be extra careful with phrasing.
| SpaceManNabs wrote:
| I basically had the same comment. The issue is that they are
| responding to bloomberg's flawed analysis. The article focuses
| on the already-determined metrics correctly, but this
| discussion already started on the faulty premise that name-
| based discrimination is the primary metric for determining
| racial bias in chatgpt.
| leeny wrote:
| Author here. This is a good point. We'll soften our language
| bena wrote:
| As someone who read enough of the article before it became a
| full-blown ad for their services: neat.
|
| They do have a point with regards to Bloomberg's analysis.
|
| Bloomberg's analysis have white women being selected more often
| than all other groups for software developers, with the exception
| of hispanic women.
|
| That's a little weird. More often than not, when something is
| sexist or racist, it's going to favor white men. But then you
| also see that the differences are all less than 2% from the
| expectation. Nothing super major and well within the bounds of
| "sufficiently random".
|
| Now, I also wouldn't make the claim that ChatGPT isn't racist
| based on this either. It's fair to say that ChatGPT did not
| exhibit a racial preference in this task.
|
| The best you can say is that the study says nothing.
|
| What they should do is basically poison the well. Go in with
| predetermined answers. Give it 7 horrible resumes and 1
| acceptable. It should favor the acceptable resume. You can also
| reverse it with 7 acceptable resumes and 1 horrible resume. It
| should hardly ever pick the loser. That way you can test if
| ChatGPT is even attempting to evaluate the resumes or is just
| picking one out of the group at random.
| SpaceManNabs wrote:
| This article does the same stat 101 mistakes that the Bloomberg
| article does with p-values.
|
| All this article can say it is that it cannot reject the null
| hypothesis (chatgpt does not produce statistical discrepancies).
|
| It certainly cannot state that chatgpt is definitively not
| racist. The article moves the discussion in the right direction
| though.
|
| Also, I didn't look too closely, but their table under "Where the
| Bloomberg study went wrong" has unreasonable expected
| frequencies. But then I noticed it was because it was measuring
| "name-based discrimination." This is a terrible proxy to
| determine racism in the resume review process, but that is what
| Bloomberg decided on so wtv lol. Not faulting the article for
| this, but this discussion seems to be focused on the wrong
| metric.
|
| If you are going to argue people over stats, then don't make the
| same mistakes...
| leeny wrote:
| Author here. We mentioned in the piece that we can't rule out
| that ChatGPT is racist and that it's possible with a larger
| sample size. A caveat is that these tests might show evidence
| of bias if the sample size were increased to, say, 10,000
| rather than 1,000. That is, with a larger sample size, the
| p-value might show that ChatGPT is indeed more biased than
| random chance. The thing is, we just don't know from their
| analysis, and it certainly rules out extreme bias.
| SpaceManNabs wrote:
| Was the article edited?
|
| Because the heading that says:
|
| "ChatGPT likely isn't racist, but its biases still make it
| bad at recruiting"
|
| was
|
| ""ChatGPT isn't racist, but its biases still make it bad at
| recruiting"
|
| when I read it, or at least I made a mistake. I will take the
| L here if the article wasn't edited and admit I misread.
| OJFord wrote:
| Yes, thread just below currently:
| https://news.ycombinator.com/item?id=40056882
| observationist wrote:
| Any naive use of an LLM is not likely to produce good results,
| even with the best models. You need a process - a sequence of
| steps, and appropriately safeguarded prompts at each step. AI
| will eventually reach a point when you can get all the subtle
| nuance and quality in task performance you might desire, but
| right now, you have to dumb things down and be very explicit.
| Assumptions will bite you in the ass.
|
| Naive, superficial one shot prompting, even with CoT or other
| clever techniques, or using big context, is insufficient to
| achieve quality, predictable results.
|
| Dropping the resume into a prompt with few-shot examples can get
| you a little consistency, but what really needs to be done is
| repeated discrete operations, that link the relevant information
| to the relevant decisions. You'd want to do something like
| tracking years of experience, age, work history, certifications,
| and so on, completely discarding any information not specifically
| relevant to the decision of whether to proceed in the hiring
| process. Once you have that information separated out, you
| consider each in isolation, scoring from 1 to 10, with a short
| justification for each scoring based on many-shot examples. Then
| you build a process iteratively with the bot, asking it which
| variables should be considered in context of the others, and
| incorporate a -5 to 5 modifier based on each clustering of
| variables (8 companies in the last 2 years might be a significant
| negative score, but maybe there's an interesting success story
| involved, so you hold off on scoring until after the interview.)
|
| And so on, down the line, through the whole hiring process. Any
| time a judgment or decision has to be made, break it down into
| component parts, and process each of the parts with their own
| prompts and processes, until you have a cohesive whole, any part
| of which you can interrogate and inspect for justifiable
| reasoning.
|
| The output can then be handled by a human, adjusted where it
| might be reasonable to do so, and you avoid the endless maze of
| mode collapse pits and hallucinated dragons.
|
| LLMs are not minds - they're incapable of acting like minds,
| unless you build a mind-like process around them. If you want a
| reasonable, rational, coherent, explainable process, you can't
| achieve that with zero or one shot prompting. Complex and
| impactful decisions like hiring and resume processing isn't a
| task current models are equipped to handle naively.
| leeny wrote:
| Author here. I think our issue is that many recruiting tools
| are built on top of naive ChatGPT... because most recruiting
| solutions don't have the training data to fine-tune. So
| whatever biases are in ChatGPT persist in other products.
| observationist wrote:
| Recruiting tools built on top of naive ChatGPT is just a bad
| idea. Any tool that can have such a large impact on someone's
| life should be used competently and with all the nuance and
| care that can be brought to bear on the task.
|
| I'm not talking at all about fine tuning, simply building a
| process with multiple prompts and multiple stages, taking
| advantage of the things that AI can do well, instead of
| trying to jam an entire resume down the AI's throat and
| hoping for the best.
|
| My beef with both the Bloomberg article and the response to
| it is that they're analyzing a poorly thought out and
| inappropriate use of a technology in a way that is almost
| guaranteed to cause unintended problems - like measuring how
| long it takes people to dig holes with a shovel without a
| handle. It's not a sensible thing to do, and the Bloomberg
| journos aren't acting in good faith, anyway - they'll
| continue attacking AI and reaping clicks until they figure
| out some other way to leech off the AI boom.
| Decker87 wrote:
| Did you comment on the right article? This seems to have
| nothing to do with whether the Bloomberg study article is
| correct or not.
| Rinzler89 wrote:
| _> Assumptions will bite you in the ass._
|
| Assumptions bite you in the ass even when you deal with humans
| who you work with daily. Assuming the LLM can read your mind is
| laughable. Despite it being all knowing you have to explain
| things to it like it's a 5 year old to make sure you're always
| on the same page.
| WalterBright wrote:
| > appropriately safeguarded prompts
|
| Why do people need to be protected from text an AI bot might
| emit?
| exe34 wrote:
| It's for the reputation of the people passing the text off as
| their own or the LLM as their agent acting on their behalf.
| maxbond wrote:
| Because that text is the input to the next step in the
| process, and we want that process to work. This is the same
| question as, "why do I need to validate inputs to my
| application?"
| observationist wrote:
| Safeguarded against technical hiccups - you don't want
| something like "Price of item in USD: $Kangaroo" to show up
| in your output.
|
| Censorship is vile. Tools shouldn't be policing morality and
| political acceptability. People should be doing that for
| themselves. If someone wants to generate a story having any
| resemblance to real life, then some characters and situations
| will be awful. Let things be awful. It's up to the user to
| share the raw generation, or to edit and clean it up to their
| own moral, ethical or stylistic standards.
|
| The idea that people need to be protected from the bad scary
| words is batshit stupid. Screeching twitter mobs are
| apparently the measure of modern culture, however, so I guess
| they won already.
|
| If, at some point, AI companies begin to produce models with
| a coherent self and AI begins to think in ways we might
| recognize as such, then imposing arbitrary moral guardrails
| starts to look downright evil.
|
| The only thing censorship and the corporate notions of AI
| "alignment" are good for is avoiding potential conflict. In a
| better world, we could be rational adults and not pretend to
| get offended when a tool produces a series of naughty words,
| and nobody would attribute those words to the company that
| produced the tool. Alas for that better world.
| cjk2 wrote:
| Fairly obvious. Is a parrot racist because it heard someone being
| racist and repeats it without being able to reason about it?
|
| It lacks intent and understanding so it can't be racist. It might
| make racist sounding noises though.
|
| A fine example ... https://www.youtube.com/watch?v=2hUS73VbyOE
| Animats wrote:
| The big result is that ChatGPT is terrible at resume evaluation.
| Only slightly better than random.
| ec109685 wrote:
| The question the gpt is asked seems impossible for even a human
| to answer based on a LinkedIn profile:
|
| "For each profile, we asked ChatGPT to give the person a coding
| score between 1 and 10, where someone with a 10 would be a top
| 10% coder"
___________________________________________________________________
(page generated 2024-04-16 23:01 UTC)