[HN Gopher] AI groups spend to replace low-cost 'data labellers'...
___________________________________________________________________
AI groups spend to replace low-cost 'data labellers' with high-paid
experts
Author : eisa01
Score : 188 points
Date : 2025-07-20 07:02 UTC (3 days ago)
(HTM) web link (www.ft.com)
(TXT) w3m dump (www.ft.com)
| aspenmayer wrote:
| https://archive.is/dkZVy
| Melonololoti wrote:
| Yepp it continues the gathering of more and better data.
|
| Ai is not a hype. We have started to actually do something with
| all the data and this process will not stop soon.
|
| Aline the RL what is now happening through human feedback alone
| (thumbs up/down) is massive.
| KaiserPro wrote:
| It was always the case. We only managed to make a decent model
| once we created a decent dataset.
|
| This meant making a rich synthetic dataset first, to pre-train
| the model, before fine tuning on real, expensive data to get
| the best results.
|
| but this was always the case.
| noname120 wrote:
| RLHF wasn't needed for Deepseek, only gobbling up the whole
| internet -- both good and bad stuff. See their paper
| rtrgrd wrote:
| I thought human preferences was typically considered a noisy
| reward signal
| ACCount36 wrote:
| If it was just "noisy", you could compensate with scale. It's
| worse than that.
|
| "Human preference" is incredibly fucking entangled, and we
| have no way to disentangle it and get rid of all the unwanted
| confounders. A lot of the recent "extreme LLM sycophancy"
| cases is downstream from that.
| TheAceOfHearts wrote:
| It would be great if some of these datasets were free and opened
| up for public use. Otherwise it seems like you end up duplicating
| a lot of busywork just for multiple companies to farm more money.
| Maybe some of the European initiatives related to AI will end up
| including the creation of more open datasets.
|
| Then again, maybe we're still operating from a framework where
| the dataset is part of your moat. It seems like such a way of
| thinking will severely limit the sources of innovation to just a
| few big labs.
| KaiserPro wrote:
| > operating from a framework where the dataset is part of your
| moat
|
| Very much this. Its the dataset that shapes the model, the
| model is a product of the dataset, rather than the other way
| around (mind you, synthetic datasets are different...)
| andy_ppp wrote:
| Why would companies paying top dollar to refine and create high
| quality datasets give them away for free?
| flir wrote:
| Same reason they give open source contributions away for
| free. Hardware companies attempting to commoditize their
| complement. I think the org best placed to get strategic
| advantage from releasing high quality data sets might be
| Nvidia.
| mh- wrote:
| Indeed, and they do.
|
| https://huggingface.co/nvidia/datasets?sort=most_rows
| charlieyu1 wrote:
| There are some good datasets for free though, eg HLE.
| Although I'm sure if they are marketing gimmicks
| murukesh_s wrote:
| Well that was the idea of "open"AI isn't it [1]?
|
| [1] https://web.archive.org/web/20190224031626/https://blog.o
| pen...
| delfinom wrote:
| ClosedAI gonna ClosedAI
| sumedh wrote:
| Check the date.
|
| This was published before anyone knew it running an AI
| company would be very very expensive.
| some_random wrote:
| I feel like that was by far the most predictable part of
| running an AI company.
| sumedh wrote:
| Hindsight is 20/20.
| azemetre wrote:
| Because we can make the government force them to.
| gexla wrote:
| Right, and they pay a lot of money for this data. I know
| someone who does this, and one prompt evaluation could go
| through multiple rounds and reviews that could end up
| generating $150+ in payouts, and that's just what the workers
| receive. But that's not quite what the article is talking
| about. Each of these companies do things a bit different.
| NitpickLawyer wrote:
| > Maybe some of the European initiatives related to AI will end
| up including the creation of more open datasets.
|
| The EU has started the process of opening discussions aiming to
| set the stage for opportunities to arise on facilitating talks
| looking forward to identify key strategies of initiating
| cooperation between member states that will enable vast and
| encompassing meetings generating avenues of reaching top level
| multi-lateral accords on passing legislation covering the
| process of processing processes while preparing for the moment
| when such processes will become processable in the process of
| processing such processes.
|
| #justeuthings :)
| illegalmemory wrote:
| This could work with a Wikipedia-like model. It's very
| difficult to pull off, but a next-generation Wikipedia would
| look like this.
| yorwba wrote:
| I think it would be difficult to make that work, because
| Wikipedia has a direct way of converting users into
| contributors: you see something wrong, you edit the article,
| it's not wrong anymore.
|
| Whereas if you do the same with machine learning training
| data, the influence is much more indirect and you may have to
| add a lot of data to fix one particular case, which is not
| very motivating.
| ripped_britches wrote:
| Don't worry - the labs will train based on this expert data and
| then everyone will just distill their models. Or, now that
| model itself can be an expert annotater.
| panabee wrote:
| This is long overdue for biomedicine.
|
| Even Google DeepMind's relabeled MedQA dataset, created for
| MedGemini in 2024, has flaws.
|
| Many healthcare datasets/benchmarks contain dirty data because
| accuracy incentives are absent and few annotators are qualified.
|
| We had to pay Stanford MDs to annotate 900 new questions to
| evaluate frontier models and will release these as open source on
| Hugging Face for anyone to use. They cover VQA and specialties
| like neurology, pediatrics, and psychiatry.
|
| If labs want early access, please reach out. (Info in profile.)
| We are finalizing the dataset format.
|
| Unlike general LLMs, where noise is tolerable and sometimes even
| desirable, training on incorrect/outdated information may cause
| clinical errors, misfolded proteins, or drugs with off-target
| effects.
|
| Complicating matters, shifting medical facts may invalidate
| training data and model knowledge. What was true last year may be
| false today. For instance, in April 2024 the U.S. Preventive
| Services Task Force reversed its longstanding advice and now
| urges biennial mammograms starting at age 40 -- down from the
| previous benchmark of 50 -- for average-risk women, citing rising
| breast-cancer incidence in younger patients.
| empiko wrote:
| This is true for every subfield I have been working on for the
| past 10 years. The dirty secret of ML research is that
| Sturgeon's law apply to datasets as well - 90% of data out
| there is crap. I have seen NLP datasets with hundreds of
| citations that were obviously worthless as soon as you put the
| "effort" in and actually looked at the samples.
| panabee wrote:
| 100% agreed. I also advise you not to read many cancer
| papers, particularly ones investigating viruses and cancer.
| You would be horrified.
|
| (To clarify: this is not the fault of scientists. This is a
| byproduct of a severely broken system with the wrong
| incentives, which encourages publication of papers and not
| discovery of truth. Hug cancer researchers. They have
| accomplished an incredible amount while being handcuffed and
| tasked with decoding the most complex operating system ever
| designed.)
| briandear wrote:
| > this is not the fault of scientists. This is a byproduct
| of a severely broken system with the wrong incentives,
| which encourages publication of papers and not discovery of
| truth
|
| Are scientists not writing those papers? There may be bad
| incentives, but scientists are responding to those
| incentives.
| eszed wrote:
| That is axiomatically true, but both harsh and useless,
| given that (as I understand from HN articles and
| comments) the choice is "play the publishing game as it
| is" vs "don't be a scientist anymore".
| pyuser583 wrote:
| I agree, but there is an important side-effect of this
| statement: it's possible to criticize science, without
| criticizing scientists. Or at least without criticizing
| rank and file scientists.
|
| There are many political issues where activists claim
| "the science has spoken." When critics respond by saying,
| "the science system is broken and is spitting out
| garbage", we have to take those claims very seriously.
|
| That doesn't mean the science is wrong. Even though the
| climate science system is far from perfect, climate
| change is real and human made.
|
| On the other hand, some of the science on gender medicine
| is not as established medical associates would have us
| believe (yet, this might change in a few years). But that
| doesn't stop reputable science groups from making false
| claims.
| edwardbernays wrote:
| Scientists are responding to the incentives of a) wanting
| to do science, b) for the public benefit. There was one
| game in town to do this: the American public grant
| scheme.
|
| This game is being undermined and destroyed by infamous
| anti-vaxxer, non-medical expert, non-public-policy expert
| RFK Jr.[1] The disastrous cuts to the NIH's public grant
| scheme is likely to amount to $8,200,000,000 ($8.2
| trillion USD) in terms of years of life lost.[2]
|
| So, should scientists not write those papers? Should they
| not do science for public benefit? These are the only
| ways to not respond to the structure of the American
| public grant scheme. It seems to me that, if we want
| better outcomes, then we should make incremental progress
| to the institutions surrounding the public grant scheme.
| This seems fair more sensible than installing Bobby
| Brainworms to burn it all down.
|
| [1] https://youtu.be/HqI_z1OcenQ?si=ZtlffV6N1NuH5PYQ
|
| [2] https://jamanetwork.com/journals/jama-health-
| forum/fullartic...
| panabee wrote:
| Valid critique, but one addressing a problem above the ML
| layer at the human layer. :)
|
| That said, your comment has an implication: in which
| fields can we trust data if incentives are poor?
|
| For instance, many Alzheimer's papers were undermined
| after journalists unmasked foundational research as
| academic fraud. Which conclusions are reliable and which
| are questionable? Who should decide? Can we design model
| architectures and training to grapple with this messy
| reality?
|
| These are hard questions.
|
| ML/AI should help shield future generations of scientists
| from poor incentives by maximizing experimental
| transparency and reproducibility.
|
| Apt quote from Supreme Court Justice Louis Brandeis:
| "Sunlight is the best disinfectant."
| jacobr1 wrote:
| Not a answer, but contributory idea - Meta-analysis.
| There are plenty of strong meta-analysis out there and
| one of the things they tend to end up doing is weighing
| the methodological rigour of the papers along with the
| overlap they have to the combined question being
| analyzed. Could we use this weighting explicitly in the
| training process?
| panabee wrote:
| Thanks. This is helpful. Looking forward to more of your
| thoughts.
|
| Some nuance:
|
| What happens when the methods are outdated/biased? We
| highlight a potential case in breast cancer in one of our
| papers.
|
| Worse, who decides?
|
| To reiterate, this isn't to discourage the idea. The idea
| is good and should be considered, but doesn't escape
| (yet) the core issue of when something becomes a "fact."
| roughly wrote:
| If we're not going to hold any other sector of the
| economy personally responsible for responding to
| incentives, I don't know why we'd start with scientists.
| We've excused folks working for Palantir around here - is
| it that the scientists aren't getting paid enough for
| selling out, or are we just throwing rocks in glass
| houses now?
| PaulHoule wrote:
| If you download data sets for classification from Kaggle or
| CIFAR or search ranking from TREC it is the same. Typically
| 1-2% of judgements in that kind of dataset are just wrong so
| if you are aiming for the last few points of AUC you have to
| confront that.
| morkalork wrote:
| I still want to jump off a bridge whenever someone thinks
| they can use the twitter post and movie review datasets to
| train sentiment models for use in completely different
| contexts.
| JumpCrisscross wrote:
| > _This is true for every subfield I have been working on for
| the past 10 years_
|
| Hasn't data labelling being the bulk of the work been true
| for every research endeavour since forever?
| panabee wrote:
| To elaborate, errors go beyond data and reach into model
| design. Two simple examples:
|
| 1. Nucleotides are a form of tokenization and encode bias.
| They're not as raw as people assume. For example, classic FASTA
| treats modified and canonical C as identical. Differences may
| alter gene expression -- akin to "polish" vs. "Polish".
|
| 2. Sickle-cell anemia and other diseases are linked to
| nucleotide differences. These single nucleotide polymorphisms
| (SNPs) mean hard attention for DNA matters and single-base
| resolution is non-negotiable for certain healthcare
| applications. Latent models have thrived in text-to-image and
| language, but researchers cannot blindly carry these
| assumptions into healthcare.
|
| There are so many open questions in biomedical AI. In our
| experience, confronting them has prompted (pun intended) better
| inductive biases when designing other types of models.
|
| We need way more people thinking about biomedical AI.
| bjourne wrote:
| What if there is significant disagreement within the medical
| profession itself? For example, isotretinoin is proscribed for
| acne in many countries, but in other countries the drug is
| banned or access restricted due to adverse side effects.
| panabee wrote:
| If you agree that ML starts with philosophy, not statistics,
| this is but one example highlighting how biomedicine helps
| model development, LLMs included.
|
| Every fact is born an opinion.
|
| This challenge exists in most, if not all, spheres of life.
| jacobr1 wrote:
| Would not one approach be to just ensure the system has all
| the data? Relevance to address systems, side effects, and
| legal constraints. Then when making a recommendations it can
| account for all factors not just prior use cases.
| K0balt wrote:
| I think an often overlooked aspect of training data curation is
| the value of accurate but oblique data. Much of the "emergent
| capabilities " of LLMs comes from data embedded in the data,
| implied or inferred semantic information that is not readily
| obvious. Extraction of this highly useful information, in
| contrast to specific factoids, requires a lot of off axis
| images of the problem space, like a CT scan of the field of
| interest. The value of adjacent oblique datasets should not be
| underestimated.
| TZubiri wrote:
| I noticed this when adding citations to wikipedia.
|
| You are may find a definition of what a "skyscraper" is, by
| some hyperfocused association, but you'll get a bias towards
| a definite measurement like "skyscrapers are buildings
| between 700m to 3500m tall", which might be useful for some
| data mining project, but not at all what people mean by it.
|
| The actual definition is not in a specific source but in the
| way it is used in other sources like "the Manhattan
| skyscraper is one of the most iconic skyscrapers", on the
| aggregate you learn what it is, but it isn't very citable on
| its own, which gives WP that pedantic bias.
| TZubiri wrote:
| Isn't labelling medical data for ai illegal as unlicensed
| medical practice?
|
| Same thing with law data
| bethekidyouwant wrote:
| Illegal?
| iwontberude wrote:
| Paralegals and medical assistants don't need licenses
| nomel wrote:
| I think their question is a good one, and not being taken
| charitably.
|
| Lets take the medical assistant example.
|
| > Medical assistants are unlicensed, and may only perform
| basic administrative, clerical and technical supportive
| services as permitted by law.
|
| If they're labelling data that's "tumor" or "not tumor",
| with _any_ agency of the process,does that fit within their
| unlicensed scope? Or, would that labelling be closer to a
| diagnosis?
|
| What if the AI is eventually used _to_ diagnose, based on
| data that was labeled by someone unlicensed? Should there
| there need to be a "chain of trust" of some sort?
|
| I think the answer to liability will be all on the doctor
| agreeing/disagreeing with the AI...for now.
| SkyBelow wrote:
| To answer this, I would think we should consider other
| cases where someone could practice medicine without
| legally doing so. For example, could they tutor a student
| and help them? Go through unknown cases and make
| judgement, explaining their reasoning? As long as they
| don't oversell their experience in a way that might be
| considered fraud, I don't think this would be practicing
| medicine.
|
| It does open something of a loophole. Oh, I wasn't
| diagnosing a friend, I was helping him label a case just
| like his as an educational experience. My completely
| IANAL guess would be that judges would look on it based
| on how the person is doing it, primarily if they are
| receiving any compensation or running it like a business.
|
| But wait... the example the OP was talking about is doing
| it like a business and likely doesn't have any
| disclaimers properly sent to the AI, so maybe that
| doesn't help us decide.
| mh- wrote:
| No.
| arbot360 wrote:
| > What was true last year may be false today. For instance, ...
|
| Good example of a medical QA dataset shifting but not a good
| example of a medical "fact" since it is an opinion. Another way
| to think about shifting medical targets over time would be
| things like environmental or behavioral risk factors changing.
|
| Anyways, thank you for putting this dataset together, certainly
| we need more third-party benchmarks with careful annotations
| done. I think it would be wise if you segregate tasks between
| factual observations of data, population-scale opinions
| (guidelines/recommendations), and individual-scale opinions
| (prognosis/diagnosis). Ideally there would be some formal
| taxonomy for this eventually like OMOP CDM, maybe there is
| already in some dusty corner of pubmed.
| ljlolel wrote:
| Centaur Labs does medical data labeling https://centaur.ai/
| ethan_smith wrote:
| Synthetic data generation techniques are increasingly being
| paired with expert validation to scale high-quality biomedical
| datasets while reducing annotation burden - especially useful
| for rare conditions where real-world examples are limited.
| techterrier wrote:
| The latest in a long tradition, it used to be that you'd have to
| teach the offshore person how to do your job, so they could
| replace you for cheaper. Now we are just teaching the robots
| instead.
| kjkjadksj wrote:
| Yeah I've avoided these job postings out of principle. I'm not
| going to be the one to contribute to obsoleting myself and my
| industry.
| verisimi wrote:
| This is it - this is the answer to the ai takeover.
|
| Get an ai to autogenerate lots of crap! Reddit, hn comments,
| false datasets, anything!
| Cthulhu_ wrote:
| That's just spam / more dead internet theory, and there will be
| or are companies that will curate data sets and filter out
| generated stuff / spam or hand-pick high quality data.
| vidarh wrote:
| I've done review and annotation work for two providers in this
| space, and so regularly get approached by providers looking for
| specialists with MSc's or PhD's...
|
| "High-paid" is an exaggeration for many of these, but certainly a
| small subset of people will make decent money on it.
|
| At one provider I was as an exception paid _6x_ their going rate
| because they struggled to get people skilled enough at the high-
| end to accept their regular rate, mostly to audit and review work
| done by others. I have no illusion I was the only one paid above
| their stated range. I got paid well, but even at 6x their regular
| rate I only got paid well because they estimated the number of
| tasks per hour and I was able to exceed that estimate by a
| considerable margin - if their estimate had matched my actual
| speed I 'd have just barely gotten to the low end of my regular
| rate.
|
| But it's clear there's a pyramid of work, and a sustained effort
| to create processes to allow the bulk of the work to be done by
| low-cost labellers, and then push smaller and smaller subsets of
| the data up more expensive to experts, as well as creating
| tooling to cut down the amount of time experts spend by e.g.
| starting with synthetic data (including model-generated reviews
| of model-generated responses).
|
| I don't think I was at the top of that pyramid - the provider I
| did work for didn't handle many prompts that required deep
| specialist knowledge (though I _did_ get to exercise my long-
| dormant maths and physics knowledge that doesn 't say too much).
| I think most of what we addressed would at most need people with
| MSc level skills in STEM subjects. And so I'm sure there are a
| few more layers on the pyramid handling PhD-level complexity
| data. But from what I'm seeing from hiring managers contacting
| me, I get the impression the pay scale for them isn't that much
| higher (with the obvious caveat given what I mentioned above that
| there almost certainly are people getting paid high multiples on
| the stated scale)
|
| Some of these pipelines of work are highly complex, often
| including multiple stages of reviews, sometimes with multiple
| "competing" annotators in parallel feeding into selection and
| review stages.
| charlieyu1 wrote:
| I'll believe it when it happens. A major AI company got rid of an
| expert team last year because they think it is too expensive
| quantum_state wrote:
| It is expert system evolved ...
| cryptokush wrote:
| welcome to macrodata refinement
| joshdavham wrote:
| I was literally just reached out to this morning about a contract
| job for one of these "high quality datasets". They specifically
| wanted python programmers who've contributed to popular repos (I
| maintain one repository with approx. 300 stars).
|
| The rate they offered was between $50-90 per hour, so
| significantly higher than what I'd think low-cost data labellers
| are getting.
|
| Needless to say, I marked them as spam though. Harvesting emails
| through GitHub is dirty imo. Was also sad that the recruiter was
| acting on behalf of a yc company.
| apical_dendrite wrote:
| The latest offer I saw was $150-$210 an hour for 20hrs/week. I
| didn't pursue it so I don't know if that's what people actually
| make, but it's an interesting data point.
| antonvs wrote:
| What kind of work was involved for that one? How specialist,
| I mean?
| SoftTalker wrote:
| Isn't this ignoring the "bitter lesson?"
|
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| pphysch wrote:
| The "bitter lesson" is from a paradigm where high quality data
| seemed infinite, and so we just need more and more compute to
| gobble up the data.
| jsnider3 wrote:
| Not really.
| TrackerFF wrote:
| I don't know if it is related, but I've noticed an uptick in cold
| calls / approaches for consulting gigs related to data labeling
| and data QA, in my field (work as an analyst). I never got
| requests like that 2++ years ago.
| the_brin92 wrote:
| I've been doing this for one of the major companies in the space
| for a few years now. It has been interesting to watch how much
| more complex the projects have gotten over the last few years,
| and how many issues the models still have. I have a humanities
| background which has actually served me well here as what
| constitutes a "better" AI model response is often so subjective.
|
| I can answer any questions people have about the experience
| (within code of conduct guidelines so I don't get in trouble...)
| merksittich wrote:
| Thank you, I'll bite. If within your code of conduct:
|
| - Are you providing reasoning traces, responses or both?
|
| - Are you evaluating reasoning traces, responses or both?
|
| - Has your work shifted towards multi-turn or long horizon
| tasks?
|
| - If you also work with chat logs of actual users, do you think
| that they are properly anonymized? Or do you believe that you
| could de-anonymize them without major efforts?
|
| - Do you have contact to other evaluators?
|
| - How do you (and your colleagues) feel about the work (e.g.,
| moral qualms because "training your replacement" or proud
| because furthering civilization, or it's just about the
| money...)?
| mNovak wrote:
| Curious how one gets involved in this, and what fields they're
| seeking?
| dbmikus wrote:
| What kinds of data are you working on? Coding? Something else?
|
| I've been curious how much these AI models look for more niche
| coding language expertise, and what other knowledge frontiers
| they're focusing on (like law, medical, finance, etc.)
| rnxrx wrote:
| It's only a matter of time until private enterprises figure out
| they can monetize a lot of otherwise useless datasets by tagging
| them and selling (likely via a broker) to organizations building
| models.
|
| The implications for valuation of 'legacy' businesses are
| potentially significant.
| htrp wrote:
| Already happening.
| some_random wrote:
| Bad data has been such a huge problem in the industry for ages,
| honestly a huge portion of the worst bias (racism, sexism, etc)
| stems directly from low quality labelings.
| htrp wrote:
| Starting a data labeling company is the least AI way to get into
| AI.
| glitchc wrote:
| Some people sell shovels, others the grunts to use them.
___________________________________________________________________
(page generated 2025-07-23 23:00 UTC)