[HN Gopher] AI groups spend to replace low-cost 'data labellers'...
       ___________________________________________________________________
        
       AI groups spend to replace low-cost 'data labellers' with high-paid
       experts
        
       Author : eisa01
       Score  : 188 points
       Date   : 2025-07-20 07:02 UTC (3 days ago)
        
 (HTM) web link (www.ft.com)
 (TXT) w3m dump (www.ft.com)
        
       | aspenmayer wrote:
       | https://archive.is/dkZVy
        
       | Melonololoti wrote:
       | Yepp it continues the gathering of more and better data.
       | 
       | Ai is not a hype. We have started to actually do something with
       | all the data and this process will not stop soon.
       | 
       | Aline the RL what is now happening through human feedback alone
       | (thumbs up/down) is massive.
        
         | KaiserPro wrote:
         | It was always the case. We only managed to make a decent model
         | once we created a decent dataset.
         | 
         | This meant making a rich synthetic dataset first, to pre-train
         | the model, before fine tuning on real, expensive data to get
         | the best results.
         | 
         | but this was always the case.
        
           | noname120 wrote:
           | RLHF wasn't needed for Deepseek, only gobbling up the whole
           | internet -- both good and bad stuff. See their paper
        
         | rtrgrd wrote:
         | I thought human preferences was typically considered a noisy
         | reward signal
        
           | ACCount36 wrote:
           | If it was just "noisy", you could compensate with scale. It's
           | worse than that.
           | 
           | "Human preference" is incredibly fucking entangled, and we
           | have no way to disentangle it and get rid of all the unwanted
           | confounders. A lot of the recent "extreme LLM sycophancy"
           | cases is downstream from that.
        
       | TheAceOfHearts wrote:
       | It would be great if some of these datasets were free and opened
       | up for public use. Otherwise it seems like you end up duplicating
       | a lot of busywork just for multiple companies to farm more money.
       | Maybe some of the European initiatives related to AI will end up
       | including the creation of more open datasets.
       | 
       | Then again, maybe we're still operating from a framework where
       | the dataset is part of your moat. It seems like such a way of
       | thinking will severely limit the sources of innovation to just a
       | few big labs.
        
         | KaiserPro wrote:
         | > operating from a framework where the dataset is part of your
         | moat
         | 
         | Very much this. Its the dataset that shapes the model, the
         | model is a product of the dataset, rather than the other way
         | around (mind you, synthetic datasets are different...)
        
         | andy_ppp wrote:
         | Why would companies paying top dollar to refine and create high
         | quality datasets give them away for free?
        
           | flir wrote:
           | Same reason they give open source contributions away for
           | free. Hardware companies attempting to commoditize their
           | complement. I think the org best placed to get strategic
           | advantage from releasing high quality data sets might be
           | Nvidia.
        
             | mh- wrote:
             | Indeed, and they do.
             | 
             | https://huggingface.co/nvidia/datasets?sort=most_rows
        
           | charlieyu1 wrote:
           | There are some good datasets for free though, eg HLE.
           | Although I'm sure if they are marketing gimmicks
        
           | murukesh_s wrote:
           | Well that was the idea of "open"AI isn't it [1]?
           | 
           | [1] https://web.archive.org/web/20190224031626/https://blog.o
           | pen...
        
             | delfinom wrote:
             | ClosedAI gonna ClosedAI
        
             | sumedh wrote:
             | Check the date.
             | 
             | This was published before anyone knew it running an AI
             | company would be very very expensive.
        
               | some_random wrote:
               | I feel like that was by far the most predictable part of
               | running an AI company.
        
               | sumedh wrote:
               | Hindsight is 20/20.
        
           | azemetre wrote:
           | Because we can make the government force them to.
        
         | gexla wrote:
         | Right, and they pay a lot of money for this data. I know
         | someone who does this, and one prompt evaluation could go
         | through multiple rounds and reviews that could end up
         | generating $150+ in payouts, and that's just what the workers
         | receive. But that's not quite what the article is talking
         | about. Each of these companies do things a bit different.
        
         | NitpickLawyer wrote:
         | > Maybe some of the European initiatives related to AI will end
         | up including the creation of more open datasets.
         | 
         | The EU has started the process of opening discussions aiming to
         | set the stage for opportunities to arise on facilitating talks
         | looking forward to identify key strategies of initiating
         | cooperation between member states that will enable vast and
         | encompassing meetings generating avenues of reaching top level
         | multi-lateral accords on passing legislation covering the
         | process of processing processes while preparing for the moment
         | when such processes will become processable in the process of
         | processing such processes.
         | 
         | #justeuthings :)
        
         | illegalmemory wrote:
         | This could work with a Wikipedia-like model. It's very
         | difficult to pull off, but a next-generation Wikipedia would
         | look like this.
        
           | yorwba wrote:
           | I think it would be difficult to make that work, because
           | Wikipedia has a direct way of converting users into
           | contributors: you see something wrong, you edit the article,
           | it's not wrong anymore.
           | 
           | Whereas if you do the same with machine learning training
           | data, the influence is much more indirect and you may have to
           | add a lot of data to fix one particular case, which is not
           | very motivating.
        
         | ripped_britches wrote:
         | Don't worry - the labs will train based on this expert data and
         | then everyone will just distill their models. Or, now that
         | model itself can be an expert annotater.
        
       | panabee wrote:
       | This is long overdue for biomedicine.
       | 
       | Even Google DeepMind's relabeled MedQA dataset, created for
       | MedGemini in 2024, has flaws.
       | 
       | Many healthcare datasets/benchmarks contain dirty data because
       | accuracy incentives are absent and few annotators are qualified.
       | 
       | We had to pay Stanford MDs to annotate 900 new questions to
       | evaluate frontier models and will release these as open source on
       | Hugging Face for anyone to use. They cover VQA and specialties
       | like neurology, pediatrics, and psychiatry.
       | 
       | If labs want early access, please reach out. (Info in profile.)
       | We are finalizing the dataset format.
       | 
       | Unlike general LLMs, where noise is tolerable and sometimes even
       | desirable, training on incorrect/outdated information may cause
       | clinical errors, misfolded proteins, or drugs with off-target
       | effects.
       | 
       | Complicating matters, shifting medical facts may invalidate
       | training data and model knowledge. What was true last year may be
       | false today. For instance, in April 2024 the U.S. Preventive
       | Services Task Force reversed its longstanding advice and now
       | urges biennial mammograms starting at age 40 -- down from the
       | previous benchmark of 50 -- for average-risk women, citing rising
       | breast-cancer incidence in younger patients.
        
         | empiko wrote:
         | This is true for every subfield I have been working on for the
         | past 10 years. The dirty secret of ML research is that
         | Sturgeon's law apply to datasets as well - 90% of data out
         | there is crap. I have seen NLP datasets with hundreds of
         | citations that were obviously worthless as soon as you put the
         | "effort" in and actually looked at the samples.
        
           | panabee wrote:
           | 100% agreed. I also advise you not to read many cancer
           | papers, particularly ones investigating viruses and cancer.
           | You would be horrified.
           | 
           | (To clarify: this is not the fault of scientists. This is a
           | byproduct of a severely broken system with the wrong
           | incentives, which encourages publication of papers and not
           | discovery of truth. Hug cancer researchers. They have
           | accomplished an incredible amount while being handcuffed and
           | tasked with decoding the most complex operating system ever
           | designed.)
        
             | briandear wrote:
             | > this is not the fault of scientists. This is a byproduct
             | of a severely broken system with the wrong incentives,
             | which encourages publication of papers and not discovery of
             | truth
             | 
             | Are scientists not writing those papers? There may be bad
             | incentives, but scientists are responding to those
             | incentives.
        
               | eszed wrote:
               | That is axiomatically true, but both harsh and useless,
               | given that (as I understand from HN articles and
               | comments) the choice is "play the publishing game as it
               | is" vs "don't be a scientist anymore".
        
               | pyuser583 wrote:
               | I agree, but there is an important side-effect of this
               | statement: it's possible to criticize science, without
               | criticizing scientists. Or at least without criticizing
               | rank and file scientists.
               | 
               | There are many political issues where activists claim
               | "the science has spoken." When critics respond by saying,
               | "the science system is broken and is spitting out
               | garbage", we have to take those claims very seriously.
               | 
               | That doesn't mean the science is wrong. Even though the
               | climate science system is far from perfect, climate
               | change is real and human made.
               | 
               | On the other hand, some of the science on gender medicine
               | is not as established medical associates would have us
               | believe (yet, this might change in a few years). But that
               | doesn't stop reputable science groups from making false
               | claims.
        
               | edwardbernays wrote:
               | Scientists are responding to the incentives of a) wanting
               | to do science, b) for the public benefit. There was one
               | game in town to do this: the American public grant
               | scheme.
               | 
               | This game is being undermined and destroyed by infamous
               | anti-vaxxer, non-medical expert, non-public-policy expert
               | RFK Jr.[1] The disastrous cuts to the NIH's public grant
               | scheme is likely to amount to $8,200,000,000 ($8.2
               | trillion USD) in terms of years of life lost.[2]
               | 
               | So, should scientists not write those papers? Should they
               | not do science for public benefit? These are the only
               | ways to not respond to the structure of the American
               | public grant scheme. It seems to me that, if we want
               | better outcomes, then we should make incremental progress
               | to the institutions surrounding the public grant scheme.
               | This seems fair more sensible than installing Bobby
               | Brainworms to burn it all down.
               | 
               | [1] https://youtu.be/HqI_z1OcenQ?si=ZtlffV6N1NuH5PYQ
               | 
               | [2] https://jamanetwork.com/journals/jama-health-
               | forum/fullartic...
        
               | panabee wrote:
               | Valid critique, but one addressing a problem above the ML
               | layer at the human layer. :)
               | 
               | That said, your comment has an implication: in which
               | fields can we trust data if incentives are poor?
               | 
               | For instance, many Alzheimer's papers were undermined
               | after journalists unmasked foundational research as
               | academic fraud. Which conclusions are reliable and which
               | are questionable? Who should decide? Can we design model
               | architectures and training to grapple with this messy
               | reality?
               | 
               | These are hard questions.
               | 
               | ML/AI should help shield future generations of scientists
               | from poor incentives by maximizing experimental
               | transparency and reproducibility.
               | 
               | Apt quote from Supreme Court Justice Louis Brandeis:
               | "Sunlight is the best disinfectant."
        
               | jacobr1 wrote:
               | Not a answer, but contributory idea - Meta-analysis.
               | There are plenty of strong meta-analysis out there and
               | one of the things they tend to end up doing is weighing
               | the methodological rigour of the papers along with the
               | overlap they have to the combined question being
               | analyzed. Could we use this weighting explicitly in the
               | training process?
        
               | panabee wrote:
               | Thanks. This is helpful. Looking forward to more of your
               | thoughts.
               | 
               | Some nuance:
               | 
               | What happens when the methods are outdated/biased? We
               | highlight a potential case in breast cancer in one of our
               | papers.
               | 
               | Worse, who decides?
               | 
               | To reiterate, this isn't to discourage the idea. The idea
               | is good and should be considered, but doesn't escape
               | (yet) the core issue of when something becomes a "fact."
        
               | roughly wrote:
               | If we're not going to hold any other sector of the
               | economy personally responsible for responding to
               | incentives, I don't know why we'd start with scientists.
               | We've excused folks working for Palantir around here - is
               | it that the scientists aren't getting paid enough for
               | selling out, or are we just throwing rocks in glass
               | houses now?
        
           | PaulHoule wrote:
           | If you download data sets for classification from Kaggle or
           | CIFAR or search ranking from TREC it is the same. Typically
           | 1-2% of judgements in that kind of dataset are just wrong so
           | if you are aiming for the last few points of AUC you have to
           | confront that.
        
           | morkalork wrote:
           | I still want to jump off a bridge whenever someone thinks
           | they can use the twitter post and movie review datasets to
           | train sentiment models for use in completely different
           | contexts.
        
           | JumpCrisscross wrote:
           | > _This is true for every subfield I have been working on for
           | the past 10 years_
           | 
           | Hasn't data labelling being the bulk of the work been true
           | for every research endeavour since forever?
        
         | panabee wrote:
         | To elaborate, errors go beyond data and reach into model
         | design. Two simple examples:
         | 
         | 1. Nucleotides are a form of tokenization and encode bias.
         | They're not as raw as people assume. For example, classic FASTA
         | treats modified and canonical C as identical. Differences may
         | alter gene expression -- akin to "polish" vs. "Polish".
         | 
         | 2. Sickle-cell anemia and other diseases are linked to
         | nucleotide differences. These single nucleotide polymorphisms
         | (SNPs) mean hard attention for DNA matters and single-base
         | resolution is non-negotiable for certain healthcare
         | applications. Latent models have thrived in text-to-image and
         | language, but researchers cannot blindly carry these
         | assumptions into healthcare.
         | 
         | There are so many open questions in biomedical AI. In our
         | experience, confronting them has prompted (pun intended) better
         | inductive biases when designing other types of models.
         | 
         | We need way more people thinking about biomedical AI.
        
         | bjourne wrote:
         | What if there is significant disagreement within the medical
         | profession itself? For example, isotretinoin is proscribed for
         | acne in many countries, but in other countries the drug is
         | banned or access restricted due to adverse side effects.
        
           | panabee wrote:
           | If you agree that ML starts with philosophy, not statistics,
           | this is but one example highlighting how biomedicine helps
           | model development, LLMs included.
           | 
           | Every fact is born an opinion.
           | 
           | This challenge exists in most, if not all, spheres of life.
        
           | jacobr1 wrote:
           | Would not one approach be to just ensure the system has all
           | the data? Relevance to address systems, side effects, and
           | legal constraints. Then when making a recommendations it can
           | account for all factors not just prior use cases.
        
         | K0balt wrote:
         | I think an often overlooked aspect of training data curation is
         | the value of accurate but oblique data. Much of the "emergent
         | capabilities " of LLMs comes from data embedded in the data,
         | implied or inferred semantic information that is not readily
         | obvious. Extraction of this highly useful information, in
         | contrast to specific factoids, requires a lot of off axis
         | images of the problem space, like a CT scan of the field of
         | interest. The value of adjacent oblique datasets should not be
         | underestimated.
        
           | TZubiri wrote:
           | I noticed this when adding citations to wikipedia.
           | 
           | You are may find a definition of what a "skyscraper" is, by
           | some hyperfocused association, but you'll get a bias towards
           | a definite measurement like "skyscrapers are buildings
           | between 700m to 3500m tall", which might be useful for some
           | data mining project, but not at all what people mean by it.
           | 
           | The actual definition is not in a specific source but in the
           | way it is used in other sources like "the Manhattan
           | skyscraper is one of the most iconic skyscrapers", on the
           | aggregate you learn what it is, but it isn't very citable on
           | its own, which gives WP that pedantic bias.
        
         | TZubiri wrote:
         | Isn't labelling medical data for ai illegal as unlicensed
         | medical practice?
         | 
         | Same thing with law data
        
           | bethekidyouwant wrote:
           | Illegal?
        
           | iwontberude wrote:
           | Paralegals and medical assistants don't need licenses
        
             | nomel wrote:
             | I think their question is a good one, and not being taken
             | charitably.
             | 
             | Lets take the medical assistant example.
             | 
             | > Medical assistants are unlicensed, and may only perform
             | basic administrative, clerical and technical supportive
             | services as permitted by law.
             | 
             | If they're labelling data that's "tumor" or "not tumor",
             | with _any_ agency of the process,does that fit within their
             | unlicensed scope? Or, would that labelling be closer to a
             | diagnosis?
             | 
             | What if the AI is eventually used _to_ diagnose, based on
             | data that was labeled by someone unlicensed? Should there
             | there need to be a  "chain of trust" of some sort?
             | 
             | I think the answer to liability will be all on the doctor
             | agreeing/disagreeing with the AI...for now.
        
               | SkyBelow wrote:
               | To answer this, I would think we should consider other
               | cases where someone could practice medicine without
               | legally doing so. For example, could they tutor a student
               | and help them? Go through unknown cases and make
               | judgement, explaining their reasoning? As long as they
               | don't oversell their experience in a way that might be
               | considered fraud, I don't think this would be practicing
               | medicine.
               | 
               | It does open something of a loophole. Oh, I wasn't
               | diagnosing a friend, I was helping him label a case just
               | like his as an educational experience. My completely
               | IANAL guess would be that judges would look on it based
               | on how the person is doing it, primarily if they are
               | receiving any compensation or running it like a business.
               | 
               | But wait... the example the OP was talking about is doing
               | it like a business and likely doesn't have any
               | disclaimers properly sent to the AI, so maybe that
               | doesn't help us decide.
        
           | mh- wrote:
           | No.
        
         | arbot360 wrote:
         | > What was true last year may be false today. For instance, ...
         | 
         | Good example of a medical QA dataset shifting but not a good
         | example of a medical "fact" since it is an opinion. Another way
         | to think about shifting medical targets over time would be
         | things like environmental or behavioral risk factors changing.
         | 
         | Anyways, thank you for putting this dataset together, certainly
         | we need more third-party benchmarks with careful annotations
         | done. I think it would be wise if you segregate tasks between
         | factual observations of data, population-scale opinions
         | (guidelines/recommendations), and individual-scale opinions
         | (prognosis/diagnosis). Ideally there would be some formal
         | taxonomy for this eventually like OMOP CDM, maybe there is
         | already in some dusty corner of pubmed.
        
         | ljlolel wrote:
         | Centaur Labs does medical data labeling https://centaur.ai/
        
         | ethan_smith wrote:
         | Synthetic data generation techniques are increasingly being
         | paired with expert validation to scale high-quality biomedical
         | datasets while reducing annotation burden - especially useful
         | for rare conditions where real-world examples are limited.
        
       | techterrier wrote:
       | The latest in a long tradition, it used to be that you'd have to
       | teach the offshore person how to do your job, so they could
       | replace you for cheaper. Now we are just teaching the robots
       | instead.
        
         | kjkjadksj wrote:
         | Yeah I've avoided these job postings out of principle. I'm not
         | going to be the one to contribute to obsoleting myself and my
         | industry.
        
       | verisimi wrote:
       | This is it - this is the answer to the ai takeover.
       | 
       | Get an ai to autogenerate lots of crap! Reddit, hn comments,
       | false datasets, anything!
        
         | Cthulhu_ wrote:
         | That's just spam / more dead internet theory, and there will be
         | or are companies that will curate data sets and filter out
         | generated stuff / spam or hand-pick high quality data.
        
       | vidarh wrote:
       | I've done review and annotation work for two providers in this
       | space, and so regularly get approached by providers looking for
       | specialists with MSc's or PhD's...
       | 
       | "High-paid" is an exaggeration for many of these, but certainly a
       | small subset of people will make decent money on it.
       | 
       | At one provider I was as an exception paid _6x_ their going rate
       | because they struggled to get people skilled enough at the high-
       | end to accept their regular rate, mostly to audit and review work
       | done by others. I have no illusion I was the only one paid above
       | their stated range. I got paid well, but even at 6x their regular
       | rate I only got paid well because they estimated the number of
       | tasks per hour and I was able to exceed that estimate by a
       | considerable margin - if their estimate had matched my actual
       | speed I 'd have just barely gotten to the low end of my regular
       | rate.
       | 
       | But it's clear there's a pyramid of work, and a sustained effort
       | to create processes to allow the bulk of the work to be done by
       | low-cost labellers, and then push smaller and smaller subsets of
       | the data up more expensive to experts, as well as creating
       | tooling to cut down the amount of time experts spend by e.g.
       | starting with synthetic data (including model-generated reviews
       | of model-generated responses).
       | 
       | I don't think I was at the top of that pyramid - the provider I
       | did work for didn't handle many prompts that required deep
       | specialist knowledge (though I _did_ get to exercise my long-
       | dormant maths and physics knowledge that doesn 't say too much).
       | I think most of what we addressed would at most need people with
       | MSc level skills in STEM subjects. And so I'm sure there are a
       | few more layers on the pyramid handling PhD-level complexity
       | data. But from what I'm seeing from hiring managers contacting
       | me, I get the impression the pay scale for them isn't that much
       | higher (with the obvious caveat given what I mentioned above that
       | there almost certainly are people getting paid high multiples on
       | the stated scale)
       | 
       | Some of these pipelines of work are highly complex, often
       | including multiple stages of reviews, sometimes with multiple
       | "competing" annotators in parallel feeding into selection and
       | review stages.
        
       | charlieyu1 wrote:
       | I'll believe it when it happens. A major AI company got rid of an
       | expert team last year because they think it is too expensive
        
       | quantum_state wrote:
       | It is expert system evolved ...
        
       | cryptokush wrote:
       | welcome to macrodata refinement
        
       | joshdavham wrote:
       | I was literally just reached out to this morning about a contract
       | job for one of these "high quality datasets". They specifically
       | wanted python programmers who've contributed to popular repos (I
       | maintain one repository with approx. 300 stars).
       | 
       | The rate they offered was between $50-90 per hour, so
       | significantly higher than what I'd think low-cost data labellers
       | are getting.
       | 
       | Needless to say, I marked them as spam though. Harvesting emails
       | through GitHub is dirty imo. Was also sad that the recruiter was
       | acting on behalf of a yc company.
        
         | apical_dendrite wrote:
         | The latest offer I saw was $150-$210 an hour for 20hrs/week. I
         | didn't pursue it so I don't know if that's what people actually
         | make, but it's an interesting data point.
        
           | antonvs wrote:
           | What kind of work was involved for that one? How specialist,
           | I mean?
        
       | SoftTalker wrote:
       | Isn't this ignoring the "bitter lesson?"
       | 
       | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
         | pphysch wrote:
         | The "bitter lesson" is from a paradigm where high quality data
         | seemed infinite, and so we just need more and more compute to
         | gobble up the data.
        
         | jsnider3 wrote:
         | Not really.
        
       | TrackerFF wrote:
       | I don't know if it is related, but I've noticed an uptick in cold
       | calls / approaches for consulting gigs related to data labeling
       | and data QA, in my field (work as an analyst). I never got
       | requests like that 2++ years ago.
        
       | the_brin92 wrote:
       | I've been doing this for one of the major companies in the space
       | for a few years now. It has been interesting to watch how much
       | more complex the projects have gotten over the last few years,
       | and how many issues the models still have. I have a humanities
       | background which has actually served me well here as what
       | constitutes a "better" AI model response is often so subjective.
       | 
       | I can answer any questions people have about the experience
       | (within code of conduct guidelines so I don't get in trouble...)
        
         | merksittich wrote:
         | Thank you, I'll bite. If within your code of conduct:
         | 
         | - Are you providing reasoning traces, responses or both?
         | 
         | - Are you evaluating reasoning traces, responses or both?
         | 
         | - Has your work shifted towards multi-turn or long horizon
         | tasks?
         | 
         | - If you also work with chat logs of actual users, do you think
         | that they are properly anonymized? Or do you believe that you
         | could de-anonymize them without major efforts?
         | 
         | - Do you have contact to other evaluators?
         | 
         | - How do you (and your colleagues) feel about the work (e.g.,
         | moral qualms because "training your replacement" or proud
         | because furthering civilization, or it's just about the
         | money...)?
        
         | mNovak wrote:
         | Curious how one gets involved in this, and what fields they're
         | seeking?
        
         | dbmikus wrote:
         | What kinds of data are you working on? Coding? Something else?
         | 
         | I've been curious how much these AI models look for more niche
         | coding language expertise, and what other knowledge frontiers
         | they're focusing on (like law, medical, finance, etc.)
        
       | rnxrx wrote:
       | It's only a matter of time until private enterprises figure out
       | they can monetize a lot of otherwise useless datasets by tagging
       | them and selling (likely via a broker) to organizations building
       | models.
       | 
       | The implications for valuation of 'legacy' businesses are
       | potentially significant.
        
         | htrp wrote:
         | Already happening.
        
       | some_random wrote:
       | Bad data has been such a huge problem in the industry for ages,
       | honestly a huge portion of the worst bias (racism, sexism, etc)
       | stems directly from low quality labelings.
        
       | htrp wrote:
       | Starting a data labeling company is the least AI way to get into
       | AI.
        
         | glitchc wrote:
         | Some people sell shovels, others the grunts to use them.
        
       ___________________________________________________________________
       (page generated 2025-07-23 23:00 UTC)