[HN Gopher] Unsupervised Elicitation of Language Models
       ___________________________________________________________________
        
       Unsupervised Elicitation of Language Models
        
       Author : kordlessagain
       Score  : 106 points
       Date   : 2025-06-14 12:32 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | unchocked wrote:
       | Philosophically, this looks like breaking the training data limit
       | in the same way that humans do: by using an internally consistent
       | view of the world to imagine new scenarios and integrate them
       | into an updated worldview.
        
       | robinduckett wrote:
       | Exciting news, who watches the watchmen?
        
       | Herring wrote:
       | > _our goal is to fine-tune a pretrained model on its own
       | generated labels_
       | 
       | Haven't all the big labs been doing this for a couple years now?
       | It's a good idea, with great execution, but it's far from novel.
       | 
       | https://en.wikipedia.org/wiki/Weak_supervision
        
         | platelminto wrote:
         | I think this removes any amount of human-labeled data: no RLHF
         | and stuff like that. You can use their technique to create an
         | unsupervised reward model, and use that model to RL your way to
         | having a useful assistant LLM.
         | 
         | The paper is very accessible (it's mostly written by Anthropic
         | researchers), and Section 4 summarises their findings really
         | well. They were themselves really surprised by the results:
         | 
         | > We were initially very skeptical of these findings, because
         | they seemed clearly too good to be true, and suspiciously close
         | to training with actual labels. To ensure we didn't
         | accidentally train on the labels, (1) we re-ran the experiment
         | several times on different datasets, (2) we copied the dataset
         | into a new file, excluding any labels before re-running our
         | algorithm with that file, and (3) _one coauthor independently
         | replicated the findings on the Claude 3.5 Haiku base model
         | using a different codebase_.
         | 
         | (emphasis mine)
        
       | abeppu wrote:
       | > However, as tasks and model behaviors grow more complex, human
       | supervision becomes increasingly unreliable: LMs can learn to
       | mimic mistakes in demonstrations or exploit flaws in feedback.
       | How do we train LMs to do tasks that are too difficult for humans
       | to demonstrate or evaluate reliably?
       | 
       | I didn't read the whole paper but it seems important that you
       | still need real ground truth to measure improvement, so you still
       | need to get real labels at some point. The task they focus on
       | where LLMs have "superhuman" performance is guessing the gender
       | of blog authors. While humans are bad at this, humans are decent
       | as remembering their gender, and a bunch of them are willing to
       | write a blog post, so there's obviously a better way to get
       | supervised examples than asking humans to guess labels: you
       | collect posts in from authors whose gender is known. i.e. "human
       | generated labels are low quality" should not be taken to mean
       | "good labels are not available so we should go fully
       | unsupervised".
       | 
       | So since you already need some real ground truth to know whether
       | your algorithm accomplished anything, I think it's fair to ask:
       | when would you commit to using _all_ your labeled data for
       | evaluation and none for fine tuning, as described in this work?
       | Logical consistency seems valuable, sure, but it seems like
       | really you'd want to use both consistency and some (small?)
       | amount of labeled examples, and a perhaps larger amount of self-
       | labeled examples. In their loop where they revise labels to be
       | more coherent, it seems natural to imagine that pre-provided
       | labels should be stickier than self-generated ones, but not
       | immutable, because there's always some chance of noise in your
       | upstream data generation process.
        
       | md224 wrote:
       | I was intrigued that one of the researchers was listed as
       | "independent", so I checked her out:
       | 
       | https://lindapetrini.com
       | 
       | It looks like she's a science communicator rather than a
       | scientist herself. That's interesting... I'm not used to seeing
       | academic papers that include an author devoted entirely to the
       | writing aspect. (Then again, maybe I just haven't noticed?)
        
         | joaogui1 wrote:
         | The fact that she's a scientist communicator doesn't imply that
         | she only did the communication part, I think
        
       | majormajor wrote:
       | I skimmed mostly, but was trying to understand how they came up
       | with "superhuman" as a description, and it seems like a stretch?
       | 
       | This might seem like a nit but the term "superhuman" is a VERY
       | strong one to my mind. It doesn't suggest "better than the
       | average human off the street at a particular random task" but
       | instead suggests "better than humans are capable of getting with
       | training, at a high percentile-level".
       | 
       | One of the biggest advantages of LLMs as a tool are that they are
       | generally quite good against a broad variety of things without
       | needing a ton of further domain-specific training. Humans tend to
       | be the opposite.
       | 
       | It doesn't seem like they gave much training to the human
       | annotators they recruited. Whereas an LLM trained on the internet
       | has been trained on a LOT of blog posts + associated metadata.
       | And nobody has ever really bothered figuring out "how would we
       | best train humans to identify gender of blog post authors" -
       | there's very little economic incentive for it. It's not like we
       | generally train people to write in gender-specific ways in school
       | either, so we haven't been formally instructed on potential
       | differences. We'd have to rely on broad-brush generalizations if
       | not given an opportunity to deep dive to try to find more
       | specific tendencies.
       | 
       | But if you pay people to study a big majority chunk of the corpus
       | they're using for this for a couple years, focusing consciously
       | on the post style, contents, and the gender both, and then test
       | them on stuff from the ones you held out... how well could they
       | do?
        
         | jaggirs wrote:
         | "Superhuman" refers to abilities, qualities, or powers that
         | exceed those naturally found in humans. It implies being
         | greater than normal human capabilities.
         | 
         | The term is often used in fiction, particularly in superhero
         | comics and fantasy, but it can also be used metaphorically to
         | describe extraordinary effort or achievement in real life
         | (e.g., "It took a superhuman effort to finish the marathon").
         | 
         | (Definition from Gemini)
         | 
         | It seems reasonable to use the term to me simply to say the
         | abilities on a benchmark of the model were greater than the
         | human annotated data. Computers have always been superhuman at
         | many tasks, even before llms.
        
           | majormajor wrote:
           | > "Superhuman" refers to abilities, qualities, or powers that
           | exceed those naturally found in humans. It implies being
           | greater than normal human capabilities.
           | 
           | How do you know what normal human capabilities are for an
           | unusual task that humans have not trained for? Is identifying
           | the gender of the author of a blog post 80% of the time
           | "extraordinary"? How do I know what a human is capable of
           | doing for that with training?
           | 
           | If a person with no programming experience asked Claude or
           | ChatGPT to produce some code, they'd get better code than
           | their "normal" human capability could produce. So: superhuman
           | coders?
           | 
           | But also today, I have asked Claude and ChatGPT to do coding
           | tasks for me that both models got stuck on. Then I fixed them
           | myself because I've had a lot of training and practice. So:
           | not superhuman? But wait, the model output the broken code
           | faster than I would've. So: superhuman again?
           | 
           | Extraordinary shouldn't be so easily disputable.
           | 
           | LLMs have superhuman _breadth_ and superhuman _speed_. I
           | haven 't seen superhuman _depth_ in any capabilities yet. I
           | 've seen them have "better than untrained median person" and
           | often "better than hobbyist" depth. But here the authors
           | claim "superhuman capabilities" which is pretty specificly
           | not just meaning the breadth or speed.
        
           | majormajor wrote:
           | On a separate note, using an LLM for a definition is a bit
           | funny, when there are expert-curated sources easily
           | available. The LLM didn't get it _wrong_ here, but...
           | 
           | https://en.wikipedia.org/wiki/Superhuman
           | 
           | First line: "The term superhuman refers to humans, humanoids
           | or other beings with abilities and other qualities that
           | exceed those naturally found in humans."
           | 
           | Golly, I wonder what that model based its first sentence on.
        
       | brumar wrote:
       | So LLMs have their alpha go zero moment where training on human
       | data is has-been? Sounds exciting? Terrifying?
        
       | clbrmbr wrote:
       | Marks' paper with Max Tegmark "Geometry of Truth" is a great
       | read, and I can see the ideas repeated here. I've been meaning to
       | repro some of the geotruth paper....
        
       ___________________________________________________________________
       (page generated 2025-06-14 23:00 UTC)