https://arxiv.org/abs/2407.13692

close this message
arXiv Accessibility Forum 2024

Grab your spot!

Want to see access to research regardless of disability? Sign up for
the arXiv Accessibility Forum in September and Learn more.

Sign Up
Skip to main content
Cornell University

Grab your spot at the free arXiv Accessibility Forum

Forum Schedule
We gratefully acknowledge support from the Simons Foundation, member
institutions, and all contributors. Donate
 
arxiv logo > cs > arXiv:2407.13692
[                    ]

Help | Advanced Search

[All fields        ]
Search
arXiv logo
Cornell University Logo
[                    ] GO
quick links

  * Login
  * Help Pages
  * About

Computer Science > Computation and Language

arXiv:2407.13692 (cs)
[Submitted on 18 Jul 2024 (v1), last revised 1 Aug 2024 (this
version, v2)]

Title:Prover-Verifier Games improve legibility of LLM outputs

Authors:Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike,
Nat McAleese, Yuri Burda
View a PDF of the paper titled Prover-Verifier Games improve
legibility of LLM outputs, by Jan Hendrik Kirchner and 5 other
authors
View PDF HTML (experimental)

    Abstract:One way to increase confidence in the outputs of Large
    Language Models (LLMs) is to support them with reasoning that is
    clear and easy to check -- a property we call legibility. We
    study legibility in the context of solving grade-school math
    problems and show that optimizing chain-of-thought solutions only
    for answer correctness can make them less legible. To mitigate
    the loss in legibility, we propose a training algorithm inspired
    by Prover-Verifier Game from Anil et al. (2021). Our algorithm
    iteratively trains small verifiers to predict solution
    correctness, "helpful" provers to produce correct solutions that
    the verifier accepts, and "sneaky" provers to produce incorrect
    solutions that fool the verifier. We find that the helpful
    prover's accuracy and the verifier's robustness to adversarial
    attacks increase over the course of training. Furthermore, we
    show that legibility training transfers to time-constrained
    humans tasked with verifying solution correctness. Over course of
    LLM training human accuracy increases when checking the helpful
    prover's solutions, and decreases when checking the sneaky
    prover's solutions. Hence, training for checkability by small
    verifiers is a plausible technique for increasing output
    legibility. Our results suggest legibility training against small
    verifiers as a practical avenue for increasing legibility of
    large LLMs to humans, and thus could help with alignment of
    superhuman models.

Subjects: Computation and Language (cs.CL)
Cite as:  arXiv:2407.13692 [cs.CL]
          (or arXiv:2407.13692v2 [cs.CL] for this version)
          https://doi.org/10.48550/arXiv.2407.13692
          Focus to learn more
          arXiv-issued DOI via DataCite

Submission history

From: Jan H. Kirchner [view email]
[v1] Thu, 18 Jul 2024 16:58:18 UTC (1,482 KB)
[v2] Thu, 1 Aug 2024 17:18:54 UTC (1,483 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled Prover-Verifier Games improve
    legibility of LLM outputs, by Jan Hendrik Kirchner and 5 other
    authors
  * View PDF
  * HTML (experimental)
  * TeX Source
  * Other Formats

view license
Current browse context:
cs.CL
< prev   |   next >
new | recent | 2024-07
Change to browse by:
cs

References & Citations

  * NASA ADS
  * Google Scholar
  * Semantic Scholar

a export BibTeX citation Loading...

BibTeX formatted citation

x
[loading...          ]
Data provided by:

Bookmark

BibSonomy logo Reddit logo
(*) Bibliographic Tools

Bibliographic and Citation Tools

[ ] Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
[ ] Litmaps Toggle
Litmaps (What is Litmaps?)
[ ] scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
( ) Code, Data, Media

Code, Data and Media Associated with this Article

[ ] Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
[ ] DagsHub Toggle
DagsHub (What is DagsHub?)
[ ] GotitPub Toggle
Gotit.pub (What is GotitPub?)
[ ] Links to Code Toggle
Papers with Code (What is Papers with Code?)
[ ] ScienceCast Toggle
ScienceCast (What is ScienceCast?)
( ) Demos

Demos

[ ] Replicate Toggle
Replicate (What is Replicate?)
[ ] Spaces Toggle
Hugging Face Spaces (What is Spaces?)
[ ] Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
( ) Related Papers

Recommenders and Search Tools

[ ] Link to Influence Flower
Influence Flower (What are Influence Flowers?)
[ ] Connected Papers Toggle
Connected Papers (What is Connected Papers?)
[ ] Core recommender toggle
CORE Recommender (What is CORE?)

  * Author
  * Venue
  * Institution
  * Topic

( ) About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and
share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have
embraced and accepted our values of openness, community, excellence,
and user data privacy. arXiv is committed to these values and only
works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community?
Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is
MathJax?)

  * About
  * Help

  * Click here to contact arXiv Contact
  * Click here to subscribe Subscribe

  * Copyright
  * Privacy Policy

  * Web Accessibility Assistance
  * arXiv Operational Status
    Get status notifications via email or slack