hngopher.com

       [HN Gopher] Performance of LLMs on Advent of Code 2024
       ___________________________________________________________________
        
       Performance of LLMs on Advent of Code 2024
        
       Author : jerpint
       Score  : 17 points
       Date   : 2024-12-30 18:09 UTC (4 hours ago)
        
 (HTM) web link (www.jerpint.io)
 (TXT) w3m dump (www.jerpint.io)
        
       | jebarker wrote:
       | I'd be interested to know how o1 compares. On may days after I
       | completed the AoC puzzles I was putting them question into o1 and
       | it seemed to do really well.
        
         | qsort wrote:
         | According to this thread:
         | https://old.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
         | 
         | o1 got 20 out of 25 (or 19 out of 24, depending on how you want
         | to count). Unclear experimental setup (it's not obvious how
         | much it was prompted), but it seems to check out with
         | leaderboard times, where problems solvable with LLMs had clear
         | times flat out impossible for humans.
         | 
         | An agent-type setup using Claude got 14 out of 25 (or, again,
         | 13/24)
         | 
         | https://github.com/JasonSteving99/agent-of-code/tree/main
        
       | johnea wrote:
       | LLMs are writing code for the coming of the lil' baby jesus?
        
         | valbaca wrote:
         | adventofcode.com
        
       | grumple wrote:
       | I'm both surprised and not surprised. I'm surprised because these
       | sort of problems with very clear prompts and fairly clear
       | algorithmic requirements are exactly what I'd expect LLMs to
       | perform best at.
       | 
       | But I'm not surprised because I've seen them fail on many
       | problems even with lots of prompt engineering and test cases.
        
       | yunwal wrote:
       | With no prompt engineering this seems like a weird comparison. I
       | wouldn't expect anyone to be able to one-shot most of the AOC
       | problems. A fair fight would at least use something like cursor's
       | agent on YOLO mode that can review a command's output, add logs,
       | etc
        
       ___________________________________________________________________
       (page generated 2024-12-30 23:00 UTC)