[HN Gopher] Performance of LLMs on Advent of Code 2024
___________________________________________________________________
Performance of LLMs on Advent of Code 2024
Author : jerpint
Score : 17 points
Date : 2024-12-30 18:09 UTC (4 hours ago)
(HTM) web link (www.jerpint.io)
(TXT) w3m dump (www.jerpint.io)
| jebarker wrote:
| I'd be interested to know how o1 compares. On may days after I
| completed the AoC puzzles I was putting them question into o1 and
| it seemed to do really well.
| qsort wrote:
| According to this thread:
| https://old.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
|
| o1 got 20 out of 25 (or 19 out of 24, depending on how you want
| to count). Unclear experimental setup (it's not obvious how
| much it was prompted), but it seems to check out with
| leaderboard times, where problems solvable with LLMs had clear
| times flat out impossible for humans.
|
| An agent-type setup using Claude got 14 out of 25 (or, again,
| 13/24)
|
| https://github.com/JasonSteving99/agent-of-code/tree/main
| johnea wrote:
| LLMs are writing code for the coming of the lil' baby jesus?
| valbaca wrote:
| adventofcode.com
| grumple wrote:
| I'm both surprised and not surprised. I'm surprised because these
| sort of problems with very clear prompts and fairly clear
| algorithmic requirements are exactly what I'd expect LLMs to
| perform best at.
|
| But I'm not surprised because I've seen them fail on many
| problems even with lots of prompt engineering and test cases.
| yunwal wrote:
| With no prompt engineering this seems like a weird comparison. I
| wouldn't expect anyone to be able to one-shot most of the AOC
| problems. A fair fight would at least use something like cursor's
| agent on YOLO mode that can review a command's output, add logs,
| etc
___________________________________________________________________
(page generated 2024-12-30 23:00 UTC)