[HN Gopher] Performance of LLMs on Advent of Code 2024
       ___________________________________________________________________
        
       Performance of LLMs on Advent of Code 2024
        
       Author : jerpint
       Score  : 120 points
       Date   : 2024-12-30 18:09 UTC (1 days ago)
        
 (HTM) web link (www.jerpint.io)
 (TXT) w3m dump (www.jerpint.io)
        
       | jebarker wrote:
       | I'd be interested to know how o1 compares. On may days after I
       | completed the AoC puzzles I was putting them question into o1 and
       | it seemed to do really well.
        
         | qsort wrote:
         | According to this thread:
         | https://old.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
         | 
         | o1 got 20 out of 25 (or 19 out of 24, depending on how you want
         | to count). Unclear experimental setup (it's not obvious how
         | much it was prompted), but it seems to check out with
         | leaderboard times, where problems solvable with LLMs had clear
         | times flat out impossible for humans.
         | 
         | An agent-type setup using Claude got 14 out of 25 (or, again,
         | 13/24)
         | 
         | https://github.com/JasonSteving99/agent-of-code/tree/main
        
           | joseneca wrote:
           | I have to wonder why o1 didn't work. That post is
           | unfortunately light on details that seem pretty important.
        
             | jebarker wrote:
             | I was thinking 20/25 is pretty great! At least 5 of the
             | problems were pretty tricky and easy to fail due to small
             | errors.
        
       | johnea wrote:
       | LLMs are writing code for the coming of the lil' baby jesus?
        
         | valbaca wrote:
         | adventofcode.com
        
       | grumple wrote:
       | I'm both surprised and not surprised. I'm surprised because these
       | sort of problems with very clear prompts and fairly clear
       | algorithmic requirements are exactly what I'd expect LLMs to
       | perform best at.
       | 
       | But I'm not surprised because I've seen them fail on many
       | problems even with lots of prompt engineering and test cases.
        
       | yunwal wrote:
       | With no prompt engineering this seems like a weird comparison. I
       | wouldn't expect anyone to be able to one-shot most of the AOC
       | problems. A fair fight would at least use something like cursor's
       | agent on YOLO mode that can review a command's output, add logs,
       | etc
        
         | ben_w wrote:
         | If you immediately know the candlelight is fire, then the meal
         | was cooked a long time ago.
         | 
         | And so it is with success of LLMs in one-shot challenges and
         | any job that depends on such challenges: cooked a long time
         | ago.
        
           | NitpickLawyer wrote:
           | Indeed. (wild to see a sg-1 quote in the wild!)
        
         | fumeux_fume wrote:
         | I certainly don't think it's weird to measure one-shot
         | performance on the AOC. Sure, more could be done. More can
         | always be done, but this is still useful and interesting.
        
         | segmondy wrote:
         | I did zero shot 27 solutions successfully with local model
         | code-qwen2.5-32b. I think adding sonnet or latest gemini will
         | probably get me to 40.
        
         | rhubarbtree wrote:
         | Seems reasonable to me. More people use ChatGPT than cursor, so
         | use it like ChatGPT.
         | 
         | For most coding problems, people won't magically know if the
         | code is "correct" or not, so any one-shot answer that is wrong
         | could be a real hindrance.
         | 
         | I don't have time to prompt engineer for every bit of code. I
         | need tools that accelerate my work, not tools that absorb my
         | time.
        
       | unclad5968 wrote:
       | Half the time I try to use gemini questions about the c++ std
       | library, it fabricates non-existent types and functions. I'm
       | honestly impressed it was able to solve any of the AoC problems.
        
         | devjab wrote:
         | I've had a side job as an external examiner for CS students for
         | almost a decade now. While LLMs are generally terrible at
         | programming (in my experience) they excel at passing finals. If
         | I were to guess it's likely a combination of the relatively
         | simple (or perhaps isolated is a better word) tasks coupled
         | with how many times similar problems have been solved before in
         | the available training data. Somewhat ironically the easiest
         | way to spot students who "cheat" is when the code is great.
         | Being an external examiner, meaning that I have a full time job
         | in software development I personally find it sort of silly when
         | students aren't allowed to use a calculator. I guess changing
         | the way you teach and test is a massive undertaking though, so
         | right now we just pretend LLMs aren't being used by basically
         | every students. Luckily I'm not part of the "spot the cheater"
         | process, so I can just judge them based on how well they can
         | explain their code.
         | 
         | Anyway, I'm not at all surprised that they can handle AoC. If
         | anything I would worry that AoC will still be a fun project to
         | author when many people solve it with AI. It sure won't be fun
         | to curate any form of leaderboard.
        
         | uludag wrote:
         | leetoced/AoC-like problems are probably the easiest class of
         | software related tasks LLMs can do. Using the correct library,
         | the correct way, at the correct version, especially if the
         | library has some level of churn and isn't extremely common, can
         | be a harder task for LLMs than the hardest AoC problems.
        
       | upghost wrote:
       | After looking at the charts I was like "Whoa, damn, that Jerpint
       | model seems amazing. Where do I get that??" I spent some time
       | trying to find it on Huggingface before I realized...
        
         | senordevnyc wrote:
         | lol, I did the same thing
        
         | j45 wrote:
         | Me too.
         | 
         | The author could make a model on huggingface routing requests
         | to him. Latency might vary.
        
           | tbagman wrote:
           | The good news is that training jerprint was probably cheaper
           | than training the latest got models...
        
         | foldl2022 wrote:
         | Just open-wights it. We need this. :)
        
       | moffkalast wrote:
       | At first I was like "What is this jerpint model that's beating
       | the competition so soundly?" then it hit me, lol.
       | 
       | Anyhow this is like night and day compared to last year, and it's
       | impressive that Sonnet is now apparently 50% as good as a
       | professional human at this sort of thing.
        
         | zkry wrote:
         | I don't think comparing star counts would be a good measure
         | though, as with AOC 90% of the effort and difficulty goes into
         | the harder problems towards the end and it was the beginning,
         | easy problems where the bulk of the sonnet's stars came from.
        
           | moffkalast wrote:
           | Ah yeah that's true, the difficulty curve is not very linear.
        
       | BugsJustFindMe wrote:
       | I think this is a terrible analysis with a weak conclusion.
       | 
       | There's zero mention of how long it took the LLM to write the
       | code vs the human. You have a 300 second runtime limit, but what
       | was your coding time limit? The machine spat out code in, what, a
       | few seconds? And how long did your solutions take to write?
       | 
       | Advent of code problems take me longer to just _read_ than it
       | takes an LLM to have a proposed solution ready for evaluation.
       | 
       | > _they didn't perform nearly as well as I'd expect_
       | 
       | Is this a joke, though? A machine takes a problem description
       | written as floridly hyperventilated as advent problems are, and,
       | without any opportunity for automated reanalysis, it understands
       | the exact problem domain, it understands exactly what's being
       | asked, correctly models the solution, and spits out a correct
       | single-shot solution on 20 of them in no time flat, often with
       | substantially better running time than your own solutions, and
       | that's disappointing?
       | 
       | > _a lot of the submissions had timeout errors, which means that
       | their solutions might work if asked more explicitly for efficient
       | solutions. However the models should know very well what AoC
       | solutions entail_
       | 
       | You made up an arbitrary runtime limit and then kept that limit a
       | secret, and you were surprised when the solutions didn't adhere
       | to the secret limit?
       | 
       | > _Finally, some of the submissions raised some Exceptions, which
       | would likely be fixed with a human reviewing this code and asking
       | for changes._
       | 
       | How many of your solutions got the correct answer on the first
       | try without going back and fixing something?
        
         | keyle wrote:
         | You raise some good points about the "total of hours spent" but
         | I guess you don't consider training time included. Also there
         | is no need to quote the author's post and have a go at him
         | personally. There are better ways to get your point across by
         | arguing the point made and not the sentences written.
        
           | BugsJustFindMe wrote:
           | > _I guess you don 't consider training time included_
           | 
           | In the same way that I don't consider the time an author
           | spends writing a book when saying how long it took me to read
           | the book. OP lost zero time training ChatGPT.
           | 
           | Or do you mean how much time OP spent training themself?
           | Because that's a whole new can of worms. How many years
           | should we add to OP's development times because probably they
           | learned how to write code before this exercise?
        
             | keyle wrote:
             | I was considering the energy used and $, R&D and time spent
             | to train those models, which is colossal compared to the
             | developer and factor that in. Then they don't look so
             | impressive. For reference, I'm not anti-LLM, I use Claude
             | and ChatGPT every day. I'm just raising those if you want
             | to consider all facts.
        
               | lgas wrote:
               | That cost is amortized over all requests made against all
               | copies of the model though. Fortunately we don't have to
               | re-train the model every time we want to perform
               | inference.
        
               | bawolff wrote:
               | That's a weird comparison.
               | 
               | Would you also count the number of hours it took for
               | someone to write the OS, the editor, etc? The programmer
               | wouldn't be effective without those things.
        
               | menaerus wrote:
               | And now also consider the amount of time, school,
               | university, practice and $$$ it took to train a software
               | engineer. Repeat for the population of presumably ~30
               | million software engineers in the world. Time-wise it
               | doesn't scale at all when contrasted to new LLM releases
               | that occur ~once a year per company. $$$-wise is also
               | debatable.
        
         | segmondy wrote:
         | I did experiment with local LLM running on my machine, most
         | solutions were generated within the time of 30-60seconds. My
         | overhead was really copying part 1 of the problem, generating
         | the code, copying and pasting the code, running it, copying the
         | input data, running it. Entering the result, repeating for part
         | 2 and for most of them that was about 5 minutes from start to
         | finish. If I automated the process, it would probably be 1
         | minute or less for the problems it could solve.
         | 
         | Not the OP, but I was able to zero shot 27 solutions correctly
         | without any back and forth, and 5 more with a little bit back
         | and forth. Using local models.
        
         | mvdtnz wrote:
         | > it understands the exact problem domain, it understands
         | exactly what's being asked, correctly models the solution
         | 
         | It does not "understand" or "model" shit. Good grief you AI
         | shills need to take a breath.
        
           | BugsJustFindMe wrote:
           | I'll use a different word when you demonstrate that humans
           | aren't also stochastic parrots constantly hallucinating
           | reality. Until then, good enough for one is good enough for
           | the other.
        
       | bryan0 wrote:
       | Since you did not give the models a chance to test their code and
       | correct any mistakes, I think a more accurate comparison would be
       | if you compared them against you submitting answers without
       | testing (or even running!) your code first
        
         | angarg12 wrote:
         | People keep evaluating LLMs on essentially zero-shotting a
         | perfect solution to a coding problem.
         | 
         | Once we use tools to easily iterate on code (e.g. generate,
         | compile, test, use outcome to refine prompt) we will
         | turbocharge LLMs coding abilities.
        
         | rhubarbtree wrote:
         | This smacks of "moving the goalposts" just as much as the other
         | side is accused of when unimpressed by advances.
         | 
         | It's a reasonable test.
        
         | xen0 wrote:
         | I'm a bit curious to see how close the solutions were.
         | 
         | When it couldn't solve it within the constraints, was it
         | 'broadly correct' with some bugs? Was it on the right track or
         | completely off?
        
           | jerpint wrote:
           | The code they generated and the outputs (including trace back
           | errors) is all included, you can view them on the post itself
           | or from hf space:
           | 
           | https://huggingface.co/spaces/jerpint/advent24-llm
        
             | xen0 wrote:
             | Huh, don't know how I missed that. I was hoping for them to
             | have done some analysis themselves.
             | 
             | Glancing through the code for some of the 'easier' problems
             | Claude (his best performer), it seems to have broadly
             | correct (if a little strange and overwrought) code.
             | 
             | But my phone is not a good platform for me to do this kind
             | of reading.
        
               | jerpint wrote:
               | I should specify that unfortunately I didn't store the
               | raw output from the LLM, just the parsed code snippets
               | they produced, but the code to reproduce it is all there
        
       | bongodongobob wrote:
       | I think a major mistake was giving parts 1 and 2 all at once. I
       | had great results having it solve 1, then 2. I think I got 4o to
       | one shot parts 1 then 2 up to about day 12. It started to
       | struggle a bit after that and I got bored with it at day 18. It
       | did way better than I expected, I don't understand why the author
       | is disappointed. This shit is magic.
        
       | 101008 wrote:
       | This kind of mirror my experience with LLMs. If I ask them non-
       | original problems (make this API, write this test, update this
       | function (that must be written 100s of time by develoeprs around
       | the world, etc), it works very well. Some minors changes here and
       | there but it saves time.
       | 
       | When I ask them to code things that they never heard of (I am
       | working on a online sport game), it fails catastrophically. The
       | LLM should know the sport, and what I ask is pretty clear for
       | anyone who understand the game (I tested against actual people
       | and it was obvious what to expect), but the LLM failed miserably.
       | Even worse when I ask them to write some designs in CSS for the
       | game. It seems if you take them outside the 3-columsn layout or
       | bootstrap or the overused landing page, LLMs fails miserably.
       | 
       | It works very well for the known cases, but as soon as you want
       | them to do something original, they just can't.
        
         | jppope wrote:
         | completely agree with this. The problem I'm starting to run
         | into is that I do a lot of things that have been done before so
         | I get really efficient at doing that stuff with LLMs... and now
         | I'm slow to solve the stuff I used to be really fast at.
        
         | anonzzzies wrote:
         | I ask llms things like I would spec it out for my team, which
         | is to say, when I write a task, I would not include things
         | about the sport, I would explain the logic etc one would need
         | to write. No one needs to know what it is about in abstract as
         | I see the same issue with humans; they get confused when using
         | real world things to see the connection with what they need to
         | do. I mean I don't flesh it out to the extend I might as well
         | write the code, but enough that you don't need to know anything
         | about the underlying subject.
         | 
         | We are doing a large EU healthcare project and things differ
         | per country obviously; if I would assume people to have a
         | modicum of world knowledge or interest in it and look it up,
         | nothing would get done. It is easier to deliver excel sheets
         | with the proper wording and formulas and leave out what it is
         | even for.
         | 
         | Works fine with LLMs.
         | 
         | Disclaimer; the people in my company know everything about the
         | subject matter; the people (and LLMs) that implement it are
         | usually not ours: we just manage them and in my experience,
         | programmers at big companies are borderline worthless, hence,
         | in the past 35 years, I have taken to write tasks as concrete
         | as I can; we found that otherwise we either get garbage commits
         | or tons of meetings with questions and then garbage commits.
         | And now that comes in handy as this works much better on LLMs
         | too.
        
         | steve_adams_86 wrote:
         | This has made me wonder just how few developers have been doing
         | even remotely interesting things. I encounter a lot of
         | situations where the LLMs catastrophically fail, but I don't
         | think it's for lack of trying to guide it through to solutions
         | properly or sufficiently. I give as much or as little context
         | as I can think to, give partial solutions or zero solutions,
         | try different LLMs with the same prompts, etc.
         | 
         | Maybe we were mostly actually doing the exact same stuff and
         | these things are trained on a whole lot of the same. Like with
         | react components, these things are amazing at pumping out the
         | most basic varieties of them.
        
           | uludag wrote:
           | My thoughts exactly. I honestly think that the majority of
           | SWE work is completely unoriginal.
           | 
           | But then there are those idiosyncrasies that every company
           | has, a finicky build system, strange hacks to get certain
           | things to work, extensive webs of tribal knowledge, that will
           | be the demise for any LLM SWE application besides being a
           | tool for skilled professionals.
        
             | bamboozled wrote:
             | The whole world is wrinkly, that's the way it is, everyone,
             | and everything is slightly different.
        
           | bugglebeetle wrote:
           | > This has made me wonder just how few developers have been
           | doing even remotely interesting things.
           | 
           | There's not much to ponder on here as the majority of human
           | thought and action is unoriginal and irrelevant, so software
           | development is not some special exception to this. This
           | doesn't mean it lacks meaning or value for those doing it,
           | since that can be self-generated, extrinsically motivated, or
           | both. But, by definition, most things done are within the
           | boundaries of a normal distribution and not exceptional,
           | transcendent, or sublime. You can strive for those latter
           | categories, but it's a path strewn with much failure, misery,
           | and quite questionable rewards, even where it is achieved.
           | Most prefer more achievable comforts or security to it, or,
           | at most, dull-minded imitations thereof, such as amassing
           | great wealth.
        
           | josephg wrote:
           | Yeah; I've long argued that software should be divided into
           | two separate professions.
           | 
           | If you think about it, lots of professions are already
           | divided in two. You often have profession of inventors /
           | designers and a profession of creators. For example:
           | Electrical engineers vs Electricians. Architects / civil
           | engineers and construction workers. Chefs and line cooks.
           | Composers and session musicians. And so on.
           | 
           | The two professions always need somewhat different skill
           | sets. For example, electricians need business sense and to be
           | personable - while electrical engineers need to understand
           | how to design circuits from scratch. Session musicians need a
           | fantastic memory for music, and to be able to perform in lots
           | of different musical styles. Composers need a deeper
           | understanding of how to write compelling music - but usually
           | only in their preferred style.
           | 
           | I think programming should be divided in the same way. It
           | sort of happens already: There's people who invent novel
           | database storage engines, work on programming language
           | design, write schedulers for operating systems and so on. And
           | then there's people who do consulting and write almost all
           | the software that businesses actually need to function. Eg,
           | people who talk to actual clients and figure out their needs.
           | 
           | The first group invents React. The second group uses react to
           | build online stores.
           | 
           | We sort of have these two professions already. We just don't
           | have language to separate them.
           | 
           | The line is a bit fuzzy, but you can tell they're two
           | professions because each needs a slightly different skill
           | set. If you're writing a novel full-text search engine, you
           | need to be really good at data structures & algorithms, but
           | you don't really need to be good at working with clients. And
           | if you work in frontend react all day, you don't really need
           | to understand how a B-Tree works or understand the difference
           | between LR and recursive descent parsing. (And it doesn't
           | make much sense to ask these sort of programmers "leetcode
           | problems" in job interviews).
           | 
           | ChatGPT is good at application programming. Or really,
           | anything that its seen lots of times before. But the more
           | novel a problem - the more its actual "software engineering"
           | - the more it struggles to make working software. "Write a
           | react component that does X" - easy. "Implement a B-tree with
           | this weird quirk" - hard. It'll struggle through but the
           | result will be buggy. "Implement this new algorithm in a
           | paper I just read" - its more or less completely useless at
           | that.
        
             | mwcampbell wrote:
             | Steve Yegge wrote about this ~20 years ago, with a heaping
             | helping of his usual snark directed at the OO fads of the
             | time: https://sites.google.com/site/steveyegge2/age-
             | racecar-driver
        
           | lm28469 wrote:
           | Imho the hordes of code monkeys see an incredible improvement
           | in their productivity and keep boasting about it online which
           | is inflating the idea that LLMs are close to agi and
           | revolutionizing the industry
           | 
           | Meanwhile people who work on things that are even slightly
           | niche or complex (ie. Things which aren't answered in 20
           | different stack overflow threads, or things which require of
           | few field "experts" to meet around a table and think hard for
           | a few hours) don't understand the enthusiasm at all
           | 
           | I'm working on fairly simple things but in a context/industry
           | that isn't that common and I have to spoon feed LLMs to get
           | anywhere, most of the time it doesn't even work. The next few
           | years will be interesting, it might play out as it did for
           | plane pilots over relying on autopilots to the point of
           | losing key skills
        
           | SkyBelow wrote:
           | Is it that they aren't doing interesting things, or is it
           | that developers doing interesting things tend to only ask the
           | AI to handle the boring parts? I find I use AI quite a bit,
           | but it is integrated with my own thinking and problem solving
           | every step of the way. I see how others are using it and they
           | often seem to be asking for full solutions, where as I'm
           | using it more like a rubber duck debugger stuffed with stack
           | overflow feedback and a knowledge of any common APIs. It
           | mostly helps be answering things that would have otherwise
           | been web searches, but it doesn't create any solutions from
           | scratch.
        
         | PaulRobinson wrote:
         | > as soon as you want them to do something original, they just
         | can't.
         | 
         | An LLM is, famously, a "stochastic parrot". It has no
         | reasoning. It has no creativity. It has no basis or mechanism
         | for true originality: it is regurgitating what it has seen
         | before, based on probabilities, nothing more. It looks more
         | impressive, but that's all is behind the curtain.
         | 
         | It surprises me that so many people expect an LLM to do
         | something "clever" or "original". They're compression machines
         | for knowledge, a tool for summarisation, rephrasing, not for
         | creation.
         | 
         | Why is this not that widely known and accepted across the
         | industry yet, and why do so many people have such high
         | expectations?
        
           | redlock wrote:
           | Because it isn't true according to Hinton, Sutskever, and
           | other leading AI researchers.
        
             | FroshKiller wrote:
             | Most foxes are herbivores according to the fellow guarding
             | my henhouse.
        
               | redlock wrote:
               | They do make compelling arguments if you listened to
               | them. Also, from my coding with llms using cursor they
               | obviously understand the code and my request more often
               | than not. Mechanistic interpretability research shows
               | evidence of concepts being represented within the layers.
               | Golden gate claude is evidence of this:
               | https://www.anthropic.com/news/golden-gate-claude
               | 
               | To me this proves that llms learn concepts and
               | multilayeres representations within its network and not
               | just some dumb statistical inference. Even famous llm
               | skeptic like Francois Chollet doesn't invoke this
               | stochastic parrots anymore and has moved on to arguing
               | that they don't generalize well and are just memorizing.
               | 
               | With GPT-2 and 3 I was of the same opinion that they
               | seemed like just sophisticated stochastic parrots, but
               | current llms are a different class from early gpts. Now
               | that o3 has beaten memory resisting benchmark like ARC-
               | AGI I think we can confidently move on from this
               | stochastic parrots notion.
               | 
               | (And before you argue that o3 is not an llm, here is an
               | openai researcher stating that it is an llm https://x.com
               | /__nmca__/status/1870170101091008860?s=46&t=eTe... )
        
               | pixelfarmer wrote:
               | Neural networks are generalizing things as part of their
               | optimization scheme. The current approach is just to dump
               | many layered neural networks (at the core) as in "deep
               | learning" to solve the problems, but the networks are too
               | regular, too "primitive" of sorts. What is needed are
               | network topologies that strongly support this
               | generalization, the creation of "meta abstraction
               | levels", otherwise it will get nowhere.
               | 
               | Biological networks of more intelligent species contain a
               | few billion neurons and upwards from that while even the
               | big LLMs are somewhere in the millions of "equivalent" at
               | best. So, bad topology + much less "neurons" and the
               | resulting capabilities shouldn't be too surprising. Plus
               | it is clear that AGI is nowhere close, because one result
               | of AGI is a proper understanding of "I". Crows have an
               | understanding of "I", for example.
               | 
               | And that is where these "meta abstraction levels" come
               | in: There are many needed to eventually reach the stage
               | of "I". This can also be used to test how well neural
               | networks perform, how far they can abstract things for
               | real, how many levels of generalization are handled by
               | it. But therein lies a problem: Let 2 persons specify
               | abstraction levels and the results will be all across the
               | board. This is also why ARC-AGI, while dives into that,
               | cannot really solve the problem of testing "AI", let
               | alone "AGI": We as humans are currently unable to
               | properly test intelligence in any meaningful way. All the
               | tests we have are mere glimpses into it and often
               | (complex) multivariable + multi(abstraction)layered tests
               | and dealing with the results, consequently, a total mess,
               | even if we throw big formulas at it.
        
       | Tiberium wrote:
       | Wanted to try with o1 and o1-mini but looks like there's no code
       | available, although I guess I could just ask 3.5 Sonnet/o1 to
       | make the evaluation suite ;)
        
         | jerpint wrote:
         | Author here: all the code to reproduce this is actually all on
         | the huggingface space here [1]
         | 
         | https://huggingface.co/spaces/jerpint/advent24-llm/tree/main
        
           | Tiberium wrote:
           | Thanks, I'll check with o1-mini and o1 and update this
           | comment :) Also, the code has some small errors, although
           | those can be easily fixed.
        
       | zaptheimpaler wrote:
       | I'm adjacent to some people who do AoC competitively and it's
       | clear many of the top 10 and maybe 1/2 of the top 100 this year
       | were heavily LLM assisted or wholly done by LLMs in a loop. They
       | won first place on many days. It was disappointing to the
       | community that people cheated and went against the community's
       | wishes but it's clear LLMs can do much better than described here
        
         | davidclark wrote:
         | I'd like the same article topic but from the person who did Day
         | 1 pt1 in 4s and pt2 in 5s (9s total time).
        
         | rhubarbtree wrote:
         | No, that means it's clear that LLM-assisted coding can do
         | better than described here. Which implies humans are adding a
         | lot.
        
           | NitpickLawyer wrote:
           | To be fair, there's probably not a lot that "humans" add when
           | the solution is solved in 4 and 5 seconds respectively.
           | That's clearly 100% automated. Most humans can't even read
           | the problem in that timeframes.
           | 
           | A better implication would be that proper use of LLMs implies
           | more than OP did here (proper prompting, looping the answer
           | w/ a code check, etc.)
        
       | bawolff wrote:
       | I'm a bit of an AI skeptic, and i think i had the opposite
       | reaction of the author. Even though this is far from welcoming
       | our AI overlords, I am surprised that they are this good.
        
       | cheevly wrote:
       | Genuinely terrible prompt. Not only in structure, but also
       | contains grammatical errors. I'm confident you could at least
       | double their score if you improve your prompting significantly.
        
         | rhubarbtree wrote:
         | This isn't very constructive criticism. You could improve it
         | through a number of levels:
         | 
         | 1. You could have pointed out the grammatical errors and
         | explained why they matter.
         | 
         | 2. You could have pointed out the structural errors, and
         | explained what structure should have been used.
         | 
         | 3. You could have written a new prompt.
         | 
         | 4. You could have re-run the experiment with the new prompt.
         | 
         | Otherwise what you say is just unsubstantiated criticism.
        
       | demirbey05 wrote:
       | o1 is not included, I think each benchmark should include o1 and
       | reasoning models. o-series is really changed the game.
        
       | airstrike wrote:
       | I like the idea, but I feel like the execution left a bit to be
       | desired.
       | 
       | My gut tells me you can get much better results from the models
       | with better prompting. The whole "You are solving the 2024 advent
       | of code challenge." form of prompting is just adding noise with
       | no real value. Based on my empirical experience, that likely
       | hurts performance instead of helping.
       | 
       | The time limit feels arbitrary and adds nothing to the benchmark.
       | I don't understand why you wouldn't include o1 in the list of
       | models.
       | 
       | There's just a lot here that doesn't feel very scientific about
       | this analysis...
        
       | Recursing wrote:
       | The article and comments here _really_ underestimate the current
       | state of LLMs (or overestimate how hard AoC 2024 was)
       | 
       | Here's a much better analysis from someone who got 45 stars using
       | LLMs.
       | https://www.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
       | 
       | All the top 5 players on the final leaderboard
       | https://adventofcode.com/2024/leaderboard used LLMs for most of
       | their solutions.
       | 
       | LLMs can solve all days except 12, 15, 17, 21, and 24
        
         | nindalf wrote:
         | This needs to be higher. Not only because it shows that LLMs
         | can do better than what OP says, but also that there's some
         | difference in how they're used. Clearly the person on reddit
         | was able to use them more effectively.
        
           | danielbln wrote:
           | That's the crux with most discussions on here regarding LLMs.
           | 
           | "I used gpt-4o with zero shot prompts and it failed
           | terribly!"
           | 
           | "I used Claude/o1/o3, I fed various bits of information into
           | the context and carefully guided the LLM"
           | 
           | Those two approaches (there are many more) would lead to very
           | different results, yet all we read here are comments giving
           | their opinions on "LLMs", as if there is only one LLM and one
           | way of working with them.
        
             | d0mine wrote:
             | This reminds me of https://en.wikipedia.org/wiki/Stone_Soup
             | 
             | stone->your ingredients->soup
             | 
             | LLM->your prompts->solution
        
         | jonathan_landy wrote:
         | Seems to depends strongly on the model perhaps. The Reddit post
         | says
         | 
         | "Some other models tested that just didn't work: gpt-4o,
         | gpt-o1, qwen qwq."
         | 
         | Notably gpt-4o was used in the post linked here.
        
           | jebarker wrote:
           | I don't know what they were doing, but I tried o1 with many
           | problems after I solved them already and it did great. No
           | special prompting, just "solve this problem with a python
           | program".
        
         | FrustratedMonky wrote:
         | Wonder if different goals.
         | 
         | If the top 5 people on leader board, 'used LLM's'. Meaning,
         | they used an LLM as a helping tool.
         | 
         | But article I think is, what if the LLM played by itself. Just
         | paste the questions in, and see if it can do it all on its own?
         | 
         | Perhaps that is the difference, different goals.
        
         | oytis wrote:
         | Why does neither of articles provide the actual raw chat logs?
         | It's like a recent article about a non-released LLM solving
         | non-public tasks which everyone is supposed to be impressed
         | about
        
         | michaelt wrote:
         | _> All the top 5 players on the final leaderboard [...] used
         | LLMs for most of their solutions._
         | 
         | Note that the leaderboard points are given out on the time
         | taken to produce a correct answer. 100 points for the first to
         | submit an answer, 99 for the second, 98 for third and so on. No
         | points if you're not in the first 100.
         | 
         | So if an LLM fails on 5 problems, but for the other 20 it can
         | take a 600-word problem statement and solve it in 12 seconds?
         | It'll rack up loads of points.
         | 
         | Whereas the best human programmer you know might be able to
         | solve all 25 problems taking around 15 minutes per problem. On
         | most days they would have zero points, as all 100 points
         | finishes go to LLMs.
        
         | jerpint wrote:
         | Author here: the point of the article was only to evaluate
         | zero-shot capabilities. I'm certain that had I used LLMs I
         | would have definitely gotten more stars on AoC (got 41/50
         | without). Because I chose to solve this year without LLMs, I
         | was simply curious to see how the opposite setup would do,
         | using basically zero human intervention to solve AoC.
         | 
         | That said, if I cared only to produce the best results I would
         | 100% pair up with an LLM
        
       | guerrilla wrote:
       | How far can LLMs get in Project Euler without messing up?
        
       | antirez wrote:
       | The most important thing is missing from this post: the
       | performance of Jerpint+Claude. It's not a VS game.
        
       ___________________________________________________________________
       (page generated 2024-12-31 23:01 UTC)