[HN Gopher] Performance of LLMs on Advent of Code 2024
___________________________________________________________________
Performance of LLMs on Advent of Code 2024
Author : jerpint
Score : 120 points
Date : 2024-12-30 18:09 UTC (1 days ago)
(HTM) web link (www.jerpint.io)
(TXT) w3m dump (www.jerpint.io)
| jebarker wrote:
| I'd be interested to know how o1 compares. On may days after I
| completed the AoC puzzles I was putting them question into o1 and
| it seemed to do really well.
| qsort wrote:
| According to this thread:
| https://old.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
|
| o1 got 20 out of 25 (or 19 out of 24, depending on how you want
| to count). Unclear experimental setup (it's not obvious how
| much it was prompted), but it seems to check out with
| leaderboard times, where problems solvable with LLMs had clear
| times flat out impossible for humans.
|
| An agent-type setup using Claude got 14 out of 25 (or, again,
| 13/24)
|
| https://github.com/JasonSteving99/agent-of-code/tree/main
| joseneca wrote:
| I have to wonder why o1 didn't work. That post is
| unfortunately light on details that seem pretty important.
| jebarker wrote:
| I was thinking 20/25 is pretty great! At least 5 of the
| problems were pretty tricky and easy to fail due to small
| errors.
| johnea wrote:
| LLMs are writing code for the coming of the lil' baby jesus?
| valbaca wrote:
| adventofcode.com
| grumple wrote:
| I'm both surprised and not surprised. I'm surprised because these
| sort of problems with very clear prompts and fairly clear
| algorithmic requirements are exactly what I'd expect LLMs to
| perform best at.
|
| But I'm not surprised because I've seen them fail on many
| problems even with lots of prompt engineering and test cases.
| yunwal wrote:
| With no prompt engineering this seems like a weird comparison. I
| wouldn't expect anyone to be able to one-shot most of the AOC
| problems. A fair fight would at least use something like cursor's
| agent on YOLO mode that can review a command's output, add logs,
| etc
| ben_w wrote:
| If you immediately know the candlelight is fire, then the meal
| was cooked a long time ago.
|
| And so it is with success of LLMs in one-shot challenges and
| any job that depends on such challenges: cooked a long time
| ago.
| NitpickLawyer wrote:
| Indeed. (wild to see a sg-1 quote in the wild!)
| fumeux_fume wrote:
| I certainly don't think it's weird to measure one-shot
| performance on the AOC. Sure, more could be done. More can
| always be done, but this is still useful and interesting.
| segmondy wrote:
| I did zero shot 27 solutions successfully with local model
| code-qwen2.5-32b. I think adding sonnet or latest gemini will
| probably get me to 40.
| rhubarbtree wrote:
| Seems reasonable to me. More people use ChatGPT than cursor, so
| use it like ChatGPT.
|
| For most coding problems, people won't magically know if the
| code is "correct" or not, so any one-shot answer that is wrong
| could be a real hindrance.
|
| I don't have time to prompt engineer for every bit of code. I
| need tools that accelerate my work, not tools that absorb my
| time.
| unclad5968 wrote:
| Half the time I try to use gemini questions about the c++ std
| library, it fabricates non-existent types and functions. I'm
| honestly impressed it was able to solve any of the AoC problems.
| devjab wrote:
| I've had a side job as an external examiner for CS students for
| almost a decade now. While LLMs are generally terrible at
| programming (in my experience) they excel at passing finals. If
| I were to guess it's likely a combination of the relatively
| simple (or perhaps isolated is a better word) tasks coupled
| with how many times similar problems have been solved before in
| the available training data. Somewhat ironically the easiest
| way to spot students who "cheat" is when the code is great.
| Being an external examiner, meaning that I have a full time job
| in software development I personally find it sort of silly when
| students aren't allowed to use a calculator. I guess changing
| the way you teach and test is a massive undertaking though, so
| right now we just pretend LLMs aren't being used by basically
| every students. Luckily I'm not part of the "spot the cheater"
| process, so I can just judge them based on how well they can
| explain their code.
|
| Anyway, I'm not at all surprised that they can handle AoC. If
| anything I would worry that AoC will still be a fun project to
| author when many people solve it with AI. It sure won't be fun
| to curate any form of leaderboard.
| uludag wrote:
| leetoced/AoC-like problems are probably the easiest class of
| software related tasks LLMs can do. Using the correct library,
| the correct way, at the correct version, especially if the
| library has some level of churn and isn't extremely common, can
| be a harder task for LLMs than the hardest AoC problems.
| upghost wrote:
| After looking at the charts I was like "Whoa, damn, that Jerpint
| model seems amazing. Where do I get that??" I spent some time
| trying to find it on Huggingface before I realized...
| senordevnyc wrote:
| lol, I did the same thing
| j45 wrote:
| Me too.
|
| The author could make a model on huggingface routing requests
| to him. Latency might vary.
| tbagman wrote:
| The good news is that training jerprint was probably cheaper
| than training the latest got models...
| foldl2022 wrote:
| Just open-wights it. We need this. :)
| moffkalast wrote:
| At first I was like "What is this jerpint model that's beating
| the competition so soundly?" then it hit me, lol.
|
| Anyhow this is like night and day compared to last year, and it's
| impressive that Sonnet is now apparently 50% as good as a
| professional human at this sort of thing.
| zkry wrote:
| I don't think comparing star counts would be a good measure
| though, as with AOC 90% of the effort and difficulty goes into
| the harder problems towards the end and it was the beginning,
| easy problems where the bulk of the sonnet's stars came from.
| moffkalast wrote:
| Ah yeah that's true, the difficulty curve is not very linear.
| BugsJustFindMe wrote:
| I think this is a terrible analysis with a weak conclusion.
|
| There's zero mention of how long it took the LLM to write the
| code vs the human. You have a 300 second runtime limit, but what
| was your coding time limit? The machine spat out code in, what, a
| few seconds? And how long did your solutions take to write?
|
| Advent of code problems take me longer to just _read_ than it
| takes an LLM to have a proposed solution ready for evaluation.
|
| > _they didn't perform nearly as well as I'd expect_
|
| Is this a joke, though? A machine takes a problem description
| written as floridly hyperventilated as advent problems are, and,
| without any opportunity for automated reanalysis, it understands
| the exact problem domain, it understands exactly what's being
| asked, correctly models the solution, and spits out a correct
| single-shot solution on 20 of them in no time flat, often with
| substantially better running time than your own solutions, and
| that's disappointing?
|
| > _a lot of the submissions had timeout errors, which means that
| their solutions might work if asked more explicitly for efficient
| solutions. However the models should know very well what AoC
| solutions entail_
|
| You made up an arbitrary runtime limit and then kept that limit a
| secret, and you were surprised when the solutions didn't adhere
| to the secret limit?
|
| > _Finally, some of the submissions raised some Exceptions, which
| would likely be fixed with a human reviewing this code and asking
| for changes._
|
| How many of your solutions got the correct answer on the first
| try without going back and fixing something?
| keyle wrote:
| You raise some good points about the "total of hours spent" but
| I guess you don't consider training time included. Also there
| is no need to quote the author's post and have a go at him
| personally. There are better ways to get your point across by
| arguing the point made and not the sentences written.
| BugsJustFindMe wrote:
| > _I guess you don 't consider training time included_
|
| In the same way that I don't consider the time an author
| spends writing a book when saying how long it took me to read
| the book. OP lost zero time training ChatGPT.
|
| Or do you mean how much time OP spent training themself?
| Because that's a whole new can of worms. How many years
| should we add to OP's development times because probably they
| learned how to write code before this exercise?
| keyle wrote:
| I was considering the energy used and $, R&D and time spent
| to train those models, which is colossal compared to the
| developer and factor that in. Then they don't look so
| impressive. For reference, I'm not anti-LLM, I use Claude
| and ChatGPT every day. I'm just raising those if you want
| to consider all facts.
| lgas wrote:
| That cost is amortized over all requests made against all
| copies of the model though. Fortunately we don't have to
| re-train the model every time we want to perform
| inference.
| bawolff wrote:
| That's a weird comparison.
|
| Would you also count the number of hours it took for
| someone to write the OS, the editor, etc? The programmer
| wouldn't be effective without those things.
| menaerus wrote:
| And now also consider the amount of time, school,
| university, practice and $$$ it took to train a software
| engineer. Repeat for the population of presumably ~30
| million software engineers in the world. Time-wise it
| doesn't scale at all when contrasted to new LLM releases
| that occur ~once a year per company. $$$-wise is also
| debatable.
| segmondy wrote:
| I did experiment with local LLM running on my machine, most
| solutions were generated within the time of 30-60seconds. My
| overhead was really copying part 1 of the problem, generating
| the code, copying and pasting the code, running it, copying the
| input data, running it. Entering the result, repeating for part
| 2 and for most of them that was about 5 minutes from start to
| finish. If I automated the process, it would probably be 1
| minute or less for the problems it could solve.
|
| Not the OP, but I was able to zero shot 27 solutions correctly
| without any back and forth, and 5 more with a little bit back
| and forth. Using local models.
| mvdtnz wrote:
| > it understands the exact problem domain, it understands
| exactly what's being asked, correctly models the solution
|
| It does not "understand" or "model" shit. Good grief you AI
| shills need to take a breath.
| BugsJustFindMe wrote:
| I'll use a different word when you demonstrate that humans
| aren't also stochastic parrots constantly hallucinating
| reality. Until then, good enough for one is good enough for
| the other.
| bryan0 wrote:
| Since you did not give the models a chance to test their code and
| correct any mistakes, I think a more accurate comparison would be
| if you compared them against you submitting answers without
| testing (or even running!) your code first
| angarg12 wrote:
| People keep evaluating LLMs on essentially zero-shotting a
| perfect solution to a coding problem.
|
| Once we use tools to easily iterate on code (e.g. generate,
| compile, test, use outcome to refine prompt) we will
| turbocharge LLMs coding abilities.
| rhubarbtree wrote:
| This smacks of "moving the goalposts" just as much as the other
| side is accused of when unimpressed by advances.
|
| It's a reasonable test.
| xen0 wrote:
| I'm a bit curious to see how close the solutions were.
|
| When it couldn't solve it within the constraints, was it
| 'broadly correct' with some bugs? Was it on the right track or
| completely off?
| jerpint wrote:
| The code they generated and the outputs (including trace back
| errors) is all included, you can view them on the post itself
| or from hf space:
|
| https://huggingface.co/spaces/jerpint/advent24-llm
| xen0 wrote:
| Huh, don't know how I missed that. I was hoping for them to
| have done some analysis themselves.
|
| Glancing through the code for some of the 'easier' problems
| Claude (his best performer), it seems to have broadly
| correct (if a little strange and overwrought) code.
|
| But my phone is not a good platform for me to do this kind
| of reading.
| jerpint wrote:
| I should specify that unfortunately I didn't store the
| raw output from the LLM, just the parsed code snippets
| they produced, but the code to reproduce it is all there
| bongodongobob wrote:
| I think a major mistake was giving parts 1 and 2 all at once. I
| had great results having it solve 1, then 2. I think I got 4o to
| one shot parts 1 then 2 up to about day 12. It started to
| struggle a bit after that and I got bored with it at day 18. It
| did way better than I expected, I don't understand why the author
| is disappointed. This shit is magic.
| 101008 wrote:
| This kind of mirror my experience with LLMs. If I ask them non-
| original problems (make this API, write this test, update this
| function (that must be written 100s of time by develoeprs around
| the world, etc), it works very well. Some minors changes here and
| there but it saves time.
|
| When I ask them to code things that they never heard of (I am
| working on a online sport game), it fails catastrophically. The
| LLM should know the sport, and what I ask is pretty clear for
| anyone who understand the game (I tested against actual people
| and it was obvious what to expect), but the LLM failed miserably.
| Even worse when I ask them to write some designs in CSS for the
| game. It seems if you take them outside the 3-columsn layout or
| bootstrap or the overused landing page, LLMs fails miserably.
|
| It works very well for the known cases, but as soon as you want
| them to do something original, they just can't.
| jppope wrote:
| completely agree with this. The problem I'm starting to run
| into is that I do a lot of things that have been done before so
| I get really efficient at doing that stuff with LLMs... and now
| I'm slow to solve the stuff I used to be really fast at.
| anonzzzies wrote:
| I ask llms things like I would spec it out for my team, which
| is to say, when I write a task, I would not include things
| about the sport, I would explain the logic etc one would need
| to write. No one needs to know what it is about in abstract as
| I see the same issue with humans; they get confused when using
| real world things to see the connection with what they need to
| do. I mean I don't flesh it out to the extend I might as well
| write the code, but enough that you don't need to know anything
| about the underlying subject.
|
| We are doing a large EU healthcare project and things differ
| per country obviously; if I would assume people to have a
| modicum of world knowledge or interest in it and look it up,
| nothing would get done. It is easier to deliver excel sheets
| with the proper wording and formulas and leave out what it is
| even for.
|
| Works fine with LLMs.
|
| Disclaimer; the people in my company know everything about the
| subject matter; the people (and LLMs) that implement it are
| usually not ours: we just manage them and in my experience,
| programmers at big companies are borderline worthless, hence,
| in the past 35 years, I have taken to write tasks as concrete
| as I can; we found that otherwise we either get garbage commits
| or tons of meetings with questions and then garbage commits.
| And now that comes in handy as this works much better on LLMs
| too.
| steve_adams_86 wrote:
| This has made me wonder just how few developers have been doing
| even remotely interesting things. I encounter a lot of
| situations where the LLMs catastrophically fail, but I don't
| think it's for lack of trying to guide it through to solutions
| properly or sufficiently. I give as much or as little context
| as I can think to, give partial solutions or zero solutions,
| try different LLMs with the same prompts, etc.
|
| Maybe we were mostly actually doing the exact same stuff and
| these things are trained on a whole lot of the same. Like with
| react components, these things are amazing at pumping out the
| most basic varieties of them.
| uludag wrote:
| My thoughts exactly. I honestly think that the majority of
| SWE work is completely unoriginal.
|
| But then there are those idiosyncrasies that every company
| has, a finicky build system, strange hacks to get certain
| things to work, extensive webs of tribal knowledge, that will
| be the demise for any LLM SWE application besides being a
| tool for skilled professionals.
| bamboozled wrote:
| The whole world is wrinkly, that's the way it is, everyone,
| and everything is slightly different.
| bugglebeetle wrote:
| > This has made me wonder just how few developers have been
| doing even remotely interesting things.
|
| There's not much to ponder on here as the majority of human
| thought and action is unoriginal and irrelevant, so software
| development is not some special exception to this. This
| doesn't mean it lacks meaning or value for those doing it,
| since that can be self-generated, extrinsically motivated, or
| both. But, by definition, most things done are within the
| boundaries of a normal distribution and not exceptional,
| transcendent, or sublime. You can strive for those latter
| categories, but it's a path strewn with much failure, misery,
| and quite questionable rewards, even where it is achieved.
| Most prefer more achievable comforts or security to it, or,
| at most, dull-minded imitations thereof, such as amassing
| great wealth.
| josephg wrote:
| Yeah; I've long argued that software should be divided into
| two separate professions.
|
| If you think about it, lots of professions are already
| divided in two. You often have profession of inventors /
| designers and a profession of creators. For example:
| Electrical engineers vs Electricians. Architects / civil
| engineers and construction workers. Chefs and line cooks.
| Composers and session musicians. And so on.
|
| The two professions always need somewhat different skill
| sets. For example, electricians need business sense and to be
| personable - while electrical engineers need to understand
| how to design circuits from scratch. Session musicians need a
| fantastic memory for music, and to be able to perform in lots
| of different musical styles. Composers need a deeper
| understanding of how to write compelling music - but usually
| only in their preferred style.
|
| I think programming should be divided in the same way. It
| sort of happens already: There's people who invent novel
| database storage engines, work on programming language
| design, write schedulers for operating systems and so on. And
| then there's people who do consulting and write almost all
| the software that businesses actually need to function. Eg,
| people who talk to actual clients and figure out their needs.
|
| The first group invents React. The second group uses react to
| build online stores.
|
| We sort of have these two professions already. We just don't
| have language to separate them.
|
| The line is a bit fuzzy, but you can tell they're two
| professions because each needs a slightly different skill
| set. If you're writing a novel full-text search engine, you
| need to be really good at data structures & algorithms, but
| you don't really need to be good at working with clients. And
| if you work in frontend react all day, you don't really need
| to understand how a B-Tree works or understand the difference
| between LR and recursive descent parsing. (And it doesn't
| make much sense to ask these sort of programmers "leetcode
| problems" in job interviews).
|
| ChatGPT is good at application programming. Or really,
| anything that its seen lots of times before. But the more
| novel a problem - the more its actual "software engineering"
| - the more it struggles to make working software. "Write a
| react component that does X" - easy. "Implement a B-tree with
| this weird quirk" - hard. It'll struggle through but the
| result will be buggy. "Implement this new algorithm in a
| paper I just read" - its more or less completely useless at
| that.
| mwcampbell wrote:
| Steve Yegge wrote about this ~20 years ago, with a heaping
| helping of his usual snark directed at the OO fads of the
| time: https://sites.google.com/site/steveyegge2/age-
| racecar-driver
| lm28469 wrote:
| Imho the hordes of code monkeys see an incredible improvement
| in their productivity and keep boasting about it online which
| is inflating the idea that LLMs are close to agi and
| revolutionizing the industry
|
| Meanwhile people who work on things that are even slightly
| niche or complex (ie. Things which aren't answered in 20
| different stack overflow threads, or things which require of
| few field "experts" to meet around a table and think hard for
| a few hours) don't understand the enthusiasm at all
|
| I'm working on fairly simple things but in a context/industry
| that isn't that common and I have to spoon feed LLMs to get
| anywhere, most of the time it doesn't even work. The next few
| years will be interesting, it might play out as it did for
| plane pilots over relying on autopilots to the point of
| losing key skills
| SkyBelow wrote:
| Is it that they aren't doing interesting things, or is it
| that developers doing interesting things tend to only ask the
| AI to handle the boring parts? I find I use AI quite a bit,
| but it is integrated with my own thinking and problem solving
| every step of the way. I see how others are using it and they
| often seem to be asking for full solutions, where as I'm
| using it more like a rubber duck debugger stuffed with stack
| overflow feedback and a knowledge of any common APIs. It
| mostly helps be answering things that would have otherwise
| been web searches, but it doesn't create any solutions from
| scratch.
| PaulRobinson wrote:
| > as soon as you want them to do something original, they just
| can't.
|
| An LLM is, famously, a "stochastic parrot". It has no
| reasoning. It has no creativity. It has no basis or mechanism
| for true originality: it is regurgitating what it has seen
| before, based on probabilities, nothing more. It looks more
| impressive, but that's all is behind the curtain.
|
| It surprises me that so many people expect an LLM to do
| something "clever" or "original". They're compression machines
| for knowledge, a tool for summarisation, rephrasing, not for
| creation.
|
| Why is this not that widely known and accepted across the
| industry yet, and why do so many people have such high
| expectations?
| redlock wrote:
| Because it isn't true according to Hinton, Sutskever, and
| other leading AI researchers.
| FroshKiller wrote:
| Most foxes are herbivores according to the fellow guarding
| my henhouse.
| redlock wrote:
| They do make compelling arguments if you listened to
| them. Also, from my coding with llms using cursor they
| obviously understand the code and my request more often
| than not. Mechanistic interpretability research shows
| evidence of concepts being represented within the layers.
| Golden gate claude is evidence of this:
| https://www.anthropic.com/news/golden-gate-claude
|
| To me this proves that llms learn concepts and
| multilayeres representations within its network and not
| just some dumb statistical inference. Even famous llm
| skeptic like Francois Chollet doesn't invoke this
| stochastic parrots anymore and has moved on to arguing
| that they don't generalize well and are just memorizing.
|
| With GPT-2 and 3 I was of the same opinion that they
| seemed like just sophisticated stochastic parrots, but
| current llms are a different class from early gpts. Now
| that o3 has beaten memory resisting benchmark like ARC-
| AGI I think we can confidently move on from this
| stochastic parrots notion.
|
| (And before you argue that o3 is not an llm, here is an
| openai researcher stating that it is an llm https://x.com
| /__nmca__/status/1870170101091008860?s=46&t=eTe... )
| pixelfarmer wrote:
| Neural networks are generalizing things as part of their
| optimization scheme. The current approach is just to dump
| many layered neural networks (at the core) as in "deep
| learning" to solve the problems, but the networks are too
| regular, too "primitive" of sorts. What is needed are
| network topologies that strongly support this
| generalization, the creation of "meta abstraction
| levels", otherwise it will get nowhere.
|
| Biological networks of more intelligent species contain a
| few billion neurons and upwards from that while even the
| big LLMs are somewhere in the millions of "equivalent" at
| best. So, bad topology + much less "neurons" and the
| resulting capabilities shouldn't be too surprising. Plus
| it is clear that AGI is nowhere close, because one result
| of AGI is a proper understanding of "I". Crows have an
| understanding of "I", for example.
|
| And that is where these "meta abstraction levels" come
| in: There are many needed to eventually reach the stage
| of "I". This can also be used to test how well neural
| networks perform, how far they can abstract things for
| real, how many levels of generalization are handled by
| it. But therein lies a problem: Let 2 persons specify
| abstraction levels and the results will be all across the
| board. This is also why ARC-AGI, while dives into that,
| cannot really solve the problem of testing "AI", let
| alone "AGI": We as humans are currently unable to
| properly test intelligence in any meaningful way. All the
| tests we have are mere glimpses into it and often
| (complex) multivariable + multi(abstraction)layered tests
| and dealing with the results, consequently, a total mess,
| even if we throw big formulas at it.
| Tiberium wrote:
| Wanted to try with o1 and o1-mini but looks like there's no code
| available, although I guess I could just ask 3.5 Sonnet/o1 to
| make the evaluation suite ;)
| jerpint wrote:
| Author here: all the code to reproduce this is actually all on
| the huggingface space here [1]
|
| https://huggingface.co/spaces/jerpint/advent24-llm/tree/main
| Tiberium wrote:
| Thanks, I'll check with o1-mini and o1 and update this
| comment :) Also, the code has some small errors, although
| those can be easily fixed.
| zaptheimpaler wrote:
| I'm adjacent to some people who do AoC competitively and it's
| clear many of the top 10 and maybe 1/2 of the top 100 this year
| were heavily LLM assisted or wholly done by LLMs in a loop. They
| won first place on many days. It was disappointing to the
| community that people cheated and went against the community's
| wishes but it's clear LLMs can do much better than described here
| davidclark wrote:
| I'd like the same article topic but from the person who did Day
| 1 pt1 in 4s and pt2 in 5s (9s total time).
| rhubarbtree wrote:
| No, that means it's clear that LLM-assisted coding can do
| better than described here. Which implies humans are adding a
| lot.
| NitpickLawyer wrote:
| To be fair, there's probably not a lot that "humans" add when
| the solution is solved in 4 and 5 seconds respectively.
| That's clearly 100% automated. Most humans can't even read
| the problem in that timeframes.
|
| A better implication would be that proper use of LLMs implies
| more than OP did here (proper prompting, looping the answer
| w/ a code check, etc.)
| bawolff wrote:
| I'm a bit of an AI skeptic, and i think i had the opposite
| reaction of the author. Even though this is far from welcoming
| our AI overlords, I am surprised that they are this good.
| cheevly wrote:
| Genuinely terrible prompt. Not only in structure, but also
| contains grammatical errors. I'm confident you could at least
| double their score if you improve your prompting significantly.
| rhubarbtree wrote:
| This isn't very constructive criticism. You could improve it
| through a number of levels:
|
| 1. You could have pointed out the grammatical errors and
| explained why they matter.
|
| 2. You could have pointed out the structural errors, and
| explained what structure should have been used.
|
| 3. You could have written a new prompt.
|
| 4. You could have re-run the experiment with the new prompt.
|
| Otherwise what you say is just unsubstantiated criticism.
| demirbey05 wrote:
| o1 is not included, I think each benchmark should include o1 and
| reasoning models. o-series is really changed the game.
| airstrike wrote:
| I like the idea, but I feel like the execution left a bit to be
| desired.
|
| My gut tells me you can get much better results from the models
| with better prompting. The whole "You are solving the 2024 advent
| of code challenge." form of prompting is just adding noise with
| no real value. Based on my empirical experience, that likely
| hurts performance instead of helping.
|
| The time limit feels arbitrary and adds nothing to the benchmark.
| I don't understand why you wouldn't include o1 in the list of
| models.
|
| There's just a lot here that doesn't feel very scientific about
| this analysis...
| Recursing wrote:
| The article and comments here _really_ underestimate the current
| state of LLMs (or overestimate how hard AoC 2024 was)
|
| Here's a much better analysis from someone who got 45 stars using
| LLMs.
| https://www.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
|
| All the top 5 players on the final leaderboard
| https://adventofcode.com/2024/leaderboard used LLMs for most of
| their solutions.
|
| LLMs can solve all days except 12, 15, 17, 21, and 24
| nindalf wrote:
| This needs to be higher. Not only because it shows that LLMs
| can do better than what OP says, but also that there's some
| difference in how they're used. Clearly the person on reddit
| was able to use them more effectively.
| danielbln wrote:
| That's the crux with most discussions on here regarding LLMs.
|
| "I used gpt-4o with zero shot prompts and it failed
| terribly!"
|
| "I used Claude/o1/o3, I fed various bits of information into
| the context and carefully guided the LLM"
|
| Those two approaches (there are many more) would lead to very
| different results, yet all we read here are comments giving
| their opinions on "LLMs", as if there is only one LLM and one
| way of working with them.
| d0mine wrote:
| This reminds me of https://en.wikipedia.org/wiki/Stone_Soup
|
| stone->your ingredients->soup
|
| LLM->your prompts->solution
| jonathan_landy wrote:
| Seems to depends strongly on the model perhaps. The Reddit post
| says
|
| "Some other models tested that just didn't work: gpt-4o,
| gpt-o1, qwen qwq."
|
| Notably gpt-4o was used in the post linked here.
| jebarker wrote:
| I don't know what they were doing, but I tried o1 with many
| problems after I solved them already and it did great. No
| special prompting, just "solve this problem with a python
| program".
| FrustratedMonky wrote:
| Wonder if different goals.
|
| If the top 5 people on leader board, 'used LLM's'. Meaning,
| they used an LLM as a helping tool.
|
| But article I think is, what if the LLM played by itself. Just
| paste the questions in, and see if it can do it all on its own?
|
| Perhaps that is the difference, different goals.
| oytis wrote:
| Why does neither of articles provide the actual raw chat logs?
| It's like a recent article about a non-released LLM solving
| non-public tasks which everyone is supposed to be impressed
| about
| michaelt wrote:
| _> All the top 5 players on the final leaderboard [...] used
| LLMs for most of their solutions._
|
| Note that the leaderboard points are given out on the time
| taken to produce a correct answer. 100 points for the first to
| submit an answer, 99 for the second, 98 for third and so on. No
| points if you're not in the first 100.
|
| So if an LLM fails on 5 problems, but for the other 20 it can
| take a 600-word problem statement and solve it in 12 seconds?
| It'll rack up loads of points.
|
| Whereas the best human programmer you know might be able to
| solve all 25 problems taking around 15 minutes per problem. On
| most days they would have zero points, as all 100 points
| finishes go to LLMs.
| jerpint wrote:
| Author here: the point of the article was only to evaluate
| zero-shot capabilities. I'm certain that had I used LLMs I
| would have definitely gotten more stars on AoC (got 41/50
| without). Because I chose to solve this year without LLMs, I
| was simply curious to see how the opposite setup would do,
| using basically zero human intervention to solve AoC.
|
| That said, if I cared only to produce the best results I would
| 100% pair up with an LLM
| guerrilla wrote:
| How far can LLMs get in Project Euler without messing up?
| antirez wrote:
| The most important thing is missing from this post: the
| performance of Jerpint+Claude. It's not a VS game.
___________________________________________________________________
(page generated 2024-12-31 23:01 UTC)