[HN Gopher] My 2.5 year old laptop can write Space Invaders in J...
___________________________________________________________________
My 2.5 year old laptop can write Space Invaders in JavaScript now
(GLM-4.5 Air)
Author : simonw
Score : 417 points
Date : 2025-07-29 13:45 UTC (9 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| croes wrote:
| I bet the training data included enough space invader cloned in
| JS
| jplrssn wrote:
| I also wouldn't be surprised if labs were starting to mix in a
| few pelican SVGs into their training data.
| quantumHazer wrote:
| SVG benchmarking is a thing since GPT-4, so probably all
| major labs are overfitting on some dataset ov svg images for
| sure
| diggan wrote:
| Even "accidentally" it makes sense that "SVGs of pelicans
| riding bikes" are now included into datasets used for
| training as it has spread as a wildfire on the internet,
| making it less useful as a simple benchmark.
|
| This is why I keep all my benchmarks private and don't share
| anything about them publicly, as soon as you write about them
| anywhere publicly they'll stop being useful in some months.
| toyg wrote:
| _> This is why I keep all my benchmarks private_
|
| This is also why, if I were an artist or anyone
| commercially relying on creative output of any kind, I
| wouldn't be posting _anything_ on the internet anymore,
| ever. The minute you make anything public, the engines will
| clone it to death and turn it into a commodity.
| __mharrison__ wrote:
| Somewhat defeats the purpose of being an artist, doesn't
| it?
| toyg wrote:
| Defeating the purpose of creating almost anything,
| really.
|
| AI is definitely breaking the whole "labor for money"
| architecture of our world.
| zhengyi13 wrote:
| Eeeehhhh.
|
| Maybe the thing to do is provide public, physical
| exhibits of your art in search of patronage.
| debugnik wrote:
| That makes it so much harder to show art to people and
| market yourself though.
|
| I considered experimenting with web DRM for art
| sites/portfolios, on the assumption that scrappers won't
| bother with the analog loophole (and dedicated art-style
| cloners would hopefully be disappointed by the quality),
| but gave up because of limited compatible devices for the
| strongest DRM levels, and HDCP being broken on those
| levels anyway. If the DRM technique caught on it would
| take attackers, at most, a few bucks and hours once to
| bypass it, and I don't think users would truly understand
| that upfront.
| simonw wrote:
| I'll believe they are doing that when one of the models draws
| me an SVG that actually looks like a pelican.
| __mharrison__ wrote:
| Someone needs to craft a beautifully bike donned by a
| pelican, throw in some seo, and see how long it takes a
| model to replicate it.
|
| Simon probably wouldn't be happy about killing his multi-
| year evaluation metric though...
| simonw wrote:
| I would be delighted.
|
| My pelican on a bicycle benchmark is a long con. The goal
| is to finally get a good SVG of a pelican riding a
| bicycle, and if I can trick AI labs into investing
| significant effort in cheating on my benchmark then fine,
| that gets me my pelican!
| gchamonlive wrote:
| Which would make this disappointing if it was only good at
| cloning space invaders. If it can reproduce all the clone it
| has ever seen it would still be an impressive feat.
|
| I just think we should stop to appreciate exactly how awesome
| language models are. It's compressing and correctly reproducing
| a lot of data with meaningful context between each token and
| the rest of the context window. It's still amazing, specially
| with smaller models like this, because even if it's reproducing
| a clone, you can still ask questions about it and it should
| perform reasonably well explaining you what it does and how you
| can take it over to further develop that clone.
| croes wrote:
| But that would still be copy and paste with extra steps.
|
| Like all these vibe coded to do apps, one of the most used
| starting problems of programming courses.
|
| It's great that an AI can do that but it could stall progress
| if we get limited to existing tools and programs.
| shermantanktop wrote:
| How about an SVG of 9.11 pelicans riding bicycles and counting
| the three Rs in "strawberry"?
| chickenzzzzu wrote:
| "2.5 year old laptop" is potentially the most useless way of
| describing a 64GB M2, as it could be confused with virtually any
| other configuration of laptop.
| OJFord wrote:
| I think the point is just that it doesn't require absolute
| cutting edge nor server hardware.
| jphoward wrote:
| No but 64 GB of unified memory provides almost as much GPU
| RAM capacity as two RTX 5090s (only less due to the unified
| nature) - top of the range GPUs - so it's a truly exceptional
| laptop in this regard.
| turnsout wrote:
| Except that it is not exceptional at all; it's an older-
| generation MacBook Pro with 64GB of RAM. There's nothing
| particularly unusual about it.
| jphoward wrote:
| 64 GB of RAM which is addressable by a GPU is exceptional
| for a laptop - this is not just system RAM.
| chickenzzzzu wrote:
| To emphasize this point further, at least with my
| efforts, it is not even possible to buy a 64GB M4 Pro
| right now. 32GB, 64GB, and 128GB are all sold out.
|
| We can say that 64GB addressable by a GPU is not
| exceptional when compared to 128GB and it still costs
| less than a month's pay for a FAANG engineer, but the
| fact that they aren't actually purchasable right now
| shows that it's not as easy as driving to Best Buy and
| grabbing one off the shelf.
| turnsout wrote:
| They're not sold out--Apple's configurator (and chip
| naming) is just confusing. The MacBook Pro with M4 Pro is
| only available in 24 or 48 GB configurations. To get 64
| or 128 GB, you need to upgrade to the M4 Max.
|
| If you're looking for the cheapest way into 64 of unified
| memory, the Mac mini is available with an M4 Pro and 64GB
| at $1999.
|
| So, truly, not "exceptional" unless you consider the
| price to be exorbitant (it's not, as evidenced by the
| long useful life of an M-series Mac).
| chickenzzzzu wrote:
| thank you for providing that extra info! i agree that
| $2000-4000 is not an absolutely earth shattering price,
| but i still wonder what the benefit one receives is when
| they say "2.5 year old laptop" instead of "64GB M2
| laptop"
| turnsout wrote:
| I understand, but _that is not exceptional for a Mac
| laptop._ You could say all Apple Silicon Macs are
| exceptional, and I guess I agree in the context of the
| broader PC community. But I would not point at an
| individual MacBook Pro with 64 GB of RAM and say "whoa,
| that's exceptional." It's literally just a standard
| option when you buy the computer. It does bump the price
| pretty high, but the point of the MBP is to cater to
| higher-end workflows.
| tantalor wrote:
| It was also something he already had lying around. Did not
| need to buy something new to get new functionality.
| simonw wrote:
| The thing I find most notable here is that this is the same
| laptop I've used to run every open weights model since the
| original LLaMA.
|
| The models have got _so much better_ without me needing to
| upgrade my hardware.
| chickenzzzzu wrote:
| That's great! Why can't we say that instead?
|
| No need to overly quantize our headlines.
|
| "64GB M2 makes Space Invaders-- can be bought for under
| $xxxx"
| AlexeyBrin wrote:
| Most likely its training data included countless Space Invaders
| in various programming languages.
| quantumHazer wrote:
| and probably some synthetic data are generated copy of the
| games already on the dataset?
|
| i have this feeling with LLM's generated react frontend, they
| all look the same
| bayindirh wrote:
| Last time somebody asked for a "premium camera app for iOS",
| and the model (re)generated Halide.
|
| Models don't emit something they don't know. They remix and
| rewrite what they know. There's no invention, just recall...
| FeepingCreature wrote:
| True where trivial; where nontrivial, false.
|
| Trivially, humans don't emit something they don't know
| either. You don't spontaneously figure out Javascript from
| first principles, you put together your existing knowledge
| into new shapes.
|
| Nontrivially, LLMs can _absolutely_ produce code for
| entirely new requirements. I 've seen them do it many
| times. Will it be put together from smaller fragments? Yes,
| this is called "experience" or if the fragments are small
| enough, "understanding".
| bayindirh wrote:
| Humans can observe ants and invent any colony
| optimization. AIs can't.
|
| Humans can explore what they don't know. AIs can't.
| falcor84 wrote:
| What makes you categorically say that "AIs can't"?
|
| Based on my experience with present day AIs, I personally
| wouldn't be surprised at all that if you showed Gemini
| 2.5 Pro a video of an insect colony and asked it "Take a
| look at the way they organize and see if that gives you
| inspiration for an optimization algorithm", it will spit
| something interesting out.
| sarchertech wrote:
| It will 100% have something in its training set
| discussing a human doing this and will almost definitely
| spit out something similar.
| FeepingCreature wrote:
| What makes you categorically say that "humans can"?
|
| I couldn't do that with an ant colony. I would have to
| train on ant research first.
|
| (Oh, and AIs can absolutely explore what they don't know.
| Watch a Claude Code instance look at a new repository.
| Exploration is a convergent skill in long-horizon RL.)
| CamperBob2 wrote:
| That's what benchmarks like ARC-AGI are designed to test.
| The models are getting better at it, and you aren't.
|
| Nothing ultimately matters in this business except the
| first couple of time derivatives.
| ben_w wrote:
| > Humans can observe ants and invent any colony
| optimization. AIs can't.
|
| Surely this is _exactly_ what current AI do? Observe
| stuff and apply that observation? Isn 't this the exact
| criticism, that they aren't inventing ant colonies from
| first principles without ever seeing one?
|
| > Humans can explore what they don't know. AIs can't.
|
| We only learned to decode Egyptian hieroglyphs because of
| the Rosetta Stone. There's no translation for North
| Sentinelese, the Voynich manuscript, or Linear A.
|
| We're not magic.
| phkahler wrote:
| >> Nontrivially, LLMs can absolutely produce code for
| entirely new requirements. I've seen them do it many
| times.
|
| I think most people writing software today are
| reinventing a wheel, even in corporate environments for
| internal tools. Everyone wants their own tweak or thinks
| their idea is unique and nobody wants to share code
| publicly, so everyone pays programmers to develop buggy
| bespoke custom versions of the same stuff that's been
| done 100 times before.
|
| I guess what I'm saying is that your requirements are
| probably not new, and to the extent they are yes an LLM
| can fill in the blanks due to its fluency in languages.
| satvikpendem wrote:
| This doesn't make sense thermodynamically because models
| are far smaller than the training data they purport to hold
| and recall, so there must be some level of "understanding"
| going on. Whether that's the same as human understanding is
| a different matter.
| Eggpants wrote:
| It's a lossy text compression technique. It's clever
| applied statistics. Basically an advanced association
| rules algorithm which has been around for decades but
| modified to consider order and relative positions.
|
| There is no understanding, regardless of the wants of all
| the capital investors in this domain.
| simonw wrote:
| I don't care if it can "understand" anything, as long as
| I can use it to achieve useful things.
| Eggpants wrote:
| "useful things" like poorly drawing birds on bikes? ;)
|
| (I have much respect for what you have done and are
| currently doing, but you did walk right into that one)
| msephton wrote:
| The pelican on a bicycle is a very useful test.
| CamperBob2 wrote:
| _It's a lossy text compression technique._
|
| That is a much, much bigger deal than you make it sound
| like.
|
| Compression may, in fact, be all we need. For that
| matter, it may be all there _is_.
| Uehreka wrote:
| > Models don't emit something they don't know. They remix
| and rewrite what they know. There's no invention, just
| recall...
|
| People really need to stop saying this. I get that it was
| the Smart Guy Thing To Say in 2023, but by this point it's
| pretty clear that that it's not true in any way that
| matters for most practical purposes.
|
| Coding LLMs have clearly been trained on conversations
| where a piece of code is shown, a transformation is
| requested (rewrite this from Python to Go), and then the
| transformed code is shown. It's not that they're just
| learning codebases, they're learning what working with code
| looks like.
|
| Thus you can ask an LLM to refactor a program in a language
| it has never seen, and it will "know" what refactoring
| means, because it has seen it done many times, and it will
| stand a good chance of doing the right thing.
|
| That's why they're useful. They're doing something way more
| sophisticated than just "recombining codebases from their
| training data", and anyone chirping 2023 sound bites is
| going to miss that.
| mr_toad wrote:
| > They remix and rewrite what they know. There's no
| invention, just recall...
|
| If they only recalled they wouldn't "hallucinate". What's a
| lie if not an invention? So clearly they can come up with
| data that they weren't trained on, for better or worse.
| 0x457 wrote:
| Because internally, there isn't a difference between
| correctly "recalled" token and incorrectly
| (hallucinated).
| tshaddox wrote:
| To be fair, the human-generated user interfaces all look the
| same too.
| cchance wrote:
| Have you used the internet? thats how the internet looks,
| their all fuckin react and the same layouts and styles 90%
| shadcn lol
| NitpickLawyer wrote:
| This comment is ~3 years late. Every model since gpt3 has had
| the entirety of available code in their training data. That's
| not a gotcha anymore.
|
| We went from chatgpt's "oh, look, it looks like python code but
| everything is wrong" to "here's a full stack boilerplate app
| that does what you asked and works in 0-shot" inside 2 years.
| That's the kicker. And the sauce isn't just in the training
| set, models now do post-training and RL and a bunch of other
| stuff to get to where we are. Not to mention the insane
| abilities with extended context (first models were 2/4k max),
| agentic stuff, and so on.
|
| These kinds of comments are really missing the point.
| haar wrote:
| I've had little success with Agentic coding, and what success
| I have had has been paired with hours of frustration, where
| I'd have been better off doing it myself for anything but the
| most basic tasks.
|
| Even then, when you start to build up complexity within a
| codebase - the results have often been worse than "I'll start
| generating it all from scratch again, and include this as an
| addition to the initial longtail specification prompt as
| well", and even then... it's been a crapshoot.
|
| I _want_ to like it. The times where it initially "just
| worked" felt magical and inspired me with the possibilities.
| That's what prompted me to get more engaged and use it more.
| The reality of doing so is just frustrating and wishing
| things _actually worked_ anywhere close to expectations.
| aschobel wrote:
| Bingo, it's magical but the learning curve is very very
| steep. The METR study on open-source productivity alluded
| to this a bit.
|
| I am definitely at a point where I am more productive with
| it, but it took a bunch of effort.
| devmor wrote:
| The subjects in the study you are referencing also
| believed that they were more productive with it. What
| metrics do you have to convince yourself you aren't under
| the same illusionary bias they were?
| simonw wrote:
| Yesterday I used ffmpeg to extract the frame at the 13
| second mark of a video out as a JPEG.
|
| If I didn't have an LLM to figure that out for me I
| wouldn't have done it at all.
| devmor wrote:
| You wouldn't have just typed "extract frame at timestamp
| as jpeg ffmpeg" into Google and used the StackExchange
| result that comes up first that gives you a command to do
| exactly that?
| simonw wrote:
| Before LLMs made ffmpeg no-longer-frustrating-to-use I
| genuinely didn't know that ffmpeg COULD do things like
| that.
| devmor wrote:
| I'm not really sure what you're saying an LLM did in this
| case. Inspired a lost sense of curiosity?
| Philpax wrote:
| Translated a vague natural language query ("cli, extract
| frame 13s into video") into something immediately
| actionable with specific examples and explanations,
| surfacing information that I would otherwise not know how
| to search for.
|
| That's what I've done with my ffmpeg LLM queries, anyway
| - can't speak for simonw!
| wizzwizz4 wrote:
| DuckDuckGo search results for "cli, extract frame 13s
| into video" (no quotes):
|
| * https://stackoverflow.com/questions/10957412/fastest-
| way-to-...
|
| * https://superuser.com/questions/984850/linux-how-to-
| extract-...
|
| * https://www.aleksandrhovhannisyan.com/notes/video-cli-
| cheat-...
|
| * https://www.baeldung.com/linux/ffmpeg-extract-video-
| frames
|
| * https://ottverse.com/extract-frames-using-ffmpeg-a-
| comprehen...
|
| Search engines have been able to translate "vague natural
| language queries" into search results for a decade, now.
| This pre-existing infrastructure accounts for the _vast_
| majority of ChatGPT 's apparent ability to find answers.
| 0x457 wrote:
| LLM somewhat understood ffmpeg documentation? Not sure
| what is not clear here.
| simonw wrote:
| My general point is that people say things like "yeah,
| but this one study showed that programmers over-estimate
| the productivity gain they get from LLMs so how can you
| really be sure?"
|
| Meanwhile I've spent the past two years constantly
| building and implementing things I _never would have
| done_ because of the reduction in friction LLM assistance
| gives me.
|
| I wrote about this first two years ago - AI-enhanced
| development makes me more ambitious with my projects -
| https://simonwillison.net/2023/Mar/27/ai-enhanced-
| developmen... - when I realized I was hacking on things
| with tech like AppleScript and jq that I'd previously
| avoided.
|
| It's hard to measure the productivity boost you get from
| "wouldn't have built that thing" to "actually built that
| thing".
| dingnuts wrote:
| It is nice to use LLMs to generate ffmpeg commands,
| because those can be pretty tricky, but really, you
| wouldn't have just used the man page before?
|
| That explains a lot about Django that the author is
| allergic to man pages lol
| simonw wrote:
| I just took a look, and the man page DOES explain how to
| do that!
|
| ... on line 3,218: https://gist.github.com/simonw/6fc05ea
| 7392c5fb8a5621d65e0ed0...
|
| (I am very confident I am not the only person who has
| been deterred by ffmpeg's legendarily complex command-
| line interface. I feel no shame about this at all.)
| quesera wrote:
| Ffmpeg is genuinely complicated! And the CLI is
| convoluted (in justifiable, and unfortunate ways).
|
| But if you approach ffmpeg from the perspective of "I
| know this is possible", you are always correct, and can
| almost always reach the "how" in a handful of minutes.
|
| Whether that's worth it or not, will vary. :)
| ben_w wrote:
| I remember when I was a kid, people asking a teacher how
| to spell a word, and the answer was generally "look it up
| in a dictionary"... which you can only do if you already
| have shortlist of possible spellings.
|
| *nix man pages are the same: if you already know which
| tool can solve your problem, they're easy to use. But you
| have to already have a shortlist of tools that can solve
| your problem, before you even know which man pages to
| read.
| throwworhtthrow wrote:
| LLM's still give subpar results with ffmpeg. For example
| when I asked Sonnet to trim a long video with ffmpeg, it
| put the input file parameter before the start time
| parameter, which triggers an unnecessary decode of the
| video file. [1]
|
| Sure, use the LLM to get over the initial hump. But
| ffmpeg's no exception to the rule that LLM's produce
| subpar code. It's worth spending a couple minutes reading
| the docs to understand what it did so you can do it
| better, and unassisted, next time.
|
| [1] https://ffmpeg.org/ffmpeg.html#:~:text=ss%20position
| CamperBob2 wrote:
| That says more about suboptimal design on ffmpeg's part
| than it does about the LLM. Most humans can't deal with
| ffmpeg command lines, so it's not surprising that the LLM
| misses a few tricks.
| nottorp wrote:
| Had a LLM generate 3 lines of working C++ code that was
| "only" one order of magnitude slower than what i edited
| the code to in 10 minutes.
|
| If you're happy with results like that, sure, LLMs miss
| "a few tricks"...
| ben_w wrote:
| You don't have to leave LLM code alone, it's fine to
| change it -- unless, I guess, you're doing some kind of
| LLM vibe-code-golfing?
|
| But this does remind me of a previous co-worker. Wrote
| something to convert from a custom data store to a
| database, his version took 20 minutes on some inputs.
| Swore it couldn't possibly be improved. Obviously
| ridiculous because it didn't take 20 minutes to load from
| the old data store, nor to load from the new database.
| Over the next few hours of looking at very mediocre code,
| I realised it was doing an unnecessary O(n^2) check,
| confirmed with the CTO it wasn't business-critical, got
| rid of it, and the same conversion on the same data ran
| in something like 200ms.
|
| Over a decade before LLMs.
| nottorp wrote:
| We all do that, sometimes where it's time critical
| sometimes where it isn't.
|
| But I keep being told "AI" is the second coming of Ahura
| Mazda so it shouldn't do stuff like that right?
| CamperBob2 wrote:
| "I'm taking this talking dog right back to the pound. It
| told me to short NVDA, and you should see the buffer
| overflow bugs in the C++ code it wrote. Totally
| overhyped. I don't get it."
| nottorp wrote:
| "We hear you have been calling our deity a talking dog.
| Please enter the red door for reeducation."
| ben_w wrote:
| > Ahura Mazda
|
| Niche reference, I like it.
|
| But... I only hear of scammers who say, and psychosis
| sufferers who think, LLMs are *already* that competent.
|
| Future AI? Sure, lots of sane-seeming people also think
| it could go far beyond us. Special purpose ones have in
| very narrow domains. But current LLMs are only good
| enough to be useful and potentially economically
| disruptive, they're not even close to wildly superhuman
| like Stockfish is.
| CamperBob2 wrote:
| Sure. If you ask ChatGPT to play chess, it will put up an
| amateur-level effort at best. Stockfish will indeed wipe
| the floor with it. But what happens when you ask
| Stockfish to write a Space Invaders game?
|
| ChatGPT will get better at chess over time. Stockfish
| will not get better at anything _except_ chess. That 's
| kind of a big difference.
| ben_w wrote:
| > ChatGPT will get better at chess over time
|
| Oddly, LLMs got _worse_ at specifically chess:
| https://dynomight.net/chess/
|
| But even to the general point, there's absolutely no
| agreement how much better the current architectures can
| ultimately get, nor how quickly they can get there.
|
| Do they have potential for unbounded improvements, albeit
| at exponential cost for each linear incremental
| improvement? Or will they asymptomatically approach
| someone with 5 years experience, 10 years experience, a
| lifetime of experience, or a higher level than any human?
|
| If I had to bet, I'd say current models have an
| asymptomatic growth converging to a merely "ok"
| performance; and separately claim that even if they're
| actually unbounded with exponential cost for linear
| returns, we can't afford the training cost needed to make
| them act like someone with even just 6 years professional
| experience in any given subject.
|
| Which is still a lot. Especially as it would be acting
| like it had about as much experience in _every other
| subject at the same time_. Just... not a literal Ahura
| Mazda.
| CamperBob2 wrote:
| _If I had to bet, I 'd say current models have an
| asymptomatic growth converging to a merely "ok"
| performance_
|
| (Shrug) People with actual money to spend are betting
| twelve figures that you're wrong.
|
| Should be fun to watch it shake out from up here in the
| cheap seats.
| ben_w wrote:
| Nah, trillion dollars is about right for "ok". Percentage
| point of the global economy in cost, automate 2 percent
| and get a huge margin. We literally set more than that on
| actual fire each year.
|
| For "pretty good", it would be worth 14 figures, over two
| years. The global GDP is 14 figures. Even if this only
| automated 10% of the economy, it pays for itself after a
| decade.
|
| For "Ahura Mazda", it would easily be worth 16 figures,
| what with that being the principal God and god of the sky
| in Zoroastrianism, and the only reason it stops at 16 is
| the implausibility of people staying organised for longer
| to get it done.
| haar wrote:
| Apologies if I was unclear.
|
| The more I've used it, the more I've disliked how poor
| the results it's produced, and the more I've realised I
| would have been better served by doing it myself and
| following a methodical path for things that I didn't have
| experience with.
|
| It's easier to step through a problem as I'm learning and
| making small changes than an LLM going "It's done, and
| production ready!" where it just straight up doesn't work
| for 101 different tiny reasons.
| MyOutfitIsVague wrote:
| I don't think they are missing the point, because they're
| pointing out that the tools are still the most useful for
| patterns that are extremely widely known and repeated. I use
| Gemini 2.5 Pro every day for coding, and even that one still
| falls over on tasks that aren't well known to it (which is
| why I break the problem down into small parts that I know
| it'll be able to handle properly).
|
| It's kind of funny, because sometimes these tools are magical
| and incredible, and sometimes they are extremely stupid in
| obvious ways.
|
| Yes, these are impressive, and especially so for local models
| that you can run yourself, but there is a gap between
| "absolutely magical" and "pretty cool, but needs heavy
| guiding" depending on how heavily the ground you're treading
| has been walked upon.
|
| For a heavily explored space, it's like being impressed that
| you're 2.5 year old M2 with 64 GB RAM can extract some source
| code from a zip file. It's worth being impressed and excited
| about the space and the pace of improvement, but it's also
| worth stepping back and thinking rationally about the
| specific benchmark at hand.
| NitpickLawyer wrote:
| > because they're pointing out that the tools are still the
| most useful for patterns that are extremely widely known
| and repeated
|
| I agree with you, but your take is _much_ more nuanced than
| what the GP comment said! These models don 't simply
| regurgitate the training set. That was my point with gpt3.
| The models have advanced from that, and can now
| "generalise" over the context in ways they could not do ~3
| years ago. We are now at a point where you can write a
| detailed spec (10-20k tokens) for an unseen scripting
| language, and have SotA models a) write a parser and b)
| start writing scripts for you in that language, even though
| it never saw that particular scripting language anywhere in
| its training set. Try it. You'll be surprised.
| jayd16 wrote:
| I think you're missing the point.
|
| Showing off moderately complicated results that are actually
| not indicative of performance because they are sniped by the
| training data turns this from a cool demo to a parlor trick.
|
| Stating that, aha, jokes on you, that's the status quo, is an
| even bigger indictment.
| jan_Sate wrote:
| Not exactly. The real utility value of LLM for programming is
| to come up with something new. For Space Invaders, instead of
| using LLM for that, I might as well just manually search for
| the code online and use that.
|
| To show that LLM actually can provide value for one-shot
| programming, you need to find a problem that there's no fully
| working sample code available online. I'm not trying to say
| that LLM couldn't to that. But just because LLM can come up
| with a perfectly-working Space Invaders doesn't mean that it
| could do that.
| devmor wrote:
| > The real utility value of LLM for programming is to come
| up with something new.
|
| That's the goal for these projects anyways. I don't know
| that its true or feasible. I find the RAG models much more
| interesting myself, I see the technology as having far more
| value in search than generation.
|
| Rather than write some markov-chain reminiscent
| frankenstein function when I ask it how to solve a problem,
| I would like to see it direct me to the original sources it
| would use to build those tokens, so that I can see their
| implementations in context and use my judgement.
| simonw wrote:
| "I would like to see it direct me to the original sources
| it would use to build those tokens"
|
| Sadly that's not feasible with transformer-based LLMs:
| those original sources are _long gone_ by the time you
| actually get to use the model, scrambled a billion times
| into a trained set of weights.
|
| One thing that helped me understand this is understanding
| that every single token output by an LLM is the result of
| a calculation that considers _all X billion parameters_
| that are baked into that model (or a subset of that in
| the case of MoE models, but it 's still billions of
| floating point calculations for every token.)
|
| You can get an imitation of that if you tell the model
| "use your search tool and find example code for this
| problem and build new code based on that", but that's a
| pretty unconventional way to use a model. A key component
| of the value of these things is that they can spit out
| completely new code based on the statistical patterns
| they learned through training.
| devmor wrote:
| I am aware, and that's exactly why I don't think they're
| anywhere near as useful for this type of work as the
| people pushing them want them to be.
|
| I tried to push for this type of model when an org I
| worked with over a decade ago was first exploring using
| the first generation of Tensorflow to drive customer
| service chatbots and was sadly ignored.
| simonw wrote:
| I don't understand. For code, why would I want to remix
| existing code snippets?
|
| I totally get the value of RAG style patterns for
| information retrieval against factual information - for
| those I don't want the LLM to answer my question
| directly, I want it to run a search and show me a
| citation and directly quote a credible source as part of
| answering.
|
| For code I just want code that works - I can test it
| myself to make sure it does what it's supposed to.
| devmor wrote:
| > I don't understand. For code, why would I want to remix
| existing code snippets?
|
| That is what you're doing already. You're just relying on
| a vector compression and search engine to hide it from
| you and hoping the output is what you expect, instead of
| having it direct you to where it remixed those snippets
| from so you can see how they work to start with and make
| sure its properly implemented from the get-go.
|
| We all want code that works, but understanding that code
| is a critical part of that for anything but a throw-away
| one time use script.
|
| I don't really get this desire to replace critical
| thought with hoping and testing. It sounds like the pipe
| dream of a middle manager, not a tool for a programmer.
| stavros wrote:
| I don't understand your point. You seem to be saying that
| we should be getting code from the source, then adapting
| it to our project ourselves, instead of getting adapted
| code to begin with.
|
| I'm going to review the code anyway, why would I not want
| to save myself some of the work? I can "see how they
| work" after the LLM gives them to me just fine.
| devmor wrote:
| The work that you are "saving" is the work of using your
| brain to determine the solution to the problem. Whatever
| the LLM gives you doesn't have a context it is used in
| other than your prompt - you don't even know what it does
| until after you evaluate it.
|
| If you instead have a set of sources related to your
| problem, they immediately come with context, usage and in
| many cases, developer notes and even change history to
| show you mistakes and adaptations.
|
| You're ultimately creating more work for yourself* by
| trying to avoid work, and possibly ending up with an
| inferior solution in the process. Where is your sense of
| efficiency? Where is your pride as a intellectual?
|
| * Yes, you are most likely creating more work for
| yourself even if you think you are capable of telling
| otherwise. [1]
|
| 1. https://metr.org/blog/2025-07-10-early-2025-ai-
| experienced-o...
| stavros wrote:
| Thanks for the concern, but I'm perfectly able to judge
| for myself whether I'm creating more work or delivering
| an inferior product.
| simonw wrote:
| It sounds like you care deeply about learning as much as
| you can. I care about that too.
|
| I would encourage you to consider that even LLM-generated
| code can teach you a ton of useful new things.
|
| Go read the source code for my dumb, zero-effort space
| invaders clone:
| https://github.com/simonw/tools/blob/main/space-invaders-
| GLM...
|
| There's a bunch of useful lessons to be picked up even
| from that!
|
| - Examples of CSS gradients, box shadows and flexbox
| layout
|
| - CSS keyframe animation
|
| - How to implement keyboard events in JavaScript
|
| - A simple but effective pattern for game loops against a
| Canvas element, using requestAnimationFrame
|
| - How to implement basic collision detection
|
| If you've written games like this before these may not be
| new to you, but I found them pretty interesting.
| tracker1 wrote:
| I have a friend who has been doing just that... usually
| with his company he manages a handful of projects where a
| bulk of the development is outsourced overseas. This past
| year, he's outpaced the 6 devs he's had working on misc
| projects just with his own efforts and AI. Most of this
| being a relatively unique combination of UX with features
| that are less common.
|
| He's using AI with note taking apps for meetings to enhance
| notes and flush out technology ideas at a higher level,
| then refining those ideas into working experiments.
|
| It's actually impressive to see. My personal experience has
| been far more disappointing to say the least. I can't speak
| to the code quality, consistency or even structure in terms
| of most people being able to maintain such applications
| though. I've asked to shadow him through a few of his vibe
| coding sessions to see his workflow. It feels rather alien
| to me, again my experience is much more disappointing in
| having to correct AI errors.
| nottorp wrote:
| Is this the same person who posted about launching 17
| "products" in one year a few days ago on HN? :)
| tracker1 wrote:
| No, he's been working on building a larger eLearning
| solution with some interesting workflow analytics around
| courseware evaluation and grading. He's been involved in
| some of the newer LRS specifications and some
| implementation details to bridge training as well as real
| world exposure scenarios. Working a lot with first
| responders, incident response training etc.
|
| I've worked with him off and on for years from simulating
| aircraft diagnostics hardware to incident command
| simulation and setting up core infrastructure for F100
| learning management backends.
| Aurornis wrote:
| > These kinds of comments are really missing the point.
|
| I disagree. In my experience, asking coding tools to produce
| something similar to all of the tutorials and example code
| out there works amazingly well.
|
| Asking them to produce novel output that doesn't match the
| training set produces very different results.
|
| When I tried multiple coding agents for a somewhat unique
| task recently they all struggled, continuously trying to pull
| the solution back to the standard examples. It felt like an
| endless loop of the models grinding through a solution and
| then spitting out something that matched common examples,
| after which I had to remind them of the unique properties of
| the task and they started all over again, eventually arriving
| back in the same spot.
|
| It shows the reality of working with LLMs and it's an
| important consideration.
| AlexeyBrin wrote:
| You are reading too much into my comment. My point was that
| the test (a Space Invaders clone) used to asses the model is
| irrelevant for some time now. I could have gotten a similar
| result with Mistral Small a few months ago.
| stolencode wrote:
| It's amazing that none of you even try to falsify you claims
| anymore. You can literally just put some of the code in a
| search engine and find the prior art example:
|
| https://www.web-leb.com/en/code/2108
|
| Your "AI tools" are just "copyright whitewashing machines."
|
| These kinds of comments are really ignoring reality.
| elif wrote:
| Most likely this comment included countless similar comments in
| its training data, likely all synthetic without any actual
| tether to real analysis.
| Conflonto wrote:
| That sounds so dismissive.
|
| I was not able to just download a 8-16GB File and then it would
| be able to generate A LOT of different tools, games etc. for me
| in multiply programming languages while in parallel ELI5 me
| research papers, generate svgs and a lot lot lot more.
|
| But hey.
| phkahler wrote:
| I find the visual similarity to breakout kind of interesting.
| gblargg wrote:
| The real test is if you can have it tweak things. Have the ship
| shoot down. Have the space invaders come from the left and
| right. Add two player simultaneous mode with two ships.
| wizzwizz4 wrote:
| It can _usually_ tweak things, if given specific instruction,
| but it doesn 't know when to refactor (and can't reliably
| preserve functionality when it does), so the program gets
| further and further away from something sensible until it
| can't make edits any more.
| simonw wrote:
| For serious projects you can address that by writing (or
| having it write) unit tests along the way, that way it can
| run in a loop and avoid breaking existing functionality
| when it adds new changes.
| greesil wrote:
| Okay ask it to write unit tests for space invaders next
| time :)
| NitpickLawyer wrote:
| > Two years ago when I first tried LLaMA I never dreamed that the
| same laptop I was using then would one day be able to run models
| with capabilities as strong as what I'm seeing from GLM 4.5 Air--
| and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of
| other high quality models that have emerged over the past six
| months.
|
| Yes, the open-models have surpassed my expectations in both
| quality and speed of release. For a bit of context, when chatgpt
| launched in Dec22, the "best" open models were GPT-J(~6-7B) and
| GPT-neoX (~22B?). I actually had an app running live, with users,
| using gpt-j for ~1 month. It was a pain. The quality was abysmal,
| there was no instruction following (you had to start your prompt
| like a story, or come up with a bunch of examples and hope the
| model will follow along) and so on.
|
| And then something happened, LLama models got "leaked" (I still
| think it was a on purpose leak - don't sue us, we never meant to
| release, etc), and the rest is history. With L1 we got lots of
| optimisations like quantised models, fine-tuning and so on, L2
| really saw fine-tuning go off (most of the fine-tunes were better
| than what meta released), we got alpaca showing off LoRA, and
| then a bunch of really strong models came out (mistrals,
| mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)
|
| By some estimations the open models are ~6mo behind what SotA
| labs have released. (note that doesn't mean the labs are
| releasing their best models, it's likely they keep those in house
| to use on next runs data curation, synthetic datasets, for
| distilling, etc). Being 6mo behind is NUTS! I never in my wildest
| dreams believed we'll be here. In fact I thought it would take
| ~2years to reach gpt3.5 levels. It's really something insane that
| we get to play with these models "locally", fine-tune them and so
| on.
| tonyhart7 wrote:
| is GLM 4.5 better than Qwen3 coder??
| diggan wrote:
| For what? It's really hard to say what model is "generally"
| better then another, as they're all better/worse at specific
| things.
|
| My own benchmarks has a bunch of different tasks I use
| various local models for, and I run it when I wanna see if a
| new model is better than the existing ones I use. The output
| is basically a markdown table with a description of which
| model is best for what task.
|
| They're being sold as general purpose things that are
| better/worse at _everything_ but reality doesn 't reflect
| this, they all have very specific tasks they're worse/better
| at, and the only way to find that out is by having a private
| benchmark you run yourself.
| kelvinjps10 wrote:
| coding? they are coding models? what specific tasks is one
| performing better than the other?
| diggan wrote:
| They may be, but there are lots of languages, lots of
| approaches, lots of methodologies and just a ton of
| different ways to "code", coding isn't one homogeneous
| activity that one model beats all the other models at.
|
| > what specific tasks is one performing better than the
| other?
|
| That's exactly why you create your own benchmark, so you
| can figure that out by just having a list of models,
| instead of testing each individually and basing it on
| "feels better".
| whimsicalism wrote:
| glm 4.5 is not a coding model
| simonw wrote:
| It may not be code-only, but it was trained extensively
| for coding:
|
| > Our base model undergoes several training stages.
| During pre-training, the model is first trained on 15T
| tokens of a general pre-training corpus, followed by 7T
| tokens of a code & reasoning corpus. After pre-training,
| we introduce additional stages to further enhance the
| model's performance on key downstream domains.
|
| From my notes here:
| https://simonwillison.net/2025/Jul/28/glm-45/
| whimsicalism wrote:
| yes, all reasoning models currently are, but it's not
| like ds coder or qwen coder
| simonw wrote:
| I don't see how the training process for GLM-4.5 is
| materially different from that used for
| Qwen3-235B-A22B-Instruct-2507 - they both did a ton of
| extra reinforcement learning training related to code.
|
| Am I missing something?
| whimsicalism wrote:
| I think the primary thing you're missing is that
| Qwen3-235B-A22B-Instruct-2507 !=
| Qwen3-Coder-480B-A35B-Instruct. And the difference there
| is that while both do tons of code RL, in one they do not
| monitor performance on anything else for
| forgetting/regression and focus fully on code post-
| training pipelines and it is not meant for other tasks.
| NitpickLawyer wrote:
| I haven't tried them (released yesterday I think?). The
| benchmarks look good (similar I'd say) but that's not saying
| much these days. The best test you can do is have a couple of
| cases that match your needs, and run them yourself w/ the
| cradle that you are using (aider, cline, roo, any of the CLI
| tools, etc). Openrouter usually has them up soon after
| launch, and you can run a quick test really cheap (and only
| deal with one provider for billing & stuff).
| genewitch wrote:
| I'll bite. How do i train/make and/or use LoRA, or, separately,
| how do i fine-tune? I've been asking this for months, and no
| one has a decent answer. websearch on my end is seo/geo-spam,
| with no real instructions.
|
| I know how to make an SD LoRA, and use it. I've known how to do
| that for 2 years. So what's the big secret about LLM LoRA?
| minimaxir wrote:
| If you're using Hugging Face transformers, the library you
| want to use is peft:
| https://huggingface.co/docs/peft/en/quicktour
|
| There are Colab Notebook tutorials around training models
| with it as well.
| notpublic wrote:
| https://github.com/unslothai/unsloth
|
| I'm not sure if it contains exactly what you're looking for,
| but it includes several resources and notebooks related to
| fine-tuning LLMs (including LoRA) that I found useful.
| techwizrd wrote:
| We have been fine-tuning models using Axolotl and Unsloth,
| with a slight preference for Axolotl. Check out the docs [0]
| and fine-tune or quantize your first model. There is a lot to
| be learned in this space, but it's exciting.
|
| 0: https://axolotl.ai/ and https://docs.axolotl.ai/
| syntaxing wrote:
| What hardware do you train on using axolotl? I use unsloth
| with Google colab pro
| arkmm wrote:
| When do you think fine tuning is worth it over prompt
| engineering a base model?
|
| I imagine with the finetunes you have to worry about self-
| hosting, model utilization, and then also retraining the
| model as new base models come out. I'm curious under what
| circumstances you've found that the benefits outweigh the
| downsides.
| whimsicalism wrote:
| finetuning rarely makes sense unless you are an
| enterprise and even generally doesn't in most cases there
| either.
| tough wrote:
| only for narrow applications where your fine tune can let
| you use a smaller model locally , specialised and trained
| for your specific use-case mostly
| reissbaker wrote:
| For self-hosting, there are a few companies that offer
| per-token pricing for LoRA finetunes (LoRAs are basically
| efficient-to-train, efficient-to-host finetunes) of
| certain base models:
|
| - (shameless plug) My company, Synthetic, supports LoRAs
| for Llama 3.1 8b and 70b: https://synthetic.new All you
| need to do is give us the Hugging Face repo and we take
| care of the rest. If you want other people to try your
| model, we charge usage to them rather than to you. (We
| can also host full finetunes of anything vLLM supports,
| although we charge by GPU-minute for full finetunes
| rather than the cheaper per-token pricing for supported
| base model LoRAs.)
|
| - Together.ai supports a slightly wider number of base
| models than we do, with a bit more config required, and
| any usage is charged to you.
|
| - Fireworks does the same as Together, although they
| quantize the models more heavily (FP4 for the higher-end
| models). However, they support Llama 4, which is pretty
| nice although fairly resource-intensive to train.
|
| If you have reasonably good data for your task, and your
| task is relatively "narrow" (i.e. find a specific kind of
| bug, rather than general-purpose coding; extract a
| specific kind of data from legal documents rather than
| general-purpose reasoning about social and legal matters;
| etc), finetunes of even a very small model like an 8b
| will typically outperform -- by a pretty wide margin --
| even very large SOTA models while being a lot cheaper to
| run. For example, if you find yourself hand-coding
| heuristics to fix some problem you're seeing with an
| LLM's responses, it's probably more robust to just train
| a small model finetune on the data and have the finetuned
| model fix the issues rather than writing hardcoded
| heuristics. On the other hand, no amount of finetuning
| will make an 8b model a better general-purpose coding
| agent than Claude 4 Sonnet.
| delijati wrote:
| Do you maybe know if there is a company in the EU that
| hosts models (DeepSeek, Qwen3, Kimi)?
| svachalek wrote:
| For completeness, for Apple hardware MLX is the way to go.
| w10-1 wrote:
| MLX github: https://github.com/ml-explore/mlx
|
| get started:
| https://developer.apple.com/videos/play/wwdc2025/315/
|
| details:
| https://developer.apple.com/videos/play/wwdc2025/298/
| qcnguy wrote:
| LLM fine tuning tends to destroy the model's capabilities if
| you aren't very careful. It's not as easy or effective as
| with image generation.
| electroglyph wrote:
| unsloth is the easiest way to finetune due to the lower
| memory requirements
| pdntspa wrote:
| Have you tried asking an LLM?
| jasonjmcghee wrote:
| brev.dev made an easy to follow guide a while ago but
| apparently Nvidia took it down or something when they bought
| them?
|
| So here's the original
|
| https://web.archive.org/web/20231127123701/https://brev.dev/.
| ..
| Nesco wrote:
| Zuck wouldn't have leaked it on 4chan of all the places
| tough wrote:
| prob just told an employee to get it done no?
| vaenaes wrote:
| Why not?
| pulkitsh1234 wrote:
| Is there any website to see the minimum/recommended hardware
| required for running local LLMs? Much like 'system requirements'
| mentioned for games.
| GaggiX wrote:
| https://apxml.com/tools/vram-calculator
|
| This one is very good in my opinion.
| jxf wrote:
| Don't think it has the GLM series on there yet.
| knowaveragejoe wrote:
| If you have a HuggingFace account, you can specify the hardware
| you have and it will show on any given model's page what you
| can run.
| CharlesW wrote:
| > _Is there any website to see the minimum /recommended
| hardware required for running local LLMs?_
|
| LM Studio (not exclusively, I'm sure) makes it a no-brainer to
| pick models that'll work on your hardware.
| qingcharles wrote:
| This can be a useful resource too:
|
| https://www.reddit.com/r/LocalLLaMA/
| svachalek wrote:
| In addition to the tools other people responded with, a good
| rule of thumb is that most local models work best* at q4
| quants, meaning the memory for the model is a little over half
| the number of parameters, e.g. a 14b model may be 8gb. Add some
| more for context and maybe you want 10gb VRAM for a 14gb model.
| That will at least put you in the right ballpark for what
| models to consider for your hardware.
|
| (*best performance/size ratio, generally if the model easily
| fits at q4 you're better off going to a higher parameter count
| than going for a larger quant, and vice versa)
| nottorp wrote:
| > maybe you want 10gb VRAM for a 14gb model
|
| ... or if you have Apple hardware with their unified memory,
| whatever the assholes soldered in is your limit.
| bradly wrote:
| I appreciate you sharing both the chat log and the full source
| code. I would be interested to see a followup post on how adding
| moderately-sized features like High Score go.
|
| Also, IANAL but Space Invaders is owned IP. I have no idea the
| legality of a blog post describing steps to create and releasing
| an existing game, but I've seen headlines on HN of engs in
| trouble for things I would not expect to be problematic. Maybe
| Space Invaders is in q-tip/band-aid territory at this point?, but
| if this was Zelda instead of Space Invaders, I could see things
| being more dicey.
| Joker_vD wrote:
| > Space Invaders is owned IP
|
| So is Tetris. And I believe that Snake is also an owned IP
| although I could be wrong on this one.
| sowbug wrote:
| It doesn't infringe any kind of intellectual property.
|
| This isn't copyright infringement; it isn't based on the
| original assembly code or artwork. A game concept can't be
| copyrighted. Even if one of SI's game mechanics were patented,
| it would have long expired. Trade secret doesn't apply in this
| situation.
|
| That leaves trademark. No reasonable person would be confused
| whether Simon is trying to pass this creation off as a genuine
| Space Invaders product.
| 9rx wrote:
| _> No reasonable person would be confused whether Simon is
| trying to pass this creation off as a genuine Space Invaders
| product._
|
| There may be no reasonable confusion, but trademark holders
| also have to protect against dilution of their brand, if they
| want to retain their trademark. With use like this, people
| might come to think of Space Invaders as a generic term for
| all games of this type, not the brand of a specific game.
|
| (there is a strong case to be made that they already do,
| granted)
| pamelafox wrote:
| Alas, my 3 year old Mac has only 16 GB RAM, and can barely run a
| browser without running out of memory. It's a work-issued Mac,
| and we only get upgrades every 4/5 years. I must be content with
| 8B parameters models from Ollama (some of which are quite good,
| like llama3.1:8b).
| e1gen-v wrote:
| Just download more ram!
| GaggiX wrote:
| Reasoning models like qwen3 are even better, and they have more
| options, for example you can choose the 14B model (at the usual
| 4KM quantization) instead of the 8B model.
| pamelafox wrote:
| Are they quantized more effectively than the non-reasoning
| models for some reason?
| GaggiX wrote:
| There is no difference, you can choose a 6 bits
| quantization if you prefer, at that point it's essentially
| lossless.
| dreamer7 wrote:
| I am able to run Gemma 3 12B on my M1 MBP 16GB. It is pretty
| good at logic and reasoning!
| __mharrison__ wrote:
| Odd. My MBP has 16 GB and I routinely have 5 browsers windows
| open. Most of them have 5-20 tabs. I'm also routinely running
| vi vscode and editing videos with davinci resolve without
| issue.
|
| My only memory issue that I can remember is an OBS memory leak,
| otherwise these MBPs incredible hardware. I wish any other
| company could actually deliver a comparable machine.
| pamelafox wrote:
| I was exaggerating slightly - I think it's some combo of the
| apps I use: Edge, Teams, Discord, VS Code, Docker. When I get
| the RAM popup once a week, I typically have to close a few of
| those, whichever is using the most memory according to
| Activity Monitor. I've also got very little hard drive space
| on my machine, about 15 GB free, so that makes it harder for
| me to download the larger models. I keep trying to clear
| space, even using CleanMyMac, but I somehow keep filling it
| up.
| neutronicus wrote:
| If I understand correctly, the author is managing to run this
| model on a laptop with 64GB of RAM?
|
| So a home workstation with 64GB+ of RAM could get similar
| results?
| lynndotpy wrote:
| The laptop has "unified RAM", so that's like 64GB of VRAM.
| simonw wrote:
| Only if that RAM is available to a GPU, or you're willing to
| tolerate extremely slow responses.
|
| The neat thing about Apple Silicon is the system RAM is
| available to the GPU. On most other systems you would need
| ~48GB of VRAM.
| xrd wrote:
| Aren't there non-Macos laptops which also support sharing the
| VRAM and regular RAM, i.e. iGPU?
|
| https://www.reddit.com/r/GamingLaptops/comments/1akj5aw/what.
| ..
|
| I personally want to run linux and feel like I'll get a
| better price/GB offering that way. But, it is confusing to
| know how local models will actually work on those and the
| drawbacks of iGPU.
| mft_ wrote:
| iGPUs are typically weak, and/or aren't capable of running
| the LLM so the CPU is used instead. You _can_ run things
| this way, but it 's not fast, and it gets slower as the
| models go up in size.
|
| If you want things to run quickly, then aside from Macs,
| there's the 2025 ASUS Flow z13 which (afaik) is the only
| laptop with AMD's new Ryzen Max+ 395 processor. This is
| powerful _and_ has up to 128Gb of RAM that can be shared
| with the GPU, but they 're very rare (and Mac-expensive) at
| the moment.
|
| The other variable for running LLMs quickly is memory
| bandwidth; the Max+ 395 has 256Gb/s, which is similar to
| the M4 Pro; the M4 Max chips are considerably higher. Apple
| fell on their feet on this one.
| simlevesque wrote:
| Not so sure. The MBP uses hybrid memory, the ram is shared with
| the cpu and gpu.
|
| Your 64gb workstation doesn't share the ram with your gpu.
| NitpickLawyer wrote:
| > So a home workstation with 64GB+ of RAM could get similar
| results?
|
| Similar in quality, but CPU generation will be slower than what
| macs can do.
|
| What you can do with MoEs (GLMs and Qwens) is to run _some_
| experts (the shared ones usually) on a GPU (even a 12GB /16GB
| will do) and the rest from RAM on CPU. That will speed things
| up considerably (especially prompt processing). If you're
| interested in this, look up llama.cpp and especially ik_llama,
| which is a fork dedicated to this kind of selective offloading
| of experts.
| 0x457 wrote:
| You can run, it will just run on CPU and will be pretty slow.
| Macs, like everyone in this thread said, use unified memory, so
| it's 64GB between CPU and GPU, while for you its just 64 for
| CPU.
| larodi wrote:
| Is probably more correct to say - my 2.5 year laptop can RETELL
| space invaders. Pretty sure it cannot write a game it has never
| seen, so you can even say - my old laptop can now do this fancy
| extraction of data from a smart probabilistic blob, where the
| original things are retold in new colours and forms :)
| simonw wrote:
| I know these models can build games and apps they've never seen
| before because I've already observed them doing exactly that
| time and time again.
|
| If you haven't seen that yourself yet I suggest firing up the
| free, no registration required GLM-4.5 Air on
| https://chat.z.ai/ and seeing if you can prove yourself wrong.
| oceanplexian wrote:
| So you're saying it works exactly the same way as humans, who
| copied Space Invaders from Breakout which came out in 1976.
| uludag wrote:
| It's unfortunate that the ideas of things to test first are
| exactly the things more likely to be contained in training
| data. Hence why the pelican on a bicycle was such a good test,
| until it became viral.
| MattRix wrote:
| No, that would be incorrect, nobody uses "retell" like that.
|
| The impressive thing about these models is their ability to
| write working code, not their ability to come up with unique
| ideas. These LLMs actually can come up with unique ideas as
| well, though I think it's more exciting that they can help
| people execute human ideas instead.
| anthk wrote:
| Writting a Z80 emulator with the original Space Invaders ROM will
| make you more fullfilled.
|
| Either with SDL2+C, or even TCL/Tk, or Pythn with TKInter.
| vFunct wrote:
| please please apple give us a M5 MacBook Pro laptop with 2TB of
| unified memory please please
| stpedgwdgfhgdd wrote:
| Aside that space invaders from scratch is not representative for
| real engineering, it will be interesting to see what the business
| model for Anthropic will be if I can run a solid code generation
| model on my local machine (no usage tier per hour or week), let's
| say, one year from now. At $200 per month for 2 years I can buy a
| decent Mx with 64GB (or perhaps even 128GB taking residual value
| into account)
| falcor84 wrote:
| How come it's "not representative for real engineering"? Other
| than copy-pasting existing code (which is not what an LLM
| does), I don't see how you can create a space invaders game
| without applying "engineering".
| phkahler wrote:
| >> Other than copy-pasting existing code (which is not what
| an LLM does)
|
| I'd like to see someone try to prove this. How many space
| invaders projects exist on the internet? I'd be hard to
| compare model "generated" code to everything out there
| looking for plagiarism, but I bet there are lots of snippets
| pulled in. These things are NOT smart, they are huge and
| articulate information repositories.
| simonw wrote:
| Go for it. https://www.google.com/search?client=firefox-b-1
| -d&q=github+... has a bunch of results. Here's the source
| code GLM-4.5 Air spat out for me on my laptop:
| https://github.com/simonw/tools/blob/main/space-invaders-
| GLM...
|
| Based on my mental model of how these things work I'll be
| genuinely surprised if you can find even a few lines of
| code duplicated from one of those projects into the code
| that GLM-4.5 wrote for me.
| phkahler wrote:
| So I scanned the beginning of the generated code, picked
| line 83: animation: glow 2s ease-in-out
| infinite;
|
| stuffed it verbatim into google and found a stack
| overflow discussion that contained this:
| animation: glow .5s infinite alternate;
|
| in under one minute. Then I found this page of CSS
| effects:
|
| https://alvarotrigo.com/blog/animated-backgrounds-css/
|
| Another page has examples and contains:
| animation: float 15s infinite ease-in-out;
|
| There is just too much internet to scan for an exact
| match or a match of larger size.
| simonw wrote:
| That's not an example of copying from an existing Space
| Invaders implementation. That's an LLM using a CSS
| animation pattern - one that it's seen thousands
| (probably millions) of times in the training data.
|
| That's what I expect these things to do: they break down
| Space Invaders into the components they need to build,
| then mix and match thousands of different coding patterns
| (like "animation: glow 2s ease-in-out infinite;") to
| implement different aspects of that game.
|
| You can see that in the "reasoning" trace here: https://g
| ist.github.com/simonw/9f515c8e32fb791549aeb88304550... -
| "I'll use a modern design with smooth animations,
| particle effects, and a retro-futuristic aesthetic."
| threeducks wrote:
| I think LLMs are adapting higher level concepts. For
| example, the following JavaScript code generated by GLM (
| https://github.com/simonw/tools/blob/9e04fd9895fae1aa9ac7
| 8b8...) is clearly inspired by this C++ code
| (https://github.com/portapack-mayhem/mayhem-
| firmware/blob/28e...), but it is not an exact copy.
| simonw wrote:
| This is a really good spot.
|
| That code certainly looks similar, but I have trouble
| imagining how else you would implement very basic
| collision detection between a projectile and a player
| object in a game of this nature.
| threeducks wrote:
| A human would likely have refactored the two collision
| checks between bullet/enemy and enemyBullet/player in the
| JavaScript code into its own function, perhaps something
| like "areRectanglesOverlapping". The C++ code only does
| one collision check like that, so it has not been
| refactored there, but as a human, I certainly would not
| want to write that twice.
|
| More importantly, it is not just the collision check that
| is similar. Almost the entire sequence of operations is
| identical on a higher level: 1.
| enemyBullet/player collision check 2. same
| comment "// Player hit!" (this is how I found the code)
| 3. remove enemy bullet from array 4. decrement
| lives 5. update lives UI 6.
| (createParticle only exists in JS code) 7. if
| lives are <= 0, gameOver
| falcor84 wrote:
| The parent said
|
| > find even a few lines of code duplicated from one of
| those projects
|
| I'm pretty sure they meant multiple lines copied verbatim
| from a single project implementing space invaders, rather
| than individual lines copied (or likely just accidentally
| identical) across different unrelated projects.
| ben_w wrote:
| So, your example of it copying snippets is... using the
| same API with fairly different parameters in a different
| order?
| sejje wrote:
| Is this some kind of joke?
|
| That's how you write css. The examples aren't the same at
| all, they just use the same css feature.
|
| It feels like you aren't a coder--you've sabotaged your
| own point.
| ben_w wrote:
| Sorites paradox. Where's the distinction between "snippet"
| and "a design pattern"?
|
| Compressing a few petabytes into a few gigabytes _requires_
| that they can 't be like this about all of the things
| they're accused of simply copy-pasting, from code to
| newspaper articles to novels. There's not enough space.
| hbn wrote:
| The prompt was
|
| > Write an HTML and JavaScript page implementing space
| invaders
|
| It may not be "copy pasting" but it's generating output as
| best it can be recreated from its training on looking at
| Space Invaders source code.
|
| The engineers at Taito that originally developed Space
| Invaders were not told "make Space Invaders" and then did
| their best to recall all the source code they've looked at in
| their life to re-type the source code to an existing game.
| From a logistics standpoint, where the source code already
| exists and is accessible, you may as well have copy-pasted it
| and fudged a few things around.
| simonw wrote:
| The source code for original Space Invaders from 1978 has
| never been published. The closest to that is disassembled
| ROMs.
|
| I used that prompt because it's the shortest possible
| prompt that tells the model to build a game with a specific
| set of features. If I wanted to build a custom game I would
| have had to write a prompt that was many paragraphs longer
| than that.
|
| The aim of this piece isn't "OMG looks LLMs can build space
| invaders" - at this point that shouldn't be a surprise to
| anyone. What's interesting is that _my laptop_ can run a
| model that is capable of that now.
| sarchertech wrote:
| > The source code for original Space Invaders from 1978
| has never been published. The closest to that is
| disassembled ROMs.
|
| Sure but that doesn't impact the OPs point at all because
| there are numerous copies of reverse engineered source
| code available.
|
| There are numerous copies of the reverse engineered
| source code already translated to JavaScript in your
| models training set.
| nottorp wrote:
| > What's interesting is that my laptop can run a model
| that is capable of that now.
|
| I'm afraid no one cared much about your point :)
|
| You'll only get "OMG look how good LLMs are they'll get
| us all fired!" comments and "LLMs suck" comments.
|
| This is how it goes with religion...
| hbn wrote:
| The discussion I replied to was just regarding whether or
| not what the LLM did should be considered "engineering"
|
| It doesn't really matter whether or not the original code
| was published. In fact that original source code on its
| own probably wouldn't be that useful, since I imagine it
| wouldn't have tipped the weights enough to be
| "recallable" from the model, not to mention it was tasked
| with implementing it in web technologies.
| sharkjacobs wrote:
| Making a space invaders game is not representative of normal
| engineering because you're reproducing an existing game with
| well known specs and requirements. There are probably
| hundreds of thousands of words describing and discussing
| Space Invaders in GLM-4.5's training data
|
| It's like using an LLM to implement a red black tree. Red
| black trees are in the training data, so you don't need to
| explain or describe what you mean beyond naming it.
|
| "Real engineering" with LLMs usually requires a bunch of up
| front work creating specs and outlines and unit tests.
| "Context engineering"
| jasonvorhe wrote:
| Smells like moving the goal post. What's real engineering
| to be in 2028? Implementing Google's infra stack in your
| homelab?
| rafaelmn wrote:
| What about power used and support hardware ? Also card going
| down means you are down until you get warranty service.
| skeezyboy wrote:
| why are you doing anything locally then?
| tptacek wrote:
| OK, go write Space Invaders by hand.
| LandR wrote:
| I'd hope most professional software engineers could do this
| in an afternoon or so?
| sejje wrote:
| Most professional software engineers have never written a
| game and don't do web work, so I somehow doubt that.
| anthk wrote:
| With TCL/TK it's a matter of less than 2 hours.
| dmortin wrote:
| " it will be interesting to see what the business model for
| Anthropic will be if I can run a solid code generation model on
| my local machine "
|
| Most people won't bother with buying powerful hardware for
| this, they will keep using SAAS solutions, so Anthropic can be
| in trouble if cheaper SAAS solutions come out.
| qingcharles wrote:
| The frontier models are always going to tempt you with their
| higher quality and quicker generation, IMO.
| kasey_junk wrote:
| I've been mentally mapping tge models to the history of db.
|
| Most db in the early days you had to pay for. There are still
| for pay db that are just better than ones you don't pay for.
| Some teams think that the cost is worth the improvements and
| there is a (tough) business there. Fortunes were made in the
| early days.
|
| But eventually open source models became good enough for many
| use cases and they have their own advantages. So lots of
| teams use them.
|
| I think coding models might have a similar trajectory.
| qingcharles wrote:
| You make a good point -- a majority of applications are now
| using open source or free versions[1] of DBs.
|
| My only feedback is: are these the same animal? Can we
| compare an O/S DB vs. paid/closed DB to me running an LLM
| locally? The biggest issue right now with LLMs is simply
| the cost of the _hardware_ to run one locally, not the
| quality of the actual software (the model).
|
| [1] e.g. SQL Server Express is good enough for a lot of
| tasks, and I guess would be roughly equivalent to the
| upcoming open versions of GPT vs. the frontier version.
| qcnguy wrote:
| A majority of apps nowadays are using proprietary forks
| of open source DBs running in the cloud, where their
| feature set is (slightly) rounded out and smoothed off by
| the cloud vendors.
|
| Not that many projects are doing fully self-hosted RDBMS
| at this point. So ultimately proprietary databases still
| win out, they just (ab)use the Postgresql trademark to
| make people think they're using open source.
|
| LLMs might go the same way. The big clouds offering
| proprietary fine tunes of models given away by AI labs
| using investor money?
| qingcharles wrote:
| That's definitely true. I could see more of the running
| open source models on other people's hardware model.
|
| I dislike running local LLMs right now because I find the
| software kinda janky still, you often have to tweak
| settings, find the right model files. Basically have a
| bunch of domain knowledge I don't have space for in my
| head. On top of maintaining a high-spec piece of hardware
| and paying for the power costs.
| zarzavat wrote:
| Closed doesn't always win over open. People said the same
| thing about Windows vs Linux, but even Microsoft was forced
| to admit defeat and support Linux.
|
| All it takes is some large companies commoditizing their
| complements. For Linux it was Google, etc. For AI it's Meta
| and China.
|
| The only thing keeping Anthropic in business is geopolitics.
| If China were allowed full access to GPUs, they would
| probably die.
| amelius wrote:
| Wake me up when I can apt-get install the llm.
| Kurtz79 wrote:
| You can install ollama with a script fetched with curl and run
| a llm model with a grand total of two bash commands (including
| curl).
| jus3sixty wrote:
| I recently let go of my 2.5 year old vacuum. It was just
| collecting dust.
| falcor84 wrote:
| Thinking about it, the measure of whether a vacuum is being
| sufficiently used is probably that the circulation of dust
| within it over the last year is greater than the circulation of
| dust on its external boundary over that time period.
| alankarmisra wrote:
| I see the value in showcasing that LLMs can run locally on
| laptops -- it's an important milestone, especially given how
| difficult that was before smaller models became viable.
|
| That said, for something like this, I'd probably get more out of
| simply finding an existing implementation on github or the like
| and downloading that.
|
| When it comes to specialized and narrow domains like Space
| Invaders, the training set is likely to be extremely small and
| the model's vector space will have limited room to generalize.
| You'll get code that is more or less identical to the original
| source and you also have to wait for it to 'type' the code and
| the value add seems very low. I would rather ask it to point me
| to known Space Invaders implementations in language X on github
| (or search there).
|
| Note that ChatGPT gets very nervous if I put this into GPT to
| clean up the grammar. It wants very badly for me to stress that
| LLMs don't memorize and overfitting is very unlikely (I believe
| neither).
| tossandthrow wrote:
| Interesting, I can not produce these warnings in ChatGPT -
| though this is something that really interests me, as it
| represents immense political power to be able ti interject such
| warnings (explicitly, or implicitly by slight reformulations)
| efitz wrote:
| I missed the word "laptop" in the title at first glance and
| thought this was a "I taught my toddler to code" article.
| juliangoetze wrote:
| I thought I was the only one.
| joelthelion wrote:
| Apart from using a Mac, what can you use for inference with
| reasonable performance? Is a Mac the only realistic option at the
| moment?
| AlexeyBrin wrote:
| A gaming PC with an NVIDIA 4090/5090 will be more than adequate
| for running local models.
|
| Where a Mac may beat the above is on the memory side, if a
| model requires more than 24/32 GB of GPU memory you are usually
| better off with a Mac with 64/128 GB of RAM. On a Mac the
| memory is shared between CPU and GPU, so the GPU can load
| larger models.
| reilly3000 wrote:
| The top 3 approaches I see a lot on r/localllama are:
|
| 1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB
| cards. There is a ceiling to vRAM that prevents the biggest
| models from being able to load, most can run most quants at
| great speeds
|
| 2. Epyc servers running CPU inference with lots of RAM at as
| high of memory bandwidth as is available. With these setups
| people are getting like 5-10 t/s but are able to run 450B
| parameter models.
|
| 3. High RAM Macs with as much memory bandwidth as possible.
| They are the best balanced approach and surprisingly reasonable
| relative to other options.
| thenaturalist wrote:
| This guy [0] does a ton of in-depth HW comparison/
| benchmarking, including against Mac mini clusters and an M3
| ultra.
|
| 0: https://www.youtube.com/@AZisk
| regularfry wrote:
| This one should just about fit on a box with an RTX 4090 and
| 64GB RAM (which is what I've got) at q4. Don't know what the
| performance will be yet. I'm hoping for an unsloth dynamic
| quant to get the most out of it.
| weberer wrote:
| Whats important is VRAM, not system RAM. The 4090 has 16gb of
| VRAM so you'll be limited to smaller models at decent speeds.
| Of course, you can run models from system memory, but your
| tokens/second will be orders of magnitude slower. ARM Macs
| are the exception since they have unified memory, allowing
| high bandwidth between the GPU and the system's RAM.
| whimsicalism wrote:
| you are almost certainly better off renting GPUs, but i
| understand self-hosting is an HN touchstone
| qingcharles wrote:
| This. Especially if you just want to try a bunch of different
| things out. Renting is insanely cheap -- to the point where I
| don't understand how the renters are making their money back
| unless they stole the hardware and power.
|
| It can really help you figure a ton of things out before you
| blow the cash on your own hardware.
| 4b11b4 wrote:
| Recommended sites to rent from
| doormatt wrote:
| runpod.io
| whimsicalism wrote:
| runpod, vast, hyperbolic, prime intellect. if all you're
| doing is going to be running LLMs, you can pay per token
| on openrouter or some of the providers listed there
| mrinterweb wrote:
| I don't know about that. I've had my RTX 4090 for nearly 3
| years now. If I had a script that provisioned and
| deprovisioned a rented 4090 at $0.70/hr for an 8 hour work
| day for 20 work days per month. Assuming I get 2 paid weeks
| off per year + normal holidays over 3 years.
|
| 0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662
|
| I bought my RTX 4090 for about $2200. I also had the pleasure
| of being able to use it for gaming when I wasn't working. To
| be fair, the VRAM requirements for local models keeps
| climbing and my 4090 isn't able to run many of the latest
| LLMs. Also, I omitted cost of electricity for my local LLM
| server cost. I have not been measuring total watts consumed
| by just that machine.
|
| One nice thing about renting is that it give you flexibility
| in terms of what you want to try.
|
| If you're really looking for the best deals look at 3rd party
| hosts serving open models for the API-based pricing, or
| honestly a Claude subscription can easily be worth it if you
| use LLMs a fair bit.
| whimsicalism wrote:
| 1. I agree - there are absolutely scenarios in which it can
| make sense to buy a GPU and run it yourself. If you are
| managing a software firm with multiple employees, you very
| well might break even in less than a few years. But I would
| gander this is not the case for 90%+ of people self-hosting
| these models, unless they have some other good reason (like
| gaming) to buy a GPU.
|
| 2. I basically agree with your caveats - excluding
| electricity is a pretty big exclusion and I don't think
| that you've had 3 years of really high-value self-hostable
| models, I would really only say the last year and I'm
| somewhat skeptical of how good for ones that can be hosted
| in 24gb vram. 4x4090 is a different story.
| badsectoracula wrote:
| An Nvidia GPU is the most common answer, but personally i've
| done all my LLM use locally using mainly Mistral Small
| 3.1/3.2-based models and llama.cpp with an AMD RX 7900 XTX GPU.
| It only gives you ~4.71 tokens per second, but that is fast
| enough for a lot of uses. For example last month or so i wrote
| a raytracer[0][1] in C with Devstral Small 1.0 (based on
| Mistral Small 3.1). It wasn't "vibe coding" as much as a "co-
| op" where i'd go back and forth a chat interface (koboldcpp)
| and i'd, e.g. ask the LLM to implement some feature, then i'd
| switch to the editor and start writing code using that feature
| while the LLM was generating it in the background. Or, more
| often, i'd fix bugs in the LLM's code :-P.
|
| FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year
| old PC that was the cheapest money could buy originally and
| became "decent" at the time i upgraded it ~5 years ago and i
| only added the GPU around Christmas as prices were dropping
| since AMD was about to release the new GPUs.
|
| [0] https://i.imgur.com/FevOm0o.png
|
| [1]
| https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...
| joshstrange wrote:
| My next MBP is going to need the next size up SSD (RIP bank
| account) so it can hold all the models I want to play with
| locally and my data. Thankfully I already have been maxing out
| the RAM so that isn't something new I also have to do.
| __mharrison__ wrote:
| Time to get a new laptop. My MBP only has 16 gigs.
|
| Looking forward to trying this with Aider.
| sneak wrote:
| What is the SOTA for benchmarking all of the models you can run
| on your local machine vs a test suite?
|
| Surely this must exist, no? I want to generate a local
| leaderboard and perhaps write new test cases.
| petercooper wrote:
| I ran the same experiment on the full size model. It used a
| custom 80s style font (from Google Fonts) and gave 'eyes' and
| more differences to the enemies but otherwise had a similar vibe
| to Simon's. An interesting visual demonstration of what
| quantization does though! Screenshot:
| https://peterc.org/img/aliens.png
| deadbabe wrote:
| You can overtrain a neural network to write a space invaders
| clone. The final weights might take up less disk space than the
| output code.
| indigodaddy wrote:
| Did pretty well with a boggle clone. I like that it tries to do a
| single html file (I didn't ask for that but was pleasantly
| surprised). It didn't include dictionary validation so needed a
| couple of prompts. Touch selection on mobile isn't the greatest
| but I've seen plenty worse
|
| https://chat.z.ai/space/z0gcn6qtu8s1-art
|
| https://chat.z.ai/s/74fe4ddc-f528-4d21-9405-0a8b15a96520
| JKCalhoun wrote:
| Cool -- if only diagonals were easier. ;-) (Hopefully I'm being
| constructive here.)
| indigodaddy wrote:
| Yep I tried to have it improve that but actually didn't use
| the word 'diagonal' in the prompt. I bet it would have done
| better if I had..
| indigodaddy wrote:
| Had it try to improve Diagonal selection but didn't seem to
| help much
|
| https://chat.z.ai/space/b01dc65rg2p0-art
| Keyframe wrote:
| I went the other route with tetris clone the other day. It's
| definitely not a single prompt. It took me solid 15 hours until
| this stage to get here and most of that me thinking.. BUT,
| except one small trivial thing (space invader logo in pre tag)
| I haven't touched code - just looked at it. I made it mandatory
| for myself to see if I can first greenfield myself into this
| project and then brownfield features and fixes.. It's
| definitely a ton of work on my end, but it's also not something
| I'd be able to do in ~2 working days or less. As a cherry on
| top, even though it's still not done yet, I put in AI-generated
| music singing about the project itself.
| https://www.susmel.com/stacky/
|
| Definitely a ton of things I learned about how to "develop"
| "with" AI along the way.
| lifestyleguru wrote:
| > my 2.5 year old laptop (a 64GB MacBook Pro M2) i
|
| My MacBook has 16GB of RAM and it is from a period when everyone
| was fiercely insisting that 8GB base model is all I'll ever need.
| tracker1 wrote:
| I'm kind of with you... while I've run 128gb on my desktop, and
| currently at 96gb with dr5 what it is, It's far less common for
| typical laptops. I'm a bit curious how the Ryzen 395+ with
| 128gb will handle some of these models. The 200gb options feel
| completely out of reach.
| Aurornis wrote:
| This is very cool. The blog had to run it from the main branch of
| the mlx-lm library and a custom script. Can someone up to date on
| the local LLM tools let us know which mainstream tools we should
| be watching for an easier way to run this on MLX? The space moves
| so fast that it's hard to keep up.
| simonw wrote:
| I expect LM Studio will have this pretty soon - I imagine they
| are waiting on the next stable release of mlx-lm which will
| include the change I needed to get this to work.
| righthand wrote:
| Did you understand the implementation or just that it produced a
| result?
|
| I would hope an LLM could spit out a cobbled form of answer to a
| common interview question.
|
| Today a colleague presented data changes and used an LLM to build
| a display app for the JSON for presentation. Why did they not
| just pipe the JSON into our already working app that displays
| this data?
|
| People around me for the most part are using LLMs to enhance
| their presentations, not to actually implement anything useful. I
| have been watching my coworkers use it that way for months.
|
| Another example? A different coworker wanted to build a document
| macro to perform bulk updates on courseware content. Swapping old
| words for new words. To build the macro they first wrote a
| rubrick to prompt an LLM correctly inside of a word doc.
|
| That filled rubrik is then used to generate a program template
| for the macro. To define the requirements for the macro the
| coworker then used a slideshow slide to list bullet points of
| functionality, in this case to Find+Replace words in courseware
| slides/documents using a list of words from another text
| document. Due to the complexity of the system, I can't believe my
| colleague saved any time. The presentation was interesting though
| and that is what they got compliments on.
|
| However the solutions are absolutely useless for anyone else but
| the implementer.
| simonw wrote:
| I scanned the code and understood what it was doing, but I
| didn't spend much time on it once I'd seen that it worked.
|
| If I'm writing code for production systems using LLMs I still
| review every single line - my personal rule is I need to be
| able to explain how it works to someone else before I'm willing
| to commit it.
|
| I wrote a whole lot more about my approach to using LLMs to
| help write "real" code here:
| https://simonwillison.net/2025/Mar/11/using-llms-for-code/
| th0ma5 wrote:
| [flagged]
| CamperBob2 wrote:
| I missed the part where he said he was going to put the
| Space Invaders game into production. Link?
| bnchrch wrote:
| You do realize your talking to the creator of Django,
| Datassette, and Lanyrd right?
| tough wrote:
| that made me chuckle
| ajcp wrote:
| They said "production systems", not "critical production
| applications".
|
| Also the 'if' doesn't negate anything as they say "I
| still", meaning the behavior is actively happening or
| ongoing; they don't use a hypothetical or conditional after
| "still", as in "I still _would_ ".
| dang wrote:
| Please don't cross into personal attack in HN comments.
|
| https://news.ycombinator.com/newsguidelines.html
|
| Edit: twice is already a pattern -
| https://news.ycombinator.com/item?id=44110785. No more of
| this, please.
|
| Edit 2: I only just realized that you've been frequently
| posting abusive replies in a way that crosses into harangue
| if not harassment:
|
| https://news.ycombinator.com/item?id=44725284 (July 2025)
|
| https://news.ycombinator.com/item?id=44725227 (July 2025)
|
| https://news.ycombinator.com/item?id=44725190 (July 2025)
|
| https://news.ycombinator.com/item?id=44525830 (July 2025)
|
| https://news.ycombinator.com/item?id=44441154 (July 2025)
|
| https://news.ycombinator.com/item?id=44110817 (May 2025)
|
| https://news.ycombinator.com/item?id=44110785 (May 2025)
|
| https://news.ycombinator.com/item?id=44018000 (May 2025)
|
| https://news.ycombinator.com/item?id=44008533 (May 2025)
|
| https://news.ycombinator.com/item?id=43779758 (April 2025)
|
| https://news.ycombinator.com/item?id=43474204 (March 2025)
|
| https://news.ycombinator.com/item?id=43465383 (March 2025)
|
| https://news.ycombinator.com/item?id=42960299 (Feb 2025)
|
| https://news.ycombinator.com/item?id=42942818 (Feb 2025)
|
| https://news.ycombinator.com/item?id=42706415 (Jan 2025)
|
| https://news.ycombinator.com/item?id=42562036 (Dec 2024)
|
| https://news.ycombinator.com/item?id=42483664 (Dec 2024)
|
| https://news.ycombinator.com/item?id=42021665 (Nov 2024)
|
| https://news.ycombinator.com/item?id=41992383 (Oct 2024)
|
| That's abusive, unacceptable, and not even a complete list!
|
| You can't go after another user like this on HN, regardless
| of how right you are or feel you are or who you have a
| problem with. If you keep doing this, we're going to end up
| banning you, so please stop now.
| photon_lines wrote:
| This is why I love using the Deep-Seek chain of reason output
| ... I can actually go through and read what it's 'thinking'
| to validate whether it's basing its solution on valid facts /
| assumptions. Either way thanks for all of your valuable
| write-ups on these models I really appreciate them Simon!
| vessenes wrote:
| Nota bene - there is a fair amount of research that
| indicates models outputs and 'thoughts' do not necessarily
| align with their chain of reasoning output.
|
| You can validate this pretty easily by asking some logic or
| coding questions: you will likely note that a final output
| is not necessarily the logical output of the end of the
| thinking; sometimes significantly orthogonal to it, or
| returning to reasoning in the middle.
|
| All that to say - good idea to read it, but stay vigilant
| on outputs.
| shortrounddev2 wrote:
| Serious question: if you have to read every line of code in
| order to validate it in production, why not just _write_
| every line of code instead?
| simonw wrote:
| Because it's much, much faster to review a hundred lines of
| code than it is to write a hundred lines of code.
|
| (I'm experienced at reading and reviewing code.)
| paufernandez wrote:
| Simon, don't you fear "atrophy" in your writing ability?
| bsder wrote:
| > However the solutions are absolutely useless for anyone else
| but the implementer.
|
| Disposable code is where AI _shines_.
|
| AI generating the boilerplate code for an obtuse build system?
| Yes, please. AI generating an animation? Ganbatte. (Look at how
| much work 3Blue1Brown had to put into that--if AI can help that
| kind of thing, it has my blessings). AI enabling someone who
| doesn't program to generate _some prototype_ that they can then
| point at an actual programmer? Excellent.
|
| This is fine because you _don 't need to understand the
| result_. You have a concrete pass/fail gate and don't care
| about underneath. This is real value. The problem is that it
| isn't _gigabuck_ value.
|
| The stuff that would be gigabuck value is unfortunately where
| AI falls down. Fix this bug in a product. Add this feature to
| an existing codebase. etc.
|
| AI is also a problem because disposable code is what you would
| assign to junior programmers in order for them to learn.
| magic_hamster wrote:
| The LLM is the solution.
| aplzr wrote:
| I really like talking to Claude (free tier) instead of using a
| search engine when I'm stumbling upon a random topic that
| interests me. For example, this morning I had it explain the
| differences between pass by value, pass by reference, and pass by
| sharing, the last of which I wasn't aware of until then.
|
| Is this kind of thing also possible with one of these self-hosted
| models in a comparable way, or are they mostly good for coding?
| dcchambers wrote:
| Amazing. There really is no secret sauce that the frontier models
| have.
| accrual wrote:
| Very impressive model! The SVG pelican designed by GLM 4.5 in
| Simon's adjacent article is the most accurate I've seen yet.
| 4b11b4 wrote:
| Quick, someone knit a quilt with all the different SVG pelicans
| bgwalter wrote:
| The GML-4.5 model utterly fails at creating ASCII art or
| factorizing numbers. It can "write" Space Invaders because there
| are literally thousands of open source projects out there.
|
| This is another example of LLMs being dumb copiers that do
| understand human prompts.
|
| But there is one positive side to this: If this photocopying
| business can be run locally, the stocks of OpenAI etc. should got
| to zero.
| simonw wrote:
| Why would you use an LLM to factorize numbers?
| bgwalter wrote:
| Because we are told that they can solve IMO problems. Yet
| they fail at basic math problems, not only at factorization
| but also when probing them with relatively basic symbolic
| math that would not require the invocation of an external
| program.
|
| Also, you know it they fail they could say so instead of
| giving a hallucinated answer. First the models lie and say
| that a 20 digit number takes vast amounts of computing. Then,
| if pointed to a factorization program they pretend to execute
| it and lie about the output.
|
| There is no intelligence or flexibility apart from stealing
| other people's open source code.
| simonw wrote:
| That's why the IMO results were so notable: that was one of
| those moments where new models were demonstrated doing
| something that they had previously been unable to do.
| ducktective wrote:
| I can't fathom why more people aren't talking about the
| IMO story. Apparently the model they used is not just an
| LLM but some RL are involved too. If a model wins gold at
| IMO, is it still merely a "statistical parrot"?
| sejje wrote:
| Stochastic parrot is the term.
|
| I don't think it's ever been accurate.
| bgwalter wrote:
| The results were private and the methodology was not
| revealed. Even Tao, who was bullish on "AI", is starting
| to question the process.
| simonw wrote:
| The same thing has also been achieved by a Google
| DeepMind team and at least one group of independent
| researchers using publicly available models and careful
| promoting tricks.
| lherron wrote:
| With the Anthropic rug pull on quotas for Max, I feel the short-
| mid term value sweet spot will be a Frankensteined together
| "Claude as orchestrator/coder, falling back to local models as
| quota limits approach" tool suite.
| 4b11b4 wrote:
| Was thinking this one might backfire on Anthropic in the end...
|
| People are going to explore and get comfortable with
| alternatives.
|
| There may have been other ways to deal with the cases they were
| worried about.
| h-bradio wrote:
| Thanks so much for this! I updated LM Studio, and it picked up
| the mlx-lm update required. After a small tweak to tool-calling
| in the prompt, it works great with Zed!
| torarnv wrote:
| Could you describe the tweak you did, and possibly the general
| setup you have with zed working with LM Studio? Do you use a
| custom system prompt? What context size do you use?
| Temperature? Thanks!
| ddtaylor wrote:
| My brain is running legacy COBOL and first read this as
|
| > My 2.5 year old with their laptop can write Space Invaders
|
| For a few hundred milliseconds there I was thinking "these damn
| kids are getting good with tablets"
| Imustaskforhelp wrote:
| Don't worry I guess my brain is running bleeding edge
| typescript with react (I am in high school for context) and the
| first time I also read it this way...
|
| But I am without my glasses, but still I have hackernews at
| 250%, I think I am a little cooked lol.
| OldfieldFund wrote:
| We are all cooked at this point :)
| skeezyboy wrote:
| But arent we still decades away from running our own video-
| creating AIs locally? Have we plateaued with this current
| generation of techniques?
| svachalek wrote:
| It's more a question of, how long do you want it to take to
| create a video locally?
| skeezyboy wrote:
| nah, i definitely want to know what i asked
| sejje wrote:
| His answer implies you can run them locally now, just not
| in a useful timeframe.
| polynomial wrote:
| At first I read this as "My 2.5 year old can write Space Invaders
| in JavaScript now"
| maksimur wrote:
| A $xxxx 2.5 year old laptop, one that's probably much more
| powerful than an average laptop bought today and probably next
| year as well. I don't think it's a fair reference point.
| parsimo2010 wrote:
| The article is pretty good overall, but the title did irk me a
| little. I assumed when reading "2.5 year old" that it was
| fairly low-spec only to find out it was an M2 Macbook Pro with
| 64 GB of unified memory, so it can run models bigger than what
| an Nvidia 5090 can handle.
|
| I suppose that it could be intended to be read as "my laptop is
| only 2.5 years old, and therefore fairly modern/powerful" but I
| doubt that was the intention.
| simonw wrote:
| The reason I emphasize the laptop's age is that it is the
| same laptop I have been using ever since the first LLaMA
| release.
|
| This makes it a great way to illustrate how much better the
| models have got without requiring new hardware to unlock
| those improved abilities.
| bprew wrote:
| His point isn't that you can run a model on an average laptop,
| but that the same laptop can still run frontier models.
|
| It speaks to the advancements in models that aren't just
| throwing more compute/ram at it.
|
| Also, his laptop isn't that fancy.
|
| > It claims to be small enough to run on consumer hardware. I
| just ran the 7B and 13B models on my 64GB M2 MacBook Pro!
|
| From: https://simonwillison.net/2023/Mar/11/llama/
| nh43215rgb wrote:
| About $3700 laptop...
| asadm wrote:
| How good is this model with tool calling.
| bob1029 wrote:
| > still think it's noteworthy that a model running on my 2.5 year
| old laptop (a 64GB MacBook Pro M2) is able to produce code like
| this--especially code that worked first time with no further
| edits needed.
|
| I believe we are vastly underestimating what our existing
| hardware is capable of in this space. I worry that narratives
| like the bitter lesson and the efficient compute frontier are
| pushing a lot of brilliant minds away from investigating
| revolutionary approaches.
|
| It is obvious that the current models are deeply inefficient when
| you consider how much you can decimate the precision of the
| weights post-training and still have pelicans on bicycles, etc.
| jonas21 wrote:
| Wasn't the bitter lesson about training on large amounts of
| data? The model that he's using was still trained on a massive
| corpus (22T tokens).
| yahoozoo wrote:
| What does that have to do with quantizing?
| itsalotoffun wrote:
| I think GP means that if you internalize the bitter lesson
| (more data more compute wins), you stop imagining how to
| squeeze SOTA minus 1 performance out of constrained compute
| environments.
| lxgr wrote:
| This raises an interesting question I've seen occasionally
| addressed in science fiction before:
|
| Could today's consumer hardware run a future superintelligence
| (or, as a weaker hypothesis, at least contain some lower-level
| agent that can bootstrap something on other hardware via
| networking or hyperpersuasion) if the binary dropped out of a
| wormhole?
| switchbak wrote:
| This is what I find fascinating. What hidden capabilities
| exist, and how far could it be exploited? Especially on exotic
| or novel hardware.
|
| I think much of our progress is limited by the capacity of the
| human brain, and we mostly proceed via abstraction which allows
| people to focus on narrow slices. That abstraction has a cost,
| sometimes a high one, and it's interesting to think about what
| the full potential could be without those limitations.
| lxgr wrote:
| Abstraction, or efficient modeling of a given system, is
| probably a feature, not a bug, given the strong similarity
| between intelligence and compression and all that.
|
| A concise description of the _right_ abstractions for our
| universe is probably not too far removed from the weights of
| a superintelligence, modulo a few transformations :)
| bob1029 wrote:
| This is the premise of all of the ML research I've been into.
| The only difference is to replace the wormhole with linear
| genetic programming, neuroevolution, et. al. The size of
| programs in the demoscene is what originally sent me down this
| path.
|
| The biggest question I keep asking myself - What is the
| Kolmogorov complexity of a binary image that provides the exact
| same capabilities as the current generation LLMs? What are the
| chances this could run on the machine under my desk right now?
|
| I know how many AAA frames per second my machine is capable of
| rendering. I refuse to believe the gap between running CS2 at
| 400fps and getting ~100b/s of UTF8 text out of a NLP black box
| is this big.
| bgirard wrote:
| > ~100b/s of UTF8 text out of a NLP black box is this big
|
| That's not a good measure. NP problem solutions are only a
| single bit, but they are much harder to solve than CS2 frames
| for large N. If it could solve any problem perfectly, I would
| pay you billions for just 1b/s of UTF8 text.
| bob1029 wrote:
| > If it could solve any problem perfectly, I would pay you
| billions for just 1b/s of UTF8 text.
|
| Exactly. This is what compels me to try.
| wslh wrote:
| Here's a sci-fi twist: suppose Space Invaders and similar early
| games were seeded by a future intelligence. (*_*)[?]#-#
| another_one_112 wrote:
| Crazy to think that you can have a mostly-competent oracle even
| when disconnected from the grid.
| msikora wrote:
| With 48GB MAcBook Pro M3 I'm probably out of luck, right?
| simonw wrote:
| For this particular model, yes.
|
| This new one from Qwen should fit though - it looks like that
| only needs ~30GB of RAM: https://huggingface.co/lmstudio-
| community/Qwen3-30B-A3B-Inst...
| omneity wrote:
| It takes ~17-20GB on Q4 depending on context length &
| settings (running it as we speak)
|
| ~30GB in Q8 sure, but it's a minimal gain for double the VRAM
| usage.
| andai wrote:
| I got almost the same result with a 4B model (Qwen3-4B), about
| 20x smaller than OP's ~200B model.
|
| https://jsbin.com/lejunenezu/edit?html,output
|
| Its pelican was a total fail though.
| andai wrote:
| Update: It failed to make Flappy Bird though (several
| attempts).
|
| This surprises me, I thought it would be simpler than Space
| Invaders.
| simonw wrote:
| There's a new model from Qwen today - Qwen3-30B-A3B-Instruct-2507
| - that also runs comfortably on my Mac (using about 30GB of RAM
| with an 8bit quantization).
|
| I tried the "Write an HTML and JavaScript page implementing space
| invaders" prompt against it and didn't quite get a working game
| with a single shot, but it was still an interesting result:
| https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...
| xianshou wrote:
| I initially read the title as "My 2.5 year old can write Space
| Invaders in JavaScript now (GLM-4.5 Air)."
|
| Though I suppose, given a few years, that may also be true!
| dust42 wrote:
| I tried with Claude Sonnet 4 and it does *not* work. So looks
| like GLM-4.5 Air in 3bit quant is ahead.
|
| Chat is here:
| https://claude.ai/share/dc9eccbf-b34a-4e2b-af86-ec2dd83687ea
|
| Claude Opus 4 does work but is far behind of Simon's GLM-4.5:
| https://claude.ai/share/5ddc0e94-3429-4c35-ad3f-2c9a2499fb5d
___________________________________________________________________
(page generated 2025-07-29 23:00 UTC)