[HN Gopher] What's the strongest AI model you can train on a lap...
       ___________________________________________________________________
        
       What's the strongest AI model you can train on a laptop in five
       minutes?
        
       Author : ingve
       Score  : 480 points
       Date   : 2025-08-12 13:15 UTC (2 days ago)
        
 (HTM) web link (www.seangoedecke.com)
 (TXT) w3m dump (www.seangoedecke.com)
        
       | bbarnett wrote:
       | Perhaps grimlock level:
       | 
       | https://m.youtube.com/shorts/4qN17uCN2Pg
        
         | treetalker wrote:
         | "Hadn't thought of that ..."
         | 
         | "You're absolutely right!"
        
       | lamuswawir wrote:
       | Thanks.
        
       | zarzavat wrote:
       | Instead of time it should be energy. What is the best model you
       | can train with a given budget in Joules. Then the MBP and the
       | H100 are on a more even footing.
        
         | NooneAtAll3 wrote:
         | it's not about efficiency - it's about availability
         | 
         | H100 is not an everyday product. Laptop is
        
           | KeplerBoy wrote:
           | Still, I don't think the m4 is going to be far off from the
           | h100 in terms of energy efficiency.
           | 
           | edit: fixed typo
        
             | menaerus wrote:
             | What efficiency did you have in mind? Bandwidth-wise M4 is
             | ~10x to ~30x lower.
        
               | KeplerBoy wrote:
               | ah, i mistyped. I meant energy efficiency, not memory
               | efficiency.
        
           | Der_Einzige wrote:
           | At this point, given how many H100s there are in existence,
           | it's basically an everyday product.
        
             | logicchains wrote:
             | I envy you if $25k is an everyday product cost.
        
               | jeroenhd wrote:
               | For what it's worth, most of the world can't afford an M4
               | Macbook either.
        
               | wongarsu wrote:
               | And renting an H100 for an hour is a lot easier than
               | renting an M4 MacBook for an hour.
        
               | falcor84 wrote:
               | Maybe not to buy one, but to rent one. Like how barista-
               | made coffee is an everyday product even though most
               | people can't afford a fancy professional coffee machine.
        
               | bee_rider wrote:
               | Reasonably high quality coffee machines are very
               | widespread. Or you can do pour-over. I don't think the
               | cost of a machine is a limiting factor for many people,
               | it is just convenience.
               | 
               | Maybe an analogy could be made to espresso, nice espresso
               | machines get costlier. But, you can still get quite good
               | results out of a manual machine like a Flair.
               | 
               | I think this is why the suggestion to rent a machine is
               | not to helpful. In this analogy we're on BaristaNews, we
               | all know about the industrial machines, lots of folks use
               | them at work. But, the topic of what sort of things you
               | can do on your manual machine at home has come up.
        
               | inetknght wrote:
               | > _Reasonably high quality coffee machines are very
               | widespread. Or you can do pour-over. I don't think the
               | cost of a machine is a limiting factor for many people_
               | 
               | No, reasonably-priced coffee machines is an enabling
               | factor for many people.
               | 
               | If coffee machines weren't reasonably priced, they would
               | not be "very widespread".
        
               | bee_rider wrote:
               | I'm not sure I follow your deeper meaning here, sorry.
        
           | Sharlin wrote:
           | H100s are almost-instantly available to anyone with a credit
           | card and access to the internet. Without even having to lift
           | their butt from the seat. And you get plenty more than five
           | minutes of compute for the price of an M4.
        
             | jsperson wrote:
             | For the orgs where I've worked the important thing isn't
             | availability of compute it's security. Using what we have
             | on our local network is much easier from a governance and
             | approval standpoint than whatever is available on the
             | internet.
        
               | Sharlin wrote:
               | Many orgs have no problems using cloud envs for most
               | things. The usual suspects offer just as secure compute
               | envs as everything else.
               | 
               | Anyway, I was assuming personal use, like the messing-
               | around experimenting that the article is about. (Or who
               | knows, maybe it was part of the author's job.)
        
             | potatolicious wrote:
             | And yet just about any intro-to-programming tutorial gets
             | something running on your local machine, and local machine
             | development continues to be the default for most people,
             | even though devving on a cloud machine is eminently
             | reasonable.
             | 
             | "Pull out credit card, sign up for some thing and pay a bit
             | of money" is a non-trivial bit of friction! Extremely non-
             | trivial!
             | 
             | Especially in a corporate context - you have to get the
             | expense approved. It's not clear if you can put company
             | data onto the machine. Whereas generally running local
             | things on corporate laptops is far less controversial.
             | 
             | "Download this tool and run it." is still an extremely
             | powerful pitch. Pretty much the only thing that beats it is
             | "go to this website which you can use without any signup or
             | payment".
        
               | Sharlin wrote:
               | Sure, if you already have said local machine. Which I
               | guess in HN's context many/most do.
        
             | ekianjo wrote:
             | no org will let you send their data to a random online
             | h100...
        
               | Sharlin wrote:
               | Many orgs happily use Google's everything. And Google
               | offers secure compute envs just like it offers secure
               | cloud everything.
               | 
               | Anyway, I thought the context was doing stuff for
               | personal use/fun, not work.
        
               | sethhochberg wrote:
               | Frankly I think a lot of full-time-employed technical
               | people are largely experimenting for fun in the context
               | of things that might eventually be useful to their
               | employer. AI is cool and fascinating stuff and when I
               | have a few idle minutes at the end of my workweek I love
               | catching up and experimenting with the latest and
               | greatest, but with an eye towards company problems and on
               | company time, and sometimes using company datasets. That
               | means company vendor approval and financing of my
               | efforts.
               | 
               | In my personal life, when its time for fun, I close the
               | laptop and go do some gardening.
        
             | dekhn wrote:
             | While I love cloud computing, you're comparing the cost of
             | renting a GPU for a fixed amount of time to the purchase of
             | an asset which can be used for years. Not a useful
             | comparison IMHO.
        
               | sudoshred wrote:
               | Disagree, equity of access matters a lot. Not everyone
               | benefits from exposure to the entire hardware lifecycle,
               | the same way that buying housing is not the best
               | financial decision for everyone regardless of
               | affordability. I might have unlimited budget but if I
               | only need access to state of the art hardware
               | intermittently or under irregular circumstances the cost
               | of renting may be efficient for my needs. Also consider
               | the costs of supporting hardware that is fully owned, if
               | you own the hardware but underutilize it that is
               | inefficiency and the owner bears that cost. The unusual
               | way that silicon depreciates mean that the value of your
               | "asset" is not static and rapidly depreciates as silicon
               | manufacturing improves.
        
               | dekhn wrote:
               | Your argument is not related to my statement. You're
               | arguing something else.
        
             | victorbjorklund wrote:
             | I already have an M4 so the cost of running it is tiny.
        
             | 0x457 wrote:
             | Yeah, is a large server rack to run those H100s. But
             | realistically, the majority of people have a PC with
             | consumer grade GPU or more likely a laptop with...laptop
             | grade GPU.
             | 
             | Cloud H100 don't count because you need lawyer to review
             | ToS and other agreements.
        
           | nickpsecurity wrote:
           | Also, my laptop running Linux and its outputs are probably
           | mine and private. If I use cloud GPU's, I need to be a lawyer
           | to be sure what they can or can't do with my data or models.
           | 
           | There's also no overages or hidden charges with a laptop.
           | Past simply breaking it. You know the replacement cost ahead
           | of time, though.
        
         | giancarlostoro wrote:
         | Mac is more competitive on power consumption though since its
         | not ever pulling as much as a Nvidia GPU is my understanding.
         | 
         | On that note you can rent an H100 for an hour for under $10
         | which might make for a slightly more interesting test, whats
         | the best model outcome you can train in under an hour.
        
           | dtnewman wrote:
           | > you can rent an H100 for an hour for under $10
           | 
           | Far cheaper these days. More like $2-3 for a consumer to do
           | this. For bulk deals, pricing is often < $2.
        
             | giancarlostoro wrote:
             | I couldnt remember offhand the exact amount but figured
             | noting that under $10 is still impressive for one high end
             | GPU for an entire hour.
        
           | bigyabai wrote:
           | It depends. If you're bottlenecked by memeory speed, the Mac
           | typically comes out on-top.
           | 
           | In terms of conpute efficiency though, Nvidia still has Apple
           | beat. Nvidia wouldn't have the datacenter market on a leash
           | if Apple was putting up a real fight.
        
             | giancarlostoro wrote:
             | Yeah, this is correct. My 3080 will render quicker than my
             | M4 but my M4 will outcompete on being able to load larger
             | models.
        
         | netcan wrote:
         | They're all good. Being somewhat arbitrary isnt a bad thing.
        
         | jvanderbot wrote:
         | Bro por que no los dos
         | 
         | We can / should benchmark and optimize this to death on all
         | axes
        
         | motorest wrote:
         | > Instead of time it should be energy (...) Then the MBP and
         | H100 are on a more even footing.
         | 
         | What exactly is your point? That instead of expressing
         | workloads in terms of what a laptop could do, you prefer to
         | express them in terms of what a MacBook Pro could do?
        
           | zarzavat wrote:
           | The point is that "best model you can train in 5 minutes" is
           | hardware dependent, the answer will be different depending on
           | the hardware available. So it's necessarily a single-player
           | game.
           | 
           | "Best model you can train with X joules" is a fairer contest
           | that multiple people could take part in even if they have
           | different hardware available. It's not completely fair, but
           | it's fair enough to be interesting.
           | 
           | Training models with an energy limit is an interesting
           | constraint that might lead to advances. Currently LLMs
           | implement online learning by having increasingly large
           | contexts that we then jam "memories" into. So there is a
           | strict demarcation between information learned during pre-
           | training and during use. New more efficient approaches to
           | training could perhaps inform new approaches to memory that
           | are less heterogenous.
           | 
           | tl;dr: more dimensionally correct
        
       | hodgehog11 wrote:
       | I love seeing explorations like this, which highlight that easily
       | accessible hardware can do better than most people think with
       | modern architectures. For many novel scientific tasks, you really
       | don't need an H100 to make progress using deep learning over
       | classical methods.
        
       | tootyskooty wrote:
       | I suspect one can go a lot further by adopting some tweaks from
       | the GPT-2 speedrun effort [0], at minimum Muon, better init and
       | carefully tuning learning rate.
       | 
       | [0]: https://github.com/KellerJordan/modded-nanogpt
        
       | nottorp wrote:
       | But supposing you have a real specific need to train, is the
       | training speed still relevant? Or do the resources spent on
       | gathering and validating the data set dwarf the actual CPU/GPU
       | usage?
        
         | wongarsu wrote:
         | If training is trivially fast that allows you to iterate on
         | architecture choices, hyperparameters, choices which data to
         | include, etc
         | 
         | Of course that only works if the trial runs are representative
         | of what your full scale model will look like. But within those
         | constraints optimising training time seems very valuable
        
       | l5870uoo9y wrote:
       | The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU
       | cores, and 128 GB of RAM (and a 16-core "neural engine"
       | specifically designed to accelerate machine learning).
       | Technically, it is a laptop, but it could just as well be a
       | computer optimized for AI.
        
         | alberth wrote:
         | The Mac Studio has:                 32 CPU       80 GPU
         | 512GB RAM
         | 
         | https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...
        
           | Joel_Mckay wrote:
           | From https://opendata.blender.org/ :
           | 
           | Apple M3 Ultra (GPU - 80 cores) scores 7235.31
           | 
           | NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
           | 
           | Note the memory constraints of NVIDIA are not like Apple
           | silicon which tends to also be less i/o constrained. YMMV
           | 
           | https://www.youtube.com/watch?v=d8yS-2OyJhw
           | 
           | https://www.youtube.com/watch?v=Ju0ndy2kwlw
           | 
           | Apple m3/m4 silicon is certainly good in some ways, but the
           | bottleneck is often a lack of CUDA software support and price
           | (could buy >4 times the GPU raw performance on a dual rtx
           | 5090 desktop.) =3
        
             | pstuart wrote:
             | Not just GPU performance -- the M3 Ultra has memory
             | bandwidth of ~800GBps vs ~1,800GBps for the 5090.
             | 
             | I would wager that Apple recognizes the value prop for the
             | mac to be used for AI and will up their memory bandwidth to
             | stay in the game.
        
           | lukan wrote:
           | That's a well made page, describing nice hardware, but
           | doesn't seem to be a laptop.
        
             | MobiusHorizons wrote:
             | I think the point is that laptops are more limited than
             | other form factors. I'm reading it as a response to the
             | comment that MacBooks are computers optimized for ai and
             | only technically a laptop (which is a pretty ridiculous
             | statement imo). Apples architecture happens to be very good
             | at a lot of compute heavy tasks, especially where total
             | available GPU ram and low latency handoff between the CPU
             | and the gpu are concerned. This happens to be very well
             | suited to LLM workloads.
        
       | LorenDB wrote:
       | > Paris, France is a city in North Carolina. It is the capital of
       | North Carolina, which is officially major people in Bhugh and
       | Pennhy. The American Council Mastlandan, is the city of Retrea.
       | There are different islands, and the city of Hawkeler: Law is the
       | most famous city in The Confederate. The country is Guate.
       | 
       | I love the phrase "officially major people"! I wonder how it
       | could be put to use in everyday speech?
        
       | wowczarek wrote:
       | Not the point of the exercise obviously, but at five minutes'
       | training I wonder how this would compare to a Markov chain bot.
        
       | mhogers wrote:
       | Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026
       | M5) for local LLMs? Due an upgrade soon and perhaps it is
       | educational to run these things more easily locally?
        
         | ionwake wrote:
         | I did just that , got the r 32gb ram one so I could run qwen.
         | 
         | Might still be early days I'm trying to use the model to sort
         | my local notes but I don't know man seems only a little faster
         | yet still unusable and I downloaded the lighter qwen model as
         | recommended.
         | 
         | Again it's early days maybe I'm being an idiot I did manage to
         | get it to parse one note after about 15 mins though.
        
           | dpoloncsak wrote:
           | Have a 16GB one, just setup ollama yesterday.
           | 
           | gpt-oss-20b eats too much ram to use for anything other than
           | an overnight task. maybe 3tok/s.
           | 
           | Been playing around with the 8b versions of qwen and
           | deepseek. Seems usable so far. YMMV, i'm just messing around
           | in chat at the moment, haven't really had it do any tasks for
           | me
        
         | sandreas wrote:
         | For LLMs, VRAM is the requirement number one. Since MacBooks
         | have unified RAM you can use up to 75% for the LLM, so a higher
         | RAM model would open more possibilies, but these are much more
         | expensive (of course).
         | 
         | As an alternative you might consider a Ryzen Pro 395+ like in
         | the Framework desktop or HP Zbook G1a but the 128GB versions
         | are still extremely expensive. The Asus Flow Z13 is a tablet
         | with ryzen 395+ but hardly available with 128GB
        
       | schaefer wrote:
       | You could train an unbeatable tic-tac-toe ai on your laptop in
       | five minutes. It doesn't get any stronger than that.
       | 
       | --
       | 
       | I know, I know. I'm intentionally misinterpreting the OP's clear
       | intent (the stuff of comedy). And normally a small joke like this
       | wouldn't be worth the downvotes...
       | 
       | But, I think there's a deeper double meaning in this brave new
       | world of prompt engineering. Most chat isn't all that precise
       | without some level of assumed shared context:
       | 
       | These days the meaning of the phrase ai has changed from the
       | classical definition (all algorithms welcome), and now ai usually
       | means LLMs and their derivatives.
        
         | silverlake wrote:
         | I'm actually working on just this. What's the smallest training
         | data set required to learn tic-tac-toe? A 5yo doesn't need much
         | training to learn a new game, but a transformer needs millions
         | of samples.
        
           | rkomorn wrote:
           | > A 5yo doesn't need much training to learn a new game
           | 
           | A 5yo also has... 5 years of cumulative real world training.
           | I'm a bit of an AI naysayer but I'd say the comparison
           | doesn't seem quite accurate.
        
             | silverlake wrote:
             | It's a glib analogy, but the goal remains the same. Today's
             | training sets are immense. Is there an architecture that
             | can learn something with tiny training sets?
        
               | rkomorn wrote:
               | I'm certainly not challenging anything you're writing,
               | because I only have a very distant understanding of deep
               | learning, but I do find the question interesting.
               | 
               | Isn't there a bit of a defining line between something
               | like tic-tac-toe that has a finite (and pretty limited
               | for a computer) set of possible combinations where it
               | seems like you shouldn't need a training set that is
               | larger than said set of possible combinations, and
               | something more open-ended where the impact of the size of
               | your training set mainly impacts accuracy?
        
               | dpoloncsak wrote:
               | Assuming you don't account for reflections, rotations,
               | and 'unreachable' gamestates where a player wins and you
               | continue to mark boxes.
               | 
               | It's just 3^9, right? 9 boxes, either X,O, or blank?
               | We're only at 19,683 game states and would trim down from
               | here if we account for the cases above.
        
               | rkomorn wrote:
               | Exactly, but then we may as well say "don't solve this
               | with an LLM" which sort of kills the conversation
               | altogether and that's not my goal. :)
        
               | dpoloncsak wrote:
               | Oh, im sorry! I was just trying to give a quick
               | perspective of how small that tic-tac-toe data-set
               | actually is. Not suggest against the idea!
        
               | rkomorn wrote:
               | Oh no worries at all. :)
        
               | onlyrealcuzzo wrote:
               | And hundreds of millions of years of evolutionary
               | intelligence.
        
               | rkomorn wrote:
               | Next step in AI: teaching an LLM to think like a
               | trilobite!
        
               | onlyrealcuzzo wrote:
               | A trilobite was obviously better at being a trilobite
               | than an LLM would be, if not by purely definitional
               | purposes.
        
               | rkomorn wrote:
               | Was the six million dollar man not a better man?
        
               | adrianwaj wrote:
               | Maybe ZephApp, when it's actually released. But would be
               | interesting to record day-to-day conversations (face-to-
               | face using voice recognition) to train a virtual
               | doppelganger of myself and use it to find uncommon
               | commonalities between myself and others.
               | 
               | What would someone do with a year's worth of recorded
               | conversations? Would the other parties be identified? How
               | would it be useful, if at all? How about analyzing the
               | sounds/waveform rather than words? (eg BioAcousticHealth
               | / vocal biomarkers)
               | 
               | Perhaps typing into a text-field is the problem right
               | now? Maybe have a HUD in a pair of glasses. Better than
               | getting a brain chip! Most recent or most repeated
               | conversations most important. Could lead to a reduction
               | in isolation within societies, in favor for "AI training
               | parties." Hidden questions in oneself answered by a robot
               | guru as bedtime story-telling but related to the real-
               | world and real-events.
               | 
               | Smart Glasses --> Smart Asses
               | 
               | Vibe Coding --> Tribe Loading
               | 
               | Everything Probable --> Mission Impossible
        
           | Daltonagray wrote:
           | This sounds super interesting. Will you be sharing your work
           | anywhere? :)
        
       | highfrequency wrote:
       | This is awesome - thanks for sharing. Appreciate the small-scale
       | but comprehensive studies testing out different architectures,
       | model sizes and datasets.
       | 
       | Would be curious to see a version of your model size comparison
       | chart but letting the training continue until perplexity plateaus
       | / begins to overfit. For example: are your larger models
       | performing worse because they are overfitting to a small dataset,
       | or because you are comparing model sizes at a fixed 5 minute
       | computation time - so that the large models just don't get to
       | learn very much in that time.
       | 
       | (Also interesting would be learning curve comparisons between
       | architecture/param count)
        
       | Aperocky wrote:
       | At which point is a simple markov chain same/better?
        
         | visarga wrote:
         | Output text is word salad every few words apart. You can't
         | scale n-gram counting enough to make it work.
        
           | sadiq wrote:
           | You might find https://arxiv.org/abs/2401.17377v3
           | interesting..
        
             | JPLeRouzic wrote:
             | Only if you have access to corporate-level hardware:
             | 
             | " _It took us 48 hours to build the suffix array for
             | RedPajama on a single node with 128 CPUs and 1TiB RAM_ "
        
               | protomikron wrote:
               | It's okayish. Considering 64G to 128G are available for
               | (nerd) high-end consumers you're just off with a factor 5
               | (if we can squeeze out a little bit more performance).
               | 
               | Thas is pretty astonishing in my opinion.
        
           | JPLeRouzic wrote:
           | Not exactly a few words in my experience, I would say every
           | 100 words, if you sophisticate your Markov Chain (n-gram = 3
           | at minimum, using a good tokenizer, making it tailored to the
           | training data, large training set (500Kbytes or +),
           | intelligent fallback instead of random, etc.).
        
         | Nevermark wrote:
         | It is the other way around.
         | 
         | Neural-type models have long passed the point where markov
         | chains made any sense by many orders of magnitude.
         | 
         | Markov models fail by being too opinionated about the style of
         | compute.
         | 
         | In contrast, a linear tensor + non-linear function has
         | incredible flexibility to transform the topology of
         | information. Given large enough tensors, two such layers, with
         | recurrence, can learn any mapping, static or dynamical. No
         | priors (other than massive compute) needed.
         | 
         | All other neural architectures then are simply sparser
         | arrangements, that bring compute demands down. Where the
         | sparseness is fit to the type of problem.
         | 
         | Sparseness can be deeper but narrower information flows (thus
         | "deep" learning). Or in lower numbers of weights to weight
         | application (I.e. shared weights, like convolutions).
        
         | yobbo wrote:
         | I can't find references to HMM-based large language models.
         | Small HMM language models generate gibberish very similar to
         | this.
         | 
         | A HMM consists of a state space, a state transition matrix, and
         | an output probability matrix. A token space of 50k and a state
         | space of something like 60k would have seemed impossible 10-20
         | years. It has only recently become viable.
         | 
         | Training using Baum-Welch on a big enough text data set would
         | be interesting. It should be much faster than back-propagation
         | with a transformer-model.
        
       | pjmlp wrote:
       | Which laptop, though?
        
       | jebarker wrote:
       | Optimized small model training is not only important for
       | availability but also for the scientific study of LLMs. It's like
       | the use of simple organisms like yeast for biological studies -
       | we also need to study the simplest possible transformers that
       | exhibit behaviors of interest from the larger models if we hope
       | to ever understand LLMs and have more control over their
       | behavior.
        
         | biophysboy wrote:
         | It's a fun analogy because the data "environment" of the model
         | being trained matters a great deal
        
           | jebarker wrote:
           | Exactly. YOLO runs of frontier models with a single random
           | seed/data shuffle are pretty limited for trying to study the
           | "molecular biology". I actually like to think of LLM
           | understanding as being like biology in the 1850s. There's
           | lots of inspiration to be found in how biology has advanced
           | since then and the types of experiments we might run to
           | better understand LLMs.
        
             | biophysboy wrote:
             | Its something I keep thinking about when I see all these
             | deep-dives by Anthropic on the "genetics" of LLMs. I see
             | the emergent properties of LLMs as inseparable from their
             | data environment. If the organization/prevalence of text
             | online was different, I think Anthropic would see different
             | "genetics". As the amt of LLM-generated text grows, I think
             | it will become more clear that the "fundamental unit" is
             | their relationship.
        
         | willvarfar wrote:
         | (there are also lots of private company datasets like e.g. user
         | purchase history that can be used with small models to solve
         | real business problems. All the advances in 'large' language
         | models can be leveraged and applied to small problems if the
         | input sequences can be represented as a special custom
         | language.)
        
         | smeeth wrote:
         | I've been annoyed for a while people don't use a common
         | parameter weight/compute budget for benchmarking papers.
         | 
         | That said, it does make it easier to claim progress...
        
           | pizza wrote:
           | https://github.com/KellerJordan/modded-nanogpt is pretty
           | great in that respect
        
         | ai-christianson wrote:
         | I'm interested in one that can run fast on a laptop, but
         | training can take a few days (maybe even longer) on the same
         | laptop.
        
         | arethuza wrote:
         | Thanks - that's one of the most interesting comments I've seen
         | about LLMs.
         | 
         | Makes me want to try training a model to sing "Daisy, Daisy..."
        
         | azath92 wrote:
         | Totally agree, one of the most interesting podcasts i have
         | listened to in a while was a couple of years ago on the Tiny
         | Stories paper and dataset (the author used that dataset) which
         | focuses on stories that only contain simple words and concepts
         | (like bedtime stories for a 3 year old), but which can be used
         | to train smaller models to produce coherent english, both with
         | grammar, diversity, and reasoning.
         | 
         | The podcast itself with one of the authors was fantastic for
         | explaining and discussing the capabilities of LLMs more
         | broadly, using this small controlled research example.
         | 
         | As an aside: i dont know what the dataset is in the biological
         | analogy, maybe the agar plate. A super simple and controlled
         | environment in which to study simple organisms.
         | 
         | For ref: - Podcast ep https://www.cognitiverevolution.ai/the-
         | tiny-model-revolution... - tinystories paper
         | https://arxiv.org/abs/2305.07759
        
           | momojo wrote:
           | I like the agar plate analogy. Of course, the yeast is the
           | star of the show, but so much work goes into prepping the
           | plate.
           | 
           | As someone in biotech, 90% of the complaints I hear over
           | lunch are not about bad _results_ , but about bad mistakes
           | during the experiment. E.G. someone didn't cover their mouth
           | while pipetting and the plates unusable now.
        
         | leopoldj wrote:
         | What the author is doing here is pre-training. This is
         | something usually model makers like Google and Meta need to do.
         | Most business are much better off doing fine-tuning or to a
         | lesser extent continued pre-training. The author is doing this
         | for academic reasons.
        
         | tmule wrote:
         | Unfortunately, as things stand, it's well-known that behaviors
         | and optimizations in small scale models fail to replicate in
         | larger models.
        
           | jebarker wrote:
           | Well-known but not well-understood
        
           | victorbjorklund wrote:
           | Which in itself is very interesting and requires study.
        
             | anvuong wrote:
             | It mostly has to do with sparsity in high dimensional
             | space. When you scale things to the extreme everything is
             | very far away from each other, the space is sparse, and
             | random vectors have very high chance to be orthogonal, etc.
             | All of these makes optimization incredibly slow and
             | difficult. Just another facet of the so called "curse of
             | dimensionality".
        
           | indoordin0saur wrote:
           | But why? If we don't know why then how do we figure it out?
        
           | yorwba wrote:
           | Doing hyperparameter sweeps on lots of small models to find
           | the optimal values for each size and fitting scaling laws to
           | predict the hyperparameters to use for larger models seems to
           | work reasonably well. I think
           | https://arxiv.org/abs/2505.01618 is the latest advance in
           | that vein.
        
             | swyx wrote:
             | the problem is that the eval processes dont really work
             | here if you believe in "Emergent Abilities"
             | https://arxiv.org/abs/2206.07682
        
               | exasperaited wrote:
               | Which we probably should not, at least not the "sudden"
               | emergence that those researchers claimed to see.
               | 
               | https://arxiv.org/abs/2304.15004
               | 
               | Good article about why here; this helped me understand a
               | lot:
               | 
               | https://www.wired.com/story/how-quickly-do-large-
               | language-mo...
        
           | jph00 wrote:
           | That's not widely true. E.g the GPT 4 tech report pointed out
           | nearly all their experiments were done on models 1000x
           | smaller than the final model.
        
         | moojacob wrote:
         | Enough with big data! Who's working on small data?
         | https://www.youtube.com/watch?v=eDr6_cMtfdA&pp=ygUKc21hbGwgZ...
        
       | aniijbod wrote:
       | Let the AI efficiency olympics begin!
       | 
       | On a laptop, on a desktop, on a phone?
       | 
       | Train for 5 minutes, an hour, a day, a week?
       | 
       | On a boat? With a goat?
        
         | visarga wrote:
         | goats have too many parameters, they are like GPT-4
        
           | hinkley wrote:
           | GO4-T
        
         | rPlayer6554 wrote:
         | I'd pay for GoatLM
        
         | Nevermark wrote:
         | On a maxxxed out Mac Studio M3 Ultra 512GB.
         | 
         | That boat will float your goat!
        
         | lifestyleguru wrote:
         | Honestly AI is a trick to make us buy new expensive computers.
         | I'm writing this from over 10 years old one and the computers
         | offered in a leaflet from nearby electronic store aren't much
         | better.
        
           | voidUpdate wrote:
           | I mean, gaming is the big pusher of new hardware these days,
           | and web is basically the reason you can use a 90s computer in
           | the modern day. I happily survived on roughly 10 year old
           | components all the way through university because I wasn't
           | playing AAA games
        
             | throwawaylaptop wrote:
             | My parents bought a new laptop for their general household
             | use and to watch YouTube via HDMI on their tv. It was so
             | annoying and weird and not even fast, that they returned it
             | to Costco for the $800 within 90 days.
             | 
             | I setup a 10 year old computer for them instead running
             | Linux Mint Mate and it's perfect.
        
           | 542354234235 wrote:
           | Anyone who remembers the 90s and 2000s, where your computer
           | hardware was out of date within months, might disagree. If
           | you want to do bleeding edge things like running 70b+ LLMs
           | locally or doing training, you need bleeding edge hardware.
           | No different than if you want to play the newest AAA games.
           | There are plenty of games you can play with old hardware, and
           | plenty of small LLMs. When you can use ChatGPT or a bunch of
           | other services, it isn't a trick that some people want to
           | host their own or do training, but you need a system that can
           | do that.
        
           | aniijbod wrote:
           | Oh no! I thought that was Windows 11
        
         | yojo wrote:
         | > With a goat?
         | 
         | I think you meant Llama.
         | 
         | The rhymes are admittedly more limited, unless you have a
         | Boston accent.
        
           | jdjdndndn wrote:
           | I do not like been eggs and ham. I do not like them Sam I am.
           | 
           | Dr Seuss ftw
        
         | hinkley wrote:
         | Vernor Vinge has a story line where humans build their own
         | portable chess computers and utilize them as assistants in
         | human chess matches.
         | 
         | I still think this would be kinda cool. I could see a
         | tournament providing the power source in addition to the chess
         | clock. Then gamesmanship where you play moves you hope are
         | expensive for the opponent but not for your own AI.
        
       | yunusabd wrote:
       | Now imagine what you could do in 6 minutes!
       | 
       | But honestly I really like the short turnaround times. Makes it
       | easy to experiment with different parameters and develop an
       | intuition for what they do.
        
       | pilooch wrote:
       | I'd be interested in what implementation of D3PM was used (and
       | failed). Diffusion model are more data efficient than their AR
       | LLM counterpart but les compute efficient at training time, so
       | it'd be interesting to know whether with more time.to.converge
       | the diffusion approach does succeed. I guess I'll try :)
        
       | yalogin wrote:
       | The bigger question or may be even realization is that with this
       | architecture there is no way to build a capable model to run on
       | the laptop or phone, which means there will never be local
       | compute and servers became ever more important. In general
       | thinking about how ML itself works, reducing model size while
       | retaining capability will just never happen.
        
         | simonw wrote:
         | This post is about training, not inference.
         | 
         | The lesson here is that you can't use a laptop to train a
         | useful model - at least not without running that training for
         | probably decades.
         | 
         | That doesn't mean you can't _run_ a useful model on a laptop
         | that was trained in larger hardware. I do that all the time -
         | local models hit _really_ good this year.
         | 
         | > reducing model size while retaining capability will just
         | never happen.
         | 
         | Tell that to Qwen3-4B! Those models are remarkably capable.
        
           | grim_io wrote:
           | It's always a question of "compared to what?"
           | 
           | Local models are no where near capable compared to frontier
           | big models.
           | 
           | While a small model might be fine for your use case, it can
           | not replace Sonnet-4 for me.
        
             | simonw wrote:
             | Sure, Qwen-3-4B - a 4GB download - is nowhere near as
             | capable as Claude Sonnet 4.
             | 
             | But it is _massively_ more capable than the 4GB models we
             | had last year.
             | 
             | Meanwhile recent models that are within the same ballpark
             | of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi
             | K2 and the largest of the Qwen 3 models - can just about
             | fit on a $10,000 512GB of RAM Mac Studio. That's a very
             | notable trend.
        
               | grim_io wrote:
               | It doesn't feel like that the gap is closing at all.
               | 
               | The local models can get 10x as good next year, it won't
               | matter to me if the frontier models are still better.
               | 
               | And just because we can run those models (heavily
               | quantized, and thus less capable), they are unusably slow
               | on that 10k dead weight hardware.
        
               | badsectoracula wrote:
               | El Capitan being much faster than my desktop doesn't mean
               | that my desktop is useless. Same with LLMs.
               | 
               | I've been using Mistral Small 3.x for a bunch of tasks on
               | my own PC and it has been very useful, especially after i
               | wrote a few custom tools with llama.cpp to make it more
               | "scriptable".
        
               | jdjdndndn wrote:
               | I would be interested in hearing about those custom tools
        
         | sdenton4 wrote:
         | It depends, actually... The data and train time requirements
         | seen to increase exponentially for linear gains in performance.
         | As a result, you can often trade a 10x reduction in training
         | time to get a model with 90+% of the real deal. And as we
         | accumulate more architecture and efficiency tricks, the ceiling
         | in what you can do locally goes up commensurately.
         | 
         | There's also a whole world of data curation to improve
         | training, which is likely to be great for small models and
         | seems still underexplored.
        
       | faangguyindia wrote:
       | The best LLM on the planet right now is Gemini Pro 2.5 and Gemini
       | Flash 2.5, nothing comes close to these.
       | 
       | Once you setup a good system prompt on these, nothing really
       | compares.
       | 
       | Most of the models you see with high benchmarks are not even
       | comparable on real tasks.
       | 
       | qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini
       | Pro2.5
        
         | howmayiannoyyou wrote:
         | Then they are not the best. Most users aren't prompt engineers
         | and grew up expecting to enter search terms into Google and get
         | a result. If its the case OpenAI or Anthropic are best able to
         | interpret user intent there's a good argument to be made they
         | are the best.
        
           | faangguyindia wrote:
           | this is something people do not understand.
           | 
           | If model trusts the users, and if user is dumb model will
           | "weigh" user's input much higher and end up with flawed code.
           | 
           | If the model is more independent, it will find the right
           | solution. If just want a dumb model which says yes to
           | everything, and follows you when u are not at smart enough
           | then you'll never end up with good solution if not by luck.
        
         | dvrj101 wrote:
         | > not even comparable on real tasks. care to elaborate how
         | gemini did completed this task successfully and how other
         | models fumbled ?
        
           | faangguyindia wrote:
           | I am using AI to write full projects, complete code
           | generation and haven found any model which comes close to
           | Gemini Pro2.5 in code generation reasoning and generation.
           | 
           | While other models like qwen3, glm promise big in real code
           | writing they fail badly, get stuck in loops.
           | 
           | The only problem right now i run into gemini is i get
           | throttled every now and then with empty response specially
           | around this time.
        
       | hnfong wrote:
       | Here's an Obfuscated C Contest entry that trains a toy model
       | using LSTM:
       | 
       | https://www.ioccc.org/2019/mills/index.html
       | 
       | I suppose if you only have 5 minutes this is probably about the
       | level you'd get.
        
       | fswd wrote:
       | Right now, Qwen3 4B
        
       | chasd00 wrote:
       | AI is a broad term, the zero-to-hero series by Karpathy trains
       | one in a Jupyter notebook. You can make some pretty powerful
       | networks to de-duplicate database rows right in your laptop too.
       | Data de-duplication and general MDM is pretty useful in large
       | businesses.
        
       | fontsgenerator wrote:
       | Probably something like a small logistic regression or a tiny
       | GPT-2 variant (117M parameters) on a small dataset--anything
       | beyond that will choke on RAM, VRAM, or time. Five minutes on a
       | laptop = toy models, not miracles.
        
       | initramfs wrote:
       | I looked up the most expensive laptop with an RTX 5090:
       | https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
       | 
       | $5599.00 https://marketplace.nvidia.com/en-us/consumer/gaming-
       | laptops...
       | 
       | Although you can get them with fewer specs and the same GPU for
       | $3,899.99
       | 
       | https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
        
         | nehal3m wrote:
         | The same SKU on a GPU can perform differently depending on how
         | the manufacturer powers and cools it [0], and nVidia's naming
         | shenanigans don't help either [1].
         | 
         | [0] https://www.digitaltrends.com/computing/laptop-gpu-power-
         | lim... [1]
         | https://coconote.app/notes/4c75b7a0-eb41-435d-85ee-55ae2dd8d...
        
           | zipy124 wrote:
           | And even worse it's surprisingly hard to find out what power
           | budget is assigned to the GPU/ CPU or combined on spec
           | sheets.
        
       | bryanrasmussen wrote:
       | I like this scenario for a future James Bond movie. Bond has to
       | have an AI in chat pretend to be him to stall the bad guys while
       | he is sneaking around the back, but the state of the art Bond
       | persona bot that Q gave him in its own hardware enclosure has
       | been smashed up in the previous fight scene.
       | 
       | Bond has only minutes to train a strong enough AI model to
       | pretend to be him and fool his targets long enough for him to
       | gain entry to their impregnable fortress. Can he do it?!?
        
         | rsyring wrote:
         | But...they need to show him "training" it by smashing away at
         | the keys frantically. A touch of sweat rolling down his face
         | while a progress meter inches across the screen to suspenseful
         | music.
        
           | bryanrasmussen wrote:
           | no that is a cliche from lesser brands, Bond will get drunk
           | while it trains and shoot somebody with amazing accuracy.
        
           | hinkley wrote:
           | We're gonna need a montage.
        
       | panarchy wrote:
       | This would be more interesting if it wasn't about (L)LMs
        
       | jasonjmcghee wrote:
       | The idea of tracking and optimizing this reminds me of similar
       | efforts a few years ago especially for image models via
       | DAWNBench.
       | 
       | https://dawnd9.sites.stanford.edu/dawnbench
        
       | simianwords wrote:
       | An idea worth exploring: if specialized models on datasets can be
       | trained quickly, it can be used as tools by bigger models.
        
       | Razengan wrote:
       | I'd be happy with an AI that can just "train" on me: Just see
       | what I do, learn from the repetitive tasks I do, and then do them
       | quicker. An agent that is basically me x 10.
       | 
       | Start blank with no corporate-controlled/crippled state and just
       | become me.
       | 
       | In fact, that might be the only way to let computers _appear_ to
       | grow faster into the future, even if their internal hardware only
       | gets minor incremental improvements: Have your shit done before
       | you sit down to do it.
        
       | jl6 wrote:
       | Feels like there should be value in building smaller, more
       | specialized models - maybe even doing so on-demand. I don't
       | always want a model that knows Polish and astrophysics and
       | Shakespeare, I want one that runs really fast and is laser-
       | focused on the domain that I'm working on.
       | 
       | I want to be able to say to a large general purpose LLM: "write a
       | script that trains a model that is optimized for <useful task>"
       | and then run _that_ model.
       | 
       | Edit: well gosh darn. Within the edit window for this comment,
       | Google goes and launches Gemma 3 270M.
        
         | erkiserk wrote:
         | one of the trends of machine learning though is that
         | generalists outperform specialists on those specialists' tasks!
        
           | jl6 wrote:
           | But I'd happily accept some of that bitter lesson if the
           | "worse specialist" ran way faster (or at all, given memory
           | limits).
        
       | indoordin0saur wrote:
       | What about overnight on a desktop with a higher-end Nvidia gaming
       | GPU? Asking for a friend.
        
       | erikqu wrote:
       | I would've liked to see some xlstms
        
       | Animats wrote:
       | _" Paris, France is a city in North Carolina. It is the capital
       | of North Carolina."_
       | 
       | If only we had a technology that didn't hallucinate and reported
       | "I don't know". Then small models would be far more useful. Part
       | of the need for insanely huge LLM models is to get coverage so
       | broad that they don't have to make up stuff.
       | 
       | It would be nice to be able to train a customer service bot on a
       | laptop in a reasonable length of time. But it will screw up badly
       | outside its area of competence, which will happen frequently.
        
         | Closi wrote:
         | I don't think we should use an AI trained in 5 minutes on a
         | laptop to infer what small models are capable of...
         | 
         | Sure they still have massive problems with hallucination, but
         | this article doesn't give us any more insight into that I don't
         | think!
        
           | gambiting wrote:
           | Why not? And I'm not being flippant, but like....isn't that
           | the whole point of small models?
        
             | kevinventullo wrote:
             | As I understand it, the most effective small models are
             | synthesized from larger models.
        
             | remexre wrote:
             | For one thing, the model is trained on a language modelling
             | task, not a question-answering task?
        
       | jarmitage wrote:
       | AI is sorely lacking a demoscene
        
       | andrewstuart wrote:
       | Would have been useful to see exact steps taken to replicate the
       | result.
        
       | iamgopal wrote:
       | If only AI models are trained to connect to data (sql) and use
       | that to answer some of the questions using data source instead of
       | just train on them, it could reduce model size a lot.
        
         | CharlesW wrote:
         | That's what tools are for. (see MCP:
         | https://modelcontextprotocol.io/docs/getting-started/intro)
        
           | scubbo wrote:
           | Would RAG also be an approach here? My intuition from some
           | small investigation is that RAG is more formal and structured
           | to set up, but more efficient, whereas MCP you can just point
           | an LLM at an MCP server and tell it to figure shit out (and
           | also MCP can be used to _do_ stuff, not just to acquire more
           | information).
        
             | CharlesW wrote:
             | > _Would RAG also be an approach here?_
             | 
             | For sure! If the RAG context includes "Raleigh is the
             | capital city of the U.S. state of North Carolina" somewhere
             | in whatever you feed it, one would hope that you'd get an
             | accurate answer to that question.
        
               | scubbo wrote:
               | Thank you!
        
       | charcircuit wrote:
       | A trick that would be useful would be to start with an existing
       | model instead of trying to generate it from a random starting
       | place.
        
       | lsb wrote:
       | This is evocative of "cramming", a paper from a few years ago,
       | where the author tried to find the best model they could train
       | for a day on a modern laptop: https://arxiv.org/abs/2212.14034
        
       | quux wrote:
       | Depends on how much weight you can support on your lap
        
       | profsummergig wrote:
       | Readers: I'm looking for toy, quick AI exercises that can be
       | trained on a laptop, and help the doer increase their confidence
       | in AI concepts (learning by doing, and all that).
       | 
       | The OP fits the bill.
       | 
       | If you can suggest other such exercises, please share in reply to
       | this post.
       | 
       | Thank you.
        
       | dileeparanawake wrote:
       | Siri.
        
       | raindear wrote:
       | How far can you go by improving the curriculum? Start simple.
       | Find a shorter and shorter sequence of examples that gives you
       | thd best result. What is the shortest sequence to get to some
       | perplexity? Why?
        
       | remexre wrote:
       | Am I missing where the GitHub link is for this, or did the author
       | not release sources? It'd be fun to reproduce this on a different
       | machine, and play around with other architectures and optimizers
       | that weren't mentioned in the article...
        
       | trhway wrote:
       | There was https://sortbenchmark.org, and now we need a similar
       | for AI - best per joule, per 1 cent, per minute.
        
       ___________________________________________________________________
       (page generated 2025-08-14 23:00 UTC)