[HN Gopher] What's the strongest AI model you can train on a lap...
___________________________________________________________________
What's the strongest AI model you can train on a laptop in five
minutes?
Author : ingve
Score : 480 points
Date : 2025-08-12 13:15 UTC (2 days ago)
(HTM) web link (www.seangoedecke.com)
(TXT) w3m dump (www.seangoedecke.com)
| bbarnett wrote:
| Perhaps grimlock level:
|
| https://m.youtube.com/shorts/4qN17uCN2Pg
| treetalker wrote:
| "Hadn't thought of that ..."
|
| "You're absolutely right!"
| lamuswawir wrote:
| Thanks.
| zarzavat wrote:
| Instead of time it should be energy. What is the best model you
| can train with a given budget in Joules. Then the MBP and the
| H100 are on a more even footing.
| NooneAtAll3 wrote:
| it's not about efficiency - it's about availability
|
| H100 is not an everyday product. Laptop is
| KeplerBoy wrote:
| Still, I don't think the m4 is going to be far off from the
| h100 in terms of energy efficiency.
|
| edit: fixed typo
| menaerus wrote:
| What efficiency did you have in mind? Bandwidth-wise M4 is
| ~10x to ~30x lower.
| KeplerBoy wrote:
| ah, i mistyped. I meant energy efficiency, not memory
| efficiency.
| Der_Einzige wrote:
| At this point, given how many H100s there are in existence,
| it's basically an everyday product.
| logicchains wrote:
| I envy you if $25k is an everyday product cost.
| jeroenhd wrote:
| For what it's worth, most of the world can't afford an M4
| Macbook either.
| wongarsu wrote:
| And renting an H100 for an hour is a lot easier than
| renting an M4 MacBook for an hour.
| falcor84 wrote:
| Maybe not to buy one, but to rent one. Like how barista-
| made coffee is an everyday product even though most
| people can't afford a fancy professional coffee machine.
| bee_rider wrote:
| Reasonably high quality coffee machines are very
| widespread. Or you can do pour-over. I don't think the
| cost of a machine is a limiting factor for many people,
| it is just convenience.
|
| Maybe an analogy could be made to espresso, nice espresso
| machines get costlier. But, you can still get quite good
| results out of a manual machine like a Flair.
|
| I think this is why the suggestion to rent a machine is
| not to helpful. In this analogy we're on BaristaNews, we
| all know about the industrial machines, lots of folks use
| them at work. But, the topic of what sort of things you
| can do on your manual machine at home has come up.
| inetknght wrote:
| > _Reasonably high quality coffee machines are very
| widespread. Or you can do pour-over. I don't think the
| cost of a machine is a limiting factor for many people_
|
| No, reasonably-priced coffee machines is an enabling
| factor for many people.
|
| If coffee machines weren't reasonably priced, they would
| not be "very widespread".
| bee_rider wrote:
| I'm not sure I follow your deeper meaning here, sorry.
| Sharlin wrote:
| H100s are almost-instantly available to anyone with a credit
| card and access to the internet. Without even having to lift
| their butt from the seat. And you get plenty more than five
| minutes of compute for the price of an M4.
| jsperson wrote:
| For the orgs where I've worked the important thing isn't
| availability of compute it's security. Using what we have
| on our local network is much easier from a governance and
| approval standpoint than whatever is available on the
| internet.
| Sharlin wrote:
| Many orgs have no problems using cloud envs for most
| things. The usual suspects offer just as secure compute
| envs as everything else.
|
| Anyway, I was assuming personal use, like the messing-
| around experimenting that the article is about. (Or who
| knows, maybe it was part of the author's job.)
| potatolicious wrote:
| And yet just about any intro-to-programming tutorial gets
| something running on your local machine, and local machine
| development continues to be the default for most people,
| even though devving on a cloud machine is eminently
| reasonable.
|
| "Pull out credit card, sign up for some thing and pay a bit
| of money" is a non-trivial bit of friction! Extremely non-
| trivial!
|
| Especially in a corporate context - you have to get the
| expense approved. It's not clear if you can put company
| data onto the machine. Whereas generally running local
| things on corporate laptops is far less controversial.
|
| "Download this tool and run it." is still an extremely
| powerful pitch. Pretty much the only thing that beats it is
| "go to this website which you can use without any signup or
| payment".
| Sharlin wrote:
| Sure, if you already have said local machine. Which I
| guess in HN's context many/most do.
| ekianjo wrote:
| no org will let you send their data to a random online
| h100...
| Sharlin wrote:
| Many orgs happily use Google's everything. And Google
| offers secure compute envs just like it offers secure
| cloud everything.
|
| Anyway, I thought the context was doing stuff for
| personal use/fun, not work.
| sethhochberg wrote:
| Frankly I think a lot of full-time-employed technical
| people are largely experimenting for fun in the context
| of things that might eventually be useful to their
| employer. AI is cool and fascinating stuff and when I
| have a few idle minutes at the end of my workweek I love
| catching up and experimenting with the latest and
| greatest, but with an eye towards company problems and on
| company time, and sometimes using company datasets. That
| means company vendor approval and financing of my
| efforts.
|
| In my personal life, when its time for fun, I close the
| laptop and go do some gardening.
| dekhn wrote:
| While I love cloud computing, you're comparing the cost of
| renting a GPU for a fixed amount of time to the purchase of
| an asset which can be used for years. Not a useful
| comparison IMHO.
| sudoshred wrote:
| Disagree, equity of access matters a lot. Not everyone
| benefits from exposure to the entire hardware lifecycle,
| the same way that buying housing is not the best
| financial decision for everyone regardless of
| affordability. I might have unlimited budget but if I
| only need access to state of the art hardware
| intermittently or under irregular circumstances the cost
| of renting may be efficient for my needs. Also consider
| the costs of supporting hardware that is fully owned, if
| you own the hardware but underutilize it that is
| inefficiency and the owner bears that cost. The unusual
| way that silicon depreciates mean that the value of your
| "asset" is not static and rapidly depreciates as silicon
| manufacturing improves.
| dekhn wrote:
| Your argument is not related to my statement. You're
| arguing something else.
| victorbjorklund wrote:
| I already have an M4 so the cost of running it is tiny.
| 0x457 wrote:
| Yeah, is a large server rack to run those H100s. But
| realistically, the majority of people have a PC with
| consumer grade GPU or more likely a laptop with...laptop
| grade GPU.
|
| Cloud H100 don't count because you need lawyer to review
| ToS and other agreements.
| nickpsecurity wrote:
| Also, my laptop running Linux and its outputs are probably
| mine and private. If I use cloud GPU's, I need to be a lawyer
| to be sure what they can or can't do with my data or models.
|
| There's also no overages or hidden charges with a laptop.
| Past simply breaking it. You know the replacement cost ahead
| of time, though.
| giancarlostoro wrote:
| Mac is more competitive on power consumption though since its
| not ever pulling as much as a Nvidia GPU is my understanding.
|
| On that note you can rent an H100 for an hour for under $10
| which might make for a slightly more interesting test, whats
| the best model outcome you can train in under an hour.
| dtnewman wrote:
| > you can rent an H100 for an hour for under $10
|
| Far cheaper these days. More like $2-3 for a consumer to do
| this. For bulk deals, pricing is often < $2.
| giancarlostoro wrote:
| I couldnt remember offhand the exact amount but figured
| noting that under $10 is still impressive for one high end
| GPU for an entire hour.
| bigyabai wrote:
| It depends. If you're bottlenecked by memeory speed, the Mac
| typically comes out on-top.
|
| In terms of conpute efficiency though, Nvidia still has Apple
| beat. Nvidia wouldn't have the datacenter market on a leash
| if Apple was putting up a real fight.
| giancarlostoro wrote:
| Yeah, this is correct. My 3080 will render quicker than my
| M4 but my M4 will outcompete on being able to load larger
| models.
| netcan wrote:
| They're all good. Being somewhat arbitrary isnt a bad thing.
| jvanderbot wrote:
| Bro por que no los dos
|
| We can / should benchmark and optimize this to death on all
| axes
| motorest wrote:
| > Instead of time it should be energy (...) Then the MBP and
| H100 are on a more even footing.
|
| What exactly is your point? That instead of expressing
| workloads in terms of what a laptop could do, you prefer to
| express them in terms of what a MacBook Pro could do?
| zarzavat wrote:
| The point is that "best model you can train in 5 minutes" is
| hardware dependent, the answer will be different depending on
| the hardware available. So it's necessarily a single-player
| game.
|
| "Best model you can train with X joules" is a fairer contest
| that multiple people could take part in even if they have
| different hardware available. It's not completely fair, but
| it's fair enough to be interesting.
|
| Training models with an energy limit is an interesting
| constraint that might lead to advances. Currently LLMs
| implement online learning by having increasingly large
| contexts that we then jam "memories" into. So there is a
| strict demarcation between information learned during pre-
| training and during use. New more efficient approaches to
| training could perhaps inform new approaches to memory that
| are less heterogenous.
|
| tl;dr: more dimensionally correct
| hodgehog11 wrote:
| I love seeing explorations like this, which highlight that easily
| accessible hardware can do better than most people think with
| modern architectures. For many novel scientific tasks, you really
| don't need an H100 to make progress using deep learning over
| classical methods.
| tootyskooty wrote:
| I suspect one can go a lot further by adopting some tweaks from
| the GPT-2 speedrun effort [0], at minimum Muon, better init and
| carefully tuning learning rate.
|
| [0]: https://github.com/KellerJordan/modded-nanogpt
| nottorp wrote:
| But supposing you have a real specific need to train, is the
| training speed still relevant? Or do the resources spent on
| gathering and validating the data set dwarf the actual CPU/GPU
| usage?
| wongarsu wrote:
| If training is trivially fast that allows you to iterate on
| architecture choices, hyperparameters, choices which data to
| include, etc
|
| Of course that only works if the trial runs are representative
| of what your full scale model will look like. But within those
| constraints optimising training time seems very valuable
| l5870uoo9y wrote:
| The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU
| cores, and 128 GB of RAM (and a 16-core "neural engine"
| specifically designed to accelerate machine learning).
| Technically, it is a laptop, but it could just as well be a
| computer optimized for AI.
| alberth wrote:
| The Mac Studio has: 32 CPU 80 GPU
| 512GB RAM
|
| https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...
| Joel_Mckay wrote:
| From https://opendata.blender.org/ :
|
| Apple M3 Ultra (GPU - 80 cores) scores 7235.31
|
| NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
|
| Note the memory constraints of NVIDIA are not like Apple
| silicon which tends to also be less i/o constrained. YMMV
|
| https://www.youtube.com/watch?v=d8yS-2OyJhw
|
| https://www.youtube.com/watch?v=Ju0ndy2kwlw
|
| Apple m3/m4 silicon is certainly good in some ways, but the
| bottleneck is often a lack of CUDA software support and price
| (could buy >4 times the GPU raw performance on a dual rtx
| 5090 desktop.) =3
| pstuart wrote:
| Not just GPU performance -- the M3 Ultra has memory
| bandwidth of ~800GBps vs ~1,800GBps for the 5090.
|
| I would wager that Apple recognizes the value prop for the
| mac to be used for AI and will up their memory bandwidth to
| stay in the game.
| lukan wrote:
| That's a well made page, describing nice hardware, but
| doesn't seem to be a laptop.
| MobiusHorizons wrote:
| I think the point is that laptops are more limited than
| other form factors. I'm reading it as a response to the
| comment that MacBooks are computers optimized for ai and
| only technically a laptop (which is a pretty ridiculous
| statement imo). Apples architecture happens to be very good
| at a lot of compute heavy tasks, especially where total
| available GPU ram and low latency handoff between the CPU
| and the gpu are concerned. This happens to be very well
| suited to LLM workloads.
| LorenDB wrote:
| > Paris, France is a city in North Carolina. It is the capital of
| North Carolina, which is officially major people in Bhugh and
| Pennhy. The American Council Mastlandan, is the city of Retrea.
| There are different islands, and the city of Hawkeler: Law is the
| most famous city in The Confederate. The country is Guate.
|
| I love the phrase "officially major people"! I wonder how it
| could be put to use in everyday speech?
| wowczarek wrote:
| Not the point of the exercise obviously, but at five minutes'
| training I wonder how this would compare to a Markov chain bot.
| mhogers wrote:
| Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026
| M5) for local LLMs? Due an upgrade soon and perhaps it is
| educational to run these things more easily locally?
| ionwake wrote:
| I did just that , got the r 32gb ram one so I could run qwen.
|
| Might still be early days I'm trying to use the model to sort
| my local notes but I don't know man seems only a little faster
| yet still unusable and I downloaded the lighter qwen model as
| recommended.
|
| Again it's early days maybe I'm being an idiot I did manage to
| get it to parse one note after about 15 mins though.
| dpoloncsak wrote:
| Have a 16GB one, just setup ollama yesterday.
|
| gpt-oss-20b eats too much ram to use for anything other than
| an overnight task. maybe 3tok/s.
|
| Been playing around with the 8b versions of qwen and
| deepseek. Seems usable so far. YMMV, i'm just messing around
| in chat at the moment, haven't really had it do any tasks for
| me
| sandreas wrote:
| For LLMs, VRAM is the requirement number one. Since MacBooks
| have unified RAM you can use up to 75% for the LLM, so a higher
| RAM model would open more possibilies, but these are much more
| expensive (of course).
|
| As an alternative you might consider a Ryzen Pro 395+ like in
| the Framework desktop or HP Zbook G1a but the 128GB versions
| are still extremely expensive. The Asus Flow Z13 is a tablet
| with ryzen 395+ but hardly available with 128GB
| schaefer wrote:
| You could train an unbeatable tic-tac-toe ai on your laptop in
| five minutes. It doesn't get any stronger than that.
|
| --
|
| I know, I know. I'm intentionally misinterpreting the OP's clear
| intent (the stuff of comedy). And normally a small joke like this
| wouldn't be worth the downvotes...
|
| But, I think there's a deeper double meaning in this brave new
| world of prompt engineering. Most chat isn't all that precise
| without some level of assumed shared context:
|
| These days the meaning of the phrase ai has changed from the
| classical definition (all algorithms welcome), and now ai usually
| means LLMs and their derivatives.
| silverlake wrote:
| I'm actually working on just this. What's the smallest training
| data set required to learn tic-tac-toe? A 5yo doesn't need much
| training to learn a new game, but a transformer needs millions
| of samples.
| rkomorn wrote:
| > A 5yo doesn't need much training to learn a new game
|
| A 5yo also has... 5 years of cumulative real world training.
| I'm a bit of an AI naysayer but I'd say the comparison
| doesn't seem quite accurate.
| silverlake wrote:
| It's a glib analogy, but the goal remains the same. Today's
| training sets are immense. Is there an architecture that
| can learn something with tiny training sets?
| rkomorn wrote:
| I'm certainly not challenging anything you're writing,
| because I only have a very distant understanding of deep
| learning, but I do find the question interesting.
|
| Isn't there a bit of a defining line between something
| like tic-tac-toe that has a finite (and pretty limited
| for a computer) set of possible combinations where it
| seems like you shouldn't need a training set that is
| larger than said set of possible combinations, and
| something more open-ended where the impact of the size of
| your training set mainly impacts accuracy?
| dpoloncsak wrote:
| Assuming you don't account for reflections, rotations,
| and 'unreachable' gamestates where a player wins and you
| continue to mark boxes.
|
| It's just 3^9, right? 9 boxes, either X,O, or blank?
| We're only at 19,683 game states and would trim down from
| here if we account for the cases above.
| rkomorn wrote:
| Exactly, but then we may as well say "don't solve this
| with an LLM" which sort of kills the conversation
| altogether and that's not my goal. :)
| dpoloncsak wrote:
| Oh, im sorry! I was just trying to give a quick
| perspective of how small that tic-tac-toe data-set
| actually is. Not suggest against the idea!
| rkomorn wrote:
| Oh no worries at all. :)
| onlyrealcuzzo wrote:
| And hundreds of millions of years of evolutionary
| intelligence.
| rkomorn wrote:
| Next step in AI: teaching an LLM to think like a
| trilobite!
| onlyrealcuzzo wrote:
| A trilobite was obviously better at being a trilobite
| than an LLM would be, if not by purely definitional
| purposes.
| rkomorn wrote:
| Was the six million dollar man not a better man?
| adrianwaj wrote:
| Maybe ZephApp, when it's actually released. But would be
| interesting to record day-to-day conversations (face-to-
| face using voice recognition) to train a virtual
| doppelganger of myself and use it to find uncommon
| commonalities between myself and others.
|
| What would someone do with a year's worth of recorded
| conversations? Would the other parties be identified? How
| would it be useful, if at all? How about analyzing the
| sounds/waveform rather than words? (eg BioAcousticHealth
| / vocal biomarkers)
|
| Perhaps typing into a text-field is the problem right
| now? Maybe have a HUD in a pair of glasses. Better than
| getting a brain chip! Most recent or most repeated
| conversations most important. Could lead to a reduction
| in isolation within societies, in favor for "AI training
| parties." Hidden questions in oneself answered by a robot
| guru as bedtime story-telling but related to the real-
| world and real-events.
|
| Smart Glasses --> Smart Asses
|
| Vibe Coding --> Tribe Loading
|
| Everything Probable --> Mission Impossible
| Daltonagray wrote:
| This sounds super interesting. Will you be sharing your work
| anywhere? :)
| highfrequency wrote:
| This is awesome - thanks for sharing. Appreciate the small-scale
| but comprehensive studies testing out different architectures,
| model sizes and datasets.
|
| Would be curious to see a version of your model size comparison
| chart but letting the training continue until perplexity plateaus
| / begins to overfit. For example: are your larger models
| performing worse because they are overfitting to a small dataset,
| or because you are comparing model sizes at a fixed 5 minute
| computation time - so that the large models just don't get to
| learn very much in that time.
|
| (Also interesting would be learning curve comparisons between
| architecture/param count)
| Aperocky wrote:
| At which point is a simple markov chain same/better?
| visarga wrote:
| Output text is word salad every few words apart. You can't
| scale n-gram counting enough to make it work.
| sadiq wrote:
| You might find https://arxiv.org/abs/2401.17377v3
| interesting..
| JPLeRouzic wrote:
| Only if you have access to corporate-level hardware:
|
| " _It took us 48 hours to build the suffix array for
| RedPajama on a single node with 128 CPUs and 1TiB RAM_ "
| protomikron wrote:
| It's okayish. Considering 64G to 128G are available for
| (nerd) high-end consumers you're just off with a factor 5
| (if we can squeeze out a little bit more performance).
|
| Thas is pretty astonishing in my opinion.
| JPLeRouzic wrote:
| Not exactly a few words in my experience, I would say every
| 100 words, if you sophisticate your Markov Chain (n-gram = 3
| at minimum, using a good tokenizer, making it tailored to the
| training data, large training set (500Kbytes or +),
| intelligent fallback instead of random, etc.).
| Nevermark wrote:
| It is the other way around.
|
| Neural-type models have long passed the point where markov
| chains made any sense by many orders of magnitude.
|
| Markov models fail by being too opinionated about the style of
| compute.
|
| In contrast, a linear tensor + non-linear function has
| incredible flexibility to transform the topology of
| information. Given large enough tensors, two such layers, with
| recurrence, can learn any mapping, static or dynamical. No
| priors (other than massive compute) needed.
|
| All other neural architectures then are simply sparser
| arrangements, that bring compute demands down. Where the
| sparseness is fit to the type of problem.
|
| Sparseness can be deeper but narrower information flows (thus
| "deep" learning). Or in lower numbers of weights to weight
| application (I.e. shared weights, like convolutions).
| yobbo wrote:
| I can't find references to HMM-based large language models.
| Small HMM language models generate gibberish very similar to
| this.
|
| A HMM consists of a state space, a state transition matrix, and
| an output probability matrix. A token space of 50k and a state
| space of something like 60k would have seemed impossible 10-20
| years. It has only recently become viable.
|
| Training using Baum-Welch on a big enough text data set would
| be interesting. It should be much faster than back-propagation
| with a transformer-model.
| pjmlp wrote:
| Which laptop, though?
| jebarker wrote:
| Optimized small model training is not only important for
| availability but also for the scientific study of LLMs. It's like
| the use of simple organisms like yeast for biological studies -
| we also need to study the simplest possible transformers that
| exhibit behaviors of interest from the larger models if we hope
| to ever understand LLMs and have more control over their
| behavior.
| biophysboy wrote:
| It's a fun analogy because the data "environment" of the model
| being trained matters a great deal
| jebarker wrote:
| Exactly. YOLO runs of frontier models with a single random
| seed/data shuffle are pretty limited for trying to study the
| "molecular biology". I actually like to think of LLM
| understanding as being like biology in the 1850s. There's
| lots of inspiration to be found in how biology has advanced
| since then and the types of experiments we might run to
| better understand LLMs.
| biophysboy wrote:
| Its something I keep thinking about when I see all these
| deep-dives by Anthropic on the "genetics" of LLMs. I see
| the emergent properties of LLMs as inseparable from their
| data environment. If the organization/prevalence of text
| online was different, I think Anthropic would see different
| "genetics". As the amt of LLM-generated text grows, I think
| it will become more clear that the "fundamental unit" is
| their relationship.
| willvarfar wrote:
| (there are also lots of private company datasets like e.g. user
| purchase history that can be used with small models to solve
| real business problems. All the advances in 'large' language
| models can be leveraged and applied to small problems if the
| input sequences can be represented as a special custom
| language.)
| smeeth wrote:
| I've been annoyed for a while people don't use a common
| parameter weight/compute budget for benchmarking papers.
|
| That said, it does make it easier to claim progress...
| pizza wrote:
| https://github.com/KellerJordan/modded-nanogpt is pretty
| great in that respect
| ai-christianson wrote:
| I'm interested in one that can run fast on a laptop, but
| training can take a few days (maybe even longer) on the same
| laptop.
| arethuza wrote:
| Thanks - that's one of the most interesting comments I've seen
| about LLMs.
|
| Makes me want to try training a model to sing "Daisy, Daisy..."
| azath92 wrote:
| Totally agree, one of the most interesting podcasts i have
| listened to in a while was a couple of years ago on the Tiny
| Stories paper and dataset (the author used that dataset) which
| focuses on stories that only contain simple words and concepts
| (like bedtime stories for a 3 year old), but which can be used
| to train smaller models to produce coherent english, both with
| grammar, diversity, and reasoning.
|
| The podcast itself with one of the authors was fantastic for
| explaining and discussing the capabilities of LLMs more
| broadly, using this small controlled research example.
|
| As an aside: i dont know what the dataset is in the biological
| analogy, maybe the agar plate. A super simple and controlled
| environment in which to study simple organisms.
|
| For ref: - Podcast ep https://www.cognitiverevolution.ai/the-
| tiny-model-revolution... - tinystories paper
| https://arxiv.org/abs/2305.07759
| momojo wrote:
| I like the agar plate analogy. Of course, the yeast is the
| star of the show, but so much work goes into prepping the
| plate.
|
| As someone in biotech, 90% of the complaints I hear over
| lunch are not about bad _results_ , but about bad mistakes
| during the experiment. E.G. someone didn't cover their mouth
| while pipetting and the plates unusable now.
| leopoldj wrote:
| What the author is doing here is pre-training. This is
| something usually model makers like Google and Meta need to do.
| Most business are much better off doing fine-tuning or to a
| lesser extent continued pre-training. The author is doing this
| for academic reasons.
| tmule wrote:
| Unfortunately, as things stand, it's well-known that behaviors
| and optimizations in small scale models fail to replicate in
| larger models.
| jebarker wrote:
| Well-known but not well-understood
| victorbjorklund wrote:
| Which in itself is very interesting and requires study.
| anvuong wrote:
| It mostly has to do with sparsity in high dimensional
| space. When you scale things to the extreme everything is
| very far away from each other, the space is sparse, and
| random vectors have very high chance to be orthogonal, etc.
| All of these makes optimization incredibly slow and
| difficult. Just another facet of the so called "curse of
| dimensionality".
| indoordin0saur wrote:
| But why? If we don't know why then how do we figure it out?
| yorwba wrote:
| Doing hyperparameter sweeps on lots of small models to find
| the optimal values for each size and fitting scaling laws to
| predict the hyperparameters to use for larger models seems to
| work reasonably well. I think
| https://arxiv.org/abs/2505.01618 is the latest advance in
| that vein.
| swyx wrote:
| the problem is that the eval processes dont really work
| here if you believe in "Emergent Abilities"
| https://arxiv.org/abs/2206.07682
| exasperaited wrote:
| Which we probably should not, at least not the "sudden"
| emergence that those researchers claimed to see.
|
| https://arxiv.org/abs/2304.15004
|
| Good article about why here; this helped me understand a
| lot:
|
| https://www.wired.com/story/how-quickly-do-large-
| language-mo...
| jph00 wrote:
| That's not widely true. E.g the GPT 4 tech report pointed out
| nearly all their experiments were done on models 1000x
| smaller than the final model.
| moojacob wrote:
| Enough with big data! Who's working on small data?
| https://www.youtube.com/watch?v=eDr6_cMtfdA&pp=ygUKc21hbGwgZ...
| aniijbod wrote:
| Let the AI efficiency olympics begin!
|
| On a laptop, on a desktop, on a phone?
|
| Train for 5 minutes, an hour, a day, a week?
|
| On a boat? With a goat?
| visarga wrote:
| goats have too many parameters, they are like GPT-4
| hinkley wrote:
| GO4-T
| rPlayer6554 wrote:
| I'd pay for GoatLM
| Nevermark wrote:
| On a maxxxed out Mac Studio M3 Ultra 512GB.
|
| That boat will float your goat!
| lifestyleguru wrote:
| Honestly AI is a trick to make us buy new expensive computers.
| I'm writing this from over 10 years old one and the computers
| offered in a leaflet from nearby electronic store aren't much
| better.
| voidUpdate wrote:
| I mean, gaming is the big pusher of new hardware these days,
| and web is basically the reason you can use a 90s computer in
| the modern day. I happily survived on roughly 10 year old
| components all the way through university because I wasn't
| playing AAA games
| throwawaylaptop wrote:
| My parents bought a new laptop for their general household
| use and to watch YouTube via HDMI on their tv. It was so
| annoying and weird and not even fast, that they returned it
| to Costco for the $800 within 90 days.
|
| I setup a 10 year old computer for them instead running
| Linux Mint Mate and it's perfect.
| 542354234235 wrote:
| Anyone who remembers the 90s and 2000s, where your computer
| hardware was out of date within months, might disagree. If
| you want to do bleeding edge things like running 70b+ LLMs
| locally or doing training, you need bleeding edge hardware.
| No different than if you want to play the newest AAA games.
| There are plenty of games you can play with old hardware, and
| plenty of small LLMs. When you can use ChatGPT or a bunch of
| other services, it isn't a trick that some people want to
| host their own or do training, but you need a system that can
| do that.
| aniijbod wrote:
| Oh no! I thought that was Windows 11
| yojo wrote:
| > With a goat?
|
| I think you meant Llama.
|
| The rhymes are admittedly more limited, unless you have a
| Boston accent.
| jdjdndndn wrote:
| I do not like been eggs and ham. I do not like them Sam I am.
|
| Dr Seuss ftw
| hinkley wrote:
| Vernor Vinge has a story line where humans build their own
| portable chess computers and utilize them as assistants in
| human chess matches.
|
| I still think this would be kinda cool. I could see a
| tournament providing the power source in addition to the chess
| clock. Then gamesmanship where you play moves you hope are
| expensive for the opponent but not for your own AI.
| yunusabd wrote:
| Now imagine what you could do in 6 minutes!
|
| But honestly I really like the short turnaround times. Makes it
| easy to experiment with different parameters and develop an
| intuition for what they do.
| pilooch wrote:
| I'd be interested in what implementation of D3PM was used (and
| failed). Diffusion model are more data efficient than their AR
| LLM counterpart but les compute efficient at training time, so
| it'd be interesting to know whether with more time.to.converge
| the diffusion approach does succeed. I guess I'll try :)
| yalogin wrote:
| The bigger question or may be even realization is that with this
| architecture there is no way to build a capable model to run on
| the laptop or phone, which means there will never be local
| compute and servers became ever more important. In general
| thinking about how ML itself works, reducing model size while
| retaining capability will just never happen.
| simonw wrote:
| This post is about training, not inference.
|
| The lesson here is that you can't use a laptop to train a
| useful model - at least not without running that training for
| probably decades.
|
| That doesn't mean you can't _run_ a useful model on a laptop
| that was trained in larger hardware. I do that all the time -
| local models hit _really_ good this year.
|
| > reducing model size while retaining capability will just
| never happen.
|
| Tell that to Qwen3-4B! Those models are remarkably capable.
| grim_io wrote:
| It's always a question of "compared to what?"
|
| Local models are no where near capable compared to frontier
| big models.
|
| While a small model might be fine for your use case, it can
| not replace Sonnet-4 for me.
| simonw wrote:
| Sure, Qwen-3-4B - a 4GB download - is nowhere near as
| capable as Claude Sonnet 4.
|
| But it is _massively_ more capable than the 4GB models we
| had last year.
|
| Meanwhile recent models that are within the same ballpark
| of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi
| K2 and the largest of the Qwen 3 models - can just about
| fit on a $10,000 512GB of RAM Mac Studio. That's a very
| notable trend.
| grim_io wrote:
| It doesn't feel like that the gap is closing at all.
|
| The local models can get 10x as good next year, it won't
| matter to me if the frontier models are still better.
|
| And just because we can run those models (heavily
| quantized, and thus less capable), they are unusably slow
| on that 10k dead weight hardware.
| badsectoracula wrote:
| El Capitan being much faster than my desktop doesn't mean
| that my desktop is useless. Same with LLMs.
|
| I've been using Mistral Small 3.x for a bunch of tasks on
| my own PC and it has been very useful, especially after i
| wrote a few custom tools with llama.cpp to make it more
| "scriptable".
| jdjdndndn wrote:
| I would be interested in hearing about those custom tools
| sdenton4 wrote:
| It depends, actually... The data and train time requirements
| seen to increase exponentially for linear gains in performance.
| As a result, you can often trade a 10x reduction in training
| time to get a model with 90+% of the real deal. And as we
| accumulate more architecture and efficiency tricks, the ceiling
| in what you can do locally goes up commensurately.
|
| There's also a whole world of data curation to improve
| training, which is likely to be great for small models and
| seems still underexplored.
| faangguyindia wrote:
| The best LLM on the planet right now is Gemini Pro 2.5 and Gemini
| Flash 2.5, nothing comes close to these.
|
| Once you setup a good system prompt on these, nothing really
| compares.
|
| Most of the models you see with high benchmarks are not even
| comparable on real tasks.
|
| qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini
| Pro2.5
| howmayiannoyyou wrote:
| Then they are not the best. Most users aren't prompt engineers
| and grew up expecting to enter search terms into Google and get
| a result. If its the case OpenAI or Anthropic are best able to
| interpret user intent there's a good argument to be made they
| are the best.
| faangguyindia wrote:
| this is something people do not understand.
|
| If model trusts the users, and if user is dumb model will
| "weigh" user's input much higher and end up with flawed code.
|
| If the model is more independent, it will find the right
| solution. If just want a dumb model which says yes to
| everything, and follows you when u are not at smart enough
| then you'll never end up with good solution if not by luck.
| dvrj101 wrote:
| > not even comparable on real tasks. care to elaborate how
| gemini did completed this task successfully and how other
| models fumbled ?
| faangguyindia wrote:
| I am using AI to write full projects, complete code
| generation and haven found any model which comes close to
| Gemini Pro2.5 in code generation reasoning and generation.
|
| While other models like qwen3, glm promise big in real code
| writing they fail badly, get stuck in loops.
|
| The only problem right now i run into gemini is i get
| throttled every now and then with empty response specially
| around this time.
| hnfong wrote:
| Here's an Obfuscated C Contest entry that trains a toy model
| using LSTM:
|
| https://www.ioccc.org/2019/mills/index.html
|
| I suppose if you only have 5 minutes this is probably about the
| level you'd get.
| fswd wrote:
| Right now, Qwen3 4B
| chasd00 wrote:
| AI is a broad term, the zero-to-hero series by Karpathy trains
| one in a Jupyter notebook. You can make some pretty powerful
| networks to de-duplicate database rows right in your laptop too.
| Data de-duplication and general MDM is pretty useful in large
| businesses.
| fontsgenerator wrote:
| Probably something like a small logistic regression or a tiny
| GPT-2 variant (117M parameters) on a small dataset--anything
| beyond that will choke on RAM, VRAM, or time. Five minutes on a
| laptop = toy models, not miracles.
| initramfs wrote:
| I looked up the most expensive laptop with an RTX 5090:
| https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
|
| $5599.00 https://marketplace.nvidia.com/en-us/consumer/gaming-
| laptops...
|
| Although you can get them with fewer specs and the same GPU for
| $3,899.99
|
| https://marketplace.nvidia.com/en-us/consumer/gaming-laptops...
| nehal3m wrote:
| The same SKU on a GPU can perform differently depending on how
| the manufacturer powers and cools it [0], and nVidia's naming
| shenanigans don't help either [1].
|
| [0] https://www.digitaltrends.com/computing/laptop-gpu-power-
| lim... [1]
| https://coconote.app/notes/4c75b7a0-eb41-435d-85ee-55ae2dd8d...
| zipy124 wrote:
| And even worse it's surprisingly hard to find out what power
| budget is assigned to the GPU/ CPU or combined on spec
| sheets.
| bryanrasmussen wrote:
| I like this scenario for a future James Bond movie. Bond has to
| have an AI in chat pretend to be him to stall the bad guys while
| he is sneaking around the back, but the state of the art Bond
| persona bot that Q gave him in its own hardware enclosure has
| been smashed up in the previous fight scene.
|
| Bond has only minutes to train a strong enough AI model to
| pretend to be him and fool his targets long enough for him to
| gain entry to their impregnable fortress. Can he do it?!?
| rsyring wrote:
| But...they need to show him "training" it by smashing away at
| the keys frantically. A touch of sweat rolling down his face
| while a progress meter inches across the screen to suspenseful
| music.
| bryanrasmussen wrote:
| no that is a cliche from lesser brands, Bond will get drunk
| while it trains and shoot somebody with amazing accuracy.
| hinkley wrote:
| We're gonna need a montage.
| panarchy wrote:
| This would be more interesting if it wasn't about (L)LMs
| jasonjmcghee wrote:
| The idea of tracking and optimizing this reminds me of similar
| efforts a few years ago especially for image models via
| DAWNBench.
|
| https://dawnd9.sites.stanford.edu/dawnbench
| simianwords wrote:
| An idea worth exploring: if specialized models on datasets can be
| trained quickly, it can be used as tools by bigger models.
| Razengan wrote:
| I'd be happy with an AI that can just "train" on me: Just see
| what I do, learn from the repetitive tasks I do, and then do them
| quicker. An agent that is basically me x 10.
|
| Start blank with no corporate-controlled/crippled state and just
| become me.
|
| In fact, that might be the only way to let computers _appear_ to
| grow faster into the future, even if their internal hardware only
| gets minor incremental improvements: Have your shit done before
| you sit down to do it.
| jl6 wrote:
| Feels like there should be value in building smaller, more
| specialized models - maybe even doing so on-demand. I don't
| always want a model that knows Polish and astrophysics and
| Shakespeare, I want one that runs really fast and is laser-
| focused on the domain that I'm working on.
|
| I want to be able to say to a large general purpose LLM: "write a
| script that trains a model that is optimized for <useful task>"
| and then run _that_ model.
|
| Edit: well gosh darn. Within the edit window for this comment,
| Google goes and launches Gemma 3 270M.
| erkiserk wrote:
| one of the trends of machine learning though is that
| generalists outperform specialists on those specialists' tasks!
| jl6 wrote:
| But I'd happily accept some of that bitter lesson if the
| "worse specialist" ran way faster (or at all, given memory
| limits).
| indoordin0saur wrote:
| What about overnight on a desktop with a higher-end Nvidia gaming
| GPU? Asking for a friend.
| erikqu wrote:
| I would've liked to see some xlstms
| Animats wrote:
| _" Paris, France is a city in North Carolina. It is the capital
| of North Carolina."_
|
| If only we had a technology that didn't hallucinate and reported
| "I don't know". Then small models would be far more useful. Part
| of the need for insanely huge LLM models is to get coverage so
| broad that they don't have to make up stuff.
|
| It would be nice to be able to train a customer service bot on a
| laptop in a reasonable length of time. But it will screw up badly
| outside its area of competence, which will happen frequently.
| Closi wrote:
| I don't think we should use an AI trained in 5 minutes on a
| laptop to infer what small models are capable of...
|
| Sure they still have massive problems with hallucination, but
| this article doesn't give us any more insight into that I don't
| think!
| gambiting wrote:
| Why not? And I'm not being flippant, but like....isn't that
| the whole point of small models?
| kevinventullo wrote:
| As I understand it, the most effective small models are
| synthesized from larger models.
| remexre wrote:
| For one thing, the model is trained on a language modelling
| task, not a question-answering task?
| jarmitage wrote:
| AI is sorely lacking a demoscene
| andrewstuart wrote:
| Would have been useful to see exact steps taken to replicate the
| result.
| iamgopal wrote:
| If only AI models are trained to connect to data (sql) and use
| that to answer some of the questions using data source instead of
| just train on them, it could reduce model size a lot.
| CharlesW wrote:
| That's what tools are for. (see MCP:
| https://modelcontextprotocol.io/docs/getting-started/intro)
| scubbo wrote:
| Would RAG also be an approach here? My intuition from some
| small investigation is that RAG is more formal and structured
| to set up, but more efficient, whereas MCP you can just point
| an LLM at an MCP server and tell it to figure shit out (and
| also MCP can be used to _do_ stuff, not just to acquire more
| information).
| CharlesW wrote:
| > _Would RAG also be an approach here?_
|
| For sure! If the RAG context includes "Raleigh is the
| capital city of the U.S. state of North Carolina" somewhere
| in whatever you feed it, one would hope that you'd get an
| accurate answer to that question.
| scubbo wrote:
| Thank you!
| charcircuit wrote:
| A trick that would be useful would be to start with an existing
| model instead of trying to generate it from a random starting
| place.
| lsb wrote:
| This is evocative of "cramming", a paper from a few years ago,
| where the author tried to find the best model they could train
| for a day on a modern laptop: https://arxiv.org/abs/2212.14034
| quux wrote:
| Depends on how much weight you can support on your lap
| profsummergig wrote:
| Readers: I'm looking for toy, quick AI exercises that can be
| trained on a laptop, and help the doer increase their confidence
| in AI concepts (learning by doing, and all that).
|
| The OP fits the bill.
|
| If you can suggest other such exercises, please share in reply to
| this post.
|
| Thank you.
| dileeparanawake wrote:
| Siri.
| raindear wrote:
| How far can you go by improving the curriculum? Start simple.
| Find a shorter and shorter sequence of examples that gives you
| thd best result. What is the shortest sequence to get to some
| perplexity? Why?
| remexre wrote:
| Am I missing where the GitHub link is for this, or did the author
| not release sources? It'd be fun to reproduce this on a different
| machine, and play around with other architectures and optimizers
| that weren't mentioned in the article...
| trhway wrote:
| There was https://sortbenchmark.org, and now we need a similar
| for AI - best per joule, per 1 cent, per minute.
___________________________________________________________________
(page generated 2025-08-14 23:00 UTC)