[HN Gopher] Ggml.ai joins Hugging Face to ensure the long-term p...
___________________________________________________________________
Ggml.ai joins Hugging Face to ensure the long-term progress of
Local AI
Author : lairv
Score : 808 points
Date : 2026-02-20 13:51 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| rvz wrote:
| This acquisition is almost the same as the acquisition of Bun by
| Anthropic.
|
| Both $0 revenue "companies", but have created software that is
| essential to the wider ecosystem and has mindshare value; Bun for
| Javascript and Ggml for AI models.
|
| But of course the VCs needed an exit sooner or later. That was
| inevitable.
| andsoitis wrote:
| I believe ggml.ai was funded by angel investors, not VC.
| jimmydoe wrote:
| Amazing. I like the openness of both project and really excited
| for them.
|
| Hopefully this does not mean consolidation due to resource dry up
| but true fusion of the bests.
| mnewme wrote:
| Huggingface is the silent GOAT of the AI space, such a great
| community and platform
| lairv wrote:
| Truly amazing that they've managed to build an open and
| profitable platform without shady practices
| al_borland wrote:
| It's such a sad state of affairs when shady practices are so
| normal that finding a company without them is noteworthy.
| geooff_ wrote:
| As someone who's been in the "AI" space for a while its strange
| how Hugging Face went from one of the biggest name to not a part
| of the discussion at all.
| r_lee wrote:
| I think that's because there's less local AI usage now since
| there's all kinds of image models by the big labs, so there's
| really no rush of people self hosting stable diffusion etc
| anymore
|
| the space moved from Consumer to Enterprise pretty fast due to
| models getting bigger
| zozbot234 wrote:
| Today's free models are not really bigger when you account
| for the use of MoE (with ever increasing sparsity, meaning a
| smaller fraction of active parameters), and better ways of
| managing KV caching. You can do useful things with very
| little RAM/VRAM, it just gets slower and slower the more you
| try to squeeze it where it doesn't quite belong. But that's
| not a problem if you're willing to wait for every answer.
| r_lee wrote:
| yeah, but I mean more like the old setups where you'd just
| load a model on a 4090 or something, even with MoE it's a
| lot more complex and takes more VRAM, right? like it just
| seems not justifiable for most hobbyists
|
| but maybe I'm just slightly out of the loop
| zozbot234 wrote:
| With sparse MoE it's worth running the experts in system
| RAM since that allows you to transparently use mmap and
| inactive experts can stay on disk. Of course that's also
| a slowdown unless you have enough RAM for the full set,
| but it lets you run much larger models on smaller
| systems.
| LatencyKills wrote:
| It isn't necessary to be part of the discussion if you are
| truly adding value (which HF continues to do). It's nice to see
| a company doing what it does best without constantly driving
| the hype train.
| segmondy wrote:
| part of what discussion? anyone in the AI space knows and uses
| HF, but the public doesn't give a care and why should they?
| It's just an advanced site were nerds download AI stuff. HF is
| super valuable with their transformers library, their code,
| tutorials, smol-models, etc, but how does it translate to
| investor dollars?
| HanClinto wrote:
| I'm regularly amazed that HuggingFace is able to make money. It
| does so much good for the world.
|
| How solid is its business model? Is it long-term viable? Will
| they ever "sell out"?
| I_am_tiberius wrote:
| I once tried hugging face because I wanted I worked through
| some tutorial. They wanted my credit card details during the
| registration as far as I remember. After a month they invoiced
| me some amount of money and I had no idea what it was. To be
| honest, I don't understand what exactly they do and what
| services I was paying for, but I cancelled my account and never
| touched it again. For me that was a totally intransparent
| process.
| shafyy wrote:
| Their pricing seems pretty transparent:
| https://huggingface.co/pricing
| dmezzetti wrote:
| They have paid hosting - https://huggingface.co/enterprise and
| paid accounts. Also consulting services. Seems like a pretty
| good foundation to me.
| julien_c wrote:
| and a lot of traction on paid (private in particular) storage
| these days; sneak peek at new landing page:
| https://huggingface.co/storage
| microsoftedging wrote:
| FT had a solid piece a few weeks back: "Why AI start-up Hugging
| Face turned down a $500mn Nvidia deal"
|
| https://giftarticle.ft.com/giftarticle/actions/redeem/9b4eca...
| jackbravo wrote:
| sounds very interesting, but even though it says
| giftarticle.ft, I got blocked by a paywall.
| nerevarthelame wrote:
| https://archive.is/zSyUc
|
| To summarize, they rejected Nvidia's offer because they
| didn't want one outsized investor who could sway decisions.
| And "the company was also able to turn down Nvidia due to
| its stable finances. Hugging Face operates a 'freemium'
| business model. Three per cent of customers, usually large
| corporations, pay for additional features such as more
| storage space and the ability to set up private
| repositories."
| bee_rider wrote:
| Freemium seems to be working pretty well for them--what's
| the alternative website, after all. They seem to command
| their niche.
| culi wrote:
| find the Bypass Paywalls Clean extension. Never worry about
| a paywall again
| heliumtera wrote:
| >Will they ever "sell out"?
|
| Oh no, never. Don't worry, the usual investors are very well
| known for fighting for user autonomy (AMD, Nvidia, Intel,IBM,
| Qualcomm)
|
| They are all very pro consumers and all backers are certainly
| here for your enjoyment only
| zozbot234 wrote:
| These are all big hardware firms, which makes a lot of sense
| as a classic 'commoditize the complement' play. Not exactly
| pro-consumer, but not quite anti-consumer either!
| 5o1ecist wrote:
| > AMD, Nvidia, Intel, IBM, Qualcomm
|
| > but not quite anti-consumer either!
|
| All of them are public companies, which means that their
| default state is anti-consumer and pro-shareholder. By law
| they are required to do whatever they can to maximize
| profits. History teaches that shareholders can demand
| whatever they want, with the respective companies following
| orders, since nobody ever really has to suffer consequences
| and any and all potential fines are already priced in, in
| advance, anyway.
|
| Conversely, this is why Valve is such a great company.
| Valve is probably one of the only few actual pro-consumer
| companies out there.
|
| Fun Fact! Rarely is it ever mentioned anywhere, but Valve
| is not a public company! Valve is a private company! That's
| why they can operate the way they do! If Valve was a public
| company, then greedy, crooked billionaire shareholders
| would have managed to get rid of Gabe a long time ago.
| HanClinto wrote:
| Great points.
|
| Valve is one of my top favorite companies right now. Love
| the work they're doing, and their products are amazing.
|
| Can hardly wait for the Steam Frame.
| RussianCow wrote:
| > By law they are required to do whatever they can to
| maximize profits.
|
| I know it's a nit-pick, but I hate that this always gets
| brought up when it's not actually true. Public
| corporations face pressure from investors to maximize
| returns, sure, but there is no law stating that they have
| to maximize profits at all costs. Public companies can
| (and often do) act against the interest of immediate
| profits for some other gain. The only real leverage that
| investors have is the board's ability to fire executives,
| but that assumes that they have the necessary votes to do
| so. As a counter-example, Mark Zuckerberg still controls
| the majority of voting power at Meta, so he can
| effectively do whatever he wants with the company without
| major consequence (assuming you don't consider stock
| price fluctuations "major").
|
| But I say this not to take away from your broader point,
| which I agree with: the short-term profit-maximizing
| culture is indeed the default when it comes to publicly
| traded corporations. It just isn't something inherent in
| being publicly traded, and in the inverse, private
| companies often have the same kind of culture, so that's
| not a silver bullet either.
| 5o1ecist wrote:
| You're perfectly right and I don't consider it a nitpick.
| I really should be more precise about this, instead of
| spreading inaccuracies. Thank you!
| chucksmash wrote:
| It's a worthwhile point to make because if people believe
| that misconception then it lets companies wash their
| hands of flagrantly bad behavior. "Gosh, we should really
| get around to changing the law that makes them act that
| way."
| smallerize wrote:
| heliumtera is being sarcastic.
| bityard wrote:
| Their business model is essentially the same as GitHub. Host
| lots of stuff for free and build a community around it, sell
| the upscaled/private version to businesses. They are already
| profitable.
| HanClinto wrote:
| This is what Sourceforge did too, and they still had the
| DevShare adware thing didn't they?
|
| GitHub is great -- huge fan. To some degree they "sold out"
| to Microsoft and things could have gone more south, but
| thankfully Microsoft has ruled them with a very kind hand,
| and overall I'm extremely happy with the way they've handled
| it.
|
| I guess I always retain a bit of skepticism with such things,
| and the long-term viability and goodness of such things never
| feels totally sure.
| dmezzetti wrote:
| This is really great news. I've been one of the strongest
| supporters of local AI dedicating thousands of hours towards
| building a framework to enable it. I'm looking forward to seeing
| what comes of it!
| logicallee wrote:
| >I've been one of the strongest supporters of local AI,
| dedicating thousands of hours towards building a framework to
| enable it.
|
| Sounds like you're very serious about supporting local AI. I
| have a query for you (and anyone else who feels like donating)
| about whether you'd be willing to donate some memory/bandwidth
| resources p2p to hosting an offline model:
|
| We have a local model we would like to distribute but don't
| have a good CDN.
|
| As a user/supporter question, would you be willing to donate
| some spare memory/bandwidth in a simple dedicated browser tab
| you keep open on your desktop that plays silent audio (to not
| be put in the background and deloaded) and then allocates 100mb
| -1 gb of RAM and acts as a webrtc peer, serving checksumed
| models?[1] (Then our server only has to check that you still
| have it from time to time, by sending you some salt and a part
| of the file to hash and your tab proves it still has it by
| doing so). This doesn't require any trust, and the receiving
| user will also hash it and report if there's a mismatch.
|
| Our server federates the p2p connections, so when someone
| downloads they do so from a trusted peer (one who has
| contributed and passed the audits) like you. We considered
| building a binary for people to run but we consider that people
| couldn't trust our binaries, or would target our build process
| somehow, we are paranoid about trust, whereas a web model is
| inherently untrusted and safer. Why do all this?
|
| The purpose of this would be to host an offline model: we
| successfully ported a 1 GB model from C++ and Python to WASM
| and WebGPU (you can see Claude doing so here, we livestreamed
| some of it[2]), but the model weights at 1 GB are too much for
| us to host.
|
| Please let us know whether this is something you would
| contribute a background tab to hosting on your desktop. It
| wouldn't impact you much and you could set how much memory to
| dedicate to it, but you would have the good feeling of knowing
| that you're helping people run a trusted offline model if they
| want - from their very own browser, no download required. The
| model we ported is fast enough for anyone to run on their own
| machines. Let me know if this is something you'd be willing to
| keep a tab open for.
|
| [1] filesharing over webrtc works like this:
| https://taonexus.com/p2pfilesharing/ you can try it in 2
| browser tabs.
|
| [2] https://www.youtube.com/watch?v=tbAkySCXyp0and and some
| other videos
| liuliu wrote:
| > We have a local model we would like to distribute but don't
| have a good CDN.
|
| That is not true. I am serving models off Cloudflare R2. It
| is 1 petabyte per month in egress use and I basically pay
| peanuts (~$200 everything included).
| logicallee wrote:
| 1 petabyte per month is 1 million downloads of a 1 GB file.
| We intend to scale to more than 1 million downloads per
| month. We have a specific scaling architecture in mind.
| We're qualified to say this because we've ported a billion
| parameter model to run in your browser - fast - on either
| webgpu or wasm. (You can see us doing it live at the
| youtube link in my comment above.) There is a lot of demand
| for that.
| liuliu wrote:
| The bandwidth is free on Cloudflare R2. I paid money for
| storage (~10TiB storage of different models). If you only
| host 1GiB file there, you are only paying $0.01 per month
| I believe.
| dirasieb wrote:
| how about you work on achieving 1 million downloads per
| month first? talk about putting the horse before the
| carriage
| echoangle wrote:
| Maybe stupid question but why not just put it in a torrent?
| logicallee wrote:
| Torrents require users to download and install a torrent
| client! In addition, we would like to retain the
| possibility of giving live updates to the latest version of
| a sovereign fine-tuned file, torrents don't autoupdate. We
| want to keep improving what people get.
|
| Finally, we would like the possibility of setting up market
| dynamics in the future: if you aren't currently using all
| your ram, why not rent it out? This matches the p2p edge
| architecture we envision.
|
| In addition, our work on WebGPU would allow you to rent out
| your gpu to a background tab whenever you're not using it.
| Why have all that silicon sit idle when you could rent it
| out?
|
| You could also donate it to help fine tune our own
| sovereign model.
|
| All of this will let us bootstrap to the point where we
| could be trusted with a download.
|
| We have a rather paranoid approach to security.
| liuliu wrote:
| It is very simple. Storage / bandwidth is not expensive.
| Residential bandwidth is. If you can convince people to
| install a bandwidth-related software on their residential
| homes, you can then charge other people $5 to $10 per 1GiB
| bandwidth (useful for botnet mostly, get around DDOS
| protections and other reCAPTCHA tasks).
| logicallee wrote:
| Thank you for your suggestion. Below is only our
| plans/intentions, we welcome feedback about it:
|
| We are not going to do what you suggest. Instead, our
| approach is to use the RAM people aren't using at the
| moment for a fast edge cache close to their area.
|
| We've tried this architecture and get very low latency
| and high bandwidth. People would not be contributing
| their resources to anything they don't know about.
| HanClinto wrote:
| Hosting model weights for projects like this I think is
| something that you could upload to a space in Hugging Face?
|
| What services would you need that Hugging Face doesn't
| provide?
| beoberha wrote:
| Seems like a great fit - kinda surprised it didn't happen sooner.
| I think we are deep in the valley of local AI, but I'd be willing
| to bet it breaks out in the next 2-3 years. Here's hoping!
| breisa wrote:
| I mean they already supported the project quite a bit. @ngxson
| and maybe others? from Huggingface are big contributors to
| llama.cpp.
| mythz wrote:
| I consider HuggingFace more "Open AI" than OpenAI - one of the
| few quiet heroes (along with Chinese OSS) helping bring on-
| premise AI to the masses.
|
| I'm old enough to remember when traffic was expensive, so I've no
| idea how they've managed to offer free hosting for so many
| models. Hopefully it's backed by a sustainable business model, as
| the ecosystem would be meaningfully worse without them.
|
| We still need good value hardware to run Kimi/GLM in-house, but
| at least we've got the weights and distribution sorted.
| zozbot234 wrote:
| > We still need good value hardware to run Kimi/GLM in-house
|
| If you stream weights in from SSD storage and freely use swap
| to extend your KV cache it will be really slow (multiple
| seconds per token!) but run on basically anything. And that's
| still really good for stuff that can be computed overnight,
| perhaps even by batching many requests simultaneously. It gets
| progressively better as you add more compute, of course.
| HPsquared wrote:
| At a certain point the energy starts to cost more than
| renting some GPUs.
| vardalab wrote:
| Yeah, that is hard to argue with because I just go to
| OpenRouter and play around with a lot of models before I
| decide which ones I like. But there's something special
| about running it locally in your basement
| dotancohen wrote:
| I'd love to hear more about this. How do you decide that
| you like a model? For which use cases?
| fc417fc802 wrote:
| Aren't decent GPU boxes in excess of $5 per hour? At $0.20
| per kWhr (which is on the high side in the US) running a 1
| kW workstation 24/7 would work out to the same price as 1
| hour of GPU time.
|
| The issue you'll actually run into is that most residential
| housing isn't wired for more than ~2kW per room.
| Aurornis wrote:
| > it will be really slow (multiple seconds per token!)
|
| This is fun for proving that it can be done, but that's 100X
| slower than hosted models and 1000X slower than GPT-Codex-
| Spark.
|
| That's like going from real time conversation to e-mailing
| someone who only checks their inbox twice a day if you're
| lucky.
| zozbot234 wrote:
| You'd need real rack-scale/datacenter infrastructure to
| properly match the hosted models that are keeping
| everything in fast VRAM at all times, and then you only get
| reasonable utilization on that by serving requests from
| many users. The ~100X slower tier is totally okay for
| experimentation and non-conversational use cases (including
| some that are more agentic-like!), and you'd reach ~10X
| (quite usable for conversation) by running something like a
| good homelab.
| data-ottawa wrote:
| Can we toss in the work unsloth does too as an unsung hero?
|
| They provide excellent documentation and they're often very
| quick to get high quality quants up in major formats. They're a
| very trustworthy brand.
| cubie wrote:
| I'm a big fan of their work as well, good shout.
| danielhanchen wrote:
| Thank you!
| disiplus wrote:
| Yeah, they're the good guys. I suspect the open source work
| is mostly advertisements for them to sell consulting and
| services to enterprises. Otherwise, the work they do doesn't
| make sense to offer for free.
| arcanemachiner wrote:
| I hope that is exactly what is happening. It benefits them,
| and it benefits us.
| danielhanchen wrote:
| Haha for now our primary goal is to expand the market for
| local AI and educate people on how to do RL, fine-tuning
| and running quants :)
| WanderPanda wrote:
| Amazing work and people should really appreciate that the
| opportunity costs of your work are immense (given the
| hype).
|
| On another note: I'm a bit paranoid about quantization. I
| know people are not good at discerning model quality at
| these levels of "intelligence" anymore, I don't think a
| vibe check really catches the nuances. How hard would it
| be to systematically evaluate the different
| quantizations? E.g. on the Aider benchmark that you used
| in the past?
|
| I was recently trying Qwen 3 Coder Next and there are
| benchmark numbers in your article but they seem to be for
| the official checkpoint, not the quantized ones. But it
| is not even really clear (and chatbots confuse them for
| benchmarks of the quantized versions btw.)
|
| I think systematic/automated benchmarks would really
| bring the whole effort to the next level. Basically
| something like the bar chart from the Dynamic
| Quantization 2.0 article but always updated with all
| kinds of recent models.
| Zetaphor wrote:
| This would be amazing
| danielhanchen wrote:
| Working on it! :)
| jychang wrote:
| > How hard would it be to systematically evaluate the
| different quantizations? E.g. on the Aider benchmark that
| you used in the past?
|
| Very hard. $$$
|
| The benchmarks are not cheap to run. It'll cost a lot to
| run them for each quant of each model.
| danielhanchen wrote:
| Yes sadly very expensive :( Maybe a select few quants
| could happen - we're still figuring out what is the most
| economical and most efficient way to benchmark!
| illusive4080 wrote:
| Roughly how much does it cost to run one of the popular
| benchmarks? Are we talking $1,000, $10,000, or $100k?
| danielhanchen wrote:
| Thanks! Yes we actually did think about that - it can get
| quite expensive sadly - perplexity benchmarks over short
| context lengths with small datasets are doable, but it's
| not an accurate measure sadly. We're actually
| investigating currently what would be the best efficient
| course of action on evaluating quants - will keep you
| posted!
| danielhanchen wrote:
| Oh thank you - appreciate it :)
| swyx wrote:
| not that unsung! we've given them our biggest workshop spot
| every single year we've been able to and will do until they
| are tired of us
| https://www.youtube.com/@aiDotEngineer/search?query=unsloth
| danielhanchen wrote:
| Appreciate it immensely haha :) Never tired - always
| excited and pumped for this year!
| sowbug wrote:
| Why doesn't HF support BitTorrent? I know about hf-torrent and
| hf_transfer, but those aren't nearly as accessible as a link in
| the web UI.
| embedding-shape wrote:
| > Why doesn't HF support BitTorrent?
|
| Harder to track downloads then. Only when clients hit the
| tracker would they be able to get download states, and forget
| about private repositories or the "gated" ones that
| Meta/Facebook does for their "open" models.
|
| Still, if vanity metrics wasn't so important, it'd be a great
| option. I've even thought of creating my own torrent mirror
| of HF to provide as a public service, as eventually access to
| models will be restricted, and it would be nice to be
| prepared for that moment a bit better.
| sowbug wrote:
| I thought of the tracking and gate questions, too, when I
| vibed up an HF torrent service a few nights ago. (Super
| annoying BTW to have to download the files just to hash the
| parts, especially when webseeds exist.) Model owners could
| disable or gate torrents the same way they gate the models,
| and HF could still measure traffic by .torrent downloads
| and magnet clicks.
|
| It's a bit like any legalization question -- the black
| market exists anyway, so a regulatory framework could bring
| at least some of it into the sunlight.
| embedding-shape wrote:
| > Model owners could disable or gate torrents the same
| way they gate the models, and HF could still measure
| traffic by .torrent downloads and magnet clicks.
|
| But that'll only stop a small part, anyone could share
| the infohash and if you're using the dht/magnet without
| .torrent files or clicks on a website, no one can count
| those downloads unless they too scrape the dht for peers
| who are reporting they've completed the download.
| sowbug wrote:
| Right, but that's already happening today. That's the
| black-market point.
| fc417fc802 wrote:
| > unless they too scrape the dht for peers who are
| reporting they've completed the download.
|
| Which can be falsified. Head over to your favorite
| tracker and sort by completed downloads to see what I
| mean.
| taminka wrote:
| most of the traffic is probably from open weights, just
| seed those, host private ones as is
| homarp wrote:
| how are all the private trackers tracking ratios?
| jimbob45 wrote:
| Wouldn't it still provide massive benefits if they could
| convince/coerce their most popular downloaded models to
| move to torrenting?
| intrasight wrote:
| Benefit to you, but great downside to the three letter
| agencies that inject their goods into these models.
| Barbing wrote:
| That would be a very nice service. I think folks might rely
| on it for a number of reasons, including that we'll want to
| see how biases changed over time. What got sloppier,
| shillier...
| Fin_Code wrote:
| I still don't know why they are not running on torrent. Its the
| perfect use case.
| heliumtera wrote:
| How can you be the man in the middle in a truly P2P
| environment?
| freedomben wrote:
| That would shut out most people working for big corp, which
| is probably a huge percentage of the user base. It's dumb,
| but that's just the way corp IT is (no torrenting allowed).
| zozbot234 wrote:
| It's a sensible option, even when not everyone can really
| use it. Linux distros are routinely transfered via torrent,
| so why not other massive, open-licensed data?
| freedomben wrote:
| Oh as an option, yeah I agree it makes a ton of sense. I
| just would expect a very, very small percentage of people
| to use the torrent over the direct download. With Linux
| distros, the vast majority of downloads still come from
| standard web servers. When I download distro images I opt
| for torrents, but very few people do the same
| zrm wrote:
| With Linux distros they typically put the web link right
| on the main page and have a torrent available if you go
| look for it, because they want you to try their distro
| more than they want to save some bandwidth.
|
| Suppose HF did the opposite because the bandwidth saved
| is more and they're not as concerned you might download a
| different model from someone else.
| Const-me wrote:
| > very small percentage of people to use the torrent over
| the direct download
|
| BitTorrent protocol is IMO better for downloading large
| files. When I want to download something which exceeds
| couple GB, and I see two links direct download and
| BitTorrent, I always click on the torrent.
|
| On paper, HTTP supports range requests to resume partial
| downloads. IME, it seems modern web browsers neglected to
| implement it properly. They won't resume after browser is
| reopened, or the computer is restarted. Command-line HTTP
| clients like wget are more reliable, however many web
| servers these days require some session cookies or one-
| time query string tokens, and it's hard to pass that
| stuff from browser to command-line.
|
| I live in Montenegro, CDN connectivity is not great here.
| Only a few of them like steam and GOG saturate my 300
| megabit/sec download link. Others are much slower, e.g.
| windows updates download at about 100 megabit/sec.
| BitTorrent protocol almost always delivers the 300
| megabit/sec bandwidth.
| thot_experiment wrote:
| I have terabytes of linux isos I got via torrents, many
| such cases!
| Tepix wrote:
| It's insane how much traffic HF must be pushing out of the
| door. I routinely download models that are hundreds of
| gigabytes in size from them. A fantastic service to the
| sovererign AI community.
| vardalab wrote:
| Yup, I have downloaded probably a terabyte in the last week,
| especially with the Step 3.5 model being released and Minimax
| quants. I wonder what my ISP thinks. I hope they don't cut me
| off. They gave me a fast lane, they better let me use it, lol
| fc417fc802 wrote:
| Even fairly restrictive data caps are in the range of 6 Tb
| per month. P2P at a mere 100 Mb works out to 1 TiB per 24
| hours.
|
| Hypothetically my ISP will sell me unmetered 10 Gb service
| but I wonder if they would actually make good on their word
| ...
| 3eb7988a1663 wrote:
| I have a 1.2TB cap before you start getting charged
| extra, so you might need to recalibrate your restrictive
| level.
| fc417fc802 wrote:
| Is that with a WISP by chance? Or in a developing
| country? Or are there really wired providers with such
| low caps in the western world in this day and age?
| nagaiaida wrote:
| well it's my wired cap a stone's throw from buildings
| with google cloud logos on the side in a major us city,
| so...
| zargon wrote:
| Comcast.
| Zetaphor wrote:
| ATT once told me if I don't pay for their TV service then
| my home gigabit fiber would have a 1TB cap. They had an
| agreement with the apartment building so I had no other
| choice of provider.
| fc417fc802 wrote:
| Buy our off brand netflix or else we'll make it so you
| can't watch netflix. How is that legal?
| Zetaphor wrote:
| The law is written by the highest bidder, and the telecom
| lobbyists are very generous
| razster wrote:
| My fear is that these large "AI" companies will lobby to have
| these open source options removed or banned, growing concern.
| I'm not sure how else to explain how much I enjoy using what
| HF provides, I religiously browse their site for new and
| exciting models to try.
| culi wrote:
| ModelScope is the Chinese equivalent of Hugging Face and a
| good back up. All the open models are Chinese anyways
| thot_experiment wrote:
| Not true! Mistral is really really good, but I agree that
| there isn't a single decent open model from the USA.
| culi wrote:
| Mistral is cool and I wish them success but it
| consistently ranks extremely low on benchmarks while
| still being expensive. Chinese models like DeepSeek might
| rank almost as low as Mistral but they are significantly
| cheaper. And Kimi is the best of both worlds with
| incredible benchmark results while still being incredibly
| cheap
|
| I know things change rapidly so I'm not counting them out
| quite yet but I don't see them as a serious contender
| currently
| Eupolemos wrote:
| Why are you talking price when we are talking local AI?
|
| That doesn't make any sense to me. Am I missing
| something?
| culi wrote:
| Your electricity is free?
| cpburns2009 wrote:
| If you have the hardware to run expensive models, is the
| cost of electricity much of a factor? According to
| Google, the average price in the Silicon Valley Area is
| $0.448 per kWh. An RTX 5090 costs about $4,000 and has a
| peak power consumption of 1000 W. Maxing out that GPU for
| a whole year would cost $3,925 at that rate. It's not
| particularly more expensive than that hardware itself.
| culi wrote:
| At that point it'd be cheaper to get an expensive
| subscription to a cloud platform AI product. I understand
| the case for local LLMs but it seems silly to worry about
| pricing for cloud-based offerings but not worry about
| pricing for locally run models. Especially since running
| it locally can often be more expensive
| seanmcdirmid wrote:
| Apple silicon is crazy efficient as well as being
| comparable to GPUs in performance for max and ultra
| chips.
| thot_experiment wrote:
| for almost the entire year, yes.
| dirasieb wrote:
| 15 missed calls from your local power company
| thot_experiment wrote:
| Sure, benchmarks are fake and I use Mistral over
| equivalently sized models most of the time because it's
| better in real life. It runs plenty fast for me, I don't
| pay for inference.
| BoredomIsFun wrote:
| > it consistently ranks extremely low on benchmarks
|
| As general purpose chatbots small Mistral models are
| better than comparably sized Chiniese models, as they
| have better SimpleQA scores and general knowledge of
| Western culture.
| seanmcdirmid wrote:
| It's really hard to beat qwen coder, especially for role
| play where the instruction following is really useful. I
| don't think their corpus is lacking in western knowledge,
| although I wonder if Chinese users get even better
| results from it?
| BoredomIsFun wrote:
| > It's really hard to beat qwen coder, for role play
|
| I am not sure if you actually tried that. Mistrals are
| widely asccepted go-to models for roleplay and creative
| writing. No Qwens are good at prose, except for their
| latest big Qwen 3.5.
|
| > I don't think their corpus is lacking in western
| knowledge,
|
| It absolutely does, especially pop culture knowledge.
| seanmcdirmid wrote:
| Instruct and coder just follow instructions so well
| though. I guess I've just never been able to make mistral
| work well, I guess.
| BoredomIsFun wrote:
| Qwen3 30B A3B and that big 400+ B Coder were absolutely
| terrible at editing fiction. I would tell them what to
| change in the prose and they'd just regurgitate text with
| no changes.
| CamperBob2 wrote:
| To be fair there are lots of worse models than OpenAI's
| GPT-OSS-120b. It's not a standout when positioned next to
| the latest releases from China, but prior to the current
| wave it was considered one of the stronger local models
| you can reasonably run.
| throwaway27448 wrote:
| They can try. I don't think they'll be able to get the
| toothpaste back in the tube. The data will just move our of
| the country.
| seanmcdirmid wrote:
| Many of the models on hugging face are already Chinese.
| It's kind of obvious that local AI is going to flourish
| more in China than the USA due to hardware constraints.
| dotancohen wrote:
| How do you choose which models to try for which workflows?
| Do you have objective tests that you run, or do you just
| get a feel for them while using them in your daily
| workflow?
| toofy wrote:
| it's only a matter of time. we have all seen first hand how
| ... wrong ... these companies behave, almost on a regular
| basis.
|
| there's a small tinfoil hat part of me that suspects part
| of their obscene investments and cornering the hardware
| market is driven by an conscious attempt to stop open
| source local from taking off. they want it all, the money,
| the control, and to be the only source of information to
| us.
| Onavo wrote:
| Bandwidth is not that expensive. The Big 3 clouds just want
| to milk customers via egress. Look at Hetzner or CloudFlare
| R2 if you want to get get an idea of commodity bandwidth
| costs.
| the__alchemist wrote:
| Does anyone have a good comparison of HuggingFace/Candle to Burn?
| I am testing them concurrently, and Burn seems to have an easier-
| to-use API. (And can use Candle as a backend, which is confusing)
| When I ask on Reddit or Discord channels, people overwhelmingly
| recommend Burn, but provide no concrete reasons beyond "Candle is
| more for inference while Burn is training and inference". This
| doesn't track, as I've done training on Candle. So, if you've
| used both: Thoughts?
| csunoser wrote:
| I have used both (albeit 2 years ago, and things change really
| fast). At the time, Candle didn't have 2d conv backprop with
| strides properly implemented. And getting Burn running libtch
| backend was just a lot simpler.
|
| I did use candle for wasm based inference for teaching purposes
| - that was reasonably painless and pretty nice.
| dhruv3006 wrote:
| Huggingface is actually something thats driving good in the
| world. Good to see this collab/
| androiddrew wrote:
| One of the few acquisitions I do support
| tkp-415 wrote:
| Can anyone point me in the direction of getting a model to run
| locally and efficiently inside something like a Docker container
| on a system with not so strong computing power (aka a Macbook M1
| with 8gb of memory)?
|
| Is my only option to invest in a system with more computing
| power? These local models look great, especially something like
| https://huggingface.co/AlicanKiraz0/Cybersecurity-BaronLLM_O...
| for assisting in penetration testing.
|
| I've experimented with a variety of configurations on my local
| system, but in the end it turns into a make shift heater.
| xrd wrote:
| I think a better bet is to ask on reddit.
|
| https://www.reddit.com/r/LocalLLM/
|
| Everytime I ask the same thing here, people point me there.
| zozbot234 wrote:
| The general rule of thumb is that you should feel free to
| quantize even as low as 2 bits average if this helps you run a
| model with more active parameters. Quantized models are not
| perfect at all, but they're preferable to the models with
| fewer, bigger parameters. With 8GB usable, you could run models
| with up to 32B active at heavy quantization.
| zargon wrote:
| A large model (100B+, the more the better) may be acceptable
| at 2-bit quantization, depending on the task. But not a small
| model. Especially not for technical tasks. On top of that,
| one still needs room for OS, software and KV cache. 8GB is
| just not very useful for local LLMs. That said, it can still
| be entertaining to try out a 4-bit 8B model for the fun of
| it.
| zozbot234 wrote:
| 100B+ is the amount of total parameters, whereas what
| matters here is active - very different for sparse MoE
| models. You're right that there's some overhead for the
| OS/software stack but it's not that much. KV-cache is a
| good candidate for being swapped out, since it only gets a
| limited amount of writes per emitted token.
| zargon wrote:
| Total parameters, not active parameters, is the property
| that matters for model robustness under extreme
| quantization.
|
| Once you're swapping from disk, the performance will be
| quite unusable for most people. And for local inference,
| KV cache is the worst possible choice to put on disk.
| mft_ wrote:
| There's no way around needing a powerful-enough system to run
| the model. So you either choose a model that can fit on what
| you have --i.e. via a small model, or a quantised slightly
| larger model-- or you access more powerful hardware, either by
| buying it or renting it. (IME you don't need Docker. For an
| easy start just install LM Studio and have a play.)
|
| I picked up a second-hand 64GB M1 Max MacBook Pro a while back
| for not too much money for such experimentation. It's
| sufficiently fast at running any LLM models that it can fit in
| memory, but the gap between those models and Claude is
| considerable. However, this might be a path for you? It can
| also run all manner of diffusion models, but there the
| performance suffers (vs. an older discrete GPU) and you're
| waiting sometimes many minutes for an edit or an image.
| sigbottle wrote:
| Are mac kernels optimized compared to CUDA kernels? I know
| that the unified GPU approach is inherently slower, but I
| thought a ton of optimizations were at the kernel level too
| (CUDA itself is a moat)
| bigyabai wrote:
| Mac kernels are almost always compute shaders written in
| Metal. That's the bare-minimum of acceleration, being done
| in a non-portable proprietary graphics API. It's optimized
| in the loosest sense of the word, but extremely far from
| "optimal" relative to CUDA (or hell, even Vulkan Compute).
|
| Most people will not choose Metal if they're picking
| between the two moats. CUDA is far-and-away the better
| hardware architecture, not to mention better-supported by
| the community.
| liuliu wrote:
| Depending on what you do. If you are doing token
| generations, compute-dense kernel optimization is less
| interesting (as, it is memory-bounded) than latency
| optimizations else where (data transfers, kernel
| invocations etc). And for these, Mac devices actually have
| a leg than CUDA kernels (as pretty much Metal shaders
| pipelines are optimized for latencies (a.k.a. games) while
| CUDA shaders are not (until cudagraph introduction, and of
| course there are other issues).
| ttoinou wrote:
| There's this developer called nightmedia who converts a lot
| of models to apple MLX. I can run Qwen3 coder next at 60
| tps on my m4 max. It works
| ryandrake wrote:
| I wasn't able to have very satisfying success until I bit the
| bullet and threw a GPU at the problem. Found an actually
| reasonably priced A4000 Ada generation 20GB GPU on eBay and
| never looked back. I still can't run the insanely large
| models, but 20GB should hold me over for a while, and I
| didn't have to upgrade my 10 year old Ivy Bridge vintage
| homelab.
| HanClinto wrote:
| Maybe check out Docker Model Runner -- it's built on llama.cpp
| (in a good way -- not like Ollama) and handles I think most of
| what you're looking for?
|
| https://www.docker.com/blog/run-llms-locally/
|
| As far as how to find good models to run locally, I found this
| site recently, and I liked the data it provides:
|
| https://localclaw.io/
| ontouchstart wrote:
| This is the easiest set up on a Mac. You need at least 16gb on
| a MacBook:
|
| https://github.com/ggml-org/llama.cpp/discussions/15396
| Hamuko wrote:
| I tried to run some models on my M1 Max (32 GB) Mac Studio and
| it was a pretty miserable experience. Slow performance and
| awful results.
| 0xbadcafebee wrote:
| 8GB is not enough to do complex reasoning, but you could do
| very small simple things. Models like Whisper, SmolVLM,
| Quen2.5-0.5B, Phi-3-mini, Granite-4.0-micro, Mistral-7B,
| Gemma3, Llama-3.2 all work on very little memory. Tiny models
| can do a lot if you tune/train them. They also need to be used
| differently: system prompt preloaded with information, few-shot
| examples, reasoning guidance, single-task purpose, strict
| output guidelines. See https://github.com/acon96/home-llm for
| an example. For each small model, check if Unsloth has a tuned
| version of it; it reduces your memory footprint and makes
| inference faster.
|
| For your Mac, you can use Ollama, or MLX (Mac ARM specific,
| requires different engine and different model disk format, but
| is faster). Ramalama may help fix bugs or ease the process
| w/MLX. Use either Docker Desktop or Colima for the VM + Docker.
|
| For today's coding & reasoning models, you need a minimum of
| 32GB VRAM combined (graphics + system), the more in GPU the
| better. Copying memory between CPU and GPU is too slow so the
| model needs to "live" in GPU space. If it can't fit all in GPU
| space, your CPU has to work hard, and you get a space heater.
| That Mac M1 will do 5-10 tokens/s with 8GB (and CPU on full
| blast), or 50 token/s with 32GB RAM (CPU idling). And now you
| know why there's a RAM shortage.
| BoredomIsFun wrote:
| > Mistral-7B
|
| Is hopelessly dated. There are much better newer models
| around.
| yjftsjthsd-h wrote:
| With only 8 GB of memory, you're going to be running a really
| small quant, and it's going to be slow and lower quality. But
| yes, it should be doable. In the worst case, find a tiny gguf
| and run it on CPU with llamafile.
| option wrote:
| Isn't HF banned in China? Also, how are many Chinese labs on
| Twitter all the time?
|
| In either case - huge thanks to them for keeping AI open!
| woadwarrior01 wrote:
| HF is indeed banned in China. The Chinese equivalent of HF is
| ModelScope[1].
|
| [1]: https://modelscope.cn/
| disiplus wrote:
| I think in the West we think everything is blocked. But for
| example, if you book an eSIM, when you visit you already get
| direct access to Western services because they route it to some
| other server. Hong Kong is totally different: they basically
| use WhatsApp and Google Maps, and everything worked when I was
| there.
| embedding-shape wrote:
| But also yes, parent is right, HF is more or less
| inaccessible, and Modelscope frequently cited as the mirror
| to use (although many Chinese labs seems to treat HF as the
| mirror, and Modelscope as the "real" origin).
| dragonwriter wrote:
| > Isn't HF banned in China?
|
| I think, for some definition of "banned", that's the case. It
| doesn't stop the Chinese labs from having organization accounts
| on HF and distributing models there. ModelScope is apparently
| the HF-equivalent for reaching Chinese users.
| segmondy wrote:
| Great news! I have always worried about ggml and long term
| prospect for them and wished for them to be rewarded for their
| effort.
| stephantul wrote:
| Georgi is such a legend. Glad to see this happening
| jgrahamc wrote:
| This is great news. I've been sponsoring ggml/llama.cpp/Georgi
| since 2023 via Github. Glad to see this outcome. I hope you don't
| mind Georgi but I'm going to cancel my sponsorship now you and
| the code have found a home!
| superkuh wrote:
| I'm glad the llama.cpp and the ggml backing are getting
| consistent reliable economic support. I'm glad that ggerganov is
| getting rewarded for making such excellent tools.
|
| I am somewhat anxious about "integration with the Hugging Face
| transformers library" and possible python ecosystem entanglements
| that might cause. I know llama.cpp and ggml already have plenty
| of python tooling but it's not strictly required unless you're
| quantizing models yourself or other such things.
| periodjet wrote:
| Prediction: Amazon will end up buying HuggingFace. Screenshot
| this.
| ukblewis wrote:
| Honestly I'm shocked to be the only one I see of this opinion:
| HuggingFace's `accelerate`, `transformers` and `datasets` have
| been some of the worst open source Python libraries I have ever
| used that I had to use. They break backwards compatibility
| constantly, even on APIs which are not underscore/dunder named
| even on minor version releases without even documenting this,
| they refuse PRs fixing their lack of `overloads` type annotations
| which breaks type checking on their libraries and they just
| generally seem to have spaghetti code. I am not excited that
| another team is joining them and consolidating more engineering
| might in the hands of these people
| ukblewis wrote:
| And I said all of that despite us continuing to use their
| platform and libraries extensively... We just don't have a
| choice due to their dominance of open source ML
| ukblewis wrote:
| And clearly I say all of this in my name and not my employers
| name
| 0xbadcafebee wrote:
| > The community will continue to operate fully autonomously and
| make technical and architectural decisions as usual. Hugging Face
| is providing the project with long-term sustainable resources,
| improving the chances of the project to grow and thrive. The
| project will continue to be 100% open-source and community driven
| as it is now.
|
| I want this to be true, but business interests win out in the
| end. Llama.cpp is now the de-facto standard for local inference;
| more and more projects depend on it. If a company controls it,
| that means that company controls the local LLM ecosystem. And
| yeah, Hugging Face seems nice now... so did Google originally. If
| we all don't want to be locked in, we either need a llama.cpp
| competitor (with a universal abstration), or it should be
| controlled by an independent nonprofit.
| zozbot234 wrote:
| Llama.cpp is an open source project that anyone can fork as
| needed, so any "control" over it really only extends to
| facilitating development of certain features.
| 0xbadcafebee wrote:
| In practice, nobody does this, because you then have to keep
| the fork up to date with upstream plus your changes, and this
| is an endless amount of work.
| sheepscreek wrote:
| Curious about the financials behind this deal. Did they close
| above what they raised? What's in it for HuggingFace?
| simonw wrote:
| It's hard to overstate the impact Georgi Gerganov and llama.cpp
| have had on the local model space. He pretty much kicked off the
| revolution in March 2023, making LLaMA work on consumer laptops.
|
| Here's that README from March 10th 2023 https://github.com/ggml-
| org/llama.cpp/blob/775328064e69db1eb...
|
| > The main goal is to run the model using 4-bit quantization on a
| MacBook. [...] This was hacked in an evening - I have no idea if
| it works correctly.
|
| Hugging Face have been a great open source steward of
| Transformers, I'm optimistic the same will be true for GGML.
|
| I wrote a bit about this here:
| https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-f...
| ushakov wrote:
| i am curious, why are your comments always pinned to the top?
| carbocation wrote:
| Because many of us think simonw has discerning taste on this
| topic and like to read what he has to say about it, so we
| upvote his comments.
| ushakov wrote:
| i don't doubt this. i just find it questionable that one
| particular poster always gets in the spotlight when AI is
| the topic - while other conversations in my opinion offer
| more interesting angles.
| jonas21 wrote:
| Upvote the conversations that you find to be more
| interesting. If enough people do the same, they too will
| make it to the top.
| coldtea wrote:
| Parent implies there might be some "boosting" involved,
| in which case, "upvote the conversations that you find to
| be more interesting" wont change anything...
|
| Not saying this is the case, but it's what the comment
| implies, so "just upvote your faves" doesn't really
| address it.
| colesantiago wrote:
| Agreed,
|
| I would like to see others, being promoted to the top
| rather than Simon's constant shilling for backlinks to
| his blog every time an AI topic is on the front page.
| simonw wrote:
| At a guess that's because my comment attracted more up-votes
| than the other top-level comments in the thread.
|
| I generally try to include something in a comment that's not
| information already under discussion - in this case that was
| the link and quote from the original README.
| ushakov wrote:
| of course your comment attracts more upvotes - it's at the
| top.
| ontouchstart wrote:
| Attention feeds attention.
|
| Attention is ALL You Need.
| seanhunter wrote:
| It's at the top because of upvotes. They don't have an
| "if simonw: boost" branch in the code.
| ushakov wrote:
| the code is not public, so we can't know. i think it's
| much more nuanced and certain users' comments might get a
| preferential treatment, based on factors other than the
| upvote count - which itself is hidden from us.
| ComplexSystems wrote:
| > the code is not public, so we can't know.
|
| I feel like you're making this statement in bad faith,
| rather than honestly believing the developers of the
| forum software here have built in a clause to pin
| simonw's comments to the top.
| satvikpendem wrote:
| > _certain users ' comments might get a preferential
| treatment_
|
| This does not happen. It hasn't even happened when pg
| made the forum in the first place.
| dcrazy wrote:
| I thought dang explicitly said it does happen? It
| certainly happens for stories.
| llm_nerd wrote:
| HN goes through phases. I remember when patio11 was the star
| of the hour on here. At another time it was that security guy
| (can't remember his name).
|
| And for those who think it's just organic with all of the
| upvotes, HN absolutely does have a +/- comment bias for
| users, and it does automatically feature certain people and
| suppress others.
| imiric wrote:
| > And for those who think it's just organic with all of the
| upvotes, HN absolutely does have a bias for authors, and it
| does automatically feature certain people and suppress
| others.
|
| Exactly.
|
| There are configurable settings for each account, which
| might be automatically or manually set--I'm not sure-, that
| control the initial position of a comment in threads, and
| how long it stays there. There might be a reward system,
| where comments from high-karma accounts are prioritized
| over others, and accounts with "strikes", e.g. direct
| warnings from moderators, are penalized.
|
| The difference in upvotes that account ultimately receives,
| and thus the impact on the discussion, is quite stark. The
| more visible a comment is, i.e. the more at the top it is,
| the more upvotes it can collect, which in turn makes it
| stay at the top, and so on.
|
| It's safe to assume that certain accounts, such as those of
| YC staff, mods, or alumni, or tech celebrities like simonw,
| are given the highest priority.
|
| I've noticed this on my own account. Before being warned
| for an IMO bullshit reason, my comments started to appear
| near the middle, and quickly float down to the bottom,
| whereas before they would usually be at the top for a few
| minutes. The quality of what I say hasn't changed, though
| the account's standing, and certainly the community itself,
| has.
|
| I don't mind, nor particularly care about an arbitrary
| number. This is a proprietary platform run by a VC firm. It
| would be silly to expect that they've cracked the code of
| online discourse, or that their goal is to keep it
| balanced. The discussions here are better on average than
| elsewhere because of the community, although that also has
| been declining over the years.
|
| I still find it jarring that most people would vote on a
| comment depending on if they agree with it or not, instead
| of engaging with it intellectually, which often pushes
| interesting comments to the bottom. This is an unsolved
| problem here, as much as it is on other platforms.
| Eisenstein wrote:
| There is a saying that if everyone you encounter seems to
| be unreasonable, maybe it isn't the other people that are
| being unreasonable.
|
| This isn't to say that social media is fair, or that
| people vote properly or that any ranking system based on
| agreement by readers is a good one. However, generally
| when you are getting negativity communicated to you and
| you are seeing consistently poor results around actions
| you take, it is going to be useful to examine the
| possibility that there is a difference in how you
| perceive what you are doing vs how others do. In that
| case spending time trying to figure out ways in which you
| are being wronged so that you can continue in the same
| manner is going to be time wasted.
| imiric wrote:
| How are you getting persecution complex from what I said?
| If anything, your comment might be feeding that delusion.
| :)
|
| My point is that HN definitely has certain weights
| associated with accounts, which control the karma,
| visibility, and ultimately discussion of certain topics.
|
| This problem doesn't affect only negativity or downvotes,
| but upvotes as well. The most upvoted comments are not
| necessarily of the highest quality, or contribute the
| most to the discussion. They just happen to be the most
| visible, and to generally align with the feeling of the
| hive mind.
|
| I know this because some of my own comments have been at
| the top, without being anything special, while others I
| think are, barely get any attention. I certainly examine
| my thinking whenever it strongly aligns with the hive
| mind, as this community does not particularly align with
| my values.
|
| I also tend to seek out comments near the bottom of
| threads, and have dead comments enabled, precisely to
| counteract this flawed system. I often find quality
| opinions there, so I suggest everyone do the same as
| well.
|
| An essential feature of a healthy and interesting
| discussion forum is to accomodate different viewpoints.
| That starts by not burying those that disagree with the
| majority, or boosting those that agree. AFAIK no online
| system has gotten this right yet.
| rymc wrote:
| the security you mean is probably tptacek
| (https://news.ycombinator.com/user?id=tptacek)
| throwaway2027 wrote:
| Time flies and simonw his AI feedback isn't always received
| favorably, sometimes he pushes it too much.
| francispauli wrote:
| thanks for reminding me i need to follow his blog weekly
| again
| satvikpendem wrote:
| They aren't pinned, people just vote on them, and more so
| because simonw is a recognizable name with lots of posts and
| comments.
| magicalhippo wrote:
| New comments get a boost, and as such are frequently near the
| top just due to that. Frequent upvotes also boosts. There
| might be other factors.
|
| However these things are dynamic and change over time. As I
| read the discussion just now, the GP comment was the ~5th
| top-level comment.
| mattfrommars wrote:
| I don't know if this warrants a separate thread here but I have
| to ask...
|
| How can I realistically get involved the AI development space? I
| feel left out with what's going on and living in a bubble where
| AI is forced into by my employer to make use of it (GitHub
| Copilot), what is a realistic road map to kinda slowly get into
| AI development, whatever that means
|
| My background is full stack development in Java and React, albeit
| development is slow.
|
| I've only messed with AI on very application side, created a
| local chat bot for demo purposes to understand what RAG is about
| to running models locally. But all of this is very superficial
| and I feel I'm not in the deep with what AI is about. I get I'm
| too 'late' to be on the side of building the next frontier model
| and makes no sense, what else can I do?
|
| I know Python, next step is maybe do 'LLM from scratch"? Or I
| pick up Google machine learning crash course certificate? Or do
| recently released Nvidia Certification?
|
| I'm open for suggestions
| fc417fc802 wrote:
| I'm not entirely clear what your goals are but roughly, just
| figure out an application that holds your interest and build a
| model for it from scratch. Probably don't start with an LLM
| though. Same as for anything else really. If you're interest in
| computer graphics then decide on a small scale project and go
| build it from scratch. Etc.
| breisa wrote:
| Maybe look into model finetuning/distilation. Unsloth [1] has
| great guides and provides everything you need to get started on
| Google Colab for free. [1] https://unsloth.ai/
| w10-1 wrote:
| The competition for root and branch AI models and
| infrastructure is intense and skilled.
|
| But if you're adjacent to some leaf use-case for AI, you're
| likely already as good as anyone else at productizing it.
|
| And that's who is getting hired: people who show they can
| deliver product-market fit.
| swyx wrote:
| go thru workshops here https://www.youtube.com/@aiDotEngineer/
| fancy_pantser wrote:
| Was Georgi ever approached by Meta? I wonder what they offered
| (I'm glad they didn't succeed, just morbid curiosity).
| karmasimida wrote:
| Does local AI have a future? The models are getting ridiculously
| big and any storage hardware is hoarded by few companies for next
| 2 years and nvidia has stopped making consumer GPU for this year.
|
| It seems to me there is no chance local ML is going to be
| anywhere out of the toy status comparing to closed source ones in
| short term
| rhdunn wrote:
| Mistral have small variants (3B, 8B, 14B, etc.), as do others
| like IBM Granite and Qwen. Then there are finetunes based on
| these models, depending on your workflow/requirements.
| karmasimida wrote:
| True, but anything remotely useful is 300B and above
| Eupolemos wrote:
| That is a very broad and silly position to take, especially
| in this thread.
|
| I use Devstral 2 and Gemini 3 daily.
| dust42 wrote:
| I am actually doing now a good part of dev with Qwen3-Coder-
| Next on an M1 64GB with Qwen Code CLI (a fork of Gemini CLI). I
| very much like a) to have an idea how much
| tokens I use and b) be independent of VC financed token
| machines and c) I can use it on a plane/train
|
| Also I never have to wait in a queue, nor will I be told to
| wait for a few hours. And I get many answers in a second.
|
| I don't do full vibe coding with a dozen agents though. I read
| all the code it produces and guide it where necessary.
|
| Last not least, at some point the VC funded party will be over
| and when this happens one better knows how to be highly
| efficient in AI token use.
| ttoinou wrote:
| How much tokens per seconds are you getting ?
|
| Whats the advantage of qwen code cli over opencode ?
| dust42 wrote:
| 320 tok/s PP and 42 tok/s TG with 4bit quant and MLX.
| Llama.cpp was half for this model but afaik has improved a
| few days ago, I haven't yet tested though.
|
| I have tried many tools locally and was never really happy
| with any. I tried finally Qwen Code CLI assuming that it
| would run well with a Qwen model and it does. YMMV, I
| mostly do javascript and Python. Most important setting was
| to set the max context size, it then auto compacts before
| reaching it. I run with 65536 but may raise this a bit.
|
| Last not least OpenCode is VC funded, at some point they
| will have to make money while Gemini CLI / Qwen CLI are not
| the primary products of the companies but definitely dog-
| fooded.
| kristianp wrote:
| > Towards seamless "single-click" integration with the
| transformers library
|
| That's interesting. I thought they would be somewhat redundant.
| They do similar things after all, except training.
| lukebechtel wrote:
| Thank you Georgi <3
| forty wrote:
| Looks like someone tried to type "Gmail" while drunk...
| rkomorn wrote:
| Looks like Gargamel of Smurfs fame to me.
| cyanydeez wrote:
| Is there a local webui that integrates with Hugging face?
|
| Ollama and webui seem to rapidly lose their charm. Ollama now
| includes cloud apis which makes no sense as a local.
| moralestapia wrote:
| I hope Georgi gets a big fat check out of this, he deserves it
| 100%.
| snowhale wrote:
| good to see them get proper backing. llama.cpp is basically
| infrastructure at this point and relying on volunteer maintainers
| for something this critical was starting to feel sketchy.
| sbinnee wrote:
| I am happy for ggml team. They did so much work for quantization
| and actually made it available to everyone. Thank you.
| ontouchstart wrote:
| I have played with both mlx-lm and llama.cpp after I bought a
| 24GB M5 MacBook Pro last year.
|
| Then I fell down the rabbit holes of uv, rust and C++ and forgot
| about LLMs. Today after I saw this announcement and answered
| someone's question about how to set it up, when I got home, I
| decided play with llama.cpp again.
|
| I was surprised and impressed:
|
| https://ontouchstart.github.io/rabbit-holes/llama.cpp/
|
| I am not going to use mlx-lm or lmstudio anymore. llama.cpp is so
| much fun.
| car wrote:
| So great to see my two favorite Open Source AI projects/companies
| joining forces.
|
| Since I don't see it mentioned here, _LlamaBarn_ is an awesome
| little--but mighty--MacOS menubar program, making access to
| llama.cpp 's great web UI and downloading of tastefully curated
| models easy as pie. It automatically determines the available
| model- and context-sizes based on available RAM.
|
| https://github.com/ggml-org/LlamaBarn
|
| Downloaded models live in: ~/.llamabarn
|
| Apart from running on localhost, the server address and port can
| be set via CLI: # bind to all interfaces
| (0.0.0.0) defaults write app.llamabarn.LlamaBarn
| exposeToNetwork -bool YES # or bind to a specific IP
| (e.g., for Tailscale) defaults write
| app.llamabarn.LlamaBarn exposeToNetwork -string "100.x.x.x"
| # disable (default) defaults delete app.llamabarn.LlamaBarn
| exposeToNetwork
| noisy_boy wrote:
| Github is showing me unicorn - is there an Linux equivalent? I
| have a old Thinkpad with a puny Nvidia GPU, can I hope to find
| anything useful to run on that?
| car wrote:
| Building Llama.cpp from source with CUDA enabled should get
| you pretty far. llama-server has a really good web UI, the
| latest version supports model switching.
|
| As for models, plenty of GGUF quantized (down to 2-bit)
| available on HF and modelscope.
| am17an wrote:
| One often overlooked after that is ggml, the tensor library that
| runs llama.cpp is not based on pytorch, rather just plain cpp. In
| a world where pytorch dominates, it shows that alternatives are
| possible and are worthy to be pursued.
| mhher wrote:
| It's great to see the ggml team getting proper backing. Keeping
| inference in bare-metal C/C++ without the Python bloat is the
| only way local AI is going to scale efficiently. Well deserved
| for Georgi, Johannes, Piotr, and the rest of the team.
| jpcompartir wrote:
| This is great, brings clear benefits to both sides and the rest
| of us.
|
| Always rooting for Hugging Face
___________________________________________________________________
(page generated 2026-02-21 23:02 UTC)