[HN Gopher] Exo: Run your own AI cluster at home with everyday d...
___________________________________________________________________
Exo: Run your own AI cluster at home with everyday devices
Author : simonpure
Score : 329 points
Date : 2024-07-16 02:55 UTC (20 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| hagope wrote:
| I used to be excited about running models locally (LLM, stable
| diffusion etc) on my Mac, PC, etc. But now I have resigned to the
| fact that most useful AI compute will mostly be in the cloud.
| Sure, I can run some slow Llama3 models on my home network, but
| why bother when it is so cheap or free to run it on a cloud
| service? I know Apple is pushing local AI models; however, I have
| serious reservations about the impact on battery performance.
| Cantinflas wrote:
| Why bother running models locally? Privacy, for once, or
| censorship resistance.
| seasonman wrote:
| Also customizability. Sure, you can fine-tune the cloud
| hosted models (to a certain degree of freedom), but it will
| probably be expensive, inefficient, difficult and
| unmaintainable.
| hanniabu wrote:
| And offline access
| bongodongobob wrote:
| I have a 2 year old Thinkpad and I wouldn't necessarily call
| llama3 slow on it. It's not as fast as ChatGPT but certainly
| serviceable. This should only help.
|
| Not sure why your throwing your hands up because this is a step
| towards solving your problem.
| nhod wrote:
| Is this a hunch, or do you know of some data to back up your
| reservations?
|
| Copilot+ PC's, which all run models locally, have the best
| battery life of any portable PC devices, ever.
|
| These devices have in turn taken a page out of Apple Silicon's
| playbook. Apple has the benefit of deep hardware and software
| integration that no one else has, and is obsessive about
| battery life.
|
| It is reasonable to think that battery life will not be
| impacted much.
| fragmede wrote:
| That doesn't seem totally reasonable. The battery life of an
| iphone is pretty great if you're not actually using it, but
| if you're using the device hard, it gets hot to the touch,
| along with the battery getting drained. playing resource
| intensive video games, maxing out the *PU won't stop and let
| the device sleep at all, and has a noticable hit on battery
| life. Where inference takes a lot of compute to perform, it's
| hard to imagine inference being totally free, battery-wise.
| It probably won't be as hard on the device as playing
| specific video games non-stop, but I get into phone
| conversations with ChatGPT as it is, so I can imagine that
| being a concern if you're already low on battery.
| wokwokwok wrote:
| > Sure, I can run some slow Llama3 models on my home network,
| but why bother when it is so cheap or free to run it on a cloud
| service?
|
| Obvious answer: because it's not free, and it's not cheap.
|
| If you're playing with a UI library, lets say, QT... would you:
|
| a) install the community version and play with ($0)
|
| b) buy a professional license to play with (3460 EUR/Year)
|
| Which one do you pick?
|
| Well, the same goes. It turns out, renting a server large
| enough to run big (useful, > 8B) models is actually quite
| expensive. The per-api-call costs of real models (like GPT4)
| adds up very quickly once you're doing non-trivial work.
|
| If you're just messing around with the tech, why would you pay
| $$$$ just to piss around with it and see what you can do?
|
| Why would you _not_ use a free version running on your old PC
| / mac / whatever you have lying around?
|
| > I used to be excited about running models locally
|
| That's an easy position to be one once you've _already done it_
| and figured out, yes, I really want the pro plan to build my
| $StartUP App.
|
| If you prefer to pay for an online service and you can afford
| it, absolutely go for it; but isn't this an enabler for a lot
| of people to play and explore the tech for $0?
|
| Isn't having more people who understand this stuff and can make
| meaningful (non-hype) decisions about when and where to use it
| good?
|
| Isn't it nice that if meta released some 400B llama 4 model,
| most people can play with it, not just the ones with the $7000
| mac studio? ...and keep building the open source ecosystem?
|
| Isn't that great?
|
| I think it's great.
|
| Even if you don't want to play, I do.
| FeepingCreature wrote:
| I just prepay $20/mo to openrouter.ai and can instantly play
| with every model, no further signup required.
| itake wrote:
| I'm a bit confused. Your reasoning doesn't align with the
| data you shared.
|
| The startup costs for just messing around at home are huge:
| purchasing a server and gpus, paying for electricity, time
| spent configuring the api.
|
| If you want to just mess around, $100 to call the world's
| best api is much cheaper than spending $2-7k Mac Studio.
|
| Even at production level traffic, the ROI on uptime, devops,
| utilities, etc would take years to recapture the upfront and
| on-going costs of self-hosting.
|
| Self hosting will have higher latency and lower throughput.
| sudohackthenews wrote:
| People have gotten manageable results on all sorts of
| hardware. People have even squeezed a few tokens/second out
| of Raspberry PIs. The small models are pretty performant-
| they get good results on consumer gaming hardware. My 2021
| laptop with a 3070m (only 8gb vram) runs 8b models faster
| than I can read, and even the original M1 chips can run the
| models fine.
| monkmartinez wrote:
| You are right of course.... IF your metric for
| manageable/useable is measured only tokens per second
| (tok/s).
|
| If your metric is quality of output, time, money and
| tok/s, there is no comparison; Local models just aren't
| there yet.
| wokwokwok wrote:
| > The startup costs for just messing around at home are
| huge
|
| No, they are zero.
|
| Most people have extra hardware lying around at home
| they're not using. It costs nothing but time to install
| python.
|
| $100 is not free.
|
| If you can't be bothered, sure thing, slap down that credit
| card and spend your $100.
|
| ...but, maybe not so for some people?
|
| Consider students with no credit card, etc; there are a lot
| of people with a lot of free time and not a lot of money.
| Even if _you_ don 't want to use it do you do seriously
| think this project is totally valueless for everyone?
|
| Maybe, it's not for you. Not everything has to be for
| everyone.
|
| You are, maybe, just not the target audience here?
| lynx23 wrote:
| And its not entitled to cliam that "Most people have
| extra hardware lying around at home". Your story doesn't
| sound plausible at all.
| wokwokwok wrote:
| This project is _literally_ aiming to run on devices like
| old phones.
|
| I don't think having an old phone is particularly
| entitled.
|
| I think casually slapping down $100 on whim to play with
| an API... probably, yeah.
|
| /shrug
| itake wrote:
| According to this tweet, Llama 3 costs about $0.20 per
| Million tokens using an M2.
|
| https://x.com/awnihannun/status/1786069640948719956
|
| In comparison, GPT3.5-turbo costs $0.50 per million
| tokens.
|
| Do you think an old iPhone will less than 2x efficient?
| nightski wrote:
| FWIW depends on cost of power. Where I live cost of power
| is less than half the stated average.
| bryanrasmussen wrote:
| Most people who would want to be running machine learning
| models probably have some hardware at home that can
| handle a slow task for playing around and determining if
| it is worthwhile to pay out for something more
| performant.
|
| This is undoubtedly entitled, but thinking to yourself
| huh, I think it's time to try out some of this machine
| learning stuff is a pretty inherently entitled thing to
| do.
| Aurornis wrote:
| > You are, maybe, just not the target audience here?
|
| The difference between an open model running on a $100
| computer and the output from GPT4 or Claude Sonnet is
| huge.
|
| I use local and cloud models. The difference in
| productivity and accuracy between what I can run locally
| and what I can get for under $100 of API calls per month
| is huge once you get past basic playing around with chat.
| It's not even close right now.
|
| So I think actually you are not the target audience for
| what the parent comments are taking about. If you don't
| need cutting edge performance then it's fun to play with
| local, open, small models. If the goal is to actually use
| LLMs for productivity in one way or another, spending
| money on the cloud providers is a far better investment.
|
| Exceptions of course for anything that is privacy-
| sensitive, but you're still sacrificing quality by using
| local models. It's not really up for debate that the
| large hosted models are better than what you'd get from
| running a 7B open model locally.
| zeta0134 wrote:
| You are vastly overestimating the startup cost. For me this
| week it was literally these commands:
|
| pacman -S ollama
|
| ollama serve
|
| ollama run llama3
|
| My basic laptop with about 16 GB of RAM can run the model
| just fine. It's not fast, but it's reasonably usable for
| messing around with the tech. That's the "startup" cost.
| Everything else is a matter of pushing scale and
| performance, and yes that can be expensive, but a novice
| who doesn't know what they need yet doesn't have to spend
| tons of money to find out. Almost any PC with a reasonable
| amount of RAM gets the job done.
| Aurornis wrote:
| I'm familiar with local models. They're fine for chatting
| on unimportant things.
|
| They do not compare to the giant models like Claude
| Sonnet and GPT4 when it comes to trying to use them for
| complex things.
|
| I continue to use both local models and the commercial
| cloud offerings, but I think anyone who suggests that the
| small local models are on par with the big closed hosted
| models right now is wishful thinking.
| monkmartinez wrote:
| llama3 at 8billion params is weak sauce for anything
| serious, it just isn't in the same galaxy as Sonnet 3.5
| or GPT-4o. The smaller and faster models like Phi are
| even worse. Once you progress past asking trivial
| questions to a point where you need to trust the output a
| bit more, its not worth effort in time, money and/or
| sweat effort to run a local model to do it.
|
| A novice isn't going to know what they need because they
| don't know what they don't know. Try asking a question to
| LLaMA 3 at 8 billion and the same question to LLaMA 3 at
| 70 billion. There is a night and day difference. Sonnet,
| Opus and GPT-4o run circles around LLaMA 3 70b. To run
| LLaMA at 70 billion you need serious horse power as well,
| likely thousands of dollars in hardware investment. I say
| it again... the calculus in time, money, and effort isn't
| favorable to running open models on your own hardware
| once you pass the novice stage.
|
| I am not ungrateful that the LLaMA's are available for
| many different reasons, but there is no comparison
| between quality of output, time, money and effort. The
| API's are a bargain when you really break down what it
| takes to run a serious model.
| jononor wrote:
| Using an LLM as a general purpose knowledge base is only
| one particular application of an LLM. And on which is
| probably best served by ChatGPT etc.
|
| A lot of other things are possible with LLMs using the
| context window and completion, thanks to their "zero
| shot" learning capabilities. Which is also what RAG
| builds upon.
| LorenDB wrote:
| And why would you buy a Mac Studio? You could build a
| reasonable GPU-accelerated Linux box for well under $1500.
| For example:
| https://pcpartpicker.com/guide/BCWG3C/excellent-amd-
| gamingst...
| J_Shelby_J wrote:
| Devs that refuse to move off Apple are severely
| disadvantaged in the LLM era.
| jondwillis wrote:
| lol tell that to the 3 year old laptop with 64 GB of RAM
| that I use exclusively for local LLMs while dev'ing on my
| work laptop with 96 GB of RAM...
| nl wrote:
| > Well, the same goes. It turns out, renting a server large
| enough to run big (useful, > 8B) models is actually quite
| expensive. The per-api-call costs of real models (like GPT4)
| adds up very quickly once you're doing non-trivial work.
|
| I run my own models, but the truth is most of the time I just
| use an API provider.
|
| TogetherAI and Groq both have free offers that are generous
| enough I haven't used them up in 6 months of experimentation
| and TogetherAI in particular has more models and gets new
| models up quicker than I can try them myself.
| jrm4 wrote:
| Right, I think people here are _vastly_ underestimating this
| idea of
|
| "What if I want to play around with really PERSONAL stuff."
|
| I've been keeping a digital journal about my whole life. I
| plan to throw that thing into an AI to see what happens, and
| you can be damn sure that it will be local.
| monkmartinez wrote:
| Yes, I am with you 100% and keep several LLaMA's on my
| workstation for that reason. I use Openrouter for
| everything else. Everything that isn't sensitive goes to
| one of the big kid models because they are just sooooo much
| better. LLaMA 400b might be the start of running with the
| big kids, but I know we are not close with the current
| available models.
| Aurornis wrote:
| > Why would you not use a free version running on your old PC
| / mac / whatever you have lying around?
|
| Because the old PC lying around can't come anywhere near the
| abilities or performance of the hosted AI compute providers.
| Orders of magnitudes of difference.
|
| The parent commenter is correct: If you want cutting edge
| performance, there's no replacement for the hosted solutions
| right now.
|
| Running models locally is fun for playing around and
| experimenting, but there is no comparison between what you
| can run on an old PC lying around and what you can get from a
| hosted cluster of cutting edge hardware that offers cheap
| output priced per API call.
| dotancohen wrote:
| I have found many similarities between home AI and home
| astronomy. The equipment needed to get really good performance
| is far beyond that available to the home user, however
| intellectually satisfying results can be had at home as a
| hobby. But certainly not professional results.
| grugagag wrote:
| When learning and experimenting it could make a difference.
| friendly_chap wrote:
| We are running smaller models with software we wrote (self plug
| alert: https://github.com/singulatron/singulatron) with great
| success. There are obvious mistakes these models make (such as
| the one in our repo image - haha) sometimes but they can also
| be surprisingly versatile in areas you don't expect them to be,
| like coding.
|
| Our demo site uses two NVIDIA GeForce RTX 3090 and our whole
| team is hammering it all day. The only problem is occasionally
| high GPU temperature.
|
| I don't think the picture is as bleak as you paint. I actually
| expect Moore's Law and better AI architectures to bring on a
| self-hosted AI revolution in the next few years.
| dsign wrote:
| For my advanced spell-checking use-case[^1], local LLMs are,
| sadly, not state-of-the-art. But their $0 price-point is
| excellent to analyze lots of sentences and catch the most
| obvious issues. With some clever hacking, the most difficult
| cases can be handled by GPT4o and Claude. I'm glad there is a
| wide variety of options.
|
| [^1] Hey! If you know of spell-checking-tuned LLM models, I'm
| all ears (eyes).
| bruce343434 wrote:
| I think the floating point encoding of LLMs is inherently
| lossy, add to that the way tokenization works. The LLMs I've
| worked with "ignore" bad spelling and correctly interpret
| misspelled words. I'm guessing that for spelling LLMs, you'd
| want tokenization at the character level, rather than a byte
| pair encoding.
|
| You could probably train any recent LLM to be better than a
| human at spelling correction though, where "better" might be
| a vague combination of faster, cheaper, and acceptable loss
| of accuracy. Or maybe slightly more accurate.
|
| (A lot of people hate on LLMs for not being perfect, I don't
| get it. LLMs are just a tool with their own set of trade
| offs, no need to get rabid either for or against them. Often,
| things just need to be "good enough". Maybe people on this
| forum have higher standards than average, and can not deal
| with the frustration of that cognitive dissonance)
| Hihowarewetoday wrote:
| I'm not sure why you have resigned?
|
| If you don't care about running it locally, just spend it
| online. Everything is good.
|
| But you can run it locally already. Is it cheap? No. Are we
| still in the beginning? yes. We are still in a phase were this
| is a pure luxury and just getting into it by buying a 4090, is
| still relativly cheap in my opinion.
|
| Why running it locally you ask? I personally think running
| anythingllm and similiar frameworks on your own local data is
| interesting.
|
| But im pretty sure in a few years you will be able to buy
| cheaper ml chips for running models locally fast and cheap.
|
| Btw. aat least i don't know a online service which is
| uncensored, has a lot of loras as choice and is cost effective.
| For just playing around with LLMs for sure there are plenty of
| services.
| PostOnce wrote:
| Maybe you want to conduct experiments that the cloud API
| doesn't allow for.
|
| Perhaps you'd like to plug it into a toolchain that runs faster
| than API calls can be passed over the network? -- eventually
| your edge hardware is going to be able to infer a lot faster
| than the 50ms+ per call to the cloud.
|
| Maybe you would like to prevent the monopolists from gaining
| sole control of what may be the most impactful technology of
| the century.
|
| Or perhaps you don't want to share your data with Microsoft &
| Other Evils (formerly known as dont be evil).
|
| You might just like to work offline. Whole towns go offline,
| sometimes for days, just because of bad weather. Nevermind war
| and infrastructure crises.
|
| Or possibly you don't like that The Cloud model has a fervent,
| unshakeable belief in the propaganda of its masters. Maybe that
| propaganda will change one day, and not in your favor. Maybe
| you'd like to avoid that.
|
| There are many more reasons in the possibility space than my
| limited imagination allows for.
| tarruda wrote:
| It is not like strong models are at a point where you can
| 100% trust their output. It is always necessary to review LLM
| generated text before using it.
|
| I'd rather have a weaker model which I can always rely on
| being available than a strong model which is hosted by a
| third party service that can be shut down at any time.
| Aurornis wrote:
| > I'd rather have a weaker model which I can always rely on
| being available than a strong model which is hosted by a
| third party service that can be shut down at any time.
|
| Every LLM project I've worked with has an abstraction layer
| for calling hosted LLMs. It's trivial to implement another
| adapter to call a different LLM. It's often does as a
| fallback, failover strategy.
|
| There are also services that will merge different providers
| into a unified API call if you don't want to handle the
| complexity on the client.
|
| It's really not a problem.
| PostOnce wrote:
| Suppose you live outside of America and the supermajority
| of LLM companies are American. You want to ask a question
| about whisky distillation or abortion or anything else
| that's legal in your jurisdiction but not in the US, but
| the LLM won't answer.
|
| You've got a plethora of cloud providers, all of them
| aligned to a foreign country's laws and customs.
|
| If you can choose between Anthropic, OpenAI, Google, and
| some others... well, that's really not a choice at all.
| They're all in California. What good does that do an
| Austrian or an Australian?
| jumpCastle wrote:
| Aren't services like runpod solve half of these concerns?
| gtirloni wrote:
| _> eventually your edge hardware is going to be able to infer
| a lot faster than the 50ms+ per call to the cloud._
|
| This is interesting. Is that based on any upcoming technology
| improvement already in the works?
| a_t48 wrote:
| GP is likely referring to network latency here. There's a
| tradeoff between smaller GPUs/etc at home that have no
| latency to use and beefier hardware in the cloud that have
| a minimum latency to use.
| yjftsjthsd-h wrote:
| Sure, but if the model takes multiple seconds to execute,
| then even 100 milliseconds of network latency seems more
| or less irrelevant
| datameta wrote:
| Comms is also the greatest battery drain for a remote edge
| system. Local inference can allow for longer operation, or
| operation with no network infra.
| sharpshadow wrote:
| Excellent points and being able to use available hardware in
| unison is amazing and I guess we are not far away from
| botnets utilising this kind of technology like they did with
| mining coins.
| jrm4 wrote:
| What do you mean by _useful_ here?
|
| I'm saying because I've had the exact OPPOSITE thought. The
| intersection of Moore's Law and the likelihood that these
| things won't end up as some big unified singularity brain and
| instead little customized use cases make me think that running
| at home/office will perhaps be just as appealing.
| aftbit wrote:
| What if you want to create transcripts for 100s of hours of
| private recorded audio? I for one do not want to share that
| with the cloud providers and have it get used as training data
| or be subject to warrentless search under the third party
| doctrine. Or what if you want to run a spicy Stable Diffusion
| fine-tune that you'd rather not have associated with your name
| in case the anti-porn fascists take over? I feel like there are
| dozens of situations where the cost is really not the main
| reason to prefer a local solution.
| dws wrote:
| > Sure, I can run some slow Llama3 models on my home network,
| but why bother when it is so cheap or free to run it on a cloud
| service?
|
| Running locally, you can change the system prompt. I have Gemma
| set up on a spare NUC, and changed the system prompt from
| "helpful" to "snarky" and "kind, honest" to "brutally honest".
| Having an LLM that will roll its eyes at you and say "whatever"
| is refreshing.
| diego_sandoval wrote:
| > why bother when it is so cheap or free to run it on a cloud
| service?
|
| For the same reasons that we bother to use Open Source software
| instead of proprietary software.
| cess11 wrote:
| I don't want people I don't know snooping around in my
| experiments.
| iJohnDoe wrote:
| Anyone run this? Works?
| tdubhro1 wrote:
| The readme shows how to run it assuming you can run a python
| program on the device, so I expect it works with laptops and
| PCs but there's a note at the end of the page saying that the
| iOS app has fallen behind the python version so it's not clear
| to me how to get this running on your iphone or other such
| devices.
| orsorna wrote:
| The "device" in question must be Apple Silicon because the
| `mlx` package is a hard dependency, or at least an ARM
| machine (I do not have any Apple Silicon Macbooks or ARM
| machines to run this). I tried tweaking this before realizing
| calls to this library is littered all over the repo. I don't
| really understand the AI ecosystem very well but it seems
| that the use of the `mlx` library should be supplanted by
| some other library depending on the platform machine. Until
| then, and the actual release of the iOS code somewhere,
| "everyday devices" is limited to premium devices that almost
| no one has more than one of. I'm looking forward to run this
| on other machine platforms and squeeze out what I can from
| old hardware laying around. Otherwise I doubt the tagline of
| the project.
|
| Edit: to add on, the only evidence that this runs anywhere
| but Apple Silicon is the maintainer's Twitter where they show
| it running on two Macbook Pros as well as other devices. I'm
| not sure how many of those devices are not ARM.
|
| I'm not throwing shade at the concept the author is
| presenting, but I'd appreciate if they could slow down
| functional commits (he is writing them right now as I type)
| and truthfully modify the documentation to state which
| targets are actually able to run this.
| acosmism wrote:
| why ask? try it!
| mg wrote:
| This enables you to run larger models than you would be
| able to on any single device.
|
| No further explanation on how this is supposed to work?
|
| If some layers of the neural network are on deviceA and some
| layers are on deviceB, wouldn't that mean that for every token
| generated, all output data from the last layer on deviceA have to
| be transferred to deviceB?
| steeve wrote:
| Yes, that's how it works (pipeline parallelism)
| mg wrote:
| Interesting. Let's do the math ...
|
| Let's say the model has 50B parameters and 50 layers. That
| would mean about one billion values have to travel through
| the wifi for every generated token?
|
| I wonder how much data that is in bytes and how long it takes
| to transfer them.
| blackbear_ wrote:
| It's not the parameters that are sent, it's the layer
| outputs. That makes for a few thousands floats per token
| mg wrote:
| Woops! I would have thought the number of neurons roughly
| equals the number of parameters, but you are right. The
| number of parameters is much higher.
| tama_sala wrote:
| The embedding size is only 8k so while the parameters are
| 70B. So it's a huge difference
| mikewarot wrote:
| Yes, so you would have a vector about 8k values long to be
| transferred on each token generated.
|
| You could do that easily with any modern network.
| mg wrote:
| That's exciting. So we could build a SETI@home style network
| of even the largest models.
|
| I wonder if training could be done in this way too.
| alexandercheema wrote:
| Repo author here. That's correct. The embeddings for
| Llama-3-8B are around 8KB-10KB. For Llama-3-70B they're
| around 32KB. These are small enough to send around between
| devices on a local network. For a SETI@home style network,
| latency will kill you if you go over the internet. That's
| why we're starting with local networks.
| mg wrote:
| Ah yes. At first, I thought that since it is all one-way
| forward-only communication, latency would only affect the
| time to the first token.
|
| But I guess the final output needs to be sent back to the
| first node before it can continue. So if there are 50
| nodes with a latency of 40ms each, each token would take
| 2s to process.
| alexandercheema wrote:
| Yeah, unfortunately the autoregressive nature of these
| models slows it down significantly with added
| device<->device latency. However, you can still max out
| on throughput with pipeline parallelism, where you
| overlap execution. See:
| https://pytorch.org/docs/stable/pipeline.html
| juvo wrote:
| how does it compare to https://github.com/bigscience-
| workshop/petals ?
| DiogoSnows wrote:
| For generating synthetic data you could have a SETI@Home
| setup if you consider each home as a node that generates
| some amount of data. I mean, such a setup can be built
| with Exo, I wouldn't suggest including it as part of Exo.
|
| Out of curiosity, would you ever support training or at
| least fine-tuning?
| thom wrote:
| Bexowulf.
| tarasglek wrote:
| This is the first timer i've seen tinygrad backend in the wild.
| Amusing that it's supposedly more stable than llama.cpp for this
| project.
| alexandercheema wrote:
| Repo author here. Tinygrad changes rapidly so wouldn't it say
| it's "more" stable, but it certainly supports more accelerators
| than llama.cpp. As George Hotz likes to say, it sits somewhere
| on the spectrum between llama.cpp and Mojo. No hand-written
| kernels, optimal kernels are generated and found by beam
| search.
| matyaskzs wrote:
| Cloud cannot be beaten on compute / price, but moving to local
| could solve privacy issues and the world needs a second amendment
| for compute anyway.
| CuriouslyC wrote:
| You can beat gpt4/claude in terms of price/performance for most
| things by a mile using fine tuned models running in a colo.
| Those extra parameters give the chatbots the ability to
| understand malformed input and to provide off the cuff answers
| about almost anything, but small models can be just as smart
| about limited domains.
| ComputerGuru wrote:
| The problem is that once you say "fine tuned" then you have
| immediately slashed the user base down to virtually nothing.
| You need to fine-tune per-task and usually per-user (or org).
| There is no good way to scale that.
|
| Apple can fine-tune a local LLM to respond to a catalog of
| common interactions and requests but it's hard to see anyone
| else deploying fine-tuned models for non-technical audiences
| or even for their own purposes when most of their needs are
| one-off and not recurring cases of the same thing.
| CuriouslyC wrote:
| Not necessarily, you can fine tune on a general domain of
| knowledge (people already do this and open source the
| results) then use on device RAG to give it specific
| knowledge in the domain.
| pierrefermat1 wrote:
| Would be great if we could get some benchmarks on commonly
| available hardware setups.
| pharrington wrote:
| I'm sure someone will show their benchmarks in a couple years!
| festive-minsky wrote:
| So I just tried with 2x macbook pros (M2 64GB & M3 128GB) and
| it was exactly the same speed as with just 1 macbook pro (M2
| 64GB) Not exactly a common setup but at least it's something
| alexandercheema wrote:
| Could you create a GitHub issue? There's a lot of work we'd
| like to do to improve this.
| whoami730 wrote:
| Is it possible to use this for image recognition and like? Not
| sure what can be the usage of this apart from as a chatbot.
| tama_sala wrote:
| You can use other models like a vision LLM, or use AI agents as
| well
| jononor wrote:
| Image recognition can generally be done very efficiently on a
| single commodity PC. Even a phone that is a few years olds can
| do quite a lot. Or a Raspberry PI. So it generally does not
| need distributed computing solutions. I am talking about models
| like YOLO, ResNet, MobileNets, etc.
| ajnin wrote:
| It requires mlx but it is an Apple silicon-only library as far as
| I can tell. How is it supposed to be (I quote) "iPhone, iPad,
| Android, Mac, Linux, pretty much any device" ? Has it been tested
| on anything else than the author's MacBook ?
| orsorna wrote:
| One of the maintainers has a video demo on his twitter claiming
| iOS, android and Linux. Some of the code is not released and I
| wish they were advertising that properly.
| lopuhin wrote:
| The README says they plan to add llama.cpp support which should
| cover a lot of targets, also they have tinygrad already
| integrated I think.
| pyinstallwoes wrote:
| Swarm compute should be the norm for all compute - so much unused
| cpu across all the devices we collectively own.
| KronisLV wrote:
| This might not work for use cases where you need low latency,
| but for longer winded processing it would be amazing if
| possible.
|
| For example, if I have a few servers, laptop (connected to
| power) as well as a desktop PC and they're all connected to a
| fast local network, it'd be great to distribute the task of
| rendering a video or working with archive files across all of
| them.
| greggsy wrote:
| Those are two precise examples that benefit from single core
| compute power, and are wholly unsuited to distributed
| computing...
| KronisLV wrote:
| Distributed rendering farms have existed for a while.
| _factor wrote:
| This exists: https://aihorde.net/
|
| I haven't tried it, and not the norm, but I agree it should be
| more common. We have a global supercomputer with higher
| latency, but still a supercomputer.
| dchuk wrote:
| I might just still be too tired from just waking up, but I
| can't for the life of me find any details on that site about
| what models are actually being served by the horde?
| burkaman wrote:
| Go to https://aihorde.net/api/, scroll down to
| /v2/status/models, and click Try it out and then Execute.
| It's an enormous list and I think it can be dynamically
| updated, so that's probably why it isn't listed on the
| website.
| phito wrote:
| I'd rather my CPU to be idle and not consome much power
| imp0cat wrote:
| It depends. There is a lot of devices with quite capable cpus
| that are mostly doing nothing.
| bastawhiz wrote:
| I also prefer my phone to not be hot and constantly plugged
| in. Or for my ML workload to suddenly get slow because my
| partner drove the car out of range of the WiFi. Or to miss
| notifications because my watch's CPU was saturated.
| christkv wrote:
| Is apple silicon with a lot of memory 32Gb and up still
| considered a cheapish way to run models or are there other
| options now?
| talldayo wrote:
| A good Apple Silicon Mac with 32gb of RAM will cost you over
| $2,000 on-sale. For that price you might as well buy an Nvidia
| machine instead, either two 3090s or a 64gb Jetson Orin board
| would be both cheaper and faster.
|
| The markup on Apple hardware is so big that I just don't think
| "cheapish" will ever be a way to describe the position they
| hold in the AI market. Apple's current budget lineup gets
| smoked by an RTX 3060 in a cheap Linux homeserver; the bar for
| high-value AI has been raised pretty high.
| makmanalp wrote:
| Question - if large clusters are reporting that they're seeing
| gains from using RDMA networks because communication overhead is
| a bottleneck, how is it possible that this thing is not massively
| bottlenecked running over a home network?
| DistractionRect wrote:
| I suspect that most of the devices you'd expect to find in your
| consumer cluster are too small/slow to saturate the link.
|
| Edit: it's also a matter of scale. You probably have a small
| number of small/slow devices in a consumer network versus a lot
| of large/fast devices in your enterprise cluster.
| derefr wrote:
| I haven't looked into exactly what this project is doing, but
| here's my understanding:
|
| Inference across O(N) pre-trained hidden layers isn't exactly
| an "embarrassingly parallel" problem, but it _is_ an
| "embarrassingly pipeline-able" problem (in the CPU sense of
| "pipelining.") Each device can keep just one or a few layers
| hot in their own VRAM; and also only needs to send and receive
| one small embedding (<1MB) vector per timestep -- which is so
| trivial that it's easily achievable in realtime even if all the
| devices are on wi-fi, talking to the same router, in your
| "noisy" apartment where 100 other neighbours are on the same
| bands.
|
| (To put it another way: running a single inference job, has
| more forgiving realtime latency+throughput requirements than
| game streaming!)
|
| Assuming that you have a model that's too big for any of your
| home machines to individually hold; and that all you care about
| is performance for single-concurrent-request inference on that
| model -- then _in theory_ , you just need _one_ GPU of one node
| of your homespun Beowulf GPU cluster to have enough VRAM to
| keep the single largest layer of your model always-hot; and
| then other smaller devices can handle keeping the smaller
| layers always-hot. And the result _should_ be faster than
| "overloading" that same model on that single largest-VRAM
| device and having some layers spill to CPU, or worse yet,
| having the GPU have to swap layers in and out repeatedly with
| each inference step.
|
| (Also, if you're wondering, in the case where a single
| machine/node has multiple GPUs -- or a GPU+VRAM and also a
| CPU+RAM! -- you can treat this as no different than if these
| were multiple independent nodes, that just-so-happen to have a
| very efficient pipeline communication channel between them. As
| the VRAM+computation cost of running inference far outweighs
| the communication overhead of forward propagation during
| inference, a home-network inference-pipelining cluster
| scheduler like this project, would still likely "schedule" the
| model's layers purely in consideration of the properties of the
| individual GPU+VRAM (or CPU+RAM), rather than bothering to care
| about placement.)
|
| ---
|
| That being said, AFAIK training _is_ "pipeline parallelizable"
| exactly as inference is. And people training models _do_ do
| this -- but almost always only across multiple top-of-the-line
| GPUs in one machine; not across multiple machines.
|
| When you think about what pipelining achieves for training --
| all you get is either:
|
| 1. the ability to use a bunch of small-aggregate-VRAM nodes to
| achieve the aggregate training capacity of fewer, larger-
| aggregate-VRAM nodes -- but with more power consumption =
| higher OpEx; and where also, if you scale this to O(N), then
| you're dumping a quadratic amount of layer-propagation data
| (which is now both forward-prop _and_ backprop data, and
| backprop data is bigger!) over what would likely be a shared
| network just to make this work. (If it 's _not_ a shared
| network -- i.e. if it 's Infiniband/other RDMA -- then why did
| you spend all that CapEx for your network and not on your
| GPUs!?)
|
| 2. the ability to pipeline a bunch of large-aggregate-VRAM
| nodes together to train a model that will then _never_ be able
| to be deployed onto any single node in existence, but can
| instead _only_ exist as a "pipelined inference model" that
| hogs O(log N) nodes of your cluster at a time for any inference
| run. Which makes cluster scheduling hell (if you aren't just
| permanently wiring the scheduler to treat O(log N)-node groups
| as single "hyper-nodes"); makes it so that you'll never be able
| to practically open-source the model in a way anybody but other
| bigcorps could ever run it (if that's something you care
| about); and very likely means you're cutting the concurrent-
| inference-request-serving capacity of your huge expensive GPU
| cluster by O(log N)... which the product team that allowed that
| cluster to be budgeted is _really_ not gonna like.
|
| That being said, I imagine at some point one of these
| proprietary "Inference-as-a-Service" models _has_ been trained
| at a layer size that puts it into pipelined-inference-only
| territory, _temporarily_. Doing so would be the ML engineer 's
| equivalent to the CPU engineer's "we have no fundamentally
| clever advance, so this quarter we'll just crank up the clock
| frequency and deal with the higher TDP." (Heck, maybe GPT-4o is
| one of these.)
|
| ---
|
| What people with GPU clusters _want_ , is 1. for the output of
| the process to be a model that runs on a single (perhaps multi-
| GPU) node; and 2. for the process itself to be mostly-shared-
| nothing with as little cross-node communication burden as
| possible (such that it's just a question of building highly
| _internally_ communication-efficient nodes, not so much highly-
| communication-efficient clusters.)
|
| And both of those goals are achieved by sizing models so that
| they fit within a single node; continuously fanning out streams
| of training data to those nodes; and then periodically fanning
| back in model-weights (or model-weight deltas) in an AllReduce
| operation, to merge the learning of O(N) independently-training
| nodes to become the new baseline for those nodes.
|
| (If you'll note, this architecture doesn't put _any_ latency
| requirements on the network, only some monstrous _throughput_
| requirements [at the fan-in step] -- which makes it a _lot_
| easier to design for.)
| ulrischa wrote:
| Does somebody know if it runs on a raspberry?
| alexandercheema wrote:
| It *should* but I haven't tried it. I will try it. Updated in
| this issue:
|
| We could also try raspberry pi + coral usb tpu
| (https://coral.ai/products/) - that might be a killer combo for
| super cheap home ai cluster.
| alexandercheema wrote:
| Issue link: https://github.com/exo-explore/exo/issues/11
| dcreater wrote:
| This is a great ideal and user friendly as well. Has the
| potential of converting multiple old devices overnight from being
| useless. However, I wish they had provided some results on tok/s,
| latency with some example setups.
| alexandercheema wrote:
| We didn't expect this to blow up so quickly. A lot of work
| needs to be done on getting different setups working. I have
| made an issue here: https://github.com/exo-
| explore/exo/issues/11
| DiogoSnows wrote:
| This is great work! I will keep an eye (and maybe even try to
| contribute). Looking back at the beginning of Google, I think
| their use of hardware and hardware agnostic platform likely
| contributed to support growth at lower cost. We need more of
| that in the AI era
| alexandercheema wrote:
| Thank you for the support! I agree on the cost point, and
| personally I don't want to live in a world where all AI
| runs on H100s in a giant datacenter controlled by one
| company.
| throwawaymaths wrote:
| Is this sensible? Transformers are memory bandwidth bound.
| Schlepping activations around your home network (which is liable
| to be lossy) seems like it would result in atrocious TPS.
| alexandercheema wrote:
| "Transformers are memory bandwidth bound" - this is the precise
| reason why this makes sense. If a model doesn't fit into memory
| on a single device, it needs to be incrementally loaded into
| memory (offloading), which is bottlenecked by memory bandwidth.
| Splitting the model over multiple devices avoids this, instead
| trading off for latency of communicating between nodes. The
| network bandwidth requirements are minimal since only the
| activations (intermediary embeddings) are passed between
| devices. For Llama-3-8B these are ~10KB, for Llama-3-70B these
| are ~32KB.
| cess11 wrote:
| I look forward to something similar being developed on top of
| Bumblebee and Axon, which I expect is just around the corner.
| Because, for me, Python does not spark joy.
| alexandercheema wrote:
| Repo author here. This sounds interesting. Could you elaborate
| on the benefits of Bumblebee / Axon?
| cess11 wrote:
| They run on the BEAM, and there are related IoT-platforms
| like Nerves. If find that to be a much nicer runtime than
| (C)Python.
|
| Edit: I don't know where else to begin. It's a runtime that
| has lightweight processes, excellent observability, absurdly
| good fault tolerance, really nice programming languages and
| so on. It's designed for distributed computing.
| alexandercheema wrote:
| Fascinating, will check this out! I wanted to focus on
| Python first to build this quickly, test out ideas and
| iterate.
|
| This seems like a good option for a switch.
|
| Do you know if any of these can run on Apple/Android
| devices?
| yjftsjthsd-h wrote:
| Unfortunately I don't see _any_ licensing info, without which I
| 'm not touching it. Which is too bad since the idea is really
| cool.
| alexandercheema wrote:
| Thanks for pointing out that. Fixed https://github.com/exo-
| explore/exo/blob/main/LICENSE
| yjftsjthsd-h wrote:
| Excellent, thank you:)
| Jayakumark wrote:
| Just got https://github.com/distantmagic/paddler working across 2
| machines on windows, for load balancing, This will be next level
| and useful for Llama 400B to run across multiple machines. But
| looks like windows support is not there yet.
| fudged71 wrote:
| Since this is best over a local network, I wonder how easy you
| could make the crowdsourcing aspect of this. How could you make
| it simple enough for everyone that's physically in your office to
| join a network to train overnight? Or get everyone at a
| conference to scan a QR code to contribute to a domain specific
| model.
| alexandercheema wrote:
| That's where we want to get eventually. There's a lot of work
| that needs to be done but I'm confident we'll get there. Give
| us 3 months and it'll be as simple as running Dropbox.
| pkeasjsjd wrote:
| It bothers me that they don't talk about security here, I don't
| like it at all.
| alexandercheema wrote:
| You're right. The assumption right now is that you're running
| on trusted devices on your own local network. I will add a
| section in the README.
___________________________________________________________________
(page generated 2024-07-16 23:00 UTC)