[HN Gopher] Exo: Run your own AI cluster at home with everyday d...
       ___________________________________________________________________
        
       Exo: Run your own AI cluster at home with everyday devices
        
       Author : simonpure
       Score  : 329 points
       Date   : 2024-07-16 02:55 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | hagope wrote:
       | I used to be excited about running models locally (LLM, stable
       | diffusion etc) on my Mac, PC, etc. But now I have resigned to the
       | fact that most useful AI compute will mostly be in the cloud.
       | Sure, I can run some slow Llama3 models on my home network, but
       | why bother when it is so cheap or free to run it on a cloud
       | service? I know Apple is pushing local AI models; however, I have
       | serious reservations about the impact on battery performance.
        
         | Cantinflas wrote:
         | Why bother running models locally? Privacy, for once, or
         | censorship resistance.
        
           | seasonman wrote:
           | Also customizability. Sure, you can fine-tune the cloud
           | hosted models (to a certain degree of freedom), but it will
           | probably be expensive, inefficient, difficult and
           | unmaintainable.
        
           | hanniabu wrote:
           | And offline access
        
         | bongodongobob wrote:
         | I have a 2 year old Thinkpad and I wouldn't necessarily call
         | llama3 slow on it. It's not as fast as ChatGPT but certainly
         | serviceable. This should only help.
         | 
         | Not sure why your throwing your hands up because this is a step
         | towards solving your problem.
        
         | nhod wrote:
         | Is this a hunch, or do you know of some data to back up your
         | reservations?
         | 
         | Copilot+ PC's, which all run models locally, have the best
         | battery life of any portable PC devices, ever.
         | 
         | These devices have in turn taken a page out of Apple Silicon's
         | playbook. Apple has the benefit of deep hardware and software
         | integration that no one else has, and is obsessive about
         | battery life.
         | 
         | It is reasonable to think that battery life will not be
         | impacted much.
        
           | fragmede wrote:
           | That doesn't seem totally reasonable. The battery life of an
           | iphone is pretty great if you're not actually using it, but
           | if you're using the device hard, it gets hot to the touch,
           | along with the battery getting drained. playing resource
           | intensive video games, maxing out the *PU won't stop and let
           | the device sleep at all, and has a noticable hit on battery
           | life. Where inference takes a lot of compute to perform, it's
           | hard to imagine inference being totally free, battery-wise.
           | It probably won't be as hard on the device as playing
           | specific video games non-stop, but I get into phone
           | conversations with ChatGPT as it is, so I can imagine that
           | being a concern if you're already low on battery.
        
         | wokwokwok wrote:
         | > Sure, I can run some slow Llama3 models on my home network,
         | but why bother when it is so cheap or free to run it on a cloud
         | service?
         | 
         | Obvious answer: because it's not free, and it's not cheap.
         | 
         | If you're playing with a UI library, lets say, QT... would you:
         | 
         | a) install the community version and play with ($0)
         | 
         | b) buy a professional license to play with (3460 EUR/Year)
         | 
         | Which one do you pick?
         | 
         | Well, the same goes. It turns out, renting a server large
         | enough to run big (useful, > 8B) models is actually quite
         | expensive. The per-api-call costs of real models (like GPT4)
         | adds up very quickly once you're doing non-trivial work.
         | 
         | If you're just messing around with the tech, why would you pay
         | $$$$ just to piss around with it and see what you can do?
         | 
         | Why would you _not_ use a free version running on your old PC
         | / mac / whatever you have lying around?
         | 
         | > I used to be excited about running models locally
         | 
         | That's an easy position to be one once you've _already done it_
         | and figured out, yes, I really want the pro plan to build my
         | $StartUP App.
         | 
         | If you prefer to pay for an online service and you can afford
         | it, absolutely go for it; but isn't this an enabler for a lot
         | of people to play and explore the tech for $0?
         | 
         | Isn't having more people who understand this stuff and can make
         | meaningful (non-hype) decisions about when and where to use it
         | good?
         | 
         | Isn't it nice that if meta released some 400B llama 4 model,
         | most people can play with it, not just the ones with the $7000
         | mac studio? ...and keep building the open source ecosystem?
         | 
         | Isn't that great?
         | 
         | I think it's great.
         | 
         | Even if you don't want to play, I do.
        
           | FeepingCreature wrote:
           | I just prepay $20/mo to openrouter.ai and can instantly play
           | with every model, no further signup required.
        
           | itake wrote:
           | I'm a bit confused. Your reasoning doesn't align with the
           | data you shared.
           | 
           | The startup costs for just messing around at home are huge:
           | purchasing a server and gpus, paying for electricity, time
           | spent configuring the api.
           | 
           | If you want to just mess around, $100 to call the world's
           | best api is much cheaper than spending $2-7k Mac Studio.
           | 
           | Even at production level traffic, the ROI on uptime, devops,
           | utilities, etc would take years to recapture the upfront and
           | on-going costs of self-hosting.
           | 
           | Self hosting will have higher latency and lower throughput.
        
             | sudohackthenews wrote:
             | People have gotten manageable results on all sorts of
             | hardware. People have even squeezed a few tokens/second out
             | of Raspberry PIs. The small models are pretty performant-
             | they get good results on consumer gaming hardware. My 2021
             | laptop with a 3070m (only 8gb vram) runs 8b models faster
             | than I can read, and even the original M1 chips can run the
             | models fine.
        
               | monkmartinez wrote:
               | You are right of course.... IF your metric for
               | manageable/useable is measured only tokens per second
               | (tok/s).
               | 
               | If your metric is quality of output, time, money and
               | tok/s, there is no comparison; Local models just aren't
               | there yet.
        
             | wokwokwok wrote:
             | > The startup costs for just messing around at home are
             | huge
             | 
             | No, they are zero.
             | 
             | Most people have extra hardware lying around at home
             | they're not using. It costs nothing but time to install
             | python.
             | 
             | $100 is not free.
             | 
             | If you can't be bothered, sure thing, slap down that credit
             | card and spend your $100.
             | 
             | ...but, maybe not so for some people?
             | 
             | Consider students with no credit card, etc; there are a lot
             | of people with a lot of free time and not a lot of money.
             | Even if _you_ don 't want to use it do you do seriously
             | think this project is totally valueless for everyone?
             | 
             | Maybe, it's not for you. Not everything has to be for
             | everyone.
             | 
             | You are, maybe, just not the target audience here?
        
               | lynx23 wrote:
               | And its not entitled to cliam that "Most people have
               | extra hardware lying around at home". Your story doesn't
               | sound plausible at all.
        
               | wokwokwok wrote:
               | This project is _literally_ aiming to run on devices like
               | old phones.
               | 
               | I don't think having an old phone is particularly
               | entitled.
               | 
               | I think casually slapping down $100 on whim to play with
               | an API... probably, yeah.
               | 
               | /shrug
        
               | itake wrote:
               | According to this tweet, Llama 3 costs about $0.20 per
               | Million tokens using an M2.
               | 
               | https://x.com/awnihannun/status/1786069640948719956
               | 
               | In comparison, GPT3.5-turbo costs $0.50 per million
               | tokens.
               | 
               | Do you think an old iPhone will less than 2x efficient?
        
               | nightski wrote:
               | FWIW depends on cost of power. Where I live cost of power
               | is less than half the stated average.
        
               | bryanrasmussen wrote:
               | Most people who would want to be running machine learning
               | models probably have some hardware at home that can
               | handle a slow task for playing around and determining if
               | it is worthwhile to pay out for something more
               | performant.
               | 
               | This is undoubtedly entitled, but thinking to yourself
               | huh, I think it's time to try out some of this machine
               | learning stuff is a pretty inherently entitled thing to
               | do.
        
               | Aurornis wrote:
               | > You are, maybe, just not the target audience here?
               | 
               | The difference between an open model running on a $100
               | computer and the output from GPT4 or Claude Sonnet is
               | huge.
               | 
               | I use local and cloud models. The difference in
               | productivity and accuracy between what I can run locally
               | and what I can get for under $100 of API calls per month
               | is huge once you get past basic playing around with chat.
               | It's not even close right now.
               | 
               | So I think actually you are not the target audience for
               | what the parent comments are taking about. If you don't
               | need cutting edge performance then it's fun to play with
               | local, open, small models. If the goal is to actually use
               | LLMs for productivity in one way or another, spending
               | money on the cloud providers is a far better investment.
               | 
               | Exceptions of course for anything that is privacy-
               | sensitive, but you're still sacrificing quality by using
               | local models. It's not really up for debate that the
               | large hosted models are better than what you'd get from
               | running a 7B open model locally.
        
             | zeta0134 wrote:
             | You are vastly overestimating the startup cost. For me this
             | week it was literally these commands:
             | 
             | pacman -S ollama
             | 
             | ollama serve
             | 
             | ollama run llama3
             | 
             | My basic laptop with about 16 GB of RAM can run the model
             | just fine. It's not fast, but it's reasonably usable for
             | messing around with the tech. That's the "startup" cost.
             | Everything else is a matter of pushing scale and
             | performance, and yes that can be expensive, but a novice
             | who doesn't know what they need yet doesn't have to spend
             | tons of money to find out. Almost any PC with a reasonable
             | amount of RAM gets the job done.
        
               | Aurornis wrote:
               | I'm familiar with local models. They're fine for chatting
               | on unimportant things.
               | 
               | They do not compare to the giant models like Claude
               | Sonnet and GPT4 when it comes to trying to use them for
               | complex things.
               | 
               | I continue to use both local models and the commercial
               | cloud offerings, but I think anyone who suggests that the
               | small local models are on par with the big closed hosted
               | models right now is wishful thinking.
        
               | monkmartinez wrote:
               | llama3 at 8billion params is weak sauce for anything
               | serious, it just isn't in the same galaxy as Sonnet 3.5
               | or GPT-4o. The smaller and faster models like Phi are
               | even worse. Once you progress past asking trivial
               | questions to a point where you need to trust the output a
               | bit more, its not worth effort in time, money and/or
               | sweat effort to run a local model to do it.
               | 
               | A novice isn't going to know what they need because they
               | don't know what they don't know. Try asking a question to
               | LLaMA 3 at 8 billion and the same question to LLaMA 3 at
               | 70 billion. There is a night and day difference. Sonnet,
               | Opus and GPT-4o run circles around LLaMA 3 70b. To run
               | LLaMA at 70 billion you need serious horse power as well,
               | likely thousands of dollars in hardware investment. I say
               | it again... the calculus in time, money, and effort isn't
               | favorable to running open models on your own hardware
               | once you pass the novice stage.
               | 
               | I am not ungrateful that the LLaMA's are available for
               | many different reasons, but there is no comparison
               | between quality of output, time, money and effort. The
               | API's are a bargain when you really break down what it
               | takes to run a serious model.
        
               | jononor wrote:
               | Using an LLM as a general purpose knowledge base is only
               | one particular application of an LLM. And on which is
               | probably best served by ChatGPT etc.
               | 
               | A lot of other things are possible with LLMs using the
               | context window and completion, thanks to their "zero
               | shot" learning capabilities. Which is also what RAG
               | builds upon.
        
             | LorenDB wrote:
             | And why would you buy a Mac Studio? You could build a
             | reasonable GPU-accelerated Linux box for well under $1500.
             | For example:
             | https://pcpartpicker.com/guide/BCWG3C/excellent-amd-
             | gamingst...
        
               | J_Shelby_J wrote:
               | Devs that refuse to move off Apple are severely
               | disadvantaged in the LLM era.
        
               | jondwillis wrote:
               | lol tell that to the 3 year old laptop with 64 GB of RAM
               | that I use exclusively for local LLMs while dev'ing on my
               | work laptop with 96 GB of RAM...
        
           | nl wrote:
           | > Well, the same goes. It turns out, renting a server large
           | enough to run big (useful, > 8B) models is actually quite
           | expensive. The per-api-call costs of real models (like GPT4)
           | adds up very quickly once you're doing non-trivial work.
           | 
           | I run my own models, but the truth is most of the time I just
           | use an API provider.
           | 
           | TogetherAI and Groq both have free offers that are generous
           | enough I haven't used them up in 6 months of experimentation
           | and TogetherAI in particular has more models and gets new
           | models up quicker than I can try them myself.
        
           | jrm4 wrote:
           | Right, I think people here are _vastly_ underestimating this
           | idea of
           | 
           | "What if I want to play around with really PERSONAL stuff."
           | 
           | I've been keeping a digital journal about my whole life. I
           | plan to throw that thing into an AI to see what happens, and
           | you can be damn sure that it will be local.
        
             | monkmartinez wrote:
             | Yes, I am with you 100% and keep several LLaMA's on my
             | workstation for that reason. I use Openrouter for
             | everything else. Everything that isn't sensitive goes to
             | one of the big kid models because they are just sooooo much
             | better. LLaMA 400b might be the start of running with the
             | big kids, but I know we are not close with the current
             | available models.
        
           | Aurornis wrote:
           | > Why would you not use a free version running on your old PC
           | / mac / whatever you have lying around?
           | 
           | Because the old PC lying around can't come anywhere near the
           | abilities or performance of the hosted AI compute providers.
           | Orders of magnitudes of difference.
           | 
           | The parent commenter is correct: If you want cutting edge
           | performance, there's no replacement for the hosted solutions
           | right now.
           | 
           | Running models locally is fun for playing around and
           | experimenting, but there is no comparison between what you
           | can run on an old PC lying around and what you can get from a
           | hosted cluster of cutting edge hardware that offers cheap
           | output priced per API call.
        
         | dotancohen wrote:
         | I have found many similarities between home AI and home
         | astronomy. The equipment needed to get really good performance
         | is far beyond that available to the home user, however
         | intellectually satisfying results can be had at home as a
         | hobby. But certainly not professional results.
        
           | grugagag wrote:
           | When learning and experimenting it could make a difference.
        
         | friendly_chap wrote:
         | We are running smaller models with software we wrote (self plug
         | alert: https://github.com/singulatron/singulatron) with great
         | success. There are obvious mistakes these models make (such as
         | the one in our repo image - haha) sometimes but they can also
         | be surprisingly versatile in areas you don't expect them to be,
         | like coding.
         | 
         | Our demo site uses two NVIDIA GeForce RTX 3090 and our whole
         | team is hammering it all day. The only problem is occasionally
         | high GPU temperature.
         | 
         | I don't think the picture is as bleak as you paint. I actually
         | expect Moore's Law and better AI architectures to bring on a
         | self-hosted AI revolution in the next few years.
        
         | dsign wrote:
         | For my advanced spell-checking use-case[^1], local LLMs are,
         | sadly, not state-of-the-art. But their $0 price-point is
         | excellent to analyze lots of sentences and catch the most
         | obvious issues. With some clever hacking, the most difficult
         | cases can be handled by GPT4o and Claude. I'm glad there is a
         | wide variety of options.
         | 
         | [^1] Hey! If you know of spell-checking-tuned LLM models, I'm
         | all ears (eyes).
        
           | bruce343434 wrote:
           | I think the floating point encoding of LLMs is inherently
           | lossy, add to that the way tokenization works. The LLMs I've
           | worked with "ignore" bad spelling and correctly interpret
           | misspelled words. I'm guessing that for spelling LLMs, you'd
           | want tokenization at the character level, rather than a byte
           | pair encoding.
           | 
           | You could probably train any recent LLM to be better than a
           | human at spelling correction though, where "better" might be
           | a vague combination of faster, cheaper, and acceptable loss
           | of accuracy. Or maybe slightly more accurate.
           | 
           | (A lot of people hate on LLMs for not being perfect, I don't
           | get it. LLMs are just a tool with their own set of trade
           | offs, no need to get rabid either for or against them. Often,
           | things just need to be "good enough". Maybe people on this
           | forum have higher standards than average, and can not deal
           | with the frustration of that cognitive dissonance)
        
         | Hihowarewetoday wrote:
         | I'm not sure why you have resigned?
         | 
         | If you don't care about running it locally, just spend it
         | online. Everything is good.
         | 
         | But you can run it locally already. Is it cheap? No. Are we
         | still in the beginning? yes. We are still in a phase were this
         | is a pure luxury and just getting into it by buying a 4090, is
         | still relativly cheap in my opinion.
         | 
         | Why running it locally you ask? I personally think running
         | anythingllm and similiar frameworks on your own local data is
         | interesting.
         | 
         | But im pretty sure in a few years you will be able to buy
         | cheaper ml chips for running models locally fast and cheap.
         | 
         | Btw. aat least i don't know a online service which is
         | uncensored, has a lot of loras as choice and is cost effective.
         | For just playing around with LLMs for sure there are plenty of
         | services.
        
         | PostOnce wrote:
         | Maybe you want to conduct experiments that the cloud API
         | doesn't allow for.
         | 
         | Perhaps you'd like to plug it into a toolchain that runs faster
         | than API calls can be passed over the network? -- eventually
         | your edge hardware is going to be able to infer a lot faster
         | than the 50ms+ per call to the cloud.
         | 
         | Maybe you would like to prevent the monopolists from gaining
         | sole control of what may be the most impactful technology of
         | the century.
         | 
         | Or perhaps you don't want to share your data with Microsoft &
         | Other Evils (formerly known as dont be evil).
         | 
         | You might just like to work offline. Whole towns go offline,
         | sometimes for days, just because of bad weather. Nevermind war
         | and infrastructure crises.
         | 
         | Or possibly you don't like that The Cloud model has a fervent,
         | unshakeable belief in the propaganda of its masters. Maybe that
         | propaganda will change one day, and not in your favor. Maybe
         | you'd like to avoid that.
         | 
         | There are many more reasons in the possibility space than my
         | limited imagination allows for.
        
           | tarruda wrote:
           | It is not like strong models are at a point where you can
           | 100% trust their output. It is always necessary to review LLM
           | generated text before using it.
           | 
           | I'd rather have a weaker model which I can always rely on
           | being available than a strong model which is hosted by a
           | third party service that can be shut down at any time.
        
             | Aurornis wrote:
             | > I'd rather have a weaker model which I can always rely on
             | being available than a strong model which is hosted by a
             | third party service that can be shut down at any time.
             | 
             | Every LLM project I've worked with has an abstraction layer
             | for calling hosted LLMs. It's trivial to implement another
             | adapter to call a different LLM. It's often does as a
             | fallback, failover strategy.
             | 
             | There are also services that will merge different providers
             | into a unified API call if you don't want to handle the
             | complexity on the client.
             | 
             | It's really not a problem.
        
               | PostOnce wrote:
               | Suppose you live outside of America and the supermajority
               | of LLM companies are American. You want to ask a question
               | about whisky distillation or abortion or anything else
               | that's legal in your jurisdiction but not in the US, but
               | the LLM won't answer.
               | 
               | You've got a plethora of cloud providers, all of them
               | aligned to a foreign country's laws and customs.
               | 
               | If you can choose between Anthropic, OpenAI, Google, and
               | some others... well, that's really not a choice at all.
               | They're all in California. What good does that do an
               | Austrian or an Australian?
        
           | jumpCastle wrote:
           | Aren't services like runpod solve half of these concerns?
        
           | gtirloni wrote:
           | _> eventually your edge hardware is going to be able to infer
           | a lot faster than the 50ms+ per call to the cloud._
           | 
           | This is interesting. Is that based on any upcoming technology
           | improvement already in the works?
        
             | a_t48 wrote:
             | GP is likely referring to network latency here. There's a
             | tradeoff between smaller GPUs/etc at home that have no
             | latency to use and beefier hardware in the cloud that have
             | a minimum latency to use.
        
               | yjftsjthsd-h wrote:
               | Sure, but if the model takes multiple seconds to execute,
               | then even 100 milliseconds of network latency seems more
               | or less irrelevant
        
             | datameta wrote:
             | Comms is also the greatest battery drain for a remote edge
             | system. Local inference can allow for longer operation, or
             | operation with no network infra.
        
           | sharpshadow wrote:
           | Excellent points and being able to use available hardware in
           | unison is amazing and I guess we are not far away from
           | botnets utilising this kind of technology like they did with
           | mining coins.
        
         | jrm4 wrote:
         | What do you mean by _useful_ here?
         | 
         | I'm saying because I've had the exact OPPOSITE thought. The
         | intersection of Moore's Law and the likelihood that these
         | things won't end up as some big unified singularity brain and
         | instead little customized use cases make me think that running
         | at home/office will perhaps be just as appealing.
        
         | aftbit wrote:
         | What if you want to create transcripts for 100s of hours of
         | private recorded audio? I for one do not want to share that
         | with the cloud providers and have it get used as training data
         | or be subject to warrentless search under the third party
         | doctrine. Or what if you want to run a spicy Stable Diffusion
         | fine-tune that you'd rather not have associated with your name
         | in case the anti-porn fascists take over? I feel like there are
         | dozens of situations where the cost is really not the main
         | reason to prefer a local solution.
        
         | dws wrote:
         | > Sure, I can run some slow Llama3 models on my home network,
         | but why bother when it is so cheap or free to run it on a cloud
         | service?
         | 
         | Running locally, you can change the system prompt. I have Gemma
         | set up on a spare NUC, and changed the system prompt from
         | "helpful" to "snarky" and "kind, honest" to "brutally honest".
         | Having an LLM that will roll its eyes at you and say "whatever"
         | is refreshing.
        
         | diego_sandoval wrote:
         | > why bother when it is so cheap or free to run it on a cloud
         | service?
         | 
         | For the same reasons that we bother to use Open Source software
         | instead of proprietary software.
        
         | cess11 wrote:
         | I don't want people I don't know snooping around in my
         | experiments.
        
       | iJohnDoe wrote:
       | Anyone run this? Works?
        
         | tdubhro1 wrote:
         | The readme shows how to run it assuming you can run a python
         | program on the device, so I expect it works with laptops and
         | PCs but there's a note at the end of the page saying that the
         | iOS app has fallen behind the python version so it's not clear
         | to me how to get this running on your iphone or other such
         | devices.
        
           | orsorna wrote:
           | The "device" in question must be Apple Silicon because the
           | `mlx` package is a hard dependency, or at least an ARM
           | machine (I do not have any Apple Silicon Macbooks or ARM
           | machines to run this). I tried tweaking this before realizing
           | calls to this library is littered all over the repo. I don't
           | really understand the AI ecosystem very well but it seems
           | that the use of the `mlx` library should be supplanted by
           | some other library depending on the platform machine. Until
           | then, and the actual release of the iOS code somewhere,
           | "everyday devices" is limited to premium devices that almost
           | no one has more than one of. I'm looking forward to run this
           | on other machine platforms and squeeze out what I can from
           | old hardware laying around. Otherwise I doubt the tagline of
           | the project.
           | 
           | Edit: to add on, the only evidence that this runs anywhere
           | but Apple Silicon is the maintainer's Twitter where they show
           | it running on two Macbook Pros as well as other devices. I'm
           | not sure how many of those devices are not ARM.
           | 
           | I'm not throwing shade at the concept the author is
           | presenting, but I'd appreciate if they could slow down
           | functional commits (he is writing them right now as I type)
           | and truthfully modify the documentation to state which
           | targets are actually able to run this.
        
         | acosmism wrote:
         | why ask? try it!
        
       | mg wrote:
       | This enables you to run larger models         than you would be
       | able to on any single         device.
       | 
       | No further explanation on how this is supposed to work?
       | 
       | If some layers of the neural network are on deviceA and some
       | layers are on deviceB, wouldn't that mean that for every token
       | generated, all output data from the last layer on deviceA have to
       | be transferred to deviceB?
        
         | steeve wrote:
         | Yes, that's how it works (pipeline parallelism)
        
           | mg wrote:
           | Interesting. Let's do the math ...
           | 
           | Let's say the model has 50B parameters and 50 layers. That
           | would mean about one billion values have to travel through
           | the wifi for every generated token?
           | 
           | I wonder how much data that is in bytes and how long it takes
           | to transfer them.
        
             | blackbear_ wrote:
             | It's not the parameters that are sent, it's the layer
             | outputs. That makes for a few thousands floats per token
        
               | mg wrote:
               | Woops! I would have thought the number of neurons roughly
               | equals the number of parameters, but you are right. The
               | number of parameters is much higher.
        
               | tama_sala wrote:
               | The embedding size is only 8k so while the parameters are
               | 70B. So it's a huge difference
        
         | mikewarot wrote:
         | Yes, so you would have a vector about 8k values long to be
         | transferred on each token generated.
         | 
         | You could do that easily with any modern network.
        
           | mg wrote:
           | That's exciting. So we could build a SETI@home style network
           | of even the largest models.
           | 
           | I wonder if training could be done in this way too.
        
             | alexandercheema wrote:
             | Repo author here. That's correct. The embeddings for
             | Llama-3-8B are around 8KB-10KB. For Llama-3-70B they're
             | around 32KB. These are small enough to send around between
             | devices on a local network. For a SETI@home style network,
             | latency will kill you if you go over the internet. That's
             | why we're starting with local networks.
        
               | mg wrote:
               | Ah yes. At first, I thought that since it is all one-way
               | forward-only communication, latency would only affect the
               | time to the first token.
               | 
               | But I guess the final output needs to be sent back to the
               | first node before it can continue. So if there are 50
               | nodes with a latency of 40ms each, each token would take
               | 2s to process.
        
               | alexandercheema wrote:
               | Yeah, unfortunately the autoregressive nature of these
               | models slows it down significantly with added
               | device<->device latency. However, you can still max out
               | on throughput with pipeline parallelism, where you
               | overlap execution. See:
               | https://pytorch.org/docs/stable/pipeline.html
        
               | juvo wrote:
               | how does it compare to https://github.com/bigscience-
               | workshop/petals ?
        
               | DiogoSnows wrote:
               | For generating synthetic data you could have a SETI@Home
               | setup if you consider each home as a node that generates
               | some amount of data. I mean, such a setup can be built
               | with Exo, I wouldn't suggest including it as part of Exo.
               | 
               | Out of curiosity, would you ever support training or at
               | least fine-tuning?
        
       | thom wrote:
       | Bexowulf.
        
       | tarasglek wrote:
       | This is the first timer i've seen tinygrad backend in the wild.
       | Amusing that it's supposedly more stable than llama.cpp for this
       | project.
        
         | alexandercheema wrote:
         | Repo author here. Tinygrad changes rapidly so wouldn't it say
         | it's "more" stable, but it certainly supports more accelerators
         | than llama.cpp. As George Hotz likes to say, it sits somewhere
         | on the spectrum between llama.cpp and Mojo. No hand-written
         | kernels, optimal kernels are generated and found by beam
         | search.
        
       | matyaskzs wrote:
       | Cloud cannot be beaten on compute / price, but moving to local
       | could solve privacy issues and the world needs a second amendment
       | for compute anyway.
        
         | CuriouslyC wrote:
         | You can beat gpt4/claude in terms of price/performance for most
         | things by a mile using fine tuned models running in a colo.
         | Those extra parameters give the chatbots the ability to
         | understand malformed input and to provide off the cuff answers
         | about almost anything, but small models can be just as smart
         | about limited domains.
        
           | ComputerGuru wrote:
           | The problem is that once you say "fine tuned" then you have
           | immediately slashed the user base down to virtually nothing.
           | You need to fine-tune per-task and usually per-user (or org).
           | There is no good way to scale that.
           | 
           | Apple can fine-tune a local LLM to respond to a catalog of
           | common interactions and requests but it's hard to see anyone
           | else deploying fine-tuned models for non-technical audiences
           | or even for their own purposes when most of their needs are
           | one-off and not recurring cases of the same thing.
        
             | CuriouslyC wrote:
             | Not necessarily, you can fine tune on a general domain of
             | knowledge (people already do this and open source the
             | results) then use on device RAG to give it specific
             | knowledge in the domain.
        
       | pierrefermat1 wrote:
       | Would be great if we could get some benchmarks on commonly
       | available hardware setups.
        
         | pharrington wrote:
         | I'm sure someone will show their benchmarks in a couple years!
        
         | festive-minsky wrote:
         | So I just tried with 2x macbook pros (M2 64GB & M3 128GB) and
         | it was exactly the same speed as with just 1 macbook pro (M2
         | 64GB) Not exactly a common setup but at least it's something
        
           | alexandercheema wrote:
           | Could you create a GitHub issue? There's a lot of work we'd
           | like to do to improve this.
        
       | whoami730 wrote:
       | Is it possible to use this for image recognition and like? Not
       | sure what can be the usage of this apart from as a chatbot.
        
         | tama_sala wrote:
         | You can use other models like a vision LLM, or use AI agents as
         | well
        
         | jononor wrote:
         | Image recognition can generally be done very efficiently on a
         | single commodity PC. Even a phone that is a few years olds can
         | do quite a lot. Or a Raspberry PI. So it generally does not
         | need distributed computing solutions. I am talking about models
         | like YOLO, ResNet, MobileNets, etc.
        
       | ajnin wrote:
       | It requires mlx but it is an Apple silicon-only library as far as
       | I can tell. How is it supposed to be (I quote) "iPhone, iPad,
       | Android, Mac, Linux, pretty much any device" ? Has it been tested
       | on anything else than the author's MacBook ?
        
         | orsorna wrote:
         | One of the maintainers has a video demo on his twitter claiming
         | iOS, android and Linux. Some of the code is not released and I
         | wish they were advertising that properly.
        
         | lopuhin wrote:
         | The README says they plan to add llama.cpp support which should
         | cover a lot of targets, also they have tinygrad already
         | integrated I think.
        
       | pyinstallwoes wrote:
       | Swarm compute should be the norm for all compute - so much unused
       | cpu across all the devices we collectively own.
        
         | KronisLV wrote:
         | This might not work for use cases where you need low latency,
         | but for longer winded processing it would be amazing if
         | possible.
         | 
         | For example, if I have a few servers, laptop (connected to
         | power) as well as a desktop PC and they're all connected to a
         | fast local network, it'd be great to distribute the task of
         | rendering a video or working with archive files across all of
         | them.
        
           | greggsy wrote:
           | Those are two precise examples that benefit from single core
           | compute power, and are wholly unsuited to distributed
           | computing...
        
             | KronisLV wrote:
             | Distributed rendering farms have existed for a while.
        
         | _factor wrote:
         | This exists: https://aihorde.net/
         | 
         | I haven't tried it, and not the norm, but I agree it should be
         | more common. We have a global supercomputer with higher
         | latency, but still a supercomputer.
        
           | dchuk wrote:
           | I might just still be too tired from just waking up, but I
           | can't for the life of me find any details on that site about
           | what models are actually being served by the horde?
        
             | burkaman wrote:
             | Go to https://aihorde.net/api/, scroll down to
             | /v2/status/models, and click Try it out and then Execute.
             | It's an enormous list and I think it can be dynamically
             | updated, so that's probably why it isn't listed on the
             | website.
        
         | phito wrote:
         | I'd rather my CPU to be idle and not consome much power
        
           | imp0cat wrote:
           | It depends. There is a lot of devices with quite capable cpus
           | that are mostly doing nothing.
        
             | bastawhiz wrote:
             | I also prefer my phone to not be hot and constantly plugged
             | in. Or for my ML workload to suddenly get slow because my
             | partner drove the car out of range of the WiFi. Or to miss
             | notifications because my watch's CPU was saturated.
        
       | christkv wrote:
       | Is apple silicon with a lot of memory 32Gb and up still
       | considered a cheapish way to run models or are there other
       | options now?
        
         | talldayo wrote:
         | A good Apple Silicon Mac with 32gb of RAM will cost you over
         | $2,000 on-sale. For that price you might as well buy an Nvidia
         | machine instead, either two 3090s or a 64gb Jetson Orin board
         | would be both cheaper and faster.
         | 
         | The markup on Apple hardware is so big that I just don't think
         | "cheapish" will ever be a way to describe the position they
         | hold in the AI market. Apple's current budget lineup gets
         | smoked by an RTX 3060 in a cheap Linux homeserver; the bar for
         | high-value AI has been raised pretty high.
        
       | makmanalp wrote:
       | Question - if large clusters are reporting that they're seeing
       | gains from using RDMA networks because communication overhead is
       | a bottleneck, how is it possible that this thing is not massively
       | bottlenecked running over a home network?
        
         | DistractionRect wrote:
         | I suspect that most of the devices you'd expect to find in your
         | consumer cluster are too small/slow to saturate the link.
         | 
         | Edit: it's also a matter of scale. You probably have a small
         | number of small/slow devices in a consumer network versus a lot
         | of large/fast devices in your enterprise cluster.
        
         | derefr wrote:
         | I haven't looked into exactly what this project is doing, but
         | here's my understanding:
         | 
         | Inference across O(N) pre-trained hidden layers isn't exactly
         | an "embarrassingly parallel" problem, but it _is_ an
         | "embarrassingly pipeline-able" problem (in the CPU sense of
         | "pipelining.") Each device can keep just one or a few layers
         | hot in their own VRAM; and also only needs to send and receive
         | one small embedding (<1MB) vector per timestep -- which is so
         | trivial that it's easily achievable in realtime even if all the
         | devices are on wi-fi, talking to the same router, in your
         | "noisy" apartment where 100 other neighbours are on the same
         | bands.
         | 
         | (To put it another way: running a single inference job, has
         | more forgiving realtime latency+throughput requirements than
         | game streaming!)
         | 
         | Assuming that you have a model that's too big for any of your
         | home machines to individually hold; and that all you care about
         | is performance for single-concurrent-request inference on that
         | model -- then _in theory_ , you just need _one_ GPU of one node
         | of your homespun Beowulf GPU cluster to have enough VRAM to
         | keep the single largest layer of your model always-hot; and
         | then other smaller devices can handle keeping the smaller
         | layers always-hot. And the result _should_ be faster than
         | "overloading" that same model on that single largest-VRAM
         | device and having some layers spill to CPU, or worse yet,
         | having the GPU have to swap layers in and out repeatedly with
         | each inference step.
         | 
         | (Also, if you're wondering, in the case where a single
         | machine/node has multiple GPUs -- or a GPU+VRAM and also a
         | CPU+RAM! -- you can treat this as no different than if these
         | were multiple independent nodes, that just-so-happen to have a
         | very efficient pipeline communication channel between them. As
         | the VRAM+computation cost of running inference far outweighs
         | the communication overhead of forward propagation during
         | inference, a home-network inference-pipelining cluster
         | scheduler like this project, would still likely "schedule" the
         | model's layers purely in consideration of the properties of the
         | individual GPU+VRAM (or CPU+RAM), rather than bothering to care
         | about placement.)
         | 
         | ---
         | 
         | That being said, AFAIK training _is_ "pipeline parallelizable"
         | exactly as inference is. And people training models _do_ do
         | this -- but almost always only across multiple top-of-the-line
         | GPUs in one machine; not across multiple machines.
         | 
         | When you think about what pipelining achieves for training --
         | all you get is either:
         | 
         | 1. the ability to use a bunch of small-aggregate-VRAM nodes to
         | achieve the aggregate training capacity of fewer, larger-
         | aggregate-VRAM nodes -- but with more power consumption =
         | higher OpEx; and where also, if you scale this to O(N), then
         | you're dumping a quadratic amount of layer-propagation data
         | (which is now both forward-prop _and_ backprop data, and
         | backprop data is bigger!) over what would likely be a shared
         | network just to make this work. (If it 's _not_ a shared
         | network -- i.e. if it 's Infiniband/other RDMA -- then why did
         | you spend all that CapEx for your network and not on your
         | GPUs!?)
         | 
         | 2. the ability to pipeline a bunch of large-aggregate-VRAM
         | nodes together to train a model that will then _never_ be able
         | to be deployed onto any single node in existence, but can
         | instead _only_ exist as a  "pipelined inference model" that
         | hogs O(log N) nodes of your cluster at a time for any inference
         | run. Which makes cluster scheduling hell (if you aren't just
         | permanently wiring the scheduler to treat O(log N)-node groups
         | as single "hyper-nodes"); makes it so that you'll never be able
         | to practically open-source the model in a way anybody but other
         | bigcorps could ever run it (if that's something you care
         | about); and very likely means you're cutting the concurrent-
         | inference-request-serving capacity of your huge expensive GPU
         | cluster by O(log N)... which the product team that allowed that
         | cluster to be budgeted is _really_ not gonna like.
         | 
         | That being said, I imagine at some point one of these
         | proprietary "Inference-as-a-Service" models _has_ been trained
         | at a layer size that puts it into pipelined-inference-only
         | territory, _temporarily_. Doing so would be the ML engineer 's
         | equivalent to the CPU engineer's "we have no fundamentally
         | clever advance, so this quarter we'll just crank up the clock
         | frequency and deal with the higher TDP." (Heck, maybe GPT-4o is
         | one of these.)
         | 
         | ---
         | 
         | What people with GPU clusters _want_ , is 1. for the output of
         | the process to be a model that runs on a single (perhaps multi-
         | GPU) node; and 2. for the process itself to be mostly-shared-
         | nothing with as little cross-node communication burden as
         | possible (such that it's just a question of building highly
         | _internally_ communication-efficient nodes, not so much highly-
         | communication-efficient clusters.)
         | 
         | And both of those goals are achieved by sizing models so that
         | they fit within a single node; continuously fanning out streams
         | of training data to those nodes; and then periodically fanning
         | back in model-weights (or model-weight deltas) in an AllReduce
         | operation, to merge the learning of O(N) independently-training
         | nodes to become the new baseline for those nodes.
         | 
         | (If you'll note, this architecture doesn't put _any_ latency
         | requirements on the network, only some monstrous _throughput_
         | requirements [at the fan-in step] -- which makes it a _lot_
         | easier to design for.)
        
       | ulrischa wrote:
       | Does somebody know if it runs on a raspberry?
        
         | alexandercheema wrote:
         | It *should* but I haven't tried it. I will try it. Updated in
         | this issue:
         | 
         | We could also try raspberry pi + coral usb tpu
         | (https://coral.ai/products/) - that might be a killer combo for
         | super cheap home ai cluster.
        
           | alexandercheema wrote:
           | Issue link: https://github.com/exo-explore/exo/issues/11
        
       | dcreater wrote:
       | This is a great ideal and user friendly as well. Has the
       | potential of converting multiple old devices overnight from being
       | useless. However, I wish they had provided some results on tok/s,
       | latency with some example setups.
        
         | alexandercheema wrote:
         | We didn't expect this to blow up so quickly. A lot of work
         | needs to be done on getting different setups working. I have
         | made an issue here: https://github.com/exo-
         | explore/exo/issues/11
        
           | DiogoSnows wrote:
           | This is great work! I will keep an eye (and maybe even try to
           | contribute). Looking back at the beginning of Google, I think
           | their use of hardware and hardware agnostic platform likely
           | contributed to support growth at lower cost. We need more of
           | that in the AI era
        
             | alexandercheema wrote:
             | Thank you for the support! I agree on the cost point, and
             | personally I don't want to live in a world where all AI
             | runs on H100s in a giant datacenter controlled by one
             | company.
        
       | throwawaymaths wrote:
       | Is this sensible? Transformers are memory bandwidth bound.
       | Schlepping activations around your home network (which is liable
       | to be lossy) seems like it would result in atrocious TPS.
        
         | alexandercheema wrote:
         | "Transformers are memory bandwidth bound" - this is the precise
         | reason why this makes sense. If a model doesn't fit into memory
         | on a single device, it needs to be incrementally loaded into
         | memory (offloading), which is bottlenecked by memory bandwidth.
         | Splitting the model over multiple devices avoids this, instead
         | trading off for latency of communicating between nodes. The
         | network bandwidth requirements are minimal since only the
         | activations (intermediary embeddings) are passed between
         | devices. For Llama-3-8B these are ~10KB, for Llama-3-70B these
         | are ~32KB.
        
       | cess11 wrote:
       | I look forward to something similar being developed on top of
       | Bumblebee and Axon, which I expect is just around the corner.
       | Because, for me, Python does not spark joy.
        
         | alexandercheema wrote:
         | Repo author here. This sounds interesting. Could you elaborate
         | on the benefits of Bumblebee / Axon?
        
           | cess11 wrote:
           | They run on the BEAM, and there are related IoT-platforms
           | like Nerves. If find that to be a much nicer runtime than
           | (C)Python.
           | 
           | Edit: I don't know where else to begin. It's a runtime that
           | has lightweight processes, excellent observability, absurdly
           | good fault tolerance, really nice programming languages and
           | so on. It's designed for distributed computing.
        
             | alexandercheema wrote:
             | Fascinating, will check this out! I wanted to focus on
             | Python first to build this quickly, test out ideas and
             | iterate.
             | 
             | This seems like a good option for a switch.
             | 
             | Do you know if any of these can run on Apple/Android
             | devices?
        
       | yjftsjthsd-h wrote:
       | Unfortunately I don't see _any_ licensing info, without which I
       | 'm not touching it. Which is too bad since the idea is really
       | cool.
        
         | alexandercheema wrote:
         | Thanks for pointing out that. Fixed https://github.com/exo-
         | explore/exo/blob/main/LICENSE
        
           | yjftsjthsd-h wrote:
           | Excellent, thank you:)
        
       | Jayakumark wrote:
       | Just got https://github.com/distantmagic/paddler working across 2
       | machines on windows, for load balancing, This will be next level
       | and useful for Llama 400B to run across multiple machines. But
       | looks like windows support is not there yet.
        
       | fudged71 wrote:
       | Since this is best over a local network, I wonder how easy you
       | could make the crowdsourcing aspect of this. How could you make
       | it simple enough for everyone that's physically in your office to
       | join a network to train overnight? Or get everyone at a
       | conference to scan a QR code to contribute to a domain specific
       | model.
        
         | alexandercheema wrote:
         | That's where we want to get eventually. There's a lot of work
         | that needs to be done but I'm confident we'll get there. Give
         | us 3 months and it'll be as simple as running Dropbox.
        
       | pkeasjsjd wrote:
       | It bothers me that they don't talk about security here, I don't
       | like it at all.
        
         | alexandercheema wrote:
         | You're right. The assumption right now is that you're running
         | on trusted devices on your own local network. I will add a
         | section in the README.
        
       ___________________________________________________________________
       (page generated 2024-07-16 23:00 UTC)