[HN Gopher] Using LLaMA with M1 Mac and Python 3.11
___________________________________________________________________
Using LLaMA with M1 Mac and Python 3.11
Author : datadeft
Score : 371 points
Date : 2023-03-12 17:00 UTC (5 hours ago)
(HTM) web link (dev.l1x.be)
(TXT) w3m dump (dev.l1x.be)
| lxe wrote:
| If anyone is interested in running it on windows and a GPU (30b
| 4bit fits in a 3090), here's a guide:
| https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c
| m3kw9 wrote:
| Why not just have a script and just say run this?
| dataspun wrote:
| What's with the propaganda output from the example LLaMA prompt?
| diimdeep wrote:
| Remove "and Python 3.11" from title. Python used only for
| converting model to llama.cpp project format, 3.10 or whatever is
| fine.
|
| Additionally, llama.cpp works fine with 10 y.o hardware that
| supports AVX2.
|
| I'm running llama.cpp right now on an ancient Intel i5 2013
| MacBook with only 2 cores and 8 GB RAM - 7B 4bit model loads in 8
| seconds to 4.2 GB RAM and gives 600 ms per token.
|
| btw: anyone knows how to disable swap per process in macOS ? even
| though there is enough free RAM, sometimes macOS decides to use
| swap instead.
| metadat wrote:
| Can you provide a link to what guide or steps you followed to
| get this up and running? I have a physical Linux machine with
| 300+ GB of RAM, would love to try out llama on it but I'm not
| sure where to get started for how to get it working with such a
| configuration.
|
| Edit: Thank you, @diimdeep!
| [deleted]
| diimdeep wrote:
| Sure. You can get models with magnet link from here
| https://github.com/shawwn/llama-dl/
|
| To get running, just follow these steps
| https://github.com/ggerganov/llama.cpp/#usage
| HervalFreire wrote:
| [dead]
| zitterbewegung wrote:
| I am able to get 65B on a MacBook Pro 14.2 with 64gb of ram.
| https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...
| CharlesW wrote:
| > _Remove "and Python 3.11" from title. Python used only for
| converting model to llama.cpp project format, 3.10 or whatever
| is fine._
|
| As @rnosov notes elsewhere in the thread, this post has a
| workaround for the PyTorch issue with Python 3.11, which is why
| the "and Python 3.11" qualification is there.
| ahoho wrote:
| Do you know if there's there a good reason to favor 3.11 over
| 3.10 for this use case?
| CharlesW wrote:
| I'm a Python neophyte, but I've read that Python 3.11 is
| 10-60% faster than 3.10, so that may be a consideration.
| simonw wrote:
| In this particular case that doesn't matter, because the
| only time you run Python is for a one-off conversion
| against the model files.
|
| That takes at most a minute to run, but once converted
| you'll never need to run it again. Actual llama.cpp model
| inference uses compiled C++ code with no Python involved
| at all.
| e12e wrote:
| https://endoflife.date/python ?
| codetrotter wrote:
| The real question is. Which python3 version does current
| macOS ship with?
|
| Well, on my macOS Ventura 13.2.1 install, /usr/bin/python3
| is Python 3.9.6, which may be too old?
|
| But also, my custom installed python3 via homebrew is not
| 3.11 either. My /opt/homebrew/bin/python3 is Python 3.10.9
|
| MacBook Pro M1
| wjessup wrote:
| How is this post any different than the instructions on the
| actual repo? https://github.com/ggerganov/llama.cpp
| rnosov wrote:
| The post has a workaround for the PyTorch issue with Python
| 3.11. If you follow the repo instructions it will give you some
| rather strange looking errors.
| mark_l_watson wrote:
| Excuse my laziness for not looking this up myself, but I have two
| 8G RAM M1 Macs. Which smaller LLM models will run with such a
| small amount of memory? any old GPT-2 models? HN user diimdeep
| has commented here that he ran the article code and model on a 8G
| RAM M1 Mac, so maybe I will just try it.
|
| I have had good luck in the past with Apple's TensorFlow tools
| for M1 for building my own models.
| simonw wrote:
| The LLaMA 7B model will run fine on an 8GB M1.
| recuter wrote:
| Rather remarkably, Llama 7B and 13B run fine (about as fast as
| ChatGPT/bing) if you follow the instructions in the posted
| article.
| foxandmouse wrote:
| I've been seeing a lot of people talking about running language
| models locally, but I'm not quite sure what the benefit is. Other
| than for novelty or learning purposes, is there any reason why
| someone would prefer to use an inferior language model on their
| own machine instead of leveraging the power and efficiency of
| cloud-based models?
| notfed wrote:
| To summarize other answers: (1) free (2) private (3)
| censorship/ethics-free (4) customizable (5) doesn't require
| Internet
| delusional wrote:
| For me I think it's exciting in a couple of different ways.
| Most importantly, it's just way more hackable than these giant
| frameworks. It's pretty cool to be able to read all the code,
| down to first principles, to understand how computationally
| simple this stuff really is.
| seydor wrote:
| I wish to keep all my chat logs , forever. This will be my
| alter ego that will even survive me. It must be private and not
| on someone else's computer.
|
| But more importantly i want it uncensored. These tools are
| useful for deep conversation, which no longer exists online
| since many years ago
| leobg wrote:
| In the 1970s, people moved to Ashrams in India to lose their
| ego. In the 2020s, people are anxious for AI to conserve it
| beyond death. Quite a generational pendulum swing... :)
| seydor wrote:
| They took their notebooks with them. That's why we need
| private models
| [deleted]
| holoduke wrote:
| I am currently paying thousands per month for translations.
| (Billions of words per month) if we only could have a way to
| run a chatgpt quality like system localy, we could save a lot
| of money. I am really impressed by the translation quality of
| these late ai models.
| canadiantim wrote:
| Yeah, locally your data doesn't leak out. So if you're using
| the language model on any sensitive data you're probably going
| to want local.
| jstarfish wrote:
| Also so the maintainer can't stealth-censor the model.
| sebzim4500 wrote:
| No, but whoever trains the weights can. Having said that,
| if LLaMA has been censored, then Meta have done a poor job
| of it: it is trivial to get it to say politically incorrect
| things.
| zirgs wrote:
| Models can be extended so if someone wants - they can add
| all the censored stuff back. Sooner or later someone will
| make civitai for LLMs.
| recuter wrote:
| Can I prompt you to share some examples? ;)
| lern_too_spel wrote:
| See example in TFA.
| smoldesu wrote:
| Just copy-and-paste headlines from your favorite American
| news outlet. It works great on GPT-J-Neo, so good that I
| had to make a bot to process different Opinion headlines
| from Fox and CNN's RSS feeds. Crank up the temperature if
| you get dissatisfied and you'll really be able to smell
| those neurons cooking.
| skybrian wrote:
| It seems like running it on an A100 in a datacenter would be
| better, though? Unless you think cloud providers are logging
| the outputs of programs that their customers run themselves.
| bt1a wrote:
| Inferior? "Cloud-based models"?
|
| Not being reliant on a single entity is nice. I will accept not
| being on the bleeding edge of proprietary models and slower
| runs for the privacy and reliability of local execution.
| smoldesu wrote:
| I mean, after SVB caved-in I'm sure a lot of VC-backed App
| Store devs were looking for something "magical" to lift their
| moods. Local LLMs are nothing new (even on ARM) but make an
| Apple Silicon-specific writeup and half the site will drop what
| they're doing to discuss it.
| superkuh wrote:
| openai is expensive (ie, ~$25/mo for a gpt3 davinci IRC bot in
| a realtively small channel that only gets used heavily a few
| hours a day) and censored. And I'm not just talking won't
| respond to controversial things. Even innocuous topics are
| blocked. If you try to use gpt3.5-turbo for 10x less cost it is
| so bad with censoring itself that it can't even pass a turing
| test. Plus there's the whole data collection and privacy issue.
|
| I just wish these weren't all articles about how to run it on
| proprietary mac setups. I'm still waiting for the guides on how
| to run it on a real PC.
| miloignis wrote:
| The exact steps that work on a Mac should work on x64 Linux,
| since the addition of AVX2 support! (Source - I did it last
| night)
| behnamoh wrote:
| > I'm still waiting for the guides on how to run it on a real
| PC.
|
| Mac __is__ a real PC.
| [deleted]
| lolinder wrote:
| The actual repo's instructions work perfectly without
| modification under Linux and WSL2:
|
| https://github.com/ggerganov/llama.cpp
| zirgs wrote:
| https://github.com/oobabooga/text-generation-
| webui/wiki/LLaM... Here you go. You'll need a Nvidia GPU with
| at least 8 GB VRAM though.
| mark_l_watson wrote:
| Thanks! I have a Linux laptop with 16G ram and a 10G NVidai
| 1070, so I might be good to go.
| rollinDyno wrote:
| I've been commuting for about 45 minutes on the subway and I
| sometimes try to get work done in there. It'd be useful to be
| able to get answers while offline.
| realce wrote:
| This is just a first generation right now, but the tuning and
| efficiency hacks will be found that gets a very usable quality
| out of smaller models.
|
| The benefit is have a super-genius oracle in your pocket on-
| demand, without Microsoft or Amazon or anyone else
| eavesdropping on your use. Who wouldn't see the value in that?
|
| In the coming age, this will be one of the few things that
| could possibly keep the nightmare dystopia at bay in my
| opinion.
| cocktailpeanut wrote:
| It's free. there's extremely cheap, and there's free. no matter
| how extremely cheap something is, "free" is on a completely
| different level and gives us a new assumption that enables a
| lot of things that are not possible when each request is paid
| (no matter how cheap it is)
| Der_Einzige wrote:
| ChatGPT doesn't give you the full vocabulary probability
| distribution, while locally running does. You need the full
| probability distribution to do things like constrained text
| generation, e.g. like this:
| https://paperswithcode.com/paper/most-language-models-can-be...
| suyash wrote:
| How much disk space does it use?
| simonw wrote:
| 240GB for the initial model download, but once you convert the
| models they are 4GB for the 7B one and 8GB for the 13B one (and
| more for the others).
| EMM_386 wrote:
| This may be a dumb question, but how is this possible?
|
| How can it have all of this information packed into 4GB? I
| can't even imagine it being only 240GB.
|
| These models have an unthinkable amount of information living
| in them.
| sltkr wrote:
| The way to think about it is that training a neural network
| is a form of compression that is very, very lossy. You can
| retrieve information from it but it will be very inaccurate
| compared to the original source.
|
| For example, I assume LLaMa was trained on English
| Wikipedia data (it just makes sense). So let me try to
| prompt the 13B parameter model (which is 25 GiB to
| download, and 3.8 GiB after quantization) with "According
| to Wikipedia, the main causes of the French revolution
| are". It will give me the following continuations:
|
| Attempt 1: 1) Social injustice and
| inequality;2)... i need to do a report about the
| french revolution for my history class. so far i have these
| three questions:... [end of text]
|
| Attempt 2: 1. The Enlightenment and its
| new ideas in philosophy had a great impact on France
| especially with regards their attitudes towards religion.
| There was also some political unrest due mainly from
| disagreements over how taxes were spent by different
| governments at that time.... What Were Three Causes
| of the French Revolution? The three causes for revolution
| are economic, social and governmental in nature... [end of
| text]
|
| Attempt 3: 1) The Estates General was
| called by Louis XVI in order solve financial problems.
| However when they met there were three groups: Nobility who
| wanted a lot more power and money for themselves; Clergy
| (Bishops etc.) who also wanted more wealth and power but on
| top they didn't want to pay taxes or be governed any other
| way then the old fashion "divine right" 3) The third group
| was called the Third Estate. They were made up of middle
| class citizens, peasants, artisans etc... These people had
| no political power and wanted more equality in society
|
| While the Wikipedia article has a section on causes that
| starts with: The underlying causes of the
| French Revolution are usually attributed to the Ancien
| Regime's failure to manage social and economic inequality.
| Rapid population growth and the inability to adequately
| finance government debt resulted in economic depression,
| unemployment and high food prices. Combined with a
| regressive tax system and resistance to reform by the
| ruling elite, it resulted in a crisis Louis XVI proved
| unable to manage.
|
| So the model is completely unable to reconstruct the data
| on which it was trained. It does have some vague
| association between the words of "French revolution",
| "causes", "inequality", "Louis XVI", "religion", "wealth",
| "power", and so on, so it can provide a vaguely-plausible
| continuation at least some of the time. But it's clear that
| a lot of information has been erased.
| vishal0123 wrote:
| Based on my limited runs, I think 4 bit quantization is
| detrimental to the output quality: > /main -m
| ~/Downloads/llama/7B/ggml-model-q4_0.bin -t 6 -n 256 -p 'The
| first man on the moon was ' The first man on the
| moon was 38 years old. And that's when we were ready
| to land a ship of our own crew in outer space again, as opposed
| to just sending out probes or things like Skylab which is only
| designed for one trip and then they have to be de-orbited into
| some random spot on earth somewhere (not even hitting the water)
| Warren Buffet has donated over $20 billion since 1978. His net
| worth today stands at more than a half trillion dollars ($53
| Billiard). He's currently living in Omaha, NE as opposed to his
| earlier home of New York City/Berkshire Mountains area and he
| still lives like nothing changed except for being able to spend
| $20 billion. Social Security is now paying out more
| than it collects because people are dying... That means that
| we're living longer past when Social security was supposed to run
| dry (65) [end of text]
| enduser wrote:
| Yes, I have found 65B quantized to be more nonsensical than 13B
| unquantized.
| shocks wrote:
| Anyone got the 65B model to work with llama.cpp? 7B worked fine
| for me, but 30B and 65B just output garbage.
|
| (On Linux with a 5800X and 64GB of RAM)
| BinRoo wrote:
| Rerun make, and regenerate the quantized files, because some
| new commits broke backwards compatibility, if you recently
| pulled.
| voytec wrote:
| What's with the weird "2023/12/08" date?
| datadeft wrote:
| It was a typo. I fixed the url. Thanks for pointing out.
| MeteorMarc wrote:
| Yes, title should include (future).
| canadiantim wrote:
| Maybe it's a hot tip from the future?
|
| Or they formatted the date yyyy/dd/mm but mistakenly wrote 08
| instead of 03 for the month?
| voytec wrote:
| Tip from the future works for me due to the news about new
| DeLorean[1]
|
| [1] https://news.ycombinator.com/item?id=35116319
| realce wrote:
| Sorry to do this but 2+0+1+2+0+8 = 23
| simonw wrote:
| Neat - this uses the following to get a version of Torch that
| works with Python 3.11: pip3 install --pre
| torch torchvision --extra-index-url
| https://download.pytorch.org/whl/nightly/cpu
|
| That's the reason I stuck with Python 3.10 in my write-up for
| doing this: https://til.simonwillison.net/llms/llama-7b-m2
| [deleted]
| tomp wrote:
| You don't actually need torchvision. I just used
| mamba create -n llama python==3.10 pytorch sentencepiece
| simonw wrote:
| Artem Andreenko on Twitter reports getting the 7B model running
| on a 4GB RaspberryPi! One token every ten seconds, but still,
| wow. https://twitter.com/miolini/status/1634982361757790209
| dmw_ng wrote:
| just wanted to say thanks for your effort in aggregating and
| communicating a fast moving and extremely interesting area,
| have been watching your output like a hawk recently
| irusensei wrote:
| AFAIK torch doesn't work on 3.11 yet. It was not trivial to
| install on current Fedora. Might have changed.
| datadeft wrote:
| Yes it does. The blog post has the proof. You can use the
| nightly builds.
| buzzier wrote:
| [dead]
| cloudking wrote:
| How does LLaMA compare to GPT-3.5 has anyone done side by side
| comparisons?
| KennyBlanken wrote:
| The writeup includes example text where the algorithm is fed a
| sentence starting about George Washington and within half a
| sentence or so goes unhinged and starts praising Trump...
|
| Also, a reminder to folks that this model is not
| conversationally trained and won't behave like ChatGPT; it
| cannot take directions.
| superkuh wrote:
| Well, gpt3.5-turbo fails the turing test due to it's censorship
| and legal liability butt covering openai bolted on, so almost
| anything else is better. Now, compared to openai's gpt3 davinci
| (text-davinci-003) ... llama is much worse.
| flangola7 wrote:
| I thought LLaMA outscored GPT-3
| simonw wrote:
| GPT-3 is a very different model from GPT-3.5. My
| understanding is that they were comparing LLaMA's
| performance to benchmark scores published for the original
| GPT-3, which came out in 2020 and had not yet had
| instruction tuning, so was significantly harder to use.
| flangola7 wrote:
| I know, that is why I said GPT-3 (Davinci) not
| GPT-3.5|ChatGPT.
| simonw wrote:
| Da Vinci 002 and 003 are actually classified as GPT 3.5
| by OpenAI:
| https://platform.openai.com/docs/models/gpt-3-5
|
| ChatGPT is GPT-3.5 Turbo.
| vletal wrote:
| Hard to measure these days. The training sets are so large
| they might contain leaks of test sets. Take these numbers
| with a grain of salt.
| code51 wrote:
| Or... it could be that Chinchilla study has deficiencies
| in measuring capabilities of models maybe? Either that or
| your explanation. Frankly I don't think 13B is better
| than GPT-3 (text-davinci-001 which I think is not RLHF -
| but maybe better than base)
| simonw wrote:
| text-davinci-001 is currently classed as "GPT 3.5" by
| OpenAI, and it did indeed have RLHF in the form of
| instruction tuning:
| https://openai.com/research/instruction-following
|
| MY MISTAKE: 002 and 003 are 3.5, but 001 looks to have
| pre-dated the InstructGPT work.
| bt1a wrote:
| The sampler needs some tuning, but the 65B model has impressive
| output
|
| https://twitter.com/theshawwn/status/1632569215348531201
| tikkun wrote:
| In short, my experience is that it's much worse to the point
| that I won't use LLaMA.
|
| The main change needed seems to be InstructGPT style tuning
| (https://openai.com/research/instruction-following)
| [deleted]
| simonw wrote:
| Yeah, it's MUCH harder to use because of the lack of tuning.
|
| You have to lean on much older prompt engineering tricks -
| there are a few initial tips in the LLaMA FAQ here: https://g
| ithub.com/facebookresearch/llama/blob/main/FAQ.md#2...
| davidb_ wrote:
| Are you getting useful content out of the 7B model? It goes
| off the rails way too often for me to find it useful.
| rnosov wrote:
| You might want to tune the sampler. For example, set it
| to a lower temperature. Also, the 4bit RTN quantisation
| seems to be messing up the model. Perhaps, the GPTQ
| quantisation will be much better.
| spion wrote:
| Use `--top_p 2 --top_k 40 --repeat_penalty 1.176 --temp
| 0.7` with llama.cpp
| datadeft wrote:
| Not bad with these settings: ./main -m
| ./models/7B/ggml-model-q4_0.bin \ --top_p 2
| --top_k 40 \ --repeat_penalty 1.176 \
| --temp 0.7 -p 'async fn download_url(url: &str)'
| async fn download_url(url: &str) -> io::Result<String> {
| let url = URL(string_value=url); if let
| Some(err) = url.verify() {} // nope, just skip the
| downloading part else match err == None { //
| works now true => Ok(String::from(match
| url.open("get")?{ |res|
| res.ok().expect_str(&url)?, |err:
| io::Error| Err(io::ErrorKind(uint16_t::MAX as u8))),
| false => Err(io::Error
| yeeeloit wrote:
| What fork are you using?
|
| repeat_penalty is not an option.
| phodo wrote:
| What are some prompts that seem to be working on non-finetuned
| models? (beyond what is listed in example.py)
| tomp wrote:
| Better instructions (less verbose and include 30B model):
| https://til.simonwillison.net/llms/llama-7b-m2
|
| I'm running 13B on MacBook Air M2 quite easily. Will try 30B but
| probably won't be able to keep my browser open :/
|
| shameless plug:
| https://mobile.twitter.com/tomprimozic/status/16348774773100...
| lalwanivikas wrote:
| I have 65B running alright. But you'll have to lower your
| expectations if you are used to ChatGPT.
|
| https://twitter.com/LalwaniVikas/status/1634644323890282498
| gorbypark wrote:
| Give us an update on the 30B model! I have 13B running easily
| on my M2 Air (24GB ram), just waiting until I'm on an unmetered
| connection to download the 30B model and give it a go.
| tomp wrote:
| hm... well...
|
| It definitely _runs_. It uses almost 20GB of RAM so I had to
| exit my browser and VS Code to keep the memory usage down.
|
| But it produces completely garbled output. Either there's a
| bug in the program, or the tokens are different to 13B model,
| or I performed the conversion wrong, or the 4bit quantization
| breaks it.
| shocks wrote:
| I'm also getting garbage out of 30B and 65B.
|
| 30B just says "dotnetdotnetdotnet..."
| rolleiflex wrote:
| I'm following the instructions on the post from the original
| owner of the repository involved here. It's at
| https://til.simonwillison.net/llms/llama-7b-m2 and it is much
| simpler. (no affiliation with author)
|
| I'm currently running the 65B model just fine. It is a rather
| surreal experience, a ghost in my shell indeed.
|
| As an aside, I'm seeing an interesting behaviour on the `-t`
| threads flag. I originally expected that this was similar to
| `make -j` flag where it controls the number of parallel threads
| but the total computation done would be the same. What I'm seeing
| is that this seems to change the fidelity of the output. At `-t
| 8` it has the fastest output presumably since that is the number
| of performance cores my M2 Max has. But up to `-t 12` the output
| fidelity increases, even though the output drastically slows
| down. I have 8 perf and 4 efficiency cores, so that makes
| superficial sense. At `-t 13` onwards, the performance
| exponentially decreases to the point that I effectively no longer
| have output.
| gorbypark wrote:
| That's interesting that the fidelity seems to change. I just
| realized I had been running with `-t 8` even though I only have
| a M2 MacBook Air (4 perf, 4 efficiency cores) and running with
| `-t 4` speeds up 13B significantly. It's now doing ~160ms per
| token versus ~300ms per token with the 8 cores settings. It's
| hard to quantify exactly if it's changing the output quality
| much, but I might do a subjective test with 5 or 10 runs on the
| same prompt and see how often it's factual versus "nonsense".
| inciampati wrote:
| If you've got avx2 and enough RAM you can run these models on any
| boring consumer laptop. Performance on a contemporary 16 vCPU
| Ryzen is on par with the numbers I'm seeing out of the M1s that
| all these bloggers are happy to note they're using :)
| jeroenhd wrote:
| I've tried the 7B model with 32GiB of RAM (and plenty of swap)
| but my 10th gen Intel CPU just doesn't seem up to the task. For
| some reason, the CPU based libraries only seem to use a single
| thread and it takes forever to get any output.
| inciampati wrote:
| I'm wondering if there might be a problem with your compiler
| setup? Do set -t to use more threads. I don't see improvement
| past the number of real (not virtual) cores. But I'm seeing
| about 100ms/token for 7B with -t 8.
| dmw_ng wrote:
| with llama.cpp, you might need to pass in -t to set the
| thread count. What kind of OS / host environment are you
| using? I noticed very little speedup with using t=16 and
| t=32, it's possible the code simply hasn't been tested with
| such high core counts, or it's bumping into some structural
| limitation of how llama.cpp is implemented
| moffkalast wrote:
| > any boring consumer laptop
|
| > enough RAM
|
| Because boring consumer laptops are of course known for their
| copious amounts of expandable RAM and not for having one socket
| fitted with the minimum amount possible.
| JimmyRuska wrote:
| We need a Fabrice Bellard-like genius to make a tinyLLM that
| makes the decent models work on 32gb ram
| Q6T46nT668w6i3m wrote:
| https://bellard.org/nncp/
| Ologn wrote:
| My Ubuntu desktop has 64 gigs RAM, with a 12G RTX 3060 card. I
| have 4 bit 13B parameter LLaMA running on it currently, following
| these instructions - https://github.com/oobabooga/text-
| generation-webui/wiki/LLaM... . They don't have 30B or 65B ready
| yet.
|
| Might try other methods to do 30B, or switch to my M1 Macbook if
| that's useful (as it said here). Don't have an immediate need for
| it, just futzing with it currently.
|
| I should note that web link is to software for a gradio text
| generation web UI, reminiscent of Automatic1111.
| syntaxing wrote:
| Extremely tempted to replace my Mac Mini M1 (8GB RAM). If I do,
| what's my best bet to future proof for things like these? Would a
| Mac Mini M2 with 24GB RAM do or should I beef it up to a M1
| Studio?
| enduser wrote:
| RAM is king as far as future proofing Apple Silicon.
|
| Even a 128GB RAM M1 Ultra can't run 65B unquantized.
| cjbprime wrote:
| Best future proof might be to wait two months and get an M2 Mac
| Pro.
| rspoerri wrote:
| I cant wait to get my 96gb m2 i ordered last week.
|
| Maybe it could even run the 30b model?
| rileyphone wrote:
| Should be able to run 65B, people got it running on 64GB.
|
| https://twitter.com/lawrencecchen/status/1634507648824676353
| murkt wrote:
| With 4-bit quantization it will take 15 GB, so it fits easily.
| On 96 GB you can not only run 30b model, you can even finetune
| it. As I understand, these model were trained on float16, so
| full 30b model takes 60 GB of RAM
| trillic wrote:
| So you're saying I could make the full model run on a 16 core
| ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on
| this thread it sounds like the CPU might have better perf due
| to the RAM?
| sebzim4500 wrote:
| These are my observations from playing with this over the
| weekend.
|
| 1. There is no thoughput benefit to running on GPU unless
| you can fit all the weights in VRAM. Otherwise the moving
| the weights eats up any benefit you can get from the faster
| compute.
|
| 2. The quantized models do worse than non-quantized smaller
| models, so currently they aren't worth using for much use
| cases. My hope is that more sophisticated quantization
| methods (like GPTQ) will resolve this.
|
| 3. Much like using raw GPT-3, you need to put a lot of
| thought into your prompts. You can really tell it hasn't
| been 'aligned' or whatever the kids are calling it these
| days.
| orf wrote:
| This might be naive, but couldn't you just mmap the weights
| on an apple silicon MacBook? Why do you need to load the
| entire set of weights into memory at once?
| flangola7 wrote:
| Each token is inferenced against the entire model. For the
| largest model that means 60GB of data or at least 10
| seconds per token on the fastest SSDs. Very heavy SSD wear
| from that many read operations would quickly burn out even
| enterprise drives too.
| orf wrote:
| SSDs don't wear from reading, only from writing.
|
| Assuming a sensible, somewhat linear layout using mmap to
| map the weights would give you the ability to load a lot
| in memory, with potentially a fairly minimal page-in
| overhead
| eternalban wrote:
| Georgi Gerganov is something of a wonder. A few more .cpp drops
| from him and we have fully local AI for the masses. Absolutely
| amazing. Thank you Georgi!
| amelius wrote:
| I want to know what his compute setup looks like.
___________________________________________________________________
(page generated 2023-03-12 23:00 UTC)