[HN Gopher] Using LLaMA with M1 Mac and Python 3.11
       ___________________________________________________________________
        
       Using LLaMA with M1 Mac and Python 3.11
        
       Author : datadeft
       Score  : 371 points
       Date   : 2023-03-12 17:00 UTC (5 hours ago)
        
 (HTM) web link (dev.l1x.be)
 (TXT) w3m dump (dev.l1x.be)
        
       | lxe wrote:
       | If anyone is interested in running it on windows and a GPU (30b
       | 4bit fits in a 3090), here's a guide:
       | https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c
        
       | m3kw9 wrote:
       | Why not just have a script and just say run this?
        
       | dataspun wrote:
       | What's with the propaganda output from the example LLaMA prompt?
        
       | diimdeep wrote:
       | Remove "and Python 3.11" from title. Python used only for
       | converting model to llama.cpp project format, 3.10 or whatever is
       | fine.
       | 
       | Additionally, llama.cpp works fine with 10 y.o hardware that
       | supports AVX2.
       | 
       | I'm running llama.cpp right now on an ancient Intel i5 2013
       | MacBook with only 2 cores and 8 GB RAM - 7B 4bit model loads in 8
       | seconds to 4.2 GB RAM and gives 600 ms per token.
       | 
       | btw: anyone knows how to disable swap per process in macOS ? even
       | though there is enough free RAM, sometimes macOS decides to use
       | swap instead.
        
         | metadat wrote:
         | Can you provide a link to what guide or steps you followed to
         | get this up and running? I have a physical Linux machine with
         | 300+ GB of RAM, would love to try out llama on it but I'm not
         | sure where to get started for how to get it working with such a
         | configuration.
         | 
         | Edit: Thank you, @diimdeep!
        
           | [deleted]
        
           | diimdeep wrote:
           | Sure. You can get models with magnet link from here
           | https://github.com/shawwn/llama-dl/
           | 
           | To get running, just follow these steps
           | https://github.com/ggerganov/llama.cpp/#usage
        
             | HervalFreire wrote:
             | [dead]
        
         | zitterbewegung wrote:
         | I am able to get 65B on a MacBook Pro 14.2 with 64gb of ram.
         | https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...
        
         | CharlesW wrote:
         | > _Remove "and Python 3.11" from title. Python used only for
         | converting model to llama.cpp project format, 3.10 or whatever
         | is fine._
         | 
         | As @rnosov notes elsewhere in the thread, this post has a
         | workaround for the PyTorch issue with Python 3.11, which is why
         | the "and Python 3.11" qualification is there.
        
           | ahoho wrote:
           | Do you know if there's there a good reason to favor 3.11 over
           | 3.10 for this use case?
        
             | CharlesW wrote:
             | I'm a Python neophyte, but I've read that Python 3.11 is
             | 10-60% faster than 3.10, so that may be a consideration.
        
               | simonw wrote:
               | In this particular case that doesn't matter, because the
               | only time you run Python is for a one-off conversion
               | against the model files.
               | 
               | That takes at most a minute to run, but once converted
               | you'll never need to run it again. Actual llama.cpp model
               | inference uses compiled C++ code with no Python involved
               | at all.
        
             | e12e wrote:
             | https://endoflife.date/python ?
        
             | codetrotter wrote:
             | The real question is. Which python3 version does current
             | macOS ship with?
             | 
             | Well, on my macOS Ventura 13.2.1 install, /usr/bin/python3
             | is Python 3.9.6, which may be too old?
             | 
             | But also, my custom installed python3 via homebrew is not
             | 3.11 either. My /opt/homebrew/bin/python3 is Python 3.10.9
             | 
             | MacBook Pro M1
        
       | wjessup wrote:
       | How is this post any different than the instructions on the
       | actual repo? https://github.com/ggerganov/llama.cpp
        
         | rnosov wrote:
         | The post has a workaround for the PyTorch issue with Python
         | 3.11. If you follow the repo instructions it will give you some
         | rather strange looking errors.
        
       | mark_l_watson wrote:
       | Excuse my laziness for not looking this up myself, but I have two
       | 8G RAM M1 Macs. Which smaller LLM models will run with such a
       | small amount of memory? any old GPT-2 models? HN user diimdeep
       | has commented here that he ran the article code and model on a 8G
       | RAM M1 Mac, so maybe I will just try it.
       | 
       | I have had good luck in the past with Apple's TensorFlow tools
       | for M1 for building my own models.
        
         | simonw wrote:
         | The LLaMA 7B model will run fine on an 8GB M1.
        
         | recuter wrote:
         | Rather remarkably, Llama 7B and 13B run fine (about as fast as
         | ChatGPT/bing) if you follow the instructions in the posted
         | article.
        
       | foxandmouse wrote:
       | I've been seeing a lot of people talking about running language
       | models locally, but I'm not quite sure what the benefit is. Other
       | than for novelty or learning purposes, is there any reason why
       | someone would prefer to use an inferior language model on their
       | own machine instead of leveraging the power and efficiency of
       | cloud-based models?
        
         | notfed wrote:
         | To summarize other answers: (1) free (2) private (3)
         | censorship/ethics-free (4) customizable (5) doesn't require
         | Internet
        
         | delusional wrote:
         | For me I think it's exciting in a couple of different ways.
         | Most importantly, it's just way more hackable than these giant
         | frameworks. It's pretty cool to be able to read all the code,
         | down to first principles, to understand how computationally
         | simple this stuff really is.
        
         | seydor wrote:
         | I wish to keep all my chat logs , forever. This will be my
         | alter ego that will even survive me. It must be private and not
         | on someone else's computer.
         | 
         | But more importantly i want it uncensored. These tools are
         | useful for deep conversation, which no longer exists online
         | since many years ago
        
           | leobg wrote:
           | In the 1970s, people moved to Ashrams in India to lose their
           | ego. In the 2020s, people are anxious for AI to conserve it
           | beyond death. Quite a generational pendulum swing... :)
        
             | seydor wrote:
             | They took their notebooks with them. That's why we need
             | private models
        
         | [deleted]
        
         | holoduke wrote:
         | I am currently paying thousands per month for translations.
         | (Billions of words per month) if we only could have a way to
         | run a chatgpt quality like system localy, we could save a lot
         | of money. I am really impressed by the translation quality of
         | these late ai models.
        
         | canadiantim wrote:
         | Yeah, locally your data doesn't leak out. So if you're using
         | the language model on any sensitive data you're probably going
         | to want local.
        
           | jstarfish wrote:
           | Also so the maintainer can't stealth-censor the model.
        
             | sebzim4500 wrote:
             | No, but whoever trains the weights can. Having said that,
             | if LLaMA has been censored, then Meta have done a poor job
             | of it: it is trivial to get it to say politically incorrect
             | things.
        
               | zirgs wrote:
               | Models can be extended so if someone wants - they can add
               | all the censored stuff back. Sooner or later someone will
               | make civitai for LLMs.
        
               | recuter wrote:
               | Can I prompt you to share some examples? ;)
        
               | lern_too_spel wrote:
               | See example in TFA.
        
               | smoldesu wrote:
               | Just copy-and-paste headlines from your favorite American
               | news outlet. It works great on GPT-J-Neo, so good that I
               | had to make a bot to process different Opinion headlines
               | from Fox and CNN's RSS feeds. Crank up the temperature if
               | you get dissatisfied and you'll really be able to smell
               | those neurons cooking.
        
           | skybrian wrote:
           | It seems like running it on an A100 in a datacenter would be
           | better, though? Unless you think cloud providers are logging
           | the outputs of programs that their customers run themselves.
        
         | bt1a wrote:
         | Inferior? "Cloud-based models"?
         | 
         | Not being reliant on a single entity is nice. I will accept not
         | being on the bleeding edge of proprietary models and slower
         | runs for the privacy and reliability of local execution.
        
         | smoldesu wrote:
         | I mean, after SVB caved-in I'm sure a lot of VC-backed App
         | Store devs were looking for something "magical" to lift their
         | moods. Local LLMs are nothing new (even on ARM) but make an
         | Apple Silicon-specific writeup and half the site will drop what
         | they're doing to discuss it.
        
         | superkuh wrote:
         | openai is expensive (ie, ~$25/mo for a gpt3 davinci IRC bot in
         | a realtively small channel that only gets used heavily a few
         | hours a day) and censored. And I'm not just talking won't
         | respond to controversial things. Even innocuous topics are
         | blocked. If you try to use gpt3.5-turbo for 10x less cost it is
         | so bad with censoring itself that it can't even pass a turing
         | test. Plus there's the whole data collection and privacy issue.
         | 
         | I just wish these weren't all articles about how to run it on
         | proprietary mac setups. I'm still waiting for the guides on how
         | to run it on a real PC.
        
           | miloignis wrote:
           | The exact steps that work on a Mac should work on x64 Linux,
           | since the addition of AVX2 support! (Source - I did it last
           | night)
        
           | behnamoh wrote:
           | > I'm still waiting for the guides on how to run it on a real
           | PC.
           | 
           | Mac __is__ a real PC.
        
             | [deleted]
        
           | lolinder wrote:
           | The actual repo's instructions work perfectly without
           | modification under Linux and WSL2:
           | 
           | https://github.com/ggerganov/llama.cpp
        
           | zirgs wrote:
           | https://github.com/oobabooga/text-generation-
           | webui/wiki/LLaM... Here you go. You'll need a Nvidia GPU with
           | at least 8 GB VRAM though.
        
             | mark_l_watson wrote:
             | Thanks! I have a Linux laptop with 16G ram and a 10G NVidai
             | 1070, so I might be good to go.
        
         | rollinDyno wrote:
         | I've been commuting for about 45 minutes on the subway and I
         | sometimes try to get work done in there. It'd be useful to be
         | able to get answers while offline.
        
         | realce wrote:
         | This is just a first generation right now, but the tuning and
         | efficiency hacks will be found that gets a very usable quality
         | out of smaller models.
         | 
         | The benefit is have a super-genius oracle in your pocket on-
         | demand, without Microsoft or Amazon or anyone else
         | eavesdropping on your use. Who wouldn't see the value in that?
         | 
         | In the coming age, this will be one of the few things that
         | could possibly keep the nightmare dystopia at bay in my
         | opinion.
        
         | cocktailpeanut wrote:
         | It's free. there's extremely cheap, and there's free. no matter
         | how extremely cheap something is, "free" is on a completely
         | different level and gives us a new assumption that enables a
         | lot of things that are not possible when each request is paid
         | (no matter how cheap it is)
        
         | Der_Einzige wrote:
         | ChatGPT doesn't give you the full vocabulary probability
         | distribution, while locally running does. You need the full
         | probability distribution to do things like constrained text
         | generation, e.g. like this:
         | https://paperswithcode.com/paper/most-language-models-can-be...
        
       | suyash wrote:
       | How much disk space does it use?
        
         | simonw wrote:
         | 240GB for the initial model download, but once you convert the
         | models they are 4GB for the 7B one and 8GB for the 13B one (and
         | more for the others).
        
           | EMM_386 wrote:
           | This may be a dumb question, but how is this possible?
           | 
           | How can it have all of this information packed into 4GB? I
           | can't even imagine it being only 240GB.
           | 
           | These models have an unthinkable amount of information living
           | in them.
        
             | sltkr wrote:
             | The way to think about it is that training a neural network
             | is a form of compression that is very, very lossy. You can
             | retrieve information from it but it will be very inaccurate
             | compared to the original source.
             | 
             | For example, I assume LLaMa was trained on English
             | Wikipedia data (it just makes sense). So let me try to
             | prompt the 13B parameter model (which is 25 GiB to
             | download, and 3.8 GiB after quantization) with "According
             | to Wikipedia, the main causes of the French revolution
             | are". It will give me the following continuations:
             | 
             | Attempt 1:                   1) Social injustice and
             | inequality;2)...         i need to do a report about the
             | french revolution for my history class. so far i have these
             | three questions:... [end of text]
             | 
             | Attempt 2:                   1. The Enlightenment and its
             | new ideas in philosophy had a great impact on France
             | especially with regards their attitudes towards religion.
             | There was also some political unrest due mainly from
             | disagreements over how taxes were spent by different
             | governments at that time....        What Were Three Causes
             | of the French Revolution? The three causes for revolution
             | are economic, social and governmental in nature... [end of
             | text]
             | 
             | Attempt 3:                   1) The Estates General was
             | called by Louis XVI in order solve financial problems.
             | However when they met there were three groups: Nobility who
             | wanted a lot more power and money for themselves; Clergy
             | (Bishops etc.) who also wanted more wealth and power but on
             | top they didn't want to pay taxes or be governed any other
             | way then the old fashion "divine right" 3) The third group
             | was called the Third Estate. They were made up of middle
             | class citizens, peasants, artisans etc... These people had
             | no political power and wanted more equality in society
             | 
             | While the Wikipedia article has a section on causes that
             | starts with:                   The underlying causes of the
             | French Revolution are usually attributed to the Ancien
             | Regime's failure to manage social and economic inequality.
             | Rapid population growth and the inability to adequately
             | finance government debt resulted in economic depression,
             | unemployment and high food prices. Combined with a
             | regressive tax system and resistance to reform by the
             | ruling elite, it resulted in a crisis Louis XVI proved
             | unable to manage.
             | 
             | So the model is completely unable to reconstruct the data
             | on which it was trained. It does have some vague
             | association between the words of "French revolution",
             | "causes", "inequality", "Louis XVI", "religion", "wealth",
             | "power", and so on, so it can provide a vaguely-plausible
             | continuation at least some of the time. But it's clear that
             | a lot of information has been erased.
        
       | vishal0123 wrote:
       | Based on my limited runs, I think 4 bit quantization is
       | detrimental to the output quality:                   > /main -m
       | ~/Downloads/llama/7B/ggml-model-q4_0.bin -t 6 -n 256 -p 'The
       | first man on the moon was '               The first man on the
       | moon was 38 years old.              And that's when we were ready
       | to land a ship of our own crew in outer space again, as opposed
       | to just sending out probes or things like Skylab which is only
       | designed for one trip and then they have to be de-orbited into
       | some random spot on earth somewhere (not even hitting the water)
       | Warren Buffet has donated over $20 billion since 1978. His net
       | worth today stands at more than a half trillion dollars ($53
       | Billiard). He's currently living in Omaha, NE as opposed to his
       | earlier home of New York City/Berkshire Mountains area and he
       | still lives like nothing changed except for being able to spend
       | $20 billion.              Social Security is now paying out more
       | than it collects because people are dying... That means that
       | we're living longer past when Social security was supposed to run
       | dry (65) [end of text]
        
         | enduser wrote:
         | Yes, I have found 65B quantized to be more nonsensical than 13B
         | unquantized.
        
       | shocks wrote:
       | Anyone got the 65B model to work with llama.cpp? 7B worked fine
       | for me, but 30B and 65B just output garbage.
       | 
       | (On Linux with a 5800X and 64GB of RAM)
        
         | BinRoo wrote:
         | Rerun make, and regenerate the quantized files, because some
         | new commits broke backwards compatibility, if you recently
         | pulled.
        
       | voytec wrote:
       | What's with the weird "2023/12/08" date?
        
         | datadeft wrote:
         | It was a typo. I fixed the url. Thanks for pointing out.
        
         | MeteorMarc wrote:
         | Yes, title should include (future).
        
         | canadiantim wrote:
         | Maybe it's a hot tip from the future?
         | 
         | Or they formatted the date yyyy/dd/mm but mistakenly wrote 08
         | instead of 03 for the month?
        
           | voytec wrote:
           | Tip from the future works for me due to the news about new
           | DeLorean[1]
           | 
           | [1] https://news.ycombinator.com/item?id=35116319
        
         | realce wrote:
         | Sorry to do this but 2+0+1+2+0+8 = 23
        
       | simonw wrote:
       | Neat - this uses the following to get a version of Torch that
       | works with Python 3.11:                   pip3 install --pre
       | torch torchvision --extra-index-url
       | https://download.pytorch.org/whl/nightly/cpu
       | 
       | That's the reason I stuck with Python 3.10 in my write-up for
       | doing this: https://til.simonwillison.net/llms/llama-7b-m2
        
         | [deleted]
        
         | tomp wrote:
         | You don't actually need torchvision. I just used
         | mamba create -n llama python==3.10 pytorch sentencepiece
        
       | simonw wrote:
       | Artem Andreenko on Twitter reports getting the 7B model running
       | on a 4GB RaspberryPi! One token every ten seconds, but still,
       | wow. https://twitter.com/miolini/status/1634982361757790209
        
         | dmw_ng wrote:
         | just wanted to say thanks for your effort in aggregating and
         | communicating a fast moving and extremely interesting area,
         | have been watching your output like a hawk recently
        
       | irusensei wrote:
       | AFAIK torch doesn't work on 3.11 yet. It was not trivial to
       | install on current Fedora. Might have changed.
        
         | datadeft wrote:
         | Yes it does. The blog post has the proof. You can use the
         | nightly builds.
        
       | buzzier wrote:
       | [dead]
        
       | cloudking wrote:
       | How does LLaMA compare to GPT-3.5 has anyone done side by side
       | comparisons?
        
         | KennyBlanken wrote:
         | The writeup includes example text where the algorithm is fed a
         | sentence starting about George Washington and within half a
         | sentence or so goes unhinged and starts praising Trump...
         | 
         | Also, a reminder to folks that this model is not
         | conversationally trained and won't behave like ChatGPT; it
         | cannot take directions.
        
         | superkuh wrote:
         | Well, gpt3.5-turbo fails the turing test due to it's censorship
         | and legal liability butt covering openai bolted on, so almost
         | anything else is better. Now, compared to openai's gpt3 davinci
         | (text-davinci-003) ... llama is much worse.
        
           | flangola7 wrote:
           | I thought LLaMA outscored GPT-3
        
             | simonw wrote:
             | GPT-3 is a very different model from GPT-3.5. My
             | understanding is that they were comparing LLaMA's
             | performance to benchmark scores published for the original
             | GPT-3, which came out in 2020 and had not yet had
             | instruction tuning, so was significantly harder to use.
        
               | flangola7 wrote:
               | I know, that is why I said GPT-3 (Davinci) not
               | GPT-3.5|ChatGPT.
        
               | simonw wrote:
               | Da Vinci 002 and 003 are actually classified as GPT 3.5
               | by OpenAI:
               | https://platform.openai.com/docs/models/gpt-3-5
               | 
               | ChatGPT is GPT-3.5 Turbo.
        
             | vletal wrote:
             | Hard to measure these days. The training sets are so large
             | they might contain leaks of test sets. Take these numbers
             | with a grain of salt.
        
               | code51 wrote:
               | Or... it could be that Chinchilla study has deficiencies
               | in measuring capabilities of models maybe? Either that or
               | your explanation. Frankly I don't think 13B is better
               | than GPT-3 (text-davinci-001 which I think is not RLHF -
               | but maybe better than base)
        
               | simonw wrote:
               | text-davinci-001 is currently classed as "GPT 3.5" by
               | OpenAI, and it did indeed have RLHF in the form of
               | instruction tuning:
               | https://openai.com/research/instruction-following
               | 
               | MY MISTAKE: 002 and 003 are 3.5, but 001 looks to have
               | pre-dated the InstructGPT work.
        
         | bt1a wrote:
         | The sampler needs some tuning, but the 65B model has impressive
         | output
         | 
         | https://twitter.com/theshawwn/status/1632569215348531201
        
         | tikkun wrote:
         | In short, my experience is that it's much worse to the point
         | that I won't use LLaMA.
         | 
         | The main change needed seems to be InstructGPT style tuning
         | (https://openai.com/research/instruction-following)
        
           | [deleted]
        
           | simonw wrote:
           | Yeah, it's MUCH harder to use because of the lack of tuning.
           | 
           | You have to lean on much older prompt engineering tricks -
           | there are a few initial tips in the LLaMA FAQ here: https://g
           | ithub.com/facebookresearch/llama/blob/main/FAQ.md#2...
        
             | davidb_ wrote:
             | Are you getting useful content out of the 7B model? It goes
             | off the rails way too often for me to find it useful.
        
               | rnosov wrote:
               | You might want to tune the sampler. For example, set it
               | to a lower temperature. Also, the 4bit RTN quantisation
               | seems to be messing up the model. Perhaps, the GPTQ
               | quantisation will be much better.
        
               | spion wrote:
               | Use `--top_p 2 --top_k 40 --repeat_penalty 1.176 --temp
               | 0.7` with llama.cpp
        
               | datadeft wrote:
               | Not bad with these settings:                   ./main -m
               | ./models/7B/ggml-model-q4_0.bin \         --top_p 2
               | --top_k 40 \         --repeat_penalty 1.176 \
               | --temp 0.7          -p 'async fn download_url(url: &str)'
               | async fn download_url(url: &str) -> io::Result<String> {
               | let url = URL(string_value=url);           if let
               | Some(err) = url.verify() {} // nope, just skip the
               | downloading part           else match err == None {  //
               | works now             true => Ok(String::from(match
               | url.open("get")?{                 |res|
               | res.ok().expect_str(&url)?,                 |err:
               | io::Error| Err(io::ErrorKind(uint16_t::MAX as u8))),
               | false => Err(io::Error
        
               | yeeeloit wrote:
               | What fork are you using?
               | 
               | repeat_penalty is not an option.
        
       | phodo wrote:
       | What are some prompts that seem to be working on non-finetuned
       | models? (beyond what is listed in example.py)
        
       | tomp wrote:
       | Better instructions (less verbose and include 30B model):
       | https://til.simonwillison.net/llms/llama-7b-m2
       | 
       | I'm running 13B on MacBook Air M2 quite easily. Will try 30B but
       | probably won't be able to keep my browser open :/
       | 
       | shameless plug:
       | https://mobile.twitter.com/tomprimozic/status/16348774773100...
        
         | lalwanivikas wrote:
         | I have 65B running alright. But you'll have to lower your
         | expectations if you are used to ChatGPT.
         | 
         | https://twitter.com/LalwaniVikas/status/1634644323890282498
        
         | gorbypark wrote:
         | Give us an update on the 30B model! I have 13B running easily
         | on my M2 Air (24GB ram), just waiting until I'm on an unmetered
         | connection to download the 30B model and give it a go.
        
           | tomp wrote:
           | hm... well...
           | 
           | It definitely _runs_. It uses almost 20GB of RAM so I had to
           | exit my browser and VS Code to keep the memory usage down.
           | 
           | But it produces completely garbled output. Either there's a
           | bug in the program, or the tokens are different to 13B model,
           | or I performed the conversion wrong, or the 4bit quantization
           | breaks it.
        
             | shocks wrote:
             | I'm also getting garbage out of 30B and 65B.
             | 
             | 30B just says "dotnetdotnetdotnet..."
        
       | rolleiflex wrote:
       | I'm following the instructions on the post from the original
       | owner of the repository involved here. It's at
       | https://til.simonwillison.net/llms/llama-7b-m2 and it is much
       | simpler. (no affiliation with author)
       | 
       | I'm currently running the 65B model just fine. It is a rather
       | surreal experience, a ghost in my shell indeed.
       | 
       | As an aside, I'm seeing an interesting behaviour on the `-t`
       | threads flag. I originally expected that this was similar to
       | `make -j` flag where it controls the number of parallel threads
       | but the total computation done would be the same. What I'm seeing
       | is that this seems to change the fidelity of the output. At `-t
       | 8` it has the fastest output presumably since that is the number
       | of performance cores my M2 Max has. But up to `-t 12` the output
       | fidelity increases, even though the output drastically slows
       | down. I have 8 perf and 4 efficiency cores, so that makes
       | superficial sense. At `-t 13` onwards, the performance
       | exponentially decreases to the point that I effectively no longer
       | have output.
        
         | gorbypark wrote:
         | That's interesting that the fidelity seems to change. I just
         | realized I had been running with `-t 8` even though I only have
         | a M2 MacBook Air (4 perf, 4 efficiency cores) and running with
         | `-t 4` speeds up 13B significantly. It's now doing ~160ms per
         | token versus ~300ms per token with the 8 cores settings. It's
         | hard to quantify exactly if it's changing the output quality
         | much, but I might do a subjective test with 5 or 10 runs on the
         | same prompt and see how often it's factual versus "nonsense".
        
       | inciampati wrote:
       | If you've got avx2 and enough RAM you can run these models on any
       | boring consumer laptop. Performance on a contemporary 16 vCPU
       | Ryzen is on par with the numbers I'm seeing out of the M1s that
       | all these bloggers are happy to note they're using :)
        
         | jeroenhd wrote:
         | I've tried the 7B model with 32GiB of RAM (and plenty of swap)
         | but my 10th gen Intel CPU just doesn't seem up to the task. For
         | some reason, the CPU based libraries only seem to use a single
         | thread and it takes forever to get any output.
        
           | inciampati wrote:
           | I'm wondering if there might be a problem with your compiler
           | setup? Do set -t to use more threads. I don't see improvement
           | past the number of real (not virtual) cores. But I'm seeing
           | about 100ms/token for 7B with -t 8.
        
           | dmw_ng wrote:
           | with llama.cpp, you might need to pass in -t to set the
           | thread count. What kind of OS / host environment are you
           | using? I noticed very little speedup with using t=16 and
           | t=32, it's possible the code simply hasn't been tested with
           | such high core counts, or it's bumping into some structural
           | limitation of how llama.cpp is implemented
        
         | moffkalast wrote:
         | > any boring consumer laptop
         | 
         | > enough RAM
         | 
         | Because boring consumer laptops are of course known for their
         | copious amounts of expandable RAM and not for having one socket
         | fitted with the minimum amount possible.
        
       | JimmyRuska wrote:
       | We need a Fabrice Bellard-like genius to make a tinyLLM that
       | makes the decent models work on 32gb ram
        
         | Q6T46nT668w6i3m wrote:
         | https://bellard.org/nncp/
        
       | Ologn wrote:
       | My Ubuntu desktop has 64 gigs RAM, with a 12G RTX 3060 card. I
       | have 4 bit 13B parameter LLaMA running on it currently, following
       | these instructions - https://github.com/oobabooga/text-
       | generation-webui/wiki/LLaM... . They don't have 30B or 65B ready
       | yet.
       | 
       | Might try other methods to do 30B, or switch to my M1 Macbook if
       | that's useful (as it said here). Don't have an immediate need for
       | it, just futzing with it currently.
       | 
       | I should note that web link is to software for a gradio text
       | generation web UI, reminiscent of Automatic1111.
        
       | syntaxing wrote:
       | Extremely tempted to replace my Mac Mini M1 (8GB RAM). If I do,
       | what's my best bet to future proof for things like these? Would a
       | Mac Mini M2 with 24GB RAM do or should I beef it up to a M1
       | Studio?
        
         | enduser wrote:
         | RAM is king as far as future proofing Apple Silicon.
         | 
         | Even a 128GB RAM M1 Ultra can't run 65B unquantized.
        
         | cjbprime wrote:
         | Best future proof might be to wait two months and get an M2 Mac
         | Pro.
        
       | rspoerri wrote:
       | I cant wait to get my 96gb m2 i ordered last week.
       | 
       | Maybe it could even run the 30b model?
        
         | rileyphone wrote:
         | Should be able to run 65B, people got it running on 64GB.
         | 
         | https://twitter.com/lawrencecchen/status/1634507648824676353
        
         | murkt wrote:
         | With 4-bit quantization it will take 15 GB, so it fits easily.
         | On 96 GB you can not only run 30b model, you can even finetune
         | it. As I understand, these model were trained on float16, so
         | full 30b model takes 60 GB of RAM
        
           | trillic wrote:
           | So you're saying I could make the full model run on a 16 core
           | ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on
           | this thread it sounds like the CPU might have better perf due
           | to the RAM?
        
             | sebzim4500 wrote:
             | These are my observations from playing with this over the
             | weekend.
             | 
             | 1. There is no thoughput benefit to running on GPU unless
             | you can fit all the weights in VRAM. Otherwise the moving
             | the weights eats up any benefit you can get from the faster
             | compute.
             | 
             | 2. The quantized models do worse than non-quantized smaller
             | models, so currently they aren't worth using for much use
             | cases. My hope is that more sophisticated quantization
             | methods (like GPTQ) will resolve this.
             | 
             | 3. Much like using raw GPT-3, you need to put a lot of
             | thought into your prompts. You can really tell it hasn't
             | been 'aligned' or whatever the kids are calling it these
             | days.
        
           | orf wrote:
           | This might be naive, but couldn't you just mmap the weights
           | on an apple silicon MacBook? Why do you need to load the
           | entire set of weights into memory at once?
        
             | flangola7 wrote:
             | Each token is inferenced against the entire model. For the
             | largest model that means 60GB of data or at least 10
             | seconds per token on the fastest SSDs. Very heavy SSD wear
             | from that many read operations would quickly burn out even
             | enterprise drives too.
        
               | orf wrote:
               | SSDs don't wear from reading, only from writing.
               | 
               | Assuming a sensible, somewhat linear layout using mmap to
               | map the weights would give you the ability to load a lot
               | in memory, with potentially a fairly minimal page-in
               | overhead
        
       | eternalban wrote:
       | Georgi Gerganov is something of a wonder. A few more .cpp drops
       | from him and we have fully local AI for the masses. Absolutely
       | amazing. Thank you Georgi!
        
         | amelius wrote:
         | I want to know what his compute setup looks like.
        
       ___________________________________________________________________
       (page generated 2023-03-12 23:00 UTC)