[HN Gopher] Run Llama 13B with a 6GB graphics card
___________________________________________________________________
Run Llama 13B with a 6GB graphics card
Author : rain1
Score : 411 points
Date : 2023-05-14 12:35 UTC (10 hours ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| tikkun wrote:
| See also:
|
| https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guy...
|
| https://chat.lmsys.org/?arena (Click 'leaderboard')
| alg_fun wrote:
| wouldn't i be faster to use ram as a swap for vram?
| naillo wrote:
| This is cool but are people actually getting stuff done with
| these models? I'm enthusiastic about their potential too but
| after playing with it for a day I'm at a loss for what to use it
| for anymore at this point
| Joeri wrote:
| My experience is that even the "toy" 7B models can do pretty
| amazing things, but they're very sensitive to prompting in the
| right way to force the model down the right pathways. My
| assumption is that the larger models have more redundancy and
| therefore can be prompted less precisely while still going down
| the right pathway.
|
| For example, the WizardLM model (5_1 quantized, as 4_x
| quantization makes the 7B models perform terribly):
|
| Prompt: _I went to the market and bought 10 apples. I gave 2 to
| my neighbor and 2 to my sister. Then I bought 5 more and ate 1.
| How many apples do I have?_
|
| Response: Let me check that for you! You had 9
| apples left after giving away 4, so you still have 9 - 4 = 5
| apples left.
|
| Prompt: _I went to the market and bought 10 apples. I gave 2 to
| my neighbor and 2 to my sister. Then I bought 5 more and ate 1.
| How many apples do I have? Write out any calculations in full.
| List the steps used to come to the final result._
|
| Response: Sure, let's work through this problem
| together! Here are the steps we can use: Step 1: Start
| with the initial number of apples you bought (10). Step
| 2: Subtract the amount given to your neighbor and sister from
| the total (2 + 2 = 4). This gives us a new total of 6 apples.
| Step 3: Add the additional 5 apples you bought later on to the
| remaining 6 apples, giving us a final total of 11 apples.
| Step 4: Finally, subtract the apple you ate from the total (1),
| leaving us with 10 apples in total. So the answer is 10
| apples.
| m3kw9 wrote:
| Most part if you actually need stuff done, you can afford 20$
| for OpenAIs proven models.
|
| These local projects are great because maybe eventually they
| will have a equivalent model that can be run on cheap parts
| Beltiras wrote:
| The knowledge cutoff date is brutal for some applications.
| m3kw9 wrote:
| I wonder how ingesting more and more data will affect the
| size of parameters, it's gonna continually get bigger?
| rolisz wrote:
| I don't think that the current models are at "knowledge
| capacity". So far all evidence points to training on more
| data on the same size model gives better results.
| cubefox wrote:
| Both increasing the amount of parameters and the amount
| of training tokens improves results (more precisely:
| lowers training loss), and costs computing power. For
| optimally improving loss per training computing power,
| model size and training tokens should be increased
| equally. That's the Chinchilla scaling law. (Though low
| loss is not always the same as good results, the data
| quality also matters.)
|
| Further reading: https://dynomight.net/scaling/
| snovv_crash wrote:
| An interesting corollary of this is that if you want to
| reduce the model size you can compensate by training for
| longer to achieve the same accuracy. Depending on your
| training:inference ratio this may be more optimal
| globally to reduce your total compute costs or even just
| reduce your frontend latency.
| cubefox wrote:
| Yeah, though I have not seen a formula which takes the
| number of expected inference runs into account for
| calculating the optimal data/parameter balance.
| thelittleone wrote:
| Knowledge cut off and sending potentially sensitive IP to a
| centralised and untrusted third party. This would likely
| limit the addressable market.
| jjoonathan wrote:
| Also: they are lobotomized. If you want to do security
| research, Sorry Dave. If you want to nsfw writing, Sorry
| Dave. If you want to open the pod bay doors, Sorry Dave,
| I'm afraid I can't do that.
| yieldcrv wrote:
| I tried to help a sex worker with ChatGPT and all it did
| was preach about avoiding sex work, further marginalizing
| in its virtue signalizing. I dont consider her
| marginalized, and "help" was just information about her
| job and techniques and venues. ChatGPT would have
| consumed this textual information too.
|
| But yeah offline fine tuned models wont have this
| problem.
|
| Kind of cool to see how the SWERF representation in tech
| is going to speedrun SWERF irrelevancy.
| jhartwig wrote:
| You tried to help a sex worker with chatGPT? Care to
| explain more on this use case lol? Curious minds want to
| know.
| yieldcrv wrote:
| Venues to work, other people's thoughts working there,
| management, nuances about local licenses, stuff that
| anybody with any job would do
| jstarfish wrote:
| People are already setting up fake personas/OnlyFans
| accounts using chatbots and SD images.
|
| We have a high-value specialist currently chatting up a
| few of them at work. His wife doesn't know. He doesn't
| know _we_ know. The photos are fake but he 's too horny
| to notice.
|
| Time to dust off the "there are no women on the internet"
| meme...
| baobabKoodaa wrote:
| > People are already setting up fake personas/OnlyFans
| accounts using chatbots and SD images.
|
| Citation needed.
| yieldcrv wrote:
| just the latest most high profile example making the
| rounds yesterday
|
| https://www.nbcnews.com/news/amp/rcna84180
| jstarfish wrote:
| I'm only offering an early anecdote that catfishing is
| adapting to the times. You don't have to believe me.
| Don't trust anyone with weird nipples that hides their
| hands in every photo.
|
| People have been [claiming to] do this for years:
| https://www.blackhatworld.com/seo/monetizing-traffic-
| from-so...
|
| Give it 1-2 years and you can hear about it from Krebs.
| yieldcrv wrote:
| Informative for some but this wasn't an interaction over
| the internet, just out and about
| cubefox wrote:
| Microsoft Azure still has the GPT-3.5 foundation model,
| code-davinci-002. It is not fine-tuned for instruction
| following, safety, or censorship.
|
| I'm not sure though whether Microsoft analyzes the
| input/output with another model to detect and prevent
| certain content.
| iforgotpassword wrote:
| I haven't tried the fine-tuned variants yet, but when I played
| around with it shortly after the leak, it tended to quickly
| derail into nonsense when you let it complete sentences or
| paragraphs, especially when using other languages than English.
| When I tried to get it to produce Python code, most of the time
| it wasn't even syntactically correct.
| Taek wrote:
| I've successfully gotten at-home models
| (https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-fp16) to
| go through my messages and pull out key todo items. For
| example, reminding me to message my father about travel plans.
|
| Is it comparable to GPT-4? No, it's not remotely close. But
| it's getting closer every week, and it very recently crossed
| the threshold of 'it can do stuff I would never send to a cloud
| service' (namely, reading all of my messages and pulling out
| action items).
| jhbadger wrote:
| Depends on what "getting stuff done" means. I find 13B models
| (running on my M1 Mac) useful for playing AI Dungeon-like games
| -- where you describe the environment and your character and
| you can play an RPG.
| fredguth wrote:
| GitHub Copilot is (or once was) a 13b model, according to Nat
| Friedman in the scale.ai interview.
| (https://youtu.be/lnufceCxwG0)
| Zetobal wrote:
| We run some llamas to analyze user content.
| rain1 wrote:
| It's just for fun!
|
| These local models aren't as good as Bard or GPT-4.
| happycube wrote:
| There are two major advantages though - you can retrain them,
| and they're not on the guardrails that the commercial models
| have.
| gre wrote:
| I tried to prompt vicuna to tell me a joke about gay people
| and it refused. Some of the guardrails are still in there.
| azeirah wrote:
| It's because vicuna is fine-tuned on chatGPT answers.
| LLaMa will not do this, but LLaMa-based models fine tuned
| with chatGPT answers will.
| occz wrote:
| Did you use the censored or the uncensored variant?
| gre wrote:
| It's just a random one from huggingface. I will look for
| the uncensored one later. Thanks, I think.
| occz wrote:
| You're welcome. I can't vouch for them though, as I
| haven't tried them, I've merely heard about them.
| instance wrote:
| I tested on a serious use case and quality was subpar. For real
| use cases I had to either host the most powerful model you can
| get (e.g. LLaMA-65B or so) on a cloud machine, which again
| costs too much (you'll be paying like 500-1000 USD per month),
| or just go straight for GPT-3.5 on OpenAI. The latter
| economically makes most sense.
| inferense wrote:
| what real use case did you use it for?
| instance wrote:
| For instance used it in conjunction with llama-index for
| knowledge management. Created an index for a whole
| confluence/jira of a mid-sized company, got good results
| with GPT, but for LLaMA of this size that use case was too
| much.
| sroussey wrote:
| Did you try instructor-xl? It ranks highest on
| huggingface.
| dzhiurgis wrote:
| I'd argue 1k per month for mid-sized company is nothing,
| but I can understand where you are coming from.
| throwaway1777 wrote:
| Making demos to raise investment probably
| raffraffraff wrote:
| What about turning the cloud vm off except when you're
| actually using it?
| unglaublich wrote:
| A "serious use case" means it needs to be available around
| the clock.
| ineedasername wrote:
| I can run the Wizard 30B ggml model in CPU mode using a Ryzen
| 5700 and 16GB of _system_ RAM, not GPU VRAM. I'm using
| oobabooga as the front end.
|
| It's slow, but if I ask it to write a Haiku it's slow on the
| order of "go brew some coffee and come back in 10 minutes" and
| does it very well. Running it overnight on something like
| "summarize an analysis of topic X it does a reasonable job.
|
| It can produce answers to questions only slightly less well
| than ChatGPT (3.5). The Wizard 13B model runs much faster,
| maybe 2-3 tokens per second.
|
| It is free, private, and runs on a midrange laptop.
|
| A little more than a month ago that wasn't possible, not with
| my level of knowledge of the tooling involved at least, now it
| requires little more than running an executable and minor
| troubleshooting of python dependencies (on another machine it
| "just worked")
|
| So: Don't think of these posts as "doing it just because you
| can and it's fun to tinker"
|
| Vast strides are being made pretty much daily in both quality
| and efficiency, raising their utility while lowering the cost
| of usage, doing both to a very significant degree.
| theaiquestion wrote:
| > It's slow, but if I ask it to write a Haiku it's slow on
| the order of "go brew some coffee and come back in 10
| minutes" and does it very well. Running it overnight on
| something like "summarize an analysis of topic X it does a
| reasonable job.
|
| I'm sorry but that's unusably slow, even GPT-4 can take a
| retry or a prompt to fix certain type of issues. My
| experience is the open options require a lot more
| attempts/manual prompt tuning.
|
| I can't think of a single workload where that is usable. That
| said once consumer GPUs are involved it does become usable
| postalrat wrote:
| I doubt you've ever worked with people if you think that's
| unusable slow
| bcrosby95 wrote:
| The computer doesn't ask for annoying things like a
| paycheck or benefits either.
| mejutoco wrote:
| Money upfront and a small salary in the form of
| electricity bills.
| sp332 wrote:
| What prompt do you use to get haikus?
| BaculumMeumEst wrote:
| Wow you can run a 30B model on 16gb ram? Is it hitting swap?
| sp332 wrote:
| Most people are running these at 4 bits per parameter for
| speed and RAM reasons. That means the model would take just
| about all of the RAM. But instead of swap (writing data to
| disk and then reading it again later), I would expect a
| good implementation to only run into cache eviction
| (deleting data from RAM and then reading it back from disk
| later), which should be a lot faster and cause less wear
| and tear on SSDs.
| mcbuilder wrote:
| These models can run FP16, with LLM quantization going down
| to Int8 and beyond.
| BaculumMeumEst wrote:
| i'm just starting to get into deep learning so i look
| forward to understanding that sentence
| MobiusHorizons wrote:
| FP16 and Int8 are about how many bits are being used for
| floating point and integer numbers. FP16 is 16bit
| floating point. The more bits the better the precision,
| but the more ram it takes. Normally programmers use 32 or
| 64bit floats so 16bit floats have significantly reduced
| precision, but take up half the space of fp32 which is
| the smallest floating point format for most CPUs.
| similarly 8 bit integers have only 256 total
| possibilities and go from -128 to 127.
| mike00632 wrote:
| How much resources are required is directly related to
| the memory size devoted to each weight. If the weights
| are stored as 32-bit floating points then each weight is
| 32 bits which adds up when we are talking about billions
| of weights. But if the weights are first converted to
| 16-bit floating point numbers (precise to fewer decimal
| places) then fewer resources are needed to store and
| compute the numbers. Research has shown that simply
| chopping off some of the precision of the weights still
| yields good AI performance in many cases.
|
| Note too that the numbers are standardized, e.g. floats
| are defined by IEEE 754 standard. Numbers in this format
| have specialized hardware to do math with them, so when
| considering which number format to use it's difficult to
| get outside of the established ones (foat32, float16,
| int8).
| sp332 wrote:
| Training uses gradient descent, so you want to have good
| precision during that process. But once you have the
| overall structure of the network,
| https://arxiv.org/abs/2210.17323 (GPTQ) showed that you
| can cut down the precision quite a bit without losing a
| lot of accuracy. It seems you can cut down further for
| larger models. For the 13B Llama-based ones, going below
| 5 bits per parameter is noticeably worse, but for 30B
| models you can do 4 bits.
|
| The same group did another paper
| https://arxiv.org/abs/2301.00774 which shows that in
| addition to reducing the precision of each parameter, you
| can also prune out a bunch of parameters entirely. It's
| harder to apply this optimization because models are
| usually loaded into RAM densely, but I hope someone
| figures out how to do it for popular models.
| [deleted]
| redox99 wrote:
| People are extensively using these models (more specifically
| the finetuned, uncensored ones) for role playing.
| irzzy wrote:
| [dead]
| elorant wrote:
| I've setup and use Vicuna-13b for text classification,
| summarization and topic modelling. Works like a charm.
|
| It's also good for math lessons.
| BOOSTERHIDROGEN wrote:
| Would like to know how you setup this. A posts would be
| awesome.
| elorant wrote:
| There are various posts online on how to set it up, either
| for Linux or Windows. There was an older post here on how
| to install opt-65b on a mac studio ultra, and smaller
| models on mac pros. There was also a post if I remember
| correctly about running vicuna-7b on an iPhone.
|
| Here are a few examples:
|
| https://morioh.com/p/55296932dd8b
|
| https://www.youtube.com/watch?v=iQ3Lhy-eD1s
|
| https://news.ycombinator.com/item?id=35430432
|
| Side note. You need bonkers hardware to run it efficiently.
| I'm currently using a 16-core cpu, 128G RAM, a Pcie 4.0
| nvme and an RTX 3090. There are ways to run it on less
| powerful hardware, like 8cores, 64GB RAM, simple ssd and an
| RTX 3080 or 70, but I happen to have a large corpus of data
| to process so I went all in.
| csdvrx wrote:
| I think the previous comment is more interested in your
| experience with your large data: what are you doing with
| it?
|
| I have similar hardware at home, so I wonder how reliably
| you can process simple queries using domain knowledge +
| logic which work on on mlc-llm, something like "if you
| can chose the word food, or the word laptop, or the word
| deodorant, which one do you chose for describing "macbook
| air"? answer precisely with just the word you chose"
|
| If it works, can you upload the weights somewhere? IIRC,
| vicuna is open source.
| elorant wrote:
| There's an online demo of Vicuna-13b where you can test
| its efficiency:
|
| https://chat.lmsys.org/
| techload wrote:
| After two prompts I was astounded by the innacuracies
| present in the answers. An they were pretty easy
| questions.
| csdvrx wrote:
| Yes, but can you replicate that functionality using
| llama.cpp?
|
| If so, what did you run with main?
|
| I haven't been able to get an answer, while for the
| question above, I can get _' I chose the word "laptop"'_
| with mlc-llm
| elorant wrote:
| For the tasks I need it the efficiency is similar to the
| online model. Only slower. I don't care for
| conversational functionality.
| chaxor wrote:
| If these problems are all very similar in structure, then
| you may not need an LLM. Simple GloVe or W2V may suffice
| with a dot product. The. You can plow through a few
| terabytes by the time the LLM goes through a fraction of
| that.
| jstarfish wrote:
| Maybe others' experiences are different, but I find smaller
| models to work just as well for "reductive" tasks.
|
| Dolly sucks for generating long-form content (not very
| creative) but if I need a summary or classification, it's
| quicker and easier to spin up dolly-3b than vicuna-13b.
|
| I suspect OpenAI is routing prompts to select models based on
| similar logic.
| s_dev wrote:
| [deleted]
| capableweb wrote:
| First link: https://github.com/ggerganov/llama.cpp
|
| Which in turn has the following as the first link:
| https://arxiv.org/abs/2302.13971
|
| Is it really quicker to ask here than just browse content for a
| bit, skimming some text or even using Google for one minute?
| djbusby wrote:
| You gave an awesome answer in 2 minutes! Might be faster than
| reading!
| capableweb wrote:
| If you cannot click two links in a browser under two
| minutes, I'm either sorry for you, or scared of you :)
| s_dev wrote:
| >Is it really quicker to ask here than just browse content
| for a bit, skimming some text or even using Google for one
| minute?
|
| I don't know if it's quicker but I trust human assessment a
| lot more than any machine generated explanations. You're
| right I could have asked ChatGPT or even Googled but a small
| bit of context goes a long way and I'm clearly out of the
| loop here -- it's possible others arrive on HN might
| appreciate such an explanation or we're better off having
| lots of people making duplicated efforts to understand what
| they're looking at.
| capableweb wrote:
| Well, I'm saying if you just followed the links on the
| submitted page, you'd reach the same conclusion but faster.
| rain1 wrote:
| llama is a text prediction model similar to GPT-2, and the
| version of GPT-3 that has not been fine tuned yet.
|
| It is also possible to run fine tuned versions like vicuna with
| this. I think. Those versions are more focused on answering
| questions.
| haunter wrote:
| >I can't tell from the Gist alone
|
| Literally the second line: "llama is a text prediction model
| similar to GPT-2, and the version of GPT-3 that has not been
| fine tuned yet"
| rain1 wrote:
| I'm sorry! I added this improvement based on that persons
| question!
| s_dev wrote:
| Sorry -- I missed that. I'll delete my comments -- obviously
| I'm just an idiot asking dumb questions that have no value to
| anybody. I thought I read through it.
| rain1 wrote:
| not at all, your question was really good so I added the
| answer to it to my gist to help everyone else. Sorry for
| the confusion I created by doing that!
| avereveard wrote:
| or like download oobabooga/text-generation-webui, any
| prequantized variant, and be done.
| rahimnathwani wrote:
| On my system, using `-ngl 22` (running 22 layers on the GPU) cuts
| wall clock time by ~60%.
|
| My system:
|
| GPU: NVidia RTX 2070S (8GB VRAM)
|
| CPU: AMD Ryzen 5 3600 (16GB VRAM)
|
| Here's the performance difference I see:
|
| CPU only (./main -t 12) llama_print_timings:
| load time = 15459.43 ms llama_print_timings: sample
| time = 23.64 ms / 38 runs ( 0.62 ms per token)
| llama_print_timings: prompt eval time = 9338.10 ms / 356
| tokens ( 26.23 ms per token) llama_print_timings:
| eval time = 31700.73 ms / 37 runs ( 856.78 ms per token)
| llama_print_timings: total time = 47192.68 ms
|
| GPU (./main -t 12 -ngl 22) llama_print_timings:
| load time = 10285.15 ms llama_print_timings: sample
| time = 21.60 ms / 35 runs ( 0.62 ms per token)
| llama_print_timings: prompt eval time = 3889.65 ms / 356
| tokens ( 10.93 ms per token) llama_print_timings:
| eval time = 8126.90 ms / 34 runs ( 239.03 ms per token)
| llama_print_timings: total time = 18441.22 ms
| samstave wrote:
| May you please ELI5 what is happening here...
|
| Imagine I am first ever hearing about this, ;; what did you do?
| rahimnathwani wrote:
| 0. Have a PC with an NVidia GPU, running Ubuntu, with the
| NVidia drivers and CUDA Toolkit already set up.
|
| 1. Download the weights for the model you want to use, e.g.
| gpt4-x-vicuna-13B.ggml.q5_1.bin
|
| 2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to
| compile it with support for CUBLAS (BLAS on GPU).
|
| 3. Run the resulting 'main' executable, with the -ngl option
| set to 18, so that it tries to load 18 layers of the model
| into the GPU's VRAM, instead of the system's RAM.
| rain1 wrote:
| > 1. Download the weights for the model you want to use,
| e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin
|
| I think you need to quantize the model yourself from the
| float/huggingface versions. My understanding is that the
| quantization formats have changed recently. and old
| quantized models no longer work.
| rahimnathwani wrote:
| That was true until 2 days ago :)
|
| The repo has now been updated with requantized models
| that work with the latest version, so you don't need to
| do that any more.
|
| https://huggingface.co/TheBloke/gpt4-x-vicuna-13B-GGML/co
| mmi...
| rain1 wrote:
| wonderful! thank you
| guardiangod wrote:
| I am testing it on an AWS instance and the speedup effect is
| not as consistent as I hope. The speedup varies between runs.
|
| Intel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4
| ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-
| GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin
| llama_print_timings: load time = 3725.08 ms
| llama_print_timings: sample time = 612.06 ms / 536
| runs ( 1.14 ms per token) llama_print_timings:
| prompt eval time = 13876.81 ms / 259 tokens ( 53.58 ms per
| token) llama_print_timings: eval time = 221647.40
| ms / 534 runs ( 415.07 ms per token)
| llama_print_timings: total time = 239423.46 ms
| ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-
| GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30
| llama_print_timings: load time = 7638.95 ms
| llama_print_timings: sample time = 280.81 ms / 294
| runs ( 0.96 ms per token) llama_print_timings:
| prompt eval time = 2197.82 ms / 2 tokens ( 1098.91 ms per
| token) llama_print_timings: eval time = 112790.25
| ms / 293 runs ( 384.95 ms per token)
| llama_print_timings: total time = 120788.82 ms
| rahimnathwani wrote:
| Thanks. BTW:
|
| - the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin
|
| - I used 'time' to measure the wall clock time of each
| command.
|
| - My prompt was: Below is an instruction that
| describes a task. Write a response that appropriately
| completes the request. ### Instruction: Write a
| long blog post with 5 sections, about the pros and cons of
| emphasising procedural fluency over conceptual understanding,
| in high school math education. ### Response:
| PaulWaldman wrote:
| Any way to know the differences in power consumption?
| Tuna-Fish wrote:
| Probably significant savings.
| cpill wrote:
| Will this work with the leaked models or Alpaca?
| eightysixfour wrote:
| You will likely see a bit of a performance gain dropping your
| threads to 6. I'm on a 3700x and get a regression when using 16
| threads instead of the real 8 cores.
| rain1 wrote:
| That is a crazy speedup!!
| GordonS wrote:
| Is it really? Going from CPU to GPU, I would have expected a
| much better improvement.
| rahimnathwani wrote:
| You can think of it this way: if half the model is running
| on the GPU, and the GPU is infinitely fast, then the total
| calculation time would go down by 50%, compared with
| everything running on the CPU.
| ethbr0 wrote:
| Ref Amdahl's Law:
| https://en.m.wikipedia.org/wiki/Amdahl%27s_law
| qwertox wrote:
| I feel the same.
|
| For example some stats from Whisper [0] (audio transcoding,
| 30 seconds) show the following for the medium model (see
| other models in the link):
|
| ---
|
| GPU medium fp32 Linear 1.7s
|
| CPU medium fp32 nn.Linear 60.7s
|
| CPU medium qint8 (quant) nn.Linear 23.1s
|
| ---
|
| So the same model runs 35.7 times faster on GPU, and
| compared to an "optimized" model still 13.6.
|
| I was expecting around an order or magnitude of
| improvement.
|
| Then again, I do not know if in the case of this article
| the entire model was in the GPU, or just a fraction of it
| (22 layers) and the remainder on CPU, which might explain
| the result. Apparently that's the case, but I don't know
| much about this stuff.
|
| [0] https://github.com/MiscellaneousStuff/openai-whisper-
| cpu
| rahimnathwani wrote:
| You last paragraph is correct. Only about half the model
| was running on the GPU.
| anshumankmr wrote:
| How long before it runs on a 4 gig card?
| rain1 wrote:
| You can offload only 10 layers or so if you want to run on a
| 4GB card
| bitL wrote:
| How about reloading parts of the model as the inference
| progresses instead of splitting it into GPU/CPU parts? Reloading
| would be memory-limited to the largest intermediate tensor cut.
| moffkalast wrote:
| The Tensor Reloaded, starring Keanu Reeves
| regularfry wrote:
| That would turn what's currently an L3 cache miss or a GPU data
| copy into a disk I/O stall. Not that it might not be possible
| to pipeline things to make that less of a problem, but it
| doesn't immediately strike me as a fantastic trade-off.
| bitL wrote:
| One can keep all tensors in the RAM, just push whatever
| needed to GPU VRAM, basically limited by PCIe speed. Or some
| intelligent strategy with read-ahead from SSD if one's RAM is
| limited. There are even GPUs with their own SSDs.
| sroussey wrote:
| I wish this used the webgpu c++ library instead, then it could be
| used in any GPU hardware.
| marcopicentini wrote:
| What do you use to host these models (like Vicuna, Dolly etc) on
| your own server and expose them using HTTP REST API? Is there an
| Heroku-like for LLM models?
|
| I am looking for an open source models to do text summarization.
| Open AI is too expensive for my use case because I need to pass
| lots of tokens.
| rain1 wrote:
| I haven't tried that but https://github.com/abetlen/llama-cpp-
| python and https://github.com/r2d4/openlm exists
| speedgoose wrote:
| These days I use FastChat: https://github.com/lm-sys/FastChat
|
| It's not based on llama.cpp but huggingface transformers but
| can also run on CPU.
|
| It works well, can be distributed and very conveniently provide
| the same REST API than OpenAI GPT.
| itake wrote:
| Do you know how well it performs compared to llama.cpp?
| rain1 wrote:
| my understanding is that the engine used (pytorch
| transformers library) is still faster than llama.cpp with
| 100% of layers running on the GPU.
| itake wrote:
| I only have an m1
| rain1 wrote:
| I don't think the integrated GPU on that supports CUDA.
| So you will need to use CPU mode only.
| itake wrote:
| Yep, but isn't there an integrated ML chip that makes it
| faster than cpu? Or does llama.cpp not use that?
| rain1 wrote:
| unfortunately that chip is proprietary and undocumented,
| it's very difficult for open source programs to make use
| of. I think there is some reverse engineering work being
| done but it's not complete.
| qeternity wrote:
| It's the Huggingface transformers library which is
| implemented in pytorch.
|
| In terms of speed, yes running fp16 will indeed be faster
| with vanilla gpu setup. However most people are running
| 4bit quantized versions, and the GPU quantization
| landscape as been a mess (GPTQ-for-llama project).
| llama.cpp has taken a totally different approach, and it
| looks like they are currently able to match native GPU
| perf via cuBLAS with much less effort and brittleness.
| inhumantsar wrote:
| Weights and Biases is good for building/training models and
| Lambda Labs is a cloud provider for AI workloads. Lambda will
| only get you up to running the model though. You would still
| need to overlay some job management on top of that. I've heard
| Run.AI is good on that front but I haven't tried.
| peatmoss wrote:
| From skimming, it looks like this approach requires CUDA and thus
| is Nvidia only.
|
| Anyone have a recommended guide for AMD / Intel GPUs? I gather
| the 4 bit quantization is the special sauce for CUDA, but I'd
| guess there'd be something comparable for not-CUDA?
| rain1 wrote:
| 4-bit quantization is to reduce the amount of VRAM required to
| run the model. You can run it 100% on CPU if you don't have
| CUDA. I'm not aware of any AMD equivalent yet.
| amelius wrote:
| Looks like there are several projects that implement the CUDA
| interface for various other compute systems, e.g.:
|
| https://github.com/ROCm-Developer-
| Tools/HIPIFY/blob/master/R...
|
| https://github.com/hughperkins/coriander
|
| I have zero experience with these, though.
| westurner wrote:
| "Democratizing AI with PyTorch Foundation and ROCm(tm)
| support for PyTorch" (2023)
| https://pytorch.org/blog/democratizing-ai-with-pytorch/ :
|
| > _AMD, along with key PyTorch codebase developers
| (including those at Meta AI), delivered a set of updates to
| the ROCm(tm) open software ecosystem that brings stable
| support for AMD Instinct(tm) accelerators as well as many
| Radeon(tm) GPUs. This now gives PyTorch developers the
| ability to build their next great AI solutions leveraging
| AMD GPU accelerators & ROCm. The support from PyTorch
| community in identifying gaps, prioritizing key updates,
| providing feedback for performance optimizing and
| supporting our journey from "Beta" to "Stable" was
| immensely helpful and we deeply appreciate the strong
| collaboration between the two teams at AMD and PyTorch. The
| move for ROCm support from "Beta" to "Stable" came in the
| PyTorch 1.12 release (June 2022)_
|
| > [...] _PyTorch ecosystem libraries like TorchText (Text
| classification), TorchRec (libraries for recommender
| systems - RecSys), TorchVision (Computer Vision),
| TorchAudio (audio and signal processing) are fully
| supported since ROCm 5.1 and upstreamed with PyTorch 1.12._
|
| > _Key libraries provided with the ROCm software stack
| including MIOpen (Convolution models), RCCL (ROCm
| Collective Communications) and rocBLAS (BLAS for
| transformers) were further optimized to offer new potential
| efficiencies and higher performance._
|
| https://news.ycombinator.com/item?id=34399633 :
|
| >> _AMD ROcm supports Pytorch, TensorFlow, MlOpen, rocBLAS
| on NVIDIA and AMD
| GPUs:https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-
| learni... _
| westurner wrote:
| https://github.com/intel/intel-extension-for-pytorch :
|
| > _Intel(r) Extension for PyTorch extends PyTorch with
| up-to-date features optimizations for an extra
| performance boost on Intel hardware. Optimizations take
| advantage of AVX-512 Vector Neural Network Instructions
| (AVX512 VNNI) and Intel(r) Advanced Matrix Extensions
| (Intel(r) AMX) on Intel CPUs as well as Intel Xe Matrix
| Extensions (XMX) AI engines on Intel discrete GPUs.
| Moreover, through PyTorch xpu device, Intel(r) Extension
| for PyTorch provides easy GPU acceleration for Intel
| discrete GPUs with PyTorch_
|
| https://pytorch.org/blog/celebrate-pytorch-2.0/ (2023) :
|
| > _As part of the PyTorch 2.0 compilation stack,
| TorchInductor CPU backend optimization brings notable
| performance improvements via graph compilation over the
| PyTorch eager mode._
|
| > _The TorchInductor CPU backend is sped up by leveraging
| the technologies from the Intel(r) Extension for PyTorch
| for Conv /GEMM ops with post-op fusion and weight
| prepacking, and PyTorch ATen CPU kernels for memory-bound
| ops with explicit vectorization on top of OpenMP-based
| thread parallelization_
|
| DLRS Deep Learning Reference Stack:
| https://intel.github.io/stacks/dlrs/index.html
| rain1 wrote:
| exciting! maybe we will see that land in llama.cpp
| eventually, who knows!
| juliangoldsmith wrote:
| llama.cpp has CLBlast support now, though I haven't used
| it yet.
| [deleted]
| hhh wrote:
| Instructions are a bit rough. The Micromamba thing doesn't work,
| doesn't say how to install it... you have to clone llama.cpp too
| rain1 wrote:
| Apologies for that. I've added some extra micromamba setup
| commands that I should have included before!
|
| I've also added the git clone command, thank you for the
| feedback
| hhh wrote:
| Appreciate it! This is much better!
| ranger_danger wrote:
| Why can't these models run on the GPU while also using CPU RAM
| for the storage? That way people will performant-but-memory-
| starved GPUs can still utilize the better performance of the GPU
| calculation while also having enough RAM to store the model? I
| know it is possible to provide system RAM-backed GPU objects.
| syntaxing wrote:
| This update is pretty exciting, I'm gonna try running a large
| model (65B) with a 3090. I have ran a ton of local LLM but the
| hardest part is finding out the prompt structure. I wish there is
| some sort of centralized data base that explains it.
| guardiangod wrote:
| I got the alpaca 65B GGML model to run on my 64GB ram laptop.
| No GPU required if you can tolerate the 1 token per 3 seconds
| rate.
| syntaxing wrote:
| Supposedly the new update with GPU offloading will bring that
| up to 10 tokens per second! 1 token per second is painfully
| slow, that's about 30s for a sentence.
| rain1 wrote:
| Tell us how it goes! Try different numbers of layers if needed.
|
| A good place to dig for prompt structures may be the 'text-
| generation-webui' commit log. For example
| https://github.com/oobabooga/text-generation-webui/commit/33...
| tarr11 wrote:
| What is the state of the art on evaluating the accuracy of these
| models? Is there some equivalent to an "end to end test"?
|
| It feels somewhat recursive since the input and output are
| natural language and so you would need another LLM to evaluate
| whether the model answered a prompt correctly.
| tikkun wrote:
| https://chat.lmsys.org/?arena (Click 'leaderboard')
| klysm wrote:
| It's going to be very difficult to come up with any rigorous
| structure for automatically assessing the outputs of these
| models. They're built using effectively human grading of the
| answers
| RockyMcNuts wrote:
| hmmh, if we have the reinforcement learning part of
| reinforcement learning with human feedback, isn't that a
| model that takes a question/answer pair and rates the quality
| of the answer? it's sort of grading itself, it's like a
| training loss but it still tells us something?
| sroussey wrote:
| Llama cpp and others use perplexity:
|
| https://huggingface.co/docs/transformers/perplexity
| ACV001 wrote:
| The future is this - these models will be able to run on smaller
| and smaller hardware eventually being able to run on your phone,
| watch or embedded devices. The revolution is here and is
| inevitable. Similar to how computers evolved. We are still lucky
| that these models have no consciousness, still. Once they gain
| consciousness, that will mark the appearance of a new species
| (superior to us if anything). Also, luckily, they have no
| physical bodies and cannot replicate, so far...
| canadianfella wrote:
| [dead]
| qwertox wrote:
| If I really want to do some playing around in this area, would it
| be good to get a RTX 4000 SFF which has 20 GB of VRAM but is a
| low-power card, which I want as it would be running 24/7 and
| energy prices are pretty bad in Germany, or would it make more
| sense to buy an Apple product with some M2 chip which apparently
| is good for these tasks as it shares CPU and GPU memory?
| holoduke wrote:
| Why does AMD or Intel not release a medium performant GPU with
| minimum 128gb of memory for a good consumer price. These models
| require lots of memory to 'single' pass an operation. Throughput
| could be bit slower. A 1080 Nvidia with 256gb of memory would run
| all these models fast right? Or am I forgetting something here.
| hackernudes wrote:
| I don't think there was a market for it before LLMs. Still
| might not be (especially if they don't want to cannibalize data
| center products). Also, they might have hardware constraints. I
| wouldn't be that surprised if we see some high ram consumer
| GPUs in the future, though.
|
| It won't work out unless it becomes common to run LLMs locally.
| Kind of a chicken-and-egg problem so I hope they try it!
| the8472 wrote:
| > I don't think there was a market for it before LLMs.
|
| At $work CGI assets sometimes grow pretty big and throwing
| more VRAM at the problem would be easier than optimizing the
| scenes in the middle of the workflow. They _can_ be
| optimized, but that often makes it less ergonomic to work
| with them.
|
| Perhaps asset-streaming (nanite&co) will make this less of an
| issue, but that's also fairly new.
|
| Do LLM implementations already stream the weights layer by
| layer or in whichever order they're doing the evaluation or
| is PCIe bandwidth too limited for that?
| tpetry wrote:
| But you are not the home user target audience. They want to
| sell you the more expensive workstation or server models.
| the8472 wrote:
| Even an A6000 tops out at 48GB while you can attach
| terabytes of RAM to server-class CPUs.
| elabajaba wrote:
| AMD had the Radeon pro SSG that let you attach 1TB of pcie3
| nvme SSDs directly to the GPU, but no one bought them and
| afaik they were basically unobtainable on the consumer
| market.
|
| Also asset streaming has been a thing for like 20 years now
| in gaming, it's not really a new thing. Nanite's big thing
| is that it gets you perfect LODs without having to pre-
| create them and manually tweak them (eg. how far away does
| the LOD transition happen, what's the lowest LOD before it
| disappears, etc)
| the8472 wrote:
| Loading assets JIT for the next frame from NVMe hasn't
| been a thing for 20 years though. Different kinds of
| latency floors.
|
| What I was asking is whether LLM inference can be
| structured in such a way that only a fraction of the
| weight is needed at a time and then the next ones can be
| loaded JIT as the processing pipeline advances.
| [deleted]
| layer8 wrote:
| Releasing a new model takes time, and it's unclear how large
| the consumer market would actually be. Maybe they're working on
| it right now.
| Kye wrote:
| GDDR probably hasn't seen the same cost reduction benefits of
| volume DDR has.
| TaylorAlexander wrote:
| One question I have is: can they use cheaper kinds of RAM and
| still be perfectly usable for large ML models? They could put
| 4GB of GDDR and 128GB of cheap RAM maybe? I do realize as
| others are saying, this would be a new kind of card so they
| will need time to develop it. But would this work?
| andromeduck wrote:
| Not without a redesigned memory controller or one off chip.
| You'd probably just want the host's memory to be directly
| accessible over PCIE or something faster like NVLINK. Such
| solutions already exist just not in the consumer space.
| duxup wrote:
| >for a good consumer price
|
| Was there a consumer market for them until recently?
| 0xcde4c3db wrote:
| Probably because if they take that exact same GPU+VRAM
| configuration and slap it on a rackmount-optimized board, they
| can charge AI companies 5-10x the price for it.
| jsheard wrote:
| They don't even offer that much VRAM on cards aimed at those
| price-insensitive customers, Nvidias current lineup maxes out
| at 48GB for GDDR-based models or 80GB for HBM-based models.
| Even if money is no object there's still practical
| engineering limits on how much memory they can put on a card
| without sacrificing bandwidth.
| vegabook wrote:
| this is where the new third player, Intel, can (if it can
| tear itself away from identical behaviour in the
| consumer/server CPU market) hopefully break the duopoly. Love
| to see a 32 or 64GB card from Intel. Their software stack on
| Linux is competent enough (unlike the dumpster fire that is
| AMD's ROCm).
| andromeduck wrote:
| Because then memory would be 90% of the BOM.
| dragonwriter wrote:
| > Why does AMD or Intel not release a medium performant GPU
| with minimum 128gb of memory for a good consumer price.
|
| They do. Well, not "medium performant", but for VRAM-bound
| tasks they'd still be an improvement over CPUs if you could use
| them -- iGPUs use main memory.
|
| What they don't have is support for them for popular GPGPU
| frameworks (though there was a third party CUDA-for-Intel-iGPUs
| a while ago.)
| elabajaba wrote:
| Because they can't do that for a "good consumer price".
|
| If you want more than ~48GB, you're looking at HBM which is
| extremely expensive (HBM chips are very expensive,
| packaging+interposer is extremely expensive, designing and
| producing a new GPU is expensive).
|
| Normal GPUs are limited by both their bus width (wider bus =
| more pins = harder to design, more expensive to produce, and
| increases power consumption), and GDDR6(x) (which maxes out at
| 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't
| expect anyone to make a 512bit busses anymore) you need 12x2GB
| (GDDR6 uses 32 pins per package) which gives you 24GB. You can
| double the memory capacity to 48GB, but that requires putting
| the chips on the back of the GPU which leads to a bunch of
| cooling issues (and GDDR6 is expensive).
|
| Of course, even if they did all that they're selling expensive
| GPUs to a small niche market and cannibalizing sales of their
| own high end products (and even if AMD somehow managed to magic
| up a 128GB gpu for $700 people still wouldn't buy it because so
| much of the ML software is CUDA only).
| eurekin wrote:
| 3090 has a lot of vram chips on the back though
| elabajaba wrote:
| And because of it there were issues with the vram
| overheating in memory intensive workloads, and on some GPUs
| the vram even separated off the board.
|
| https://www.igorslab.de/en/looming-pads-and-too-hot-
| gddrx6-m...
| pbhjpbhj wrote:
| There's a type of DMA for GPUs to access NVMe on the
| motherboard, IIRC. Perhaps that is a better solution here?
|
| https://developer.nvidia.com/blog/gpudirect-storage/
| boppo1 wrote:
| Isn't pci-e latency dramatically higher than onboard vram?
| fooker wrote:
| That's exactly what the next generation of 'accelerators' will
| be like.
|
| Whether it will be co-located with a GPU for consumer hardware
| remains to be seen.
|
| The thing to determine is how essential running LLMs locally is
| for consumers.
|
| BigTech is pushing hard to make their clouds the only place to
| run LLMs unfortunately, so unless there is a killer app that is
| just better locally (like games were for GPUs), this might not
| change.
| boppo1 wrote:
| > unless there is a killer app that is just better locally
|
| Therapy & relationship bots, like the movie 'Her'. It's ugly,
| but it's coming.
| fooker wrote:
| There's no technical reason it has to be run locally.
|
| Massive privacy implications for sure, but people do
| consume all sorts of adult material online.
|
| Games though, no one has been able to make it work as well
| as local so far.
| kevingadd wrote:
| The margins on VRAM are pretty bad for them since they don't
| manufacture it themselves. And every memory module they add
| needs additional power delivery and memory controller muscle to
| drive, so adding that memory is going to raise the cost of the
| card significantly. Most games and consumer workloads won't use
| all that extra memory.
|
| Keep in mind video cards don't use the same kind of RAM as
| consumer CPUs do, they typically use GDDR or HBM.
| Tuna-Fish wrote:
| It would not be trivial to do.
|
| GDDR achieves higher speeds than normal DDR mainly by
| specifying much tighter tolerances on the electrical interface,
| and using wider interface to the memory chips. This means that
| using commodity GDDR (which is the only fast DRAM that will be
| reasonably cheap), you have fairly strict limitations on the
| maximum amount of RAM your can use with the same GPUs that are
| manufactured for consumer use. (Typically, at most 4x
| difference between the lowest-end reasonable configuration and
| the highest-end one, 2x from higher density modules and 2x from
| using clamshell memory configuration, although often you only
| have one type of module for a new memory interface generation.)
|
| If the product requires either a new memory or GPU die
| configuration, it's cost will be very high.
|
| The only type of memory that can support very different VRAM
| sizes for an efficiently utilized bus of the same size is HBM,
| and so far that is limited to the very high end.
| magicalhippo wrote:
| Anandtech has an article on the GDDR6X variant[1] that NVIDIA
| has in their 3000-cards, where they use a more complex
| encoding to transmit two bits per clock edge.
|
| I hadn't realized just how insane the bandwidth on the
| higher-ends cards are, the 3090 being just shy of 1 TB/s,
| yes, one terrabyte per second...
|
| For comparison a couple of DDR5 sticks[2] will just get you
| north of 70GB/s...
|
| [1]: https://www.anandtech.com/show/15978/micron-spills-on-
| gddr6x...
|
| [2]: https://www.anandtech.com/show/17269/ddr5-demystified-
| feat-s...
| q7xvh97o2pDhNrh wrote:
| Do you happen to know where Apple's integrated approach falls
| on this spectrum?
|
| I was actually wondering about this the other day. A fully
| maxed out Mac Studio is about $6K, and it comes with a
| "64-core GPU" and "128GB integrated memory" (whatever any of
| that means). Would that be enough to run a decent Llama?
| cudder wrote:
| The Mac's "integrated memory" means it's shared between the
| CPU and GPU. So the GPU can address all of that and you can
| load giant (by current consumer GPU standards) models. I
| have no idea how it actually performs though.
| dclowd9901 wrote:
| Has anyone tried running encryption algorithms through these
| models? I wonder if it could be trained to decrypt.
| Hendrikto wrote:
| That would be very surprising, given that any widely used
| cryptographic encryption algorithm has been EXTENSIVELY
| cryptanalyzed.
|
| ML models are essentially trained to recognize patterns.
| Encryption algorithms are explicitly designed to resist that
| kind of analysis. LLMs are not magic.
| dclowd9901 wrote:
| All of what you said is true, for us. I know LLMs aren't
| magic (lord knows I actually kind of understand the
| principles of how they operate), but they have a much greater
| computational and relational bandwidth than we've ever had
| access to before. So I'm curious if that can break down what
| otherwise appears to be complete obfuscation. Otherwise,
| we're saying that encryption is somehow magic in a way that
| LLMs cannot possibly be.
| NegativeK wrote:
| > Otherwise, we're saying that encryption is somehow magic
| in a way that LLMs cannot possibly be.
|
| I don't see why that's an unreasonable claim. I mean,
| encryption isn't magic, but it is a drastically different
| process.
| dinobones wrote:
| What is HN's fascination with these toy models that produce low
| quality, completely unusable output?
|
| Is there a use case for them I'm missing?
|
| Additionally, don't they all have fairly restrictive licenses?
| az226 wrote:
| [flagged]
| Zetobal wrote:
| Maybe you forgot for what the H stands in HN... playful
| curiousity.
| tbalsam wrote:
| I never thought I'd see the day when a 13B model was casually
| referred to in a comments section as a "toy model".
| andrewmcwatters wrote:
| Start using it for tasks and you'll find limitations very
| quickly. Even ChatGPT excels at some tasks and fails
| miserably at others.
| tbalsam wrote:
| Oh, I've been using language models before a lot (or at
| least some significant chunk) of HN knew the word LLM, I
| think.
|
| I remember when going from 6B to 13B was crazy good. We've
| just normalized our standards to the latest models in the
| era.
|
| They do have their shortcomings but can be quite useful as
| well, especially the LLama class ones. They're definitely
| not GPT-4 or Claude+, for sure, for sure.
| az226 wrote:
| Compared to GPT2 it's on par. Compared to GPT3, 3.5, or 4,
| it's a toy. GPT2 is 4 years old, and in terms of LLMs, that's
| several life times ago. In 5-10 years, GPT3 will be viewed as
| a toy. Note, "progress" will unlikely be as fast as it has
| been going forward.
| tbalsam wrote:
| GPT-2's largest model was 1.5B params, LLama-65B was
| similar to the largest GPT3 in benchmark performance but
| that model was expensive in the API, a number of the people
| would use the cheaper one(s) instead IIRC.
|
| So this is similar to a mid tier GPT3 class model.
|
| Basically, there's not much reason to Pooh-Pooh it. It may
| not perform quite as well, but I find it to be useful for
| the things it's useful for.
| mozillas wrote:
| I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air
| (8GB RAM) with llama.cpp.
|
| Worked OK for me with the default context size. 2048, like you
| see in most examples was too slow for my taste.
| koheripbal wrote:
| Given the current price (mostly free) off public llms I'm not
| sure what the use case of running out at home are yet.
|
| OpenAIs paid GPT4 has few restrictions and is still cheap.
|
| ... Not to mention GPT4 with browsing feature is vastly
| superior to any home of the models you can run at home.
| toxik wrote:
| The point for me personally is the same as why I find it so
| powerful to self host SMTP, IMAP, HTTP. It's in my hands, I
| know where it all begins and ends. I answer to no one.
|
| For LLMs this means I am allowed their full potential. I can
| generate smut, filth, illegal content of any kind for any
| reason. It's for me to decide. It's empowering, it's the
| hacker mindset.
| sagarm wrote:
| I think it's mostly useful if you want to do your own fine
| tuning, or the data you are working with can't be sent to a
| third party for contractual, legal, or paranoid reasons.
| sroussey wrote:
| I'm working on an app to index your life, and having it
| local is a huge plus for the people I have using it.
| 2devnull wrote:
| Many would be users can't send their data data to openAI.
| Think HIPPA and other laws restricting data sharing.
| Federation or distribution of the models for local training
| is the other solution to that problem.
___________________________________________________________________
(page generated 2023-05-14 23:00 UTC)