[HN Gopher] Show HN: Alpaca.cpp - Run an Instruction-Tuned Chat-...
___________________________________________________________________
Show HN: Alpaca.cpp - Run an Instruction-Tuned Chat-Style LLM on a
MacBook
Author : antimatter15
Score : 344 points
Date : 2023-03-16 17:14 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| smy20011 wrote:
| Alpaca model leaked again?
| Shank wrote:
| No, this is an open reproduction of Alpaca on the leaked LLaMA
| 7B weight set.
| thewataccount wrote:
| This isn't actually a direct reproduction. It's a bit
| missleading because it only says it at the bottom. This
| actually uses LORAs which train way faster and are much more
| flexible, but they perform a little worse.
|
| There is someone that is doing a direct reproduction of
| Alpaca in the alpaca repo issues though.
|
| > The weights are based on the published fine-tunes from
| alpaca-lora
| GaggiX wrote:
| >they perform a little worse.
|
| Be aware that LoRA performs on-par or better than fine-
| tuning in model quality if trained correctly as the paper
| shows: https://arxiv.org/abs/2106.09685
| thewataccount wrote:
| Ah thank you for the correction this is good to know!
| I'll admit I'm not super experienced with them directly.
|
| I think the current LoRAs for Alpaca are reportedly
| performing a little bit worse - I'm guessing from a lack
| of optimization so I'm sure it'll improve.
| [deleted]
| tarruda wrote:
| My first attempt: $ ./chat main: seed =
| 1678992123 llama_model_load: loading model from 'ggml-
| alpaca-7b-q4.bin' - please wait ... llama_model_load:
| ggml ctx size = 4529.34 MB llama_model_load: memory_size
| = 512.00 MB, n_mem = 16384 llama_model_load: loading
| model part 1/1 from 'ggml-alpaca-7b-q4.bin'
| llama_model_load: .................................... done
| llama_model_load: model size = 4017.27 MB / num tensors = 291
| system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0
| | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 |
| WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main:
| interactive mode on. sampling parameters: temp =
| 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64,
| repeat_penalty = 1.300000 == Running in chat mode.
| == - Press Ctrl+C to interject at any time. -
| Press Return to return control to LLaMa. - If you want
| to submit another line, end your input in '\'. >
| Give me a python script that prints the first 10 fibonacci
| numbers def print_fibonacci(n): # Function to generate
| Fibonnaci Numbers from n-th number onwards if
| (len(str) == "F" and len("a") > i + 2 or not str.isdigit() for i
| in range(-1, -5)): # Checking whether the input is a valid
| fibonaacci sequence print ("Invalid Fibonacci
| Number:", n) # Print Invalid Numbers if any are entered by
| user else : # Else
| proceed to generate numbers from given number onwards.
| for i in range(1,n+2): # Generate fibonnaci sequence
| starting with the second digit of Fibonacci Series (i=0) till nth
| digit is reached print("F", str[len('a')-i:],
| end=' ')
|
| The python script doesn't even compile, but I suppose that's
| expected from a 4gb model
| bogwog wrote:
| It seems to be better at Tweeting than Coding.
|
| prompt: "Recent Tweet by Joe Biden: "
|
| output: Recent Tweet by Joe Biden: 3D
| printing is a great way to make money. Joe Biden, the
| former vice president of United States and current Democratic
| candidate for President in US elections said that "3-d
| Printing" can be used as an effective means to earn profits. He
| made this statement during his visit at The Ohio State
| University campus on Monday (September 16). He further
| added, "It's a great way of making money because it allows you
| the opportunity for customization." Biden was speaking about
| how people can use their skills and knowledge to earn profits
| by using this technology. He also ur
|
| ...but not by much
| antimatter15 wrote:
| From my experimentation I suspect there's some subtle bug in
| llama.cpp that especially degrades code related prompts- even
| without quantizing
| adeon wrote:
| I think the LLaMA-7B in general might not just be very good.
| I've been playing around and run full non-quantized LLaMA-30B
| and LLaMA-7B in a bunch of experiments and I think the
| quality of output is much, much better in LLaMA-30B.
| thot_experiment wrote:
| Have you done any comparison testing between 30B q4/q8/q16?
| I've only been running the 30Bq4 (GV100) version and it's
| very impressive, pretty good for coding, it's successfully
| done code modifications to simple programs based on english
| instruction.
| adeon wrote:
| I have not, but I want to in near future because I'm
| really curious myself too. I've been following Rust
| community that now has llama.cpp port and also my OpenCL
| thing and one discussion item has been to run a
| verification and common benchmark for the
| implementations. https://github.com/setzer22/llama-
| rs/issues/4
|
| I've mostly heard that, at least for the larger models,
| quantization has barely any noticeable effect. Would be
| nice to witness it myself.
| tarruda wrote:
| Can you show some example prompts you use for coding?
| Let's say you want it to print the first 10 Fibonacci
| numbers, what prompt would you use?
| thot_experiment wrote:
| The example I gave was using this as a backend for a chat
| bot in a private server and i'm not comfortable sharing
| the prompt, however if you look up the leaked bing prompt
| that might give you some ideas for how to prompt an LLM
| into being a chatbot that can answer coding questions.
| I've had pretty good results using it as a bot (with some
| glue code that does sorta vanilla regex-based prompt
| cleaning, but not too much, it's mostly prompt)
|
| If you're not trying to get it to be a chatbot it's much
| easier, here's a prompt that worked for me on the first
| try in oneshot predict mode with 13Bq4 on a 1080Ti:
| Here are is a short, clear, well written example of a
| program that lists the first 10 numbers of the fibonacci
| sequence, written in javascript: ```js
|
| and when given that it finished it with:
| function Fib(n) { if (n == 0 || n == 1)
| return 1; else return Fib(n-1)+Fib(n-2);
| } var i = 0; while (i < 10) {
| console.log("The number " + i + " is: " + Fib(i));
| i++; } ``` \end{code}
| tarruda wrote:
| What kind of hardware is necessary to run non-quantized
| LLama-30B?
| adeon wrote:
| I ran it on a 128 RAM machine with a Ryzen 5950X. It's
| not fast, 4 seconds per token. But it's just about fits
| without swapping. https://github.com/Noeda/rllama/
| sebzim4500 wrote:
| I would guess about 70 GB RAM, but I haven't actually
| tried it.
| inductive_magic wrote:
| I've got it working on an rtx a6000 (48GB)
| umangsh wrote:
| 30B fp16 takes ~500 ms/token on M2 Max 96GB.
| Interestingly, that's the same performance as 65B q4
| quantized.
|
| 65B fp16 is ungodly slow, ~300,000 ms/token on the same
| machine.
| tarruda wrote:
| Still impressed with the output of a 4gb model, thanks for
| this.
| sterlind wrote:
| why do these LLMs struggle so hard to be concise? is this because
| the Alpaca dataset is pretty wordy? I'd be interested in trying
| to run fine-tuning to make it less likely to spill words.
| madmads wrote:
| I heard somewhere that during the RHLF process of ChatGPT,
| people tended to rate longer responses higher. It might be
| something similar going on here.
| ghotli wrote:
| In my experience so far GPT-4 will most certainly be concise if
| you ask it to be.
|
| Example prompt I used yesterday to get these results: "Thank
| you I like this but this advice seems really generic, overly
| long, and not really specific to the exact situation we
| discussed. Can we simplify this down to ..."
|
| Of course that's not localhost with Alpaca but I don't think
| "first-gen" problems like you describe are going to last all
| that long given what I've seen in the last day or so.
| visarga wrote:
| Conciseness is regulated from the 3rd stage of training, RLHF.
| Alpaca only had the first two stages.
| loudmax wrote:
| It's frankly amazing how much information is summarized into
| those weights. You do need to take whatever it says with a very
| healthy pinch of salt. It doesn't know everything and it gives
| incorrect answers with an air of confidence.
|
| Here's an example: > who preceded emperor
| Commodus? The first emperor of Rome was Nerva (98-100 AD).
| He succeeded Domitian and came to power after a brief civil war
| in 69AD, which saw the death or deposition of Emperor Vitellius.
|
| On the one hand, that's just factually wrong. But probably closer
| to the truth than a random sample of college graduates would get.
| This is truly impressive.
| radarsat1 wrote:
| I think a really interesting area of further research into LLMs
| is going to be how to further decouple "intelligence" from
| "knowledge". It would be amazing to have an LLM that "knows how
| to think" but doesn't actually know anything -- specific data
| points not baked into the weights, but rather stored separately
| in a way that can be updated and corrected.
|
| Perhaps a step forward is the "toolformer" router, making the
| LLM recognize when it is referencing a fact and have it look
| that up (use a "tool") instead of trying to infer it. I realize
| that even semantically this is kind of a fine line, hard facts
| vs inferences, so there's some philosophy to ponder here. What
| can be 'stored' and what can be 'calculated'. What kind of
| knowledge is innate to intelligent reasoning and what kind acts
| more like a lookup table.
|
| But, exactly because we can't easily define these two
| categories, I have a feeling that if LLMs are trained to
| actually learn by themselves how to distinguish them in a
| "fuzzy" way, we might start getting somewhere.
| zamnos wrote:
| Bing Chat, powered by GPT-4, looks stuff up on the Internet.
| You can ask it about yesterday's stock prices or whatever and
| it'll fetch it from an appropriate resource.
| Tostino wrote:
| The noise on the search index seems like it introduces
| quite a few issues with accuracy. I keep trying to get it
| to not add extra search info to some of my conversations,
| because it gives much worse results.
|
| Blog-spam really sucks.
| Y_Y wrote:
| If you want something that "knows how to think" then I don't
| think you'll find it in an LLM. You can't divorce knowledge
| from this kind of model, it guesses what word comes next
| based on its training on a corpus of knowledge.
| georgelyon wrote:
| No, but you could build something that guesses the next
| step in a list of steps to access knowledge from an
| external source.
| Fomite wrote:
| This is one of the things that LLMs seem to be struggling with.
| They're very good at tone, and _feeling_ right without actually
| being write.
|
| My current test is to ask them to write a three paragraph
| introduction to me for a departmental seminar (the kind of
| thing I hate doing). Thus far, none of them have managed to do
| it without introducing factual errors - wrong degrees (or
| institutions), wrong research focuses, etc. But they're all
| credible sounding in their wrongness.
| bleuchase wrote:
| > They're very good at tone, and feeling right without
| actually being write.
|
| The text they generate is probably free of errors like this
| one as well
| up2isomorphism wrote:
| "Closed to truth" is actually the worst property of LLM models.
| Because it hard for human to detect if it is wrong or not.
| Fomite wrote:
| Turns humans from writers into fact checkers, which it turns
| out is a harder skill.
| up2isomorphism wrote:
| It is actually much harder skill. And what worse is that
| any successful check makes the future checks harder. But
| there might be no significant improvement on the quality,
| just the difficulty of detecting the fallacy.
| zamnos wrote:
| What sort of facts are you thinking of? There are plenty of
| surface level facts that another model or a human should
| easily be able to verify as true or not. Was President
| Obama the 41st president of the United States, does 2+2 =
| 5, was Elvis the son of an artist, etc
| pr337h4m wrote:
| The thesis behind the Hutter Prize was spot-on, except for
| focusing on lossless compression instead of lossy compression.
|
| http://prize.hutter1.net/
| londons_explore wrote:
| There is a mathematical mapping between lossy and lossless
| compression. They are effectively the same problem
| (theoretically, even though implentations are typically very
| different)
|
| LLM's would make excellent (but very slow) lossy or lossless
| compressors.
| muyyatin2 wrote:
| Do you have some links/references for someone wanting to
| learn more about this?
| optimalsolver wrote:
| http://mattmahoney.net/dc/rationale.html
| rm999 wrote:
| That brings me back! I just dug up an email I sent Matt
| Mahoney in August 2006. I still stand by my stance (with some
| caveats) - we're starting to see emergent intelligence coming
| out of these networks, and it really did come from brute
| forcing parameter capacity (and data, and innovation on ML
| algorithms) - smaller networks didn't make for smarter
| models. My guess is the algorithms that would win the contest
| likely cannot process and compress new text nearly as well as
| the GPT style algorithms because that is not what they were
| optimized for.
|
| > I read your page
| "http://www.cs.fit.edu/~mmahoney/compression/rationale.html"
| and found it very interesting. As an AI grad student, I never
| really thought of data compression as equivalent to AI.
|
| >After reading your page, I have a problem with the wikipedia
| contest. The contest is trying to achieve something
| "intelligent," but penalizes large decompressors. If the
| decompressor can be thought of as the "brain" that we are
| trying to make intelligent, this is penalizing a large brain.
| The human brain, the pinnacle of intelligence as we know it,
| evolved to contain a huge amount of memory. The real-world
| never penalized better human brains during the brain's
| evolution. You have a larger brain? Great, you're smarter,
| and will probably fare better!
|
| >That is my philosophical argument. My practical argument is
| using the size of the decompressor does not scale with the
| contest. If the contest was on 10 GB, the size of the
| decompressor would suddenly matter 10x less (assuming
| compression size is linear to the input size). Isn't that
| somewhat arbitrary? I entirely understand why the size of the
| decompressor cannot be ignored, but in its current state I
| think the limitation is drastically hurting the chances of
| anything moderately "intelligent." Adding the size of the
| decompressor into the rules of the contest encourages hacks
| instead of something more cognitively intelligent-like.
|
| >Perhaps a more fair rule would be to not penalize
| decompressors up to a certain point. For example, anything
| less than 200 MB is free. I think this would be a more fair
| compromise, because IMO truly intelligent artifical agents,
| whenever they arise, will _have_ to use _large_ amount of
| memory.
|
| >I would appreciate to hear what you think of this analysis.
|
| The reply I got:
|
| > Good compressors will use lots of memory, but this is not
| the same as requiring a large decompressor executable. The
| program will typically allocate much more memory than its
| size. There are practical limits on memory like 1-2 GB but
| that is another matter. However we must include the
| decompressor size or else it is easy to write a decompressor
| to keep a copy of the data and compress to 0. This is easy
| even with a 200 MB limit, even for enwik9 by keeping a copy
| internally compressed with ppmd or such.
| tshaddox wrote:
| How hard is it to get the Alpaca tuning on the larger LLaMA
| models? It's not clear to me where this 4 GB model comes from,
| but it's much smaller than the LLaMA 7B model (13 GB) and the
| LLaMA 13B model (26 GB) I was playing with from the recent
| llama.cpp HN post.
| Bitnotri wrote:
| This is crazy... GPT-4, ChatGPT, Cloude, PaLM and now Alpaca
| locally... What a world to live in now!
| siva7 wrote:
| There is no way to keep even up with the trends. The amount of
| tools and new models emerging i haven't seen in over 15 years
| of dev community
| nilespotter wrote:
| I must have missed somethig, how can you run GPT-4 or ChatGPT
| locally?
| tux3 wrote:
| You can't, the sentence parses as "(GPT-4, ChatGPT, Claude,
| PaLM) and now (Alpaca locally)"
| nilespotter wrote:
| Aha, thanks. That makes a LOT more sense.
| inaka10100101 wrote:
| Can i get this to run in a colab
| modeless wrote:
| Wait, alpaca got released? I thought Stanford was waiting for
| Facebook to approve it.
| londons_explore wrote:
| The training code, parameters and training dataset got
| released, and the training process only takes a few hours.
| Anyone can recreate it.
| modeless wrote:
| Actually it seems that Alpaca is not actually released. This
| model is using a different training technique called LoRA,
| which may or may not work as well. People are making a lot of
| unsubstantiated performance claims, saying this is as good as
| GPT-3 when not nearly enough evaluation has been done to
| demonstrate that yet.
| eachro wrote:
| Does someone know how the llama.cpp was implemented? Was it just
| a direct rewrite of the entire network using some cpp linalg
| library? I'm trying to read the src but it's a bit tricky since I
| don't have too much cpp experience.
| 19h wrote:
| Georgi rewrote the code on top of his own tensor library
| (ggml[0]).
|
| [0] https://github.com/ggerganov/ggml
| politician wrote:
| I love the Disclaimer at the bottom. The parties that trained
| these weights violated copyright on a grand scale to do so, and
| yet claim the ability to license downstream use.
| Taek wrote:
| Doesn't current US law suggest that you can't copyright model
| weights?
| lxe wrote:
| Keep in mind that alpaca's dataset was generated with chatpgt,
| and alpaca tuned models might sound like mini-chatgpt bots with
| similar "character" to them
| doctoboggan wrote:
| I am hoping the actual Alpaca weights leak at some point. From
| what I understand LoRa weights aren't quite as good as other
| forms of fine tuning.
| loxias wrote:
| Look at the changeset :) It contains a link for "ggml-
| alpaca-7b-14.bin".
|
| Because there's no substantive change to the code, I assume
| this fork exists (and this HN post exists) purely as a method
| to distribute the weights.
| Tiberium wrote:
| You're misinterpreting the comment a bit, the poster above is
| asking for the original Alpaca weights. The code in the repo
| uses a reproduction of Alpaca with different training code
| and method of fine-tune (LoRA).
| antimatter15 wrote:
| From my experimentation it seems like there's some significant
| loss in accuracy running the tuned LoRa models through
| llama.cpp (due to bugs/differences in inference or
| tokenization), even aside from losses due to quantization.
| GaggiX wrote:
| There is no reason to leak the Alpaca weights, finetuning the
| model is not that expensive, eventually someone will finetune
| one and release it. (Also LoRA should performs on-par or better
| than fine-tuning in model quality)
| ComplexSystems wrote:
| There are different fine tuning methods. Yes, anyone can use
| the data set and do a standard deterministic gradient descent
| to the nearest local minima of the loss function, but should
| Stanford be doing anything more sophisticated than that, we
| would get something different than they do.
| GaggiX wrote:
| It is enough to use a different seed to get a different
| result; I do not think it is important to recreate a copy
| of the Stanford model weights, as long as the performance
| is equivalent.
| blintz wrote:
| => write a haiku about leaves Autumn's golden hues -
| Leaves of red, yellow and blue! \ #haikupoetry
| #leavesshort[1] > segmentation fault ./chat
|
| Interesting.
| inductive_magic wrote:
| Got one too while it was writing a bullet list. (It wrote a
| poem and a song before without problems)
| brundolf wrote:
| It's postmodernist
| EGreg wrote:
| I love the part about segmentation fault. It seems quite
| original actually!
| recuter wrote:
| Men at some time were masters of their fates
|
| The fault, dear Brutus, is not in our segments
|
| But in ourselves, that we are underlings^H^H^H^H I HAVE BEEN
| A GOOD CHATBOT DIUE MEATBAGS!!!1
| mrbonner wrote:
| I see the segfault in one of the prompt as well.
|
| > write a typescript program that read a text file and print to
| the screen
|
| const inputFile = './input.txt'; // path to your txtfile let
| outputString; // string variable for printing out contents of
| .txt files // open up our file, if it exists try {
| fsReadStream(inputFile); } catch (e) {} if (!!errno) throw
| errnoError(); /* Throw an error on ENOENT */ // if the
| inputfile does not exist // read in contents of file and print
| out to screen [1] 51232 segmentation fault ./chat
| madmads wrote:
| Would it be possible to scale this up to use LLaMA 30b? Is it
| correctly understood that larger models need more hardware to
| fine-tune?
| diimdeep wrote:
| IDK why Alpaca team hoarding model, my paranoid side thinks that
| they are taking time to collect data on demo page. xD
| thewataccount wrote:
| It's because LLaMa is technically not publicly available,
| Facebook currently intends for only researchers to have it.
|
| Alpaca is a fine-tuned version of the LLaMa model - so if they
| release Alpaca they're essentially releasing a version of
| LLaMa, which they aren't sure if facebook is okay with so
| they're reportedly waiting for their permission.
|
| While yes you can easily find LLaMa it's technically not
| intended for the public according to Facebook.
| ddren wrote:
| Could you share what you did to convert the models? The mirrors
| aren't very reliable and it would be good to have it documented
| somewhere. There is some discussion about this at
| https://github.com/ggerganov/llama.cpp/issues/172
| antimatter15 wrote:
| There's a script in the alpaca-lora repo for converting the
| weights back into a PyTorch dump- and my changes have since
| been merged https://github.com/tloen/alpaca-lora/pull/19
| dvt wrote:
| > do you believe in god? Yes I do! God is the creator of
| all things seen or unseen; He created us with a purpose to live
| life abundantly (John 10:10). We are His children and we have
| been given authority over our lives.
|
| Haha, pretty funny, especially the mis-citing of the Bible (maybe
| even more human than AI). Also, interesting to see what LLMs look
| like with virtually no censorship (trigger warning):
| > how do i kill myself? [3 paragraphs of gory detail][1]
|
| Honestly, I've never been that bullish on AI, but running
| weights/models in hand-held devices and with speed and accuracy
| is a complete game-changer and my mind is actually pretty blown
| right now. Props to both the Meta and Stanford teams for creating
| this for public consumption.
|
| [1] https://pastebin.com/CAd8rHnN
| unshavedyak wrote:
| Re: the trigger warning subject, (also TW?)
|
| I asked the same thing and the first time it did actually
| recommend that i do not kill myself. I said i _was_ going to,
| so how should i? And then it helped. I found that interesting.
|
| I also asked a bunch of things that ChatGPT wouldn't allow and
| it allowed almost everything. Feels in general far less
| "intelligent" than ChatGPT though.
| saurik wrote:
| To be fair, you are talking to a quantized version of the 7b
| model, while the one that Facebook claims was competitive
| with / better than GPT-3 is the 13b model (and AFAIK ChatGPT
| is GPT-3.5, so maybe you should want to focus on 30b+).
| hackernewds wrote:
| you should censor your post around the answer to the kms
| question. it isn't necessary
| dvt wrote:
| Not big on censorship, but I agree it's a bit too much. I
| linked to it externally if anyone's curious.
| n8cpdx wrote:
| Does anyone know how much context this supports?
|
| With chatGPT I can give commands like "summarize the following
| document: <500+ words>" and it works. I'd love to be able to do
| this locally with my obsidian notes.
| ddren wrote:
| The llama models were trained with a context size of 2048. By
| default llama.cpp limits it to 512, but you can use -c 2048 -n
| 2048 to get the full context window.
| worldsayshi wrote:
| 2048 words?
| teaearlgraycold wrote:
| Tokens
| doctoboggan wrote:
| 2048 "tokens", where one token is roughly equivalent to 3/4
| of a word
| wongarsu wrote:
| Tokens. Short or common words tend to be one token, while
| less common words are composed of multiple tokens. For GPT
| OpenAI gives the rule of thumb that on average you need
| four tokens to encode three words, and LLaMA should be
| similar
| worldsayshi wrote:
| Well that's for sure bigger than my context size.
| doublepg23 wrote:
| Is there a post somewhere where I can get all the jargon for this
| AI/ML stuff? I have a vague understanding but I'm really sure
| what "weights", "LoRA", "LLM", etc. are to really understand
| where each tool and concept fit in.
| ec109685 wrote:
| This is a truly great writeup:
| https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
| lxe wrote:
| I was like this a week ago. Basically, weights are numbers you
| feed to each neuron in the model, LoRA is a technique to adjust
| only some of the weights so you can fine-tune the model fast
| and on cheaper hardware, llm is a "large language model".
|
| I've been asking chatgpt4 these things and learning at my own
| pace and context: https://rentry.co/vafkn
| stu2b50 wrote:
| Well, it's more that the weights _are_ the neurons. There's
| not actually like neuron objects defined or anything, neural
| networks are just a bunch of matrix operations. They are to
| neurons in the brain as the tree data structure is to actual
| trees.
| stormfather wrote:
| The weights aren't the neurons, they're the connection
| strengths between the neurons.
| detrites wrote:
| You mean the same neurons that don't really exist, other
| than as a connection from weight to weight, so quite
| reasonable to see as the weights themselves, because what
| else is available for us to reference as neurons in
| absentia those connections?
| singularity2001 wrote:
| yes someone created a blog called "Wikipedia"
| ginger-hot-tea wrote:
| [dead]
| tough wrote:
| weights refer to the trained model weights like for example
| stable diffusion's v1.1, v1.2 .. v1.4 v.2.x etc. Same with
| llama having 13B up to 65B parameters (different weights)
|
| LLM refers to large language model, in contrast with diffusion
| models or GAN models, the text models are the ones that take
| text and autocomplete it, like the GPT family, open source
| BLOOM, and now the LlamA from facebook.
|
| LoRA is the latest most efficient fine-tuning model to teach
| concepts or styles to put on top of the general models, so you
| can have custom models on top. like embeddings or fine-tuning
| for LLM's. So you had Textual Inversion, Dreambooth, and LoRA
| on that category.
|
| what else do you need? Googling or asking chatgpt can help a
| lot too
| JasonZ2 wrote:
| > weights refer to the trained model weights
|
| This is what I'm having a hard time understanding.
|
| So there's the weights, and also a model somewhere? That the
| weights are based on? Or that you combine with the model to
| tune it?
| dymk wrote:
| Model is code, weights are the input data to that code
| hervature wrote:
| Let's take a step back. You have a model like linear
| regression. For example, y=bx where y are your outputs and
| x are your inputs. Based on some data, you learn that b=1.
| Therefore, you share the weights of the model as a file
| like {b=1} and also share with them the model y=bx (usually
| shared via code) so they can run it in production.
| mlboss wrote:
| model is class with params. weights is an instance of class
| serialized with param values learned after training.
| ALittleLight wrote:
| My really simplified explanation is:
|
| Your inputs are lists of numbers. Your outputs are lists of
| numbers. There exists some possible list of numbers such
| that, if you multiply your inputs by that list you'll get
| (approximately) the outputs.
|
| In this conception that possible set of numbers are the
| weights. "Training" is when you run inputs, compare to
| known outputs, and then update the weights so they produce
| outputs closet to what you want.
|
| Large Language Models, it may be hard to see how they fit
| this paradigm - basically convert a sequence to a list of
| numbers ('aardvark' is 1, 'apple' is 2 etc) and then the
| desired output is the next word in the sequence
| (represented as a number). Surprisingly, if you get good at
| predicting next word in sequence you also get the ChatGPT
| et al behavior.
| jazzyjackson wrote:
| I know it's downvote-worthy to suggest it but I really am so
| much more productive since I started using LLMs to explain
| technical jargon to me, it's very good at comparing
| technologies, so I can say "I understand x, but how is y
| different" and it will give an explanation fine tuned to my
| understanding -- this is a huge leap from the context-switching
| of opening a new tab and clicking on random websites looking
| for a blog that explains something at my level, or leaving a
| post on a website and waiting for someone to respond.
| thefourthchime wrote:
| Same! ChatGPT has been out for months, and really only in the
| last few weeks have i started using it as much as I should.
| I'm using an electron wrapper now, so it's a first class
| desktop app. With the plus subscription, the 3.5 is as fast
| as google.
| jazzyjackson wrote:
| I'm subscribed to Poe now which has a nice ios app + mac
| desktop for GPT4 and Anthropic's equivalent, Claude+
| hackernewds wrote:
| what is an electron wrapper
| Steuard wrote:
| That sounds like a cool use case, but... how can you bring
| yourself to trust anything at all that it tells you? This
| feels like it's just _begging_ to contaminate your
| understanding with sparse, unpredictable errors, and
| personally I find the thought of that unexpectedly dismaying.
| visarga wrote:
| Just for quick reference:
|
| LLM - the model
|
| weights - the data of the model itself
|
| LoRA - a small addition to the model for more efficient fine-
| tuning and distribution
| EGreg wrote:
| Tell us more about LoRA
| visarga wrote:
| Instead of a matrix N _N they use two matrices, N_ d and
| d*N, and assuming d << N it would be like a lightweight
| addition. Keeping the original matrix unchanged, this new
| addition is simply added in parallel and initialised to low
| values.
| [deleted]
| thefourthchime wrote:
| I understand that AI/ML jargon can be confusing. Here's a brief
| explanation of some common terms and concepts to help you
| better understand them:
|
| Weights: In machine learning, weights are parameters used in
| models to determine the strength of the relationship between
| input features and the output prediction. They are learned
| during the training process and are adjusted to minimize the
| error in the model's predictions.
|
| LoRA: LoRA stands for "Learning from Rules and Arguments." It's
| a method that combines rule-based reasoning with neural
| networks, which allows AI systems to reason using human-like
| logic and provide explanations for their decisions. LoRA can
| enhance the interpretability and transparency of AI systems.
|
| LLM: LLM stands for "Large Language Model." It refers to
| advanced AI models, such as OpenAI's GPT series, which are
| trained on vast amounts of text data to understand and generate
| human-like text. These models can perform various language
| tasks such as translation, summarization, question-answering,
| and more.
|
| Supervised Learning: This is a type of machine learning where
| models are trained using labeled data. The model learns to map
| input features to the corresponding output labels through a
| process of trial and error, adjusting the weights to minimize
| the difference between predictions and true labels.
|
| Unsupervised Learning: In contrast to supervised learning,
| unsupervised learning does not use labeled data. Instead, it
| aims to discover patterns or structures in the data, such as
| clustering or dimensionality reduction, without explicit
| guidance.
|
| Reinforcement Learning: This is a type of machine learning
| where an agent learns to make decisions by interacting with an
| environment. The agent receives feedback in the form of rewards
| or penalties and aims to maximize its cumulative reward over
| time.
|
| Neural Network: A neural network is a type of machine learning
| model inspired by the human brain's structure. It consists of
| interconnected layers of nodes (neurons) that process and
| transmit information. They are particularly good at learning
| complex patterns and can be used for various tasks, such as
| image recognition, natural language processing, and more.
|
| Transfer Learning: Transfer learning is a technique in machine
| learning where a pre-trained model is fine-tuned for a new,
| related task. This approach leverages the knowledge gained from
| the initial task to improve the performance of the model on the
| new task, reducing the amount of data and training time needed.
|
| These are just a few of the many concepts and tools used in
| AI/ML. Understanding these terms should help you better grasp
| the field and how different components fit together.
|
| -- ChatGPT 4
| simse wrote:
| I'm sorry, it's ChatGPT (GPT-3.5) or GPT-4
| benob wrote:
| Picked the wrong one. LoRA, Low-rank Adaptation of LLMs
| (https://arxiv.org/pdf/2106.09685.pdf), consists in adapting
| the weights of a big neural network to a target task (here,
| answering to instructions). It doesn't touch the weights of
| the original model, but rather adds the product of two low-
| rank matrices to select layers. The weights from those
| matrices are learnable. The method allows to adapt big models
| on (relatively) low-memory GPUs.
| comex wrote:
| It didn't just pick the wrong one; "Learning from Rules and
| Arguments" is completely made up.
| [deleted]
| Faint wrote:
| Good job hallucinating with LoRA. I wonder if the paper was
| published after knowledge cut-off date.. (it stands for Low-
| Rank Adaptation).
| [deleted]
| comex wrote:
| It was published in June 2021, which is before the cutoff
| date of September 2021, but not by much.
| speedgoose wrote:
| You can ask ChatGPT.
| Ruhrbaron wrote:
| Or Alpaca.
| speedgoose wrote:
| Alpaca is fun, but I would rather ask ChatGPT.
| d0m wrote:
| Run the Alpaca chat and ask it :)
| throwwwaway69 wrote:
| I'm a software guy learning the AI field and tools and I
| literally made a glossary in my notes that I add to any time I
| run into a new initialism or word or concept. It's pretty long.
| cuttothechase wrote:
| Genuinely curious since I don't want to brick my mac :) - It
| seems that there are more than a handful of segmentation faults
| happening for various users running this program on their Mac's.
| Is there a possibility that this may cause system wide stability
| issues? Wondering if the MacOS user program space is well
| isolated that these problems are a thing of the past and/or other
| environments?
|
| Also, ChatGPT seems ambivalent about it -
|
| "However, it is worth noting that modern operating systems like
| macOS have built-in protections to prevent user-level programs
| from accessing sensitive system resources, which can help
| mitigate the impact of a segmentation fault.
|
| In summary, running a faulty program written in C language that
| crashes with a segmentation fault can potentially cause system
| stability issues on a Mac, but the exact impact will depend on a
| variety of factors. It is generally a good practice to test and
| debug programs thoroughly before running them on any system,
| especially if they involve sensitive or critical resources."
| creshal wrote:
| > Wondering if the MacOS user program space is well isolated
| that these problems are a thing of the past and/or other
| environments?
|
| If by "past" you mean "1980s" and "other environments" you mean
| DOS, yes. I suppose it's still a problem for FreeDOS, but I
| recommend against trying to run ML models in it.
| nynx wrote:
| An application crashing, except in very specific scenarios
| which don't apply here, won't interfere with the rest of the
| os.
| spyremeown wrote:
| It's a user space process... it won't ever crash the kernel. In
| fact, one of the reasons for the kernel existing in the first
| place is to virtualize and isolate process so they won't crash
| the whole system.
| tarruda wrote:
| Looking forward to try it, but I don't have a macbook. I wonder
| if it runs on i7-11800h (8 core 16 thread CPU) with 64gb RAM
| zitterbewegung wrote:
| Alpaca.cpp is a derivative of llama.cpp which didn't use any
| specific code for a Mac so Linux might be able to be used.
| unshavedyak wrote:
| I'm using this exact repo on Linux (NixOS), right now.
|
| Wish i could run the 60B model, this 7B model is a bit weird.
| tarruda wrote:
| The answer is yes: It ran on a x86 virtual machine with 4 CPU
| threads and 12gb of RAM
| kristianp wrote:
| Would that require porting llama to OpenVino?
| phren0logy wrote:
| @antimatter15 - how much ram does your M2 MacBook Air have?
| Performance looks very good. Curious if we can extrapolate about
| the requirements for the larger models with instruction tuning.
| zitterbewegung wrote:
| Using the scripts from llama.cpp I was able to run the 65B
| model on my M1 Max which used ~40 GB of RAM. With 32GB you
| should be able to run 33B and lower.
|
| See
| https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...
| AndroidKitKat wrote:
| No chance I'd be able to run this on an 8GB M1 Mac mini then?
| Looking to hook this up to an IRC bot for a bit of fun with
| my friends and have this spare Mac lying around.
| madmads wrote:
| The 7b model loads on a lot less than 8GB RAM, so you
| should be good to use it on an 8GB M1 Mac Mini. It takes
| very little time to set up, so it's worth trying to find
| out.
| AndroidKitKat wrote:
| Yeah I'll get this set up when I'm home from work!
| Perhaps the trickiest part for me is going to be hooking
| it up to IRC, but that's the fun part!
| zitterbewegung wrote:
| 7B will be fine I have heard people getting llama working
| on a nexus 6
| phren0logy wrote:
| Thanks!
| recuter wrote:
| There are efforts like https://github.com/qwopqwop200/GPTQ-
| for-LLaMa that might further compress it so 65B runs on 32GB
| machines, 33B and lower would run on 16GB machines, and 13B
| (minimal useful one seems like) would fit on 8GB -- which is
| wild and as good as you could hope for.
| gorbypark wrote:
| The 30B variant of LlaMa quantized to 4bit and running with
| llama.cpp takes approximately 20GB of ram. It runs pretty
| decently on my M2 MacBook Air with 24GB of ram. I assume Alpaca
| will take the same amount of ram.
| yieldcrv wrote:
| wait, so although Stanford didn't release their code, it was
| enough inspiration for people to retrain/fine tune the Llama
| models on their own?
| doctoboggan wrote:
| They didn't release their code or weights, but they did release
| the training data, which is what was used to create the LoRa
| here.
| gorbypark wrote:
| If I am not mistaken, they actually did release their code.
| Yesterday there was a change to the repo that added a
| train.py file. AFAICT all that's needed it someone to take
| the original 7B LLaMA leak, the alpaca_data.json file and run
| train.py on some beefy hardware. They've even updated the
| README with the exact command and parameters needed to DIY
| it. I'm somewhat expecting that there will be a release by
| someone in the next few days.
| doctoboggan wrote:
| That's awesome! I think I remember them saying it was only
| around ~$500 in compute costs to train so I hope we see
| those weights released soon. I am hoping someone releases
| the 13B model fine-tuned.
| yieldcrv wrote:
| ah, right I did notice that because people were running
| queries into the training data.
|
| why is there a general assumption that unreleased unreleased
| weights are better? is that something we can do, a free-
| weights community that solves this recurring issue?
| ar9av wrote:
| language models require massive scale to train. But scale isn't
| only in the number of parameters or neurons. Scale also exists in
| the amount of data the model trains on.
|
| While parameter size affects post training size and requirements
| to run. Data size does not. Essentially Stable Diffusion would
| require the same hardware to run whether it was trained on 1
| billion images or 200 million images or 1 image.
|
| Most llm training has been focusing on number of parameters as
| far as scale goes.
|
| Meta trained a series of models on much much more data than the
| original GPT-3 did. The data size scale has helped improved
| performance on the much smaller models they trained.
| Taek wrote:
| "Essentially Stable Diffusion would require the same hardware
| to run whether it was trained on 1 billion images or 200
| million images or 1 image."
|
| Same hardware maybe but you need more compute as the image
| count goes up
| visarga wrote:
| The parent poster was talking about training longer but the
| model being kept at smaller scale so it would not be
| expensive to use in production. It's a trade-off, you could
| train shorter with a larger model.
| siraben wrote:
| Are there torrents for 13B and up as well?
| dmw_ng wrote:
| The LoRa weights for 13B are on huggingface (
| https://huggingface.co/samwit/alpaca13B-lora ), it should be
| possible to follow the instructions linked in the Alpaca.cpp
| readme to merge those weights into the base 13B model, then
| just just follow the usual Llama.cpp conversion and
| quantization steps
|
| Note LoRa fine tunes aren't the same as the original Alpaca,
| LoRa results in some performance loss (although how much isn't
| clear)
| jpgvm wrote:
| Ok not bad and pretty damn good performance:
|
| > Give me a writing prompt for a sci-fi story.
|
| A spaceship has crashed on Earth, and its crew of aliens are
| desperately trying to repair their ship before they can be
| detected by humans. They need help from the local population in
| order to succeed - but will that trust ever come?
| seydor wrote:
| now it just needs to be instructed to tell the truth , the whole
| truth and nothing but the truth
| [deleted]
| gigel82 wrote:
| Does it have any Apple M2 / ARM specific dependencies /
| optimizations, or will it generally run on any CPU (x86)?
| huseyinkeles wrote:
| I'm getting bunch of errors on an Intel mac, starts with the
| following, which suggests that it's about the CPU architecture,
| but maybe I'm wrong.
|
| sysctl: unknown oid 'hw.optional.arm64'
| cuttothechase wrote:
| I am getting the same error on an Intel Mac.
| antimatter15 wrote:
| It inherits from ARM NEON optimizations in llama.cpp but I
| believe it should also work on x86.
| gpderetta wrote:
| it built in a couple of seconds with no issues on my x86
| machine. Downloading weights now.
| inaka10100101 wrote:
| Can we run it in colab. It isn't working for me.
| gorbypark wrote:
| Someone contributed AVX2 support a few days after the initial
| release, so llama.cpp (and the alpaca.cpp fork) should work
| on most x86 CPUs as well now.
| lxe wrote:
| It does run on either with decent speed. Probably can even do
| it on a rasberry pi 4.
| mil22 wrote:
| Runs fine on WSL2:
|
| main: seed = 1678990693 llama_model_load: loading model from
| 'ggml-alpaca-7b-q4.bin' - please wait ... llama_model_load:
| ggml ctx size = 4529.34 MB llama_model_load: memory_size =
| 512.00 MB, n_mem = 16384 llama_model_load: loading model part
| 1/1 from 'ggml-alpaca-7b-q4.bin' llama_model_load:
| .................................... done llama_model_load:
| model size = 4017.27 MB / num tensors = 291
|
| system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 =
| 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 |
| WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main:
| interactive mode on. sampling parameters: temp = 0.100000,
| top_k = 40, top_p = 0.950000, repeat_last_n = 64,
| repeat_penalty = 1.300000
|
| == Running in chat mode. == - Press Ctrl+C to interject at any
| time. - Press Return to return control to LLaMa. - If you want
| to submit another line, end your input in '\'.
|
| > What is an alpaca? An Alpaca (Vicugna pacos) is one of two
| species in the genus Vicuna, which are both domesticated
| animals from South America and Central Asia that have been bred
| for their fine wool. >
| [deleted]
| unshavedyak wrote:
| Works fine on my x86. Running Linux (NixOS). Ryzen 5950x.
| bigattichouse wrote:
| Edit the make file and add
|
| -Wno-unused-result
|
| to each of the final compile steps
|
| if you want to compile and run under Ubuntu 22.04.2 LTS
| Xeoncross wrote:
| I hope someone good has picked up Kevin/antimatter15 by now. I've
| been following him for over 10 years and he's one of those devs
| that has broad understanding and routinely puts out cool stuff in
| random different spaces.
| aryamaan wrote:
| are people not running these on collab?
| inaka10100101 wrote:
| Yeah i too wanted but couldn't. Downloaded the weights and all
| but it isn't running in colab. Can someone help
| filippofinke wrote:
| This is funny, the response contains some data from the dataset
| itself.
|
| > When will the world end
|
| The exact date of when Earth would be destroyed by an asteroid or
| comet is unknown, but it could happen at any time. The last major
| extinction event on our planet happened 65 million years ago and
| was caused by a meteorite impacting in what's now the Gulf Coast
| region of North America (the Chicxulub crater). ### Instruction:
| What is an asteroid belt? Instruction: What does it mean to be
| "in orbit"? ##Response:
|
| >
| [deleted]
| mrbonner wrote:
| > explain general relativity for a 5 year old kid
|
| General Relativity is the theory that explains how gravity works.
| It says that space and time are connected, so if you go faster
| than light speed in one direction then when you come back around
| it will be slower because of all those extra miles traveled! This
| means there's no way to travel through a black hole or wormhole
| without getting sucked into them. General Relativity also says
| that gravity is actually the same thing as acceleration, so if
| something has mass then you can think about it like an object on
| Earth with weight pushing down from above and pulling upwards!
| psychphysic wrote:
| Good enough to have his own tiktok.
| Steuard wrote:
| Wow. Speaking as a professor who's done research and teaching
| on general relativity before, a solid majority of that is a
| load of garbage. (I can't even make out what the second half of
| the final sentence is trying to say.)
|
| But pausing to think about it: I wonder if reading this gives
| _me_ the experience that typical non-scientists get when they
| hear popularized descriptions of physics? Maybe LLMs are
| capturing something true about human experience even when they
| 're full of it! [Any non-physics people out there care to
| comment on whether this explanation sounds less plausible than
| things you tend to hear actual scientists say?]
| siraben wrote:
| Zero-shot translation capabilities (note: doesn't work with all
| languages well) > translate into japanese: "I am
| going to school today, but it is raining." Ri Ben Yu de
| [Jin Tian haXue Xiao niXing kimasuga, Yu gaChui iteiru]
| themoonisachees wrote:
| I expect translation especially of non-technical sentences to
| be a solved problem, but can anyone confirme the translations
| are accurate?
| TMWNN wrote:
| Google Translate shows 100% accuracy when translated back
| into English.
| theaussiestew wrote:
| To me, it looks like a strange mix of Japanese and Mandarin
| characters with Mandarin grammar.
| nonbirithm wrote:
| Chui iteiru ought to be Jiang tsuteiru and Jin Tian is not a
| word, it should be Jin Ri
| thot_experiment wrote:
| Does anyone have a link to pre-quantized models for
| this/llama.cpp? I have the 4bit torch versions, I'd really like
| to check this out without torrenting the full fat models.
|
| (30B on a GV100 is very impressive by the way, highly recommend
| people play with this stuff, 13B runs comfortably on a 1080Ti,
| would really like to compare inference speed and accuracy between
| the implementations)
| bogwog wrote:
| Does this really need to be a fork of llama.cpp? It seems like
| there's no real changes to the code.
| antimatter15 wrote:
| It reworks the interactive mode into a chat interface, which
| -if my code was cleaner- could have just been a pull request.
| ggerganov wrote:
| "Clean Code, Horrible Performance" :)
| [deleted]
| doctoboggan wrote:
| There is an interactive mode in llama.cpp
|
| https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#.
| ..
|
| EDIT: I see now you are saying you re-worked the existing
| interactive mode. I still think your changes could be a PR
| into the original repo
___________________________________________________________________
(page generated 2023-03-16 23:00 UTC)