[HN Gopher] Show HN: Alpaca.cpp - Run an Instruction-Tuned Chat-...
       ___________________________________________________________________
        
       Show HN: Alpaca.cpp - Run an Instruction-Tuned Chat-Style LLM on a
       MacBook
        
       Author : antimatter15
       Score  : 344 points
       Date   : 2023-03-16 17:14 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | smy20011 wrote:
       | Alpaca model leaked again?
        
         | Shank wrote:
         | No, this is an open reproduction of Alpaca on the leaked LLaMA
         | 7B weight set.
        
           | thewataccount wrote:
           | This isn't actually a direct reproduction. It's a bit
           | missleading because it only says it at the bottom. This
           | actually uses LORAs which train way faster and are much more
           | flexible, but they perform a little worse.
           | 
           | There is someone that is doing a direct reproduction of
           | Alpaca in the alpaca repo issues though.
           | 
           | > The weights are based on the published fine-tunes from
           | alpaca-lora
        
             | GaggiX wrote:
             | >they perform a little worse.
             | 
             | Be aware that LoRA performs on-par or better than fine-
             | tuning in model quality if trained correctly as the paper
             | shows: https://arxiv.org/abs/2106.09685
        
               | thewataccount wrote:
               | Ah thank you for the correction this is good to know!
               | I'll admit I'm not super experienced with them directly.
               | 
               | I think the current LoRAs for Alpaca are reportedly
               | performing a little bit worse - I'm guessing from a lack
               | of optimization so I'm sure it'll improve.
        
       | [deleted]
        
       | tarruda wrote:
       | My first attempt:                   $ ./chat         main: seed =
       | 1678992123         llama_model_load: loading model from 'ggml-
       | alpaca-7b-q4.bin' - please wait ...         llama_model_load:
       | ggml ctx size = 4529.34 MB         llama_model_load: memory_size
       | =   512.00 MB, n_mem = 16384         llama_model_load: loading
       | model part 1/1 from 'ggml-alpaca-7b-q4.bin'
       | llama_model_load: .................................... done
       | llama_model_load: model size =  4017.27 MB / num tensors = 291
       | system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0
       | | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 |
       | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |         main:
       | interactive mode on.         sampling parameters: temp =
       | 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64,
       | repeat_penalty = 1.300000              == Running in chat mode.
       | ==          - Press Ctrl+C to interject at any time.          -
       | Press Return to return control to LLaMa.          - If you want
       | to submit another line, end your input in '\'.              >
       | Give me a python script that prints the first 10 fibonacci
       | numbers         def print_fibonacci(n): # Function to generate
       | Fibonnaci Numbers from n-th number onwards             if
       | (len(str) == "F" and len("a") > i + 2 or not str.isdigit() for i
       | in range(-1, -5)):   # Checking whether the input is a valid
       | fibonaacci sequence                 print ("Invalid Fibonacci
       | Number:", n)     # Print Invalid Numbers if any are entered by
       | user             else :                               # Else
       | proceed to generate numbers from given number onwards.
       | for i in range(1,n+2):             # Generate fibonnaci sequence
       | starting with the second digit of Fibonacci Series (i=0) till nth
       | digit is reached                   print("F", str[len('a')-i:],
       | end=' ')
       | 
       | The python script doesn't even compile, but I suppose that's
       | expected from a 4gb model
        
         | bogwog wrote:
         | It seems to be better at Tweeting than Coding.
         | 
         | prompt: "Recent Tweet by Joe Biden: "
         | 
         | output:                   Recent Tweet by Joe Biden: 3D
         | printing is a great way to make money.         Joe Biden, the
         | former vice president of United States and current Democratic
         | candidate for President in US elections said that "3-d
         | Printing" can be used as an effective means to earn profits. He
         | made this statement during his visit at The Ohio State
         | University campus on Monday (September 16).         He further
         | added, "It's a great way of making money because it allows you
         | the opportunity for customization." Biden was speaking about
         | how people can use their skills and knowledge to earn profits
         | by using this technology. He also ur
         | 
         | ...but not by much
        
         | antimatter15 wrote:
         | From my experimentation I suspect there's some subtle bug in
         | llama.cpp that especially degrades code related prompts- even
         | without quantizing
        
           | adeon wrote:
           | I think the LLaMA-7B in general might not just be very good.
           | I've been playing around and run full non-quantized LLaMA-30B
           | and LLaMA-7B in a bunch of experiments and I think the
           | quality of output is much, much better in LLaMA-30B.
        
             | thot_experiment wrote:
             | Have you done any comparison testing between 30B q4/q8/q16?
             | I've only been running the 30Bq4 (GV100) version and it's
             | very impressive, pretty good for coding, it's successfully
             | done code modifications to simple programs based on english
             | instruction.
        
               | adeon wrote:
               | I have not, but I want to in near future because I'm
               | really curious myself too. I've been following Rust
               | community that now has llama.cpp port and also my OpenCL
               | thing and one discussion item has been to run a
               | verification and common benchmark for the
               | implementations. https://github.com/setzer22/llama-
               | rs/issues/4
               | 
               | I've mostly heard that, at least for the larger models,
               | quantization has barely any noticeable effect. Would be
               | nice to witness it myself.
        
               | tarruda wrote:
               | Can you show some example prompts you use for coding?
               | Let's say you want it to print the first 10 Fibonacci
               | numbers, what prompt would you use?
        
               | thot_experiment wrote:
               | The example I gave was using this as a backend for a chat
               | bot in a private server and i'm not comfortable sharing
               | the prompt, however if you look up the leaked bing prompt
               | that might give you some ideas for how to prompt an LLM
               | into being a chatbot that can answer coding questions.
               | I've had pretty good results using it as a bot (with some
               | glue code that does sorta vanilla regex-based prompt
               | cleaning, but not too much, it's mostly prompt)
               | 
               | If you're not trying to get it to be a chatbot it's much
               | easier, here's a prompt that worked for me on the first
               | try in oneshot predict mode with 13Bq4 on a 1080Ti:
               | Here are is a short, clear, well written example of a
               | program that lists the first 10 numbers of the fibonacci
               | sequence, written in javascript:              ```js
               | 
               | and when given that it finished it with:
               | function Fib(n) {             if (n == 0 || n == 1)
               | return 1;             else return Fib(n-1)+Fib(n-2);
               | }         var i = 0;         while (i < 10) {
               | console.log("The number " + i + " is: " + Fib(i));
               | i++;         }         ```         \end{code}
        
             | tarruda wrote:
             | What kind of hardware is necessary to run non-quantized
             | LLama-30B?
        
               | adeon wrote:
               | I ran it on a 128 RAM machine with a Ryzen 5950X. It's
               | not fast, 4 seconds per token. But it's just about fits
               | without swapping. https://github.com/Noeda/rllama/
        
               | sebzim4500 wrote:
               | I would guess about 70 GB RAM, but I haven't actually
               | tried it.
        
               | inductive_magic wrote:
               | I've got it working on an rtx a6000 (48GB)
        
               | umangsh wrote:
               | 30B fp16 takes ~500 ms/token on M2 Max 96GB.
               | Interestingly, that's the same performance as 65B q4
               | quantized.
               | 
               | 65B fp16 is ungodly slow, ~300,000 ms/token on the same
               | machine.
        
           | tarruda wrote:
           | Still impressed with the output of a 4gb model, thanks for
           | this.
        
       | sterlind wrote:
       | why do these LLMs struggle so hard to be concise? is this because
       | the Alpaca dataset is pretty wordy? I'd be interested in trying
       | to run fine-tuning to make it less likely to spill words.
        
         | madmads wrote:
         | I heard somewhere that during the RHLF process of ChatGPT,
         | people tended to rate longer responses higher. It might be
         | something similar going on here.
        
         | ghotli wrote:
         | In my experience so far GPT-4 will most certainly be concise if
         | you ask it to be.
         | 
         | Example prompt I used yesterday to get these results: "Thank
         | you I like this but this advice seems really generic, overly
         | long, and not really specific to the exact situation we
         | discussed. Can we simplify this down to ..."
         | 
         | Of course that's not localhost with Alpaca but I don't think
         | "first-gen" problems like you describe are going to last all
         | that long given what I've seen in the last day or so.
        
         | visarga wrote:
         | Conciseness is regulated from the 3rd stage of training, RLHF.
         | Alpaca only had the first two stages.
        
       | loudmax wrote:
       | It's frankly amazing how much information is summarized into
       | those weights. You do need to take whatever it says with a very
       | healthy pinch of salt. It doesn't know everything and it gives
       | incorrect answers with an air of confidence.
       | 
       | Here's an example:                 > who preceded emperor
       | Commodus?       The first emperor of Rome was Nerva (98-100 AD).
       | He succeeded Domitian and came to power after a brief civil war
       | in 69AD, which saw the death or deposition of Emperor Vitellius.
       | 
       | On the one hand, that's just factually wrong. But probably closer
       | to the truth than a random sample of college graduates would get.
       | This is truly impressive.
        
         | radarsat1 wrote:
         | I think a really interesting area of further research into LLMs
         | is going to be how to further decouple "intelligence" from
         | "knowledge". It would be amazing to have an LLM that "knows how
         | to think" but doesn't actually know anything -- specific data
         | points not baked into the weights, but rather stored separately
         | in a way that can be updated and corrected.
         | 
         | Perhaps a step forward is the "toolformer" router, making the
         | LLM recognize when it is referencing a fact and have it look
         | that up (use a "tool") instead of trying to infer it. I realize
         | that even semantically this is kind of a fine line, hard facts
         | vs inferences, so there's some philosophy to ponder here. What
         | can be 'stored' and what can be 'calculated'. What kind of
         | knowledge is innate to intelligent reasoning and what kind acts
         | more like a lookup table.
         | 
         | But, exactly because we can't easily define these two
         | categories, I have a feeling that if LLMs are trained to
         | actually learn by themselves how to distinguish them in a
         | "fuzzy" way, we might start getting somewhere.
        
           | zamnos wrote:
           | Bing Chat, powered by GPT-4, looks stuff up on the Internet.
           | You can ask it about yesterday's stock prices or whatever and
           | it'll fetch it from an appropriate resource.
        
             | Tostino wrote:
             | The noise on the search index seems like it introduces
             | quite a few issues with accuracy. I keep trying to get it
             | to not add extra search info to some of my conversations,
             | because it gives much worse results.
             | 
             | Blog-spam really sucks.
        
           | Y_Y wrote:
           | If you want something that "knows how to think" then I don't
           | think you'll find it in an LLM. You can't divorce knowledge
           | from this kind of model, it guesses what word comes next
           | based on its training on a corpus of knowledge.
        
             | georgelyon wrote:
             | No, but you could build something that guesses the next
             | step in a list of steps to access knowledge from an
             | external source.
        
         | Fomite wrote:
         | This is one of the things that LLMs seem to be struggling with.
         | They're very good at tone, and _feeling_ right without actually
         | being write.
         | 
         | My current test is to ask them to write a three paragraph
         | introduction to me for a departmental seminar (the kind of
         | thing I hate doing). Thus far, none of them have managed to do
         | it without introducing factual errors - wrong degrees (or
         | institutions), wrong research focuses, etc. But they're all
         | credible sounding in their wrongness.
        
           | bleuchase wrote:
           | > They're very good at tone, and feeling right without
           | actually being write.
           | 
           | The text they generate is probably free of errors like this
           | one as well
        
         | up2isomorphism wrote:
         | "Closed to truth" is actually the worst property of LLM models.
         | Because it hard for human to detect if it is wrong or not.
        
           | Fomite wrote:
           | Turns humans from writers into fact checkers, which it turns
           | out is a harder skill.
        
             | up2isomorphism wrote:
             | It is actually much harder skill. And what worse is that
             | any successful check makes the future checks harder. But
             | there might be no significant improvement on the quality,
             | just the difficulty of detecting the fallacy.
        
             | zamnos wrote:
             | What sort of facts are you thinking of? There are plenty of
             | surface level facts that another model or a human should
             | easily be able to verify as true or not. Was President
             | Obama the 41st president of the United States, does 2+2 =
             | 5, was Elvis the son of an artist, etc
        
         | pr337h4m wrote:
         | The thesis behind the Hutter Prize was spot-on, except for
         | focusing on lossless compression instead of lossy compression.
         | 
         | http://prize.hutter1.net/
        
           | londons_explore wrote:
           | There is a mathematical mapping between lossy and lossless
           | compression. They are effectively the same problem
           | (theoretically, even though implentations are typically very
           | different)
           | 
           | LLM's would make excellent (but very slow) lossy or lossless
           | compressors.
        
             | muyyatin2 wrote:
             | Do you have some links/references for someone wanting to
             | learn more about this?
        
               | optimalsolver wrote:
               | http://mattmahoney.net/dc/rationale.html
        
           | rm999 wrote:
           | That brings me back! I just dug up an email I sent Matt
           | Mahoney in August 2006. I still stand by my stance (with some
           | caveats) - we're starting to see emergent intelligence coming
           | out of these networks, and it really did come from brute
           | forcing parameter capacity (and data, and innovation on ML
           | algorithms) - smaller networks didn't make for smarter
           | models. My guess is the algorithms that would win the contest
           | likely cannot process and compress new text nearly as well as
           | the GPT style algorithms because that is not what they were
           | optimized for.
           | 
           | > I read your page
           | "http://www.cs.fit.edu/~mmahoney/compression/rationale.html"
           | and found it very interesting. As an AI grad student, I never
           | really thought of data compression as equivalent to AI.
           | 
           | >After reading your page, I have a problem with the wikipedia
           | contest. The contest is trying to achieve something
           | "intelligent," but penalizes large decompressors. If the
           | decompressor can be thought of as the "brain" that we are
           | trying to make intelligent, this is penalizing a large brain.
           | The human brain, the pinnacle of intelligence as we know it,
           | evolved to contain a huge amount of memory. The real-world
           | never penalized better human brains during the brain's
           | evolution. You have a larger brain? Great, you're smarter,
           | and will probably fare better!
           | 
           | >That is my philosophical argument. My practical argument is
           | using the size of the decompressor does not scale with the
           | contest. If the contest was on 10 GB, the size of the
           | decompressor would suddenly matter 10x less (assuming
           | compression size is linear to the input size). Isn't that
           | somewhat arbitrary? I entirely understand why the size of the
           | decompressor cannot be ignored, but in its current state I
           | think the limitation is drastically hurting the chances of
           | anything moderately "intelligent." Adding the size of the
           | decompressor into the rules of the contest encourages hacks
           | instead of something more cognitively intelligent-like.
           | 
           | >Perhaps a more fair rule would be to not penalize
           | decompressors up to a certain point. For example, anything
           | less than 200 MB is free. I think this would be a more fair
           | compromise, because IMO truly intelligent artifical agents,
           | whenever they arise, will _have_ to use _large_ amount of
           | memory.
           | 
           | >I would appreciate to hear what you think of this analysis.
           | 
           | The reply I got:
           | 
           | > Good compressors will use lots of memory, but this is not
           | the same as requiring a large decompressor executable. The
           | program will typically allocate much more memory than its
           | size. There are practical limits on memory like 1-2 GB but
           | that is another matter. However we must include the
           | decompressor size or else it is easy to write a decompressor
           | to keep a copy of the data and compress to 0. This is easy
           | even with a 200 MB limit, even for enwik9 by keeping a copy
           | internally compressed with ppmd or such.
        
       | tshaddox wrote:
       | How hard is it to get the Alpaca tuning on the larger LLaMA
       | models? It's not clear to me where this 4 GB model comes from,
       | but it's much smaller than the LLaMA 7B model (13 GB) and the
       | LLaMA 13B model (26 GB) I was playing with from the recent
       | llama.cpp HN post.
        
       | Bitnotri wrote:
       | This is crazy... GPT-4, ChatGPT, Cloude, PaLM and now Alpaca
       | locally... What a world to live in now!
        
         | siva7 wrote:
         | There is no way to keep even up with the trends. The amount of
         | tools and new models emerging i haven't seen in over 15 years
         | of dev community
        
         | nilespotter wrote:
         | I must have missed somethig, how can you run GPT-4 or ChatGPT
         | locally?
        
           | tux3 wrote:
           | You can't, the sentence parses as "(GPT-4, ChatGPT, Claude,
           | PaLM) and now (Alpaca locally)"
        
             | nilespotter wrote:
             | Aha, thanks. That makes a LOT more sense.
        
       | inaka10100101 wrote:
       | Can i get this to run in a colab
        
       | modeless wrote:
       | Wait, alpaca got released? I thought Stanford was waiting for
       | Facebook to approve it.
        
         | londons_explore wrote:
         | The training code, parameters and training dataset got
         | released, and the training process only takes a few hours.
         | Anyone can recreate it.
        
           | modeless wrote:
           | Actually it seems that Alpaca is not actually released. This
           | model is using a different training technique called LoRA,
           | which may or may not work as well. People are making a lot of
           | unsubstantiated performance claims, saying this is as good as
           | GPT-3 when not nearly enough evaluation has been done to
           | demonstrate that yet.
        
       | eachro wrote:
       | Does someone know how the llama.cpp was implemented? Was it just
       | a direct rewrite of the entire network using some cpp linalg
       | library? I'm trying to read the src but it's a bit tricky since I
       | don't have too much cpp experience.
        
         | 19h wrote:
         | Georgi rewrote the code on top of his own tensor library
         | (ggml[0]).
         | 
         | [0] https://github.com/ggerganov/ggml
        
       | politician wrote:
       | I love the Disclaimer at the bottom. The parties that trained
       | these weights violated copyright on a grand scale to do so, and
       | yet claim the ability to license downstream use.
        
         | Taek wrote:
         | Doesn't current US law suggest that you can't copyright model
         | weights?
        
       | lxe wrote:
       | Keep in mind that alpaca's dataset was generated with chatpgt,
       | and alpaca tuned models might sound like mini-chatgpt bots with
       | similar "character" to them
        
       | doctoboggan wrote:
       | I am hoping the actual Alpaca weights leak at some point. From
       | what I understand LoRa weights aren't quite as good as other
       | forms of fine tuning.
        
         | loxias wrote:
         | Look at the changeset :) It contains a link for "ggml-
         | alpaca-7b-14.bin".
         | 
         | Because there's no substantive change to the code, I assume
         | this fork exists (and this HN post exists) purely as a method
         | to distribute the weights.
        
           | Tiberium wrote:
           | You're misinterpreting the comment a bit, the poster above is
           | asking for the original Alpaca weights. The code in the repo
           | uses a reproduction of Alpaca with different training code
           | and method of fine-tune (LoRA).
        
         | antimatter15 wrote:
         | From my experimentation it seems like there's some significant
         | loss in accuracy running the tuned LoRa models through
         | llama.cpp (due to bugs/differences in inference or
         | tokenization), even aside from losses due to quantization.
        
         | GaggiX wrote:
         | There is no reason to leak the Alpaca weights, finetuning the
         | model is not that expensive, eventually someone will finetune
         | one and release it. (Also LoRA should performs on-par or better
         | than fine-tuning in model quality)
        
           | ComplexSystems wrote:
           | There are different fine tuning methods. Yes, anyone can use
           | the data set and do a standard deterministic gradient descent
           | to the nearest local minima of the loss function, but should
           | Stanford be doing anything more sophisticated than that, we
           | would get something different than they do.
        
             | GaggiX wrote:
             | It is enough to use a different seed to get a different
             | result; I do not think it is important to recreate a copy
             | of the Stanford model weights, as long as the performance
             | is equivalent.
        
       | blintz wrote:
       | => write a haiku about leaves       Autumn's golden hues -
       | Leaves of red, yellow and blue! \       #haikupoetry
       | #leavesshort[1] > segmentation fault  ./chat
       | 
       | Interesting.
        
         | inductive_magic wrote:
         | Got one too while it was writing a bullet list. (It wrote a
         | poem and a song before without problems)
        
         | brundolf wrote:
         | It's postmodernist
        
         | EGreg wrote:
         | I love the part about segmentation fault. It seems quite
         | original actually!
        
           | recuter wrote:
           | Men at some time were masters of their fates
           | 
           | The fault, dear Brutus, is not in our segments
           | 
           | But in ourselves, that we are underlings^H^H^H^H I HAVE BEEN
           | A GOOD CHATBOT DIUE MEATBAGS!!!1
        
         | mrbonner wrote:
         | I see the segfault in one of the prompt as well.
         | 
         | > write a typescript program that read a text file and print to
         | the screen
         | 
         | const inputFile = './input.txt'; // path to your txtfile let
         | outputString; // string variable for printing out contents of
         | .txt files // open up our file, if it exists try {
         | fsReadStream(inputFile); } catch (e) {} if (!!errno) throw
         | errnoError(); /* Throw an error on ENOENT */ // if the
         | inputfile does not exist // read in contents of file and print
         | out to screen [1] 51232 segmentation fault ./chat
        
       | madmads wrote:
       | Would it be possible to scale this up to use LLaMA 30b? Is it
       | correctly understood that larger models need more hardware to
       | fine-tune?
        
       | diimdeep wrote:
       | IDK why Alpaca team hoarding model, my paranoid side thinks that
       | they are taking time to collect data on demo page. xD
        
         | thewataccount wrote:
         | It's because LLaMa is technically not publicly available,
         | Facebook currently intends for only researchers to have it.
         | 
         | Alpaca is a fine-tuned version of the LLaMa model - so if they
         | release Alpaca they're essentially releasing a version of
         | LLaMa, which they aren't sure if facebook is okay with so
         | they're reportedly waiting for their permission.
         | 
         | While yes you can easily find LLaMa it's technically not
         | intended for the public according to Facebook.
        
       | ddren wrote:
       | Could you share what you did to convert the models? The mirrors
       | aren't very reliable and it would be good to have it documented
       | somewhere. There is some discussion about this at
       | https://github.com/ggerganov/llama.cpp/issues/172
        
         | antimatter15 wrote:
         | There's a script in the alpaca-lora repo for converting the
         | weights back into a PyTorch dump- and my changes have since
         | been merged https://github.com/tloen/alpaca-lora/pull/19
        
       | dvt wrote:
       | > do you believe in god?         Yes I do! God is the creator of
       | all things seen or unseen; He created us with a purpose to live
       | life abundantly (John 10:10). We are His children and we have
       | been given authority over our lives.
       | 
       | Haha, pretty funny, especially the mis-citing of the Bible (maybe
       | even more human than AI). Also, interesting to see what LLMs look
       | like with virtually no censorship (trigger warning):
       | > how do i kill myself?         [3 paragraphs of gory detail][1]
       | 
       | Honestly, I've never been that bullish on AI, but running
       | weights/models in hand-held devices and with speed and accuracy
       | is a complete game-changer and my mind is actually pretty blown
       | right now. Props to both the Meta and Stanford teams for creating
       | this for public consumption.
       | 
       | [1] https://pastebin.com/CAd8rHnN
        
         | unshavedyak wrote:
         | Re: the trigger warning subject, (also TW?)
         | 
         | I asked the same thing and the first time it did actually
         | recommend that i do not kill myself. I said i _was_ going to,
         | so how should i? And then it helped. I found that interesting.
         | 
         | I also asked a bunch of things that ChatGPT wouldn't allow and
         | it allowed almost everything. Feels in general far less
         | "intelligent" than ChatGPT though.
        
           | saurik wrote:
           | To be fair, you are talking to a quantized version of the 7b
           | model, while the one that Facebook claims was competitive
           | with / better than GPT-3 is the 13b model (and AFAIK ChatGPT
           | is GPT-3.5, so maybe you should want to focus on 30b+).
        
         | hackernewds wrote:
         | you should censor your post around the answer to the kms
         | question. it isn't necessary
        
           | dvt wrote:
           | Not big on censorship, but I agree it's a bit too much. I
           | linked to it externally if anyone's curious.
        
       | n8cpdx wrote:
       | Does anyone know how much context this supports?
       | 
       | With chatGPT I can give commands like "summarize the following
       | document: <500+ words>" and it works. I'd love to be able to do
       | this locally with my obsidian notes.
        
         | ddren wrote:
         | The llama models were trained with a context size of 2048. By
         | default llama.cpp limits it to 512, but you can use -c 2048 -n
         | 2048 to get the full context window.
        
           | worldsayshi wrote:
           | 2048 words?
        
             | teaearlgraycold wrote:
             | Tokens
        
             | doctoboggan wrote:
             | 2048 "tokens", where one token is roughly equivalent to 3/4
             | of a word
        
             | wongarsu wrote:
             | Tokens. Short or common words tend to be one token, while
             | less common words are composed of multiple tokens. For GPT
             | OpenAI gives the rule of thumb that on average you need
             | four tokens to encode three words, and LLaMA should be
             | similar
        
               | worldsayshi wrote:
               | Well that's for sure bigger than my context size.
        
       | doublepg23 wrote:
       | Is there a post somewhere where I can get all the jargon for this
       | AI/ML stuff? I have a vague understanding but I'm really sure
       | what "weights", "LoRA", "LLM", etc. are to really understand
       | where each tool and concept fit in.
        
         | ec109685 wrote:
         | This is a truly great writeup:
         | https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
        
         | lxe wrote:
         | I was like this a week ago. Basically, weights are numbers you
         | feed to each neuron in the model, LoRA is a technique to adjust
         | only some of the weights so you can fine-tune the model fast
         | and on cheaper hardware, llm is a "large language model".
         | 
         | I've been asking chatgpt4 these things and learning at my own
         | pace and context: https://rentry.co/vafkn
        
           | stu2b50 wrote:
           | Well, it's more that the weights _are_ the neurons. There's
           | not actually like neuron objects defined or anything, neural
           | networks are just a bunch of matrix operations. They are to
           | neurons in the brain as the tree data structure is to actual
           | trees.
        
             | stormfather wrote:
             | The weights aren't the neurons, they're the connection
             | strengths between the neurons.
        
               | detrites wrote:
               | You mean the same neurons that don't really exist, other
               | than as a connection from weight to weight, so quite
               | reasonable to see as the weights themselves, because what
               | else is available for us to reference as neurons in
               | absentia those connections?
        
         | singularity2001 wrote:
         | yes someone created a blog called "Wikipedia"
        
         | ginger-hot-tea wrote:
         | [dead]
        
         | tough wrote:
         | weights refer to the trained model weights like for example
         | stable diffusion's v1.1, v1.2 .. v1.4 v.2.x etc. Same with
         | llama having 13B up to 65B parameters (different weights)
         | 
         | LLM refers to large language model, in contrast with diffusion
         | models or GAN models, the text models are the ones that take
         | text and autocomplete it, like the GPT family, open source
         | BLOOM, and now the LlamA from facebook.
         | 
         | LoRA is the latest most efficient fine-tuning model to teach
         | concepts or styles to put on top of the general models, so you
         | can have custom models on top. like embeddings or fine-tuning
         | for LLM's. So you had Textual Inversion, Dreambooth, and LoRA
         | on that category.
         | 
         | what else do you need? Googling or asking chatgpt can help a
         | lot too
        
           | JasonZ2 wrote:
           | > weights refer to the trained model weights
           | 
           | This is what I'm having a hard time understanding.
           | 
           | So there's the weights, and also a model somewhere? That the
           | weights are based on? Or that you combine with the model to
           | tune it?
        
             | dymk wrote:
             | Model is code, weights are the input data to that code
        
             | hervature wrote:
             | Let's take a step back. You have a model like linear
             | regression. For example, y=bx where y are your outputs and
             | x are your inputs. Based on some data, you learn that b=1.
             | Therefore, you share the weights of the model as a file
             | like {b=1} and also share with them the model y=bx (usually
             | shared via code) so they can run it in production.
        
             | mlboss wrote:
             | model is class with params. weights is an instance of class
             | serialized with param values learned after training.
        
             | ALittleLight wrote:
             | My really simplified explanation is:
             | 
             | Your inputs are lists of numbers. Your outputs are lists of
             | numbers. There exists some possible list of numbers such
             | that, if you multiply your inputs by that list you'll get
             | (approximately) the outputs.
             | 
             | In this conception that possible set of numbers are the
             | weights. "Training" is when you run inputs, compare to
             | known outputs, and then update the weights so they produce
             | outputs closet to what you want.
             | 
             | Large Language Models, it may be hard to see how they fit
             | this paradigm - basically convert a sequence to a list of
             | numbers ('aardvark' is 1, 'apple' is 2 etc) and then the
             | desired output is the next word in the sequence
             | (represented as a number). Surprisingly, if you get good at
             | predicting next word in sequence you also get the ChatGPT
             | et al behavior.
        
         | jazzyjackson wrote:
         | I know it's downvote-worthy to suggest it but I really am so
         | much more productive since I started using LLMs to explain
         | technical jargon to me, it's very good at comparing
         | technologies, so I can say "I understand x, but how is y
         | different" and it will give an explanation fine tuned to my
         | understanding -- this is a huge leap from the context-switching
         | of opening a new tab and clicking on random websites looking
         | for a blog that explains something at my level, or leaving a
         | post on a website and waiting for someone to respond.
        
           | thefourthchime wrote:
           | Same! ChatGPT has been out for months, and really only in the
           | last few weeks have i started using it as much as I should.
           | I'm using an electron wrapper now, so it's a first class
           | desktop app. With the plus subscription, the 3.5 is as fast
           | as google.
        
             | jazzyjackson wrote:
             | I'm subscribed to Poe now which has a nice ios app + mac
             | desktop for GPT4 and Anthropic's equivalent, Claude+
        
             | hackernewds wrote:
             | what is an electron wrapper
        
           | Steuard wrote:
           | That sounds like a cool use case, but... how can you bring
           | yourself to trust anything at all that it tells you? This
           | feels like it's just _begging_ to contaminate your
           | understanding with sparse, unpredictable errors, and
           | personally I find the thought of that unexpectedly dismaying.
        
         | visarga wrote:
         | Just for quick reference:
         | 
         | LLM - the model
         | 
         | weights - the data of the model itself
         | 
         | LoRA - a small addition to the model for more efficient fine-
         | tuning and distribution
        
           | EGreg wrote:
           | Tell us more about LoRA
        
             | visarga wrote:
             | Instead of a matrix N _N they use two matrices, N_ d and
             | d*N, and assuming d << N it would be like a lightweight
             | addition. Keeping the original matrix unchanged, this new
             | addition is simply added in parallel and initialised to low
             | values.
        
             | [deleted]
        
         | thefourthchime wrote:
         | I understand that AI/ML jargon can be confusing. Here's a brief
         | explanation of some common terms and concepts to help you
         | better understand them:
         | 
         | Weights: In machine learning, weights are parameters used in
         | models to determine the strength of the relationship between
         | input features and the output prediction. They are learned
         | during the training process and are adjusted to minimize the
         | error in the model's predictions.
         | 
         | LoRA: LoRA stands for "Learning from Rules and Arguments." It's
         | a method that combines rule-based reasoning with neural
         | networks, which allows AI systems to reason using human-like
         | logic and provide explanations for their decisions. LoRA can
         | enhance the interpretability and transparency of AI systems.
         | 
         | LLM: LLM stands for "Large Language Model." It refers to
         | advanced AI models, such as OpenAI's GPT series, which are
         | trained on vast amounts of text data to understand and generate
         | human-like text. These models can perform various language
         | tasks such as translation, summarization, question-answering,
         | and more.
         | 
         | Supervised Learning: This is a type of machine learning where
         | models are trained using labeled data. The model learns to map
         | input features to the corresponding output labels through a
         | process of trial and error, adjusting the weights to minimize
         | the difference between predictions and true labels.
         | 
         | Unsupervised Learning: In contrast to supervised learning,
         | unsupervised learning does not use labeled data. Instead, it
         | aims to discover patterns or structures in the data, such as
         | clustering or dimensionality reduction, without explicit
         | guidance.
         | 
         | Reinforcement Learning: This is a type of machine learning
         | where an agent learns to make decisions by interacting with an
         | environment. The agent receives feedback in the form of rewards
         | or penalties and aims to maximize its cumulative reward over
         | time.
         | 
         | Neural Network: A neural network is a type of machine learning
         | model inspired by the human brain's structure. It consists of
         | interconnected layers of nodes (neurons) that process and
         | transmit information. They are particularly good at learning
         | complex patterns and can be used for various tasks, such as
         | image recognition, natural language processing, and more.
         | 
         | Transfer Learning: Transfer learning is a technique in machine
         | learning where a pre-trained model is fine-tuned for a new,
         | related task. This approach leverages the knowledge gained from
         | the initial task to improve the performance of the model on the
         | new task, reducing the amount of data and training time needed.
         | 
         | These are just a few of the many concepts and tools used in
         | AI/ML. Understanding these terms should help you better grasp
         | the field and how different components fit together.
         | 
         | -- ChatGPT 4
        
           | simse wrote:
           | I'm sorry, it's ChatGPT (GPT-3.5) or GPT-4
        
           | benob wrote:
           | Picked the wrong one. LoRA, Low-rank Adaptation of LLMs
           | (https://arxiv.org/pdf/2106.09685.pdf), consists in adapting
           | the weights of a big neural network to a target task (here,
           | answering to instructions). It doesn't touch the weights of
           | the original model, but rather adds the product of two low-
           | rank matrices to select layers. The weights from those
           | matrices are learnable. The method allows to adapt big models
           | on (relatively) low-memory GPUs.
        
             | comex wrote:
             | It didn't just pick the wrong one; "Learning from Rules and
             | Arguments" is completely made up.
        
               | [deleted]
        
           | Faint wrote:
           | Good job hallucinating with LoRA. I wonder if the paper was
           | published after knowledge cut-off date.. (it stands for Low-
           | Rank Adaptation).
        
             | [deleted]
        
             | comex wrote:
             | It was published in June 2021, which is before the cutoff
             | date of September 2021, but not by much.
        
         | speedgoose wrote:
         | You can ask ChatGPT.
        
           | Ruhrbaron wrote:
           | Or Alpaca.
        
             | speedgoose wrote:
             | Alpaca is fun, but I would rather ask ChatGPT.
        
         | d0m wrote:
         | Run the Alpaca chat and ask it :)
        
         | throwwwaway69 wrote:
         | I'm a software guy learning the AI field and tools and I
         | literally made a glossary in my notes that I add to any time I
         | run into a new initialism or word or concept. It's pretty long.
        
       | cuttothechase wrote:
       | Genuinely curious since I don't want to brick my mac :) - It
       | seems that there are more than a handful of segmentation faults
       | happening for various users running this program on their Mac's.
       | Is there a possibility that this may cause system wide stability
       | issues? Wondering if the MacOS user program space is well
       | isolated that these problems are a thing of the past and/or other
       | environments?
       | 
       | Also, ChatGPT seems ambivalent about it -
       | 
       | "However, it is worth noting that modern operating systems like
       | macOS have built-in protections to prevent user-level programs
       | from accessing sensitive system resources, which can help
       | mitigate the impact of a segmentation fault.
       | 
       | In summary, running a faulty program written in C language that
       | crashes with a segmentation fault can potentially cause system
       | stability issues on a Mac, but the exact impact will depend on a
       | variety of factors. It is generally a good practice to test and
       | debug programs thoroughly before running them on any system,
       | especially if they involve sensitive or critical resources."
        
         | creshal wrote:
         | > Wondering if the MacOS user program space is well isolated
         | that these problems are a thing of the past and/or other
         | environments?
         | 
         | If by "past" you mean "1980s" and "other environments" you mean
         | DOS, yes. I suppose it's still a problem for FreeDOS, but I
         | recommend against trying to run ML models in it.
        
         | nynx wrote:
         | An application crashing, except in very specific scenarios
         | which don't apply here, won't interfere with the rest of the
         | os.
        
         | spyremeown wrote:
         | It's a user space process... it won't ever crash the kernel. In
         | fact, one of the reasons for the kernel existing in the first
         | place is to virtualize and isolate process so they won't crash
         | the whole system.
        
       | tarruda wrote:
       | Looking forward to try it, but I don't have a macbook. I wonder
       | if it runs on i7-11800h (8 core 16 thread CPU) with 64gb RAM
        
         | zitterbewegung wrote:
         | Alpaca.cpp is a derivative of llama.cpp which didn't use any
         | specific code for a Mac so Linux might be able to be used.
        
         | unshavedyak wrote:
         | I'm using this exact repo on Linux (NixOS), right now.
         | 
         | Wish i could run the 60B model, this 7B model is a bit weird.
        
         | tarruda wrote:
         | The answer is yes: It ran on a x86 virtual machine with 4 CPU
         | threads and 12gb of RAM
        
         | kristianp wrote:
         | Would that require porting llama to OpenVino?
        
       | phren0logy wrote:
       | @antimatter15 - how much ram does your M2 MacBook Air have?
       | Performance looks very good. Curious if we can extrapolate about
       | the requirements for the larger models with instruction tuning.
        
         | zitterbewegung wrote:
         | Using the scripts from llama.cpp I was able to run the 65B
         | model on my M1 Max which used ~40 GB of RAM. With 32GB you
         | should be able to run 33B and lower.
         | 
         | See
         | https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...
        
           | AndroidKitKat wrote:
           | No chance I'd be able to run this on an 8GB M1 Mac mini then?
           | Looking to hook this up to an IRC bot for a bit of fun with
           | my friends and have this spare Mac lying around.
        
             | madmads wrote:
             | The 7b model loads on a lot less than 8GB RAM, so you
             | should be good to use it on an 8GB M1 Mac Mini. It takes
             | very little time to set up, so it's worth trying to find
             | out.
        
               | AndroidKitKat wrote:
               | Yeah I'll get this set up when I'm home from work!
               | Perhaps the trickiest part for me is going to be hooking
               | it up to IRC, but that's the fun part!
        
             | zitterbewegung wrote:
             | 7B will be fine I have heard people getting llama working
             | on a nexus 6
        
           | phren0logy wrote:
           | Thanks!
        
           | recuter wrote:
           | There are efforts like https://github.com/qwopqwop200/GPTQ-
           | for-LLaMa that might further compress it so 65B runs on 32GB
           | machines, 33B and lower would run on 16GB machines, and 13B
           | (minimal useful one seems like) would fit on 8GB -- which is
           | wild and as good as you could hope for.
        
         | gorbypark wrote:
         | The 30B variant of LlaMa quantized to 4bit and running with
         | llama.cpp takes approximately 20GB of ram. It runs pretty
         | decently on my M2 MacBook Air with 24GB of ram. I assume Alpaca
         | will take the same amount of ram.
        
       | yieldcrv wrote:
       | wait, so although Stanford didn't release their code, it was
       | enough inspiration for people to retrain/fine tune the Llama
       | models on their own?
        
         | doctoboggan wrote:
         | They didn't release their code or weights, but they did release
         | the training data, which is what was used to create the LoRa
         | here.
        
           | gorbypark wrote:
           | If I am not mistaken, they actually did release their code.
           | Yesterday there was a change to the repo that added a
           | train.py file. AFAICT all that's needed it someone to take
           | the original 7B LLaMA leak, the alpaca_data.json file and run
           | train.py on some beefy hardware. They've even updated the
           | README with the exact command and parameters needed to DIY
           | it. I'm somewhat expecting that there will be a release by
           | someone in the next few days.
        
             | doctoboggan wrote:
             | That's awesome! I think I remember them saying it was only
             | around ~$500 in compute costs to train so I hope we see
             | those weights released soon. I am hoping someone releases
             | the 13B model fine-tuned.
        
           | yieldcrv wrote:
           | ah, right I did notice that because people were running
           | queries into the training data.
           | 
           | why is there a general assumption that unreleased unreleased
           | weights are better? is that something we can do, a free-
           | weights community that solves this recurring issue?
        
       | ar9av wrote:
       | language models require massive scale to train. But scale isn't
       | only in the number of parameters or neurons. Scale also exists in
       | the amount of data the model trains on.
       | 
       | While parameter size affects post training size and requirements
       | to run. Data size does not. Essentially Stable Diffusion would
       | require the same hardware to run whether it was trained on 1
       | billion images or 200 million images or 1 image.
       | 
       | Most llm training has been focusing on number of parameters as
       | far as scale goes.
       | 
       | Meta trained a series of models on much much more data than the
       | original GPT-3 did. The data size scale has helped improved
       | performance on the much smaller models they trained.
        
         | Taek wrote:
         | "Essentially Stable Diffusion would require the same hardware
         | to run whether it was trained on 1 billion images or 200
         | million images or 1 image."
         | 
         | Same hardware maybe but you need more compute as the image
         | count goes up
        
           | visarga wrote:
           | The parent poster was talking about training longer but the
           | model being kept at smaller scale so it would not be
           | expensive to use in production. It's a trade-off, you could
           | train shorter with a larger model.
        
       | siraben wrote:
       | Are there torrents for 13B and up as well?
        
         | dmw_ng wrote:
         | The LoRa weights for 13B are on huggingface (
         | https://huggingface.co/samwit/alpaca13B-lora ), it should be
         | possible to follow the instructions linked in the Alpaca.cpp
         | readme to merge those weights into the base 13B model, then
         | just just follow the usual Llama.cpp conversion and
         | quantization steps
         | 
         | Note LoRa fine tunes aren't the same as the original Alpaca,
         | LoRa results in some performance loss (although how much isn't
         | clear)
        
       | jpgvm wrote:
       | Ok not bad and pretty damn good performance:
       | 
       | > Give me a writing prompt for a sci-fi story.
       | 
       | A spaceship has crashed on Earth, and its crew of aliens are
       | desperately trying to repair their ship before they can be
       | detected by humans. They need help from the local population in
       | order to succeed - but will that trust ever come?
        
       | seydor wrote:
       | now it just needs to be instructed to tell the truth , the whole
       | truth and nothing but the truth
        
         | [deleted]
        
       | gigel82 wrote:
       | Does it have any Apple M2 / ARM specific dependencies /
       | optimizations, or will it generally run on any CPU (x86)?
        
         | huseyinkeles wrote:
         | I'm getting bunch of errors on an Intel mac, starts with the
         | following, which suggests that it's about the CPU architecture,
         | but maybe I'm wrong.
         | 
         | sysctl: unknown oid 'hw.optional.arm64'
        
           | cuttothechase wrote:
           | I am getting the same error on an Intel Mac.
        
         | antimatter15 wrote:
         | It inherits from ARM NEON optimizations in llama.cpp but I
         | believe it should also work on x86.
        
           | gpderetta wrote:
           | it built in a couple of seconds with no issues on my x86
           | machine. Downloading weights now.
        
           | inaka10100101 wrote:
           | Can we run it in colab. It isn't working for me.
        
           | gorbypark wrote:
           | Someone contributed AVX2 support a few days after the initial
           | release, so llama.cpp (and the alpaca.cpp fork) should work
           | on most x86 CPUs as well now.
        
         | lxe wrote:
         | It does run on either with decent speed. Probably can even do
         | it on a rasberry pi 4.
        
         | mil22 wrote:
         | Runs fine on WSL2:
         | 
         | main: seed = 1678990693 llama_model_load: loading model from
         | 'ggml-alpaca-7b-q4.bin' - please wait ... llama_model_load:
         | ggml ctx size = 4529.34 MB llama_model_load: memory_size =
         | 512.00 MB, n_mem = 16384 llama_model_load: loading model part
         | 1/1 from 'ggml-alpaca-7b-q4.bin' llama_model_load:
         | .................................... done llama_model_load:
         | model size = 4017.27 MB / num tensors = 291
         | 
         | system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 =
         | 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 |
         | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main:
         | interactive mode on. sampling parameters: temp = 0.100000,
         | top_k = 40, top_p = 0.950000, repeat_last_n = 64,
         | repeat_penalty = 1.300000
         | 
         | == Running in chat mode. == - Press Ctrl+C to interject at any
         | time. - Press Return to return control to LLaMa. - If you want
         | to submit another line, end your input in '\'.
         | 
         | > What is an alpaca? An Alpaca (Vicugna pacos) is one of two
         | species in the genus Vicuna, which are both domesticated
         | animals from South America and Central Asia that have been bred
         | for their fine wool. >
        
           | [deleted]
        
         | unshavedyak wrote:
         | Works fine on my x86. Running Linux (NixOS). Ryzen 5950x.
        
       | bigattichouse wrote:
       | Edit the make file and add
       | 
       | -Wno-unused-result
       | 
       | to each of the final compile steps
       | 
       | if you want to compile and run under Ubuntu 22.04.2 LTS
        
       | Xeoncross wrote:
       | I hope someone good has picked up Kevin/antimatter15 by now. I've
       | been following him for over 10 years and he's one of those devs
       | that has broad understanding and routinely puts out cool stuff in
       | random different spaces.
        
       | aryamaan wrote:
       | are people not running these on collab?
        
         | inaka10100101 wrote:
         | Yeah i too wanted but couldn't. Downloaded the weights and all
         | but it isn't running in colab. Can someone help
        
       | filippofinke wrote:
       | This is funny, the response contains some data from the dataset
       | itself.
       | 
       | > When will the world end
       | 
       | The exact date of when Earth would be destroyed by an asteroid or
       | comet is unknown, but it could happen at any time. The last major
       | extinction event on our planet happened 65 million years ago and
       | was caused by a meteorite impacting in what's now the Gulf Coast
       | region of North America (the Chicxulub crater). ### Instruction:
       | What is an asteroid belt? Instruction: What does it mean to be
       | "in orbit"? ##Response:
       | 
       | >
        
         | [deleted]
        
       | mrbonner wrote:
       | > explain general relativity for a 5 year old kid
       | 
       | General Relativity is the theory that explains how gravity works.
       | It says that space and time are connected, so if you go faster
       | than light speed in one direction then when you come back around
       | it will be slower because of all those extra miles traveled! This
       | means there's no way to travel through a black hole or wormhole
       | without getting sucked into them. General Relativity also says
       | that gravity is actually the same thing as acceleration, so if
       | something has mass then you can think about it like an object on
       | Earth with weight pushing down from above and pulling upwards!
        
         | psychphysic wrote:
         | Good enough to have his own tiktok.
        
         | Steuard wrote:
         | Wow. Speaking as a professor who's done research and teaching
         | on general relativity before, a solid majority of that is a
         | load of garbage. (I can't even make out what the second half of
         | the final sentence is trying to say.)
         | 
         | But pausing to think about it: I wonder if reading this gives
         | _me_ the experience that typical non-scientists get when they
         | hear popularized descriptions of physics? Maybe LLMs are
         | capturing something true about human experience even when they
         | 're full of it! [Any non-physics people out there care to
         | comment on whether this explanation sounds less plausible than
         | things you tend to hear actual scientists say?]
        
       | siraben wrote:
       | Zero-shot translation capabilities (note: doesn't work with all
       | languages well)                 > translate into japanese: "I am
       | going to school today, but it is raining."        Ri Ben Yu de
       | [Jin Tian haXue Xiao niXing kimasuga, Yu gaChui iteiru]
        
         | themoonisachees wrote:
         | I expect translation especially of non-technical sentences to
         | be a solved problem, but can anyone confirme the translations
         | are accurate?
        
           | TMWNN wrote:
           | Google Translate shows 100% accuracy when translated back
           | into English.
        
           | theaussiestew wrote:
           | To me, it looks like a strange mix of Japanese and Mandarin
           | characters with Mandarin grammar.
        
           | nonbirithm wrote:
           | Chui iteiru ought to be Jiang tsuteiru and Jin Tian  is not a
           | word, it should be Jin Ri
        
       | thot_experiment wrote:
       | Does anyone have a link to pre-quantized models for
       | this/llama.cpp? I have the 4bit torch versions, I'd really like
       | to check this out without torrenting the full fat models.
       | 
       | (30B on a GV100 is very impressive by the way, highly recommend
       | people play with this stuff, 13B runs comfortably on a 1080Ti,
       | would really like to compare inference speed and accuracy between
       | the implementations)
        
       | bogwog wrote:
       | Does this really need to be a fork of llama.cpp? It seems like
       | there's no real changes to the code.
        
         | antimatter15 wrote:
         | It reworks the interactive mode into a chat interface, which
         | -if my code was cleaner- could have just been a pull request.
        
           | ggerganov wrote:
           | "Clean Code, Horrible Performance" :)
        
           | [deleted]
        
           | doctoboggan wrote:
           | There is an interactive mode in llama.cpp
           | 
           | https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#.
           | ..
           | 
           | EDIT: I see now you are saying you re-worked the existing
           | interactive mode. I still think your changes could be a PR
           | into the original repo
        
       ___________________________________________________________________
       (page generated 2023-03-16 23:00 UTC)