[HN Gopher] Mistral-8x7B-Chat
       ___________________________________________________________________
        
       Mistral-8x7B-Chat
        
       Author : birriel
       Score  : 76 points
       Date   : 2023-12-10 20:13 UTC (2 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | antirez wrote:
       | llama.cpp support needed ASAP, not just quantized, but I hope it
       | will be albe to run in a way that keeps the same two experts for
       | minimum N tokens generated and re-valuate the expert to select
       | only then, so that it will be possible to do the inference in low
       | ram doing something similar to "paging".
        
         | kristianp wrote:
         | There is an issue for it:
         | 
         | https://github.com/ggerganov/llama.cpp/issues/4381
        
           | rgbrgb wrote:
           | what a sick project to be able to attract a billionaire
           | programmer [0] _and_ c royalty.
           | 
           | [0]: https://github.com/ggerganov/llama.cpp/issues/4216#issue
           | comm...
        
             | jaipilot747 wrote:
             | For those out of the loop, who are the billionaire
             | programmer and C royalty people in this link?
        
               | antirez wrote:
               | Tobi is the founder of Shopify.
        
       | a_wild_dandan wrote:
       | PSA: This is broken for Apple silicon. It relies on the CUDA-only
       | `flash_attn` package. Alas, I wanted to try Mistral's MoE on my
       | MacBook. :( Understandable though! It's a new model with an
       | incredibly bizarre release. My phalanges are crossed.
        
       | lannisterstark wrote:
       | Man this LLM stuff gets released faster than I can keep up.
       | 
       | Is there a centralized list somewhere that tests "use this for x
       | purpose, use that for y?"
        
         | brucethemoose2 wrote:
         | Even the OSS trainers/finetuners themselves don't really know
         | the use case for their models. They just hot drop them and move
         | on, lol.
        
         | lxe wrote:
         | This is a finetune of the torrent-dropped moe model from a few
         | days ago. Not a "release" per se, but rather a display of
         | community excitement over this.
        
         | behnamoh wrote:
         | > Is there a centralized list somewhere that tests "use this
         | for x purpose, use that for y?"
         | 
         | Yeah, "don't use these models for production, use OpenAI for
         | production, ignore Claude/Gemini/etc.".
        
       | simonw wrote:
       | I thought the new MoE Mistral model was called Mixtral.
        
       | lab wrote:
       | We just made it available on OpenRouter:
       | 
       | https://openrouter.ai/models/fireworks/mixtral-8x7b-fw-chat
       | 
       | Chat playground:
       | 
       | https://openrouter.ai/playground?models=fireworks/mixtral-8x...
        
       | Jimbabwe wrote:
       | There's probably a better place to ask this highly specific
       | technical question, but I'm avoiding Reddit these days so just
       | throwing it out I guess. I've been trying to run these in a
       | container but it's verrrry slow, I believe, because of the lack
       | of gpu help. All the instructions I find are for nvidia gpus and
       | my server is a qnap tvs-473e with an embedded amd cpu/gpu (I
       | know, I know). The only good news is that I've upgraded the ram
       | to 32gb, and I have a 1TB ssd. Any idea of how I can get my own
       | self-hosted LLM/chat service on this funky hardware? The
       | nvidia/docker option requires installing the nvidia runtime
       | alongside docker, but I can't find an amd equivalent. Thanks.
       | Sorry for the wall of text nobody cares about.
        
         | brucethemoose2 wrote:
         | > qnap tvs-473e
         | 
         | Specs say this runs an AMD RX-421BD. This is a 2015 AMD CPU
         | with 2 bulldozer cores and a tiny IGP.
         | 
         | ...To be blunt, you would be much better off running LLMs on
         | your phone. Even an older phone. Or literally whatever device
         | you are reading HN on. But if you _insist_ , the runtime you
         | want in MLC-LLM's Vulkan runtime.
        
           | refulgentis wrote:
           | This. Sibling llama.cpp comment is standard "I know
           | llama.cpp, I assume that's 80% of the universe instead of
           | .8%, and I assume that's all anyone needs. So I know just
           | enough to be dangerous with ppl looking for advice".
           | 
           | You'll see it over and over again when you're looking for
           | help, be careful, it's 100% a blind alley in your case. It's
           | very likely you'll be disappointed by MLC as well,
           | simultaneously it's your only real option. You definitely
           | won't hit 1 tkn/sec, and honestly, id bet 0.1 tkn / sec
        
           | Jimbabwe wrote:
           | Thanks, I'll look into it! Especially if the llama.cpp route
           | is a dud, like the other response says it will be. My little
           | qnap clunker handles all the self hosting stuff I throw at
           | it, but I won't be surprised if it simply has met its match
        
         | curious_cat_163 wrote:
         | You'll want to try llama.cpp [1]. The set of models that it can
         | support is expanding [2]. Folks have also written services [3]
         | that wrap around it.
         | 
         | [1] https://github.com/ggerganov/llama.cpp [2]
         | https://huggingface.co/TheBloke [3]
         | https://github.com/abetlen/llama-cpp-python
        
           | Jimbabwe wrote:
           | Thanks! I was just following the thread about their recent
           | addition of the OpenCl support and was on the verge of trying
           | it out last weekend. I'll definitely continue once I'm home
           | again!
        
         | loxias wrote:
         | > my server is a qnap tvs-473e with an embedded amd cpu/gpu
         | 
         | That's your problem. I googled and it looks like one of these
         | all-in-one appliances like a drobo or whatever's popular these
         | days. That's not a server. (At least, I wouldn't call it a
         | server. It's an all-in-one appliance, or toy, depending on
         | perspective) And yegods, that price...
         | 
         | Spend $500, get an actual computer, not some priced up
         | appliance, and you'll have a _much_ better time. Regardless of
         | if you spend it on more CPU or more GPU. You can get a used
         | computer off ebay for $100 and shove a $400 graphics card in
         | it. Or maybe get a ryzen 7 7700x, I 'm looking at a mobo+cpu
         | combo with that for $500 right now.
         | 
         | Finally, to make sure this response does contain a answer to
         | what you asked: ;-)
         | 
         | if you can run this stuff in a container on your appliance
         | already, but it's very slow, congrats! I'd call that a win. I
         | looked up the chip, the RX-421BD, it's of similar power as an
         | Athelon circa 2017. I think my router might have more compute
         | power. You _do_ have those 512 shader cores, given effort, you
         | could try and get them to do something useful. But I wouldn't
         | _assume_ it 's possible (well, maybe you don't mind writing
         | your own shaders ;-)). Just because the chip has "some gpu"
         | doesn't mean it has "the right kind of gpu you'd need to hijack
         | for lots of matrix multiplies, without writing the assembly
         | yourself".
         | 
         | Sorry this isn't more helpful, but it's the truth.
        
       | vizzah wrote:
       | This model is better by many other contenders, but still far from
       | GPT4.
       | 
       | "what famous brands are there which change one letter from a
       | common word to make a non-existent, but a catchy name, such as
       | "musiq" instead of "music".. etc?"
       | 
       | There are several brands that have played with words by changing
       | a letter or adding a letter to create a new and memorable name.
       | Here are a few examples:                   Qatar Airways - This
       | airline's name was inspired by the country of Qatar, but with the
       | letter "Q" replaced by "Q" to create a unique and catchy name.
       | Lush Cosmetics - This cosmetics company's name is a play on the
       | word "lush," which means "profuse" or "luxuriant."
       | Pinterest - This social media site's name is a combination of the
       | words "pin" and "interest," but with the letter "e" replaced by
       | "i" to create a more unique and memorable name.
       | Tumblr - This social media platform's name is a play on the words
       | "tumble" and "blogger," but with the letter "l" replaced by "r"
       | to create a more unique and catchy name.
       | 
       | ---------
       | 
       | that is compared to hilarious Yi 34B Chat below: =)
       | 
       | There are several famous brands that have used this strategy of
       | changing one letter from a common word to create a unique and
       | catchy name. Here are a few examples:                   eBay -
       | Instead of "cBay," which would be a reference to "crab," eBay was
       | named after the first three letters of the word "baY" and the "e"
       | from the word "electronic," as in "electronic marketplace."
       | Google - Originally named "BackRub," the company changed its name
       | to Google, a play on the word "googol," which refers to the
       | number one followed by one hundred zeros, expressing the vastness
       | of the internet.                        Kodak - The name Kodak is
       | derived from the word "cod" with an "k" added. It was chosen
       | because it is short, easy to remember, and not suggestive of the
       | photographic nature of the business.
        
         | brucethemoose2 wrote:
         | Here is my response from Yi, which is actually pretty good:
         | 
         | One well-known example is the brand Apple. The company's name
         | comes from a play on the phrase "Apple of My Eye." Another
         | example is Google, which is derived from the mathematical term
         | "googolplex," which refers to the number 1 followed by 100
         | zeroes. It was originally meant to be a play on the word
         | "googol" (1 followed by 100 zeroes), but a spelling mistake led
         | to the creation of the new word "google."
         | 
         | Other examples include PepsiCo, which combines the words
         | "pepsin" and "kola," and Amazon, which was inspired by the
         | world's largest river, the Amazon River.
         | 
         | In the case of "musiq," it seems like a made-up name without
         | any specific reference or known origin. However, some possible
         | variations could be created using existing words, such as
         | "dramatize" instead of "dramatize" or "fashin" instead of
         | "fashion."
        
           | brucethemoose2 wrote:
           | Actually I just realized I had the prompt template and
           | sampling wrong. First shot once that's fixed:
           | 
           | " Some examples of famous brands that use this strategy
           | include:
           | 
           | 1. KFC (Kentucky Fried Chicken) - The brand takes the first
           | letter of each word in its full name (Kentucky Fried Chicken)
           | and combines them to create a memorable acronym.
           | 
           | 2. Pepsi - The brand's name is a play on the word "pepsin,"
           | an enzyme used in digestion. The name was chosen because it
           | has a positive connotation and is easy to remember.
           | 
           | 3. Adidas - The brand's name is derived from the founder's
           | name, Adolf Dassler, with the 'Das' replaced by an 'a'.
           | 
           | 4. Asics - The brand's name is an acronym for "Animal, Speed,
           | Intelligence, and Spirit."
           | 
           | 5. H&M (Hennes & Mauritz) - The brand's name is a combination
           | of two Swedish words: "Hennes" means "her" in English, and
           | "Mauritz" refers to the men's clothing store "Mauritz
           | Widforss."
           | 
           | 6. Nike - The brand's name comes from the Greek goddess of
           | victory, "Nike."
           | 
           | 7. Amazon - The brand's name was inspired by the South
           | American river Amazon, which is one of the world's largest
           | rivers.
           | 
           | 8. Apple - The brand's name was chosen due to co-founder
           | Steve Jobs' belief that "Apple" was simple, memorable, and
           | had a positive connotation.
           | 
           | 9. Google - The brand's name is a play on the mathematical
           | term "googol," which represents the number 1 followed by 100
           | zeros.
           | 
           | 10. Twitter - The brand's name is a shortened version of
           | "tweet," which refers to short messages sent through the
           | platform.
           | 
           | These brands demonstrate how changing or combining letters
           | from existing words can create unique and memorable names for
           | businesses. "
           | 
           | - This is not vanilla yi, but a merge of several finetunes.
        
         | mirekrusin wrote:
         | LLMs work on tokens where characters are hidden away. They'd
         | have to be explicitly trained on spelling each token out into
         | single letter tokens and as they are bad at information
         | symmetry - from single letter tokens back onto tokens as well.
         | I don't think anybody does this so they're left with what's in
         | training data only. Otherwise they don't have chance to
         | reconstruct this information as tokens could map to any
         | equivalent visual language ie. looking like hieroglyphs where
         | questions you're asking would be nonsensical.
        
           | jakderrida wrote:
           | I thought so, too. But then I asked it to define fake words
           | that were portmanteaus I made up. Believe me, my
           | understanding of BERT and discriminant models aligned
           | perfectly with what you're saying. But testing out the theory
           | that it can break down and make meaning of fake words with
           | accurate depictions of what words I'm combining proved me
           | wrong. Generative models must work differently than you and I
           | thought.
        
         | nl wrote:
         | But both are completely wrong! And technically the Google
         | example is closer to correct than any others.
         | 
         | The Yi 34B eBay and Kodak examples are both (wrong but) very
         | interesting because it does seem to get the idea of changing
         | one letter.
         | 
         | Of GPT4 examples, the Qatar example (replacing "Q" with "Q" !?)
         | is the only one that is internally consistent. The Pinterest
         | and Tumblr examples are wrong in very odd ways in that the
         | explanation doesn't match the spelling.
        
         | behnamoh wrote:
         | Are you comparing a 8x7b model with GPT-4? Come on...
        
       | behnamoh wrote:
       | Somewhere between shiny Google releases and Mistral's magnet link
       | tweet, there's gotta be a sweet spot where you release the model
       | but also have enough decency to tell people how to use it
       | optimally. Mistral, if you're reading this, I'm talking about
       | you.
        
       ___________________________________________________________________
       (page generated 2023-12-10 23:00 UTC)