[HN Gopher] Mixtral 8x22B
       ___________________________________________________________________
        
       Mixtral 8x22B
        
       Author : meetpateltech
       Score  : 431 points
       Date   : 2024-04-17 14:00 UTC (9 hours ago)
        
 (HTM) web link (mistral.ai)
 (TXT) w3m dump (mistral.ai)
        
       | dd-dreams wrote:
       | The development never stops. In a few years we will look back and
       | see how the previous models were and how they're now. How we
       | couldn't run LLaMa 70B on MacBook Air and now we can.
        
         | squirrel23 wrote:
         | Yes it's pretty cool. There was a neat comparison of deep
         | learning development that I think resonates quite well here.
         | 
         | Around 5 years ago, it took a lambda user some pretty
         | significant hardware, software and time (around a full night),
         | to try to create a short deepfake. Now, you don't need any
         | fancy hardware and you can have some decent results within 5
         | min on your average computer.
        
           | azinman2 wrote:
           | That part isn't very good.
           | 
           | https://www.nytimes.com/2024/04/08/technology/deepfake-ai-
           | nu...
        
       | imjonse wrote:
       | Great to see such free to use and self-hostable models, but it's
       | said that open now means only that. One cannot replicate this
       | model without access to the training data.
        
         | generalizations wrote:
         | ...And a massive pile of cash/compute hardware.
        
           | kiney wrote:
           | not that massive, we're talking six figures. There was a
           | blogpost about this a while back on the startpage of HN.
        
             | htrp wrote:
             | for finetuning or parameter training from scratch?
        
               | kiney wrote:
               | from scratch: https://research.myshell.ai/jetmoe
        
               | kaibee wrote:
               | That's for an 8B model.
        
               | cptcobalt wrote:
               | This is over trivializing it, but there isn't much more
               | inherent complexity in training an 8B or larger model
               | other than more money, more compute, more data, more
               | time. Overall, the principles are similar.
        
               | lostmsu wrote:
               | Assuming linear growth to number of parameters that's 7.5
               | figures instead of 6 for 8x22B model.
        
             | moffkalast wrote:
             | 6 figures _are_ a massive pile of cash.
        
         | ru552 wrote:
         | There's a large amount of liability in disclosing your training
         | data.
        
           | imjonse wrote:
           | Calling the model 'truly open' without is not technically
           | correct though.
        
             | Lacerda69 wrote:
             | It's open enough for all practical purposes IMO.
        
               | nicklecompte wrote:
               | It's not "open enough" to do an honest evaluation of
               | these systems by constructing adversarial benchmarks.
        
               | imjonse wrote:
               | As open as an executable binary that you are allowed to
               | download and use for free.
        
               | bevekspldnw wrote:
               | In the age of SaaS I'll take it. It's not like I have a
               | few million dollars to pay for training even if I had all
               | the code and data.
        
           | londons_explore wrote:
           | I expect we'll see some AI companies in the future throwing
           | away the training dataset. Maybe some have already.
           | 
           | During a court case, the other side can demand discovery over
           | your training dataset, for example to see if it contains a
           | particular copyrighted work.
           | 
           | But if you've already deleted the dataset, you're far more
           | likely to win any case against you that hinges on what was in
           | the dataset if the plaintiff can't even prove their work was
           | included.
           | 
           | And you can argue that the dataset was very expensive to
           | store (which is true), and therefore deleted shortly after
           | training was complete. You have no obligation to keep
           | something for the benefit of potential future plaintiffs you
           | aren't even aware of yet.
        
       | endisneigh wrote:
       | Good to continue to see a permissive license here.
        
       | sa-code wrote:
       | Is this the best permissively licensed model out there?
        
         | ru552 wrote:
         | Today. Might change tomorrow at the pace this sector is at.
        
         | imjonse wrote:
         | So far it is Command R+. Let's see how this will fare on
         | Chatbot Arena after a few weeks of use.
        
           | skissane wrote:
           | > So far it is Command R+
           | 
           | Most people would not consider Command R+ to count as the
           | "best permissively licensed model" since CC-BY-NC is not
           | usually considered "permissively licensed" - the "NC" part
           | means "non-commercial use only"
        
             | imjonse wrote:
             | My bad, I remembered wrongly it was Apache too.
        
       | tinyhouse wrote:
       | Pricing?
       | 
       | Found it: https://mistral.ai/technology/#pricing
       | 
       | It'd useful to add a link to the blog post. While it's an open
       | model, most will only be able to use it via the API.
        
         | MacsHeadroom wrote:
         | It's open source, you can just download and run it for free on
         | your own hardware.
        
           | tinyhouse wrote:
           | Well, I don't have hardware to run a 141B parameters model,
           | even if only 39B are active during inference.
        
             | navbaker wrote:
             | It will be quantized in a matter of days and runnable on
             | most laptops.
        
               | azinman2 wrote:
               | 8 bit is 149G. 4 bit is 80G.
               | 
               | I wouldn't call this runnable on most laptops.
        
           | astrodust wrote:
           | "Who among us doesn't have 8 H100 cards?"
        
             | MacsHeadroom wrote:
             | Four V100s will do. They're about $1k each on ebay.
        
               | astrodust wrote:
               | $1500 each, plus the server they go in, plus plus plus
               | plus.
        
               | MacsHeadroom wrote:
               | Sure, but it's still a lot less than 8 h100s.
               | 
               | ~$8k for an LLM server with 128GB of VRAM vs like $250k+
               | for 8 H100s.
        
         | theolivenbaum wrote:
         | That looks expensive compared to what groq was offering:
         | https://wow.groq.com/
        
           | naiv wrote:
           | I also assume groq is 10-15x faster
        
           | pants2 wrote:
           | Can't wait for 8x22B to make it to Groq! Having an LLM at
           | near GPT-4 performance with Groq speed would be incredible,
           | especially for real-time voice chat.
        
       | apetresc wrote:
       | I just find it hilarious how approximately 100% of models beat
       | all other models on benchmarks.
        
         | squirrel23 wrote:
         | What do you mean?
        
           | apetresc wrote:
           | Virtually every announcement of a new model release has some
           | sort of table or graph matching it up against a bunch of
           | other models on various benchmarks, and they're always
           | selected in such a way that the newly-released model
           | dominates along several axes.
           | 
           | It turns interpreting the results into an exercise in
           | detecting which models and benchmarks were omitted.
        
             | CharlieDigital wrote:
             | It would make sense, wouldn't it? Just as we've seen rising
             | fuel efficiency, safety, dependability, etc. over the
             | lifecycle of a particular car model.
             | 
             | The different teams are learning from each other and
             | pushing boundaries; there's virtually no reason for any of
             | the teams to release a model or product that is somehow
             | inferior to a prior one (unless it had some secondary
             | attribute such as requiring lower end hardware).
             | 
             | We're simply not seeing the ones that came up short; we
             | don't even see the ones where it fell short of current
             | benchmarks because they're not worth releasing to the
             | public.
        
               | apetresc wrote:
               | That's a valid theory, a priori, but if you actually
               | follow up you'll find that the vast majority of these
               | benchmark results don't end up matching anyone's
               | subjective experience with the models. The churn at the
               | top is not nearly as fast as the press releases make it
               | out to be.
        
               | tensor wrote:
               | Subjective experience is not a benchmark that you can
               | measure success against. Also, of course new models are
               | better on some set of benchmarks. Why would someone
               | bother releasing a "new" model that is inferior to old
               | ones? (Aside from attributes like more preferable
               | licensing).
               | 
               | This is completely normal, the opposite would be strange.
        
               | andai wrote:
               | Sibling comment made a good point about benchmarks not
               | being a great indiactor of real world quality. Every time
               | something scores near GPT-4 on benchmarks, I try it out
               | and it ends up being less reliable than GPT-3 within a
               | few minutes of usage.
        
               | CharlieDigital wrote:
               | That's totally fine, but benchmarks are like standardized
               | tests like the SAT. They measure _something_ and it
               | totally makes sense that each release bests the prior in
               | the context of these benchmarks.
               | 
               | It may even be the case that in measuring against the
               | benchmarks, these product teams sacrifice some real world
               | performance (just as a student that only studies for the
               | SAT might sacrifice some real world skills).
        
         | htrp wrote:
         | gotta cherry pick your benchmarks as much as possible
        
         | paxys wrote:
         | Benchmarks published by the company itself should be treated no
         | differently than advertising. For actual signal check out more
         | independent leaderboards and benchmarks (like HuggingFace,
         | Chatbot Arena, MMLU, AlpacaEval). Of course, even then it is
         | impossible to come up with an objective ranking since there is
         | no consensus on what to even measure.
        
         | empath-nirvana wrote:
         | Just because of the pace of innovation and scaling, right now,
         | it seems pretty natural that any new model is going to be
         | better than the previous comparable models.
        
         | michaelt wrote:
         | Benchmarks are often weird because of what a benchmark
         | inherently needs to be.
         | 
         | If you compare LLMs by asking them to tell you how to catch
         | dragonflies - the free text chat answer you get will be
         | impossible to objectively evaluate.
         | 
         | Whereas if you propose four ways to catch dragonflies and ask
         | each model to choose option A, B, C or D (or check the relative
         | probability the model assigns to those four output logits) the
         | result is easy to objectively evaluate - you just check if it
         | chose the one right answer.
         | 
         | Hence a lot of the most famous benchmarks are multiple-choice
         | questions - even though 99.9% of LLM usage doesn't involve
         | answering multiple-choice questions.
        
       | arnaudsm wrote:
       | Curious to see how it performs against GPT-4.
       | 
       | Mixtral8x22 beats CommandR+, which is at GPT-4-level in LMSYS'
       | leaderboard.
        
         | zone411 wrote:
         | LMSYS leaderboard is just one benchmark (that I think is
         | fundamentally flawed). GPT-4 is clearly better.
        
           | arnaudsm wrote:
           | Which alterative benchmarks do you recommend?
        
       | brokensegue wrote:
       | Isn't equating active parameters with cost a little unfair since
       | you still need full memory for all the inactive parameters?
        
         | tartrate wrote:
         | Well, since it affects inference speed it means you can handle
         | more in less time, needing less concurrency.
        
         | sa-code wrote:
         | Fewer parameters at inference time makes a massive difference
         | in cost for batch jobs, assuming vram usage is the same
        
       | mdrzn wrote:
       | "64K tokens context window" I do wish they had managed to extend
       | it to at least 128K to match the capabilities of GPT-4 Turbo
       | 
       | Maybe this limit will become a joke when looking back? Can you
       | imagine reaching a trillion tokens context window in the future,
       | as Sam speculated on Lex's podcast?
        
         | htrp wrote:
         | maybe we'll look back at token context windows like we look
         | back at how much ram we have in a system.
        
           | frabjoused wrote:
           | I agree with this in the sense that once you have enough, you
           | stop caring about the metric.
        
             | paradite wrote:
             | And how much RAM do you need to run Mixtral 8*22B? Probably
             | not enough on a personal laptop.
        
               | Lacerda69 wrote:
               | I run it fine on my 64gb RAM beast.
        
               | coder543 wrote:
               | At what quantization? 4-bit is 80GB. Less than 4-bit is
               | rarely good enough at this point.
        
               | apexalpha wrote:
               | Is that normal ram of GPU ram?
        
               | samus wrote:
               | 64GB is not GPU RAM, but system RAM. Consumer GPUs have
               | 24GB at most, those with good value/price have way less.
               | Current generation workstation GPUs are unaffordable;
               | used can be found on ebay for a reasonable price, but
               | they are quite slow. DDR5 RAM might be a better
               | investment.
        
               | user_7832 wrote:
               | Generally about ~1gb ram per billion parameters. I've run
               | a 30b model (vicuna) on my 32gb laptop (but it was slow).
        
           | bamboozled wrote:
           | I still don't have enough RAM though ?
        
             | samus wrote:
             | RAM is simply too useful.
        
           | htrp wrote:
           | While there is a lot more HBM (or UMA if you're a Mac system)
           | you need to run these LLM models, my overarching point is
           | that at this point most systems don't have RAM constraints
           | for most of the software you need to run and as a result, RAM
           | becomes less of a selling point except in very specialized
           | instances like graphic design or 3D rendering work.
           | 
           | If we have cheap billion token context windows, 99% of your
           | use cases aren't going to hit anywhere close to that limit
           | and as a result, your models will "just run"
        
         | creshal wrote:
         | Wasn't there a paper yesterday that turned context evaluation
         | linear (instead of quadratic) and made effectively unlimited
         | context windows possible? Between that and 1.58b quantization I
         | feel like we're overdue for an LLM revolution.
        
           | samus wrote:
           | So far, people have come up with many alternatives for
           | quadratic attention. Only recently have they proven their
           | potential.
        
             | underlines wrote:
             | tons and tons of papers, most of them had some
             | disadvantages. Can't have the cake and eat it too:
             | 
             | https://arxiv.org/html/2404.08801v1 Meta Megalodon
             | 
             | https://arxiv.org/html/2404.07143v1 Google Infini-Attention
             | 
             | https://arxiv.org/html/2402.13753v1 LongRoPE
             | 
             | and a ton more
        
         | pseudosavant wrote:
         | FWIW, the 128k context window for GPT-4 is only for input. I
         | believe the output content is still only 4k.
        
           | moffkalast wrote:
           | How does that make any sense on a decoder-only architecture?
        
             | grey8 wrote:
             | It's not about the model. The model can output more - it's
             | about the API.
             | 
             | A better phrasing would be that they don't allow you to
             | output more than 4k tokens per message.
             | 
             | Same with Anthropic and Claude, sadly.
        
         | afro88 wrote:
         | How useful is such a large input window when most of the middle
         | isn't really used? I'm thinking mostly about coding. But wheb
         | putting even say 20k tokens into the input, a good chunk
         | doesn't seem to be "remembered" or used for the output
        
           | jakderrida wrote:
           | While you're 100% correct, they are working on ways to make
           | the middle useful, such a "Needle in a Haystack" testing.
           | When we say we wish for context length that large, I think
           | it's implied we mean functionally. But you do make a really
           | great point.
        
       | doublextremevil wrote:
       | How much vram is need to run this?
        
         | MacsHeadroom wrote:
         | 80GB in 4bit.
         | 
         | But because it only activates one expert at a time, it can run
         | on a fast CPU in reasonable time. So 96GB of DDR4 will do. 96GB
         | of DDR5 is better.
        
           | Me1000 wrote:
           | WizardLM-2 8x22b (which was a fine tune of the Mixtral 8x22b
           | base model) at 4bit was only 80GB.
        
       | noman-land wrote:
       | I'm really excited about this model. Just need someone to
       | quantize it to ~3 bits so it'll run on a 64GB MacBook Pro. I've
       | gotten a lot of use from the 8x7b model. Paired with llamafile
       | and it's just so good.
        
         | 2Gkashmiri wrote:
         | Can you explain your use case? I tried to get into offline
         | llms, on my machine and even android but without discrete
         | graphics, its a slow hog so I didnt enjoy it but suppose I buy
         | one, what then ?
        
           | andai wrote:
           | I run Mistral-7B on an old laptop. It's not very fast and
           | it's not very good, but it's just good enough to be useful.
           | 
           | My use case is that I'm more productive working with a LLM
           | but being online is a constant temptation and distraction.
           | 
           | Most of the time I'll reach for offline docs to verify. So
           | the LLM just points me in the right direction.
           | 
           | I also miss Google offline, so I'm working on a search
           | engine. I thought I could skip crawling by just downloading
           | common crawl, but unfortnately it's enormous and mostly junk
           | or unsuitable for my needs. So my next project is how to
           | data-mine common crawl to extract just the interesting (to
           | me) bits...
           | 
           | When I have a search engine and a LLM I'll be able to run my
           | own Phind, which will be really cool.
        
             | luke-stanley wrote:
             | Presumably you could run things like PageRank, I'm sure
             | people do this sort of thing with CommonCrawl. There are
             | lots of variants of graph connectivity scoring methods and
             | classifiers. What a time to be alive eh?
        
           | noman-land wrote:
           | Yes, I have a side project that uses local whisper.cpp to
           | transcribe a podcast I love and shows a nice UI to search and
           | filter the contents. I use Mixtral 8x7b in chat interface via
           | llamafile primarily to help me write python and sqlite code
           | and as a general Q&A agent. I ask it all sorts of technical
           | questions, learn about common tools, libraries, and idioms in
           | an ecosystem I'm not familiar with, and then I can go to
           | official documentation and dig in.
           | 
           | It has been a huge force multiplier for me and most
           | importantly of all, it removes the dread of not knowing where
           | to start and the dread of sending your inner monologue to
           | someone's stupid cloud.
           | 
           | If you're curious: https://github.com/noman-
           | land/transcript.fish/ though this doesn't include any Mixtral
           | stuff because I don't use it programmatically (yet). I soon
           | hope to use it to answer questions about the episodes like
           | who the special guest is and whatnot, which is something I do
           | manually right now.
        
           | popf1 wrote:
           | > Can you explain your use case?
           | 
           | pretty sure you can run it un-censored... that would be my
           | use case
        
         | mathverse wrote:
         | Shopping for a new mbp. Do you think going with more ram would
         | be wise?
        
           | noman-land wrote:
           | Unfortunately, yes. Get as much as you can stomach paying
           | for.
        
       | clementmas wrote:
       | I'm considering switching my function calling requests from
       | OpenAI's API to Mistral. Are they using similar formats? What's
       | the easiest way to use Mistral? Is it by using Huggingface?
        
         | ru552 wrote:
         | easiest is probably with ollama [0]. I think the ollama API is
         | OpenAI compatible.
         | 
         | [0]https://ollama.com/
        
           | talldayo wrote:
           | Most inference servers are OpenAI-compatibile. Even the
           | "official" llama-cpp server should work fine: https://github.
           | com/ggerganov/llama.cpp/blob/master/examples/...
        
           | pants2 wrote:
           | Ollama runs locally. What's the best option for calling the
           | new Mixtral model on someone else's server programmatically?
        
             | Arcuru wrote:
             | Openrouter lists several options:
             | https://openrouter.ai/models/mistralai/mixtral-8x22b
        
       | jjice wrote:
       | Does anyone have a good layman's explanation of the "Mixture-of-
       | Experts" concept? I think I understand the idea of having "sub-
       | experts", but how do you decide what each specialization is
       | during training? Or is that not how it works at all?
        
         | Keyframe wrote:
         | maybe there's one that is maitre d'llm?
        
         | londons_explore wrote:
         | Nobody decides. The network itself determines which expert(s)
         | to activate based on the context. It uses a small neural
         | network for the task.
         | 
         | It typically won't behave like human experts - you might find
         | one of the networks is an expert in determining where to place
         | capital letters or full stops for example.
         | 
         | MoE's do not really improve accuracy - instead they are to
         | reduce the amount of compute required. And, assuming you have a
         | fixed compute budget, that in turn might mean you can make the
         | model bigger to get better accuracy.
        
         | HeatrayEnjoyer wrote:
         | Correct, the experts are determined by Algo, not anything
         | humans would understand.
        
         | hlfshell wrote:
         | This is a bit of a misnomer. Each expert is a sub network that
         | specializes in sub understanding we can't possibly track.
         | 
         | During training a routing network is punished if it does not
         | evenly distribute training tokens to the correct experts. This
         | prevents any one or two networks from becoming the primary
         | networks.
         | 
         | The result of this is that each token has essentially even
         | probability of being routed to one of the sub models, with the
         | underlying logic of why that model is an expert for that token
         | being beyond our understanding or description.
        
           | api wrote:
           | A decent _loose_ analogy might be database sharding.
           | 
           | Basically you're sharding the neural network by "something"
           | that is itself tuned during the learning process.
        
           | fire_lake wrote:
           | Why do we expect this to perform better? Couldn't a regular
           | network converge on this structure anyways?
        
             | imjonse wrote:
             | It is a type of ensemble model. A regular network could do
             | it, but a MoE will select a subset to do the task faster
             | than the whole model would.
        
             | rgbrgb wrote:
             | Here's my naive intuition: in general bigger models can
             | store more knowledge but take longer to do inference. MoE
             | provides a way to blend the advantages of having a bigger
             | model (more storage) with the advantages of having smaller
             | models at inference time (faster, less memory required).
             | When you do inference, tokens hit a small layer that is
             | load balancing the experts then activate 1 or 2 experts. So
             | you're storing roughly 8 x 22B "worth" of knowledge without
             | having to run a model that big.
             | 
             | Maybe a real expert can confirm if this is correct :)
        
               | cjbprime wrote:
               | Not quite, you don't save memory, only compute.
        
               | nialv7 wrote:
               | Sounds like the "you only use 10% of your brain" myth,
               | but actually real this time.
        
               | samus wrote:
               | Almost :) the model chooses experts in every block. For a
               | typical 7B with 8 experts there will be 8^32=2^96 paths
               | through the whole model.
        
             | og_kalu wrote:
             | It doesn't perform better and until recently, MoE models
             | actually underperformed their dense counterparts. The real
             | gain is sparsity. You have this huge x parameter model that
             | is performing like an x parameter model but you don't have
             | to use all those parameters at once every time so you save
             | a lot on compute, both in training and inference.
        
           | andai wrote:
           | I heard MoE reduces inference costs. Is that true? Don't all
           | the sub networks need to be kept in RAM the whole time? Or is
           | the idea that it only needs to run compute on a small part of
           | the total network, so it runs faster? (So you complete more
           | requests per minute on same hardware.)
           | 
           | Edit: Apparently each part of the network is on a separate
           | device. Fascinating! That would also explain why the routing
           | network is trained to choose equally between experts.
           | 
           | I imagine that may reduce quality somewhat though? By forcing
           | it to distribute problems equally across all of them, whereas
           | in reality you'd expect task type to conform to the pareto
           | distribution.
        
             | Filligree wrote:
             | The latter. Yes, it all needs to stay in memory.
        
             | MPSimmons wrote:
             | >I heard MoE reduces inference costs
             | 
             | Computational costs, yes. You still take the same amount of
             | time for processing the prompt, but each token created
             | through inference costs less computationally than if you
             | were running it through _all_ layers.
        
             | samus wrote:
             | It should increase quality since those layers can
             | specialize on subsets of the training data. This means that
             | getting better in one domain won't make the model worse in
             | all the others anymore.
             | 
             | We can't really tell what the router does. There have been
             | experiments where the router in the early blocks was
             | compromised, and quality only suffered moderately. In later
             | layers, as the embeddings pick up more semantic
             | information, it matters more and might approach our naive
             | understanding of the term "expert".
        
           | wenc wrote:
           | Would it be analogous to say instead of having a single Von
           | Neumann who is a polymath, we're posing the question to a
           | pool of people who are good at their own thing, and one of
           | them gets picked to answer?
        
             | Filligree wrote:
             | Not really. The "expert" term is a misnomer; it would be
             | better put as "brain region".
             | 
             | Human brains seem to do something similar, inasmuch as
             | blood flow (and hence energy use) per region varies
             | depending on the current problem.
        
           | andai wrote:
           | Any idea why everyone seems to be using 8 experts? (Or was
           | GPT-4 using 16?) Did we just try different numbers and found
           | 8 was the optimum?
        
             | wongarsu wrote:
             | Probably because 8 GPUs is a common setup, and with 8
             | experts you can put each expert on a different GPU
        
           | andai wrote:
           | Has anyone tried MoE at smaller scales? e.g. a 7B model
           | that's made of a bunch of smaller ones? I guess that would be
           | 8x1B.
           | 
           | Or would that make each expert too small to be useful?
           | TinyLlama is 1B and it's _almost_ useful! I guess 8x1B would
           | be Mixture of TinyLLaMAs...
        
             | auspiv wrote:
             | The previous mixtral is 8x7B
        
             | jasonjmcghee wrote:
             | Yes there are many fine tunes on huggingface. Search "8x1B
             | huggingface"
        
             | samus wrote:
             | There is Qwen1.5-MoE-A2.7B, which was made by upcycling the
             | weights of Qwen1.5-1.8B, _splitting_ it and finetuning it.
        
         | jsemrau wrote:
         | There is some good documentation around mergekit available that
         | actually explains a lot and might be a good place to start.
        
         | zozbot234 wrote:
         | It's really a kind of enforced sparsity, in that it requires
         | that only a limited amount of blocks be active at a time during
         | inference. What blocks will be active for each token is decided
         | by the network itself as part of training.
         | 
         | (Notably, MoE should not be conflated with ensemble techniques,
         | which is where you would train entire separate networks, then
         | use heuristic techniques to run inference across all of them
         | simultaneously and combine the results.)
        
         | huevosabio wrote:
         | Ignore the "experts" part, it misleads a lot of people [0].
         | There is no explicit specialization in the most popular setups,
         | it is achieved implicitly through training. In short: MoEs add
         | multiple MLP sublayers and a routing mechanism after each
         | attention sublayer and let the training procedure learn the MLP
         | parameters and the routing parameters.
         | 
         | In a longer, but still rough, form...
         | 
         | How these transformers work is roughly:
         | 
         | ``` x_{l+1} = mlp_l(attention_l(x_l)) ```
         | 
         | where `x_l` is the hidden representation at layer l,
         | `attention_l` is the attention sublayer at layer l, and `mlp_l`
         | is the multilayer perceptron at sublayer l.
         | 
         | This MLP layer is very expensive because it is fully connected
         | (i.e. every input has a weight to every output). So! MoEs
         | instead of creating an even bigger, more expensive MLP to get
         | more capability, they create K MLP sublayers (the "experts")
         | and a router that decides which MLP sublayers to use. This
         | router spits out an importance score for each MLP "expert" and
         | then you choose the top T MLPs and do an average weighed on
         | importance, so roughly:
         | 
         | ``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) *
         | importance_score_{l, e} ```
         | 
         | where the `importance_score_{l, e}` is the score computed by
         | the router at layer l for "expert" e. That is,
         | `importance_score_{l} = attention_l(x_l)`. Note that here we
         | are adding all experts, but in reality we choose the top T,
         | often 2, and use that.
         | 
         | [0] some architectures do, in fact, combine domain experts to
         | make a greater whole, but not the currently popular flavor
        
           | Quarrel wrote:
           | So it is somewhat like a classic random forest or maybe
           | bagging, where you're trying to stop overfitting, but you're
           | also trying to train that top layer to know who could be the
           | "experts" given the current inputs so that you're minimising
           | the number of multiple MLP sublayers called during inference?
        
             | huevosabio wrote:
             | Yea, it's very much bagging + top layer (router) for the
             | importance score!
        
           | DougBTX wrote:
           | Would this be a reasonable explanation?
           | 
           | > MLPs are universal function approximators, but these models
           | are big enough that it is better to train many small
           | functions rather than a single unified function. MoE is a
           | mechanism to force different parts of the model to learn
           | distinct functions.
        
             | samus wrote:
             | It misses the crucial detail that every transformer layer
             | chooses the experts independently from the others. Of
             | course they still indirectly influence each other since
             | each layer processes the output of the previous one.
        
         | woadwarrior01 wrote:
         | Not quite a layman's explanation, but if you're familiar with
         | the implementation(s) of vanilla decoder only transformers,
         | mixture-of-experts is just a small extension.
         | 
         | During inference, instead of a single MLP in each transformer
         | layer, MoEs have `n` MLPs and a single layer "gate" in each
         | transformer layer. In the forward pass, softmax of the gate's
         | output is used to pick the top `k` (where k is < n) MLPs to
         | use. The relevant code snippet in the HF transformers
         | implementation is very readable IMO, and only about 40 lines.
         | 
         | https://github.com/huggingface/transformers/blob/main/src/tr...
        
         | adtac wrote:
         | As always, code is the best documentation:
         | https://github.com/ggerganov/llama.cpp/blob/8dd1ec8b3ffbfa2d...
        
         | vineyardmike wrote:
         | It's not "experts" in the typical sense of the word. There is
         | no discrete training to learn a particular skill in one expert.
         | It's more closely modeled as a bunch of smaller models grafted
         | together.
         | 
         | These models are actually a collection of weights for different
         | parts of the system. It's not "one" neural network.
         | Transformers are composed of layers of transformations to the
         | input, and each step can have its own set of weights. There was
         | a recent video on the front page that had a good introduction
         | to this. There is the MLP, there are the attention heads, etc.
         | 
         | With that in mind, a MoE model is basically where one of those
         | layers has X different versions of the weights, and then an
         | added layer (another neural network with its own weights) that
         | picks the version of "expert" weights to use.
        
         | jerpint wrote:
         | The simplest way to think about it is a form of dropout but
         | instead of dropping weights, you drop an entire path of the
         | network
        
       | jonnycomputer wrote:
       | These LLMs are making RAM great again.
       | 
       | Wish I had invested in the extra 32GB for my mac laptop.
        
         | Workaccount2 wrote:
         | You can't upgrade it?
         | 
         | Edit: I haven't owned a laptop for years, probably could have
         | surmised they'd be more user hostile nowadays.
        
           | paxys wrote:
           | > mac laptop
        
           | kristopolous wrote:
           | Everything is soldered in these days.
           | 
           | It's complete garbage. And most of the other vendors just
           | copy Apple so even things like Lenovo have the same problems.
           | 
           | The current state of laptops is such trash
        
             | sva_ wrote:
             | Plenty of laptops still have SO-DIMM, such as EliteBook for
             | example.
             | 
             | People need to vote with their wallet, and not buy stuff
             | that goes against their principles.
        
               | popf1 wrote:
               | There are so many variables though ... most of the time
               | you have to compromise on a few things.
        
               | GeekyBear wrote:
               | With SO-DIMM you gain expandability at the cost of higher
               | power draw and latency as well as lower throughput.
               | 
               | > SO-DIMM memory is inherently slower than soldered
               | memory. Moreover, considering the fact that SO-DIMM has a
               | maximum speed of 6,400MHz means that it won't be able to
               | handle the DDR6 standard, which is already in the works.
               | 
               | https://fossbytes.com/camm2-ram-standard/
        
               | kristopolous wrote:
               | There needs to be more fidelity than "vote with wallet".
               | Let's say I decided to not purchase your product. Why?
               | 
               | The question remains unanswered. Perhaps I didn't see it
               | for sale or Bob in accounting just got one and I didn't
               | want to look like I was copying Bob.
               | 
               | Even at scale this doesn't work. Let's say Lenovo
               | switches to making all of their laptops hot pink with
               | bedazzled rhinestone butterflies and sales plummet. You
               | could argue it was the wrong pink or that the butterflies
               | didn't shimmer enough ... any hypothesis you wish.
               | 
               | The market provides an extremely low information poor
               | signal that really doesn't suggest any course of action.
               | 
               | If we really want something better, there needs to be
               | more fruitful and meaningful communication lines. I've
               | come up with various ideas over the years but haven't
               | really implemented them.
        
               | qeternity wrote:
               | You misunderstand the signal. The signal is "you're doing
               | something wrong". Companies have tremendous incentive to
               | figure out what that is. They do huge amounts of market
               | research and customer feedback.
        
             | woadwarrior01 wrote:
             | These days with Apple Silicon, RAM is a part of the SoC.
             | It's not even soldered on, it's a part of the chip.
             | Although TBF, they also offer insane memory bandwidths.
        
             | GeekyBear wrote:
             | > most of the other vendors just copy Apple
             | 
             | Weird conspiracy theories aside, the low power variant of
             | RAM (LPDDR) has to be soldered onto the motherboard, so
             | laptops designed for longer battery life have been using it
             | for years now.
             | 
             | The good news is that a newer variant of low power RAM has
             | just been standardized that features low power RAM in
             | memory modules, although they attach with screws and not
             | clips.
             | 
             | https://fossbytes.com/camm2-ram-standard/
        
           | jonnycomputer wrote:
           | I really really like my Macbook Pro. But dammit, you can't
           | upgrade the thing (Mac laptops aren't upgrade-able anymore).
           | I got M1 Max in 2021 with 32GB of RAM. I did not anticipate
           | needing more than 32GB for anything I'd be doing on it. Turns
           | out, a couple of years later, I like to run local LLMs that
           | max out my available memory.
        
             | jonnycomputer wrote:
             | I say 2021, but truth is the supply chain was so trash that
             | year that it took almost a year to actually get delivered.
             | I don't think I actually started using the thing until
             | 2022.
        
               | jonnycomputer wrote:
               | I got downvoted for saying a true fact? that I ordered a
               | the new M1 Max in 2021 and it took almost a year for me
               | to actually get it? it true.
        
           | paxys wrote:
           | You are getting downvoted because you vaguely suggested
           | something negative about an Apple product, as is my comment
           | below
        
             | jonnycomputer wrote:
             | People are absurd with their downvotes. I got downvoted for
             | saying it took almost a year for my macbook to arrive once
             | i ordered it. Its true. But its also true that supply
             | chains were a wreck at the time. Apple wasn't the only tech
             | gadget that took forever to arrive.
        
       | hubraumhugo wrote:
       | It feels absolutely amazing to build an AI startup right now.
       | It's as if your product automatically becomes cheaper, more
       | reliable, and more scalable with each new major model release.
       | 
       | - We first struggled with limited context windows [solved]
       | 
       | - We had issues with consistent JSON ouput [solved]
       | 
       | - We had rate limiting and performance issues for the large 3rd
       | party models [solved]
       | 
       | - Hosting our own OSS models for small and medium complex tasks
       | was a pain [solved]
       | 
       | Obivously every startup still needs to build up defensibility and
       | focus on differentiating with everything "non-AI".
        
         | paxys wrote:
         | We are going to quickly reach the point where most of these AI
         | startups (which do nothing but provide thin wrappers on top of
         | public LLMs) aren't going to be needed at all. The
         | differentiation will need to come from the value of the end
         | product put in front of customers, not the AI backend.
        
           | layble wrote:
           | Sure, in the same way SaaS companies are just thin wrappers
           | on top of databases and the open web.
        
             | imjonse wrote:
             | You will find that a disproportionately large amount of
             | work and innovation in an AI product is in the backing
             | model (GPT, Mixtral, etc.). While there's a huge amount of
             | work in databases and the open web, SaaS products typically
             | add a lot more than a thin API layer and a shiny website
             | (well some do but you know what I mean)
        
               | tomrod wrote:
               | I'd argue the comment before you is describing
               | accessibility, features, and services -- yes, the core
               | component has a wrapper, but that wrapper differentiates
               | the use.
        
           | wongarsu wrote:
           | The same happened to image recognition. We have great
           | algorithms for many years now. You can't make a company out
           | of having the best image recognition algorithm, but you
           | absolutely can make a company out of a device that spots
           | defects in the paintjob in a car factory, or that spots
           | concrete cracks in the tunnel segments used by a tunnel
           | boring machine, or by building a wildlife camera that counts
           | wildlife and exports that to a central website. All of them
           | just fine-tune existing algorithms, but the value delivered
           | is vastly different.
           | 
           | Or you can continue selling shovels. Still lots of expensive
           | labeling services out there, to stay in the image-recognition
           | parallel
        
             | pradn wrote:
             | The key thing is AI models are services not products. The
             | real world changes, so you have to change your model. Same
             | goes for new training data (examples, yes/no labels,
             | feedback from production use), updating biases (compliance,
             | changing societal mores). And running models in a highly-
             | available way is also expertise. Not every company wants to
             | be in the ML-ops business.
        
             | HeatrayEnjoyer wrote:
             | The dynamic does seem to be different with the newer
             | systems. Larger more general systems are better than small
             | specialized models.
             | 
             | GPT-4 is SOTA at OCR and sentiment classification, for
             | example.
        
         | sleepingreset wrote:
         | If you don't mind, I'm trying to experiment w/ local models
         | more. Just now getting into messing w/ these but I'm struggling
         | to come up w/ good use cases.
         | 
         | Would you happen to know of any cool OSS model projects that
         | might be good inspiration for a side project?
         | 
         | Wondering what most people use these local models for
        
           | sosuke wrote:
           | No ideas about side projects or anything "productive" but for
           | a concrete example look at SillyTavern. Making fictional
           | characters. Finding narratives, stories, role-play for
           | tabletop games. You can even have group chats of AI
           | characters interacting. No good use cases for profit but
           | plenty right now for exploration and fun.
        
           | wing-_-nuts wrote:
           | One idea that I've been mulling over; Given how controllable
           | linux is from the command line, I think it would be somewhat
           | easy to set up a voice to text to a local LLM that could
           | control _pretty_ much everything on command.
           | 
           | It would flat out embarass alexa. Imagine 'Hal play a movie',
           | or 'Hal play some music' and it's all running locally, with
           | _your_ content.
        
             | mikegreenberg wrote:
             | There are a few projects doing this. This one piqued my
             | interest as having a potentially nice UX after some
             | maturity. https://github.com/OpenInterpreter/01
        
         | milansuk wrote:
         | The progress is insane. A few days ago I started being very
         | impressed with LLM coding skills. I wanted Golang code, instead
         | of Python, which you can see in many demos. The prompt was:
         | 
         | Write a Golang func, which accepts the path into a .gpx file
         | and outputs a JSON string with points(x=tolal distance in km,
         | y=elevation). Don't use any library.
        
         | jasonjmcghee wrote:
         | How are you approaching hosting? vLLM?
        
         | neillyons wrote:
         | > We had issues with consistent JSON ouput [solved]
         | 
         | It says the JSON output is constrained via their platform (on
         | la Plateforme).
         | 
         | Does that mean JSON output is only available in the hosted
         | version? Are there any small models that can be self hosted
         | that output valid JSON.
        
           | ajcp wrote:
           | > Does that mean JSON output is only available in the
           | [self]-hosted version?
           | 
           | I would assume so. They probably constrain JSON output so
           | that the JSON response doesn't bork the front-end/back-end of
           | la Plateforme itself as it moves through their code back to
           | you.
        
         | yodsanklai wrote:
         | > It's as if your product automatically becomes cheaper, more
         | reliable, and more scalable with each new major model release.
         | 
         | and so do your competitor's products.
        
           | samus wrote:
           | Any business idea built almost exclusively on AI, without
           | adding much value, is doomed from the start. AI is not good
           | enough to make humans obsolete yet. But a well finetuned
           | model can for sure augment what individual humans can do.
        
       | Lacerda69 wrote:
       | I have been using mixtral daily since it was released for all
       | kinds of writing and coding tasks. Love it and massively invested
       | in mistrals mission.
       | 
       | Keep on doing this great work.
       | 
       | Edit: been using the previous version, seems like this one is
       | even better?
        
       | spenceryonce wrote:
       | I can't even begin to describe how excited I am for the future of
       | AI.
        
       | iFire wrote:
       | It wasn't clear but how much hardware does it take to run Mixtral
       | 8x22B (mistral.ai) next to me locally?
        
         | ru552 wrote:
         | a macbook with 64g of ram
        
           | noman-land wrote:
           | At what quantization?
        
       | ChicagoDave wrote:
       | We need larger context windows, otherwise we're running the same
       | path with marginally different results.
        
       | luke-stanley wrote:
       | I'm confused on the instruction fine-tuning part that is
       | mentioned briefly, in passing. Is there an open weight instruct
       | variant they've released? Or is that only on their platform?
       | Edit: It's on HuggingFace, great, thanks replies!
        
         | freedmand wrote:
         | I just found this on HuggingFace:
         | https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
        
         | sva_ wrote:
         | https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
        
       | austinsuhr wrote:
       | Is 8x22B gonna make it to Le Chat in the near future?
        
       | ayolisup wrote:
       | What's the best way to run this on my Macbook Pro?
       | 
       | I've tried LMStudio, but I'm not a fan of the interface compared
       | to OpenAI's. The lack of automatic regeneration every time I edit
       | my input, like on ChatGPT, is quite frustrating. I also gave
       | Ollama a shot, but using the CLI is less convenient.
       | 
       | Ideally, I'd like something that allows me to edit my settings
       | quite granularly, similar to what I can do in OpenLM, with the
       | QoL from the hosted online platforms, particularly the ease of
       | editing my prompts that I use extensively.
        
         | duckkg5 wrote:
         | Ollama with WebUI https://github.com/open-webui/open-webui
        
           | shaunkoh wrote:
           | Not sure why your comment was downvoted. ^ is absolutely the
           | right answer.
           | 
           | Open WebUI is functionally identical to the ChatGPT
           | interface. You can even use it with the OpenAI APIs to have
           | your own pay per use GPT 4. I did this.
        
             | lubesGordi wrote:
             | Hey can you guys elaborate how this works? I'm looking at
             | the Ollama section in their docs and it talks about load
             | balancing? I don't understand what that means in this
             | context.
        
         | chown wrote:
         | You can try Msty as well. I am the author.
         | 
         | https://msty.app
        
         | mcbuilder wrote:
         | openrouter.ai is a fantastic idea if you don't want to self
         | host
        
       | byteknight wrote:
       | First test I tried to run a random taxation question through it
       | 
       | Output:
       | https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...
       | 
       | Within the first paragraph, it outputs:
       | 
       | > GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGE
       | 
       | Thought that was hilarious.
        
         | orost wrote:
         | That's not the model this post is about. You used the base
         | model, not trained for tasks. (The instruct model is probably
         | not on ollama yet.)
        
           | byteknight wrote:
           | I absolutely did not:
           | 
           | ollama run mixtral:8x22b
           | 
           | EDIT: I like how you ninja-editted your comment ;)
        
             | orost wrote:
             | Considering "mixtral:8x22b" on ollama was last updated
             | yesterday, and Mixtral-8x22B-Instruct-v0.1 (the topic of
             | this post) was released about 2 hours ago, they are not the
             | same model.
        
               | byteknight wrote:
               | Are we looking at the same page?
               | 
               | https://imgur.com/a/y6XfpBl
               | 
               | And even the direct tag page:
               | https://ollama.com/library/mixtral:8x22b shows
               | 40-something minutes ago: https://imgur.com/a/WNhv70B
        
               | orost wrote:
               | Let me clarify.
               | 
               | Mixtral-8x22B-v0.1 was released a couple days ago. The
               | "mixtral:8x22b" tag on ollama currently refers to it, so
               | it's what you got when you did "ollama run
               | mixtral:8x22b". It's a base model only capable of text
               | completion, not any other tasks, which is why you got a
               | terrible result when you gave it instructions.
               | 
               | Mixtral-8x22B-Instruct-v0.1 is an instruction-following
               | model based on Mixtral-8x22B-v0.1. It was released two
               | hours ago and it's what this post is about.
               | 
               | (The last updated 44 minutes ago refers to the entire
               | "mixtral" collection.)
        
               | belter wrote:
               | I get:
               | 
               | ollama run mixtral:8x22b
               | 
               | Error: exception create_tensor: tensor
               | 'blk.0.ffn_gate.0.weight' not found
        
               | Me1000 wrote:
               | You need to update ollama to 0.1.32.
        
               | belter wrote:
               | Thanks. That did it.
        
               | gliptic wrote:
               | And where does it say that's the instruct model?
        
           | mysteria wrote:
           | Yeah this is exactly what happens when you ask a base model a
           | question. It'll just attempt to continue what you already
           | wrote based off its training set, so if you say have it
           | continue a story you've written it may wrap up the story and
           | then ask you to subscribe for part 2, followed by a bunch of
           | social media comments with reviews.
        
         | woadwarrior01 wrote:
         | Looks like an issue with the quantization that ollama (i.e
         | llama.cpp) uses and not the model itself. It's common knowledge
         | from Mixtral 8x7B that quantizing the MoE gates is pernicious
         | to model perplexity. And yet they continue to do it. :)
        
           | cjbprime wrote:
           | No, it's unrelated to quantization, they just weren't using
           | the instruct model.
        
         | jmorgan wrote:
         | The `mixtral:8x22b` tag still points to the text completion
         | model - instruct is on the way, sorry!
         | 
         | Update: mixtral:8x22b now points to the instruct model:
         | ollama pull mixtral:8x22b       ollama run mixtral:8x22b
        
         | renewiltord wrote:
         | Not instruct tuned. You're (actually) "holding it wrong".
        
       | kristianp wrote:
       | So this one is 3x the size but only 7% better on MMLU? Given
       | Moores law is mostly dead, this trend is going to make for even
       | more extremely expensive compute for next gen AI models.
        
         | GaggiX wrote:
         | That's 25% fewer errors.
        
       | stainablesteel wrote:
       | is this different than their "large" model
        
       | jhoechtl wrote:
       | Did anyone have success getting danswer and ollama to work
       | together?
        
       | ado__dev wrote:
       | We rolled out Mixtral 8x22b to our LLM Litmus Test at s0.dev for
       | Cody AI. Don't have enough data to say it's better or worse that
       | other LLMs yet, but if you want to try it out for coding
       | purposes, let me know your experience.
        
       | CharlesW wrote:
       | Dumb question: Are "non-instructed" versions of LLMs just raw,
       | no-guardrail versions of the "instructed" versions that most end-
       | users see? And why does Mixtral need one, when OpenAI LLMs do
       | not?
        
         | hnuser123456 wrote:
         | https://platform.openai.com/docs/models/gpt-base
         | 
         | https://platform.openai.com/docs/guides/text-generation/comp...
        
           | CharlesW wrote:
           | I appreciate the correction, thanks!
        
         | kingsleyopara wrote:
         | LLM's are first trained to predict the next most likely word
         | (or token if you want to be accurate) from web crawls. These
         | models are basically great at continuing unfinished text but
         | can't really be used for instructions e.g. Q&A or chatting -
         | this is the "non-instructed" version. These models are then
         | fine tuned for instructions using additional data from human
         | interaction - these are the "instructed" versions which are
         | what end users (e.g. ChatGPT, Gemini, etc.) see.
        
           | CharlesW wrote:
           | Very helpful, thank you.
        
       | elorant wrote:
       | Seems that Perplexity Labs already offers a free demo of it.
       | 
       | https://labs.perplexity.ai/
        
         | batperson wrote:
         | That's the old/regular model. This post is about the new
         | "instruct" model.
        
       | yodsanklai wrote:
       | How does this compare to ChatGPT4?
        
       | orra wrote:
       | Is this release a pleasant surprise? Mistral weakened their
       | commitment to open source when they partnered with Microsoft.
       | 
       | It's nice they're using some of the money from their commercial
       | and proprietary models, to improve the state of the art for open
       | source (open weights) models.
        
         | qeternity wrote:
         | Mistral just released the most powerful open weight model in
         | the history of humanity.
         | 
         | How did they weaken their commitment to open weights?
        
           | orra wrote:
           | > Mistral just released the most powerful open weight model
           | in the history of humanity.
           | 
           | Well, yeah, it's very welcome, but 'history of humanity' is
           | hyperbole given ChatGPT isn't even two years old.
           | 
           | > How did they weaken their commitment to open weights?
           | 
           | Before https://web.archive.org/web/20240225001133/https://mis
           | tral.a... versus after https://web.archive.org/web/2024022702
           | 5408/https://mistral.a... the Microsoft partnership
           | announcement:
           | 
           | > Committing to open models.
           | 
           | to
           | 
           | > That is why we started our journey by releasing the world's
           | most capable open-weights models
           | 
           | There were similar changes on their about the Company page.
        
       | zone411 wrote:
       | It ranks between Mistral Small and Mistral Medium on my NYT
       | Connections benchmark and is indeed better than Command R Plus
       | and Qwen 1.5 Chat 72B, which were the top two open weights
       | models. Grok 1.0 is not an instruct model, so it cannot be
       | compared fairly.
        
       ___________________________________________________________________
       (page generated 2024-04-17 23:01 UTC)