[HN Gopher] Mixtral 8x22B
___________________________________________________________________
Mixtral 8x22B
Author : meetpateltech
Score : 431 points
Date : 2024-04-17 14:00 UTC (9 hours ago)
(HTM) web link (mistral.ai)
(TXT) w3m dump (mistral.ai)
| dd-dreams wrote:
| The development never stops. In a few years we will look back and
| see how the previous models were and how they're now. How we
| couldn't run LLaMa 70B on MacBook Air and now we can.
| squirrel23 wrote:
| Yes it's pretty cool. There was a neat comparison of deep
| learning development that I think resonates quite well here.
|
| Around 5 years ago, it took a lambda user some pretty
| significant hardware, software and time (around a full night),
| to try to create a short deepfake. Now, you don't need any
| fancy hardware and you can have some decent results within 5
| min on your average computer.
| azinman2 wrote:
| That part isn't very good.
|
| https://www.nytimes.com/2024/04/08/technology/deepfake-ai-
| nu...
| imjonse wrote:
| Great to see such free to use and self-hostable models, but it's
| said that open now means only that. One cannot replicate this
| model without access to the training data.
| generalizations wrote:
| ...And a massive pile of cash/compute hardware.
| kiney wrote:
| not that massive, we're talking six figures. There was a
| blogpost about this a while back on the startpage of HN.
| htrp wrote:
| for finetuning or parameter training from scratch?
| kiney wrote:
| from scratch: https://research.myshell.ai/jetmoe
| kaibee wrote:
| That's for an 8B model.
| cptcobalt wrote:
| This is over trivializing it, but there isn't much more
| inherent complexity in training an 8B or larger model
| other than more money, more compute, more data, more
| time. Overall, the principles are similar.
| lostmsu wrote:
| Assuming linear growth to number of parameters that's 7.5
| figures instead of 6 for 8x22B model.
| moffkalast wrote:
| 6 figures _are_ a massive pile of cash.
| ru552 wrote:
| There's a large amount of liability in disclosing your training
| data.
| imjonse wrote:
| Calling the model 'truly open' without is not technically
| correct though.
| Lacerda69 wrote:
| It's open enough for all practical purposes IMO.
| nicklecompte wrote:
| It's not "open enough" to do an honest evaluation of
| these systems by constructing adversarial benchmarks.
| imjonse wrote:
| As open as an executable binary that you are allowed to
| download and use for free.
| bevekspldnw wrote:
| In the age of SaaS I'll take it. It's not like I have a
| few million dollars to pay for training even if I had all
| the code and data.
| londons_explore wrote:
| I expect we'll see some AI companies in the future throwing
| away the training dataset. Maybe some have already.
|
| During a court case, the other side can demand discovery over
| your training dataset, for example to see if it contains a
| particular copyrighted work.
|
| But if you've already deleted the dataset, you're far more
| likely to win any case against you that hinges on what was in
| the dataset if the plaintiff can't even prove their work was
| included.
|
| And you can argue that the dataset was very expensive to
| store (which is true), and therefore deleted shortly after
| training was complete. You have no obligation to keep
| something for the benefit of potential future plaintiffs you
| aren't even aware of yet.
| endisneigh wrote:
| Good to continue to see a permissive license here.
| sa-code wrote:
| Is this the best permissively licensed model out there?
| ru552 wrote:
| Today. Might change tomorrow at the pace this sector is at.
| imjonse wrote:
| So far it is Command R+. Let's see how this will fare on
| Chatbot Arena after a few weeks of use.
| skissane wrote:
| > So far it is Command R+
|
| Most people would not consider Command R+ to count as the
| "best permissively licensed model" since CC-BY-NC is not
| usually considered "permissively licensed" - the "NC" part
| means "non-commercial use only"
| imjonse wrote:
| My bad, I remembered wrongly it was Apache too.
| tinyhouse wrote:
| Pricing?
|
| Found it: https://mistral.ai/technology/#pricing
|
| It'd useful to add a link to the blog post. While it's an open
| model, most will only be able to use it via the API.
| MacsHeadroom wrote:
| It's open source, you can just download and run it for free on
| your own hardware.
| tinyhouse wrote:
| Well, I don't have hardware to run a 141B parameters model,
| even if only 39B are active during inference.
| navbaker wrote:
| It will be quantized in a matter of days and runnable on
| most laptops.
| azinman2 wrote:
| 8 bit is 149G. 4 bit is 80G.
|
| I wouldn't call this runnable on most laptops.
| astrodust wrote:
| "Who among us doesn't have 8 H100 cards?"
| MacsHeadroom wrote:
| Four V100s will do. They're about $1k each on ebay.
| astrodust wrote:
| $1500 each, plus the server they go in, plus plus plus
| plus.
| MacsHeadroom wrote:
| Sure, but it's still a lot less than 8 h100s.
|
| ~$8k for an LLM server with 128GB of VRAM vs like $250k+
| for 8 H100s.
| theolivenbaum wrote:
| That looks expensive compared to what groq was offering:
| https://wow.groq.com/
| naiv wrote:
| I also assume groq is 10-15x faster
| pants2 wrote:
| Can't wait for 8x22B to make it to Groq! Having an LLM at
| near GPT-4 performance with Groq speed would be incredible,
| especially for real-time voice chat.
| apetresc wrote:
| I just find it hilarious how approximately 100% of models beat
| all other models on benchmarks.
| squirrel23 wrote:
| What do you mean?
| apetresc wrote:
| Virtually every announcement of a new model release has some
| sort of table or graph matching it up against a bunch of
| other models on various benchmarks, and they're always
| selected in such a way that the newly-released model
| dominates along several axes.
|
| It turns interpreting the results into an exercise in
| detecting which models and benchmarks were omitted.
| CharlieDigital wrote:
| It would make sense, wouldn't it? Just as we've seen rising
| fuel efficiency, safety, dependability, etc. over the
| lifecycle of a particular car model.
|
| The different teams are learning from each other and
| pushing boundaries; there's virtually no reason for any of
| the teams to release a model or product that is somehow
| inferior to a prior one (unless it had some secondary
| attribute such as requiring lower end hardware).
|
| We're simply not seeing the ones that came up short; we
| don't even see the ones where it fell short of current
| benchmarks because they're not worth releasing to the
| public.
| apetresc wrote:
| That's a valid theory, a priori, but if you actually
| follow up you'll find that the vast majority of these
| benchmark results don't end up matching anyone's
| subjective experience with the models. The churn at the
| top is not nearly as fast as the press releases make it
| out to be.
| tensor wrote:
| Subjective experience is not a benchmark that you can
| measure success against. Also, of course new models are
| better on some set of benchmarks. Why would someone
| bother releasing a "new" model that is inferior to old
| ones? (Aside from attributes like more preferable
| licensing).
|
| This is completely normal, the opposite would be strange.
| andai wrote:
| Sibling comment made a good point about benchmarks not
| being a great indiactor of real world quality. Every time
| something scores near GPT-4 on benchmarks, I try it out
| and it ends up being less reliable than GPT-3 within a
| few minutes of usage.
| CharlieDigital wrote:
| That's totally fine, but benchmarks are like standardized
| tests like the SAT. They measure _something_ and it
| totally makes sense that each release bests the prior in
| the context of these benchmarks.
|
| It may even be the case that in measuring against the
| benchmarks, these product teams sacrifice some real world
| performance (just as a student that only studies for the
| SAT might sacrifice some real world skills).
| htrp wrote:
| gotta cherry pick your benchmarks as much as possible
| paxys wrote:
| Benchmarks published by the company itself should be treated no
| differently than advertising. For actual signal check out more
| independent leaderboards and benchmarks (like HuggingFace,
| Chatbot Arena, MMLU, AlpacaEval). Of course, even then it is
| impossible to come up with an objective ranking since there is
| no consensus on what to even measure.
| empath-nirvana wrote:
| Just because of the pace of innovation and scaling, right now,
| it seems pretty natural that any new model is going to be
| better than the previous comparable models.
| michaelt wrote:
| Benchmarks are often weird because of what a benchmark
| inherently needs to be.
|
| If you compare LLMs by asking them to tell you how to catch
| dragonflies - the free text chat answer you get will be
| impossible to objectively evaluate.
|
| Whereas if you propose four ways to catch dragonflies and ask
| each model to choose option A, B, C or D (or check the relative
| probability the model assigns to those four output logits) the
| result is easy to objectively evaluate - you just check if it
| chose the one right answer.
|
| Hence a lot of the most famous benchmarks are multiple-choice
| questions - even though 99.9% of LLM usage doesn't involve
| answering multiple-choice questions.
| arnaudsm wrote:
| Curious to see how it performs against GPT-4.
|
| Mixtral8x22 beats CommandR+, which is at GPT-4-level in LMSYS'
| leaderboard.
| zone411 wrote:
| LMSYS leaderboard is just one benchmark (that I think is
| fundamentally flawed). GPT-4 is clearly better.
| arnaudsm wrote:
| Which alterative benchmarks do you recommend?
| brokensegue wrote:
| Isn't equating active parameters with cost a little unfair since
| you still need full memory for all the inactive parameters?
| tartrate wrote:
| Well, since it affects inference speed it means you can handle
| more in less time, needing less concurrency.
| sa-code wrote:
| Fewer parameters at inference time makes a massive difference
| in cost for batch jobs, assuming vram usage is the same
| mdrzn wrote:
| "64K tokens context window" I do wish they had managed to extend
| it to at least 128K to match the capabilities of GPT-4 Turbo
|
| Maybe this limit will become a joke when looking back? Can you
| imagine reaching a trillion tokens context window in the future,
| as Sam speculated on Lex's podcast?
| htrp wrote:
| maybe we'll look back at token context windows like we look
| back at how much ram we have in a system.
| frabjoused wrote:
| I agree with this in the sense that once you have enough, you
| stop caring about the metric.
| paradite wrote:
| And how much RAM do you need to run Mixtral 8*22B? Probably
| not enough on a personal laptop.
| Lacerda69 wrote:
| I run it fine on my 64gb RAM beast.
| coder543 wrote:
| At what quantization? 4-bit is 80GB. Less than 4-bit is
| rarely good enough at this point.
| apexalpha wrote:
| Is that normal ram of GPU ram?
| samus wrote:
| 64GB is not GPU RAM, but system RAM. Consumer GPUs have
| 24GB at most, those with good value/price have way less.
| Current generation workstation GPUs are unaffordable;
| used can be found on ebay for a reasonable price, but
| they are quite slow. DDR5 RAM might be a better
| investment.
| user_7832 wrote:
| Generally about ~1gb ram per billion parameters. I've run
| a 30b model (vicuna) on my 32gb laptop (but it was slow).
| bamboozled wrote:
| I still don't have enough RAM though ?
| samus wrote:
| RAM is simply too useful.
| htrp wrote:
| While there is a lot more HBM (or UMA if you're a Mac system)
| you need to run these LLM models, my overarching point is
| that at this point most systems don't have RAM constraints
| for most of the software you need to run and as a result, RAM
| becomes less of a selling point except in very specialized
| instances like graphic design or 3D rendering work.
|
| If we have cheap billion token context windows, 99% of your
| use cases aren't going to hit anywhere close to that limit
| and as a result, your models will "just run"
| creshal wrote:
| Wasn't there a paper yesterday that turned context evaluation
| linear (instead of quadratic) and made effectively unlimited
| context windows possible? Between that and 1.58b quantization I
| feel like we're overdue for an LLM revolution.
| samus wrote:
| So far, people have come up with many alternatives for
| quadratic attention. Only recently have they proven their
| potential.
| underlines wrote:
| tons and tons of papers, most of them had some
| disadvantages. Can't have the cake and eat it too:
|
| https://arxiv.org/html/2404.08801v1 Meta Megalodon
|
| https://arxiv.org/html/2404.07143v1 Google Infini-Attention
|
| https://arxiv.org/html/2402.13753v1 LongRoPE
|
| and a ton more
| pseudosavant wrote:
| FWIW, the 128k context window for GPT-4 is only for input. I
| believe the output content is still only 4k.
| moffkalast wrote:
| How does that make any sense on a decoder-only architecture?
| grey8 wrote:
| It's not about the model. The model can output more - it's
| about the API.
|
| A better phrasing would be that they don't allow you to
| output more than 4k tokens per message.
|
| Same with Anthropic and Claude, sadly.
| afro88 wrote:
| How useful is such a large input window when most of the middle
| isn't really used? I'm thinking mostly about coding. But wheb
| putting even say 20k tokens into the input, a good chunk
| doesn't seem to be "remembered" or used for the output
| jakderrida wrote:
| While you're 100% correct, they are working on ways to make
| the middle useful, such a "Needle in a Haystack" testing.
| When we say we wish for context length that large, I think
| it's implied we mean functionally. But you do make a really
| great point.
| doublextremevil wrote:
| How much vram is need to run this?
| MacsHeadroom wrote:
| 80GB in 4bit.
|
| But because it only activates one expert at a time, it can run
| on a fast CPU in reasonable time. So 96GB of DDR4 will do. 96GB
| of DDR5 is better.
| Me1000 wrote:
| WizardLM-2 8x22b (which was a fine tune of the Mixtral 8x22b
| base model) at 4bit was only 80GB.
| noman-land wrote:
| I'm really excited about this model. Just need someone to
| quantize it to ~3 bits so it'll run on a 64GB MacBook Pro. I've
| gotten a lot of use from the 8x7b model. Paired with llamafile
| and it's just so good.
| 2Gkashmiri wrote:
| Can you explain your use case? I tried to get into offline
| llms, on my machine and even android but without discrete
| graphics, its a slow hog so I didnt enjoy it but suppose I buy
| one, what then ?
| andai wrote:
| I run Mistral-7B on an old laptop. It's not very fast and
| it's not very good, but it's just good enough to be useful.
|
| My use case is that I'm more productive working with a LLM
| but being online is a constant temptation and distraction.
|
| Most of the time I'll reach for offline docs to verify. So
| the LLM just points me in the right direction.
|
| I also miss Google offline, so I'm working on a search
| engine. I thought I could skip crawling by just downloading
| common crawl, but unfortnately it's enormous and mostly junk
| or unsuitable for my needs. So my next project is how to
| data-mine common crawl to extract just the interesting (to
| me) bits...
|
| When I have a search engine and a LLM I'll be able to run my
| own Phind, which will be really cool.
| luke-stanley wrote:
| Presumably you could run things like PageRank, I'm sure
| people do this sort of thing with CommonCrawl. There are
| lots of variants of graph connectivity scoring methods and
| classifiers. What a time to be alive eh?
| noman-land wrote:
| Yes, I have a side project that uses local whisper.cpp to
| transcribe a podcast I love and shows a nice UI to search and
| filter the contents. I use Mixtral 8x7b in chat interface via
| llamafile primarily to help me write python and sqlite code
| and as a general Q&A agent. I ask it all sorts of technical
| questions, learn about common tools, libraries, and idioms in
| an ecosystem I'm not familiar with, and then I can go to
| official documentation and dig in.
|
| It has been a huge force multiplier for me and most
| importantly of all, it removes the dread of not knowing where
| to start and the dread of sending your inner monologue to
| someone's stupid cloud.
|
| If you're curious: https://github.com/noman-
| land/transcript.fish/ though this doesn't include any Mixtral
| stuff because I don't use it programmatically (yet). I soon
| hope to use it to answer questions about the episodes like
| who the special guest is and whatnot, which is something I do
| manually right now.
| popf1 wrote:
| > Can you explain your use case?
|
| pretty sure you can run it un-censored... that would be my
| use case
| mathverse wrote:
| Shopping for a new mbp. Do you think going with more ram would
| be wise?
| noman-land wrote:
| Unfortunately, yes. Get as much as you can stomach paying
| for.
| clementmas wrote:
| I'm considering switching my function calling requests from
| OpenAI's API to Mistral. Are they using similar formats? What's
| the easiest way to use Mistral? Is it by using Huggingface?
| ru552 wrote:
| easiest is probably with ollama [0]. I think the ollama API is
| OpenAI compatible.
|
| [0]https://ollama.com/
| talldayo wrote:
| Most inference servers are OpenAI-compatibile. Even the
| "official" llama-cpp server should work fine: https://github.
| com/ggerganov/llama.cpp/blob/master/examples/...
| pants2 wrote:
| Ollama runs locally. What's the best option for calling the
| new Mixtral model on someone else's server programmatically?
| Arcuru wrote:
| Openrouter lists several options:
| https://openrouter.ai/models/mistralai/mixtral-8x22b
| jjice wrote:
| Does anyone have a good layman's explanation of the "Mixture-of-
| Experts" concept? I think I understand the idea of having "sub-
| experts", but how do you decide what each specialization is
| during training? Or is that not how it works at all?
| Keyframe wrote:
| maybe there's one that is maitre d'llm?
| londons_explore wrote:
| Nobody decides. The network itself determines which expert(s)
| to activate based on the context. It uses a small neural
| network for the task.
|
| It typically won't behave like human experts - you might find
| one of the networks is an expert in determining where to place
| capital letters or full stops for example.
|
| MoE's do not really improve accuracy - instead they are to
| reduce the amount of compute required. And, assuming you have a
| fixed compute budget, that in turn might mean you can make the
| model bigger to get better accuracy.
| HeatrayEnjoyer wrote:
| Correct, the experts are determined by Algo, not anything
| humans would understand.
| hlfshell wrote:
| This is a bit of a misnomer. Each expert is a sub network that
| specializes in sub understanding we can't possibly track.
|
| During training a routing network is punished if it does not
| evenly distribute training tokens to the correct experts. This
| prevents any one or two networks from becoming the primary
| networks.
|
| The result of this is that each token has essentially even
| probability of being routed to one of the sub models, with the
| underlying logic of why that model is an expert for that token
| being beyond our understanding or description.
| api wrote:
| A decent _loose_ analogy might be database sharding.
|
| Basically you're sharding the neural network by "something"
| that is itself tuned during the learning process.
| fire_lake wrote:
| Why do we expect this to perform better? Couldn't a regular
| network converge on this structure anyways?
| imjonse wrote:
| It is a type of ensemble model. A regular network could do
| it, but a MoE will select a subset to do the task faster
| than the whole model would.
| rgbrgb wrote:
| Here's my naive intuition: in general bigger models can
| store more knowledge but take longer to do inference. MoE
| provides a way to blend the advantages of having a bigger
| model (more storage) with the advantages of having smaller
| models at inference time (faster, less memory required).
| When you do inference, tokens hit a small layer that is
| load balancing the experts then activate 1 or 2 experts. So
| you're storing roughly 8 x 22B "worth" of knowledge without
| having to run a model that big.
|
| Maybe a real expert can confirm if this is correct :)
| cjbprime wrote:
| Not quite, you don't save memory, only compute.
| nialv7 wrote:
| Sounds like the "you only use 10% of your brain" myth,
| but actually real this time.
| samus wrote:
| Almost :) the model chooses experts in every block. For a
| typical 7B with 8 experts there will be 8^32=2^96 paths
| through the whole model.
| og_kalu wrote:
| It doesn't perform better and until recently, MoE models
| actually underperformed their dense counterparts. The real
| gain is sparsity. You have this huge x parameter model that
| is performing like an x parameter model but you don't have
| to use all those parameters at once every time so you save
| a lot on compute, both in training and inference.
| andai wrote:
| I heard MoE reduces inference costs. Is that true? Don't all
| the sub networks need to be kept in RAM the whole time? Or is
| the idea that it only needs to run compute on a small part of
| the total network, so it runs faster? (So you complete more
| requests per minute on same hardware.)
|
| Edit: Apparently each part of the network is on a separate
| device. Fascinating! That would also explain why the routing
| network is trained to choose equally between experts.
|
| I imagine that may reduce quality somewhat though? By forcing
| it to distribute problems equally across all of them, whereas
| in reality you'd expect task type to conform to the pareto
| distribution.
| Filligree wrote:
| The latter. Yes, it all needs to stay in memory.
| MPSimmons wrote:
| >I heard MoE reduces inference costs
|
| Computational costs, yes. You still take the same amount of
| time for processing the prompt, but each token created
| through inference costs less computationally than if you
| were running it through _all_ layers.
| samus wrote:
| It should increase quality since those layers can
| specialize on subsets of the training data. This means that
| getting better in one domain won't make the model worse in
| all the others anymore.
|
| We can't really tell what the router does. There have been
| experiments where the router in the early blocks was
| compromised, and quality only suffered moderately. In later
| layers, as the embeddings pick up more semantic
| information, it matters more and might approach our naive
| understanding of the term "expert".
| wenc wrote:
| Would it be analogous to say instead of having a single Von
| Neumann who is a polymath, we're posing the question to a
| pool of people who are good at their own thing, and one of
| them gets picked to answer?
| Filligree wrote:
| Not really. The "expert" term is a misnomer; it would be
| better put as "brain region".
|
| Human brains seem to do something similar, inasmuch as
| blood flow (and hence energy use) per region varies
| depending on the current problem.
| andai wrote:
| Any idea why everyone seems to be using 8 experts? (Or was
| GPT-4 using 16?) Did we just try different numbers and found
| 8 was the optimum?
| wongarsu wrote:
| Probably because 8 GPUs is a common setup, and with 8
| experts you can put each expert on a different GPU
| andai wrote:
| Has anyone tried MoE at smaller scales? e.g. a 7B model
| that's made of a bunch of smaller ones? I guess that would be
| 8x1B.
|
| Or would that make each expert too small to be useful?
| TinyLlama is 1B and it's _almost_ useful! I guess 8x1B would
| be Mixture of TinyLLaMAs...
| auspiv wrote:
| The previous mixtral is 8x7B
| jasonjmcghee wrote:
| Yes there are many fine tunes on huggingface. Search "8x1B
| huggingface"
| samus wrote:
| There is Qwen1.5-MoE-A2.7B, which was made by upcycling the
| weights of Qwen1.5-1.8B, _splitting_ it and finetuning it.
| jsemrau wrote:
| There is some good documentation around mergekit available that
| actually explains a lot and might be a good place to start.
| zozbot234 wrote:
| It's really a kind of enforced sparsity, in that it requires
| that only a limited amount of blocks be active at a time during
| inference. What blocks will be active for each token is decided
| by the network itself as part of training.
|
| (Notably, MoE should not be conflated with ensemble techniques,
| which is where you would train entire separate networks, then
| use heuristic techniques to run inference across all of them
| simultaneously and combine the results.)
| huevosabio wrote:
| Ignore the "experts" part, it misleads a lot of people [0].
| There is no explicit specialization in the most popular setups,
| it is achieved implicitly through training. In short: MoEs add
| multiple MLP sublayers and a routing mechanism after each
| attention sublayer and let the training procedure learn the MLP
| parameters and the routing parameters.
|
| In a longer, but still rough, form...
|
| How these transformers work is roughly:
|
| ``` x_{l+1} = mlp_l(attention_l(x_l)) ```
|
| where `x_l` is the hidden representation at layer l,
| `attention_l` is the attention sublayer at layer l, and `mlp_l`
| is the multilayer perceptron at sublayer l.
|
| This MLP layer is very expensive because it is fully connected
| (i.e. every input has a weight to every output). So! MoEs
| instead of creating an even bigger, more expensive MLP to get
| more capability, they create K MLP sublayers (the "experts")
| and a router that decides which MLP sublayers to use. This
| router spits out an importance score for each MLP "expert" and
| then you choose the top T MLPs and do an average weighed on
| importance, so roughly:
|
| ``` x_{l+1} = \sum_e mlp_{l,e}(attention_l(x_l)) *
| importance_score_{l, e} ```
|
| where the `importance_score_{l, e}` is the score computed by
| the router at layer l for "expert" e. That is,
| `importance_score_{l} = attention_l(x_l)`. Note that here we
| are adding all experts, but in reality we choose the top T,
| often 2, and use that.
|
| [0] some architectures do, in fact, combine domain experts to
| make a greater whole, but not the currently popular flavor
| Quarrel wrote:
| So it is somewhat like a classic random forest or maybe
| bagging, where you're trying to stop overfitting, but you're
| also trying to train that top layer to know who could be the
| "experts" given the current inputs so that you're minimising
| the number of multiple MLP sublayers called during inference?
| huevosabio wrote:
| Yea, it's very much bagging + top layer (router) for the
| importance score!
| DougBTX wrote:
| Would this be a reasonable explanation?
|
| > MLPs are universal function approximators, but these models
| are big enough that it is better to train many small
| functions rather than a single unified function. MoE is a
| mechanism to force different parts of the model to learn
| distinct functions.
| samus wrote:
| It misses the crucial detail that every transformer layer
| chooses the experts independently from the others. Of
| course they still indirectly influence each other since
| each layer processes the output of the previous one.
| woadwarrior01 wrote:
| Not quite a layman's explanation, but if you're familiar with
| the implementation(s) of vanilla decoder only transformers,
| mixture-of-experts is just a small extension.
|
| During inference, instead of a single MLP in each transformer
| layer, MoEs have `n` MLPs and a single layer "gate" in each
| transformer layer. In the forward pass, softmax of the gate's
| output is used to pick the top `k` (where k is < n) MLPs to
| use. The relevant code snippet in the HF transformers
| implementation is very readable IMO, and only about 40 lines.
|
| https://github.com/huggingface/transformers/blob/main/src/tr...
| adtac wrote:
| As always, code is the best documentation:
| https://github.com/ggerganov/llama.cpp/blob/8dd1ec8b3ffbfa2d...
| vineyardmike wrote:
| It's not "experts" in the typical sense of the word. There is
| no discrete training to learn a particular skill in one expert.
| It's more closely modeled as a bunch of smaller models grafted
| together.
|
| These models are actually a collection of weights for different
| parts of the system. It's not "one" neural network.
| Transformers are composed of layers of transformations to the
| input, and each step can have its own set of weights. There was
| a recent video on the front page that had a good introduction
| to this. There is the MLP, there are the attention heads, etc.
|
| With that in mind, a MoE model is basically where one of those
| layers has X different versions of the weights, and then an
| added layer (another neural network with its own weights) that
| picks the version of "expert" weights to use.
| jerpint wrote:
| The simplest way to think about it is a form of dropout but
| instead of dropping weights, you drop an entire path of the
| network
| jonnycomputer wrote:
| These LLMs are making RAM great again.
|
| Wish I had invested in the extra 32GB for my mac laptop.
| Workaccount2 wrote:
| You can't upgrade it?
|
| Edit: I haven't owned a laptop for years, probably could have
| surmised they'd be more user hostile nowadays.
| paxys wrote:
| > mac laptop
| kristopolous wrote:
| Everything is soldered in these days.
|
| It's complete garbage. And most of the other vendors just
| copy Apple so even things like Lenovo have the same problems.
|
| The current state of laptops is such trash
| sva_ wrote:
| Plenty of laptops still have SO-DIMM, such as EliteBook for
| example.
|
| People need to vote with their wallet, and not buy stuff
| that goes against their principles.
| popf1 wrote:
| There are so many variables though ... most of the time
| you have to compromise on a few things.
| GeekyBear wrote:
| With SO-DIMM you gain expandability at the cost of higher
| power draw and latency as well as lower throughput.
|
| > SO-DIMM memory is inherently slower than soldered
| memory. Moreover, considering the fact that SO-DIMM has a
| maximum speed of 6,400MHz means that it won't be able to
| handle the DDR6 standard, which is already in the works.
|
| https://fossbytes.com/camm2-ram-standard/
| kristopolous wrote:
| There needs to be more fidelity than "vote with wallet".
| Let's say I decided to not purchase your product. Why?
|
| The question remains unanswered. Perhaps I didn't see it
| for sale or Bob in accounting just got one and I didn't
| want to look like I was copying Bob.
|
| Even at scale this doesn't work. Let's say Lenovo
| switches to making all of their laptops hot pink with
| bedazzled rhinestone butterflies and sales plummet. You
| could argue it was the wrong pink or that the butterflies
| didn't shimmer enough ... any hypothesis you wish.
|
| The market provides an extremely low information poor
| signal that really doesn't suggest any course of action.
|
| If we really want something better, there needs to be
| more fruitful and meaningful communication lines. I've
| come up with various ideas over the years but haven't
| really implemented them.
| qeternity wrote:
| You misunderstand the signal. The signal is "you're doing
| something wrong". Companies have tremendous incentive to
| figure out what that is. They do huge amounts of market
| research and customer feedback.
| woadwarrior01 wrote:
| These days with Apple Silicon, RAM is a part of the SoC.
| It's not even soldered on, it's a part of the chip.
| Although TBF, they also offer insane memory bandwidths.
| GeekyBear wrote:
| > most of the other vendors just copy Apple
|
| Weird conspiracy theories aside, the low power variant of
| RAM (LPDDR) has to be soldered onto the motherboard, so
| laptops designed for longer battery life have been using it
| for years now.
|
| The good news is that a newer variant of low power RAM has
| just been standardized that features low power RAM in
| memory modules, although they attach with screws and not
| clips.
|
| https://fossbytes.com/camm2-ram-standard/
| jonnycomputer wrote:
| I really really like my Macbook Pro. But dammit, you can't
| upgrade the thing (Mac laptops aren't upgrade-able anymore).
| I got M1 Max in 2021 with 32GB of RAM. I did not anticipate
| needing more than 32GB for anything I'd be doing on it. Turns
| out, a couple of years later, I like to run local LLMs that
| max out my available memory.
| jonnycomputer wrote:
| I say 2021, but truth is the supply chain was so trash that
| year that it took almost a year to actually get delivered.
| I don't think I actually started using the thing until
| 2022.
| jonnycomputer wrote:
| I got downvoted for saying a true fact? that I ordered a
| the new M1 Max in 2021 and it took almost a year for me
| to actually get it? it true.
| paxys wrote:
| You are getting downvoted because you vaguely suggested
| something negative about an Apple product, as is my comment
| below
| jonnycomputer wrote:
| People are absurd with their downvotes. I got downvoted for
| saying it took almost a year for my macbook to arrive once
| i ordered it. Its true. But its also true that supply
| chains were a wreck at the time. Apple wasn't the only tech
| gadget that took forever to arrive.
| hubraumhugo wrote:
| It feels absolutely amazing to build an AI startup right now.
| It's as if your product automatically becomes cheaper, more
| reliable, and more scalable with each new major model release.
|
| - We first struggled with limited context windows [solved]
|
| - We had issues with consistent JSON ouput [solved]
|
| - We had rate limiting and performance issues for the large 3rd
| party models [solved]
|
| - Hosting our own OSS models for small and medium complex tasks
| was a pain [solved]
|
| Obivously every startup still needs to build up defensibility and
| focus on differentiating with everything "non-AI".
| paxys wrote:
| We are going to quickly reach the point where most of these AI
| startups (which do nothing but provide thin wrappers on top of
| public LLMs) aren't going to be needed at all. The
| differentiation will need to come from the value of the end
| product put in front of customers, not the AI backend.
| layble wrote:
| Sure, in the same way SaaS companies are just thin wrappers
| on top of databases and the open web.
| imjonse wrote:
| You will find that a disproportionately large amount of
| work and innovation in an AI product is in the backing
| model (GPT, Mixtral, etc.). While there's a huge amount of
| work in databases and the open web, SaaS products typically
| add a lot more than a thin API layer and a shiny website
| (well some do but you know what I mean)
| tomrod wrote:
| I'd argue the comment before you is describing
| accessibility, features, and services -- yes, the core
| component has a wrapper, but that wrapper differentiates
| the use.
| wongarsu wrote:
| The same happened to image recognition. We have great
| algorithms for many years now. You can't make a company out
| of having the best image recognition algorithm, but you
| absolutely can make a company out of a device that spots
| defects in the paintjob in a car factory, or that spots
| concrete cracks in the tunnel segments used by a tunnel
| boring machine, or by building a wildlife camera that counts
| wildlife and exports that to a central website. All of them
| just fine-tune existing algorithms, but the value delivered
| is vastly different.
|
| Or you can continue selling shovels. Still lots of expensive
| labeling services out there, to stay in the image-recognition
| parallel
| pradn wrote:
| The key thing is AI models are services not products. The
| real world changes, so you have to change your model. Same
| goes for new training data (examples, yes/no labels,
| feedback from production use), updating biases (compliance,
| changing societal mores). And running models in a highly-
| available way is also expertise. Not every company wants to
| be in the ML-ops business.
| HeatrayEnjoyer wrote:
| The dynamic does seem to be different with the newer
| systems. Larger more general systems are better than small
| specialized models.
|
| GPT-4 is SOTA at OCR and sentiment classification, for
| example.
| sleepingreset wrote:
| If you don't mind, I'm trying to experiment w/ local models
| more. Just now getting into messing w/ these but I'm struggling
| to come up w/ good use cases.
|
| Would you happen to know of any cool OSS model projects that
| might be good inspiration for a side project?
|
| Wondering what most people use these local models for
| sosuke wrote:
| No ideas about side projects or anything "productive" but for
| a concrete example look at SillyTavern. Making fictional
| characters. Finding narratives, stories, role-play for
| tabletop games. You can even have group chats of AI
| characters interacting. No good use cases for profit but
| plenty right now for exploration and fun.
| wing-_-nuts wrote:
| One idea that I've been mulling over; Given how controllable
| linux is from the command line, I think it would be somewhat
| easy to set up a voice to text to a local LLM that could
| control _pretty_ much everything on command.
|
| It would flat out embarass alexa. Imagine 'Hal play a movie',
| or 'Hal play some music' and it's all running locally, with
| _your_ content.
| mikegreenberg wrote:
| There are a few projects doing this. This one piqued my
| interest as having a potentially nice UX after some
| maturity. https://github.com/OpenInterpreter/01
| milansuk wrote:
| The progress is insane. A few days ago I started being very
| impressed with LLM coding skills. I wanted Golang code, instead
| of Python, which you can see in many demos. The prompt was:
|
| Write a Golang func, which accepts the path into a .gpx file
| and outputs a JSON string with points(x=tolal distance in km,
| y=elevation). Don't use any library.
| jasonjmcghee wrote:
| How are you approaching hosting? vLLM?
| neillyons wrote:
| > We had issues with consistent JSON ouput [solved]
|
| It says the JSON output is constrained via their platform (on
| la Plateforme).
|
| Does that mean JSON output is only available in the hosted
| version? Are there any small models that can be self hosted
| that output valid JSON.
| ajcp wrote:
| > Does that mean JSON output is only available in the
| [self]-hosted version?
|
| I would assume so. They probably constrain JSON output so
| that the JSON response doesn't bork the front-end/back-end of
| la Plateforme itself as it moves through their code back to
| you.
| yodsanklai wrote:
| > It's as if your product automatically becomes cheaper, more
| reliable, and more scalable with each new major model release.
|
| and so do your competitor's products.
| samus wrote:
| Any business idea built almost exclusively on AI, without
| adding much value, is doomed from the start. AI is not good
| enough to make humans obsolete yet. But a well finetuned
| model can for sure augment what individual humans can do.
| Lacerda69 wrote:
| I have been using mixtral daily since it was released for all
| kinds of writing and coding tasks. Love it and massively invested
| in mistrals mission.
|
| Keep on doing this great work.
|
| Edit: been using the previous version, seems like this one is
| even better?
| spenceryonce wrote:
| I can't even begin to describe how excited I am for the future of
| AI.
| iFire wrote:
| It wasn't clear but how much hardware does it take to run Mixtral
| 8x22B (mistral.ai) next to me locally?
| ru552 wrote:
| a macbook with 64g of ram
| noman-land wrote:
| At what quantization?
| ChicagoDave wrote:
| We need larger context windows, otherwise we're running the same
| path with marginally different results.
| luke-stanley wrote:
| I'm confused on the instruction fine-tuning part that is
| mentioned briefly, in passing. Is there an open weight instruct
| variant they've released? Or is that only on their platform?
| Edit: It's on HuggingFace, great, thanks replies!
| freedmand wrote:
| I just found this on HuggingFace:
| https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
| sva_ wrote:
| https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
| austinsuhr wrote:
| Is 8x22B gonna make it to Le Chat in the near future?
| ayolisup wrote:
| What's the best way to run this on my Macbook Pro?
|
| I've tried LMStudio, but I'm not a fan of the interface compared
| to OpenAI's. The lack of automatic regeneration every time I edit
| my input, like on ChatGPT, is quite frustrating. I also gave
| Ollama a shot, but using the CLI is less convenient.
|
| Ideally, I'd like something that allows me to edit my settings
| quite granularly, similar to what I can do in OpenLM, with the
| QoL from the hosted online platforms, particularly the ease of
| editing my prompts that I use extensively.
| duckkg5 wrote:
| Ollama with WebUI https://github.com/open-webui/open-webui
| shaunkoh wrote:
| Not sure why your comment was downvoted. ^ is absolutely the
| right answer.
|
| Open WebUI is functionally identical to the ChatGPT
| interface. You can even use it with the OpenAI APIs to have
| your own pay per use GPT 4. I did this.
| lubesGordi wrote:
| Hey can you guys elaborate how this works? I'm looking at
| the Ollama section in their docs and it talks about load
| balancing? I don't understand what that means in this
| context.
| chown wrote:
| You can try Msty as well. I am the author.
|
| https://msty.app
| mcbuilder wrote:
| openrouter.ai is a fantastic idea if you don't want to self
| host
| byteknight wrote:
| First test I tried to run a random taxation question through it
|
| Output:
| https://gist.github.com/IAmStoxe/7fb224225ff13b1902b6d172467...
|
| Within the first paragraph, it outputs:
|
| > GET AN ESSAY WRITTEN FOR YOU FROM AS LOW AS $13/PAGE
|
| Thought that was hilarious.
| orost wrote:
| That's not the model this post is about. You used the base
| model, not trained for tasks. (The instruct model is probably
| not on ollama yet.)
| byteknight wrote:
| I absolutely did not:
|
| ollama run mixtral:8x22b
|
| EDIT: I like how you ninja-editted your comment ;)
| orost wrote:
| Considering "mixtral:8x22b" on ollama was last updated
| yesterday, and Mixtral-8x22B-Instruct-v0.1 (the topic of
| this post) was released about 2 hours ago, they are not the
| same model.
| byteknight wrote:
| Are we looking at the same page?
|
| https://imgur.com/a/y6XfpBl
|
| And even the direct tag page:
| https://ollama.com/library/mixtral:8x22b shows
| 40-something minutes ago: https://imgur.com/a/WNhv70B
| orost wrote:
| Let me clarify.
|
| Mixtral-8x22B-v0.1 was released a couple days ago. The
| "mixtral:8x22b" tag on ollama currently refers to it, so
| it's what you got when you did "ollama run
| mixtral:8x22b". It's a base model only capable of text
| completion, not any other tasks, which is why you got a
| terrible result when you gave it instructions.
|
| Mixtral-8x22B-Instruct-v0.1 is an instruction-following
| model based on Mixtral-8x22B-v0.1. It was released two
| hours ago and it's what this post is about.
|
| (The last updated 44 minutes ago refers to the entire
| "mixtral" collection.)
| belter wrote:
| I get:
|
| ollama run mixtral:8x22b
|
| Error: exception create_tensor: tensor
| 'blk.0.ffn_gate.0.weight' not found
| Me1000 wrote:
| You need to update ollama to 0.1.32.
| belter wrote:
| Thanks. That did it.
| gliptic wrote:
| And where does it say that's the instruct model?
| mysteria wrote:
| Yeah this is exactly what happens when you ask a base model a
| question. It'll just attempt to continue what you already
| wrote based off its training set, so if you say have it
| continue a story you've written it may wrap up the story and
| then ask you to subscribe for part 2, followed by a bunch of
| social media comments with reviews.
| woadwarrior01 wrote:
| Looks like an issue with the quantization that ollama (i.e
| llama.cpp) uses and not the model itself. It's common knowledge
| from Mixtral 8x7B that quantizing the MoE gates is pernicious
| to model perplexity. And yet they continue to do it. :)
| cjbprime wrote:
| No, it's unrelated to quantization, they just weren't using
| the instruct model.
| jmorgan wrote:
| The `mixtral:8x22b` tag still points to the text completion
| model - instruct is on the way, sorry!
|
| Update: mixtral:8x22b now points to the instruct model:
| ollama pull mixtral:8x22b ollama run mixtral:8x22b
| renewiltord wrote:
| Not instruct tuned. You're (actually) "holding it wrong".
| kristianp wrote:
| So this one is 3x the size but only 7% better on MMLU? Given
| Moores law is mostly dead, this trend is going to make for even
| more extremely expensive compute for next gen AI models.
| GaggiX wrote:
| That's 25% fewer errors.
| stainablesteel wrote:
| is this different than their "large" model
| jhoechtl wrote:
| Did anyone have success getting danswer and ollama to work
| together?
| ado__dev wrote:
| We rolled out Mixtral 8x22b to our LLM Litmus Test at s0.dev for
| Cody AI. Don't have enough data to say it's better or worse that
| other LLMs yet, but if you want to try it out for coding
| purposes, let me know your experience.
| CharlesW wrote:
| Dumb question: Are "non-instructed" versions of LLMs just raw,
| no-guardrail versions of the "instructed" versions that most end-
| users see? And why does Mixtral need one, when OpenAI LLMs do
| not?
| hnuser123456 wrote:
| https://platform.openai.com/docs/models/gpt-base
|
| https://platform.openai.com/docs/guides/text-generation/comp...
| CharlesW wrote:
| I appreciate the correction, thanks!
| kingsleyopara wrote:
| LLM's are first trained to predict the next most likely word
| (or token if you want to be accurate) from web crawls. These
| models are basically great at continuing unfinished text but
| can't really be used for instructions e.g. Q&A or chatting -
| this is the "non-instructed" version. These models are then
| fine tuned for instructions using additional data from human
| interaction - these are the "instructed" versions which are
| what end users (e.g. ChatGPT, Gemini, etc.) see.
| CharlesW wrote:
| Very helpful, thank you.
| elorant wrote:
| Seems that Perplexity Labs already offers a free demo of it.
|
| https://labs.perplexity.ai/
| batperson wrote:
| That's the old/regular model. This post is about the new
| "instruct" model.
| yodsanklai wrote:
| How does this compare to ChatGPT4?
| orra wrote:
| Is this release a pleasant surprise? Mistral weakened their
| commitment to open source when they partnered with Microsoft.
|
| It's nice they're using some of the money from their commercial
| and proprietary models, to improve the state of the art for open
| source (open weights) models.
| qeternity wrote:
| Mistral just released the most powerful open weight model in
| the history of humanity.
|
| How did they weaken their commitment to open weights?
| orra wrote:
| > Mistral just released the most powerful open weight model
| in the history of humanity.
|
| Well, yeah, it's very welcome, but 'history of humanity' is
| hyperbole given ChatGPT isn't even two years old.
|
| > How did they weaken their commitment to open weights?
|
| Before https://web.archive.org/web/20240225001133/https://mis
| tral.a... versus after https://web.archive.org/web/2024022702
| 5408/https://mistral.a... the Microsoft partnership
| announcement:
|
| > Committing to open models.
|
| to
|
| > That is why we started our journey by releasing the world's
| most capable open-weights models
|
| There were similar changes on their about the Company page.
| zone411 wrote:
| It ranks between Mistral Small and Mistral Medium on my NYT
| Connections benchmark and is indeed better than Command R Plus
| and Qwen 1.5 Chat 72B, which were the top two open weights
| models. Grok 1.0 is not an instruct model, so it cannot be
| compared fairly.
___________________________________________________________________
(page generated 2024-04-17 23:01 UTC)