[HN Gopher] Mistral "Mixtral" 8x7B 32k model [magnet]
___________________________________________________________________
Mistral "Mixtral" 8x7B 32k model [magnet]
Author : xyzzyrz
Score : 369 points
Date : 2023-12-08 16:03 UTC (6 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| politician wrote:
| Honest question: Why isn't this on Huggingface? Is this one a
| leaked model with a questionable training or alignment
| methodology?
|
| EDIT: I mean, I guess they didn't hack their own twitter account,
| but still.
| kcorbitt wrote:
| It'll be on Huggingface soon. This is how they dropped their
| original 7B model as well. It's a marketing thing, but it
| works!
| politician wrote:
| Ah, well, ok. I appreciate the torrent link -- much faster
| distribution.
| politician wrote:
| @kcorbitt Low priority, probably not worth an email: Does
| using OpenPipe.ai to fine-tune a model result in a
| downloadable LoRA adapter? It's not clear from the website if
| the fine-tune is exportable.
| tarruda wrote:
| Still 7B, but now with 32k context. Looking forward to see how it
| compares with the previous one, and what the community does with
| it.
| MacsHeadroom wrote:
| Not 7B, 8x7B.
|
| It will run with the speed of a 7B model while being much
| smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).
| dragonwriter wrote:
| Given the config parametes posted, its 2 experts per token,
| so the conputation cost per token should be the cost of the
| conponent that selects experts + 2x cost of a 7B model.
| stavros wrote:
| Yes, but I also care about "can I load this onto my home
| GPU?" where, if I need all experts for this to run, the
| answer is "no".
| MacsHeadroom wrote:
| Ah good catch. Upon even closer examination, the attention
| layer (~2B params) is shared across experts. So in theory
| you would need 2B for the attention head + 5B for each of
| two experts in RAM.
|
| That's a total of 12B, meaning this should be able to be
| run on the same hardware as 13B models with some loading
| time between generations.
| seydor wrote:
| unfortunately too big for the broader community to test. Will
| be very interesting to see how well it performs compared to the
| large models
| brucethemoose2 wrote:
| Not really, looks like a ~40B class model which is very
| runnable.
| MacsHeadroom wrote:
| It's actually ~13B class at runtime. 2B for attention is
| shared across each expert and then it runs 2 experts at a
| time.
|
| So 2B for attention + 5Bx2 for inference = 12B in RAM at
| runtime.
| brucethemoose2 wrote:
| I just mean in terms of VRAM usage.
| nulld3v wrote:
| Looks to be Mixture of Experts, here is the params.json:
| { "dim": 4096, "n_layers": 32,
| "head_dim": 128, "hidden_dim": 14336,
| "n_heads": 32, "n_kv_heads": 8,
| "norm_eps": 1e-05, "vocab_size": 32000,
| "moe": { "num_experts_per_tok": 2,
| "num_experts": 8 } }
| sp332 wrote:
| I don't see any code in there. What runtime could load these
| weights?
| brucethemoose2 wrote:
| Its presumably llama just like Mistral.
|
| _Everything_ open source is llama now. Facebook all but
| standardized the architecture.
|
| I dunno about the moe. Is there existing transformers code
| for that part? It kinda looks like there is based on the
| config.
| jasonjmcghee wrote:
| Mistral is not llama architecture.
|
| https://github.com/mistralai/mistral-src
| brucethemoose2 wrote:
| Its _basically_ llama architecture, all but drop in
| compatible with llama runtimes.
| refulgentis wrote:
| Because it's JSON? :)
| sockaddr wrote:
| What does expert mean in this context?
| moffkalast wrote:
| It means it's 8 7B models in a trench coat in a sense, it
| runs as fast as a 14B (2 experts at a time apparently) but
| takes up as much memory as a 40B model (70% * 8 * 7B). There
| is some process trained into it that chooses which experts to
| use based on the question posed. GPT 4 is allegedly based on
| the same architecture, but at 8*222B.
| dragonwriter wrote:
| > GPT 4 is based on the same architecture, but at 8*222B.
|
| Do we actually either no that it is MoE or that size? IIRC
| both if those started as outsidr guesses that somehow just
| became accepted knowledge without any actual confirmation.
| moffkalast wrote:
| Iirc some of the other things the same source stated were
| later confirmed, so this is likely to be true as well,
| but I might be misremembering.
| sockaddr wrote:
| Fascinating. Thanks
| tavavex wrote:
| Does anyone here know roughly how an expert gets chosen? It
| seems like a very open-ended problem, and I'm not sure on
| how it can be implemented easily.
| WeMoveOn wrote:
| How did you come up with 40b for the memory? specifically,
| why 0.7 * total params?
| YetAnotherNick wrote:
| 86 GB. So it's likely a Mixture of experts model with 8 experts.
| Exciting.
| tarruda wrote:
| Damn, I was hoping it was still a single 7B model that I would
| be able to run on my GPU
| renonce wrote:
| You can, wait for a 4-bit quantized version
| tarruda wrote:
| I only have a RTX 3070 with 8GB VRam. It can run quantized
| 7B models well, but this is 8 x 7B. Maybe an RTX 3090 with
| 24GB VRAM can do it.
| brucethemoose2 wrote:
| It would be very tight. 8x7B 24GB (currently) has more
| overhead than 70B.
|
| Its theoretically doable, with quantization from the
| recent 2 bit quant paper and a custom implementation (in
| exllamav2?)
|
| EDIT: Actually the download is much smaller than 8x7B.
| Not sure how, but its sized more like a 30B, perfect for
| a 3090. _Very_ interesting.
| burke wrote:
| Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more
| than just 4 bits per param, and there's extra overhead
| for context, and the FFN to select experts is probably
| more on top of that.
|
| It would probably fit in 32GB at 4-bit but probably won't
| run with sensible quantization/perf on a 3090/4090
| without other tricks like offloading. Depending on how
| likely the same experts are to be chosen for multiple
| sequential tokens, offloading experts may be viable.
| espadrine wrote:
| Once on llama.cpp, it will likely run on CPU with enough
| RAM, especially given that the GGUF mmap code only seems
| to use RAM for the parts of the weights that get used.
| kcorbitt wrote:
| No public statement from Mistral yet. What we know:
|
| - Mixture of Experts architecture.
|
| - 8x 7B parameters experts (potentially trained starting with
| their base 7B model?).
|
| - 96GB of weights. You won't be able to run this on your home
| GPU.
| tarruda wrote:
| Theoretically it could fit into a single 24GB GPU if 4-bit
| quantized. Exllama v2 has even more efficient quantization
| algorithm, and was able to fit 70B models in 24GB gpu, but only
| with 2048 tokens of context.
| coder543 wrote:
| > 96GB of weights. You won't be able to run this on your home
| GPU.
|
| This seems like a non-sequitur. Doesn't MoE select an expert
| for each token? Presumably, the same expert would frequently be
| selected for a number of tokens in a row. At that point, you're
| only running a 7B model, which will easily fit on a GPU. It
| will be slower when "swapping" experts if you can't fit them
| all into VRAM at the same time, but it shouldn't be
| catastrophic for performance in the way that being unable to
| fit all layers of an LLM is. It's also easy to imagine caching
| the N most recent experts in VRAM, where N is the largest
| number that still fits into your VRAM.
| tarruda wrote:
| I will be super happy if this is true.
|
| Even if you can't fit all of them in the VRAM, you could load
| everything in tmpfs, which at least removes disk I/O penalty.
| cjbprime wrote:
| Just mentioning in case it helps anyone out: Linux already
| has a disk buffer cache. If you have available RAM, it will
| hold on to pages that have been read from disk until there
| is enough memory pressure to remove them (and then it will
| only remove some of them, not all of them). If you don't
| have available RAM, then the tmpfs wouldn't work. The tmpfs
| is helpful if you know better than the paging subsystem
| about how much you really want this data to always stay in
| RAM no matter what, but that is also much less flexible,
| because sometimes you need to burst in RAM usage.
| read_if_gay_ wrote:
| however, if you need to swap experts on each token, you might
| as well run on cpu.
| tarruda wrote:
| > Presumably, the same expert would frequently be selected
| for a number of tokens in a row
|
| In other words, assuming you ask a coding question and
| there's a coding expert in the mix, it would answer it
| completely.
| read_if_gay_ wrote:
| yes I read that. do you think it's reasonable to assume
| that the same expert will be selected so consistently
| that model swapping times won't dominate total runtime?
| tarruda wrote:
| No idea TBH, we'll have to wait and see. Some say it
| might be possible to efficiently swap the expert weights
| if you can fit everything in RAM:
| https://x.com/brandnarb/status/1733163321036075368?s=20
| ttul wrote:
| See my poorly educated answer above. I don't think that's
| how MoE actually works. A new mixture of experts is
| chosen for every new context.
| ttul wrote:
| Someone smarter will probably correct me, but I don't think
| that is how MoE works. With MoE, a feed-forward network
| assesses the tokens and selects the best two of eight experts
| to generate the next token. The choice of experts can change
| with each new token. For example, let's say you have two
| experts that are really good at answering physics questions.
| For some of the generation, those two will be selected. But
| later on, maybe the context suggests you need two models
| better suited to generate French language. This is a silly
| simplification of what I understand to be going on.
| ttul wrote:
| This being said, presumably if you're running a huge farm
| of GPUs, you could put each expert onto its own slice of
| GPUs and orchestrate data to flow between GPUs as needed. I
| have no idea how you'd do this...
| alchemist1e9 wrote:
| Ideally those many GPUs could be on different hosts
| connected with a commodity interconnect like 10gbe.
|
| If MOE models do well it could be great for commodity hw
| based distributed inference approaches.
| Philpax wrote:
| Yes, that's more or less it - there's no guarantee that the
| chosen expert will still be used for the next token, so
| you'll need to have all of them on hand at any given
| moment.
| wongarsu wrote:
| One viable strategy might be to offload as many experts as
| possible to the GPU, and evaluate the other ones on the
| CPU. If you collect some statistics which experts are used
| most in your use cases and select those for GPU
| acceleration you might get some cheap but notable speedups
| over other approaches.
| numeri wrote:
| You're not necessarily wrong, but I'd imagine this is almost
| prohibitively slow. Also, this model seems to use two experts
| per token.
| MacsHeadroom wrote:
| That is only 24GB in 4bit.
|
| People are running models 2-4 times that size on local GPUs.
|
| What's more, this will run on a MacBook CPU just fine-- and at
| an extremely high speed.
| brucethemoose2 wrote:
| Yeah, 70B is much larger and fits on a 24GB, admitedly with
| very lossy quantization.
|
| This is just about right for 24GB. I bet that is intentional
| on their part.
| shubb wrote:
| >> You won't be able to run this on your home GPU.
|
| Would this allow you to run each expert on a cheap commodity
| GPU card so that instead of using expensive 200GB cards we can
| use a computer with 8 cheap gaming cards in it?
| terafo wrote:
| Yes, but you wouldn't want to do that. You will be able to
| run that on a single 24gb GPU by the end of this weekend.
| brucethemoose2 wrote:
| Maybe two weekends.
| dragonwriter wrote:
| > Would this allow you to run each expert on a cheap
| commodity GPU card so that instead of using expensive 200GB
| cards we can use a computer with 8 cheap gaming cards in it?
|
| I would think no differently than you can run a large regular
| model on a multiGPU setup (which people do!). Its still all
| one network even if not all of it is activated for each
| token, and since its much smaller than a 56B model, it seems
| like there are significant components of the network that are
| shared.
| terafo wrote:
| Attention is shared. It's ~30% of params here. So ~2B
| params are shared between experts and ~5B params are unique
| to each expert.
| faldore wrote:
| at 4 bits you could run it on a 3090 right?
| jlokier wrote:
| > - 96GB of weights. You won't be able to run this on your home
| GPU.
|
| You can these days, even in a portable device running on
| battery.
|
| 96GB fits comfortably in some laptop GPUs released this year.
| michaelt wrote:
| Be a lot cooler if you said what laptop, and how much
| quantisation you're assuming :)
| tvararu wrote:
| They're probably referring to the new MacBook Pros with up
| to 128GB of unified memory.
| refulgentis wrote:
| This is extremely misleading. source: been working in local
| LLMs since 10 months ago. Got my Mac laptop too. I'm bullish
| too. But we shouldn't breezily dismiss those concerns out of
| hand. In practice, it's single digit tokens a second on a
| $4500 laptop for a model with weights half this size (Llama 2
| 70B Q2 GGUF => 29 GB, Q8 => 36 GB)
| coolspot wrote:
| > $4500
|
| Which is more than a price of RTX A6000 48gb ($4k used on
| ebay)
| CamperBob2 wrote:
| How fast does it run on that?
| brucethemoose2 wrote:
| Which is outrageously priced, in case thats not clear.
| Its an 2020 RTX 3090 with doubled up memory ICs, which is
| not much extra BoM.
| baq wrote:
| Clearly it's worth what people are willing to pay for it.
| At least it isn't being used to compute hashes of virtual
| gold.
| tucnak wrote:
| People are also willing to die for all kinds of stupid
| reasons, and it's not indicative of _anything_ let alone
| a clever comment on the online forum. Show some decorum,
| please!
| brucethemoose2 wrote:
| Its a artificial supply constraint due to artificial
| market segmentation enabled by Nvidia/AMD.
|
| Honestly its crazy that AMD indulges in this, especially
| now. Their workstation market share is comparatively
| tiny, and instead they could have a swarm of devs (like
| me) pecking away at AMD compatibility on AI repos if they
| sold cheap 32GB/48GB cards.
| MacsHeadroom wrote:
| Mixtral 8x7b only needs 12B of weights in RAM per
| generation.
|
| 2B for the attention head and 5B from each of 2 experts.
|
| It should be able to run slightly faster than a 13B desnse
| model, in as little as 16GB of RAM with room to spare.
| filterfiber wrote:
| > in as little as 16GB of RAM with room to spare.
|
| I don't think that's the case, for full speed you still
| need (5B*8)/2+2~fewB overhead.
|
| I think the experts chosen per-token? That means that yes
| you technically only need two in VRAM
| memory+router/overhead per token, but you'll have to
| constantly be loading in different experts unless you can
| fit them all, which would still be terrible for
| performance.
|
| So you'll still be PCIE/RAM speed limited unless you can
| fit all of the experts into memory (or get really lucky
| and only need two experts).
| miven wrote:
| > You won't be able to run this on your home GPU.
|
| As far as I understand in a MOE model only one/few experts are
| actually used at the same time, shouldn't the inference speed
| for this new MOE model be roughly the same as for a normal
| Mistral 7B then?
|
| 7B models have a reasonable throughput when ran on a beefy CPU,
| especially when quantized down to 4bit precision, so couldn't
| Mixtral be comfortably ran on a CPU too then, just with 8 times
| the memory footprint?
| filterfiber wrote:
| So this specific model ships with a default config of 2
| experts per token.
|
| So you need roughly two loaded in memory per token. Roughly
| the speed and memory of a 13B per token.
|
| Only issues is that's per-token. 2 experts are choosen per
| token, which means if they aren't the same ones as the last
| token, you need to load them into memory.
|
| So yeah to not be disk limited you'd need roughly 8 times the
| memory and it would run at the speed of a 13B model.
|
| ~~~Note on quantization, iirc smaller models lose more
| performance when quantized vs larger models. So this would be
| the speed of a 4bit 13B model but with the penalty from a
| 4bit 7B model.~~~ Actually I have zero idea how quantization
| scales for MoE, I imagine it has the penalty I mentioned but
| that's pure speculation.
| skghori wrote:
| multimodal? 32k context is pretty impressive, curious to test
| instructability
| brucethemoose2 wrote:
| MistralLite is already 32K, and Yi 200K actually works pretty
| well out to at least 75K (the most I tested)
| civilitty wrote:
| What kind of tests did you run out to that length? (Needle in
| haystack, summarization, structured data extraction, etc)
|
| What is the max number of tokens in the output?
| brucethemoose2 wrote:
| Long stories mostly, either novel or chat format. Sometimes
| summarization or insights, notably tests that you could't
| possible do with RAG chunking. Mostly short responses, not
| rewriting documents or huge code blocks or anything like
| that.
|
| MistralLite is basically overfit to summarize and retrieve
| in its 32K context, but its extremely good at that for a
| 7B. Its kinda useless for anything else.
|
| Yi 200K is... smart with the long context. An example I
| often cite is a Captain character in a story I 'wrote' with
| the llm. A Yi 200K finetune generated a debriefing for like
| 40K of context in a story, correctly assessing what plot
| points should be kept secret _and_ making some very
| interesting deductions. You could never possibly do that
| with RAG on a 4K model, or even models that "cheat" with
| their huge attention like Anthropic.
|
| I test at 75K just because that's the most my 3090 will
| hold.
| cloudhan wrote:
| Might be the training code related with the model
| https://github.com/mistralai/megablocks-public/tree/pstock/m...
| cloudhan wrote:
| Mixtral-8x7B support --> Support new model
|
| https://github.com/stanford-futuredata/megablocks/pull/45
| mareksotak wrote:
| Some companies spend weeks on landing pages, demos and cute
| thought through promo videos and then there is Mistral, casually
| dropping a magnet link on Friday.
| tarruda wrote:
| I'm curious about their business model.
| jorge-d wrote:
| Well so far their business model seems to be mostly centered
| about raising money[1]. I do hope they succeed in becoming a
| succesful contender against OpenAI.
|
| [1]
| https://www.bloomberg.com/news/articles/2023-12-04/openai-
| ri...
| udev4096 wrote:
| https://archive.ph/4F3dT
| nuz wrote:
| They can make plenty by offering consulting fees for
| finetuning and general support around their models.
| realce wrote:
| "plenty" is not a word some of these people understand
| however
| behnamoh wrote:
| You mean they put on a Redhat?
| tananaev wrote:
| I'm sure it's also a marketing move to build a certain
| reputation. Looks like it's working.
| OscarTheGrinch wrote:
| Not geoblocking the entirety of Europe also makes them stand
| out like a ringmaster amongst clowns.
| moffkalast wrote:
| Well they are French after all. They should be geoblocking
| the USA in response for a bit to make a point lol.
| fredoliveira wrote:
| Not with their cap table, they won't ;-)
| peanuty1 wrote:
| Google Bard is still not available in Canada.
| oh_sigh wrote:
| Are there some regulatory reasons why it would not be
| available? It seems weird if Google would intentionally
| block users merely to block them.
| mrandish wrote:
| I think there are still some pretty onerous laws about
| French localization of products and services made
| available in the French-speaking part of Canada. Could be
| that...
| simonerlic wrote:
| I originally thought so too, but as far as I know Bard is
| available in France- so I have a feeling that language
| isn't the roadblock here.
| sergiotapia wrote:
| Stuck on "Retrieving data" from the Magnet link and "Downloading
| metadata" when adding the magnet to the download list.
|
| I had to manually add these trackers and now it works:
| https://gist.github.com/mcandre/eab4166938ed4205bef4
| sigmar wrote:
| Not exactly similar companies in terms of their goals, but pretty
| hilarious to contrast this model announcement with Google's
| Gemini announcement two days ago.
| aubanel wrote:
| Mistral sure does not bother too much with explanations, but this
| style gives me much more confidence in the product than Google's
| polished, corporate, soulless announcement of Gemini!
| brucethemoose2 wrote:
| I will take weights over docs.
|
| Its does remind me how some Google employee was bragging that
| they disclosed the weights for the Gemini, and _only_ the small
| mobile Gemini, as if that 's a generous step over other
| companies.
| refulgentis wrote:
| I don't think that's true, because quite simply, they have
| not.
|
| I am 100% in agreement with your viewpoint, but feel
| squeamish seeing an un-needed lie coupled to it to justify
| it. Just so much Othering these days.
| brucethemoose2 wrote:
| I was referencing this tweet:
|
| https://twitter.com/zacharynado/status/1732425598465900708
|
| (Alt:
| https://nitter.net/zacharynado/status/1732425598465900708)
|
| That is fair though, this was an impulsive addition on my
| part.
| udev4096 wrote:
| https://nitter.rawbit.ninja/MistralAI/status/173315051239503...
| udev4096 wrote:
| based mistral casually dropping a magnet link
| manojlds wrote:
| Google - Fake demo
|
| Mistral - magnet link and that's it
| maremmano wrote:
| Do you need some fancy announcement? let's do it the 90s way:
| https://twitter.com/erhartford/status/1733159666417545641/ph...
| eurekin wrote:
| I find that a way more bold and confident than dropping a
| obviously manipulated and unrealistic marketing page or video
| maremmano wrote:
| Frankly I don't know why Google continues to act this way.
| Let's remind the "Google Duplex: A.I. Assistant Calls Local
| Businesses To Make Appointments" story.
| https://www.youtube.com/watch?v=D5VN56jQMWM
|
| Not that this affects Google's user base in any way, at the
| moment.
| eurekin wrote:
| They obviously have both money and great talent. Maybe they
| put out minimal effort only for investors that expect their
| presence in consumer space?
| polygamous_bat wrote:
| > Frankly I don't know why Google continues to act this
| way.
|
| Unfortunately, that's because they have Wall St. analysts
| looking at their videos who will (indirectly) determine how
| big of a bonus Sundar and co takes home at the end of the
| year. Mistral doesn't have to worry about that.
| eurekin wrote:
| This makes so much sense! Thanks
| seydor wrote:
| FILE_ID.DIZ
| brucethemoose2 wrote:
| In other llm news, Mistral/Yi finetunes trained with a new (still
| undocumented) technique called "neural alignment" are blasting
| other models in the HF leaderboard. The 7B is "beating" most
| 70Bs. The 34B in testing seems... Very good:
|
| https://huggingface.co/fblgit/una-xaberius-34b-v1beta
|
| https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16
|
| I mention this because it could theoretically be applied to
| Mistral Moe. If the uplift is the same as regular Mistral 7B, and
| Mistral Moe is good, the end result is a _scary_ good model.
|
| This might be an inflection point where desktop-runnable OSS is
| really breathing down GPT-4's neck.
| _boffin_ wrote:
| Interesting. One thing i noticed is that Mistral has a
| `max_position_embeddings` of ~32k while these have it at 4096.
|
| Any thoughts on that?
| brucethemoose2 wrote:
| Is complicated.
|
| The 7B model (cybertron) is trained on Mistral. Mistral is
| _technically_ a 32K model, but it uses a sliding window
| beyond 32K, and for all practical purposes in current
| implementations it behaves like an 8K model.
|
| The 34B model is based on Yi 34B, which is inexplicably
| marked as a 4K model in the config but actually works out to
| 32K if you literally just edit that line. Yi also has a 200K
| base model... and I have no idea why they didn't just train
| on that. You don't need to finetune at long context to
| preserve its long context ability.
| stavros wrote:
| Aren't LLM benchmarks at best irrelevant, at worst lying, at
| this point?
| nabakin wrote:
| More or less. The automated benchmarks themselves can be
| useful when you weed out the models which are overfitting to
| them.
|
| Although, anyone claiming a 7b LLM is better than a well
| trained 70b LLM like Llama 2 70b chat for the general case,
| doesn't know what they are talking about.
|
| In the future will it be possible? Absolutely, but today we
| have no architecture or training methodology which would
| allow it to be possible.
|
| You can rank models yourself with a private automated
| benchmark which models don't have a chance to overfit to or
| with a good human evaluation study.
|
| Edit: also, I guess OP is talking about Mistral finetunes
| (ones overfitting to the benchmarks) beating out 70b models
| on the leaderboard because Mistral 7b is lower than Llama 2
| 70b chat.
| brucethemoose2 wrote:
| I'm not saying its better than 70B, just that its very
| strong from what others are saying.
|
| Actually I am testing the 34B myself (not the 7B), and it
| seems good.
| fblgit wrote:
| UNA: Uniform Neural Alignment. Haven't u noticed yet?
| Each model that I uniform, behaves like a pre-trained..
| and you likely can fine-tune it again without damaging
| it.
|
| If you chatted with them, you know .. that strange
| sensation, you know what is it.. Intelligence.
| Xaberius-34B is the highest performer of the board, and
| is NOT contaminated.
| valine wrote:
| How much data do you need for UNA? Is a typical fine
| tuning dataset needed or can you get away with less than
| that?
| fblgit wrote:
| doesn't require much data, in a 7B can take a couple
| hours ~
| brucethemoose2 wrote:
| In addition to what was said, if its anything like DPO
| you don't need a _lot_ of data, just a good set. For
| instance, DPO requires "good" and "bad" responses for
| each given prompt.
| elcomet wrote:
| We can only compare specific training procedures though.
|
| With a 7b and a 70b trained the same way, the 70b should
| always be better
| mistrial9 wrote:
| quick to assert authoritative opinion - yet the one word
| "better" belies the message ? Certainly there is are more
| dimensions worth including in the rating?
| airgapstopgap wrote:
| > today we have no architecture or training methodology
| which would allow it to be possible.
|
| We clearly see that Mistral-7B is in some important,
| representative respects (eg coding) superior to
| Falcon-180B, and superior across the board to stuff like
| OPT-175B or Bloom-175B.
|
| "Well trained" is relative. Models are, overwhelmingly,
| functions of their data, not just scale and architecture.
| Better data allows for yet-unknown performance jumps, and
| data curation techniques are a closely-guarded secret. I
| have no doubt that a 7B beating our _best_ 60-70Bs is
| possible already, eg using something like Phi methods for
| data and more powerful architectures like some variation of
| universal transformer.
| brucethemoose2 wrote:
| Yes, absolutely. I was just preaching this.
|
| But its not _totally_ irrelevant. They are still a datapoint
| to consider with some performance correlation. YMMV, but
| these models actually seem to be quite good for the size in
| my initial testing.
| typon wrote:
| Yes. The only thing that is relevant is a hidden benchmark
| that's never released and run by a trusted third party.
| puttycat wrote:
| I wonder how it will rank on benchmarks which are password-
| protected to prevent test contamination, for example:
| https://github.com/taucompling/bliss
| fblgit wrote:
| Correct. UNA can align the MoE at multiple layers, experts,
| nearly any part of the neural network I would say. Xaberius 34B
| v1 "BETA".. is the king, and its just that.. the beta. I'll be
| focusing on the Mixtral, its a christmas gift.. modular in that
| way, thanks for the lab @mistral!
| brucethemoose2 wrote:
| Do a Yi 200K version as well! That would make my Christmas,
| as Mistral Moe is only maybe 32K.
| swyx wrote:
| what is neural alignment? who came up with it?
| MyFirstSass wrote:
| Hot take but Mistral 7B is the actual state of the art of LLM's.
|
| ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but
| it's huge, runs on server farms far away and is more or less a
| black box.
|
| Mistral is tiny, and amazingly coherent and useful for it's size
| for both general questions and code, uncensored, and a leap i
| wouldn't have believed possible in just a year.
|
| I can run it on my Macbook Air at 12tkps, can't wait to try this
| on my desktop.
| tarruda wrote:
| > I can run it on my Macbook Air at 12tkps, can't wait to try
| this on my desktop.
|
| That seems kinda low, are you using Metal GPU acceleration with
| llama.cpp? I don't have a macbook, but saw some of the
| llama.cpp benchmarks that suggest it can reach close to 30tk/s
| with GPU acceleration.
| MyFirstSass wrote:
| Thanks for the tip. I'm on the M2 Air with 16 GB's of ram.
|
| If anyone has faster than 12tkps on Air's let me know.
|
| I'm using the LM Studio GUI over llama.cpp with the "Apple
| Metal GPU" option. Increasing CPU threads seemingly does
| nothing either without metal.
|
| Ram usage hovers at 5.5GB with a q5_k_m of Mistral.
| M4v3R wrote:
| Try different quantization variations. I got vastly
| different speeds depending on which quantization I chose. I
| believe q4_0 worked very well for me. Although for a 7B
| model q8_0 runs just fine too with better quality.
| andy_xor_andrew wrote:
| I am with you on this. Mistral 7B is amazingly good. There are
| finetunes of it (the Intel one, and Berkeley Starling) that
| feel like they are within throwing distance of gpt3.5T... at
| only 7B!
|
| I was really hoping for a 13B Mistral. I'm not sure if this MOE
| will run on my 3090 with 24GB. Fingers crossed that
| quantization + offloading + future tricks will make it
| runnable.
| MyFirstSass wrote:
| True i've been using the OpenOrca finetune and just
| downloaded the new UNA Cybertron model both tuned on the
| Mistral base.
|
| They are not far from GPT-3 logic wise i'd say if you
| consider the breadth of data, ie. very little in 7GB's; so
| missing other languages, niche topics and prose styles etc.
|
| I honestly wouldn't be surprised if 13B would be
| indistinguishable from GPT-3.5 on some levels. And if that is
| the case - then coupled with the latest developments in
| decoding - like Ultrafastbert, Speculative, Jacobi, Lookahead
| etc. i honestly wouldn't be surprised to see local LLM's on
| current GPT-4 level within a few years.
| 123yawaworht456 wrote:
| it really is. it feels at the very least equal to llama2 13b.
| if mistral 70b had existed and was as much an improvement over
| llama2 70b as it is at 7b size, it would definitely be on part
| with gpt3.5
| ipsum2 wrote:
| State of the art for something you can run on a Macbook air,
| but not state of the art for LLMs, or even open source. Yi 34B
| and Llama2 70B still beat it.
| MyFirstSass wrote:
| True but it's ahead of the competition when size is
| considered, which is why i really look forward to their 13B,
| 33B models etc. because if they are as potent who knows what
| leaps we'll take soon.
|
| I remember running llama1 33B 8 months ago that as i remember
| was on Mistral 7B's level while other 7B models were a
| rambling mess.
|
| The jump in "potency" is what is so extreme.
| emporas wrote:
| Given that 50% of all information consumed in the internet is
| produced in the last 24 hours, smaller models could hold a
| serious advantage over bigger models.
|
| If an LLM or a SmallLM can be retrained or fine-tuned
| constantly, every week or every day to incorporate recent
| information then outdated models trained a year or two years
| back hold no chance to keep up. Dunno about the licensing but
| OpenAI could incorporate a smaller model like Mistral7B into
| their GPT stack, re-train it from scratch every week, and
| charge the same as GPT-4. There are users who might certainly
| prefer the weaker, albeit updated models.
| nabakin wrote:
| Not a hot take, I think you're right. If it was scaled up to
| 70b, I think it would be better than Llama 2 70b. Maybe if it
| was then scaled up to 180b and turned into a MoE it would be
| better than GPT-4.
| seydor wrote:
| looks like they're too busy being awesome. i need a fake video to
| understand this!
|
| What memory will this need? I guess it won't run on my 12GB of
| vram
|
| "moe": {"num_experts_per_tok": 2, "num_experts": 8}
|
| I bet many people will re-discover bittorrent tonight
| brucethemoose2 wrote:
| Looks like it will squeeze into 24GB once the llama runtimes
| work it out.
|
| Its also a good candidate for splitting across small GPUs,
| maybe.
|
| One architecture I can envision is hosting prompt ingestion and
| the "host" model on the GPU and the downstream expert model
| weights on the CPU /IGP. This is actually pretty efficient, as
| the CPU/IGP is really bad at the prompt ingestion but
| reasonably fast at ~14B token generation.
|
| Llama.cpp all but already does this, I'm sure MLC will
| implement it as well.
| syntaxing wrote:
| BitTorrent was the craze when llama was leaked on torrent. Then
| Facebook started taking down all huggingface repos and a bunch
| of people transitioned to torrent released temporarily. llama 2
| changed all this but it was a fun time.
| cuuupid wrote:
| Stark contrast with Google's "all demo no model" approach from
| earlier this week! Seems to be trained off Stanford's Megablocks:
| https://github.com/mistralai/megablocks-public
| BryanLegend wrote:
| Andrej Karpathy's take:
|
| New open weights LLM from @MistralAI
|
| params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand -
| n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture
| of experts 8X top 2
|
| Likely related code: https://github.com/mistralai/megablocks-
| public
|
| Oddly absent: an over-rehearsed professional release video
| talking about a revolution in AI.
|
| If people are wondering why there is so much AI activity right
| around now, it's because the biggest deep learning conference
| (NeurIPS) is next week.
|
| https://twitter.com/karpathy/status/1733181701361451130
| henrysg wrote:
| > Oddly absent: an over-rehearsed professional release video
| talking about a revolution in AI.
| ahmetkca wrote:
| Let's go multimodal
| Jayakumark wrote:
| https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen
| maremmano wrote:
| Who know if I can run this on MBC Pro M3 max 128gb? at what TPS?
| deoxykev wrote:
| I would like to know this as well.
| M4v3R wrote:
| Big chance that you'll be able to run it using Ollama app soon
| enough.
| marci wrote:
| If I understand correctly:
|
| RAM Wise, you can easily run a 70b with 128GB, 8x7B is
| obviously less than that.
|
| Compute wise, I suppose it would be a bit slower than running a
| 13b.
|
| edit: "actually", I think it might be faster than a 13b. 8
| random 7b ~= 115GB, Mixtral is under 90. I will have to wait
| for more info/understanding.
| treprinum wrote:
| I would say so based on LLaMA 2 70B; if it's 8x inference in
| MoE then I guess you'd see <20 tokens/sec?
| asolidtime1 wrote:
| https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen/b...
|
| Holy shit, this is some clever marketing.
|
| Kinda wonder if any of their employees were part of the warez
| scene at some point.
| poulpy123 wrote:
| is it eight 7b models in a trench coat ?
| fortunefox wrote:
| Releasing a model with a magnet link and some ascii art gives me
| way more confidence in the product than any OpenAI blog post ever
| could.
|
| Excited to play with this once it's somewhat documented on how to
| get it running on a dual 4090 Setup.
| smlacy wrote:
| https://nitter.net/MistralAI/status/1733150512395038967
| leobg wrote:
| I love Mistral.
|
| It's crazy what can be done with this small model and 2 hours of
| fine tuning.
|
| Chatbot with function calling? Check.
|
| 90 +% accuracy multi label classifier, even when you only have 15
| examples for each label? Check.
|
| Craaaazy powerful.
| _fizz_buzz_ wrote:
| Does anybody have a tutorial or documentation how I can run this
| and play around with this locally. A ,,getting started" guide of
| sorts?
| 0cf8612b2e1e wrote:
| Even better if a llamafile gets released.
___________________________________________________________________
(page generated 2023-12-08 23:00 UTC)