[HN Gopher] Mistral "Mixtral" 8x7B 32k model [magnet]
       ___________________________________________________________________
        
       Mistral "Mixtral" 8x7B 32k model [magnet]
        
       Author : xyzzyrz
       Score  : 369 points
       Date   : 2023-12-08 16:03 UTC (6 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | politician wrote:
       | Honest question: Why isn't this on Huggingface? Is this one a
       | leaked model with a questionable training or alignment
       | methodology?
       | 
       | EDIT: I mean, I guess they didn't hack their own twitter account,
       | but still.
        
         | kcorbitt wrote:
         | It'll be on Huggingface soon. This is how they dropped their
         | original 7B model as well. It's a marketing thing, but it
         | works!
        
           | politician wrote:
           | Ah, well, ok. I appreciate the torrent link -- much faster
           | distribution.
        
           | politician wrote:
           | @kcorbitt Low priority, probably not worth an email: Does
           | using OpenPipe.ai to fine-tune a model result in a
           | downloadable LoRA adapter? It's not clear from the website if
           | the fine-tune is exportable.
        
       | tarruda wrote:
       | Still 7B, but now with 32k context. Looking forward to see how it
       | compares with the previous one, and what the community does with
       | it.
        
         | MacsHeadroom wrote:
         | Not 7B, 8x7B.
         | 
         | It will run with the speed of a 7B model while being much
         | smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).
        
           | dragonwriter wrote:
           | Given the config parametes posted, its 2 experts per token,
           | so the conputation cost per token should be the cost of the
           | conponent that selects experts + 2x cost of a 7B model.
        
             | stavros wrote:
             | Yes, but I also care about "can I load this onto my home
             | GPU?" where, if I need all experts for this to run, the
             | answer is "no".
        
             | MacsHeadroom wrote:
             | Ah good catch. Upon even closer examination, the attention
             | layer (~2B params) is shared across experts. So in theory
             | you would need 2B for the attention head + 5B for each of
             | two experts in RAM.
             | 
             | That's a total of 12B, meaning this should be able to be
             | run on the same hardware as 13B models with some loading
             | time between generations.
        
         | seydor wrote:
         | unfortunately too big for the broader community to test. Will
         | be very interesting to see how well it performs compared to the
         | large models
        
           | brucethemoose2 wrote:
           | Not really, looks like a ~40B class model which is very
           | runnable.
        
             | MacsHeadroom wrote:
             | It's actually ~13B class at runtime. 2B for attention is
             | shared across each expert and then it runs 2 experts at a
             | time.
             | 
             | So 2B for attention + 5Bx2 for inference = 12B in RAM at
             | runtime.
        
               | brucethemoose2 wrote:
               | I just mean in terms of VRAM usage.
        
       | nulld3v wrote:
       | Looks to be Mixture of Experts, here is the params.json:
       | {             "dim": 4096,             "n_layers": 32,
       | "head_dim": 128,             "hidden_dim": 14336,
       | "n_heads": 32,             "n_kv_heads": 8,
       | "norm_eps": 1e-05,             "vocab_size": 32000,
       | "moe": {                 "num_experts_per_tok": 2,
       | "num_experts": 8             }         }
        
         | sp332 wrote:
         | I don't see any code in there. What runtime could load these
         | weights?
        
           | brucethemoose2 wrote:
           | Its presumably llama just like Mistral.
           | 
           |  _Everything_ open source is llama now. Facebook all but
           | standardized the architecture.
           | 
           | I dunno about the moe. Is there existing transformers code
           | for that part? It kinda looks like there is based on the
           | config.
        
             | jasonjmcghee wrote:
             | Mistral is not llama architecture.
             | 
             | https://github.com/mistralai/mistral-src
        
               | brucethemoose2 wrote:
               | Its _basically_ llama architecture, all but drop in
               | compatible with llama runtimes.
        
             | refulgentis wrote:
             | Because it's JSON? :)
        
         | sockaddr wrote:
         | What does expert mean in this context?
        
           | moffkalast wrote:
           | It means it's 8 7B models in a trench coat in a sense, it
           | runs as fast as a 14B (2 experts at a time apparently) but
           | takes up as much memory as a 40B model (70% * 8 * 7B). There
           | is some process trained into it that chooses which experts to
           | use based on the question posed. GPT 4 is allegedly based on
           | the same architecture, but at 8*222B.
        
             | dragonwriter wrote:
             | > GPT 4 is based on the same architecture, but at 8*222B.
             | 
             | Do we actually either no that it is MoE or that size? IIRC
             | both if those started as outsidr guesses that somehow just
             | became accepted knowledge without any actual confirmation.
        
               | moffkalast wrote:
               | Iirc some of the other things the same source stated were
               | later confirmed, so this is likely to be true as well,
               | but I might be misremembering.
        
             | sockaddr wrote:
             | Fascinating. Thanks
        
             | tavavex wrote:
             | Does anyone here know roughly how an expert gets chosen? It
             | seems like a very open-ended problem, and I'm not sure on
             | how it can be implemented easily.
        
             | WeMoveOn wrote:
             | How did you come up with 40b for the memory? specifically,
             | why 0.7 * total params?
        
       | YetAnotherNick wrote:
       | 86 GB. So it's likely a Mixture of experts model with 8 experts.
       | Exciting.
        
         | tarruda wrote:
         | Damn, I was hoping it was still a single 7B model that I would
         | be able to run on my GPU
        
           | renonce wrote:
           | You can, wait for a 4-bit quantized version
        
             | tarruda wrote:
             | I only have a RTX 3070 with 8GB VRam. It can run quantized
             | 7B models well, but this is 8 x 7B. Maybe an RTX 3090 with
             | 24GB VRAM can do it.
        
               | brucethemoose2 wrote:
               | It would be very tight. 8x7B 24GB (currently) has more
               | overhead than 70B.
               | 
               | Its theoretically doable, with quantization from the
               | recent 2 bit quant paper and a custom implementation (in
               | exllamav2?)
               | 
               | EDIT: Actually the download is much smaller than 8x7B.
               | Not sure how, but its sized more like a 30B, perfect for
               | a 3090. _Very_ interesting.
        
               | burke wrote:
               | Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more
               | than just 4 bits per param, and there's extra overhead
               | for context, and the FFN to select experts is probably
               | more on top of that.
               | 
               | It would probably fit in 32GB at 4-bit but probably won't
               | run with sensible quantization/perf on a 3090/4090
               | without other tricks like offloading. Depending on how
               | likely the same experts are to be chosen for multiple
               | sequential tokens, offloading experts may be viable.
        
               | espadrine wrote:
               | Once on llama.cpp, it will likely run on CPU with enough
               | RAM, especially given that the GGUF mmap code only seems
               | to use RAM for the parts of the weights that get used.
        
       | kcorbitt wrote:
       | No public statement from Mistral yet. What we know:
       | 
       | - Mixture of Experts architecture.
       | 
       | - 8x 7B parameters experts (potentially trained starting with
       | their base 7B model?).
       | 
       | - 96GB of weights. You won't be able to run this on your home
       | GPU.
        
         | tarruda wrote:
         | Theoretically it could fit into a single 24GB GPU if 4-bit
         | quantized. Exllama v2 has even more efficient quantization
         | algorithm, and was able to fit 70B models in 24GB gpu, but only
         | with 2048 tokens of context.
        
         | coder543 wrote:
         | > 96GB of weights. You won't be able to run this on your home
         | GPU.
         | 
         | This seems like a non-sequitur. Doesn't MoE select an expert
         | for each token? Presumably, the same expert would frequently be
         | selected for a number of tokens in a row. At that point, you're
         | only running a 7B model, which will easily fit on a GPU. It
         | will be slower when "swapping" experts if you can't fit them
         | all into VRAM at the same time, but it shouldn't be
         | catastrophic for performance in the way that being unable to
         | fit all layers of an LLM is. It's also easy to imagine caching
         | the N most recent experts in VRAM, where N is the largest
         | number that still fits into your VRAM.
        
           | tarruda wrote:
           | I will be super happy if this is true.
           | 
           | Even if you can't fit all of them in the VRAM, you could load
           | everything in tmpfs, which at least removes disk I/O penalty.
        
             | cjbprime wrote:
             | Just mentioning in case it helps anyone out: Linux already
             | has a disk buffer cache. If you have available RAM, it will
             | hold on to pages that have been read from disk until there
             | is enough memory pressure to remove them (and then it will
             | only remove some of them, not all of them). If you don't
             | have available RAM, then the tmpfs wouldn't work. The tmpfs
             | is helpful if you know better than the paging subsystem
             | about how much you really want this data to always stay in
             | RAM no matter what, but that is also much less flexible,
             | because sometimes you need to burst in RAM usage.
        
           | read_if_gay_ wrote:
           | however, if you need to swap experts on each token, you might
           | as well run on cpu.
        
             | tarruda wrote:
             | > Presumably, the same expert would frequently be selected
             | for a number of tokens in a row
             | 
             | In other words, assuming you ask a coding question and
             | there's a coding expert in the mix, it would answer it
             | completely.
        
               | read_if_gay_ wrote:
               | yes I read that. do you think it's reasonable to assume
               | that the same expert will be selected so consistently
               | that model swapping times won't dominate total runtime?
        
               | tarruda wrote:
               | No idea TBH, we'll have to wait and see. Some say it
               | might be possible to efficiently swap the expert weights
               | if you can fit everything in RAM:
               | https://x.com/brandnarb/status/1733163321036075368?s=20
        
               | ttul wrote:
               | See my poorly educated answer above. I don't think that's
               | how MoE actually works. A new mixture of experts is
               | chosen for every new context.
        
           | ttul wrote:
           | Someone smarter will probably correct me, but I don't think
           | that is how MoE works. With MoE, a feed-forward network
           | assesses the tokens and selects the best two of eight experts
           | to generate the next token. The choice of experts can change
           | with each new token. For example, let's say you have two
           | experts that are really good at answering physics questions.
           | For some of the generation, those two will be selected. But
           | later on, maybe the context suggests you need two models
           | better suited to generate French language. This is a silly
           | simplification of what I understand to be going on.
        
             | ttul wrote:
             | This being said, presumably if you're running a huge farm
             | of GPUs, you could put each expert onto its own slice of
             | GPUs and orchestrate data to flow between GPUs as needed. I
             | have no idea how you'd do this...
        
               | alchemist1e9 wrote:
               | Ideally those many GPUs could be on different hosts
               | connected with a commodity interconnect like 10gbe.
               | 
               | If MOE models do well it could be great for commodity hw
               | based distributed inference approaches.
        
             | Philpax wrote:
             | Yes, that's more or less it - there's no guarantee that the
             | chosen expert will still be used for the next token, so
             | you'll need to have all of them on hand at any given
             | moment.
        
             | wongarsu wrote:
             | One viable strategy might be to offload as many experts as
             | possible to the GPU, and evaluate the other ones on the
             | CPU. If you collect some statistics which experts are used
             | most in your use cases and select those for GPU
             | acceleration you might get some cheap but notable speedups
             | over other approaches.
        
           | numeri wrote:
           | You're not necessarily wrong, but I'd imagine this is almost
           | prohibitively slow. Also, this model seems to use two experts
           | per token.
        
         | MacsHeadroom wrote:
         | That is only 24GB in 4bit.
         | 
         | People are running models 2-4 times that size on local GPUs.
         | 
         | What's more, this will run on a MacBook CPU just fine-- and at
         | an extremely high speed.
        
           | brucethemoose2 wrote:
           | Yeah, 70B is much larger and fits on a 24GB, admitedly with
           | very lossy quantization.
           | 
           | This is just about right for 24GB. I bet that is intentional
           | on their part.
        
         | shubb wrote:
         | >> You won't be able to run this on your home GPU.
         | 
         | Would this allow you to run each expert on a cheap commodity
         | GPU card so that instead of using expensive 200GB cards we can
         | use a computer with 8 cheap gaming cards in it?
        
           | terafo wrote:
           | Yes, but you wouldn't want to do that. You will be able to
           | run that on a single 24gb GPU by the end of this weekend.
        
             | brucethemoose2 wrote:
             | Maybe two weekends.
        
           | dragonwriter wrote:
           | > Would this allow you to run each expert on a cheap
           | commodity GPU card so that instead of using expensive 200GB
           | cards we can use a computer with 8 cheap gaming cards in it?
           | 
           | I would think no differently than you can run a large regular
           | model on a multiGPU setup (which people do!). Its still all
           | one network even if not all of it is activated for each
           | token, and since its much smaller than a 56B model, it seems
           | like there are significant components of the network that are
           | shared.
        
             | terafo wrote:
             | Attention is shared. It's ~30% of params here. So ~2B
             | params are shared between experts and ~5B params are unique
             | to each expert.
        
         | faldore wrote:
         | at 4 bits you could run it on a 3090 right?
        
         | jlokier wrote:
         | > - 96GB of weights. You won't be able to run this on your home
         | GPU.
         | 
         | You can these days, even in a portable device running on
         | battery.
         | 
         | 96GB fits comfortably in some laptop GPUs released this year.
        
           | michaelt wrote:
           | Be a lot cooler if you said what laptop, and how much
           | quantisation you're assuming :)
        
             | tvararu wrote:
             | They're probably referring to the new MacBook Pros with up
             | to 128GB of unified memory.
        
           | refulgentis wrote:
           | This is extremely misleading. source: been working in local
           | LLMs since 10 months ago. Got my Mac laptop too. I'm bullish
           | too. But we shouldn't breezily dismiss those concerns out of
           | hand. In practice, it's single digit tokens a second on a
           | $4500 laptop for a model with weights half this size (Llama 2
           | 70B Q2 GGUF => 29 GB, Q8 => 36 GB)
        
             | coolspot wrote:
             | > $4500
             | 
             | Which is more than a price of RTX A6000 48gb ($4k used on
             | ebay)
        
               | CamperBob2 wrote:
               | How fast does it run on that?
        
               | brucethemoose2 wrote:
               | Which is outrageously priced, in case thats not clear.
               | Its an 2020 RTX 3090 with doubled up memory ICs, which is
               | not much extra BoM.
        
               | baq wrote:
               | Clearly it's worth what people are willing to pay for it.
               | At least it isn't being used to compute hashes of virtual
               | gold.
        
               | tucnak wrote:
               | People are also willing to die for all kinds of stupid
               | reasons, and it's not indicative of _anything_ let alone
               | a clever comment on the online forum. Show some decorum,
               | please!
        
               | brucethemoose2 wrote:
               | Its a artificial supply constraint due to artificial
               | market segmentation enabled by Nvidia/AMD.
               | 
               | Honestly its crazy that AMD indulges in this, especially
               | now. Their workstation market share is comparatively
               | tiny, and instead they could have a swarm of devs (like
               | me) pecking away at AMD compatibility on AI repos if they
               | sold cheap 32GB/48GB cards.
        
             | MacsHeadroom wrote:
             | Mixtral 8x7b only needs 12B of weights in RAM per
             | generation.
             | 
             | 2B for the attention head and 5B from each of 2 experts.
             | 
             | It should be able to run slightly faster than a 13B desnse
             | model, in as little as 16GB of RAM with room to spare.
        
               | filterfiber wrote:
               | > in as little as 16GB of RAM with room to spare.
               | 
               | I don't think that's the case, for full speed you still
               | need (5B*8)/2+2~fewB overhead.
               | 
               | I think the experts chosen per-token? That means that yes
               | you technically only need two in VRAM
               | memory+router/overhead per token, but you'll have to
               | constantly be loading in different experts unless you can
               | fit them all, which would still be terrible for
               | performance.
               | 
               | So you'll still be PCIE/RAM speed limited unless you can
               | fit all of the experts into memory (or get really lucky
               | and only need two experts).
        
         | miven wrote:
         | > You won't be able to run this on your home GPU.
         | 
         | As far as I understand in a MOE model only one/few experts are
         | actually used at the same time, shouldn't the inference speed
         | for this new MOE model be roughly the same as for a normal
         | Mistral 7B then?
         | 
         | 7B models have a reasonable throughput when ran on a beefy CPU,
         | especially when quantized down to 4bit precision, so couldn't
         | Mixtral be comfortably ran on a CPU too then, just with 8 times
         | the memory footprint?
        
           | filterfiber wrote:
           | So this specific model ships with a default config of 2
           | experts per token.
           | 
           | So you need roughly two loaded in memory per token. Roughly
           | the speed and memory of a 13B per token.
           | 
           | Only issues is that's per-token. 2 experts are choosen per
           | token, which means if they aren't the same ones as the last
           | token, you need to load them into memory.
           | 
           | So yeah to not be disk limited you'd need roughly 8 times the
           | memory and it would run at the speed of a 13B model.
           | 
           | ~~~Note on quantization, iirc smaller models lose more
           | performance when quantized vs larger models. So this would be
           | the speed of a 4bit 13B model but with the penalty from a
           | 4bit 7B model.~~~ Actually I have zero idea how quantization
           | scales for MoE, I imagine it has the penalty I mentioned but
           | that's pure speculation.
        
       | skghori wrote:
       | multimodal? 32k context is pretty impressive, curious to test
       | instructability
        
         | brucethemoose2 wrote:
         | MistralLite is already 32K, and Yi 200K actually works pretty
         | well out to at least 75K (the most I tested)
        
           | civilitty wrote:
           | What kind of tests did you run out to that length? (Needle in
           | haystack, summarization, structured data extraction, etc)
           | 
           | What is the max number of tokens in the output?
        
             | brucethemoose2 wrote:
             | Long stories mostly, either novel or chat format. Sometimes
             | summarization or insights, notably tests that you could't
             | possible do with RAG chunking. Mostly short responses, not
             | rewriting documents or huge code blocks or anything like
             | that.
             | 
             | MistralLite is basically overfit to summarize and retrieve
             | in its 32K context, but its extremely good at that for a
             | 7B. Its kinda useless for anything else.
             | 
             | Yi 200K is... smart with the long context. An example I
             | often cite is a Captain character in a story I 'wrote' with
             | the llm. A Yi 200K finetune generated a debriefing for like
             | 40K of context in a story, correctly assessing what plot
             | points should be kept secret _and_ making some very
             | interesting deductions. You could never possibly do that
             | with RAG on a 4K model, or even models that  "cheat" with
             | their huge attention like Anthropic.
             | 
             | I test at 75K just because that's the most my 3090 will
             | hold.
        
       | cloudhan wrote:
       | Might be the training code related with the model
       | https://github.com/mistralai/megablocks-public/tree/pstock/m...
        
         | cloudhan wrote:
         | Mixtral-8x7B support --> Support new model
         | 
         | https://github.com/stanford-futuredata/megablocks/pull/45
        
       | mareksotak wrote:
       | Some companies spend weeks on landing pages, demos and cute
       | thought through promo videos and then there is Mistral, casually
       | dropping a magnet link on Friday.
        
         | tarruda wrote:
         | I'm curious about their business model.
        
           | jorge-d wrote:
           | Well so far their business model seems to be mostly centered
           | about raising money[1]. I do hope they succeed in becoming a
           | succesful contender against OpenAI.
           | 
           | [1]
           | https://www.bloomberg.com/news/articles/2023-12-04/openai-
           | ri...
        
             | udev4096 wrote:
             | https://archive.ph/4F3dT
        
           | nuz wrote:
           | They can make plenty by offering consulting fees for
           | finetuning and general support around their models.
        
             | realce wrote:
             | "plenty" is not a word some of these people understand
             | however
        
             | behnamoh wrote:
             | You mean they put on a Redhat?
        
         | tananaev wrote:
         | I'm sure it's also a marketing move to build a certain
         | reputation. Looks like it's working.
        
           | OscarTheGrinch wrote:
           | Not geoblocking the entirety of Europe also makes them stand
           | out like a ringmaster amongst clowns.
        
             | moffkalast wrote:
             | Well they are French after all. They should be geoblocking
             | the USA in response for a bit to make a point lol.
        
               | fredoliveira wrote:
               | Not with their cap table, they won't ;-)
        
             | peanuty1 wrote:
             | Google Bard is still not available in Canada.
        
               | oh_sigh wrote:
               | Are there some regulatory reasons why it would not be
               | available? It seems weird if Google would intentionally
               | block users merely to block them.
        
               | mrandish wrote:
               | I think there are still some pretty onerous laws about
               | French localization of products and services made
               | available in the French-speaking part of Canada. Could be
               | that...
        
               | simonerlic wrote:
               | I originally thought so too, but as far as I know Bard is
               | available in France- so I have a feeling that language
               | isn't the roadblock here.
        
       | sergiotapia wrote:
       | Stuck on "Retrieving data" from the Magnet link and "Downloading
       | metadata" when adding the magnet to the download list.
       | 
       | I had to manually add these trackers and now it works:
       | https://gist.github.com/mcandre/eab4166938ed4205bef4
        
       | sigmar wrote:
       | Not exactly similar companies in terms of their goals, but pretty
       | hilarious to contrast this model announcement with Google's
       | Gemini announcement two days ago.
        
       | aubanel wrote:
       | Mistral sure does not bother too much with explanations, but this
       | style gives me much more confidence in the product than Google's
       | polished, corporate, soulless announcement of Gemini!
        
         | brucethemoose2 wrote:
         | I will take weights over docs.
         | 
         | Its does remind me how some Google employee was bragging that
         | they disclosed the weights for the Gemini, and _only_ the small
         | mobile Gemini, as if that 's a generous step over other
         | companies.
        
           | refulgentis wrote:
           | I don't think that's true, because quite simply, they have
           | not.
           | 
           | I am 100% in agreement with your viewpoint, but feel
           | squeamish seeing an un-needed lie coupled to it to justify
           | it. Just so much Othering these days.
        
             | brucethemoose2 wrote:
             | I was referencing this tweet:
             | 
             | https://twitter.com/zacharynado/status/1732425598465900708
             | 
             | (Alt:
             | https://nitter.net/zacharynado/status/1732425598465900708)
             | 
             | That is fair though, this was an impulsive addition on my
             | part.
        
       | udev4096 wrote:
       | https://nitter.rawbit.ninja/MistralAI/status/173315051239503...
        
       | udev4096 wrote:
       | based mistral casually dropping a magnet link
        
       | manojlds wrote:
       | Google - Fake demo
       | 
       | Mistral - magnet link and that's it
        
       | maremmano wrote:
       | Do you need some fancy announcement? let's do it the 90s way:
       | https://twitter.com/erhartford/status/1733159666417545641/ph...
        
         | eurekin wrote:
         | I find that a way more bold and confident than dropping a
         | obviously manipulated and unrealistic marketing page or video
        
           | maremmano wrote:
           | Frankly I don't know why Google continues to act this way.
           | Let's remind the "Google Duplex: A.I. Assistant Calls Local
           | Businesses To Make Appointments" story.
           | https://www.youtube.com/watch?v=D5VN56jQMWM
           | 
           | Not that this affects Google's user base in any way, at the
           | moment.
        
             | eurekin wrote:
             | They obviously have both money and great talent. Maybe they
             | put out minimal effort only for investors that expect their
             | presence in consumer space?
        
             | polygamous_bat wrote:
             | > Frankly I don't know why Google continues to act this
             | way.
             | 
             | Unfortunately, that's because they have Wall St. analysts
             | looking at their videos who will (indirectly) determine how
             | big of a bonus Sundar and co takes home at the end of the
             | year. Mistral doesn't have to worry about that.
        
               | eurekin wrote:
               | This makes so much sense! Thanks
        
         | seydor wrote:
         | FILE_ID.DIZ
        
       | brucethemoose2 wrote:
       | In other llm news, Mistral/Yi finetunes trained with a new (still
       | undocumented) technique called "neural alignment" are blasting
       | other models in the HF leaderboard. The 7B is "beating" most
       | 70Bs. The 34B in testing seems... Very good:
       | 
       | https://huggingface.co/fblgit/una-xaberius-34b-v1beta
       | 
       | https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16
       | 
       | I mention this because it could theoretically be applied to
       | Mistral Moe. If the uplift is the same as regular Mistral 7B, and
       | Mistral Moe is good, the end result is a _scary_ good model.
       | 
       | This might be an inflection point where desktop-runnable OSS is
       | really breathing down GPT-4's neck.
        
         | _boffin_ wrote:
         | Interesting. One thing i noticed is that Mistral has a
         | `max_position_embeddings` of ~32k while these have it at 4096.
         | 
         | Any thoughts on that?
        
           | brucethemoose2 wrote:
           | Is complicated.
           | 
           | The 7B model (cybertron) is trained on Mistral. Mistral is
           | _technically_ a 32K model, but it uses a sliding window
           | beyond 32K, and for all practical purposes in current
           | implementations it behaves like an 8K model.
           | 
           | The 34B model is based on Yi 34B, which is inexplicably
           | marked as a 4K model in the config but actually works out to
           | 32K if you literally just edit that line. Yi also has a 200K
           | base model... and I have no idea why they didn't just train
           | on that. You don't need to finetune at long context to
           | preserve its long context ability.
        
         | stavros wrote:
         | Aren't LLM benchmarks at best irrelevant, at worst lying, at
         | this point?
        
           | nabakin wrote:
           | More or less. The automated benchmarks themselves can be
           | useful when you weed out the models which are overfitting to
           | them.
           | 
           | Although, anyone claiming a 7b LLM is better than a well
           | trained 70b LLM like Llama 2 70b chat for the general case,
           | doesn't know what they are talking about.
           | 
           | In the future will it be possible? Absolutely, but today we
           | have no architecture or training methodology which would
           | allow it to be possible.
           | 
           | You can rank models yourself with a private automated
           | benchmark which models don't have a chance to overfit to or
           | with a good human evaluation study.
           | 
           | Edit: also, I guess OP is talking about Mistral finetunes
           | (ones overfitting to the benchmarks) beating out 70b models
           | on the leaderboard because Mistral 7b is lower than Llama 2
           | 70b chat.
        
             | brucethemoose2 wrote:
             | I'm not saying its better than 70B, just that its very
             | strong from what others are saying.
             | 
             | Actually I am testing the 34B myself (not the 7B), and it
             | seems good.
        
               | fblgit wrote:
               | UNA: Uniform Neural Alignment. Haven't u noticed yet?
               | Each model that I uniform, behaves like a pre-trained..
               | and you likely can fine-tune it again without damaging
               | it.
               | 
               | If you chatted with them, you know .. that strange
               | sensation, you know what is it.. Intelligence.
               | Xaberius-34B is the highest performer of the board, and
               | is NOT contaminated.
        
               | valine wrote:
               | How much data do you need for UNA? Is a typical fine
               | tuning dataset needed or can you get away with less than
               | that?
        
               | fblgit wrote:
               | doesn't require much data, in a 7B can take a couple
               | hours ~
        
               | brucethemoose2 wrote:
               | In addition to what was said, if its anything like DPO
               | you don't need a _lot_ of data, just a good set. For
               | instance, DPO requires  "good" and "bad" responses for
               | each given prompt.
        
             | elcomet wrote:
             | We can only compare specific training procedures though.
             | 
             | With a 7b and a 70b trained the same way, the 70b should
             | always be better
        
             | mistrial9 wrote:
             | quick to assert authoritative opinion - yet the one word
             | "better" belies the message ? Certainly there is are more
             | dimensions worth including in the rating?
        
             | airgapstopgap wrote:
             | > today we have no architecture or training methodology
             | which would allow it to be possible.
             | 
             | We clearly see that Mistral-7B is in some important,
             | representative respects (eg coding) superior to
             | Falcon-180B, and superior across the board to stuff like
             | OPT-175B or Bloom-175B.
             | 
             | "Well trained" is relative. Models are, overwhelmingly,
             | functions of their data, not just scale and architecture.
             | Better data allows for yet-unknown performance jumps, and
             | data curation techniques are a closely-guarded secret. I
             | have no doubt that a 7B beating our _best_ 60-70Bs is
             | possible already, eg using something like Phi methods for
             | data and more powerful architectures like some variation of
             | universal transformer.
        
           | brucethemoose2 wrote:
           | Yes, absolutely. I was just preaching this.
           | 
           | But its not _totally_ irrelevant. They are still a datapoint
           | to consider with some performance correlation. YMMV, but
           | these models actually seem to be quite good for the size in
           | my initial testing.
        
           | typon wrote:
           | Yes. The only thing that is relevant is a hidden benchmark
           | that's never released and run by a trusted third party.
        
           | puttycat wrote:
           | I wonder how it will rank on benchmarks which are password-
           | protected to prevent test contamination, for example:
           | https://github.com/taucompling/bliss
        
         | fblgit wrote:
         | Correct. UNA can align the MoE at multiple layers, experts,
         | nearly any part of the neural network I would say. Xaberius 34B
         | v1 "BETA".. is the king, and its just that.. the beta. I'll be
         | focusing on the Mixtral, its a christmas gift.. modular in that
         | way, thanks for the lab @mistral!
        
           | brucethemoose2 wrote:
           | Do a Yi 200K version as well! That would make my Christmas,
           | as Mistral Moe is only maybe 32K.
        
         | swyx wrote:
         | what is neural alignment? who came up with it?
        
       | MyFirstSass wrote:
       | Hot take but Mistral 7B is the actual state of the art of LLM's.
       | 
       | ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but
       | it's huge, runs on server farms far away and is more or less a
       | black box.
       | 
       | Mistral is tiny, and amazingly coherent and useful for it's size
       | for both general questions and code, uncensored, and a leap i
       | wouldn't have believed possible in just a year.
       | 
       | I can run it on my Macbook Air at 12tkps, can't wait to try this
       | on my desktop.
        
         | tarruda wrote:
         | > I can run it on my Macbook Air at 12tkps, can't wait to try
         | this on my desktop.
         | 
         | That seems kinda low, are you using Metal GPU acceleration with
         | llama.cpp? I don't have a macbook, but saw some of the
         | llama.cpp benchmarks that suggest it can reach close to 30tk/s
         | with GPU acceleration.
        
           | MyFirstSass wrote:
           | Thanks for the tip. I'm on the M2 Air with 16 GB's of ram.
           | 
           | If anyone has faster than 12tkps on Air's let me know.
           | 
           | I'm using the LM Studio GUI over llama.cpp with the "Apple
           | Metal GPU" option. Increasing CPU threads seemingly does
           | nothing either without metal.
           | 
           | Ram usage hovers at 5.5GB with a q5_k_m of Mistral.
        
             | M4v3R wrote:
             | Try different quantization variations. I got vastly
             | different speeds depending on which quantization I chose. I
             | believe q4_0 worked very well for me. Although for a 7B
             | model q8_0 runs just fine too with better quality.
        
         | andy_xor_andrew wrote:
         | I am with you on this. Mistral 7B is amazingly good. There are
         | finetunes of it (the Intel one, and Berkeley Starling) that
         | feel like they are within throwing distance of gpt3.5T... at
         | only 7B!
         | 
         | I was really hoping for a 13B Mistral. I'm not sure if this MOE
         | will run on my 3090 with 24GB. Fingers crossed that
         | quantization + offloading + future tricks will make it
         | runnable.
        
           | MyFirstSass wrote:
           | True i've been using the OpenOrca finetune and just
           | downloaded the new UNA Cybertron model both tuned on the
           | Mistral base.
           | 
           | They are not far from GPT-3 logic wise i'd say if you
           | consider the breadth of data, ie. very little in 7GB's; so
           | missing other languages, niche topics and prose styles etc.
           | 
           | I honestly wouldn't be surprised if 13B would be
           | indistinguishable from GPT-3.5 on some levels. And if that is
           | the case - then coupled with the latest developments in
           | decoding - like Ultrafastbert, Speculative, Jacobi, Lookahead
           | etc. i honestly wouldn't be surprised to see local LLM's on
           | current GPT-4 level within a few years.
        
         | 123yawaworht456 wrote:
         | it really is. it feels at the very least equal to llama2 13b.
         | if mistral 70b had existed and was as much an improvement over
         | llama2 70b as it is at 7b size, it would definitely be on part
         | with gpt3.5
        
         | ipsum2 wrote:
         | State of the art for something you can run on a Macbook air,
         | but not state of the art for LLMs, or even open source. Yi 34B
         | and Llama2 70B still beat it.
        
           | MyFirstSass wrote:
           | True but it's ahead of the competition when size is
           | considered, which is why i really look forward to their 13B,
           | 33B models etc. because if they are as potent who knows what
           | leaps we'll take soon.
           | 
           | I remember running llama1 33B 8 months ago that as i remember
           | was on Mistral 7B's level while other 7B models were a
           | rambling mess.
           | 
           | The jump in "potency" is what is so extreme.
        
         | emporas wrote:
         | Given that 50% of all information consumed in the internet is
         | produced in the last 24 hours, smaller models could hold a
         | serious advantage over bigger models.
         | 
         | If an LLM or a SmallLM can be retrained or fine-tuned
         | constantly, every week or every day to incorporate recent
         | information then outdated models trained a year or two years
         | back hold no chance to keep up. Dunno about the licensing but
         | OpenAI could incorporate a smaller model like Mistral7B into
         | their GPT stack, re-train it from scratch every week, and
         | charge the same as GPT-4. There are users who might certainly
         | prefer the weaker, albeit updated models.
        
         | nabakin wrote:
         | Not a hot take, I think you're right. If it was scaled up to
         | 70b, I think it would be better than Llama 2 70b. Maybe if it
         | was then scaled up to 180b and turned into a MoE it would be
         | better than GPT-4.
        
       | seydor wrote:
       | looks like they're too busy being awesome. i need a fake video to
       | understand this!
       | 
       | What memory will this need? I guess it won't run on my 12GB of
       | vram
       | 
       | "moe": {"num_experts_per_tok": 2, "num_experts": 8}
       | 
       | I bet many people will re-discover bittorrent tonight
        
         | brucethemoose2 wrote:
         | Looks like it will squeeze into 24GB once the llama runtimes
         | work it out.
         | 
         | Its also a good candidate for splitting across small GPUs,
         | maybe.
         | 
         | One architecture I can envision is hosting prompt ingestion and
         | the "host" model on the GPU and the downstream expert model
         | weights on the CPU /IGP. This is actually pretty efficient, as
         | the CPU/IGP is really bad at the prompt ingestion but
         | reasonably fast at ~14B token generation.
         | 
         | Llama.cpp all but already does this, I'm sure MLC will
         | implement it as well.
        
         | syntaxing wrote:
         | BitTorrent was the craze when llama was leaked on torrent. Then
         | Facebook started taking down all huggingface repos and a bunch
         | of people transitioned to torrent released temporarily. llama 2
         | changed all this but it was a fun time.
        
       | cuuupid wrote:
       | Stark contrast with Google's "all demo no model" approach from
       | earlier this week! Seems to be trained off Stanford's Megablocks:
       | https://github.com/mistralai/megablocks-public
        
       | BryanLegend wrote:
       | Andrej Karpathy's take:
       | 
       | New open weights LLM from @MistralAI
       | 
       | params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand -
       | n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture
       | of experts 8X top 2
       | 
       | Likely related code: https://github.com/mistralai/megablocks-
       | public
       | 
       | Oddly absent: an over-rehearsed professional release video
       | talking about a revolution in AI.
       | 
       | If people are wondering why there is so much AI activity right
       | around now, it's because the biggest deep learning conference
       | (NeurIPS) is next week.
       | 
       | https://twitter.com/karpathy/status/1733181701361451130
        
         | henrysg wrote:
         | > Oddly absent: an over-rehearsed professional release video
         | talking about a revolution in AI.
        
       | ahmetkca wrote:
       | Let's go multimodal
        
       | Jayakumark wrote:
       | https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen
        
       | maremmano wrote:
       | Who know if I can run this on MBC Pro M3 max 128gb? at what TPS?
        
         | deoxykev wrote:
         | I would like to know this as well.
        
         | M4v3R wrote:
         | Big chance that you'll be able to run it using Ollama app soon
         | enough.
        
         | marci wrote:
         | If I understand correctly:
         | 
         | RAM Wise, you can easily run a 70b with 128GB, 8x7B is
         | obviously less than that.
         | 
         | Compute wise, I suppose it would be a bit slower than running a
         | 13b.
         | 
         | edit: "actually", I think it might be faster than a 13b. 8
         | random 7b ~= 115GB, Mixtral is under 90. I will have to wait
         | for more info/understanding.
        
         | treprinum wrote:
         | I would say so based on LLaMA 2 70B; if it's 8x inference in
         | MoE then I guess you'd see <20 tokens/sec?
        
       | asolidtime1 wrote:
       | https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen/b...
       | 
       | Holy shit, this is some clever marketing.
       | 
       | Kinda wonder if any of their employees were part of the warez
       | scene at some point.
        
       | poulpy123 wrote:
       | is it eight 7b models in a trench coat ?
        
       | fortunefox wrote:
       | Releasing a model with a magnet link and some ascii art gives me
       | way more confidence in the product than any OpenAI blog post ever
       | could.
       | 
       | Excited to play with this once it's somewhat documented on how to
       | get it running on a dual 4090 Setup.
        
       | smlacy wrote:
       | https://nitter.net/MistralAI/status/1733150512395038967
        
       | leobg wrote:
       | I love Mistral.
       | 
       | It's crazy what can be done with this small model and 2 hours of
       | fine tuning.
       | 
       | Chatbot with function calling? Check.
       | 
       | 90 +% accuracy multi label classifier, even when you only have 15
       | examples for each label? Check.
       | 
       | Craaaazy powerful.
        
       | _fizz_buzz_ wrote:
       | Does anybody have a tutorial or documentation how I can run this
       | and play around with this locally. A ,,getting started" guide of
       | sorts?
        
         | 0cf8612b2e1e wrote:
         | Even better if a llamafile gets released.
        
       ___________________________________________________________________
       (page generated 2023-12-08 23:00 UTC)