[HN Gopher] Mistral NeMo
___________________________________________________________________
Mistral NeMo
Author : bcatanzaro
Score : 353 points
Date : 2024-07-18 14:45 UTC (8 hours ago)
(HTM) web link (mistral.ai)
(TXT) w3m dump (mistral.ai)
| pantulis wrote:
| Does it have any relation to Nvidia's Nemo? Otherwise, it's
| unfortunate naming
| markab21 wrote:
| It looks like it was built jointly with nvidia:
| https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct
| refulgentis wrote:
| Click the link, read the first sentence.
| pantulis wrote:
| Yeah, not my brightest HN moment, to be honest.
| SubiculumCode wrote:
| At least you didn't ask about finding a particular fish.
| yjftsjthsd-h wrote:
| > Today, we are excited to release Mistral NeMo, a 12B model
| built in collaboration with NVIDIA. Mistral NeMo offers a large
| context window of up to 128k tokens. Its reasoning, world
| knowledge, and coding accuracy are state-of-the-art in its size
| category. As it relies on standard architecture, Mistral NeMo is
| easy to use and a drop-in replacement in any system using Mistral
| 7B.
|
| > We have released pre-trained base and instruction-tuned
| checkpoints checkpoints under the Apache 2.0 license to promote
| adoption for researchers and enterprises. Mistral NeMo was
| trained with quantisation awareness, enabling FP8 inference
| without any performance loss.
|
| So that's... uniformly an improvement at just about everything,
| right? Large context, permissive license, should have good perf.
| The one thing I can't tell is how big 12B is going to be (read:
| how much VRAM/RAM is this thing going to need). Annoyingly and
| rather confusingly for a model under Apache 2.0,
| https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
| refuses to show me files unless I login and "You need to agree to
| share your contact information to access this model"... though if
| it's actually as good as it looks, I give it hours before it's
| reposted without that restriction, which Apache 2.0 allows.
| xena wrote:
| Easy head math: parameter count times parameter size plus
| 20-40% for inference slop space. Anywhere from 8-40GB of vram
| required depending on quantization levels being used.
| imtringued wrote:
| They did quantization aware training for fp8 so you won't get
| any benefits from using more than 12GB of RAM for the
| parameters. What you might be using more RAM is the much
| bigger context window.
| exe34 wrote:
| tensors look about 20gb. not sure what that's like in vram.
| kelsey98765431 wrote:
| same size
| renewiltord wrote:
| According to nvidia https://blogs.nvidia.com/blog/mistral-
| nvidia-ai-model/ it was made to fit on a 4090 so it should work
| with 24 GB.
| bernaferrari wrote:
| if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM,
| but quantizing you might be able to do with with ~6-8. So any
| 16gb Macbook could run it (but not much else).
| hislaziness wrote:
| isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for
| inference?
| ancientworldnow wrote:
| This was trained to be run at FP8 with no quality loss.
| fzzzy wrote:
| it's very common to run local models in 8 bit int.
| qwertox wrote:
| Yes, but it's not common for the original model to be 8
| bit int. The community can downgrade any model to 8 bit
| int, but it's always linked to quality loss.
| Bumblonono wrote:
| It fits a 4090. Nvidia lists the models and therefore i assume
| 24gig is min
| michaelt wrote:
| A 4090 will _just narrowly_ fit a 34B parameter model at
| 4-bit quantisation.
|
| A 12B model will run on a 4090 with plenty room to spare,
| even with 8-bit quantisation.
| wongarsu wrote:
| You could consider the improvement in model performance a bit
| of a cheat - they beat other models "in the same size category"
| that have 30% fewer parameters.
|
| I still welcome this approach. 7B seems like a dead end in
| terms of reasoning and generalization. They are annoyingly
| close to statistical parrots, a world away from the moderate
| reasoning you get in 70B models. Any use case where that's
| useful can increasingly be filled by even smaller models, so
| chasing slightly larger models to get a bit more "intelligence"
| might be the right move
| mistercheph wrote:
| I strongly disagree, have you used fp16 or q8 llama3 8b?
| amrrs wrote:
| >reasoning and generalization
|
| any example use-cases or prompts? how do you define those?
| yjftsjthsd-h wrote:
| I actually meant execution speed from quantisation awareness
| - agreed that comparing against smaller models is a bit
| cheating.
| qwertox wrote:
| Aren't small models useful for providing a language-based
| interface - spoken or in writing - to any app? Tuned
| specifically for that app or more likely enriched via RAG and
| possibly also by using function calling?
|
| It doesn't have to be intelligent like we expect it from the
| top-tier, huge models, just capable of understanding some
| words in sentences, mostly commands, and how to react to
| them.
| imtringued wrote:
| Except Llama 3 8b is a significant improvement over llama 2,
| which was basically so terrible that there was a whole
| community building fine tunes that are better than what the
| multi billion dollar company can do using a much smaller
| budget. With llama 3 8b things have shifted towards there
| being much less community fine-tunes that actually beat it.
| The fact that Mistral AI can still build models that beat it,
| means the company isn't falling too far behind a
| significantly better equipped competitor.
|
| What's more irritating is that they decided to do
| quantization aware training for fp8. int8 quantization
| results in an imperceptible loss of quality that is difficult
| to pick up in benchmarks. They should have gone for something
| more aggressive like 4-bit, where quantization leads to a
| significant loss in quality.
| Workaccount2 wrote:
| Is "Parameter Creep" going to becomes a thing? They hold up
| Llama-8b as a competitor despite NeMo having 50% more parameters.
|
| The same thing happened with gemma-27b, where they compared it to
| all the 7-9b models.
|
| It seems like an easy way to boost benchmarks while coming off as
| "small" at first glance.
| eyeswideopen wrote:
| As written here: https://huggingface.co/nvidia/Mistral-
| NeMo-12B-Instruct
|
| "It significantly outperforms existing models smaller or
| similar in size." is a statement that goes in that direction
| and would allow the comparison of a 1.7T param model with a 7b
| one
| voiper1 wrote:
| Oddly, they are only charging slightly more for their hosted
| version:
|
| open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m
| tokens
|
| https://mistral.ai/technology/#pricing
| dannyw wrote:
| Possibly a NVIDIA subsidy. You run NEMO models, you get
| cheaper GPUs.
| Palmik wrote:
| They specifically call out fp8 aware training and TensoRT LLM
| is really good (efficient) with fp8 inference on H100 and
| other hopper cards. It's possible that they run the 7b
| natively in fp16 as smaller models suffer more from even
| "modest" quantization like this.
| causal wrote:
| Yeah it will be interesting to see if we ever settle on
| standard sizes here. My preference would be:
|
| - 3B for CPU inference or running on edge devices.
|
| - 20-30B for maximizing single consumer GPU potential.
|
| - 70B+ for those who can afford it.
|
| 7-9B never felt like an ideal size.
| marci wrote:
| For the benchmarks, it depends on how you interpret it. The
| other models are quite popular so many can have a starting
| point. Now, if you regularly use them you can assess: "just 3%
| better on some benchmark, 80% to 83, and at the cost of almost
| twice the inference speed and base base RAM requirement, but
| 16x context window, and for commercial usage..." and at the end
| "for my use case, is it worth it?"
| pants2 wrote:
| Exciting, I think 12B is the sweet spot for running locally -
| large enough to be useful, fast enough to run on a decent laptop.
| _flux wrote:
| How much memory does employing the complete 128k window take,
| though? I've sadly noticed that it can take a significant
| amount of VRAM to use a larger context window.
|
| edit: e.g. I wouldn't know the correct parameters for this
| calculator, but going from 8k window to 128k window goes from
| 1.5 GB to 23 GB: https://huggingface.co/spaces/NyxKrage/LLM-
| Model-VRAM-Calcul...
| azeirah wrote:
| In practice, it's fine to stick with "just" 8k or 16k or 32k.
| If you're working with data of over 128k tokens I'd
| personally not recommend using an open model anyway unless
| you know what you're doing. The models are kinda there, but
| the hardware mostly isn't.
|
| This is only realistic right now for people with those
| unified memory MacBook or for enthusiasts with Epyc servers
| or a very high end workstation built for inference.
|
| Anything above that I don't consider "consumer" inference
| mythz wrote:
| IMO Google's Gemma2 27B [1] is the sweet spot for running
| locally on commodity 16GB VRAM cards.
|
| [1] https://ollama.com/library/gemma2:27b
| Raed667 wrote:
| If I "only" have 16GB of ram on a macbook pro, would that
| still work ?
| sofixa wrote:
| If it's an M-series one with "unified memory" (shared RAM
| between the CPU, GPU and NPU on the same chip), yes.
| mysteria wrote:
| Keep in mind that Gemma is a larger model but it only has 8k
| context. The Mistral 12B will need less VRAM to store the
| weights but you'll need a much larger KV cache if you intend
| to use the full 128k context, especially if the KV is
| unquantized. Note sure if this new model has GQA but those
| without it absolutely eat memory when you increase the
| context size (looking at you Command R).
| minimaxir wrote:
| > Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken,
| that was trained on over more than 100 languages, and compresses
| natural language text and source code more efficiently than the
| SentencePiece tokenizer used in previous Mistral models.
|
| Does anyone have a good answer why everyone went back to
| SentencePiece in the first place? Byte-pair encoding (which is
| what tiktoken uses: https://github.com/openai/tiktoken) was shown
| to be a more efficient encoding as far back as GPT-2 in 2019.
| rockinghigh wrote:
| The SentencePiece library also implements Byte-pair-encoding.
| That's what the LLaMA models use and the original Mistral
| models were essentially a copy of LLaMA2.
| zwaps wrote:
| SentencePiece is not a different algorithm to WordPiece or BPE,
| despite its naming.
|
| One of the main pulls of the SentencePiece library was the pre-
| tokenization being less reliant on white space and therefore
| more adaptable to non Western languages.
| p1esk wrote:
| Interesting how it will compete with 4o mini.
| pixelatedindex wrote:
| Pardon me if this is a dumb question, but is it possible for me
| to download these models into my computer (I have a 1080ti and a
| [2|3]070ti) and generate some sort of api interface? That way I
| can write programs that calls this API, and I find this
| appealing.
|
| EDIT: This a 1W light bulb moment for me, thank you!
| bezbac wrote:
| AFAIK, Ollama supports most of these models locally and will
| expose a REST API[0]
|
| [0]: https://github.com/ollama/ollama/blob/main/docs/api.md
| kanwisher wrote:
| llama.cpp or ollama both have apis for most models
| codetrotter wrote:
| I'd probably check https://ollama.com/library?q=Nemo in a
| couple of days. My guess is that by then ollama will have
| support for it. And you can then run the model locally on your
| machine with ollama.
| hedgehog wrote:
| Adding to this: If the default is too slow look at the more
| heavily quantized versions of the model, they are smaller at
| moderate cost in output quality. Ollama can split models
| between GPU and host memory but the throughput dropoff tends
| to be pretty severe.
| andrethegiant wrote:
| Why would it take a couple days? Is it not a matter of
| uploading the model to their registry, or are there more
| steps involved than that?
| HanClinto wrote:
| Ollama depends on llama.cpp as its backend, so if there are
| any changes that need to be made to support anything new in
| this model architecture or tokenizer, then it will need to
| be added there first.
|
| Then the model needs to be properly quantized and formatted
| for GGUF (the model format that llama.cpp uses), tested,
| and uploaded to the model registry.
|
| So there's some length to the pipeline that things need to
| go through, but overall the devs in both projects generally
| have things running pretty smoothly, and I'm regularly
| impressed at how quickly both projects get updated to
| support such things.
| codetrotter wrote:
| > I'm regularly impressed at how quickly both projects
| get updated to support such things.
|
| Same! Big kudos to all involved
| HanClinto wrote:
| Issue to track Mistral NeMo support in llama.cpp:
| https://github.com/ggerganov/llama.cpp/issues/8577
| Patrick_Devine wrote:
| We're working on it, except that there is a change to the
| tokenizer which we're still working through in our conversion
| scripts. Unfortunately we don't get a heads up from Mistral
| when they drop a model, so sometimes it takes a little bit of
| time to sort out the differences.
|
| Also, I'm not sure if we'll call it mistral-nemo or nemo yet.
| :-D
| simpaticoder wrote:
| Justine Tunney (of redbean fame) is actively working on getting
| LLMs to run well on CPUs, where RAM is cheap. If successful
| this would eliminate an enormous bottleneck to running local
| models. If anyone can do this, she can. (And thank you to
| Mozilla for financially supporting her work). See
| https://justine.lol/matmul/ and https://github.com/mozilla-
| Ocho/llamafile
| illusive4080 wrote:
| I love that domain name.
| wkat4242 wrote:
| I think it's mostly the memory bandwidth though that makes
| the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM
| won't come near that. I'm sure a lot of optimisations can be
| had but I think GPUs will still be significantly ahead.
|
| Macs are so good at it because Apple solder the memory on top
| of the SoC for a really wide and low latency connection.
| simpaticoder wrote:
| This is a good and valid comment. It is difficult to
| predict the future, but I would be curious what the best
| case theoretical performance of an LLM on a typical x86 or
| ARM system with DDR4 or DDR5 RAM. My uneducated guess is
| that it can be very good, perhaps 50% the speed of a
| specialized GPU/RAM device. In practical terms, the CPU
| approach is _required_ for very large contexts, up to as
| large as the lifetime of all interactions you have with
| your LLM.
| rustcleaner wrote:
| There's no good reason for consumer nvidia cards to lack
| SODIMM-like slots for video RAM, except to rake in big bucks
| and induce more hasty planned obsolescence.
| Raed667 wrote:
| First thing I did when i saw the headline was to look for it on
| ollma but it didn't land there yet:
| https://ollama.com/library?sort=newest&q=NeMo
| Patrick_Devine wrote:
| We're working on it!
| RockyMcNuts wrote:
| You will need enough VRAM, 1080ti is not going to work very
| well, maybe get a 3090 with 24GB VRAM.
|
| I think it should also run well on a 36GB MacBook Pro or
| probably a 24GB Macbook Air
| nostromo wrote:
| Yes.
|
| If you're on a Mac, check out LM Studio.
|
| It's a UI that lets you load and interact with models locally.
| You can also wrap your model in an OpenAI-compatible API and
| interact with it programmatically.
| d13 wrote:
| Try Lm Studio or Ollama. Load up the model, and there you go.
| homarp wrote:
| llama.cpp supports multi gpu across local network
| https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...
|
| and expose an OpenAI compatible server, or you can use their
| python bindings
| saberience wrote:
| Two questions:
|
| 1) Anyone have any idea of VRAM requirements?
|
| 2) When will this be available on ollama?
| causal wrote:
| 1) Rule of thumb is # of params = GB at Q8. So a 12B model
| generally takes up 12GB of VRAM at 8 bit precision.
|
| But 4bit precision is still pretty good, so 6GB VRAM is viable,
| not counting additional space for context. Usually about an
| extra 20% is needed, but 128K is a pretty huge context so more
| will be needed if you need the whole space.
| alecco wrote:
| The model has 12 billion parameters and uses FP8, so 1 byte
| each. With some working memory I'd bet you can run it on 24GB.
|
| > Designed to fit on the memory of a single NVIDIA L40S, NVIDIA
| GeForce RTX 4090 or NVIDIA RTX 4500 GPU
| jorgesborges wrote:
| I'm AI stupid. Does anyone know if training on multiple languages
| provides "cross-over" -- so training done in German can be
| utilized when answering a prompt in English? I once went through
| various Wikipedia articles in a couple languages and the
| differences were interesting. For some reason I thought they'd be
| almost verbatim (forgetting that's not how Wikipedia works!) and
| while I can't remember exactly I felt they were sometimes starkly
| different in tone and content.
| bernaferrari wrote:
| no, it is basically an 'auto-correct' spell checker from the
| phone. It only knows what it was trained on. But it has been
| shown that a coding LLM that has never seen a programming
| language or a library can "learn" a new one faster than, say, a
| generic LLM.
| StevenWaterman wrote:
| That's not true, LLMs can answer questions in one language
| even if they were only trained on that data in another
| language.
|
| IE you train an LLM on both English and French in general,
| but only teach it a specific fact in French, it can give you
| that fact in English
| hdhshdhshdjd wrote:
| You, you can write a prompt in English, give it French, and
| get an accurate answer in English even with the original
| Mistral.
|
| Still blows my mind we came so far so fast.
| miki123211 wrote:
| Generally yes, with caveats.
|
| There was some research showing that training a model on facts
| like "the mother of John Smith is Alice" but in German allowed
| it to answer questions like "who's the mother of John Smith",
| but not questions like "what's the name of Alice's child",
| regardless of language. Not sure if this holds at larger model
| sizes though, it's the sort of problem that's usually fixable
| by throwing more parameters at it.
|
| Language models definitely do generalize to some extend and
| they're not "stochastic parrots" as previously thought, but
| there are some weird ways in which we expect them to generalize
| but they don't.
| planb wrote:
| > Language models definitely do generalize to some extend and
| they're not "stochastic parrots" as previously thought, but
| there are some weird ways in which we expect them to
| generalize but they don't.
|
| Do you have any good sources that explain this? I was always
| thinking LLMs are indeed stochastic parrots, but language
| (that is the unified corpus of all languages in the training
| data) already inherently contains the ,,generalization". So
| the intelligence is encoded in the language humans speak.
| michaelt wrote:
| I don't have _explanations_ but I can point you to one of
| the papers: https://arxiv.org/pdf/2309.12288 which calls it
| "the reversal curse" and does a bunch of experiments
| showing models that are successful at questions like "Who
| is Tom Cruise's mother?" (Mary Lee Pfeiffer) will not be
| equally successful at answering "Who is Mary Lee Pfeiffer's
| son?"
| spookie wrote:
| Isn't that specific case just a matter of not having
| enough data _explicitly_ stating the reverse? Seems as if
| they are indeed stochastic parrots from that perspective.
| layer8 wrote:
| Yes, and your conclusion is correct.
| moffkalast wrote:
| > language already inherently contains the
| ,,generalization"
|
| The mental gymnastics required to handwave language model
| capabilities are getting funnier and funnier every day.
| dannyw wrote:
| Anecdata, but I did some continued pretraining on a toy LLM
| using machine-translated data; of the original dataset.
|
| Performance improved across all benchmarks; in English (the
| original language).
| benmanns wrote:
| Am I understanding correctly? You look an English dataset,
| trained an LLM, machine translated the English dataset to
| e.g. Spanish, continued training the model, and performance
| for queries in English improved? That's really interesting.
| bionhoward wrote:
| There is evidence code training helps with reasoning so if you
| count code as another language then, this makes sense
|
| https://openreview.net/forum?id=KIPJKST4gw
|
| Is symbolic language a fuzzy sort of code? Absolutely, because
| it conveys logic and information. TLDR: yes!
| madeofpalk wrote:
| I find it interesting how coding/software development still
| appears to be the one category that these most popular models
| release specialised models for. Where's the finance or legal
| models from Mistral or Meta or OpenAI?
|
| Perhaps it's just confirmation bias, but programming really does
| seem to be the ideal usecase for LLMs in a way that other
| professions just haven't been able to crack. Compared to other
| types of work, it's relatively more straight forward to tell if
| code is "correct" or not.
| a2128 wrote:
| Coding models solve a clear problem and have a clear
| integration into a developer's workflow - it's like your own
| personal StackOverflow and it can autocomplete code for you.
| It's not as clear when it comes to finance or legal, you
| wouldn't want to rely on an AI that may hallucinate financial
| numbers or laws. These other professions are also a lot slower
| to react to change, compared to software development where
| people are already used to learning new frameworks every year
| MikeKusold wrote:
| Those are regulated industries, where as software development
| is not.
|
| An AI spitting back bad code won't compile. An AI spitting back
| bad financial/legal advice bankrupts people.
| knicholes wrote:
| Generally I agree! I saw a guy shamefully admit he didn't
| read the output carefully enough when using generated code
| (that ran), but there was a min() instead of a max(), and it
| messed up a month of his metrics!
| troupo wrote:
| The explanation is easier, I think. Consider what data these
| models are trained on, and who are the immediate developers of
| these models.
|
| The models are trained on a vast set of whatever is available
| on the internet. They are developed by tech people/programmers
| who are surprisingly blind to their own biases and interests.
| There's no surprise that one of the main things they want to
| try and do is programming, using vast open quantities of Stack
| Overflow, GitHub and various programming forums.
|
| For finance and legal you need to:
|
| - think a bit outside the box
|
| - be interested in finance and legal
|
| - be prepared to carry actual legal liability for the output of
| your models
| dannyw wrote:
| > - be prepared to carry actual legal liability for the
| output of your models
|
| Section 230.
|
| It's been argued that a response by a LLM, to user input, is
| "user-generated content" and hence the platform has generally
| no liability (except CSAM).
|
| Nobody has successfully sued.
| troupo wrote:
| No one has challenged this. Because LLMs haven't been
| (widely) used in legal or legally binding contexts
| moffkalast wrote:
| Then again, we just had this on the front page:
| https://news.ycombinator.com/item?id=40957990
|
| > We first document a significant decline in stock trading
| volume during ChatGPT outages and find that the effect is
| stronger for firms with corporate news released immediately
| before or during the outages. We further document similar
| declines in the short-run price impact, return variance, and
| bid-ask spreads, consistent with a reduction in informed
| trading during the outage periods. Lastly, we use trading
| volume changes during outages to construct a firm-level
| measure of the intensity of GAI-assisted trading and provide
| early evidence of a positive effect of GAI-assisted trading
| on long-run stock price informativeness.
|
| They're being used, but nobody is really saying anything
| because the stock market is a zero sum game these days and
| letting anyone else know that this holds water is a recipe
| for competition. Programming is about the opposite, the more
| you give, the more you get, so it makes sense to popularize
| it as a feature.
| troupo wrote:
| Stock trading is indistinguishable from gambling :)
|
| But true, I forgot that this, too, is part of finance
| sakesun wrote:
| Generating code has significant economical benefit. The code
| once generated can be execute so many times without requiring
| high computing resources, unlike AI model.
| miki123211 wrote:
| > Where's the finance or legal models from Mistral or Meta or
| OpenAI?
|
| Programming is "weird" in that it requires both specialized
| knowledge and specialized languages, and the languages are very
| different from any language that humans speak.
|
| Legal requires specialized knowledge, but legal writing is
| still just English and it follows English grammar rules,
| although it's sometimes a very strange "dialect" of English.
|
| Finance is weird in its own way, as that requires a lot more
| boring, highly-precise calculations, and LLMs are notoriously
| bad at those. I suspect that finance is always going to be some
| hybrid of an LLM driving an "old school" computer to do the
| hard math, via a programming language or some other, yet-
| unenvisioned protocol.
|
| > programming really does seem to be the ideal usecase for LLMs
| in a way that other professions just haven't been able to
| crack.
|
| This is true, mostly because of programmers' love of textual
| languages, textual protocols, CLI interfaces and generally all
| things text. If we were all coding in Scratch, this would be a
| lot harder.
| madeofpalk wrote:
| Yes, it appears to be the clear successful usecase for the
| technology, in a way that hasn't been replicated for other
| professions.
|
| I remain very sceptical that a chat-like interface is the
| ideal form for LLMs, yet it seems very optimal for
| programming specifically, along with Copilot-like interfaces
| of just outputting text.
| sofixa wrote:
| Finance already has their own models and has had them for
| decades. Market predictions and high frequency trading is
| literally what all the hedge funds and the like have been doing
| for a few decades now. Including advanced sources of
| information like (take with a grain of salt, I've heard it on
| the internet) using satellite images to measure factory
| activity and thus predict results.
|
| Understandably they're all quite secretive about their tooling
| because they don't want the competition to have access to the
| same competitive advantages, and an open source model / third
| party developing a model doesn't really make sense.
| madeofpalk wrote:
| I guess finance is not in need of a large _language_ model?
| Foobar8568 wrote:
| It does but everything is a joke...
| 317070 wrote:
| I work in the field. The reason has not been mentioned yet.
|
| It's because (for an unknown reason), having coding and
| software development in the training mix is really helpful at
| most other tasks. It improves everything to do with logical
| thinking by a large margin, and that seems to help with many
| other downstream tasks.
|
| Even if you don't need the programming, you want it in the
| training mix to get that logical thinking, which is hard to get
| from other resources.
|
| I don't know how much that is true for legal or financial
| resources.
| drewmate wrote:
| It's just easier to iterate and improve on a coding specialist
| AI when that is also the skill required to iterate on said AI.
|
| Products that build on general LLM tech are already being used
| in other fields. For example, my lawyer friend has started
| using one by LexisNexis[0] and is duly impressed by how it
| works. It's only a matter of time before models like that get
| increasingly specialized for that kind of work, it's just
| harder for lawyers to drive that kind of change alone. Plus,
| there's a lot more resistance in 'legacy' professions to any
| kind of change, much less one that is perceived to threaten the
| livelihoods of established professionals.
|
| Current LLMs are already not bad at a lot of things, but lawyer
| bots, accountant bots and more are likely coming.
|
| [0] https://www.lexisnexis.com/en-us/products/lexis-plus-
| ai.page
| simonw wrote:
| I wonder why Mistral et al don't prepare GGUF versions of these
| for launch day?
|
| If I were them I'd want to be the default source of the versions
| of my models that people use, rather than farming that out to
| whichever third party races to publish the GGUF (and other
| formats) first.
| dannyw wrote:
| I think it's actually reasonable to leave some opportunities to
| the community. It's an Apache 2.0 model. It's meant for
| everyone to build upon freely.
| a2128 wrote:
| llama.cpp is still under development and they sometimes come
| out with breaking changes or new quantization methods, and it
| can be a lot of work to keep up with these changes as you
| publish more models over time. It's easier to just publish a
| standard float32 safetensors that works with PyTorch, and let
| the community deal with other runtimes and file formats.
|
| If it's a new architecture, then there's also additional work
| needed to add support in llama.cpp, which means more dev time,
| more testing, and potentially loss of surprise model release if
| the development work has to be done out in the open
| sroussey wrote:
| Same could be said for onnx.
|
| Depends on which community you are in as to what you want.
| simonw wrote:
| Right - imagine how much of an impact a model release could
| have if it included GGUF and ONNX and MLX along with PyTorch.
| Patrick_Devine wrote:
| Some of the major vendors _do_ create the GGUFs for their
| models, but often they have the wrong parameter settings, need
| changes in the inference code, or don't include the correct
| prompt template. We (i.e. Ollama) have our own conversion
| scripts and we try to work with the model vendors to get
| everything working ahead of time, but unfortunately Mistral
| doesn't usually give us a heads up before they release.
| mcemilg wrote:
| I believe that if Mistral is serious about advancing in open
| source, they should consider sharing the corpus used for training
| their models, at least the base models pretraining data.
| wongarsu wrote:
| I doubt they could. Their corpus almost certainly is mostly
| composed of copyrighted material they don't have a license for.
| It's an open question whether that's an issue for using it for
| model training, but it's obvious they wouldn't be allowed to
| distribute it as a corpus. That'd just be regular copyright
| infringement.
|
| Maybe they could share a list of the content of their corpus.
| But that wouldn't be too helpful and makes it much easier for
| all affected parties to sue them for using their content in
| model training.
| gooob wrote:
| no, not the actual content, just the titles of the content.
| like "book title" by "author". the tool just simply can't be
| taken seriously by anyone until they release that
| information. this is the case for all these models. it's
| ridiculous, almost insulting.
| candiddevmike wrote:
| They can't release it without admitting to copyright
| infringement.
| regularfry wrote:
| They can't do it without getting sued for copyright
| infringement. That's not _quite_ the same.
| bilbo0s wrote:
| Uh..
|
| That would almost be worse. All copyright holders would
| need to do is search a list of titles if I'm understanding
| your proposal correctly.
|
| The idea is _not_ to get sued.
| alecco wrote:
| Nvidia has a blogpost about Mistral Nemo, too.
| https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/
|
| > Mistral NeMo comes packaged as an NVIDIA NIM inference
| microservice, offering performance-optimized inference with
| NVIDIA TensorRT-LLM engines.
|
| > *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA
| GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM
| offers high efficiency, low compute cost, and enhanced security
| and privacy.
|
| > The model was trained using Megatron-LM, part of NVIDIA NeMo,
| with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of
| NVIDIA AI architecture, including accelerated computing, network
| fabric and software to increase training efficiency.
| k__ wrote:
| What's the reason for measuring the model size in context window
| length and not GB?
|
| Also, are these small models OSS? Easier self hosting seems to be
| the main benefo for small models.
| simion314 wrote:
| >What's the reason for measuring the model size in context
| window length and not GB?
|
| there are 2 different things.
|
| The context window is how many tokens ii's context can contain,
| so on a big model you could put in the context a few books and
| articles and then start your questions, on a small context
| model you can start a conversation and after a short time it
| will start forgetting eh first prompts. Big context will use
| more memory and will cost on performance but imagine you could
| give it your entire code project and then you can ask it
| questions, so often I know there is some functions already
| there that does soemthing but I can't remember the name.
| kaoD wrote:
| I suspect you might be confusing the numbers: 12B (which is the
| very first number they give) is not context length, it's
| parameter count.
|
| The reason to use parameter count is because final size in GB
| depends on quantization. A 12B model at 8 bit parameter width
| would be 12Gbytes (plus some % overhead), while at 16 bit would
| be 24Gbytes.
|
| Context length here is 128k which is orthogonal to model size.
| You can notice the specify both parameters and context size
| because you need both to characterize an LLM.
|
| It's also interesting to know what parameter width it was
| trained on because you cannot get more information by
| "quantizing wider" -- it only makes sense to quantize into a
| narrower parameter width to save space.
| k__ wrote:
| Ah, yes.
|
| Thanks, I confused those numbers!
| yjftsjthsd-h wrote:
| > Also, are these small models OSS?
|
| From the very first paragraph on the page:
|
| > released under the Apache 2.0 license.
| bugglebeetle wrote:
| Interested in the new base model for fine tuning. Despite Llama3
| being a better instruct model overall, it's been highly resistant
| to fine-tuning, either owing to some bugs or being trained on so
| much data (ongoing debate about this in the community). Mistral's
| base model are still best in class for small model you can
| specialize.
| obblekk wrote:
| Worth noting this model has 50% more parameters than llama3.
| There are performance gains but some of the gains might be from
| using more compute rather than performance per unit compute.
| LoganDark wrote:
| Is the base model unaligned? Disappointing to see alignment from
| allegedly "open" models.
| xena wrote:
| The reason that companies align models is so that they don't
| get on the front page of the new york times with a headline
| like "Techaro's AI model used by terrorists to build a pipe
| bomb that destroyed the New York Stock Exchange datacentre".
| dpflan wrote:
| These big models are getting pumped out like crazy, that is the
| business of these companies. But basically, it feels like
| private/industry just figured out how to scale up a scalable
| process (deep learning), and it required not $M research grants
| but $BB "research grants"/funding, and the scaling laws seem to
| be fun to play with and tweak more interesting things out of
| these and find cool "emergent" behavior as billions of data
| points get correlated.
|
| But pumping out models and putting artifacts on HuggingFace, is
| that a business? What are these models being used for? There is a
| new one at a decent clip.
| hdhshdhshdjd wrote:
| I don't see any indication this beats Llama3 70B, but still
| requires a beefy GPU, so I'm not sure the use case. I have an
| A6000 which I use for a lot of things, Mixtral was my go-to
| until Llama3, then I switched over.
|
| If you could run this on say, stock CPU that would increase the
| use cases dramatically, but if you still need a 4090 I'm either
| missing something or this is useless.
| azeirah wrote:
| You don't need a 4090 at all. 16 bit requires about 24GB of
| VRAM, 8bit quants (99% same performance) requires only 12GB
| of VRAM.
|
| That's without the context window, so depending on how much
| context you want to use you'll need some more GB.
|
| That is, assuming you'll be using llama.cpp (which is
| standard for consumer inference. Ollama is also llama.cpp, as
| is kobold)
|
| This thing will run fine on a 16GB card, and a q6
| quantization will run fine on a 12GB card.
|
| You'll still get good performance on an 8GB card with
| offloading, since you'll be running most of it on the gpu
| anyway.
| eigenvalue wrote:
| There are a lot of models coming out, but in my view, most
| don't really matter or move the needle. There are the frontier
| models which aren't open (like GPT-4o) and then there are the
| small "elite" local LLMs like Llama3 8B. The rest seem like
| they are mostly about manipulating benchmarks. Whenever I try
| them, they are worse in actual practice than the Llama3 models.
| adt wrote:
| That's 3 releases for Mistral in 24 hours.
|
| https://lifearchitect.ai/models-table/
| ofermend wrote:
| Congrats. Very exciting to see continued innovation around
| smaller models, that can perform much better than larger models.
| This enables faster inference and makes them more ubiquitous.
| andrethegiant wrote:
| I still don't understand the business model of releasing open
| source gen AI models. If this took 3072 H100s to train, why are
| they releasing it for free? I understand they charge people when
| renting from their platform, but why permit people to run it
| themselves?
| kaoD wrote:
| > but why permit people to run it themselves?
|
| I wouldn't worry about that if I were them: it's been shown
| again and again that people will pay for convenience.
|
| What I'd worry about is Amazon/Cloudflare repackaging my model
| and outcompeting my platform.
| andrethegiant wrote:
| > What I'd worry about is Amazon/Cloudflare repackaging my
| model and outcompeting my platform.
|
| Why let Amazon/Cloudflare repackage it?
| bilbo0s wrote:
| How would you stop them?
|
| The license is Apache 2.
| andrethegiant wrote:
| That's my question -- why license as Apache 2
| bilbo0s wrote:
| What license would allow complete freedom for everyone
| else, but constrain Amazon and Cloudflare?
| supriyo-biswas wrote:
| The LLaMa license is a good start.
| abdullahkhalids wrote:
| They could just create a custom license based of Apache
| 2.0 that allows sharing but constraints some specific
| behavior. It won't be formally Open Source, but will have
| enough open source spirit that academics or normal people
| will be happy to use it.
| davidzweig wrote:
| Did anyone try to check how are it's multilingual skills vs.
| Gemma 2? On the page, it's compared with LLama 3 only.
| moffkalast wrote:
| Well it's not on Le Chat, it's not on LMSys, it has a new
| tokenizer that breaks llama.cpp compatibility, and I'm sure as
| hell not gonna run it with Crapformers at 0.1x speed which as
| of right now seems to be the only way to actually test it out.
| I_am_tiberius wrote:
| The last time I tried a Mistral model, it didn't answer most of
| my questions, because of "policy" reasons. I hope they fixed
| that. OpenAI at least only tells me that it's a policy issue but
| still answers most of the time.
| zone411 wrote:
| Interesting that the benchmarks they show have it outperforming
| Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT
| Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini
| also does better at 14.3. It's just one benchmark though, so
| looking forward to additional scores.
| chant4747 wrote:
| Can you help me understand why people seem to think of
| Connections as a more robust indicator of (general) performance
| than benchmarks typically used for eval?
|
| It seems to me that while the game is very challenging for
| people it's not necessarily an indicator of generalization. I
| can see how it's useful - but I have trouble seeing how a low
| score on it would indicate low performance on most tasks.
|
| Thanks and hopefully this isn't perceived as offensive. Just
| trying to learn more about it.
|
| edit: I realize you yourself indicate that it's "just one
| benchmark" - I am more asking about the broader usage I have
| seen here on HN comments from several people.
| lostmsu wrote:
| Gonna wait for LMSYS benchmarks. The "standard" benchmarks all
| seem unreliable.
| wkcheng wrote:
| Does anyone know whether the 128K is input tokens only? There are
| a lot of models that have a large context window for input but a
| small output context. If this actually has 128k tokens shared
| between input and output, that would be a game changer.
| eigenvalue wrote:
| I have to say, the experience of trying to sign up for Nvidia
| Enterprise so you can try the "NIM" packaged version of this
| model, is just icky and and awful now that I've gotten used to
| actually free and open models and software. It feels much nicer
| and more free to be able to clone llama.cpp and wget a .gguf
| model file from huggingface without any registration at all.
| Especially since it has now been several hours since I signed up
| for the Nvidia account and it still says on the website "Your
| License Should be Active Momentarily | We're setting up your
| credentials to download NIMs."
|
| I really don't get Nvidia's thinking with this. They basically
| have a hardware monopoly. I shelled out the $4,000 or so to buy
| two of their 4090 GPUs. Why are they still insisting on torturing
| me with jumping through these awful hoops? They should just be
| glad that they're winning and embrace freedom.
___________________________________________________________________
(page generated 2024-07-18 23:09 UTC)