[HN Gopher] Mistral NeMo
       ___________________________________________________________________
        
       Mistral NeMo
        
       Author : bcatanzaro
       Score  : 353 points
       Date   : 2024-07-18 14:45 UTC (8 hours ago)
        
 (HTM) web link (mistral.ai)
 (TXT) w3m dump (mistral.ai)
        
       | pantulis wrote:
       | Does it have any relation to Nvidia's Nemo? Otherwise, it's
       | unfortunate naming
        
         | markab21 wrote:
         | It looks like it was built jointly with nvidia:
         | https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct
        
         | refulgentis wrote:
         | Click the link, read the first sentence.
        
           | pantulis wrote:
           | Yeah, not my brightest HN moment, to be honest.
        
             | SubiculumCode wrote:
             | At least you didn't ask about finding a particular fish.
        
       | yjftsjthsd-h wrote:
       | > Today, we are excited to release Mistral NeMo, a 12B model
       | built in collaboration with NVIDIA. Mistral NeMo offers a large
       | context window of up to 128k tokens. Its reasoning, world
       | knowledge, and coding accuracy are state-of-the-art in its size
       | category. As it relies on standard architecture, Mistral NeMo is
       | easy to use and a drop-in replacement in any system using Mistral
       | 7B.
       | 
       | > We have released pre-trained base and instruction-tuned
       | checkpoints checkpoints under the Apache 2.0 license to promote
       | adoption for researchers and enterprises. Mistral NeMo was
       | trained with quantisation awareness, enabling FP8 inference
       | without any performance loss.
       | 
       | So that's... uniformly an improvement at just about everything,
       | right? Large context, permissive license, should have good perf.
       | The one thing I can't tell is how big 12B is going to be (read:
       | how much VRAM/RAM is this thing going to need). Annoyingly and
       | rather confusingly for a model under Apache 2.0,
       | https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
       | refuses to show me files unless I login and "You need to agree to
       | share your contact information to access this model"... though if
       | it's actually as good as it looks, I give it hours before it's
       | reposted without that restriction, which Apache 2.0 allows.
        
         | xena wrote:
         | Easy head math: parameter count times parameter size plus
         | 20-40% for inference slop space. Anywhere from 8-40GB of vram
         | required depending on quantization levels being used.
        
           | imtringued wrote:
           | They did quantization aware training for fp8 so you won't get
           | any benefits from using more than 12GB of RAM for the
           | parameters. What you might be using more RAM is the much
           | bigger context window.
        
         | exe34 wrote:
         | tensors look about 20gb. not sure what that's like in vram.
        
           | kelsey98765431 wrote:
           | same size
        
         | renewiltord wrote:
         | According to nvidia https://blogs.nvidia.com/blog/mistral-
         | nvidia-ai-model/ it was made to fit on a 4090 so it should work
         | with 24 GB.
        
         | bernaferrari wrote:
         | if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM,
         | but quantizing you might be able to do with with ~6-8. So any
         | 16gb Macbook could run it (but not much else).
        
           | hislaziness wrote:
           | isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for
           | inference?
        
             | ancientworldnow wrote:
             | This was trained to be run at FP8 with no quality loss.
        
             | fzzzy wrote:
             | it's very common to run local models in 8 bit int.
        
               | qwertox wrote:
               | Yes, but it's not common for the original model to be 8
               | bit int. The community can downgrade any model to 8 bit
               | int, but it's always linked to quality loss.
        
         | Bumblonono wrote:
         | It fits a 4090. Nvidia lists the models and therefore i assume
         | 24gig is min
        
           | michaelt wrote:
           | A 4090 will _just narrowly_ fit a 34B parameter model at
           | 4-bit quantisation.
           | 
           | A 12B model will run on a 4090 with plenty room to spare,
           | even with 8-bit quantisation.
        
         | wongarsu wrote:
         | You could consider the improvement in model performance a bit
         | of a cheat - they beat other models "in the same size category"
         | that have 30% fewer parameters.
         | 
         | I still welcome this approach. 7B seems like a dead end in
         | terms of reasoning and generalization. They are annoyingly
         | close to statistical parrots, a world away from the moderate
         | reasoning you get in 70B models. Any use case where that's
         | useful can increasingly be filled by even smaller models, so
         | chasing slightly larger models to get a bit more "intelligence"
         | might be the right move
        
           | mistercheph wrote:
           | I strongly disagree, have you used fp16 or q8 llama3 8b?
        
           | amrrs wrote:
           | >reasoning and generalization
           | 
           | any example use-cases or prompts? how do you define those?
        
           | yjftsjthsd-h wrote:
           | I actually meant execution speed from quantisation awareness
           | - agreed that comparing against smaller models is a bit
           | cheating.
        
           | qwertox wrote:
           | Aren't small models useful for providing a language-based
           | interface - spoken or in writing - to any app? Tuned
           | specifically for that app or more likely enriched via RAG and
           | possibly also by using function calling?
           | 
           | It doesn't have to be intelligent like we expect it from the
           | top-tier, huge models, just capable of understanding some
           | words in sentences, mostly commands, and how to react to
           | them.
        
           | imtringued wrote:
           | Except Llama 3 8b is a significant improvement over llama 2,
           | which was basically so terrible that there was a whole
           | community building fine tunes that are better than what the
           | multi billion dollar company can do using a much smaller
           | budget. With llama 3 8b things have shifted towards there
           | being much less community fine-tunes that actually beat it.
           | The fact that Mistral AI can still build models that beat it,
           | means the company isn't falling too far behind a
           | significantly better equipped competitor.
           | 
           | What's more irritating is that they decided to do
           | quantization aware training for fp8. int8 quantization
           | results in an imperceptible loss of quality that is difficult
           | to pick up in benchmarks. They should have gone for something
           | more aggressive like 4-bit, where quantization leads to a
           | significant loss in quality.
        
       | Workaccount2 wrote:
       | Is "Parameter Creep" going to becomes a thing? They hold up
       | Llama-8b as a competitor despite NeMo having 50% more parameters.
       | 
       | The same thing happened with gemma-27b, where they compared it to
       | all the 7-9b models.
       | 
       | It seems like an easy way to boost benchmarks while coming off as
       | "small" at first glance.
        
         | eyeswideopen wrote:
         | As written here: https://huggingface.co/nvidia/Mistral-
         | NeMo-12B-Instruct
         | 
         | "It significantly outperforms existing models smaller or
         | similar in size." is a statement that goes in that direction
         | and would allow the comparison of a 1.7T param model with a 7b
         | one
        
         | voiper1 wrote:
         | Oddly, they are only charging slightly more for their hosted
         | version:
         | 
         | open-mistral-7b is 25c/m tokens open-mistral-nemo-2407 is 30c/m
         | tokens
         | 
         | https://mistral.ai/technology/#pricing
        
           | dannyw wrote:
           | Possibly a NVIDIA subsidy. You run NEMO models, you get
           | cheaper GPUs.
        
           | Palmik wrote:
           | They specifically call out fp8 aware training and TensoRT LLM
           | is really good (efficient) with fp8 inference on H100 and
           | other hopper cards. It's possible that they run the 7b
           | natively in fp16 as smaller models suffer more from even
           | "modest" quantization like this.
        
         | causal wrote:
         | Yeah it will be interesting to see if we ever settle on
         | standard sizes here. My preference would be:
         | 
         | - 3B for CPU inference or running on edge devices.
         | 
         | - 20-30B for maximizing single consumer GPU potential.
         | 
         | - 70B+ for those who can afford it.
         | 
         | 7-9B never felt like an ideal size.
        
         | marci wrote:
         | For the benchmarks, it depends on how you interpret it. The
         | other models are quite popular so many can have a starting
         | point. Now, if you regularly use them you can assess: "just 3%
         | better on some benchmark, 80% to 83, and at the cost of almost
         | twice the inference speed and base base RAM requirement, but
         | 16x context window, and for commercial usage..." and at the end
         | "for my use case, is it worth it?"
        
       | pants2 wrote:
       | Exciting, I think 12B is the sweet spot for running locally -
       | large enough to be useful, fast enough to run on a decent laptop.
        
         | _flux wrote:
         | How much memory does employing the complete 128k window take,
         | though? I've sadly noticed that it can take a significant
         | amount of VRAM to use a larger context window.
         | 
         | edit: e.g. I wouldn't know the correct parameters for this
         | calculator, but going from 8k window to 128k window goes from
         | 1.5 GB to 23 GB: https://huggingface.co/spaces/NyxKrage/LLM-
         | Model-VRAM-Calcul...
        
           | azeirah wrote:
           | In practice, it's fine to stick with "just" 8k or 16k or 32k.
           | If you're working with data of over 128k tokens I'd
           | personally not recommend using an open model anyway unless
           | you know what you're doing. The models are kinda there, but
           | the hardware mostly isn't.
           | 
           | This is only realistic right now for people with those
           | unified memory MacBook or for enthusiasts with Epyc servers
           | or a very high end workstation built for inference.
           | 
           | Anything above that I don't consider "consumer" inference
        
         | mythz wrote:
         | IMO Google's Gemma2 27B [1] is the sweet spot for running
         | locally on commodity 16GB VRAM cards.
         | 
         | [1] https://ollama.com/library/gemma2:27b
        
           | Raed667 wrote:
           | If I "only" have 16GB of ram on a macbook pro, would that
           | still work ?
        
             | sofixa wrote:
             | If it's an M-series one with "unified memory" (shared RAM
             | between the CPU, GPU and NPU on the same chip), yes.
        
           | mysteria wrote:
           | Keep in mind that Gemma is a larger model but it only has 8k
           | context. The Mistral 12B will need less VRAM to store the
           | weights but you'll need a much larger KV cache if you intend
           | to use the full 128k context, especially if the KV is
           | unquantized. Note sure if this new model has GQA but those
           | without it absolutely eat memory when you increase the
           | context size (looking at you Command R).
        
       | minimaxir wrote:
       | > Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken,
       | that was trained on over more than 100 languages, and compresses
       | natural language text and source code more efficiently than the
       | SentencePiece tokenizer used in previous Mistral models.
       | 
       | Does anyone have a good answer why everyone went back to
       | SentencePiece in the first place? Byte-pair encoding (which is
       | what tiktoken uses: https://github.com/openai/tiktoken) was shown
       | to be a more efficient encoding as far back as GPT-2 in 2019.
        
         | rockinghigh wrote:
         | The SentencePiece library also implements Byte-pair-encoding.
         | That's what the LLaMA models use and the original Mistral
         | models were essentially a copy of LLaMA2.
        
         | zwaps wrote:
         | SentencePiece is not a different algorithm to WordPiece or BPE,
         | despite its naming.
         | 
         | One of the main pulls of the SentencePiece library was the pre-
         | tokenization being less reliant on white space and therefore
         | more adaptable to non Western languages.
        
       | p1esk wrote:
       | Interesting how it will compete with 4o mini.
        
       | pixelatedindex wrote:
       | Pardon me if this is a dumb question, but is it possible for me
       | to download these models into my computer (I have a 1080ti and a
       | [2|3]070ti) and generate some sort of api interface? That way I
       | can write programs that calls this API, and I find this
       | appealing.
       | 
       | EDIT: This a 1W light bulb moment for me, thank you!
        
         | bezbac wrote:
         | AFAIK, Ollama supports most of these models locally and will
         | expose a REST API[0]
         | 
         | [0]: https://github.com/ollama/ollama/blob/main/docs/api.md
        
         | kanwisher wrote:
         | llama.cpp or ollama both have apis for most models
        
         | codetrotter wrote:
         | I'd probably check https://ollama.com/library?q=Nemo in a
         | couple of days. My guess is that by then ollama will have
         | support for it. And you can then run the model locally on your
         | machine with ollama.
        
           | hedgehog wrote:
           | Adding to this: If the default is too slow look at the more
           | heavily quantized versions of the model, they are smaller at
           | moderate cost in output quality. Ollama can split models
           | between GPU and host memory but the throughput dropoff tends
           | to be pretty severe.
        
           | andrethegiant wrote:
           | Why would it take a couple days? Is it not a matter of
           | uploading the model to their registry, or are there more
           | steps involved than that?
        
             | HanClinto wrote:
             | Ollama depends on llama.cpp as its backend, so if there are
             | any changes that need to be made to support anything new in
             | this model architecture or tokenizer, then it will need to
             | be added there first.
             | 
             | Then the model needs to be properly quantized and formatted
             | for GGUF (the model format that llama.cpp uses), tested,
             | and uploaded to the model registry.
             | 
             | So there's some length to the pipeline that things need to
             | go through, but overall the devs in both projects generally
             | have things running pretty smoothly, and I'm regularly
             | impressed at how quickly both projects get updated to
             | support such things.
        
               | codetrotter wrote:
               | > I'm regularly impressed at how quickly both projects
               | get updated to support such things.
               | 
               | Same! Big kudos to all involved
        
               | HanClinto wrote:
               | Issue to track Mistral NeMo support in llama.cpp:
               | https://github.com/ggerganov/llama.cpp/issues/8577
        
           | Patrick_Devine wrote:
           | We're working on it, except that there is a change to the
           | tokenizer which we're still working through in our conversion
           | scripts. Unfortunately we don't get a heads up from Mistral
           | when they drop a model, so sometimes it takes a little bit of
           | time to sort out the differences.
           | 
           | Also, I'm not sure if we'll call it mistral-nemo or nemo yet.
           | :-D
        
         | simpaticoder wrote:
         | Justine Tunney (of redbean fame) is actively working on getting
         | LLMs to run well on CPUs, where RAM is cheap. If successful
         | this would eliminate an enormous bottleneck to running local
         | models. If anyone can do this, she can. (And thank you to
         | Mozilla for financially supporting her work). See
         | https://justine.lol/matmul/ and https://github.com/mozilla-
         | Ocho/llamafile
        
           | illusive4080 wrote:
           | I love that domain name.
        
           | wkat4242 wrote:
           | I think it's mostly the memory bandwidth though that makes
           | the GPUs so fast with LLMs. My card does about 1TB/s. CPU RAM
           | won't come near that. I'm sure a lot of optimisations can be
           | had but I think GPUs will still be significantly ahead.
           | 
           | Macs are so good at it because Apple solder the memory on top
           | of the SoC for a really wide and low latency connection.
        
             | simpaticoder wrote:
             | This is a good and valid comment. It is difficult to
             | predict the future, but I would be curious what the best
             | case theoretical performance of an LLM on a typical x86 or
             | ARM system with DDR4 or DDR5 RAM. My uneducated guess is
             | that it can be very good, perhaps 50% the speed of a
             | specialized GPU/RAM device. In practical terms, the CPU
             | approach is _required_ for very large contexts, up to as
             | large as the lifetime of all interactions you have with
             | your LLM.
        
           | rustcleaner wrote:
           | There's no good reason for consumer nvidia cards to lack
           | SODIMM-like slots for video RAM, except to rake in big bucks
           | and induce more hasty planned obsolescence.
        
         | Raed667 wrote:
         | First thing I did when i saw the headline was to look for it on
         | ollma but it didn't land there yet:
         | https://ollama.com/library?sort=newest&q=NeMo
        
           | Patrick_Devine wrote:
           | We're working on it!
        
         | RockyMcNuts wrote:
         | You will need enough VRAM, 1080ti is not going to work very
         | well, maybe get a 3090 with 24GB VRAM.
         | 
         | I think it should also run well on a 36GB MacBook Pro or
         | probably a 24GB Macbook Air
        
         | nostromo wrote:
         | Yes.
         | 
         | If you're on a Mac, check out LM Studio.
         | 
         | It's a UI that lets you load and interact with models locally.
         | You can also wrap your model in an OpenAI-compatible API and
         | interact with it programmatically.
        
         | d13 wrote:
         | Try Lm Studio or Ollama. Load up the model, and there you go.
        
         | homarp wrote:
         | llama.cpp supports multi gpu across local network
         | https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...
         | 
         | and expose an OpenAI compatible server, or you can use their
         | python bindings
        
       | saberience wrote:
       | Two questions:
       | 
       | 1) Anyone have any idea of VRAM requirements?
       | 
       | 2) When will this be available on ollama?
        
         | causal wrote:
         | 1) Rule of thumb is # of params = GB at Q8. So a 12B model
         | generally takes up 12GB of VRAM at 8 bit precision.
         | 
         | But 4bit precision is still pretty good, so 6GB VRAM is viable,
         | not counting additional space for context. Usually about an
         | extra 20% is needed, but 128K is a pretty huge context so more
         | will be needed if you need the whole space.
        
         | alecco wrote:
         | The model has 12 billion parameters and uses FP8, so 1 byte
         | each. With some working memory I'd bet you can run it on 24GB.
         | 
         | > Designed to fit on the memory of a single NVIDIA L40S, NVIDIA
         | GeForce RTX 4090 or NVIDIA RTX 4500 GPU
        
       | jorgesborges wrote:
       | I'm AI stupid. Does anyone know if training on multiple languages
       | provides "cross-over" -- so training done in German can be
       | utilized when answering a prompt in English? I once went through
       | various Wikipedia articles in a couple languages and the
       | differences were interesting. For some reason I thought they'd be
       | almost verbatim (forgetting that's not how Wikipedia works!) and
       | while I can't remember exactly I felt they were sometimes starkly
       | different in tone and content.
        
         | bernaferrari wrote:
         | no, it is basically an 'auto-correct' spell checker from the
         | phone. It only knows what it was trained on. But it has been
         | shown that a coding LLM that has never seen a programming
         | language or a library can "learn" a new one faster than, say, a
         | generic LLM.
        
           | StevenWaterman wrote:
           | That's not true, LLMs can answer questions in one language
           | even if they were only trained on that data in another
           | language.
           | 
           | IE you train an LLM on both English and French in general,
           | but only teach it a specific fact in French, it can give you
           | that fact in English
        
             | hdhshdhshdjd wrote:
             | You, you can write a prompt in English, give it French, and
             | get an accurate answer in English even with the original
             | Mistral.
             | 
             | Still blows my mind we came so far so fast.
        
         | miki123211 wrote:
         | Generally yes, with caveats.
         | 
         | There was some research showing that training a model on facts
         | like "the mother of John Smith is Alice" but in German allowed
         | it to answer questions like "who's the mother of John Smith",
         | but not questions like "what's the name of Alice's child",
         | regardless of language. Not sure if this holds at larger model
         | sizes though, it's the sort of problem that's usually fixable
         | by throwing more parameters at it.
         | 
         | Language models definitely do generalize to some extend and
         | they're not "stochastic parrots" as previously thought, but
         | there are some weird ways in which we expect them to generalize
         | but they don't.
        
           | planb wrote:
           | > Language models definitely do generalize to some extend and
           | they're not "stochastic parrots" as previously thought, but
           | there are some weird ways in which we expect them to
           | generalize but they don't.
           | 
           | Do you have any good sources that explain this? I was always
           | thinking LLMs are indeed stochastic parrots, but language
           | (that is the unified corpus of all languages in the training
           | data) already inherently contains the ,,generalization". So
           | the intelligence is encoded in the language humans speak.
        
             | michaelt wrote:
             | I don't have _explanations_ but I can point you to one of
             | the papers: https://arxiv.org/pdf/2309.12288 which calls it
             | "the reversal curse" and does a bunch of experiments
             | showing models that are successful at questions like "Who
             | is Tom Cruise's mother?" (Mary Lee Pfeiffer) will not be
             | equally successful at answering "Who is Mary Lee Pfeiffer's
             | son?"
        
               | spookie wrote:
               | Isn't that specific case just a matter of not having
               | enough data _explicitly_ stating the reverse? Seems as if
               | they are indeed stochastic parrots from that perspective.
        
               | layer8 wrote:
               | Yes, and your conclusion is correct.
        
             | moffkalast wrote:
             | > language already inherently contains the
             | ,,generalization"
             | 
             | The mental gymnastics required to handwave language model
             | capabilities are getting funnier and funnier every day.
        
         | dannyw wrote:
         | Anecdata, but I did some continued pretraining on a toy LLM
         | using machine-translated data; of the original dataset.
         | 
         | Performance improved across all benchmarks; in English (the
         | original language).
        
           | benmanns wrote:
           | Am I understanding correctly? You look an English dataset,
           | trained an LLM, machine translated the English dataset to
           | e.g. Spanish, continued training the model, and performance
           | for queries in English improved? That's really interesting.
        
         | bionhoward wrote:
         | There is evidence code training helps with reasoning so if you
         | count code as another language then, this makes sense
         | 
         | https://openreview.net/forum?id=KIPJKST4gw
         | 
         | Is symbolic language a fuzzy sort of code? Absolutely, because
         | it conveys logic and information. TLDR: yes!
        
       | madeofpalk wrote:
       | I find it interesting how coding/software development still
       | appears to be the one category that these most popular models
       | release specialised models for. Where's the finance or legal
       | models from Mistral or Meta or OpenAI?
       | 
       | Perhaps it's just confirmation bias, but programming really does
       | seem to be the ideal usecase for LLMs in a way that other
       | professions just haven't been able to crack. Compared to other
       | types of work, it's relatively more straight forward to tell if
       | code is "correct" or not.
        
         | a2128 wrote:
         | Coding models solve a clear problem and have a clear
         | integration into a developer's workflow - it's like your own
         | personal StackOverflow and it can autocomplete code for you.
         | It's not as clear when it comes to finance or legal, you
         | wouldn't want to rely on an AI that may hallucinate financial
         | numbers or laws. These other professions are also a lot slower
         | to react to change, compared to software development where
         | people are already used to learning new frameworks every year
        
         | MikeKusold wrote:
         | Those are regulated industries, where as software development
         | is not.
         | 
         | An AI spitting back bad code won't compile. An AI spitting back
         | bad financial/legal advice bankrupts people.
        
           | knicholes wrote:
           | Generally I agree! I saw a guy shamefully admit he didn't
           | read the output carefully enough when using generated code
           | (that ran), but there was a min() instead of a max(), and it
           | messed up a month of his metrics!
        
         | troupo wrote:
         | The explanation is easier, I think. Consider what data these
         | models are trained on, and who are the immediate developers of
         | these models.
         | 
         | The models are trained on a vast set of whatever is available
         | on the internet. They are developed by tech people/programmers
         | who are surprisingly blind to their own biases and interests.
         | There's no surprise that one of the main things they want to
         | try and do is programming, using vast open quantities of Stack
         | Overflow, GitHub and various programming forums.
         | 
         | For finance and legal you need to:
         | 
         | - think a bit outside the box
         | 
         | - be interested in finance and legal
         | 
         | - be prepared to carry actual legal liability for the output of
         | your models
        
           | dannyw wrote:
           | > - be prepared to carry actual legal liability for the
           | output of your models
           | 
           | Section 230.
           | 
           | It's been argued that a response by a LLM, to user input, is
           | "user-generated content" and hence the platform has generally
           | no liability (except CSAM).
           | 
           | Nobody has successfully sued.
        
             | troupo wrote:
             | No one has challenged this. Because LLMs haven't been
             | (widely) used in legal or legally binding contexts
        
           | moffkalast wrote:
           | Then again, we just had this on the front page:
           | https://news.ycombinator.com/item?id=40957990
           | 
           | > We first document a significant decline in stock trading
           | volume during ChatGPT outages and find that the effect is
           | stronger for firms with corporate news released immediately
           | before or during the outages. We further document similar
           | declines in the short-run price impact, return variance, and
           | bid-ask spreads, consistent with a reduction in informed
           | trading during the outage periods. Lastly, we use trading
           | volume changes during outages to construct a firm-level
           | measure of the intensity of GAI-assisted trading and provide
           | early evidence of a positive effect of GAI-assisted trading
           | on long-run stock price informativeness.
           | 
           | They're being used, but nobody is really saying anything
           | because the stock market is a zero sum game these days and
           | letting anyone else know that this holds water is a recipe
           | for competition. Programming is about the opposite, the more
           | you give, the more you get, so it makes sense to popularize
           | it as a feature.
        
             | troupo wrote:
             | Stock trading is indistinguishable from gambling :)
             | 
             | But true, I forgot that this, too, is part of finance
        
         | sakesun wrote:
         | Generating code has significant economical benefit. The code
         | once generated can be execute so many times without requiring
         | high computing resources, unlike AI model.
        
         | miki123211 wrote:
         | > Where's the finance or legal models from Mistral or Meta or
         | OpenAI?
         | 
         | Programming is "weird" in that it requires both specialized
         | knowledge and specialized languages, and the languages are very
         | different from any language that humans speak.
         | 
         | Legal requires specialized knowledge, but legal writing is
         | still just English and it follows English grammar rules,
         | although it's sometimes a very strange "dialect" of English.
         | 
         | Finance is weird in its own way, as that requires a lot more
         | boring, highly-precise calculations, and LLMs are notoriously
         | bad at those. I suspect that finance is always going to be some
         | hybrid of an LLM driving an "old school" computer to do the
         | hard math, via a programming language or some other, yet-
         | unenvisioned protocol.
         | 
         | > programming really does seem to be the ideal usecase for LLMs
         | in a way that other professions just haven't been able to
         | crack.
         | 
         | This is true, mostly because of programmers' love of textual
         | languages, textual protocols, CLI interfaces and generally all
         | things text. If we were all coding in Scratch, this would be a
         | lot harder.
        
           | madeofpalk wrote:
           | Yes, it appears to be the clear successful usecase for the
           | technology, in a way that hasn't been replicated for other
           | professions.
           | 
           | I remain very sceptical that a chat-like interface is the
           | ideal form for LLMs, yet it seems very optimal for
           | programming specifically, along with Copilot-like interfaces
           | of just outputting text.
        
         | sofixa wrote:
         | Finance already has their own models and has had them for
         | decades. Market predictions and high frequency trading is
         | literally what all the hedge funds and the like have been doing
         | for a few decades now. Including advanced sources of
         | information like (take with a grain of salt, I've heard it on
         | the internet) using satellite images to measure factory
         | activity and thus predict results.
         | 
         | Understandably they're all quite secretive about their tooling
         | because they don't want the competition to have access to the
         | same competitive advantages, and an open source model / third
         | party developing a model doesn't really make sense.
        
           | madeofpalk wrote:
           | I guess finance is not in need of a large _language_ model?
        
             | Foobar8568 wrote:
             | It does but everything is a joke...
        
         | 317070 wrote:
         | I work in the field. The reason has not been mentioned yet.
         | 
         | It's because (for an unknown reason), having coding and
         | software development in the training mix is really helpful at
         | most other tasks. It improves everything to do with logical
         | thinking by a large margin, and that seems to help with many
         | other downstream tasks.
         | 
         | Even if you don't need the programming, you want it in the
         | training mix to get that logical thinking, which is hard to get
         | from other resources.
         | 
         | I don't know how much that is true for legal or financial
         | resources.
        
         | drewmate wrote:
         | It's just easier to iterate and improve on a coding specialist
         | AI when that is also the skill required to iterate on said AI.
         | 
         | Products that build on general LLM tech are already being used
         | in other fields. For example, my lawyer friend has started
         | using one by LexisNexis[0] and is duly impressed by how it
         | works. It's only a matter of time before models like that get
         | increasingly specialized for that kind of work, it's just
         | harder for lawyers to drive that kind of change alone. Plus,
         | there's a lot more resistance in 'legacy' professions to any
         | kind of change, much less one that is perceived to threaten the
         | livelihoods of established professionals.
         | 
         | Current LLMs are already not bad at a lot of things, but lawyer
         | bots, accountant bots and more are likely coming.
         | 
         | [0] https://www.lexisnexis.com/en-us/products/lexis-plus-
         | ai.page
        
       | simonw wrote:
       | I wonder why Mistral et al don't prepare GGUF versions of these
       | for launch day?
       | 
       | If I were them I'd want to be the default source of the versions
       | of my models that people use, rather than farming that out to
       | whichever third party races to publish the GGUF (and other
       | formats) first.
        
         | dannyw wrote:
         | I think it's actually reasonable to leave some opportunities to
         | the community. It's an Apache 2.0 model. It's meant for
         | everyone to build upon freely.
        
         | a2128 wrote:
         | llama.cpp is still under development and they sometimes come
         | out with breaking changes or new quantization methods, and it
         | can be a lot of work to keep up with these changes as you
         | publish more models over time. It's easier to just publish a
         | standard float32 safetensors that works with PyTorch, and let
         | the community deal with other runtimes and file formats.
         | 
         | If it's a new architecture, then there's also additional work
         | needed to add support in llama.cpp, which means more dev time,
         | more testing, and potentially loss of surprise model release if
         | the development work has to be done out in the open
        
         | sroussey wrote:
         | Same could be said for onnx.
         | 
         | Depends on which community you are in as to what you want.
        
           | simonw wrote:
           | Right - imagine how much of an impact a model release could
           | have if it included GGUF and ONNX and MLX along with PyTorch.
        
         | Patrick_Devine wrote:
         | Some of the major vendors _do_ create the GGUFs for their
         | models, but often they have the wrong parameter settings, need
         | changes in the inference code, or don't include the correct
         | prompt template. We (i.e. Ollama) have our own conversion
         | scripts and we try to work with the model vendors to get
         | everything working ahead of time, but unfortunately Mistral
         | doesn't usually give us a heads up before they release.
        
       | mcemilg wrote:
       | I believe that if Mistral is serious about advancing in open
       | source, they should consider sharing the corpus used for training
       | their models, at least the base models pretraining data.
        
         | wongarsu wrote:
         | I doubt they could. Their corpus almost certainly is mostly
         | composed of copyrighted material they don't have a license for.
         | It's an open question whether that's an issue for using it for
         | model training, but it's obvious they wouldn't be allowed to
         | distribute it as a corpus. That'd just be regular copyright
         | infringement.
         | 
         | Maybe they could share a list of the content of their corpus.
         | But that wouldn't be too helpful and makes it much easier for
         | all affected parties to sue them for using their content in
         | model training.
        
           | gooob wrote:
           | no, not the actual content, just the titles of the content.
           | like "book title" by "author". the tool just simply can't be
           | taken seriously by anyone until they release that
           | information. this is the case for all these models. it's
           | ridiculous, almost insulting.
        
             | candiddevmike wrote:
             | They can't release it without admitting to copyright
             | infringement.
        
               | regularfry wrote:
               | They can't do it without getting sued for copyright
               | infringement. That's not _quite_ the same.
        
             | bilbo0s wrote:
             | Uh..
             | 
             | That would almost be worse. All copyright holders would
             | need to do is search a list of titles if I'm understanding
             | your proposal correctly.
             | 
             | The idea is _not_ to get sued.
        
       | alecco wrote:
       | Nvidia has a blogpost about Mistral Nemo, too.
       | https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/
       | 
       | > Mistral NeMo comes packaged as an NVIDIA NIM inference
       | microservice, offering performance-optimized inference with
       | NVIDIA TensorRT-LLM engines.
       | 
       | > *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA
       | GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM
       | offers high efficiency, low compute cost, and enhanced security
       | and privacy.
       | 
       | > The model was trained using Megatron-LM, part of NVIDIA NeMo,
       | with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of
       | NVIDIA AI architecture, including accelerated computing, network
       | fabric and software to increase training efficiency.
        
       | k__ wrote:
       | What's the reason for measuring the model size in context window
       | length and not GB?
       | 
       | Also, are these small models OSS? Easier self hosting seems to be
       | the main benefo for small models.
        
         | simion314 wrote:
         | >What's the reason for measuring the model size in context
         | window length and not GB?
         | 
         | there are 2 different things.
         | 
         | The context window is how many tokens ii's context can contain,
         | so on a big model you could put in the context a few books and
         | articles and then start your questions, on a small context
         | model you can start a conversation and after a short time it
         | will start forgetting eh first prompts. Big context will use
         | more memory and will cost on performance but imagine you could
         | give it your entire code project and then you can ask it
         | questions, so often I know there is some functions already
         | there that does soemthing but I can't remember the name.
        
         | kaoD wrote:
         | I suspect you might be confusing the numbers: 12B (which is the
         | very first number they give) is not context length, it's
         | parameter count.
         | 
         | The reason to use parameter count is because final size in GB
         | depends on quantization. A 12B model at 8 bit parameter width
         | would be 12Gbytes (plus some % overhead), while at 16 bit would
         | be 24Gbytes.
         | 
         | Context length here is 128k which is orthogonal to model size.
         | You can notice the specify both parameters and context size
         | because you need both to characterize an LLM.
         | 
         | It's also interesting to know what parameter width it was
         | trained on because you cannot get more information by
         | "quantizing wider" -- it only makes sense to quantize into a
         | narrower parameter width to save space.
        
           | k__ wrote:
           | Ah, yes.
           | 
           | Thanks, I confused those numbers!
        
         | yjftsjthsd-h wrote:
         | > Also, are these small models OSS?
         | 
         | From the very first paragraph on the page:
         | 
         | > released under the Apache 2.0 license.
        
       | bugglebeetle wrote:
       | Interested in the new base model for fine tuning. Despite Llama3
       | being a better instruct model overall, it's been highly resistant
       | to fine-tuning, either owing to some bugs or being trained on so
       | much data (ongoing debate about this in the community). Mistral's
       | base model are still best in class for small model you can
       | specialize.
        
       | obblekk wrote:
       | Worth noting this model has 50% more parameters than llama3.
       | There are performance gains but some of the gains might be from
       | using more compute rather than performance per unit compute.
        
       | LoganDark wrote:
       | Is the base model unaligned? Disappointing to see alignment from
       | allegedly "open" models.
        
         | xena wrote:
         | The reason that companies align models is so that they don't
         | get on the front page of the new york times with a headline
         | like "Techaro's AI model used by terrorists to build a pipe
         | bomb that destroyed the New York Stock Exchange datacentre".
        
       | dpflan wrote:
       | These big models are getting pumped out like crazy, that is the
       | business of these companies. But basically, it feels like
       | private/industry just figured out how to scale up a scalable
       | process (deep learning), and it required not $M research grants
       | but $BB "research grants"/funding, and the scaling laws seem to
       | be fun to play with and tweak more interesting things out of
       | these and find cool "emergent" behavior as billions of data
       | points get correlated.
       | 
       | But pumping out models and putting artifacts on HuggingFace, is
       | that a business? What are these models being used for? There is a
       | new one at a decent clip.
        
         | hdhshdhshdjd wrote:
         | I don't see any indication this beats Llama3 70B, but still
         | requires a beefy GPU, so I'm not sure the use case. I have an
         | A6000 which I use for a lot of things, Mixtral was my go-to
         | until Llama3, then I switched over.
         | 
         | If you could run this on say, stock CPU that would increase the
         | use cases dramatically, but if you still need a 4090 I'm either
         | missing something or this is useless.
        
           | azeirah wrote:
           | You don't need a 4090 at all. 16 bit requires about 24GB of
           | VRAM, 8bit quants (99% same performance) requires only 12GB
           | of VRAM.
           | 
           | That's without the context window, so depending on how much
           | context you want to use you'll need some more GB.
           | 
           | That is, assuming you'll be using llama.cpp (which is
           | standard for consumer inference. Ollama is also llama.cpp, as
           | is kobold)
           | 
           | This thing will run fine on a 16GB card, and a q6
           | quantization will run fine on a 12GB card.
           | 
           | You'll still get good performance on an 8GB card with
           | offloading, since you'll be running most of it on the gpu
           | anyway.
        
         | eigenvalue wrote:
         | There are a lot of models coming out, but in my view, most
         | don't really matter or move the needle. There are the frontier
         | models which aren't open (like GPT-4o) and then there are the
         | small "elite" local LLMs like Llama3 8B. The rest seem like
         | they are mostly about manipulating benchmarks. Whenever I try
         | them, they are worse in actual practice than the Llama3 models.
        
       | adt wrote:
       | That's 3 releases for Mistral in 24 hours.
       | 
       | https://lifearchitect.ai/models-table/
        
       | ofermend wrote:
       | Congrats. Very exciting to see continued innovation around
       | smaller models, that can perform much better than larger models.
       | This enables faster inference and makes them more ubiquitous.
        
       | andrethegiant wrote:
       | I still don't understand the business model of releasing open
       | source gen AI models. If this took 3072 H100s to train, why are
       | they releasing it for free? I understand they charge people when
       | renting from their platform, but why permit people to run it
       | themselves?
        
         | kaoD wrote:
         | > but why permit people to run it themselves?
         | 
         | I wouldn't worry about that if I were them: it's been shown
         | again and again that people will pay for convenience.
         | 
         | What I'd worry about is Amazon/Cloudflare repackaging my model
         | and outcompeting my platform.
        
           | andrethegiant wrote:
           | > What I'd worry about is Amazon/Cloudflare repackaging my
           | model and outcompeting my platform.
           | 
           | Why let Amazon/Cloudflare repackage it?
        
             | bilbo0s wrote:
             | How would you stop them?
             | 
             | The license is Apache 2.
        
               | andrethegiant wrote:
               | That's my question -- why license as Apache 2
        
               | bilbo0s wrote:
               | What license would allow complete freedom for everyone
               | else, but constrain Amazon and Cloudflare?
        
               | supriyo-biswas wrote:
               | The LLaMa license is a good start.
        
               | abdullahkhalids wrote:
               | They could just create a custom license based of Apache
               | 2.0 that allows sharing but constraints some specific
               | behavior. It won't be formally Open Source, but will have
               | enough open source spirit that academics or normal people
               | will be happy to use it.
        
       | davidzweig wrote:
       | Did anyone try to check how are it's multilingual skills vs.
       | Gemma 2? On the page, it's compared with LLama 3 only.
        
         | moffkalast wrote:
         | Well it's not on Le Chat, it's not on LMSys, it has a new
         | tokenizer that breaks llama.cpp compatibility, and I'm sure as
         | hell not gonna run it with Crapformers at 0.1x speed which as
         | of right now seems to be the only way to actually test it out.
        
       | I_am_tiberius wrote:
       | The last time I tried a Mistral model, it didn't answer most of
       | my questions, because of "policy" reasons. I hope they fixed
       | that. OpenAI at least only tells me that it's a policy issue but
       | still answers most of the time.
        
       | zone411 wrote:
       | Interesting that the benchmarks they show have it outperforming
       | Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT
       | Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini
       | also does better at 14.3. It's just one benchmark though, so
       | looking forward to additional scores.
        
         | chant4747 wrote:
         | Can you help me understand why people seem to think of
         | Connections as a more robust indicator of (general) performance
         | than benchmarks typically used for eval?
         | 
         | It seems to me that while the game is very challenging for
         | people it's not necessarily an indicator of generalization. I
         | can see how it's useful - but I have trouble seeing how a low
         | score on it would indicate low performance on most tasks.
         | 
         | Thanks and hopefully this isn't perceived as offensive. Just
         | trying to learn more about it.
         | 
         | edit: I realize you yourself indicate that it's "just one
         | benchmark" - I am more asking about the broader usage I have
         | seen here on HN comments from several people.
        
       | lostmsu wrote:
       | Gonna wait for LMSYS benchmarks. The "standard" benchmarks all
       | seem unreliable.
        
       | wkcheng wrote:
       | Does anyone know whether the 128K is input tokens only? There are
       | a lot of models that have a large context window for input but a
       | small output context. If this actually has 128k tokens shared
       | between input and output, that would be a game changer.
        
       | eigenvalue wrote:
       | I have to say, the experience of trying to sign up for Nvidia
       | Enterprise so you can try the "NIM" packaged version of this
       | model, is just icky and and awful now that I've gotten used to
       | actually free and open models and software. It feels much nicer
       | and more free to be able to clone llama.cpp and wget a .gguf
       | model file from huggingface without any registration at all.
       | Especially since it has now been several hours since I signed up
       | for the Nvidia account and it still says on the website "Your
       | License Should be Active Momentarily | We're setting up your
       | credentials to download NIMs."
       | 
       | I really don't get Nvidia's thinking with this. They basically
       | have a hardware monopoly. I shelled out the $4,000 or so to buy
       | two of their 4090 GPUs. Why are they still insisting on torturing
       | me with jumping through these awful hoops? They should just be
       | glad that they're winning and embrace freedom.
        
       ___________________________________________________________________
       (page generated 2024-07-18 23:09 UTC)