hngopher.com

       [HN Gopher] Codestral Mamba
       ___________________________________________________________________
        
       Codestral Mamba
        
       Author : tosh
       Score  : 334 points
       Date   : 2024-07-16 14:44 UTC (8 hours ago)
        
 (HTM) web link (mistral.ai)
 (TXT) w3m dump (mistral.ai)
        
       | sa-code wrote:
       | It's great to see a high-profile model using Mamba2!
        
       | culopatin wrote:
       | Does anyone have a video or written article that would get one up
       | to speed with a bit of the history/progression and current
       | products that are out there for one to try locally?
       | 
       | This is coming from someone that understands the general concepts
       | of how LLMs work but only used the general publicly available
       | tools like ChatGPT, Claude, etc.
       | 
       | I want to see if I have any hardware I can stress and run
       | something locally, but don't know where to start or even what are
       | the available options.
        
         | Kydlaw wrote:
         | If I understand correctly what you are looking for, Ollama
         | might be a solution (https://ollama.com/)?. I have no
         | affiliation, but I lazily use this solution when I want to run
         | a quick model locally.
        
           | TechDebtDevin wrote:
           | Better yet install Open Web GUI and ollama at the same time
           | via docker. Most people will want a familiar GUI rather than
           | the terminal.
           | 
           | https://github.com/open-webui/open-webui
           | 
           | This will install ollama and open web GUI:
           | 
           | For GPU support run:
           | 
           | docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama
           | -v open-webui:/app/backend/data --name open-webui --restart
           | always ghcr.io/open-webui/open-webui:ollama
           | 
           | Use for CPU only support:
           | 
           | docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-
           | webui:/app/backend/data --name open-webui --restart always
           | ghcr.io/open-webui/open-webui:ollama
        
             | Der_Einzige wrote:
             | Why do people recommend this instead of the much better
             | oobabooga text-gen-webui?
             | 
             | https://github.com/oobabooga/text-generation-webui
             | 
             | It's like you hate settings, features, and access to many
             | backends!
        
               | TechDebtDevin wrote:
               | To each their own, how are you using these extra
               | features? I personally am not looking to spend a bunch on
               | API credits and don't have the hardware to run models
               | larger than 7-8b parameters. I use local llms almost
               | exclusively for formatting notes and as a reading
               | assistant/summarizer and therefor don't need these
               | features.
        
         | sva_ wrote:
         | If you mean LLM in general, maybe try llamafile first
         | 
         | https://github.com/Mozilla-Ocho/llamafile
        
         | currycurry16 wrote:
         | Find good models here: https://huggingface.co/spaces/open-llm-
         | leaderboard/open_llm_...
         | 
         | Check hardware requirements here:
         | https://rahulschand.github.io/gpu_poor/
        
         | _kidlike wrote:
         | not sure about the history/progression part, but there's ollama
         | which makes it possible to run models locally. The UX of ollama
         | is similar to docker.
        
         | TechDebtDevin wrote:
         | Most the 7b instruct models are very bad outside very simple
         | queries.
         | 
         | You can run a 7b on most modern hardware.How fast will vary.
         | 
         | To run 30-70b models you're getting in the realm of needing
         | 24gb or more of vRAM.
        
           | Agentus wrote:
           | I'm looking to run something on a 24gb GPU for the purpose of
           | running wild with agentic use of LLMs. Is there anything
           | worth trying that would fit on that amount of vRAM? Or are
           | all the open-source PC-sized LLMs laughable still?
        
             | TechDebtDevin wrote:
             | You can run the llama 70b based models faster than 10 tkn/s
             | on 24gb vram. I've found that the quality of this class of
             | LLMs is heavily swayed by your configuration and system
             | prompting and results may vary. This Reddit post seems to
             | have some input on the topic:
             | 
             | https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/llama_
             | 3...
             | 
             | I haven't used any agent frameworks other than messing
             | around with langchain a bit so I can't speak to how that
             | would effect things.
        
           | dTal wrote:
           | >Most the 7b instruct models are very bad outside very simple
           | queries.
           | 
           | I can't agree with "very bad". Maybe your standards are set
           | by the best, largest models, but have a little perspective: a
           | modern 7b model is a friggin _magical_ piece of software.
           | Fully in the realm of sci-fi until basically last Tuesday. It
           | can reliably summarize documents, bash a 30 minute rambling
           | voice note into a terse proposal, and give you social
           | counseling at least on par with r /Relationship_Advice. It
           | might not always get facts exactly right but it is _smart_ in
           | a way that computers have never been before. And for all this
           | capability, you can get it running on a computer a decade
           | old, maybe even a Raspberry Pi or a smartphone.
           | 
           | To answer the parent: Download a "gguf" file (blob of
           | weights) of a popular model like Mistral from HugginFace. Git
           | pull and compile llama.cpp. Run ./main -m path/to/gguf -p
           | "prompt"
        
         | derefr wrote:
         | For running LLMs, I think most people just dive into
         | https://www.reddit.com/r/LocalLLaMA/ and start reading.
         | 
         | Not sure what the equivalent is for image generation; it's
         | either https://www.reddit.com/r/StableDiffusion/ or one of the
         | related subreddits it links to.
         | 
         | Sadly, I've yet to find anyone doing "daily ML-hobbyist news"
         | content creation, summarizing the types of articles that appear
         | on these subreddits. (Which is a surprise to me, as it's really
         | easy to find e.g. "daily homelab news" content creators.
         | Please, someone, start a "daily ML-hobbyist news" blog/channel!
         | Given that the target audience would essentially be "people who
         | will get an itch to buy a better GPU soon", the CPM you'd earn
         | on ad impressions would be really high...)
         | 
         | ---
         | 
         | That being said, just to get you started, here's a few things
         | to know at present about "what you can run locally":
         | 
         | 1. Most models (of the architectures people care about today)
         | will _probably_ fit on a GPU which has something like 1.5x the
         | VRAM of the model 's parameter-weights size. So e.g. a "7B" (7
         | billion parameter-weights) model, will fit on a GPU that has
         | 12GB of VRAM. (You can potentially squeeze even tighter if you
         | have a machine with integrated graphics + dedicated GPU, and
         | you're using the integrated graphics as graphics, leaving the
         | GPU's VRAM free to _only_ hold the model.)
         | 
         | 2. There are models that come in all sorts of sizes. Many open-
         | source ML models are huge (70B, 120B, 144B -- things you'd need
         | datacenter-class GPUs to run), but then versions of these same
         | models get released which have been heavily cut down (pruned
         | and/or quantized), to force them to fit into smaller VRAM
         | sizes. There are 5B, 3B, 1B, even 0.5B models (although the
         | last two are usually special-purpose models.)
         | 
         | 3. Surprisingly, depending on your use-case, smaller models (or
         | small quants of larger models) can "mostly" work perfectly
         | well! They just have more edge-cases where something will send
         | them off the rails spiralling into nonsense -- so they're less
         | _reliable_ than their larger cousins. You might have to give
         | them more prompting, and try regenerating their output from the
         | same prompt several times, to get good results.
         | 
         | 4. Apple Silicon Macs have a GPU and TPU that read from/write
         | to the same unified memory that the CPU does. While this makes
         | these devices _slower_ for inference than  "real" GPUs with
         | dedicated VRAM, it means that if you happen to own a Mac with
         | 16GB of RAM, then you own something that can run 7B models. AS
         | Macs are, oddly enough, the "cheapest" things you can buy in
         | terms of model-capacity-per-dollar. (Unlike a "real" GPU, they
         | won't be especially _quick_ and won 't have any capacity for
         | _concurrent_ model inference, so you 'd never use one as a
         | server backing an Inference-as-a-Service business. But for home
         | use? No real downsides.)
        
         | nabakin wrote:
         | Here's a summary of what's happened the past couple of years
         | and what tools are out there.
         | 
         | After ChatGPT released, there was a lot of hype in the space
         | but open source was far behind. Iirc the best open foundation
         | LLM that existed was GPT-2 but it was two generations behind.
         | 
         | Awhile later Meta released LLaMA[1], a well trained base
         | foundation model, which brought an explosion to open source. It
         | was soon implemented in the Hugging Face Transformers
         | library[2] and the weights were spread across the Hugging Face
         | website for anyone to use.
         | 
         | At first, it was difficult to run locally. Few developers had
         | the system or money to run. It required too much RAM and iirc
         | Meta's original implementation didn't support running on the
         | CPU but developers soon came up with methods to make it smaller
         | via quantization. The biggest project for this was Llama.cpp[3]
         | which probably is still the biggest open source project today
         | for running LLMs locally. Hugging Face Transformers also added
         | quantization support through bitsandbytes[4].
         | 
         | Over the next months there was rapid development in open
         | source. Quantization techniques improved which meant LLaMA was
         | able to run with less and less RAM with greater and greater
         | accuracy on more and more systems. Tools came out that were
         | capable of finetuning LLaMA and there were hundreds of LLaMA
         | finetunes that came out which finetuned LLaMA on instruction
         | following, RLHF, and chat datasets which drastically increased
         | accuracy even further. During this time, Stanford's Alpaca,
         | Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a
         | few others made their way onto the open LLM scene with some
         | very good LLaMA finetunes.
         | 
         | A new inference engine (software for running LLMs like
         | Llama.cpp, Transformers, etc) called vLLM[5] came out which was
         | capable of running LLMs in a more efficient way than was
         | previously possible in open source. Soon it would even get good
         | AMD support, making it possible for those with AMD GPUs to run
         | open LLMs locally and with relative efficiency.
         | 
         | Then Meta released Llama 2[6]. Llama 2 was by far the best open
         | LLM for its time. Released with RLHF instruction finetunes for
         | chat and with human evaluation data that put its open LLM
         | leadership beyond doubt. Existing tools like Llama.cpp and
         | Hugging Face Transformers quickly added support and users had
         | access to the best LLM open source had to offer.
         | 
         | At this point in time, despite all the advancements, it was
         | still difficult to run LLMs. Llama.cpp and Transformers were
         | great engines for running LLMs but the setup process was
         | difficult and required a lot of time. You had to find the best
         | LLM, quantize it in the best way for your computer (or figure
         | out how to identify and download one from Hugging Face), setup
         | whatever engine you wanted, figure out how to use your
         | quantized LLM with the engine, fix any bugs you made along the
         | way, and finally figure out how to prompt your specific LLM in
         | a chat-like format.
         | 
         | However, tools started coming out to make this process
         | significantly easier. The first one of these that I remember
         | was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which
         | made it easy to install, easy to select the LLM that you want
         | (pre-quantized options for easy download from a download
         | manager), and a chat UI which made LLMs easy to use. This
         | significantly reduced the barrier to entry for those who were
         | interested in using LLMs.
         | 
         | The second project that I remember was Ollama[8]. Also a
         | wrapper around Llama.cpp, Ollama gave most of what GPT4All had
         | to offer but in an even simpler way. Today, I believe Ollama is
         | bigger than GPT4All although I think it's missing some of the
         | higher-level features of GPT4All.
         | 
         | Another important tool that came out during this time is called
         | Exllama[9]. Exllama is an inference engine with a focus on
         | modern consumer Nvidia GPUs and advanced quantization support
         | based on GPTQ. It is probably the best inference engine for
         | squeezing performance out of consumer Nvidia GPUs.
         | 
         | Months later, Nvidia came out with another new inference engine
         | called TensorRT-LLM[10]. TensorRT-LLM is capable of running
         | most LLMs and does so with extreme efficiency. It is the most
         | efficient open source inferencing engine that exists for Nvidia
         | GPUs. However, it also has the most difficult setup process of
         | any inference engine and is made primarily for production use
         | cases and Nvidia AI GPUs so don't expect it to work on your
         | personal computer.
         | 
         | With the rumors of GPT-4 being a Mixture of Experts LLM,
         | research breakthroughs in MoE, and some small MoE LLMs coming
         | out, interest in MoE LLMs was at an all-time high. The company
         | Mistral had proven itself in the past with very impressive
         | LLaMA finetunes, capitalized on this interest by releasing
         | Mixtral 8x7b[11]. The best accuracy for its size LLM that the
         | local LLM community had seen to date. Eventually MoE support
         | was added to all inference engines and it was a very popular
         | mid-to-large sized LLM.
         | 
         | Cohere released their own LLM as well called Command R+[12]
         | built specifically for RAG-related tasks with a context length
         | of 128k. It's quite large and doesn't have notable performance
         | on many metrics, but it has some interesting RAG features no
         | other LLM has.
         | 
         | More recently, Llama 3[13] was released which similar to
         | previous Llama releases, blew every other open LLM out of the
         | water. The smallest version of Llama 3 (Llama 3 8b) has the
         | greatest accuracy for its size of any other open LLM and the
         | largest version of Llama 3 released so far (Llama 3 70b) beats
         | every other open LLM on almost every metric.
         | 
         | Less than a month ago, Google released Gemma 2[14], the largest
         | of which, performs very well under human evaluation despite
         | being less than half the size of Llama 3 70b, but performs only
         | decently on automated benchmarks.
         | 
         | If you're looking for a tool to get started running LLMs
         | locally, I'd go with either Ollama or GPT4All. They make the
         | process about as painless as possible. I believe GPT4All has
         | more features like using your local documents for RAG, but you
         | can also use something like Open WebUI[15] with Ollama to get
         | the same functionality.
         | 
         | If you want to get into the weeds a bit and extract some more
         | performance out of your machine, I'd go with using Llama.cpp,
         | Exllama, or vLLM depending upon your system. If you have a
         | normal, consumer Nvidia GPU, I'd go with Exllama. If you have
         | an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For
         | anything else, including just running it on your CPU or
         | M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes
         | sense if you have an AI Nvidia GPU like the A100, V100, A10,
         | H100, etc.
         | 
         | [1] https://ai.meta.com/blog/large-language-model-llama-meta-
         | ai/
         | 
         | [2] https://github.com/huggingface/transformers
         | 
         | [3] https://github.com/ggerganov/llama.cpp
         | 
         | [4] https://github.com/bitsandbytes-foundation/bitsandbytes
         | 
         | [5] https://github.com/vllm-project/vllm
         | 
         | [6] https://ai.meta.com/blog/llama-2/
         | 
         | [7] https://www.nomic.ai/gpt4all
         | 
         | [8] http://ollama.ai/
         | 
         | [9] https://github.com/turboderp/exllamav2
         | 
         | [10] https://github.com/NVIDIA/TensorRT-LLM
         | 
         | [11] https://mistral.ai/news/mixtral-of-experts/
         | 
         | [12] https://cohere.com/blog/command-r-plus-microsoft-azure
         | 
         | [13] https://ai.meta.com/blog/meta-llama-3/
         | 
         | [14] https://blog.google/technology/developers/google-gemma-2/
         | 
         | [15] https://github.com/open-webui/open-webui
        
           | holoduke wrote:
           | Great info. Do you also know the state of the code
           | assistants? Any thoughts on copilot versus others?
        
             | nabakin wrote:
             | I've been following the state of things, but I'm not sure
             | which ones are the best. There's Meta's CodeLlama[1],
             | Mistral's Codestral[2], DeepSeek AI's DeepSeek-
             | Coder-V2-Instruct[3], CodeGemma[4], Alibaba's CodeQwen[5],
             | and Microsoft's WizardCoder[6].
             | 
             | I'm pretty sure CodeLlama is out of date now. I've heard
             | DeepSeek LLMs are good and DeepSeek-Coder-V2-Instruct was
             | released recently. With the good reputation and its massive
             | size (236b) I'd guess it is the best coding LLM, but if
             | it's not being trained efficiently, maybe Codestral and
             | Codestral Mamba come close.
             | 
             | I don't think the best coding LLMs are close to GitHub
             | Copilot but I could be wrong since I'm just relaying
             | information that I've heard secondhand.
             | 
             | [1] https://ai.meta.com/blog/code-llama-large-language-
             | model-cod...
             | 
             | [2] https://mistral.ai/news/codestral/
             | 
             | [3] https://github.com/deepseek-ai/DeepSeek-Coder-V2
             | 
             | [4] https://developers.googleblog.com/en/gemma-family-
             | expands-wi...
             | 
             | [5] https://qwenlm.github.io/blog/codeqwen1.5/
             | 
             | [6] https://github.com/nlpxucan/WizardLM
        
             | hobofan wrote:
             | All the main IDE-integrated ones seem very much on par
             | (Copilot, Sourcegraph Cody, Continue.dev), with cursor.sh
             | liked by some as it has code assistant-first UI.
             | 
             | I've personally went back to the browser with Claude 3.5
             | Sonnet (and the projects + artifacts feature), as it is one
             | of the most industrious ones, and I really like the UX of
             | artifacts + it integrates new code well into existing code
             | you paste into it.
             | 
             | In the end I think it also often comes down to what
             | languages/frameworks you are using and how well the
             | LLM/product handles it, so I'd still recommend to test
             | around. E.g. some of the main frameworks I'm working with
             | on a daily basis went through big refactors/interface
             | changes 1-2 years ago, and I stopped using ChatGPT because
             | it had a strong tendency to produce code based on the old
             | interfaces/paradigms.
             | 
             | Aider[0] is also quite interesting, especially when it
             | comes to more significant refactorings in the codebase and
             | has gotten quite good with that with the last few bigger
             | model releases, but it takes same time to get used to and
             | doesn't have good IDE-integration.
             | 
             | [0]: https://github.com/paul-gauthier/aider
        
           | iAmAPencilYo wrote:
           | Thank you! Very helpful as a newbie coming in.
        
           | psychoslave wrote:
           | This is one one the most useful and informative comment I
           | ever faced on HN. Thank you very much.
        
       | bhouston wrote:
       | What are the steps required to get this running in VS Code?
       | 
       | If they had linked to the instructions in their post (or better
       | yet a link to a one click install of a VS Code Extension), it
       | would help a lot with adoption.
       | 
       | (BTW I consider it malpractice that they are at the top of hacker
       | news with a model that is of great interest to a large portion of
       | the users where and they do not have a monetizable call to action
       | on the page featured.)
        
         | leourbina wrote:
         | If you can run this using ollama, then you should be able to
         | use https://www.continue.dev/ with both IntelliJ and VSCode.
         | Haven't tried this model yet - but overall this plugin works
         | well.
        
           | scosman wrote:
           | They say no llama.cpp support yet, so no ollama yet (which
           | uses llama.cpp)
        
             | sadeshmukh wrote:
             | Ollama is supported:
             | https://docs.continue.dev/setup/select-provider
        
               | trsohmers wrote:
               | They meant that there is no support for Codestral Mamba
               | for llama.cpp yet.
        
             | HanClinto wrote:
             | Correct. The only back-end that Ollama uses is llama.cpp,
             | and llama.cpp does not yet have Mamba2 support. The issues
             | to track Mamba2 and Codestral Mamba support are here:
             | 
             | https://github.com/ggerganov/llama.cpp/issues/8519
             | 
             | https://github.com/ggerganov/llama.cpp/issues/7727
             | 
             | Mamba support was added in March of this year:
             | 
             | https://github.com/ggerganov/llama.cpp/pull/5328
             | 
             | I have not yet seen a PR to address Mamba2.
        
           | osmano807 wrote:
           | Unrelated, all my devices freeze when accessing this page,
           | desktop Firefox and Chrome, mobile Firefox and Brave. Is this
           | the best alternative to access code ai helpers besides the
           | GitHub Copilot and Google Gemini on VSCode?
        
             | raphaelj wrote:
             | I've been using it for a few months (with Starcoder 2 for
             | code, and GPT-4o for chat). I find the code completion
             | actually better than Github Copilot.
             | 
             | My main complain is that the chat sometimes fails to
             | correctly render some GPT-4o output (e.g. LaTeX
             | expressions), but it's mostly fixed with a custom system
             | prompt. It also significantly reduces the battery life of
             | my Macbook M1, but that's expected.
        
             | oliverulerich wrote:
             | I'm quite happy with Cody from Sourcegraph https://marketpl
             | ace.visualstudio.com/items?itemName=sourcegr...
        
         | sleepytimetea wrote:
         | Looking through the Quickstart docs, they have an API that can
         | generate code. However, I don't think they have a way to do
         | "Day 2" code editing.
         | 
         | Also, doesn't seem to have a freemium tier...need to start
         | paying even before trying it out ?
         | 
         | "Our API is currently available through La Plateforme. You need
         | to activate payments on your account to enable your API keys."
        
           | sv123 wrote:
           | I signed up when codestral was first available and put my
           | payment details in. Been using it daily since then with
           | continue.dev but my usage dashboard shows 0 tokens, and so
           | far have not been billed for anything... Definitely not clear
           | anywhere, but it seems to be free for now? Or some sort of
           | free limit that I am not hitting.
        
             | sunaookami wrote:
             | Through codestral.mistral.ai? It's free until August 1st:
             | https://docs.mistral.ai/capabilities/code_generation/
             | 
             | >Monthly subscription based, free until 1st of August
        
         | refulgentis wrote:
         | "All you need is users" doesn't seem optimal IMHO, Stability.ai
         | providing an object lesson in that.
         | 
         | They just released weights, and being a for profit, need to
         | optimize for making money, not eyeballs. It seems wise to guide
         | people to the API offering.
        
           | bhouston wrote:
           | On top of Hacker News (the target demographic for coders)
           | without an effective monetizable call to action? What a
           | missed opportunity.
           | 
           | Github Copilot makes +100M/year, if not way way more.
           | 
           | Having a VS Code extension for Mistral would be a revenue
           | stream if it was one-click and better or cheaper than Github
           | Copilot. It is malpractice in my mind to not be doing this if
           | you are investing in creating coding models.
        
             | refulgentis wrote:
             | I see, that makes sense: make an extension and charge for
             | it.
             | 
             | I assumed they meant free x local. It doesn't seem rational
             | to make this one paid: its significantly smaller than their
             | better model, and even more so than Copilot's.
        
           | passion__desire wrote:
           | But they also signal competence in the space which means M&A.
           | Or big nation states in future would hire them to produce
           | country models once the space matures as was Emad's vision.
        
             | refulgentis wrote:
             | Did Emad's vision end up manifest? ex. did a nation-state
             | end up paying Stability for a country model?
             | 
             | Would it help signal competency? They're a small team
             | focused on making models, not VS Code extensions.
             | 
             | Would they do M&A? The founding team is ex-Googlers and has
             | found significant attention in the MBA world via being an
             | EU champion.
        
       | monkeydust wrote:
       | Any recommended product primers to Mamba vs Transformers -
       | pros/cons etc?
        
         | ertgbnm wrote:
         | https://www.youtube.com/watch?v=X5F2X4tF9iM
         | 
         | This is what introduced me to them. May be a bit outdated at
         | this point.
        
         | bhouston wrote:
         | This video is good:
         | https://www.youtube.com/watch?v=N6Piou4oYx8. As are the other
         | videos on the same YouTube account.
        
         | red2awn wrote:
         | A very good primer to state-space models (from which Mamba is
         | based on) is The Annotated S4 [1]. If you want to dive into the
         | code I wrote a minimal single-file implementation of Mamba-2
         | here [2].
         | 
         | [1]: https://srush.github.io/annotated-s4/
         | 
         | [2]: https://github.com/tommyip/mamba2-minimal
        
       | croemer wrote:
       | The first sentence is wrong. The website says:
       | 
       | > As a tribute to Cleopatra, whose glorious destiny ended in
       | tragic snake circumstances
       | 
       | but according to Wikipedia this is not true:
       | 
       | > When Cleopatra learned that Octavian planned to bring her to
       | his Roman triumphal procession, she killed herself by poisoning,
       | contrary to the popular belief that she was bitten by an asp.
        
         | rjurney wrote:
         | I believe this is in dispute among sources.
        
         | skybrian wrote:
         | Yes, that seems to be a myth, but exact circumstances seem
         | rather uncertain according to the Wikipedia article [1]:
         | 
         | > [A]ccording to the Roman-era writers Strabo, Plutarch, and
         | Cassius Dio, Cleopatra poisoned herself using either a toxic
         | ointment or by introducing the poison with a sharp implement
         | such as a hairpin. Modern scholars debate the validity of
         | ancient reports involving snakebites as the cause of death and
         | whether she was murdered. Some academics hypothesize that her
         | Roman political rival Octavian forced her to kill herself in a
         | manner of her choosing. The location of Cleopatra's tomb is
         | unknown. It was recorded that Octavian allowed for her and her
         | husband, the Roman politician and general Mark Antony, who
         | stabbed himself with a sword, to be buried together properly.
         | 
         | I think this rounds to "nobody really knows."
         | 
         | The "glorious destiny" seems kind of shaky, too. It's just a
         | throwaway line anyway.
         | 
         | [1] https://en.m.wikipedia.org/wiki/Death_of_Cleopatra
        
         | ljsprague wrote:
         | What bothers me more is that the legend is that she was killed
         | by an asp, not a mamba.
        
         | dghlsakjg wrote:
         | Maybe Octavian was the snake?
        
       | rjurney wrote:
       | But I JUST switched from GPT4o to Claude! :( Kidding, but it
       | isn't clear how to use this thing, as others have pointed out.
        
         | ukuina wrote:
         | What made you switch?
        
           | pelagicAustral wrote:
           | I'm using both, been doing that for months now. I can
           | confidently assert that while Claude is getting better and
           | better, GPT 4 and 4o seem the be getting dumbed down for some
           | unexplained reason. Claude is now my go-to for anything code.
           | (I do Ruby and C#, btw, other might have a different
           | experience)
        
             | marcyb5st wrote:
             | I guess they are distilling the models so that they can
             | save $$$ on serving.
        
           | ldjkfkdsjnv wrote:
           | GPT4o is way behind sonnet 3.5
        
             | mountainriver wrote:
             | Huh I guess all the benchmarks are wrong then
        
               | causal wrote:
               | Agreed.
        
           | rjurney wrote:
           | Claude is much better. Overwhelmingly better. It not only
           | implements deep learning models for me, it has great
           | suggestions on evolving them to actually work.
        
             | mountainriver wrote:
             | lol no it's not, the benchmarks don't show that at all.
             | Both have issues in different ways
        
               | causal wrote:
               | Benchmarks are pretty flawed IMO, in particular their
               | weakness here seems to be that they are poor at
               | evaluating long-tail multiturn conversations. 4o often
               | gives a great first response, then spirals into a
               | repetition. Sonnet 3.5 is much better at seeing the big
               | picture in a longer conversation IMO.
        
               | orbital-decay wrote:
               | Repetition in multiturn conversations is actually
               | Sonnet's fatal flaw, both 3 and 3.5. 4o is also
               | repetitive to an extent. Opus is _way_ better than both
               | at being non-repetitive.
        
               | stavros wrote:
               | I made a mobile app the other day using LLMs (I had never
               | used React or TypeScript before, and I built an app with
               | React Native). I was pretty disappointed, both Sonnet 3.5
               | and gpt-4-turbo performed pretty poorly, making mistakes
               | like missing a closing bracket somewhere and meaning I
               | had to revert, because I had no idea where they meant to
               | put it.
               | 
               | Also they did the thing that junior developers tend to
               | do, where you have a race condition of some sort, and
               | they just work around it by adding some if checks. The
               | app is at around 400 lines right now, it works but feels
               | pretty brittle. Adding a tiny feature here or there
               | breaks something else, and GPT does the wrong thing half
               | the time.
               | 
               | All in all, I'm not complaining, because I made an app in
               | two days, but it won't replace a developer yet, no matter
               | how much I want it to.
        
           | throwup238 wrote:
           | Claude Projects which allow attaching a bunch of files to
           | fill up the 200k context. I wrote up a script to dump a bunch
           | of code and documentation files to markdown as context and I
           | add them to a bunch of Claude projects on a per topic basis.
           | 
           | For example, I'm currently working on a Rust/Qt desktop app
           | so I have a project with the whole Qt6 book attached to ask
           | questions about Qt, a project with my SQL schema and
           | ORM/Sqlite docs to ask questions about the app's data and
           | generate models without dealing with hallucinations, a
           | project with all my QML files and Rust QML element code, a
           | project with a bunch of Rust crate docs, and so on and on.
           | 
           | GPTs allow attaching files too but Claude Projects dump the
           | entire contents of the files into the context rather than
           | trying to do some hacky RAG that never works like I want it
           | to.
        
             | funnygiraffe wrote:
             | I was under the impression that with LLMs, in order to get
             | high-quality answers, it's always best to keep context
             | short. Is that not the case anymore? Does Claude under this
             | usage paradigm not struggle with very long contexts in ways
             | as for example described in the "lost in the middle" paper
             | (https://arxiv.org/abs/2307.03172)?
        
               | throwup238 wrote:
               | I don't have the time to evaluate the effects of context
               | length on my use cases so I have no idea. There might be
               | some degradation when I attach the Qt book which is
               | probably already in Claude's training data but when using
               | it against my private code base, it's not like I have any
               | other choice.
               | 
               | The UX of drag and dropping a few monolithic markdown
               | files to include entire chunks of a large project
               | outweighs the downsides of including irrelevant context
               | in my experience.
        
               | inciampati wrote:
               | No, you need to provide as much information in context as
               | possible. Otherwise you are sampling from the mode.
               | "Write me an essay about cows" = garbage boring and
               | probably 200 words. "here are twenty papers about cow
               | evolution, write me an overview of findings" = yes
        
               | azeirah wrote:
               | The conclusion you walked away with is the opposite of
               | what usually works in practice.
               | 
               | The more context you give the llm, the better.
               | 
               | The key takeaway from that paper is to keep your
               | instructions/questions/direction in the beginning or at
               | the end of the context. Any information can go anywhere.
               | 
               | Not to be too dismissive, it's a good paper, but we're
               | one year further and in practice this issue seems to have
               | been tackled by training on better data.
               | 
               | This can differ a lot depending on what model you're
               | using, but in the case of claude sonnet 3.5, more
               | relevant context is generally better for anything except
               | for speed.
               | 
               | It does remain true that you need to keep your most
               | important instructions at the beginning or at the end
               | however.
        
       | magnio wrote:
       | They announce the model is on HuggingFace but don't link to it.
       | Here it is: https://huggingface.co/mistralai/mamba-
       | codestral-7B-v0.1
        
         | dvfjsdhgfv wrote:
         | The link is already there in the text, they probably just fixed
         | it.
        
       | imjonse wrote:
       | The MBPP column should bold DeepSeek as it has a better score
       | than Codestral.
        
         | smith7018 wrote:
         | Which means Codestral Mamba and DeepSeek both lead four
         | benchmarks. Kinda takes the air out the announcement a bit.
        
           | causal wrote:
           | It should be corrected but the interesting aspect of this
           | release is the architecture. To stay competitive while only
           | needing linear inference time and supporting 256k context is
           | pretty neat.
        
             | mbowcut2 wrote:
             | THIS. People don't realize the importance of Mamba
             | competing on par with transformers.
        
           | ed wrote:
           | They're in roughly the same class but totally different
           | architectures
           | 
           | Deepseek uses a 4k sliding window compared to Codestral
           | Mamba's 256k+ tokens
        
       | localfirst wrote:
       | any sort of evals on how it compares to closed models like chat
       | gpt 4 or open ones like WizardLLM ?
        
       | pzo wrote:
       | weird they compare to deepseek-coder v1.5 when we already have
       | v2.0. Any advantage to use codestral mamba apart from that it's
       | lighter in weights?
        
         | kz919 wrote:
         | obviously because they can't beat it... There will be zero
         | reason to use it when you have better transformer based models
         | that can fit the existing infrastructure.
        
       | sam_goldman_ wrote:
       | You can try this model out using OpenAI's API format with this
       | TypeScript SDK: https://github.com/token-js/token.js
       | 
       | You just need a Mistral API key: https://console.mistral.ai/api-
       | keys/
        
       | Kinrany wrote:
       | Is there a good explanation of the Mamba architecture?
        
         | simonw wrote:
         | There's a paper: https://arxiv.org/abs/2312.00752
         | 
         | I haven't seen any good non-paper explainers yet.
        
         | alecco wrote:
         | https://thegradient.pub/mamba-explained/
         | 
         | https://jackcook.com/2024/02/23/mamba.html
         | 
         | https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html
        
       | thot_experiment wrote:
       | Does anyone have a favorite FIM capable model? I've been using
       | codellama-13b through ollama w/ a vim extension i wrote and it's
       | okay but not amazing, I definitely get better code most of the
       | time out of Gemma-27b but no FIM (and for some reason
       | codellama-34b has broken inference for me)
        
       | taf2 wrote:
       | How does this work in vim?
        
       | flakiness wrote:
       | So Mamba is supposed to be faster and the article claims that.
       | But they don't have any latency numbers.
       | 
       | Has anyone tried this? And then, is it fast(er)?
        
       ___________________________________________________________________
       (page generated 2024-07-16 23:00 UTC)