[HN Gopher] Codestral Mamba
___________________________________________________________________
Codestral Mamba
Author : tosh
Score : 334 points
Date : 2024-07-16 14:44 UTC (8 hours ago)
(HTM) web link (mistral.ai)
(TXT) w3m dump (mistral.ai)
| sa-code wrote:
| It's great to see a high-profile model using Mamba2!
| culopatin wrote:
| Does anyone have a video or written article that would get one up
| to speed with a bit of the history/progression and current
| products that are out there for one to try locally?
|
| This is coming from someone that understands the general concepts
| of how LLMs work but only used the general publicly available
| tools like ChatGPT, Claude, etc.
|
| I want to see if I have any hardware I can stress and run
| something locally, but don't know where to start or even what are
| the available options.
| Kydlaw wrote:
| If I understand correctly what you are looking for, Ollama
| might be a solution (https://ollama.com/)?. I have no
| affiliation, but I lazily use this solution when I want to run
| a quick model locally.
| TechDebtDevin wrote:
| Better yet install Open Web GUI and ollama at the same time
| via docker. Most people will want a familiar GUI rather than
| the terminal.
|
| https://github.com/open-webui/open-webui
|
| This will install ollama and open web GUI:
|
| For GPU support run:
|
| docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama
| -v open-webui:/app/backend/data --name open-webui --restart
| always ghcr.io/open-webui/open-webui:ollama
|
| Use for CPU only support:
|
| docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-
| webui:/app/backend/data --name open-webui --restart always
| ghcr.io/open-webui/open-webui:ollama
| Der_Einzige wrote:
| Why do people recommend this instead of the much better
| oobabooga text-gen-webui?
|
| https://github.com/oobabooga/text-generation-webui
|
| It's like you hate settings, features, and access to many
| backends!
| TechDebtDevin wrote:
| To each their own, how are you using these extra
| features? I personally am not looking to spend a bunch on
| API credits and don't have the hardware to run models
| larger than 7-8b parameters. I use local llms almost
| exclusively for formatting notes and as a reading
| assistant/summarizer and therefor don't need these
| features.
| sva_ wrote:
| If you mean LLM in general, maybe try llamafile first
|
| https://github.com/Mozilla-Ocho/llamafile
| currycurry16 wrote:
| Find good models here: https://huggingface.co/spaces/open-llm-
| leaderboard/open_llm_...
|
| Check hardware requirements here:
| https://rahulschand.github.io/gpu_poor/
| _kidlike wrote:
| not sure about the history/progression part, but there's ollama
| which makes it possible to run models locally. The UX of ollama
| is similar to docker.
| TechDebtDevin wrote:
| Most the 7b instruct models are very bad outside very simple
| queries.
|
| You can run a 7b on most modern hardware.How fast will vary.
|
| To run 30-70b models you're getting in the realm of needing
| 24gb or more of vRAM.
| Agentus wrote:
| I'm looking to run something on a 24gb GPU for the purpose of
| running wild with agentic use of LLMs. Is there anything
| worth trying that would fit on that amount of vRAM? Or are
| all the open-source PC-sized LLMs laughable still?
| TechDebtDevin wrote:
| You can run the llama 70b based models faster than 10 tkn/s
| on 24gb vram. I've found that the quality of this class of
| LLMs is heavily swayed by your configuration and system
| prompting and results may vary. This Reddit post seems to
| have some input on the topic:
|
| https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/llama_
| 3...
|
| I haven't used any agent frameworks other than messing
| around with langchain a bit so I can't speak to how that
| would effect things.
| dTal wrote:
| >Most the 7b instruct models are very bad outside very simple
| queries.
|
| I can't agree with "very bad". Maybe your standards are set
| by the best, largest models, but have a little perspective: a
| modern 7b model is a friggin _magical_ piece of software.
| Fully in the realm of sci-fi until basically last Tuesday. It
| can reliably summarize documents, bash a 30 minute rambling
| voice note into a terse proposal, and give you social
| counseling at least on par with r /Relationship_Advice. It
| might not always get facts exactly right but it is _smart_ in
| a way that computers have never been before. And for all this
| capability, you can get it running on a computer a decade
| old, maybe even a Raspberry Pi or a smartphone.
|
| To answer the parent: Download a "gguf" file (blob of
| weights) of a popular model like Mistral from HugginFace. Git
| pull and compile llama.cpp. Run ./main -m path/to/gguf -p
| "prompt"
| derefr wrote:
| For running LLMs, I think most people just dive into
| https://www.reddit.com/r/LocalLLaMA/ and start reading.
|
| Not sure what the equivalent is for image generation; it's
| either https://www.reddit.com/r/StableDiffusion/ or one of the
| related subreddits it links to.
|
| Sadly, I've yet to find anyone doing "daily ML-hobbyist news"
| content creation, summarizing the types of articles that appear
| on these subreddits. (Which is a surprise to me, as it's really
| easy to find e.g. "daily homelab news" content creators.
| Please, someone, start a "daily ML-hobbyist news" blog/channel!
| Given that the target audience would essentially be "people who
| will get an itch to buy a better GPU soon", the CPM you'd earn
| on ad impressions would be really high...)
|
| ---
|
| That being said, just to get you started, here's a few things
| to know at present about "what you can run locally":
|
| 1. Most models (of the architectures people care about today)
| will _probably_ fit on a GPU which has something like 1.5x the
| VRAM of the model 's parameter-weights size. So e.g. a "7B" (7
| billion parameter-weights) model, will fit on a GPU that has
| 12GB of VRAM. (You can potentially squeeze even tighter if you
| have a machine with integrated graphics + dedicated GPU, and
| you're using the integrated graphics as graphics, leaving the
| GPU's VRAM free to _only_ hold the model.)
|
| 2. There are models that come in all sorts of sizes. Many open-
| source ML models are huge (70B, 120B, 144B -- things you'd need
| datacenter-class GPUs to run), but then versions of these same
| models get released which have been heavily cut down (pruned
| and/or quantized), to force them to fit into smaller VRAM
| sizes. There are 5B, 3B, 1B, even 0.5B models (although the
| last two are usually special-purpose models.)
|
| 3. Surprisingly, depending on your use-case, smaller models (or
| small quants of larger models) can "mostly" work perfectly
| well! They just have more edge-cases where something will send
| them off the rails spiralling into nonsense -- so they're less
| _reliable_ than their larger cousins. You might have to give
| them more prompting, and try regenerating their output from the
| same prompt several times, to get good results.
|
| 4. Apple Silicon Macs have a GPU and TPU that read from/write
| to the same unified memory that the CPU does. While this makes
| these devices _slower_ for inference than "real" GPUs with
| dedicated VRAM, it means that if you happen to own a Mac with
| 16GB of RAM, then you own something that can run 7B models. AS
| Macs are, oddly enough, the "cheapest" things you can buy in
| terms of model-capacity-per-dollar. (Unlike a "real" GPU, they
| won't be especially _quick_ and won 't have any capacity for
| _concurrent_ model inference, so you 'd never use one as a
| server backing an Inference-as-a-Service business. But for home
| use? No real downsides.)
| nabakin wrote:
| Here's a summary of what's happened the past couple of years
| and what tools are out there.
|
| After ChatGPT released, there was a lot of hype in the space
| but open source was far behind. Iirc the best open foundation
| LLM that existed was GPT-2 but it was two generations behind.
|
| Awhile later Meta released LLaMA[1], a well trained base
| foundation model, which brought an explosion to open source. It
| was soon implemented in the Hugging Face Transformers
| library[2] and the weights were spread across the Hugging Face
| website for anyone to use.
|
| At first, it was difficult to run locally. Few developers had
| the system or money to run. It required too much RAM and iirc
| Meta's original implementation didn't support running on the
| CPU but developers soon came up with methods to make it smaller
| via quantization. The biggest project for this was Llama.cpp[3]
| which probably is still the biggest open source project today
| for running LLMs locally. Hugging Face Transformers also added
| quantization support through bitsandbytes[4].
|
| Over the next months there was rapid development in open
| source. Quantization techniques improved which meant LLaMA was
| able to run with less and less RAM with greater and greater
| accuracy on more and more systems. Tools came out that were
| capable of finetuning LLaMA and there were hundreds of LLaMA
| finetunes that came out which finetuned LLaMA on instruction
| following, RLHF, and chat datasets which drastically increased
| accuracy even further. During this time, Stanford's Alpaca,
| Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a
| few others made their way onto the open LLM scene with some
| very good LLaMA finetunes.
|
| A new inference engine (software for running LLMs like
| Llama.cpp, Transformers, etc) called vLLM[5] came out which was
| capable of running LLMs in a more efficient way than was
| previously possible in open source. Soon it would even get good
| AMD support, making it possible for those with AMD GPUs to run
| open LLMs locally and with relative efficiency.
|
| Then Meta released Llama 2[6]. Llama 2 was by far the best open
| LLM for its time. Released with RLHF instruction finetunes for
| chat and with human evaluation data that put its open LLM
| leadership beyond doubt. Existing tools like Llama.cpp and
| Hugging Face Transformers quickly added support and users had
| access to the best LLM open source had to offer.
|
| At this point in time, despite all the advancements, it was
| still difficult to run LLMs. Llama.cpp and Transformers were
| great engines for running LLMs but the setup process was
| difficult and required a lot of time. You had to find the best
| LLM, quantize it in the best way for your computer (or figure
| out how to identify and download one from Hugging Face), setup
| whatever engine you wanted, figure out how to use your
| quantized LLM with the engine, fix any bugs you made along the
| way, and finally figure out how to prompt your specific LLM in
| a chat-like format.
|
| However, tools started coming out to make this process
| significantly easier. The first one of these that I remember
| was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which
| made it easy to install, easy to select the LLM that you want
| (pre-quantized options for easy download from a download
| manager), and a chat UI which made LLMs easy to use. This
| significantly reduced the barrier to entry for those who were
| interested in using LLMs.
|
| The second project that I remember was Ollama[8]. Also a
| wrapper around Llama.cpp, Ollama gave most of what GPT4All had
| to offer but in an even simpler way. Today, I believe Ollama is
| bigger than GPT4All although I think it's missing some of the
| higher-level features of GPT4All.
|
| Another important tool that came out during this time is called
| Exllama[9]. Exllama is an inference engine with a focus on
| modern consumer Nvidia GPUs and advanced quantization support
| based on GPTQ. It is probably the best inference engine for
| squeezing performance out of consumer Nvidia GPUs.
|
| Months later, Nvidia came out with another new inference engine
| called TensorRT-LLM[10]. TensorRT-LLM is capable of running
| most LLMs and does so with extreme efficiency. It is the most
| efficient open source inferencing engine that exists for Nvidia
| GPUs. However, it also has the most difficult setup process of
| any inference engine and is made primarily for production use
| cases and Nvidia AI GPUs so don't expect it to work on your
| personal computer.
|
| With the rumors of GPT-4 being a Mixture of Experts LLM,
| research breakthroughs in MoE, and some small MoE LLMs coming
| out, interest in MoE LLMs was at an all-time high. The company
| Mistral had proven itself in the past with very impressive
| LLaMA finetunes, capitalized on this interest by releasing
| Mixtral 8x7b[11]. The best accuracy for its size LLM that the
| local LLM community had seen to date. Eventually MoE support
| was added to all inference engines and it was a very popular
| mid-to-large sized LLM.
|
| Cohere released their own LLM as well called Command R+[12]
| built specifically for RAG-related tasks with a context length
| of 128k. It's quite large and doesn't have notable performance
| on many metrics, but it has some interesting RAG features no
| other LLM has.
|
| More recently, Llama 3[13] was released which similar to
| previous Llama releases, blew every other open LLM out of the
| water. The smallest version of Llama 3 (Llama 3 8b) has the
| greatest accuracy for its size of any other open LLM and the
| largest version of Llama 3 released so far (Llama 3 70b) beats
| every other open LLM on almost every metric.
|
| Less than a month ago, Google released Gemma 2[14], the largest
| of which, performs very well under human evaluation despite
| being less than half the size of Llama 3 70b, but performs only
| decently on automated benchmarks.
|
| If you're looking for a tool to get started running LLMs
| locally, I'd go with either Ollama or GPT4All. They make the
| process about as painless as possible. I believe GPT4All has
| more features like using your local documents for RAG, but you
| can also use something like Open WebUI[15] with Ollama to get
| the same functionality.
|
| If you want to get into the weeds a bit and extract some more
| performance out of your machine, I'd go with using Llama.cpp,
| Exllama, or vLLM depending upon your system. If you have a
| normal, consumer Nvidia GPU, I'd go with Exllama. If you have
| an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For
| anything else, including just running it on your CPU or
| M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes
| sense if you have an AI Nvidia GPU like the A100, V100, A10,
| H100, etc.
|
| [1] https://ai.meta.com/blog/large-language-model-llama-meta-
| ai/
|
| [2] https://github.com/huggingface/transformers
|
| [3] https://github.com/ggerganov/llama.cpp
|
| [4] https://github.com/bitsandbytes-foundation/bitsandbytes
|
| [5] https://github.com/vllm-project/vllm
|
| [6] https://ai.meta.com/blog/llama-2/
|
| [7] https://www.nomic.ai/gpt4all
|
| [8] http://ollama.ai/
|
| [9] https://github.com/turboderp/exllamav2
|
| [10] https://github.com/NVIDIA/TensorRT-LLM
|
| [11] https://mistral.ai/news/mixtral-of-experts/
|
| [12] https://cohere.com/blog/command-r-plus-microsoft-azure
|
| [13] https://ai.meta.com/blog/meta-llama-3/
|
| [14] https://blog.google/technology/developers/google-gemma-2/
|
| [15] https://github.com/open-webui/open-webui
| holoduke wrote:
| Great info. Do you also know the state of the code
| assistants? Any thoughts on copilot versus others?
| nabakin wrote:
| I've been following the state of things, but I'm not sure
| which ones are the best. There's Meta's CodeLlama[1],
| Mistral's Codestral[2], DeepSeek AI's DeepSeek-
| Coder-V2-Instruct[3], CodeGemma[4], Alibaba's CodeQwen[5],
| and Microsoft's WizardCoder[6].
|
| I'm pretty sure CodeLlama is out of date now. I've heard
| DeepSeek LLMs are good and DeepSeek-Coder-V2-Instruct was
| released recently. With the good reputation and its massive
| size (236b) I'd guess it is the best coding LLM, but if
| it's not being trained efficiently, maybe Codestral and
| Codestral Mamba come close.
|
| I don't think the best coding LLMs are close to GitHub
| Copilot but I could be wrong since I'm just relaying
| information that I've heard secondhand.
|
| [1] https://ai.meta.com/blog/code-llama-large-language-
| model-cod...
|
| [2] https://mistral.ai/news/codestral/
|
| [3] https://github.com/deepseek-ai/DeepSeek-Coder-V2
|
| [4] https://developers.googleblog.com/en/gemma-family-
| expands-wi...
|
| [5] https://qwenlm.github.io/blog/codeqwen1.5/
|
| [6] https://github.com/nlpxucan/WizardLM
| hobofan wrote:
| All the main IDE-integrated ones seem very much on par
| (Copilot, Sourcegraph Cody, Continue.dev), with cursor.sh
| liked by some as it has code assistant-first UI.
|
| I've personally went back to the browser with Claude 3.5
| Sonnet (and the projects + artifacts feature), as it is one
| of the most industrious ones, and I really like the UX of
| artifacts + it integrates new code well into existing code
| you paste into it.
|
| In the end I think it also often comes down to what
| languages/frameworks you are using and how well the
| LLM/product handles it, so I'd still recommend to test
| around. E.g. some of the main frameworks I'm working with
| on a daily basis went through big refactors/interface
| changes 1-2 years ago, and I stopped using ChatGPT because
| it had a strong tendency to produce code based on the old
| interfaces/paradigms.
|
| Aider[0] is also quite interesting, especially when it
| comes to more significant refactorings in the codebase and
| has gotten quite good with that with the last few bigger
| model releases, but it takes same time to get used to and
| doesn't have good IDE-integration.
|
| [0]: https://github.com/paul-gauthier/aider
| iAmAPencilYo wrote:
| Thank you! Very helpful as a newbie coming in.
| psychoslave wrote:
| This is one one the most useful and informative comment I
| ever faced on HN. Thank you very much.
| bhouston wrote:
| What are the steps required to get this running in VS Code?
|
| If they had linked to the instructions in their post (or better
| yet a link to a one click install of a VS Code Extension), it
| would help a lot with adoption.
|
| (BTW I consider it malpractice that they are at the top of hacker
| news with a model that is of great interest to a large portion of
| the users where and they do not have a monetizable call to action
| on the page featured.)
| leourbina wrote:
| If you can run this using ollama, then you should be able to
| use https://www.continue.dev/ with both IntelliJ and VSCode.
| Haven't tried this model yet - but overall this plugin works
| well.
| scosman wrote:
| They say no llama.cpp support yet, so no ollama yet (which
| uses llama.cpp)
| sadeshmukh wrote:
| Ollama is supported:
| https://docs.continue.dev/setup/select-provider
| trsohmers wrote:
| They meant that there is no support for Codestral Mamba
| for llama.cpp yet.
| HanClinto wrote:
| Correct. The only back-end that Ollama uses is llama.cpp,
| and llama.cpp does not yet have Mamba2 support. The issues
| to track Mamba2 and Codestral Mamba support are here:
|
| https://github.com/ggerganov/llama.cpp/issues/8519
|
| https://github.com/ggerganov/llama.cpp/issues/7727
|
| Mamba support was added in March of this year:
|
| https://github.com/ggerganov/llama.cpp/pull/5328
|
| I have not yet seen a PR to address Mamba2.
| osmano807 wrote:
| Unrelated, all my devices freeze when accessing this page,
| desktop Firefox and Chrome, mobile Firefox and Brave. Is this
| the best alternative to access code ai helpers besides the
| GitHub Copilot and Google Gemini on VSCode?
| raphaelj wrote:
| I've been using it for a few months (with Starcoder 2 for
| code, and GPT-4o for chat). I find the code completion
| actually better than Github Copilot.
|
| My main complain is that the chat sometimes fails to
| correctly render some GPT-4o output (e.g. LaTeX
| expressions), but it's mostly fixed with a custom system
| prompt. It also significantly reduces the battery life of
| my Macbook M1, but that's expected.
| oliverulerich wrote:
| I'm quite happy with Cody from Sourcegraph https://marketpl
| ace.visualstudio.com/items?itemName=sourcegr...
| sleepytimetea wrote:
| Looking through the Quickstart docs, they have an API that can
| generate code. However, I don't think they have a way to do
| "Day 2" code editing.
|
| Also, doesn't seem to have a freemium tier...need to start
| paying even before trying it out ?
|
| "Our API is currently available through La Plateforme. You need
| to activate payments on your account to enable your API keys."
| sv123 wrote:
| I signed up when codestral was first available and put my
| payment details in. Been using it daily since then with
| continue.dev but my usage dashboard shows 0 tokens, and so
| far have not been billed for anything... Definitely not clear
| anywhere, but it seems to be free for now? Or some sort of
| free limit that I am not hitting.
| sunaookami wrote:
| Through codestral.mistral.ai? It's free until August 1st:
| https://docs.mistral.ai/capabilities/code_generation/
|
| >Monthly subscription based, free until 1st of August
| refulgentis wrote:
| "All you need is users" doesn't seem optimal IMHO, Stability.ai
| providing an object lesson in that.
|
| They just released weights, and being a for profit, need to
| optimize for making money, not eyeballs. It seems wise to guide
| people to the API offering.
| bhouston wrote:
| On top of Hacker News (the target demographic for coders)
| without an effective monetizable call to action? What a
| missed opportunity.
|
| Github Copilot makes +100M/year, if not way way more.
|
| Having a VS Code extension for Mistral would be a revenue
| stream if it was one-click and better or cheaper than Github
| Copilot. It is malpractice in my mind to not be doing this if
| you are investing in creating coding models.
| refulgentis wrote:
| I see, that makes sense: make an extension and charge for
| it.
|
| I assumed they meant free x local. It doesn't seem rational
| to make this one paid: its significantly smaller than their
| better model, and even more so than Copilot's.
| passion__desire wrote:
| But they also signal competence in the space which means M&A.
| Or big nation states in future would hire them to produce
| country models once the space matures as was Emad's vision.
| refulgentis wrote:
| Did Emad's vision end up manifest? ex. did a nation-state
| end up paying Stability for a country model?
|
| Would it help signal competency? They're a small team
| focused on making models, not VS Code extensions.
|
| Would they do M&A? The founding team is ex-Googlers and has
| found significant attention in the MBA world via being an
| EU champion.
| monkeydust wrote:
| Any recommended product primers to Mamba vs Transformers -
| pros/cons etc?
| ertgbnm wrote:
| https://www.youtube.com/watch?v=X5F2X4tF9iM
|
| This is what introduced me to them. May be a bit outdated at
| this point.
| bhouston wrote:
| This video is good:
| https://www.youtube.com/watch?v=N6Piou4oYx8. As are the other
| videos on the same YouTube account.
| red2awn wrote:
| A very good primer to state-space models (from which Mamba is
| based on) is The Annotated S4 [1]. If you want to dive into the
| code I wrote a minimal single-file implementation of Mamba-2
| here [2].
|
| [1]: https://srush.github.io/annotated-s4/
|
| [2]: https://github.com/tommyip/mamba2-minimal
| croemer wrote:
| The first sentence is wrong. The website says:
|
| > As a tribute to Cleopatra, whose glorious destiny ended in
| tragic snake circumstances
|
| but according to Wikipedia this is not true:
|
| > When Cleopatra learned that Octavian planned to bring her to
| his Roman triumphal procession, she killed herself by poisoning,
| contrary to the popular belief that she was bitten by an asp.
| rjurney wrote:
| I believe this is in dispute among sources.
| skybrian wrote:
| Yes, that seems to be a myth, but exact circumstances seem
| rather uncertain according to the Wikipedia article [1]:
|
| > [A]ccording to the Roman-era writers Strabo, Plutarch, and
| Cassius Dio, Cleopatra poisoned herself using either a toxic
| ointment or by introducing the poison with a sharp implement
| such as a hairpin. Modern scholars debate the validity of
| ancient reports involving snakebites as the cause of death and
| whether she was murdered. Some academics hypothesize that her
| Roman political rival Octavian forced her to kill herself in a
| manner of her choosing. The location of Cleopatra's tomb is
| unknown. It was recorded that Octavian allowed for her and her
| husband, the Roman politician and general Mark Antony, who
| stabbed himself with a sword, to be buried together properly.
|
| I think this rounds to "nobody really knows."
|
| The "glorious destiny" seems kind of shaky, too. It's just a
| throwaway line anyway.
|
| [1] https://en.m.wikipedia.org/wiki/Death_of_Cleopatra
| ljsprague wrote:
| What bothers me more is that the legend is that she was killed
| by an asp, not a mamba.
| dghlsakjg wrote:
| Maybe Octavian was the snake?
| rjurney wrote:
| But I JUST switched from GPT4o to Claude! :( Kidding, but it
| isn't clear how to use this thing, as others have pointed out.
| ukuina wrote:
| What made you switch?
| pelagicAustral wrote:
| I'm using both, been doing that for months now. I can
| confidently assert that while Claude is getting better and
| better, GPT 4 and 4o seem the be getting dumbed down for some
| unexplained reason. Claude is now my go-to for anything code.
| (I do Ruby and C#, btw, other might have a different
| experience)
| marcyb5st wrote:
| I guess they are distilling the models so that they can
| save $$$ on serving.
| ldjkfkdsjnv wrote:
| GPT4o is way behind sonnet 3.5
| mountainriver wrote:
| Huh I guess all the benchmarks are wrong then
| causal wrote:
| Agreed.
| rjurney wrote:
| Claude is much better. Overwhelmingly better. It not only
| implements deep learning models for me, it has great
| suggestions on evolving them to actually work.
| mountainriver wrote:
| lol no it's not, the benchmarks don't show that at all.
| Both have issues in different ways
| causal wrote:
| Benchmarks are pretty flawed IMO, in particular their
| weakness here seems to be that they are poor at
| evaluating long-tail multiturn conversations. 4o often
| gives a great first response, then spirals into a
| repetition. Sonnet 3.5 is much better at seeing the big
| picture in a longer conversation IMO.
| orbital-decay wrote:
| Repetition in multiturn conversations is actually
| Sonnet's fatal flaw, both 3 and 3.5. 4o is also
| repetitive to an extent. Opus is _way_ better than both
| at being non-repetitive.
| stavros wrote:
| I made a mobile app the other day using LLMs (I had never
| used React or TypeScript before, and I built an app with
| React Native). I was pretty disappointed, both Sonnet 3.5
| and gpt-4-turbo performed pretty poorly, making mistakes
| like missing a closing bracket somewhere and meaning I
| had to revert, because I had no idea where they meant to
| put it.
|
| Also they did the thing that junior developers tend to
| do, where you have a race condition of some sort, and
| they just work around it by adding some if checks. The
| app is at around 400 lines right now, it works but feels
| pretty brittle. Adding a tiny feature here or there
| breaks something else, and GPT does the wrong thing half
| the time.
|
| All in all, I'm not complaining, because I made an app in
| two days, but it won't replace a developer yet, no matter
| how much I want it to.
| throwup238 wrote:
| Claude Projects which allow attaching a bunch of files to
| fill up the 200k context. I wrote up a script to dump a bunch
| of code and documentation files to markdown as context and I
| add them to a bunch of Claude projects on a per topic basis.
|
| For example, I'm currently working on a Rust/Qt desktop app
| so I have a project with the whole Qt6 book attached to ask
| questions about Qt, a project with my SQL schema and
| ORM/Sqlite docs to ask questions about the app's data and
| generate models without dealing with hallucinations, a
| project with all my QML files and Rust QML element code, a
| project with a bunch of Rust crate docs, and so on and on.
|
| GPTs allow attaching files too but Claude Projects dump the
| entire contents of the files into the context rather than
| trying to do some hacky RAG that never works like I want it
| to.
| funnygiraffe wrote:
| I was under the impression that with LLMs, in order to get
| high-quality answers, it's always best to keep context
| short. Is that not the case anymore? Does Claude under this
| usage paradigm not struggle with very long contexts in ways
| as for example described in the "lost in the middle" paper
| (https://arxiv.org/abs/2307.03172)?
| throwup238 wrote:
| I don't have the time to evaluate the effects of context
| length on my use cases so I have no idea. There might be
| some degradation when I attach the Qt book which is
| probably already in Claude's training data but when using
| it against my private code base, it's not like I have any
| other choice.
|
| The UX of drag and dropping a few monolithic markdown
| files to include entire chunks of a large project
| outweighs the downsides of including irrelevant context
| in my experience.
| inciampati wrote:
| No, you need to provide as much information in context as
| possible. Otherwise you are sampling from the mode.
| "Write me an essay about cows" = garbage boring and
| probably 200 words. "here are twenty papers about cow
| evolution, write me an overview of findings" = yes
| azeirah wrote:
| The conclusion you walked away with is the opposite of
| what usually works in practice.
|
| The more context you give the llm, the better.
|
| The key takeaway from that paper is to keep your
| instructions/questions/direction in the beginning or at
| the end of the context. Any information can go anywhere.
|
| Not to be too dismissive, it's a good paper, but we're
| one year further and in practice this issue seems to have
| been tackled by training on better data.
|
| This can differ a lot depending on what model you're
| using, but in the case of claude sonnet 3.5, more
| relevant context is generally better for anything except
| for speed.
|
| It does remain true that you need to keep your most
| important instructions at the beginning or at the end
| however.
| magnio wrote:
| They announce the model is on HuggingFace but don't link to it.
| Here it is: https://huggingface.co/mistralai/mamba-
| codestral-7B-v0.1
| dvfjsdhgfv wrote:
| The link is already there in the text, they probably just fixed
| it.
| imjonse wrote:
| The MBPP column should bold DeepSeek as it has a better score
| than Codestral.
| smith7018 wrote:
| Which means Codestral Mamba and DeepSeek both lead four
| benchmarks. Kinda takes the air out the announcement a bit.
| causal wrote:
| It should be corrected but the interesting aspect of this
| release is the architecture. To stay competitive while only
| needing linear inference time and supporting 256k context is
| pretty neat.
| mbowcut2 wrote:
| THIS. People don't realize the importance of Mamba
| competing on par with transformers.
| ed wrote:
| They're in roughly the same class but totally different
| architectures
|
| Deepseek uses a 4k sliding window compared to Codestral
| Mamba's 256k+ tokens
| localfirst wrote:
| any sort of evals on how it compares to closed models like chat
| gpt 4 or open ones like WizardLLM ?
| pzo wrote:
| weird they compare to deepseek-coder v1.5 when we already have
| v2.0. Any advantage to use codestral mamba apart from that it's
| lighter in weights?
| kz919 wrote:
| obviously because they can't beat it... There will be zero
| reason to use it when you have better transformer based models
| that can fit the existing infrastructure.
| sam_goldman_ wrote:
| You can try this model out using OpenAI's API format with this
| TypeScript SDK: https://github.com/token-js/token.js
|
| You just need a Mistral API key: https://console.mistral.ai/api-
| keys/
| Kinrany wrote:
| Is there a good explanation of the Mamba architecture?
| simonw wrote:
| There's a paper: https://arxiv.org/abs/2312.00752
|
| I haven't seen any good non-paper explainers yet.
| alecco wrote:
| https://thegradient.pub/mamba-explained/
|
| https://jackcook.com/2024/02/23/mamba.html
|
| https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html
| thot_experiment wrote:
| Does anyone have a favorite FIM capable model? I've been using
| codellama-13b through ollama w/ a vim extension i wrote and it's
| okay but not amazing, I definitely get better code most of the
| time out of Gemma-27b but no FIM (and for some reason
| codellama-34b has broken inference for me)
| taf2 wrote:
| How does this work in vim?
| flakiness wrote:
| So Mamba is supposed to be faster and the article claims that.
| But they don't have any latency numbers.
|
| Has anyone tried this? And then, is it fast(er)?
___________________________________________________________________
(page generated 2024-07-16 23:00 UTC)