[HN Gopher] Llama.vim - Local LLM-assisted text completion
       ___________________________________________________________________
        
       Llama.vim - Local LLM-assisted text completion
        
       Author : kgwgk
       Score  : 252 points
       Date   : 2025-01-23 18:06 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dingnuts wrote:
       | Is anyone actually getting value out of these models? I wired one
       | up to Emacs and the local models all produce a huge volume of
       | garbage output.
       | 
       | Occasionally I find a hosted LLM useful but I haven't found any
       | output from the models I can run in Ollama on my gaming PC to be
       | useful.
       | 
       | It's all plausible-looking but incorrect. I feel like I'm taking
       | crazy pills when I read about others' experiences. Surely I am
       | not alone?
        
         | remexre wrote:
         | I work on compilers. A friend of mine works on webapps. I've
         | seen Cursor give him lots of useful code, but it's never been
         | particularly useful on any of the code of mine that I've tried
         | it on.
         | 
         | It seems very logical to me that there'd be orders of magnitude
         | more training data for some domains than others, and that
         | existing models' skill is not evenly distributed cross-domain.
        
           | dkga wrote:
           | This. Also across languages. For example, I suppose there is
           | a lot more content in python and javascript than Apple
           | script, for example. (And to be fair not a lot of the python
           | suggestions I receive are actually mindblowing good)
        
           | q0uaur wrote:
           | i'm still patiently waiting for an easy way to point a model
           | at some documentation, and make it actually use that.
           | 
           | My usecase is gdscript for godot games, and all the models
           | i've tried so far use godot 2 stuff that's just not around
           | anymore, even if you tell it to use godot 4 it gives way too
           | much wrong output to be useful.
           | 
           | I wish i could just point it at the latest godot docs and
           | have it give up to date answers. but seeing as that's still
           | not a thing, i guess it's more complicated than i expect.
        
             | doctoboggan wrote:
             | It's definitely a thing already. Look up "RAG" (Retrieval
             | Augmented Generation). Most of the popular closed source
             | companies offer RAG services via their APIs, and you can
             | also do it with local llms using open-webui and probably
             | many other local UIs.
        
             | mohsen1 wrote:
             | Cursor can follow links
        
             | psytrx wrote:
             | There's llms.txt [0], but it's not gaining much popularity.
             | 
             | My web framework of choice provides these [1], but they're
             | not easily injected into the LLM context without much fuss.
             | It would be a game changer if more LLM tools implemented
             | them.
             | 
             | [0] https://llmstxt.org/ [1] https://svelte.dev/docs/llms
        
         | codingdave wrote:
         | Yep - I don't get a ton of value out of autocompletions, but I
         | get decent value from asking an LLM how they would approach
         | more complex functions or features. I rarely get code back that
         | I can copy/paste, but reading their output is something I can
         | react to - whether it is good or bad, just having a starting
         | point speeds up the design of new features vs. me burning time
         | creating my first/worst draft. And that is the goal here, isn't
         | it? To get some productivity gains?
         | 
         | So maybe it is just a difference in perspective? Even incorrect
         | code and bad ideas can still be helpful. It is only useless if
         | you expect them to hand you working code.
        
         | coder543 wrote:
         | > I wired one up
         | 
         | "One"? Wired up _how_? There is a huge difference between the
         | best and worst. They aren't fungible. _Which_ one? How long
         | ago? Did it even support FIM (fill in middle), or was it
         | blindly guessing from the left side? Did the plugin even gather
         | appropriate context from related files, or was it only looking
         | at the current file?
         | 
         | If you try Copilot or Cursor today, you can experience what
         | "the best" looks like, which gives you a benchmark to measure
         | smaller, dumber models and plugins against. No, Copilot and
         | Cursor are not available for emacs, as far as I know... but if
         | you want to understand if a technology is useful, you don't
         | start with the worst version and judge from that. (Not saying
         | emacs itself is the worst... just that _without more context_ ,
         | my assumption is that whatever plugin you probably encountered
         | was probably using a bottom tier model, and I doubt the plugin
         | itself was helping that model do its best.)
         | 
         | There are some local code completion models that I think are
         | perfectly fine, but I don't know where you will draw the line
         | on how good is good enough. If you can prove to yourself that
         | the best models are good enough, then you can try out different
         | local models and see if one of those works for you.
        
           | yoyohello13 wrote:
           | https://github.com/copilot-emacs/copilot.el
        
           | Lanedo wrote:
           | There is https://github.com/copilot-emacs/copilot.el that
           | gets copilot to work in emacs via JS glue code and binaries
           | provided by copilot.vim.
           | 
           | I hacked up a slim alternative localpilot.js layer that uses
           | llama-server instead of the copilot API, so copilot.el can be
           | used with local LLMs, but I find the copilot.el overlays
           | kinda buggy... It'd probably be better to instead write a
           | llamapilot.el for local LLMs from scratch for emacs.
        
           | b5n wrote:
           | Emacs has had multiple llm integration packages available for
           | quite awhile (relative to the rise of llms). `gptel` supports
           | multiple providers including anthropic, openai, ollama, etc.
           | 
           | https://github.com/karthink/gptel
        
         | fovc wrote:
         | > I feel like I'm taking crazy pills when I read about others'
         | experiences. Surely I am not alone?
         | 
         | You're not alone :-) I asked a very similar question about a
         | month ago: https://news.ycombinator.com/item?id=42552653 and
         | have continued researching since.
         | 
         | My takeaway was that autocomplete, boiler plate, and one-off
         | scripts are the main use cases. To use an analogy, I think the
         | code assistants are more like an upgrade from handsaw to power
         | tools and less like hiring a carpenter. (Which is not what the
         | hype engine will claim).
         | 
         | For me, only the one-off script (write-only code) use-case is
         | useful. I've had the best results on this with Claude.
         | 
         | Emacs abbrevs/snippets (+ choice of language) virtually
         | eliminate the boiler plate problem, so I don't have a use for
         | assistants there.
         | 
         | For autocomplete, I find that LSP completion engines provide
         | 95% of the value for 1% of the latency. Physically typing the
         | code is a small % of my time/energy, so the value is more about
         | getting the right names, argument order, and other fiddly
         | details I may not remember exactly. But I find, that LSP-
         | powered autocomplete and tooltips largely solve those
         | challenges.
        
           | barrell wrote:
           | I think you make a very good point about your existing
           | devenv. I recently turned off GitHub copilot after maybe 2
           | years of use -- I didn't realize how often I was using its
           | completions over LSPs.
           | 
           | Quality of Life went up massively. LSPs and nvim-cmp have
           | come a long way (although one of these days I'll try
           | blink.cmp)
        
         | tomr75 wrote:
         | try cursor
        
         | sangnoir wrote:
         | > Is anyone actually getting value out of these models?
         | 
         | I've found incredible value in having LLMs help me write unit
         | tests. The quality of the test code is far from perfect, but AI
         | tooling - Claude Sonnet specifically - is good at coming up
         | with reasonable unit test cases after I've written the code
         | under test (sue me, TDD zealots). I probably have to fix 30% of
         | the tests and expand the test cases, but I'd say it cuts the
         | number if test code lines I author by more than 80%. This has
         | decreased the friction so much, I've added Continuous
         | Integration to small, years-old personal projects that had no
         | tests before.
         | 
         | I've found lesser value with refactoring and adding code docs,
         | but that's more of autocomplete++ using natural language rather
         | than AST-derived code.
        
         | righthand wrote:
         | Honestly just disabled my TabNine plugin and have found that
         | LSP server is good enough for 99% of what I do. I really don't
         | need hypothetical output suggested to me. I'm comfortable
         | reading docs though so others may feel different.
        
       | jerpint wrote:
       | It's funny because I actually use vim mostly when I don't want
       | LLM assisted code. Sometimes it just gets in the way.
       | 
       | If I do, I load up cursor with vim bindings.
        
         | renewiltord wrote:
         | Funny. I think the most common usage I have is using it at the
         | command line to write commands with vim set as my EDITOR so the
         | AI completion really helps.
         | 
         | This will help for offline support (on planes and such).
        
           | qup wrote:
           | Can you more specifically talk about how you use this, like
           | with a small example?
        
             | renewiltord wrote:
             | Yes, I mentioned it here first and haven't changed it since
             | (Twitter link and links include video of use)
             | 
             | https://news.ycombinator.com/item?id=34769611
             | 
             | Which leads (used to lead?) here https://wiki.roshangeorge.
             | dev/index.php/AI_Completion_In_The...
        
             | VMG wrote:
             | - composing a commit message
             | 
             | - anything bash script related
        
       | ggerganov wrote:
       | Hi HN, happy to see this here!
       | 
       | I highly recommend to take a look at the technical details of the
       | server implementation that enables large context usage with this
       | plugin - I think it is interesting and has some cool ideas [0].
       | 
       | Also, the same plugin is available for VS Code [1].
       | 
       | Let me know if you have any questions about the plugin - happy to
       | explain. Btw, the performance has improved compared to what is
       | seen in the README videos thanks to client-side caching.
       | 
       | [0] - https://github.com/ggerganov/llama.cpp/pull/9787
       | 
       | [1] - https://github.com/ggml-org/llama.vscode
        
         | jerpint wrote:
         | Thank you for all of your incredible contributions!
        
         | amrrs wrote:
         | For those who don't know, He is the gg of `gguf`. Thank you for
         | all your contributions! Literally the core of Ollama, LMStudio,
         | Jan and multiple other apps!
        
           | sergiotapia wrote:
           | well hot damn! killing it!
        
           | halyconWays wrote:
           | But is he the jt, the developer who reduced memory
           | utilization by 50%?
        
             | madeforhnyo wrote:
             | Someone did? Could you pls share a link?
        
             | kamranjon wrote:
             | They collaborate together! Her name is Justine Tunney - she
             | took her "execute everywhere" work with Cosmopolitan to
             | make Llamafile using the llama.cpp work that Giorgi has
             | done.
        
         | nancyp wrote:
         | TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!
        
           | nacs wrote:
           | Vim is incredibly extensible.
           | 
           | You can use C or VIMscript but programs like Neovim support
           | Lua as well which makes it really easy to make plugins.
        
         | liuliu wrote:
         | KV cache shifting is interesting!
         | 
         | Just curious: how much of your code nowadays completed by LLM?
        
           | ggerganov wrote:
           | Yes, I think it is surprising that it works.
           | 
           | I think a fairly large amount, though can't give a good
           | number. I have been using Github Copilot from the very early
           | days and with the release of Qwen Coder last year have fully
           | switched to using local completions. I don't use the chat
           | workflow to code though, only FIM.
        
             | gloflo wrote:
             | What is FIM?
        
               | jjnoakes wrote:
               | Fill-in-the-middle. If your cursor is in the middle of a
               | file instead of at the end, then the LLM will consider
               | text after the cursor in addition to the text before the
               | cursor. Some LLMs can only look before the cursor; for
               | coding,.ones that can FIM work better (for me at least).
        
               | rav wrote:
               | FIM is "fill in middle", i.e. completion in a text editor
               | using context on both sides of the cursor.
        
         | halyconWays wrote:
         | Please make one for Jetbrains' IDEs!
        
         | bangaladore wrote:
         | Quick testing on vscode to see if I'd consider replacing
         | Copilot with this. Biggest showstopper right now for me is the
         | output length is substantially small. The default length is set
         | to 256, but even if I up it to 4096, I'm not getting any larger
         | chunks of code.
         | 
         | Is this because of a max latency setting, or the internal
         | prompt, or am I doing something wrong? Or is it only really
         | make to try to autocomplete lines and not blocks like Copilot
         | will.
         | 
         | Thanks :)
        
           | ggerganov wrote:
           | There are 4 stopping criteria atm:
           | 
           | - Generation time exceeded (configurable in the plugin
           | config)
           | 
           | - Number of tokens exceeded (not the case since you increased
           | it)
           | 
           | - Indentation - stops generating if the next line has shorter
           | indent than the first line
           | 
           | - Small probability of the sampled token
           | 
           | Most likely you are hitting the last criteria. It's something
           | that should be improved in some way, but I am not very sure
           | how. Currently, it is using a very basic token sampling
           | strategy with a custom threshold logic to stop generating
           | when the token probability is too low. Likely this logic is
           | too conservative.
        
             | bangaladore wrote:
             | Hmm, interesting.
             | 
             | I didn't catch T_max_predict_ms and upped that to 5000ms
             | for fun. Doesn't seem to make a difference, so I'm guessing
             | you are right.
        
         | attentive wrote:
         | Is it correct to assume this plugin won't work with ollama?
         | 
         | If so, what's ollama missing?
        
       | msoloviev wrote:
       | I wonder how the "ring context" works under the hood. I have
       | previously had (and recently messed around with again) a somewhat
       | similar project designed for a more toy/exploratory setting
       | (https://github.com/blackhole89/autopen - demo video at
       | https://www.youtube.com/watch?v=1O1T2q2t7i4), and one of the main
       | problems to address definitively is the question of how to manage
       | your KV cache cleverly so you don't have to constantly perform
       | too much expensive recomputation whenever the buffer undergoes
       | local changes.
       | 
       | The solution I came up with involved maintaining a tree of tokens
       | branching whenever an alternative next token was explored, with
       | full LLM state snapshots at fixed depth intervals so that the
       | buffer would only have to be "replayed" for a few tokens when
       | something changed. I wonder if there are some mathematical
       | properties of how the important parts of the state (really, the
       | KV cache, which can be thought of as a partial precomputation of
       | the operation that one LLM iteration performs on the context)
       | work that could have made this more efficient, like to avoid
       | saving full snapshots or perhaps to be able to prune the "oldest"
       | tokens out of a state efficiently.
       | 
       | (edit: Georgi's comment that beat me by 3 minutes appears to be
       | pointing at information that would go some way to answer my
       | questions!)
        
       | entelechy0 wrote:
       | I use this on-and-off again. It is nice that I can flip between
       | this and Copilot by commenting out one line in my init.lua
        
       | eigenvalue wrote:
       | This guy is a national treasure and has contributed so much value
       | to the open source AI ecosystem. I hope he's able to attract
       | enough funding to continue making software like this and
       | releasing it as true "no strings attached" open source.
        
         | frankfrank13 wrote:
         | Hard agree. This alone replaces GH Copilot/Cursor ($10+ a
         | month)
        
         | nacs wrote:
         | > This guy is a national treasure
         | 
         | Agreed but he's an _international_ treasure (his Github profile
         | states Bulgaria).
        
         | feznyng wrote:
         | They have: https://ggml.ai/ under the Company heading.
        
       | frankfrank13 wrote:
       | Is this more or less the same as your VSCode version?
       | (https://github.com/ggml-org/llama.vscode)
        
       | mohsen1 wrote:
       | Terminal coding FTW!
       | 
       | And when you're really stuck you can use DeepSeek R1 for a deeper
       | analysis in your terminal using `askds`
       | 
       | https://github.com/bodo-run/askds
        
       | opk wrote:
       | Has anyone actually got this llama stuff to be usable on even
       | moderate hardware? I find it just crashes because it doesn't find
       | enough RAM. I've got 2G of VRAM on an AMD graphics card and 16G
       | of system RAM and that doesn't seem to be enough. The impression
       | I got from reading up was that it worked for most Apple stuff
       | because the memory is unified and other than that, you need very
       | expensive Nvidia GPUs with lots of VRAM. Are there any affordable
       | options?
        
         | basilgohar wrote:
         | I can run 7B models with Q4 quantization on a 7000 series AMD
         | APU without GPU acceleration quite acceptably fast. This is
         | with DDR5600 RAM which is the current roadblock for
         | performance.
         | 
         | Larger models work but slow down. I do have 64GB of RAM but I
         | think 32 could work. 16GB is pushing is, but should be possible
         | if you don't have anything else open.
         | 
         | Memory requirements depend on numerous factors. 2GB VRAM is not
         | enough for most GenAI stuff today.
        
         | zamadatix wrote:
         | 2G is pretty low and the sizes things you can get to run fast
         | on that set up probably aren't particularly attractive.
         | "moderate hardware" varies but you can grab a 12 GB RTX 3060 on
         | ebay for ~$200. You can get a lot more RAM for $200 but it'll
         | be so slow compare the the GPU I'm not sure I'd recommend it if
         | you actually want to use things like this interactively.
         | 
         | If "moderate hardware" is your average office PC then it's
         | unlikely to be very usable. Anyone with a gaming GPU from the
         | last several years should be workable though.
        
           | horsawlarway wrote:
           | I'll second this, actually - $250 for a 12gb rtx 3060 is
           | probably a better buy than $400 for 2xp40s for 16gb.
           | 
           | It'd been a minute since I checked refurb prices and $250 for
           | the rtx 3060 12gb is a good price.
           | 
           | Easier on the rest of the system than a 2x card setup, and is
           | probably a drop in replacement.
        
         | bhelkey wrote:
         | Have you tried Ollama [1]? You should be able to run a 8b model
         | in RAM and a 1b model in VRAM.
         | 
         | [1] https://news.ycombinator.com/item?id=42069453
        
         | horsawlarway wrote:
         | Yes. Although I suspect my definition of "moderate hardware"
         | doesn't really match yours.
         | 
         | I can run 2b-14b models just fine on the CPU on my laptop
         | (framework 13 with 32gb ram). They aren't super fast, and the
         | 14b models have limited context length unless I run a quantized
         | version, but they run.
         | 
         | If you just want generation and it doesn't need to be fast...
         | drop the $200 for 128gb of system ram, and you can run the vast
         | majority of the available models (up to ~70b quantized). Note -
         | it won't be quick (expect 1-2 tokens/second, sometimes less).
         | 
         | If you want something faster in the "low end" range still -
         | look at picking up a pair of Nvidia p40s (~$400) which will
         | give you 16gb of ram and be faster for 2b to 7b models.
         | 
         | If you want to hit my level for "moderate", I use 2x3090 (I
         | bought refurbed for ~$1600 a couple years ago) and they do
         | quite a bit of work. Ex - I get ~15t/s generation for 70b 4
         | quant models, and 50-100t/s for 7b models. That's plenty usable
         | for basically everything I want to run at home. They're faster
         | than the m2 pro I was issued for work, and a good chunk cheaper
         | (the m2 was in the 3k range).
         | 
         | That said - the m1/m2 macs are generally pretty zippy here, I
         | was quite surprised at how well they perform.
         | 
         | Some folks claim to have success with the k80s, but I haven't
         | tried and while 24g vram for under $100 seems nice (even if
         | it's slow), the linux compatibility issues make me inclined to
         | just go for the p40s right now.
         | 
         | I run some tasks on much older hardware (ex - willow inference
         | runs on an old 4gb gtx 970 just fine)
         | 
         | So again - I'm not really sure we'd agree on moderate (I
         | generally spend ~$1000 every 4-6 years to build a machine to
         | play games, and the machine you're describing would match the
         | specs for a machine I would have built 12+ years ago)
         | 
         | But you just need literal memory. bumping to 32gb of system ram
         | would unlock a lot of stuff for you (at low speeds) and costs
         | $50. Bumping to 124gb only costs a couple hundred, and lets you
         | run basically all of them (again - slowly).
        
       | colordrops wrote:
       | I've seen several posts and projects like this. Is there a
       | summary/comparison somewhere of the various ways of running local
       | completion/copilot?
        
       | h14h wrote:
       | A little bit of a tangent, but I'm really curious what benefits
       | could come from integrating these LLM tools more closely with
       | data from LSPs, compilers, and other static analysis tools.
       | 
       | Intuitively, it seems like you could provide much more context
       | and better output as a result. Even better would be if you could
       | fine-tune LLMs on a per-language basis and ship them alongside
       | typical editor tooling.
       | 
       | A problem I see w/ these AI tools is that they work much better
       | with old, popular languages, and I worry that this will grow as a
       | significant factor when choosing a language. Anecdotally, I see
       | far better results when using TypeScript than Gleam, for example.
       | 
       | It would be very cool to be able to install a Gleam-specific
       | model that could be fed data from the LSP and compiler, and
       | wouldn't constantly hallucinate invalid syntax. I also wonder if,
       | with additional context & fine-tuning, you could make these
       | models smaller and more feasible to run locally on modest
       | hardware.
        
       | amelius wrote:
       | This looks very interesting. Can this be trained on the user's
       | codebase, or is the idea that everything must fit inside the
       | context buffer?
        
       | binary132 wrote:
       | I am curious to see what will be possible with consumer grade
       | hardware and more improvements to quantization over the next
       | decade. Right now, even a 24GB gpu with the best models isn't
       | able to match the barely acceptable performance of hosted
       | services I'm not willing to even pay $20 a month for.
        
       | morcus wrote:
       | Looking for advice from someone who knows about the space -
       | Suppose I'm willing to go out and buy a card for this purpose,
       | what's a modestly priced graphics card with which I can get
       | somewhat usable results running local LLM?
        
       ___________________________________________________________________
       (page generated 2025-01-23 23:00 UTC)