[HN Gopher] Llama.vim - Local LLM-assisted text completion
___________________________________________________________________
Llama.vim - Local LLM-assisted text completion
Author : kgwgk
Score : 252 points
Date : 2025-01-23 18:06 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dingnuts wrote:
| Is anyone actually getting value out of these models? I wired one
| up to Emacs and the local models all produce a huge volume of
| garbage output.
|
| Occasionally I find a hosted LLM useful but I haven't found any
| output from the models I can run in Ollama on my gaming PC to be
| useful.
|
| It's all plausible-looking but incorrect. I feel like I'm taking
| crazy pills when I read about others' experiences. Surely I am
| not alone?
| remexre wrote:
| I work on compilers. A friend of mine works on webapps. I've
| seen Cursor give him lots of useful code, but it's never been
| particularly useful on any of the code of mine that I've tried
| it on.
|
| It seems very logical to me that there'd be orders of magnitude
| more training data for some domains than others, and that
| existing models' skill is not evenly distributed cross-domain.
| dkga wrote:
| This. Also across languages. For example, I suppose there is
| a lot more content in python and javascript than Apple
| script, for example. (And to be fair not a lot of the python
| suggestions I receive are actually mindblowing good)
| q0uaur wrote:
| i'm still patiently waiting for an easy way to point a model
| at some documentation, and make it actually use that.
|
| My usecase is gdscript for godot games, and all the models
| i've tried so far use godot 2 stuff that's just not around
| anymore, even if you tell it to use godot 4 it gives way too
| much wrong output to be useful.
|
| I wish i could just point it at the latest godot docs and
| have it give up to date answers. but seeing as that's still
| not a thing, i guess it's more complicated than i expect.
| doctoboggan wrote:
| It's definitely a thing already. Look up "RAG" (Retrieval
| Augmented Generation). Most of the popular closed source
| companies offer RAG services via their APIs, and you can
| also do it with local llms using open-webui and probably
| many other local UIs.
| mohsen1 wrote:
| Cursor can follow links
| psytrx wrote:
| There's llms.txt [0], but it's not gaining much popularity.
|
| My web framework of choice provides these [1], but they're
| not easily injected into the LLM context without much fuss.
| It would be a game changer if more LLM tools implemented
| them.
|
| [0] https://llmstxt.org/ [1] https://svelte.dev/docs/llms
| codingdave wrote:
| Yep - I don't get a ton of value out of autocompletions, but I
| get decent value from asking an LLM how they would approach
| more complex functions or features. I rarely get code back that
| I can copy/paste, but reading their output is something I can
| react to - whether it is good or bad, just having a starting
| point speeds up the design of new features vs. me burning time
| creating my first/worst draft. And that is the goal here, isn't
| it? To get some productivity gains?
|
| So maybe it is just a difference in perspective? Even incorrect
| code and bad ideas can still be helpful. It is only useless if
| you expect them to hand you working code.
| coder543 wrote:
| > I wired one up
|
| "One"? Wired up _how_? There is a huge difference between the
| best and worst. They aren't fungible. _Which_ one? How long
| ago? Did it even support FIM (fill in middle), or was it
| blindly guessing from the left side? Did the plugin even gather
| appropriate context from related files, or was it only looking
| at the current file?
|
| If you try Copilot or Cursor today, you can experience what
| "the best" looks like, which gives you a benchmark to measure
| smaller, dumber models and plugins against. No, Copilot and
| Cursor are not available for emacs, as far as I know... but if
| you want to understand if a technology is useful, you don't
| start with the worst version and judge from that. (Not saying
| emacs itself is the worst... just that _without more context_ ,
| my assumption is that whatever plugin you probably encountered
| was probably using a bottom tier model, and I doubt the plugin
| itself was helping that model do its best.)
|
| There are some local code completion models that I think are
| perfectly fine, but I don't know where you will draw the line
| on how good is good enough. If you can prove to yourself that
| the best models are good enough, then you can try out different
| local models and see if one of those works for you.
| yoyohello13 wrote:
| https://github.com/copilot-emacs/copilot.el
| Lanedo wrote:
| There is https://github.com/copilot-emacs/copilot.el that
| gets copilot to work in emacs via JS glue code and binaries
| provided by copilot.vim.
|
| I hacked up a slim alternative localpilot.js layer that uses
| llama-server instead of the copilot API, so copilot.el can be
| used with local LLMs, but I find the copilot.el overlays
| kinda buggy... It'd probably be better to instead write a
| llamapilot.el for local LLMs from scratch for emacs.
| b5n wrote:
| Emacs has had multiple llm integration packages available for
| quite awhile (relative to the rise of llms). `gptel` supports
| multiple providers including anthropic, openai, ollama, etc.
|
| https://github.com/karthink/gptel
| fovc wrote:
| > I feel like I'm taking crazy pills when I read about others'
| experiences. Surely I am not alone?
|
| You're not alone :-) I asked a very similar question about a
| month ago: https://news.ycombinator.com/item?id=42552653 and
| have continued researching since.
|
| My takeaway was that autocomplete, boiler plate, and one-off
| scripts are the main use cases. To use an analogy, I think the
| code assistants are more like an upgrade from handsaw to power
| tools and less like hiring a carpenter. (Which is not what the
| hype engine will claim).
|
| For me, only the one-off script (write-only code) use-case is
| useful. I've had the best results on this with Claude.
|
| Emacs abbrevs/snippets (+ choice of language) virtually
| eliminate the boiler plate problem, so I don't have a use for
| assistants there.
|
| For autocomplete, I find that LSP completion engines provide
| 95% of the value for 1% of the latency. Physically typing the
| code is a small % of my time/energy, so the value is more about
| getting the right names, argument order, and other fiddly
| details I may not remember exactly. But I find, that LSP-
| powered autocomplete and tooltips largely solve those
| challenges.
| barrell wrote:
| I think you make a very good point about your existing
| devenv. I recently turned off GitHub copilot after maybe 2
| years of use -- I didn't realize how often I was using its
| completions over LSPs.
|
| Quality of Life went up massively. LSPs and nvim-cmp have
| come a long way (although one of these days I'll try
| blink.cmp)
| tomr75 wrote:
| try cursor
| sangnoir wrote:
| > Is anyone actually getting value out of these models?
|
| I've found incredible value in having LLMs help me write unit
| tests. The quality of the test code is far from perfect, but AI
| tooling - Claude Sonnet specifically - is good at coming up
| with reasonable unit test cases after I've written the code
| under test (sue me, TDD zealots). I probably have to fix 30% of
| the tests and expand the test cases, but I'd say it cuts the
| number if test code lines I author by more than 80%. This has
| decreased the friction so much, I've added Continuous
| Integration to small, years-old personal projects that had no
| tests before.
|
| I've found lesser value with refactoring and adding code docs,
| but that's more of autocomplete++ using natural language rather
| than AST-derived code.
| righthand wrote:
| Honestly just disabled my TabNine plugin and have found that
| LSP server is good enough for 99% of what I do. I really don't
| need hypothetical output suggested to me. I'm comfortable
| reading docs though so others may feel different.
| jerpint wrote:
| It's funny because I actually use vim mostly when I don't want
| LLM assisted code. Sometimes it just gets in the way.
|
| If I do, I load up cursor with vim bindings.
| renewiltord wrote:
| Funny. I think the most common usage I have is using it at the
| command line to write commands with vim set as my EDITOR so the
| AI completion really helps.
|
| This will help for offline support (on planes and such).
| qup wrote:
| Can you more specifically talk about how you use this, like
| with a small example?
| renewiltord wrote:
| Yes, I mentioned it here first and haven't changed it since
| (Twitter link and links include video of use)
|
| https://news.ycombinator.com/item?id=34769611
|
| Which leads (used to lead?) here https://wiki.roshangeorge.
| dev/index.php/AI_Completion_In_The...
| VMG wrote:
| - composing a commit message
|
| - anything bash script related
| ggerganov wrote:
| Hi HN, happy to see this here!
|
| I highly recommend to take a look at the technical details of the
| server implementation that enables large context usage with this
| plugin - I think it is interesting and has some cool ideas [0].
|
| Also, the same plugin is available for VS Code [1].
|
| Let me know if you have any questions about the plugin - happy to
| explain. Btw, the performance has improved compared to what is
| seen in the README videos thanks to client-side caching.
|
| [0] - https://github.com/ggerganov/llama.cpp/pull/9787
|
| [1] - https://github.com/ggml-org/llama.vscode
| jerpint wrote:
| Thank you for all of your incredible contributions!
| amrrs wrote:
| For those who don't know, He is the gg of `gguf`. Thank you for
| all your contributions! Literally the core of Ollama, LMStudio,
| Jan and multiple other apps!
| sergiotapia wrote:
| well hot damn! killing it!
| halyconWays wrote:
| But is he the jt, the developer who reduced memory
| utilization by 50%?
| madeforhnyo wrote:
| Someone did? Could you pls share a link?
| kamranjon wrote:
| They collaborate together! Her name is Justine Tunney - she
| took her "execute everywhere" work with Cosmopolitan to
| make Llamafile using the llama.cpp work that Giorgi has
| done.
| nancyp wrote:
| TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!
| nacs wrote:
| Vim is incredibly extensible.
|
| You can use C or VIMscript but programs like Neovim support
| Lua as well which makes it really easy to make plugins.
| liuliu wrote:
| KV cache shifting is interesting!
|
| Just curious: how much of your code nowadays completed by LLM?
| ggerganov wrote:
| Yes, I think it is surprising that it works.
|
| I think a fairly large amount, though can't give a good
| number. I have been using Github Copilot from the very early
| days and with the release of Qwen Coder last year have fully
| switched to using local completions. I don't use the chat
| workflow to code though, only FIM.
| gloflo wrote:
| What is FIM?
| jjnoakes wrote:
| Fill-in-the-middle. If your cursor is in the middle of a
| file instead of at the end, then the LLM will consider
| text after the cursor in addition to the text before the
| cursor. Some LLMs can only look before the cursor; for
| coding,.ones that can FIM work better (for me at least).
| rav wrote:
| FIM is "fill in middle", i.e. completion in a text editor
| using context on both sides of the cursor.
| halyconWays wrote:
| Please make one for Jetbrains' IDEs!
| bangaladore wrote:
| Quick testing on vscode to see if I'd consider replacing
| Copilot with this. Biggest showstopper right now for me is the
| output length is substantially small. The default length is set
| to 256, but even if I up it to 4096, I'm not getting any larger
| chunks of code.
|
| Is this because of a max latency setting, or the internal
| prompt, or am I doing something wrong? Or is it only really
| make to try to autocomplete lines and not blocks like Copilot
| will.
|
| Thanks :)
| ggerganov wrote:
| There are 4 stopping criteria atm:
|
| - Generation time exceeded (configurable in the plugin
| config)
|
| - Number of tokens exceeded (not the case since you increased
| it)
|
| - Indentation - stops generating if the next line has shorter
| indent than the first line
|
| - Small probability of the sampled token
|
| Most likely you are hitting the last criteria. It's something
| that should be improved in some way, but I am not very sure
| how. Currently, it is using a very basic token sampling
| strategy with a custom threshold logic to stop generating
| when the token probability is too low. Likely this logic is
| too conservative.
| bangaladore wrote:
| Hmm, interesting.
|
| I didn't catch T_max_predict_ms and upped that to 5000ms
| for fun. Doesn't seem to make a difference, so I'm guessing
| you are right.
| attentive wrote:
| Is it correct to assume this plugin won't work with ollama?
|
| If so, what's ollama missing?
| msoloviev wrote:
| I wonder how the "ring context" works under the hood. I have
| previously had (and recently messed around with again) a somewhat
| similar project designed for a more toy/exploratory setting
| (https://github.com/blackhole89/autopen - demo video at
| https://www.youtube.com/watch?v=1O1T2q2t7i4), and one of the main
| problems to address definitively is the question of how to manage
| your KV cache cleverly so you don't have to constantly perform
| too much expensive recomputation whenever the buffer undergoes
| local changes.
|
| The solution I came up with involved maintaining a tree of tokens
| branching whenever an alternative next token was explored, with
| full LLM state snapshots at fixed depth intervals so that the
| buffer would only have to be "replayed" for a few tokens when
| something changed. I wonder if there are some mathematical
| properties of how the important parts of the state (really, the
| KV cache, which can be thought of as a partial precomputation of
| the operation that one LLM iteration performs on the context)
| work that could have made this more efficient, like to avoid
| saving full snapshots or perhaps to be able to prune the "oldest"
| tokens out of a state efficiently.
|
| (edit: Georgi's comment that beat me by 3 minutes appears to be
| pointing at information that would go some way to answer my
| questions!)
| entelechy0 wrote:
| I use this on-and-off again. It is nice that I can flip between
| this and Copilot by commenting out one line in my init.lua
| eigenvalue wrote:
| This guy is a national treasure and has contributed so much value
| to the open source AI ecosystem. I hope he's able to attract
| enough funding to continue making software like this and
| releasing it as true "no strings attached" open source.
| frankfrank13 wrote:
| Hard agree. This alone replaces GH Copilot/Cursor ($10+ a
| month)
| nacs wrote:
| > This guy is a national treasure
|
| Agreed but he's an _international_ treasure (his Github profile
| states Bulgaria).
| feznyng wrote:
| They have: https://ggml.ai/ under the Company heading.
| frankfrank13 wrote:
| Is this more or less the same as your VSCode version?
| (https://github.com/ggml-org/llama.vscode)
| mohsen1 wrote:
| Terminal coding FTW!
|
| And when you're really stuck you can use DeepSeek R1 for a deeper
| analysis in your terminal using `askds`
|
| https://github.com/bodo-run/askds
| opk wrote:
| Has anyone actually got this llama stuff to be usable on even
| moderate hardware? I find it just crashes because it doesn't find
| enough RAM. I've got 2G of VRAM on an AMD graphics card and 16G
| of system RAM and that doesn't seem to be enough. The impression
| I got from reading up was that it worked for most Apple stuff
| because the memory is unified and other than that, you need very
| expensive Nvidia GPUs with lots of VRAM. Are there any affordable
| options?
| basilgohar wrote:
| I can run 7B models with Q4 quantization on a 7000 series AMD
| APU without GPU acceleration quite acceptably fast. This is
| with DDR5600 RAM which is the current roadblock for
| performance.
|
| Larger models work but slow down. I do have 64GB of RAM but I
| think 32 could work. 16GB is pushing is, but should be possible
| if you don't have anything else open.
|
| Memory requirements depend on numerous factors. 2GB VRAM is not
| enough for most GenAI stuff today.
| zamadatix wrote:
| 2G is pretty low and the sizes things you can get to run fast
| on that set up probably aren't particularly attractive.
| "moderate hardware" varies but you can grab a 12 GB RTX 3060 on
| ebay for ~$200. You can get a lot more RAM for $200 but it'll
| be so slow compare the the GPU I'm not sure I'd recommend it if
| you actually want to use things like this interactively.
|
| If "moderate hardware" is your average office PC then it's
| unlikely to be very usable. Anyone with a gaming GPU from the
| last several years should be workable though.
| horsawlarway wrote:
| I'll second this, actually - $250 for a 12gb rtx 3060 is
| probably a better buy than $400 for 2xp40s for 16gb.
|
| It'd been a minute since I checked refurb prices and $250 for
| the rtx 3060 12gb is a good price.
|
| Easier on the rest of the system than a 2x card setup, and is
| probably a drop in replacement.
| bhelkey wrote:
| Have you tried Ollama [1]? You should be able to run a 8b model
| in RAM and a 1b model in VRAM.
|
| [1] https://news.ycombinator.com/item?id=42069453
| horsawlarway wrote:
| Yes. Although I suspect my definition of "moderate hardware"
| doesn't really match yours.
|
| I can run 2b-14b models just fine on the CPU on my laptop
| (framework 13 with 32gb ram). They aren't super fast, and the
| 14b models have limited context length unless I run a quantized
| version, but they run.
|
| If you just want generation and it doesn't need to be fast...
| drop the $200 for 128gb of system ram, and you can run the vast
| majority of the available models (up to ~70b quantized). Note -
| it won't be quick (expect 1-2 tokens/second, sometimes less).
|
| If you want something faster in the "low end" range still -
| look at picking up a pair of Nvidia p40s (~$400) which will
| give you 16gb of ram and be faster for 2b to 7b models.
|
| If you want to hit my level for "moderate", I use 2x3090 (I
| bought refurbed for ~$1600 a couple years ago) and they do
| quite a bit of work. Ex - I get ~15t/s generation for 70b 4
| quant models, and 50-100t/s for 7b models. That's plenty usable
| for basically everything I want to run at home. They're faster
| than the m2 pro I was issued for work, and a good chunk cheaper
| (the m2 was in the 3k range).
|
| That said - the m1/m2 macs are generally pretty zippy here, I
| was quite surprised at how well they perform.
|
| Some folks claim to have success with the k80s, but I haven't
| tried and while 24g vram for under $100 seems nice (even if
| it's slow), the linux compatibility issues make me inclined to
| just go for the p40s right now.
|
| I run some tasks on much older hardware (ex - willow inference
| runs on an old 4gb gtx 970 just fine)
|
| So again - I'm not really sure we'd agree on moderate (I
| generally spend ~$1000 every 4-6 years to build a machine to
| play games, and the machine you're describing would match the
| specs for a machine I would have built 12+ years ago)
|
| But you just need literal memory. bumping to 32gb of system ram
| would unlock a lot of stuff for you (at low speeds) and costs
| $50. Bumping to 124gb only costs a couple hundred, and lets you
| run basically all of them (again - slowly).
| colordrops wrote:
| I've seen several posts and projects like this. Is there a
| summary/comparison somewhere of the various ways of running local
| completion/copilot?
| h14h wrote:
| A little bit of a tangent, but I'm really curious what benefits
| could come from integrating these LLM tools more closely with
| data from LSPs, compilers, and other static analysis tools.
|
| Intuitively, it seems like you could provide much more context
| and better output as a result. Even better would be if you could
| fine-tune LLMs on a per-language basis and ship them alongside
| typical editor tooling.
|
| A problem I see w/ these AI tools is that they work much better
| with old, popular languages, and I worry that this will grow as a
| significant factor when choosing a language. Anecdotally, I see
| far better results when using TypeScript than Gleam, for example.
|
| It would be very cool to be able to install a Gleam-specific
| model that could be fed data from the LSP and compiler, and
| wouldn't constantly hallucinate invalid syntax. I also wonder if,
| with additional context & fine-tuning, you could make these
| models smaller and more feasible to run locally on modest
| hardware.
| amelius wrote:
| This looks very interesting. Can this be trained on the user's
| codebase, or is the idea that everything must fit inside the
| context buffer?
| binary132 wrote:
| I am curious to see what will be possible with consumer grade
| hardware and more improvements to quantization over the next
| decade. Right now, even a 24GB gpu with the best models isn't
| able to match the barely acceptable performance of hosted
| services I'm not willing to even pay $20 a month for.
| morcus wrote:
| Looking for advice from someone who knows about the space -
| Suppose I'm willing to go out and buy a card for this purpose,
| what's a modestly priced graphics card with which I can get
| somewhat usable results running local LLM?
___________________________________________________________________
(page generated 2025-01-23 23:00 UTC)