[HN Gopher] Show HN: Repogather - copy relevant files to clipboa...
___________________________________________________________________
Show HN: Repogather - copy relevant files to clipboard for LLM
coding workflows
Hey HN, I wanted to share a simple command line tool I made that
has sped up and simplified my LLM assisted coding workflow.
Whenever possible, I've been trying to use Claude as a first pass
when implementing new features / changes. But I found that
depending on the type of change I was making, I was spending a lot
of thought finding and deciding which source files should be
included in the prompt. The need to copy/paste each file
individually also becomes a mild annoyance. First, I implemented
`repogather --all` , which unintelligently copies _all_ sources
files in your repository to the clipboard (delimited by their
relative filepaths). To my surprise, for less complex repositories,
this alone is often completely workable for Claude -- much better
than pasting in the just the few files you are looking to update.
But I never would have done it if I had to copy /paste everything
individually. 200k is quite a lot of tokens! But as soon as the
repository grows to a certain complexity level (even if it is under
the input token limit), I've found that Claude can get confused by
different unrelated parts / concepts across the code. It performs
much better if you make an attempt to exclude logic that is
irrelevant to your current change. So I implemented `repogather
"<query here>"` , e.g. `repogather "only files related to
authentication"` . This uses gpt-4o-mini with structured outputs to
provide a relevance score for each source file (with automatic
exclusions for .gitignore patterns, tests, configuration, and other
manual exclusions with `--exclude <pattern>` ). gpt-4o-mini is so
cheap and fast, that for my ~8 dev startup's repo, it takes under 5
seconds and costs 3-4 cents (with appropriate exclusions). Plus,
you get to watch the output stream while you wait which always
feels fun. The retrieval isn't always perfect the first time --
but it is fast, which allows you to see what files it returned, and
iterate quickly on your command. I've found this to be much more
satisfying than embedding-search based solutions I've used, which
seem to fail in pretty opaque ways.
https://github.com/gr-b/repogather Let me know if it is useful to
you! Always love to talk about how to better integrate LLMs into
coding workflows.
Author : grbsh
Score : 42 points
Date : 2024-09-12 14:03 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| reacharavindh wrote:
| Do you literally paste a wall of text (source code of the
| filtered whole repo) into the prompt and ask the LLM to give you
| a diff patch as an answer to your question?
|
| Example,
|
| Here is my whole project, now implement user authentication with
| plain username/password?
| punkpeye wrote:
| Yes
| ukuina wrote:
| Yes? I mean, it works for small projects.
| grbsh wrote:
| Yes! And I fought the urge to do this for so long, I think
| because it _feels_ wasteful for some reason? But Claude handles
| it like a champ*, and gets me significantly better and easier
| results than if I manually pasted a file in a described the
| rest of the context it needs by hand.
|
| * Until the repository gets more complicated, which is why we
| need the intelligent relevance filtering features of
| repogather, e.g. `repogather "Only files related to
| authentication and avatar uploads"`
| ukuina wrote:
| It's fascinating to see how different frameworks are dealing with
| the problem of populating context correctly. Aider, for example,
| asks users to manually add files to context. Claude Dev attempts
| to grep files based on LLM intent. And Continue.dev uses vector
| embeddings to find relevant chunks and files.
|
| I wonder if an increase in usable (not advertised) context tokens
| may obviate many of these approaches.
| punkpeye wrote:
| Continue.dev approach sounds like it would provide the most
| relevant code?
| morgante wrote:
| Embeddings are actually generally not that effective for
| code.
| grbsh wrote:
| This is what I've found. It's so hard to do embeddings
| correctly, while it's so easy to search over a large corpus
| with a cheap LLM! Embeddings are also really inscrutable
| when they fail, whereas I find myself easily iterating if
| repogather fails to return the right group of files.
|
| Of course, 'cheap' is relative -- on a large repository,
| embeddings are 99%+ cheaper than even gpt-4o-mini.
| avernet wrote:
| And how does repogather do it? From the README, it looks to me
| like it might provide the content of each file to the LLM to
| gauge its relevance. But this would seem prohibitively
| expensive on anything that isn't a very small codebase (the
| project I'm working on has on the order of 400k SLOC), even
| with gpt-4o-mini, wouldn't it?
| grbsh wrote:
| repogather indeed as a last step stuffs everything not
| otherwise excluded through cheap heuristics into gpt-4o-mini
| to gauge relevance, so it will get expensive for large
| projects. On my small 8 dev startup's repo, this operation
| costs 2-4 cents. I was considering adding an `--intelligence`
| option, where you could trade off different methods between
| cost, speed, and accuracy. But, I've been very unsatisfied
| with both embedding search methods, and agentic file search
| methods. They seem to regularly fail in very unpredictable
| ways. In contrast, this method works quite well for the
| projects I tend to work on.
|
| I think in the future as the cost of gpt-4o-mini level
| intelligence decreases, it will become increasingly worth it,
| even for larger repositories, to simply attend to every token
| for certain coding subtasks. I'm assuming here that relevance
| filtering is a much easier task than coding itself, otherwise
| you could just copy/paste everything into the final coding
| model's context. What I think would make much more sense for
| this project is to optimize the cost / performance of a small
| LLM fine-tuned for this source relevance task. I suspect I
| could do much better than gpt-4o-mini, but it would be
| difficult to deploy this for free.
| taneq wrote:
| I haven't looked into this but do any of them use modern IDE
| code inspection tools? I'd think you would dump as much "find
| references" and "show definition" outputs for relevant
| variables into context as possible.
| faangguyindia wrote:
| Aider also uses AST (tree sitter), it creates repo map using
| and sends it to LLM.
| grbsh wrote:
| I like this approach a lot, especially because it's not
| opaque like embeddings. Maybe I can add an option to use this
| approach instead with repogather, if you are cost sensitive.
| jadbox wrote:
| How does Aider (cli) compare to Claude Dev (VSCode plugin)?
| Anyone have a subjective analysis?
| grbsh wrote:
| I've been frustrated with embedding search approaches, because
| when they fail, they fail opaquely -- I don't know how to
| iterate on my query in order to get close to what I expected.
| In contrast, since repogather merely wraps your query in a
| simple prompt, it's easier to intuit what went wrong, if the
| results weren't as you expected.
|
| > I wonder if an increase in usable (not advertised) context
| tokens may obviate many of these approaches.
|
| I've been extremely interested in this question! Will be
| interesting to see how things develop, but I suspect that
| relevance filtering is not as difficult as coding, so small,
| cheap LLMs will make the former a solved, inexpensive problem,
| while we will continue to build larger and more expensive LLMs
| to solve the latter.
|
| That said, you can buy a lot of tokens for $150k, so this could
| be short sighted.
| doctoboggan wrote:
| I am really happy with how Aider does it as it feels like a
| happy medium. The state of LLMs these days means you really
| have to break the problem down to digestible parts and if you
| are doing that, its not much more work to specify the files
| that need to be edited in your request. Aider can also prompt
| you to add a file if it thinks that file needs to be edited.
| faangguyindia wrote:
| LLM for coding is bit meh after novelty wears off.
|
| I've had problems where LLM doesn't know which library version I
| am using. It keeps suggesting methods which do not exit etc...
|
| As if LLM are unaware of library version.
|
| Place where I found LLM to be most effect and effortless is CLI
|
| My brother made this but I use it everyday
| https://github.com/zerocorebeta/Option-K
| grbsh wrote:
| I agree - it's exciting at first, but then you have experience
| where you go down a rabbit hole for an hour trying to fix /
| make use of LLM generated code.
|
| So you really have to know when the LLM will be able to cleanly
| and neatly solve your problem, and when it's going to be
| frustrating and simpler just to do it character by character.
| That's why I'm exploring building tools like this, to try to
| iron out annoyances and improve quality of life for new LLM
| workflows.
|
| Option-K looks promising! I'll try it out.
| hereme888 wrote:
| When I get frustrated with GPT-4o, I then switch to Sonnet 3.5,
| usually with good results.
|
| In my limited experience Sonnet 3.5 is more elegant at coding
| and making use of different frameworks.
| faangguyindia wrote:
| I usually only edit 1 function using LLM on old code base.
|
| On Greenfield projects. I ask Claude Sonnet to write all the
| function and their signature with return value etc..
|
| Then I've a script which sends these signature to Google Flash
| which writes all the functions for me.
|
| All this happens in paraellel.
|
| I've found if you limit the scope, Google Flash writes the best
| code and it's ultra fast and cheap.
| grbsh wrote:
| Interesting - isn't Google Flash worse at coding than Sonnet
| 3.5? I subscribe to to Claude for $20/m, but even if the API
| were free, I'd still want to use the Claude interface for
| flexibility, artifacts and just understandability, which is why
| I don't use available coding assistants like Plandex or Aider.
|
| What if you need to iterate on the functions it gives? Do you
| just start over a with a different prompt, or do you have the
| ability to do a refinement with Google Flash on existing
| functions?
| fellowniusmonk wrote:
| This looks very cool for complex queries!
|
| If your codebase is structured in a very modular way than this
| one liner mostly just works:
|
| find . -type f -exec echo {} \; -exec cat {} \; | pbcopy
| reidbarber wrote:
| Nice! I built something similar, but in the browser with drag-
| and-drop at https://files2prompt.com
|
| It doesn't have all the fancy LLM integration though.
___________________________________________________________________
(page generated 2024-09-12 23:00 UTC)