[HN Gopher] Show HN: Repogather - copy relevant files to clipboa...
       ___________________________________________________________________
        
       Show HN: Repogather - copy relevant files to clipboard for LLM
       coding workflows
        
       Hey HN, I wanted to share a simple command line tool I made that
       has sped up and simplified my LLM assisted coding workflow.
       Whenever possible, I've been trying to use Claude as a first pass
       when implementing new features / changes. But I found that
       depending on the type of change I was making, I was spending a lot
       of thought finding and deciding which source files should be
       included in the prompt. The need to copy/paste each file
       individually also becomes a mild annoyance.  First, I implemented
       `repogather --all` , which unintelligently copies _all_ sources
       files in your repository to the clipboard (delimited by their
       relative filepaths). To my surprise, for less complex repositories,
       this alone is often completely workable for Claude -- much better
       than pasting in the just the few files you are looking to update.
       But I never would have done it if I had to copy /paste everything
       individually. 200k is quite a lot of tokens!  But as soon as the
       repository grows to a certain complexity level (even if it is under
       the input token limit), I've found that Claude can get confused by
       different unrelated parts / concepts across the code. It performs
       much better if you make an attempt to exclude logic that is
       irrelevant to your current change. So I implemented `repogather
       "<query here>"` , e.g. `repogather "only files related to
       authentication"` . This uses gpt-4o-mini with structured outputs to
       provide a relevance score for each source file (with automatic
       exclusions for .gitignore patterns, tests, configuration, and other
       manual exclusions with `--exclude <pattern>` ).  gpt-4o-mini is so
       cheap and fast, that for my ~8 dev startup's repo, it takes under 5
       seconds and costs 3-4 cents (with appropriate exclusions). Plus,
       you get to watch the output stream while you wait which always
       feels fun.  The retrieval isn't always perfect the first time --
       but it is fast, which allows you to see what files it returned, and
       iterate quickly on your command. I've found this to be much more
       satisfying than embedding-search based solutions I've used, which
       seem to fail in pretty opaque ways.
       https://github.com/gr-b/repogather  Let me know if it is useful to
       you! Always love to talk about how to better integrate LLMs into
       coding workflows.
        
       Author : grbsh
       Score  : 42 points
       Date   : 2024-09-12 14:03 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | reacharavindh wrote:
       | Do you literally paste a wall of text (source code of the
       | filtered whole repo) into the prompt and ask the LLM to give you
       | a diff patch as an answer to your question?
       | 
       | Example,
       | 
       | Here is my whole project, now implement user authentication with
       | plain username/password?
        
         | punkpeye wrote:
         | Yes
        
         | ukuina wrote:
         | Yes? I mean, it works for small projects.
        
         | grbsh wrote:
         | Yes! And I fought the urge to do this for so long, I think
         | because it _feels_ wasteful for some reason? But Claude handles
         | it like a champ*, and gets me significantly better and easier
         | results than if I manually pasted a file in a described the
         | rest of the context it needs by hand.
         | 
         | * Until the repository gets more complicated, which is why we
         | need the intelligent relevance filtering features of
         | repogather, e.g. `repogather "Only files related to
         | authentication and avatar uploads"`
        
       | ukuina wrote:
       | It's fascinating to see how different frameworks are dealing with
       | the problem of populating context correctly. Aider, for example,
       | asks users to manually add files to context. Claude Dev attempts
       | to grep files based on LLM intent. And Continue.dev uses vector
       | embeddings to find relevant chunks and files.
       | 
       | I wonder if an increase in usable (not advertised) context tokens
       | may obviate many of these approaches.
        
         | punkpeye wrote:
         | Continue.dev approach sounds like it would provide the most
         | relevant code?
        
           | morgante wrote:
           | Embeddings are actually generally not that effective for
           | code.
        
             | grbsh wrote:
             | This is what I've found. It's so hard to do embeddings
             | correctly, while it's so easy to search over a large corpus
             | with a cheap LLM! Embeddings are also really inscrutable
             | when they fail, whereas I find myself easily iterating if
             | repogather fails to return the right group of files.
             | 
             | Of course, 'cheap' is relative -- on a large repository,
             | embeddings are 99%+ cheaper than even gpt-4o-mini.
        
         | avernet wrote:
         | And how does repogather do it? From the README, it looks to me
         | like it might provide the content of each file to the LLM to
         | gauge its relevance. But this would seem prohibitively
         | expensive on anything that isn't a very small codebase (the
         | project I'm working on has on the order of 400k SLOC), even
         | with gpt-4o-mini, wouldn't it?
        
           | grbsh wrote:
           | repogather indeed as a last step stuffs everything not
           | otherwise excluded through cheap heuristics into gpt-4o-mini
           | to gauge relevance, so it will get expensive for large
           | projects. On my small 8 dev startup's repo, this operation
           | costs 2-4 cents. I was considering adding an `--intelligence`
           | option, where you could trade off different methods between
           | cost, speed, and accuracy. But, I've been very unsatisfied
           | with both embedding search methods, and agentic file search
           | methods. They seem to regularly fail in very unpredictable
           | ways. In contrast, this method works quite well for the
           | projects I tend to work on.
           | 
           | I think in the future as the cost of gpt-4o-mini level
           | intelligence decreases, it will become increasingly worth it,
           | even for larger repositories, to simply attend to every token
           | for certain coding subtasks. I'm assuming here that relevance
           | filtering is a much easier task than coding itself, otherwise
           | you could just copy/paste everything into the final coding
           | model's context. What I think would make much more sense for
           | this project is to optimize the cost / performance of a small
           | LLM fine-tuned for this source relevance task. I suspect I
           | could do much better than gpt-4o-mini, but it would be
           | difficult to deploy this for free.
        
         | taneq wrote:
         | I haven't looked into this but do any of them use modern IDE
         | code inspection tools? I'd think you would dump as much "find
         | references" and "show definition" outputs for relevant
         | variables into context as possible.
        
         | faangguyindia wrote:
         | Aider also uses AST (tree sitter), it creates repo map using
         | and sends it to LLM.
        
           | grbsh wrote:
           | I like this approach a lot, especially because it's not
           | opaque like embeddings. Maybe I can add an option to use this
           | approach instead with repogather, if you are cost sensitive.
        
           | jadbox wrote:
           | How does Aider (cli) compare to Claude Dev (VSCode plugin)?
           | Anyone have a subjective analysis?
        
         | grbsh wrote:
         | I've been frustrated with embedding search approaches, because
         | when they fail, they fail opaquely -- I don't know how to
         | iterate on my query in order to get close to what I expected.
         | In contrast, since repogather merely wraps your query in a
         | simple prompt, it's easier to intuit what went wrong, if the
         | results weren't as you expected.
         | 
         | > I wonder if an increase in usable (not advertised) context
         | tokens may obviate many of these approaches.
         | 
         | I've been extremely interested in this question! Will be
         | interesting to see how things develop, but I suspect that
         | relevance filtering is not as difficult as coding, so small,
         | cheap LLMs will make the former a solved, inexpensive problem,
         | while we will continue to build larger and more expensive LLMs
         | to solve the latter.
         | 
         | That said, you can buy a lot of tokens for $150k, so this could
         | be short sighted.
        
         | doctoboggan wrote:
         | I am really happy with how Aider does it as it feels like a
         | happy medium. The state of LLMs these days means you really
         | have to break the problem down to digestible parts and if you
         | are doing that, its not much more work to specify the files
         | that need to be edited in your request. Aider can also prompt
         | you to add a file if it thinks that file needs to be edited.
        
       | faangguyindia wrote:
       | LLM for coding is bit meh after novelty wears off.
       | 
       | I've had problems where LLM doesn't know which library version I
       | am using. It keeps suggesting methods which do not exit etc...
       | 
       | As if LLM are unaware of library version.
       | 
       | Place where I found LLM to be most effect and effortless is CLI
       | 
       | My brother made this but I use it everyday
       | https://github.com/zerocorebeta/Option-K
        
         | grbsh wrote:
         | I agree - it's exciting at first, but then you have experience
         | where you go down a rabbit hole for an hour trying to fix /
         | make use of LLM generated code.
         | 
         | So you really have to know when the LLM will be able to cleanly
         | and neatly solve your problem, and when it's going to be
         | frustrating and simpler just to do it character by character.
         | That's why I'm exploring building tools like this, to try to
         | iron out annoyances and improve quality of life for new LLM
         | workflows.
         | 
         | Option-K looks promising! I'll try it out.
        
         | hereme888 wrote:
         | When I get frustrated with GPT-4o, I then switch to Sonnet 3.5,
         | usually with good results.
         | 
         | In my limited experience Sonnet 3.5 is more elegant at coding
         | and making use of different frameworks.
        
       | faangguyindia wrote:
       | I usually only edit 1 function using LLM on old code base.
       | 
       | On Greenfield projects. I ask Claude Sonnet to write all the
       | function and their signature with return value etc..
       | 
       | Then I've a script which sends these signature to Google Flash
       | which writes all the functions for me.
       | 
       | All this happens in paraellel.
       | 
       | I've found if you limit the scope, Google Flash writes the best
       | code and it's ultra fast and cheap.
        
         | grbsh wrote:
         | Interesting - isn't Google Flash worse at coding than Sonnet
         | 3.5? I subscribe to to Claude for $20/m, but even if the API
         | were free, I'd still want to use the Claude interface for
         | flexibility, artifacts and just understandability, which is why
         | I don't use available coding assistants like Plandex or Aider.
         | 
         | What if you need to iterate on the functions it gives? Do you
         | just start over a with a different prompt, or do you have the
         | ability to do a refinement with Google Flash on existing
         | functions?
        
       | fellowniusmonk wrote:
       | This looks very cool for complex queries!
       | 
       | If your codebase is structured in a very modular way than this
       | one liner mostly just works:
       | 
       | find . -type f -exec echo {} \; -exec cat {} \; | pbcopy
        
       | reidbarber wrote:
       | Nice! I built something similar, but in the browser with drag-
       | and-drop at https://files2prompt.com
       | 
       | It doesn't have all the fancy LLM integration though.
        
       ___________________________________________________________________
       (page generated 2024-09-12 23:00 UTC)