[HN Gopher] Yek: Serialize your code repo (or part of it) to fee...
       ___________________________________________________________________
        
       Yek: Serialize your code repo (or part of it) to feed into any LLM
        
       Author : mohsen1
       Score  : 167 points
       Date   : 2025-01-19 03:24 UTC (19 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | TheTaytay wrote:
       | This has some interesting ideas that I hadn't seen in the other
       | similar projects, especially around trying to sort files
       | according to importance.
       | 
       | (I've been using RepoPrompt for this sort of thing lately.)
        
       | msoad wrote:
       | This is really fast! Serialized 50k lines in 500ms on my Mac
        
       | linschn wrote:
       | That's neat ! I've built a transient UI to do this manually[0]
       | within emacs, but with the context windows getting bigger ang
       | bigger, being more systematic may be the way to go.
       | 
       | The priorization mentioned in the readme is especially
       | interesting.
       | 
       | [0] https://rdklein.fr/bites/MyTransientUIForLocalLLMs.html
        
       | pagekicker wrote:
       | Error: yek: SHA256 mismatch Expected:
       | 34896ad65e8ae7c5e93d90e87f15656b67ed5b7596492863d1da80e548ba7301
       | Actual:
       | 353f4f7467af25b5bceb66bb29d9591ffe8d620d17bf40f6e0e4ec16cd4bd7e7
       | File: /Users/... Library/Caches/Homebrew/downloads/0308e13c088cb7
       | 87ece0e33a518cd211773daab9b427649303d79e27bf723e0d--
       | yek-x86_64-apple-darwin.tar.gz To retry an incomplete download,
       | remove the file above.
       | 
       | Removed & tried again this was the result. Is the SHA256 mismatch
       | a security concern?
        
         | mohsen1 wrote:
         | Oh totally forgot about homebrew installer. I'll fix it ASAP.
         | Sorry about that.
         | 
         | Edit: Working on a fix here https://github.com/bodo-
         | run/yek/pull/14
         | 
         | You can use the bash installer on macOS for now. You can read
         | the installer file before executing it if you're not sure if it
         | is safe
        
       | hbornfree wrote:
       | Thanks for this! I have the exact use-case and have been using a
       | Python script to do this for a while.
        
       | awestroke wrote:
       | This looks promising. Hopefully much faster and less naive than
       | Repomix
        
       | mkagenius wrote:
       | I am doing something similar for my gitpodcast project:
       | def get_important_files(self, file_tree):             # file_tree
       | = "api/backend/main.py  api.py"             # Send the prompt to
       | Azure OpenAI for processing             response =
       | openai.beta.chat.completions.parse(
       | model=self.model_name,                 messages=[
       | {"role": "system", "content": "Can you give the list of upto 10
       | most important file paths in this file tree to understand code
       | architechture and high level decisions and overall what the
       | repository is about to include in the podcast i am creating, as a
       | list, do not write any unknown file paths not listed below"},  #
       | Initial system prompt                     {"role": "user",
       | "content": file_tree}                 ],
       | response_format=FileListFormat,             )             try:
       | response = response.choices[0].message.parsed
       | print(type(response), " resp ")                 return
       | response.file_list             except Exception as e:
       | print("Error processing file tree:", e)                 return []
       | 
       | 1. https://gitpodcast.com - Convert any GitHub repo into a
       | podcast.
        
       | mg wrote:
       | I think this is where the future of coding is. It is still useful
       | to be a coder, the more experienced the better. But you will not
       | write or edit a lot of lines anymore. You will organize the
       | codebase in a way AI can handle it, make architectural decisions
       | and organize the workflow around AI doing the actual coding.
       | 
       | The way I currently do this is that I wrote a small python file
       | that I can start with                   llmcode.py /path/to/repo
       | 
       | Which then offers a simple web interface at localhost:8080 where
       | I can select the files to serialize and describe a task.
       | 
       | It then creates a prompt like this:                   Look at the
       | code files below and do the following:
       | {task_description}              Output all files that you need to
       | change in full again,         including your changes. In the same
       | format as I provide         the files below, that means each file
       | starts with         filename: and ends with :filename
       | Under no circumstances output any other text, no additional
       | infos, no code formatting chars. Only the code in the
       | given format.              Here are the files:
       | somefile.py:         ...code of somefile.py...
       | :somefile.py              someotherfile.py:         ...code of
       | someotherfile.py...         :someotherfile.py
       | assets/css/somestyles.css:         ...code of somestyles.css...
       | :assets/css/somestyles.css              etc
       | 
       | Then llmcode.py sends it to an LLM, parses the output and writes
       | the files back to disk.
       | 
       | I then look at the changes via "git diff".
       | 
       | It's quite fascinating. I often only make minor changes before
       | accepting the "pull request" the llm made. Sometimes I have to
       | make no changes at all.
        
         | shatrov1 wrote:
         | Would you be kind to share your script? Thanks!
        
           | mike_hearn wrote:
           | There is an open source project called Aider that does that.
        
         | KronisLV wrote:
         | > You will organize the codebase in a way AI can handle it,
         | make architectural decisions and organize the workflow around
         | AI doing the actual coding.
         | 
         | This might sound silly, but I feel like it has the potential of
         | resulting in more readable code.
         | 
         | There have been times where I split up a 300 line function just
         | so it's easier to feed into an LLM. Same for extracting things
         | into smaller files and classes that individually do more
         | limited things, so they're easier to change.
         | 
         | There have been times where I pay attention to the grouping of
         | code blocks more or even leave a few comments along the way
         | explaining the intent so LLM autocomplete would work better.
         | 
         | I also pay more attention to naming (which does sometimes end
         | up more Java-like but is clear, even if verbose) and try to
         | make the code simple enough to generate tests with less manual
         | input.
         | 
         | Somehow when you understand the code yourself and so can your
         | colleagues (for the most part) a lot of people won't care that
         | much. But when the AI tools stumble and actually start slowing
         | you down instead of speeding you up and the readability of your
         | code results in a more positive experience (subjectively) then
         | suddenly it's a no brainer.
        
           | DJBunnies wrote:
           | You could have done all that for your peers instead.
        
             | KronisLV wrote:
             | I already do when it makes sense... except if you look at
             | messy code and nobody else seems to care, there might be
             | better things to spend your time on (some of which might
             | involve finding an environment where people care about all
             | of that by default).
             | 
             | But now, to actually improve my own productivity a lot?
             | I'll dig in more often, even in messy legacy code. Of
             | course, if some convoluted LoginView breaks due to
             | refactoring gone wrong, that is still my responsibility.
        
         | croes wrote:
         | Then you aren't a coder, you are an organizer or manager
        
           | msoad wrote:
           | I'm sure a few decades ago people would say that for not
           | fiddling with actual binary to make things work.
        
           | maccard wrote:
           | I disagree. If I put hook a library into a framework (e.g.
           | laminar into rails) that doesn't make me an organizer
        
             | croes wrote:
             | But this does
             | 
             | > You will organize the codebase in a way AI can handle it,
             | make architectural decisions and organize the workflow
             | around AI doing the actual coding.
        
               | maccard wrote:
               | "Anyone" can implement a class that does something - the
               | mark of a good engineer is someone who understands the
               | context it's going to be used in and modified , be that a
               | one shot method, a core library function or a user facing
               | api call that needs to be resilient against malicious
               | inputs.
        
           | fifilura wrote:
           | Many people identify themselves with being "a coder". Surely
           | there are jobs for "coders" and will be in the future too.
           | But not everyone writing programs today would qualify to
           | doing the work defined as what "a coder" does.
           | 
           | I like to be a "builder of systems" , "solver of problems".
           | "Organizer or manager" would also fit in that description.
           | And then what tool you use to get stuff done is not relevant.
        
           | Art9681 wrote:
           | No one cares what badge of pride you choose to wear. Only the
           | output. If an LLM produces working solutions, that's what
           | your audience will appreciate. Not a self imposed title you
           | chose for the project.
        
         | flessner wrote:
         | Even just "organizing" the code requires great amounts of
         | knowledge and intuition from prior experiences.
         | 
         | I am personally torn between the future of LLMs in this regard.
         | Right now, even with Copilot, the benefit they give
         | fundamentally depends on the coder that directs them - as you
         | have noted.
         | 
         | What if that's no longer true in a couple years? How would that
         | even be different from e.g. no code tools or website builders
         | today? In different words will handwritten code stay valuable?
         | 
         | I personally enjoy coding so I can always keep doing it for
         | entertainment, even if I am vastly surpassed by the machine
         | eventually.
        
           | maccard wrote:
           | > Even just "organizing" the code requires great amounts of
           | knowledge and intuition from prior experiences.
           | 
           | > I personally enjoy coding so I can always keep doing it for
           | entertainment, even if I am vastly surpassed by the machine
           | eventually.
           | 
           | I agree with both these takes, and I think they're far more
           | important than wondering if hand written code is valuable.
           | 
           | I do some DIY around the house. I can make a moderately
           | straight cut (within tolerances for joinery use cases). A jig
           | or circular saw makes that skill moot, but knowing I need a
           | straight clean cut is a transferable skill. There's also two
           | separate skills - being able to break down and understand the
           | problem and being able to implement the details of the
           | problem. In trade skills we don't expect any one person to
           | design, analyze, estimate, build, install and decorate
           | anything larger than a small piece of furniture and I think
           | the same can be said of programming.
           | 
           | It's similar to using libraries/framesorks - there will
           | always be people who will write shitty apps with shitty
           | unmaintainable code - we've been complaining about that since
           | I've been programming. Those people are going to move on from
           | not understanding their wix websites to not understanding
           | their AI generated code. But it's another tool in the belt of
           | a professional programmer
        
         | maccard wrote:
         | I disagree that you won't edit lines, but I think you're right.
         | 
         | At work this week I was investigating how we could auto scale
         | our CI. I know enough Jenkins, AWS, perforce, power shell,
         | packer, terraform, c++ to be able to do this, but having the
         | time to implement and flesh everything out is a struggle. I
         | asked Claude to create an AMI with our workspace preloaded on
         | it, and a user data script that set up a perforce workspace
         | without syncing it, all on windows, with the tools I mentioned.
         | I had to make some small edits to the code to get it to match
         | what I wanted but for the most part it took 2-3 days screwing
         | around with a pretty clear concept in my head, and I had a
         | prototype running in 30 minutes. Turns out it's quicker to sync
         | from perforce than it is to boot the custom AMI , but I learned
         | that with an hour in total rather than building out more and we
         | got to look at alternatives. That's the future to me.
        
           | senorrib wrote:
           | I don't understand what you disagreed with. The OP said
           | 
           | > edit a lot of lines
           | 
           | "a lot" being the keywords here.
        
         | csmpltn wrote:
         | > "But you will not write or edit a lot of lines anymore"
         | 
         | > "I wrote a small python file that I can start with"
         | 
         | Which one is it, chief?
        
           | jaapbadlands wrote:
           | > not a lot of lines > small python file
           | 
           | They mean the same thing, chief.
        
         | dark_star wrote:
         | Would you be willing to post the llm.py code? (Perhaps in a
         | Github gist?)
        
       | wiradikusuma wrote:
       | Sorry if it's not very obvious, where does Yek fit with existing
       | coding assistants such as Copilot or Continue.dev?
       | 
       | Is it purpose-built for code, or any text (e.g., Obsidian vault)
       | would work?
        
         | mohsen1 wrote:
         | This can be a piece of your own AI automation. Every task has a
         | different need so being able to program your own AI automation
         | is great for programmers. Any text based document works with
         | this tool. It's rather simple, just stitching fils together
         | with a dash of priority sorting
        
       | CGamesPlay wrote:
       | What is the use-case here? What is a "chunk"? It looks like it's
       | just an arbitrary group of files, where "more important" files
       | get put at the end. Why is that useful for LLMs? Also, I see it
       | can chunk based on token count but... what's a token? ChatGPT?
       | Llama?
       | 
       | Note, I understand why code context is important for LLMs. I
       | don't understand what this chunking is or how it helps me get
       | better code context.
        
         | mohsen1 wrote:
         | token counting is done by crate that I'm using. I agree that
         | not all LLMs use the same tokenizer but they are mostly
         | similar.
         | 
         | Chunking is useful because in chat mode you can feed more than
         | context max size if you feed in multiple USER messages
         | 
         | LLMs pay more attention to the last part of
         | conversation/message. This is why sorting is very important.
         | Your last sentence in a very long prompt is much more important
         | the first.
         | 
         | Use case: I use this to run an "AI Loop" with Deepseek to fix
         | bugs or implement features. The loop steers the LLM by not
         | letting it go stray in various rabbit holes. Every prompt
         | reiterates what the objective is. By loop I mean: Serialize
         | repo, run test, feed test failure and repo to LLM, get a diff,
         | apply the diff and repeat until the objective is achieved.
        
           | CGamesPlay wrote:
           | Got it, thanks.
           | 
           | > in chat mode you can feed more than context max size if you
           | feed in multiple USER messages
           | 
           | Just so you know, this is false. You might be using a system
           | that automatically deletes or summarizes older messages,
           | which would make you feel like that, and would also indicate
           | why you feel that the sorting is so important (It is
           | important! But possibly not _critically_ important).
           | 
           | For future work, you might be interested in seeing how tools
           | like Aider do their "repo serializing" (they call it a
           | repomap), which tries to be more intelligent by only
           | including "important lines" (like function definitions but
           | not bodies).
        
       | ycombiredd wrote:
       | i guess I shouldn't be surprised that many of us have approached
       | this in different ways. it's neat to see already multiple replies
       | of the sort I'm going to make too, which is to share the approach
       | I've been taking, which is to concatenate or to "summarize" the
       | code, with particular attention on dependency resolution.
       | 
       | [chimeracat](https://github.com/scottvr/chimeracat)
       | 
       | It took the shape that it has because it started as a tool to
       | concatenate a library i had been working on into a single ipynb
       | file so that I didn't need to install the library on the remote
       | colab, thus the dependency graph was born (as was the ascii graph
       | plotter 'phart' that it uses) and then as I realized this could
       | be useful to share code with an LLM, started adding the
       | summarization capabilities, and in some sort of meta-recursive-
       | irony, worked with Claude to do so. :-)
       | 
       | I've put a collection of ancillary tools I use to aid in the
       | pairing with LLM process up at
       | https://github.com/scottvr/LLMental
        
         | sitkack wrote:
         | This is hilarious https://github.com/scottvr/retree
        
           | ycombiredd wrote:
           | while I _hope_ you mean it is hilarious in the same spirit
           | that I write most of my stuff ("ludicrous" is a common phrase
           | even in my documentation), I did want to ask that if you
           | meant that in more of a disparaging way, that you could flesh
           | out any criticism.
           | 
           | Of course, if you meant "hilarious" similarly to how I mean
           | "ludicrous", thanks! And thank you for taking the time to
           | look at it. :-)
        
             | sitkack wrote:
             | Not disparaging in anyway, in fact, it made made me think
             | of bijective lenses and what other tools could have an
             | inverse, even if lossy.
             | 
             | Your project is amazing, you are ahead of me in your
             | thinking.
        
               | ycombiredd wrote:
               | "lessthan three" ;-)
               | 
               | Thanks again.
        
       | mohsen1 wrote:
       | Added some benchmarking to show how fast it is:
       | 
       | Here is a benchmark comparing it to [Repomix][1] serializing the
       | Next.js project:                     time yek           Executed
       | in    5.19 secs    fish           external              usr time
       | 2.85 secs   54.00 micros    2.85 secs              sys time
       | 6.31 secs  629.00 micros    6.31 secs
       | time repomix           Executed in   22.24 mins    fish
       | external              usr time   21.99 mins    0.18 millis
       | 21.99 mins              sys time    0.23 mins    1.72 millis
       | 0.23 mins
       | 
       | yek is 230x faster than repomix
       | 
       | [1] https://github.com/jxnl/repomix
        
         | amelius wrote:
         | Maybe I don't understand the usecase but I'm curious why speed
         | matters given that LLMs are so slow (?)
        
           | BigRedEye wrote:
           | 22 minutes for a medium-sized repo is probably slow enough to
           | optimize.
        
             | elashri wrote:
             | However for these large size repositories. I'm not sure
             | that you fit in the effective context window. I know that
             | there is option to limit the token but then this would be
             | your realistic limit.
        
       | yani wrote:
       | You can do this all in the browser: https://dropnread.io/
        
       | Alifatisk wrote:
       | There is also https://repo2txt.simplebasedomain.com/local.html
       | which doesn't require to download anything
        
       | endofreach wrote:
       | I have a very simple bash function for this (filecontens),
       | including ignoring files based on gitignore & binary files etc.
       | Piped to clipboard and done.
       | 
       | All these other ways seem unnecessarily complicated...
        
         | imiric wrote:
         | I also feel like this can be done in a few lines of shell
         | script.
         | 
         | Can you share your function, please?
        
       | verghese wrote:
       | How does this compare to a tool like RepoPrompt?
       | 
       | https://repoprompt.com
        
       | lordofgibbons wrote:
       | Does anyone know of a more semantically meaningful way of
       | chunking code in a generalizable way? Token count seems like it'd
       | leave out meaningful context, or include unrelated context.
        
         | energy123 wrote:
         | I have an LLM summarize each file, and feed in a list of
         | summaries categorized by theme, along with the function
         | signatures and a tree of the directory. I only paste the full
         | code of the files that are very important for the task.
        
       | sitkack wrote:
       | I have to add https://github.com/simonw/files-to-prompt as a
       | marker guid.
       | 
       | I think "the part of it" is key here. For packaging a codebase,
       | I'll select a collection of files using rg/fzf and then
       | concatenate them into a markdown document, # headers for paths
       | ```filetype <data>``` for the contents.
       | 
       | The selection of the files is key to let the LLM focus on what is
       | important for the immediate task. I'll also give it the full file
       | list and have the LLM request files as needed.
        
       | zurfer wrote:
       | Has anyone build a linter that optimizes code for an LLM?
       | 
       | The idea would be to make it more token efficient and (lower
       | accidental perplexity), e.g. by renaming variable names, fixing
       | typos and shortening comments.
       | 
       | It should probably run after a normal linter like black.
        
         | grajaganDev wrote:
         | Good idea - couldn't the linter also be an LLM?
        
         | layer8 wrote:
         | > accidental perplexity
         | 
         | I have to remember this.
        
       | foxhop wrote:
       | `tree --gitignore && cat _.py && cat templates/_`
        
         | msoad wrote:
         | why Dropbox when you can rsync, huh? ;)
        
       | claireGSB wrote:
       | Adding my take to the mix, which has been working well for me:
       | https://github.com/ClaireGSB/project-context.
       | 
       | It outputs both a file tree of your repo, a list of the
       | dependancies, and a select list of files you want to include in
       | your prompt for the LLM, in a single xml file. The first time you
       | run it, it generates a .project-context.toml config file in your
       | repo with all your files commented out, and you can just
       | uncomment the ones you want written in full in the context file.
       | I've found this helps when iterating on a specific part of the
       | codebase - while keeping the full filetree give the LLM the
       | broader context; I always ask the LLM to request more files if
       | needed, as it can see the full list.
       | 
       | The files are not sorted by priority in the output though,
       | curious what the impact would be / how much room for manual
       | config to leave (might want to order differently depending on the
       | objective of the prompt).
        
       | jamesponddotco wrote:
       | Like many people, I built something similar, llmctx[1], but the
       | chunking feature seems really interesting, I have to look into
       | that for llmctx.
       | 
       | One thing I have with llmctx that I think is missing in yek is a
       | "Claude mode", as it outputs the concatenation in a format more
       | suitable to provide context for the Anthropic LLM.
       | 
       | [1]: https://sr.ht/~jamesponddotco/llmctx/
        
       | kauegimenes wrote:
       | I have a feeling theres some native unix commands that should
       | cover this. So looking at the current scope of this tool, i think
       | it would need more features.
       | 
       | Anyone has a bash script that covers this use case?
        
         | HarHarVeryFunny wrote:
         | Sure, find, cat and split would do the job.
        
       | paradite wrote:
       | There are a lot of them. I collected a list of cli tools that
       | does this:
       | 
       | https://prompt.16x.engineer/cli-tools
       | 
       | I also built a GUI tool that does this:
       | 
       | https://prompt.16x.engineer/
        
       | gatienboquet wrote:
       | https://gitingest.com/
        
       | frays wrote:
       | Interesting thread with other links.
        
       | johnisgood wrote:
       | https://github.com/simonw/files-to-prompt/ seems to work fine
       | especially for Claude as it follows the recommendations and
       | format.
        
       | Bao12534 wrote:
       | bao
        
       | neuraldenis wrote:
       | I made quite similar web tool some time ago, client side, no
       | registration required: https://shir-man.com/txt-merge/
        
       | dwrtz wrote:
       | i wrote something similar https://github.com/dwrtz/sink
       | 
       | it has some nice features:
       | 
       | - follows .gitignore patterns
       | 
       | - optional file watcher to regenerate the snapshot when files are
       | changed
       | 
       | - supports glob patterns for filter/exclude
       | 
       | - can report token usage
       | 
       | - can use a config yaml file if you have complicated
       | filter/exclude patterns
       | 
       | i find myself using it all the time now. it's great with claude
       | projects
        
       | __chefski__ wrote:
       | The token estimator of this package is quite inaccurate, since it
       | appears to just take the number of characters, minus whitespace.
       | This can lead to it being overly conservative, which would
       | degrade an LLM's performance. That said, it can be improved in
       | subsequent versions to properly integrate with a model's
       | tokenizer so it can know the true token count.
       | 
       | https://github.com/bodo-run/yek/blob/17fe37fbd461a8194ff612e...
        
       ___________________________________________________________________
       (page generated 2025-01-19 23:01 UTC)