[HN Gopher] Yek: Serialize your code repo (or part of it) to fee...
___________________________________________________________________
Yek: Serialize your code repo (or part of it) to feed into any LLM
Author : mohsen1
Score : 167 points
Date : 2025-01-19 03:24 UTC (19 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| TheTaytay wrote:
| This has some interesting ideas that I hadn't seen in the other
| similar projects, especially around trying to sort files
| according to importance.
|
| (I've been using RepoPrompt for this sort of thing lately.)
| msoad wrote:
| This is really fast! Serialized 50k lines in 500ms on my Mac
| linschn wrote:
| That's neat ! I've built a transient UI to do this manually[0]
| within emacs, but with the context windows getting bigger ang
| bigger, being more systematic may be the way to go.
|
| The priorization mentioned in the readme is especially
| interesting.
|
| [0] https://rdklein.fr/bites/MyTransientUIForLocalLLMs.html
| pagekicker wrote:
| Error: yek: SHA256 mismatch Expected:
| 34896ad65e8ae7c5e93d90e87f15656b67ed5b7596492863d1da80e548ba7301
| Actual:
| 353f4f7467af25b5bceb66bb29d9591ffe8d620d17bf40f6e0e4ec16cd4bd7e7
| File: /Users/... Library/Caches/Homebrew/downloads/0308e13c088cb7
| 87ece0e33a518cd211773daab9b427649303d79e27bf723e0d--
| yek-x86_64-apple-darwin.tar.gz To retry an incomplete download,
| remove the file above.
|
| Removed & tried again this was the result. Is the SHA256 mismatch
| a security concern?
| mohsen1 wrote:
| Oh totally forgot about homebrew installer. I'll fix it ASAP.
| Sorry about that.
|
| Edit: Working on a fix here https://github.com/bodo-
| run/yek/pull/14
|
| You can use the bash installer on macOS for now. You can read
| the installer file before executing it if you're not sure if it
| is safe
| hbornfree wrote:
| Thanks for this! I have the exact use-case and have been using a
| Python script to do this for a while.
| awestroke wrote:
| This looks promising. Hopefully much faster and less naive than
| Repomix
| mkagenius wrote:
| I am doing something similar for my gitpodcast project:
| def get_important_files(self, file_tree): # file_tree
| = "api/backend/main.py api.py" # Send the prompt to
| Azure OpenAI for processing response =
| openai.beta.chat.completions.parse(
| model=self.model_name, messages=[
| {"role": "system", "content": "Can you give the list of upto 10
| most important file paths in this file tree to understand code
| architechture and high level decisions and overall what the
| repository is about to include in the podcast i am creating, as a
| list, do not write any unknown file paths not listed below"}, #
| Initial system prompt {"role": "user",
| "content": file_tree} ],
| response_format=FileListFormat, ) try:
| response = response.choices[0].message.parsed
| print(type(response), " resp ") return
| response.file_list except Exception as e:
| print("Error processing file tree:", e) return []
|
| 1. https://gitpodcast.com - Convert any GitHub repo into a
| podcast.
| mg wrote:
| I think this is where the future of coding is. It is still useful
| to be a coder, the more experienced the better. But you will not
| write or edit a lot of lines anymore. You will organize the
| codebase in a way AI can handle it, make architectural decisions
| and organize the workflow around AI doing the actual coding.
|
| The way I currently do this is that I wrote a small python file
| that I can start with llmcode.py /path/to/repo
|
| Which then offers a simple web interface at localhost:8080 where
| I can select the files to serialize and describe a task.
|
| It then creates a prompt like this: Look at the
| code files below and do the following:
| {task_description} Output all files that you need to
| change in full again, including your changes. In the same
| format as I provide the files below, that means each file
| starts with filename: and ends with :filename
| Under no circumstances output any other text, no additional
| infos, no code formatting chars. Only the code in the
| given format. Here are the files:
| somefile.py: ...code of somefile.py...
| :somefile.py someotherfile.py: ...code of
| someotherfile.py... :someotherfile.py
| assets/css/somestyles.css: ...code of somestyles.css...
| :assets/css/somestyles.css etc
|
| Then llmcode.py sends it to an LLM, parses the output and writes
| the files back to disk.
|
| I then look at the changes via "git diff".
|
| It's quite fascinating. I often only make minor changes before
| accepting the "pull request" the llm made. Sometimes I have to
| make no changes at all.
| shatrov1 wrote:
| Would you be kind to share your script? Thanks!
| mike_hearn wrote:
| There is an open source project called Aider that does that.
| KronisLV wrote:
| > You will organize the codebase in a way AI can handle it,
| make architectural decisions and organize the workflow around
| AI doing the actual coding.
|
| This might sound silly, but I feel like it has the potential of
| resulting in more readable code.
|
| There have been times where I split up a 300 line function just
| so it's easier to feed into an LLM. Same for extracting things
| into smaller files and classes that individually do more
| limited things, so they're easier to change.
|
| There have been times where I pay attention to the grouping of
| code blocks more or even leave a few comments along the way
| explaining the intent so LLM autocomplete would work better.
|
| I also pay more attention to naming (which does sometimes end
| up more Java-like but is clear, even if verbose) and try to
| make the code simple enough to generate tests with less manual
| input.
|
| Somehow when you understand the code yourself and so can your
| colleagues (for the most part) a lot of people won't care that
| much. But when the AI tools stumble and actually start slowing
| you down instead of speeding you up and the readability of your
| code results in a more positive experience (subjectively) then
| suddenly it's a no brainer.
| DJBunnies wrote:
| You could have done all that for your peers instead.
| KronisLV wrote:
| I already do when it makes sense... except if you look at
| messy code and nobody else seems to care, there might be
| better things to spend your time on (some of which might
| involve finding an environment where people care about all
| of that by default).
|
| But now, to actually improve my own productivity a lot?
| I'll dig in more often, even in messy legacy code. Of
| course, if some convoluted LoginView breaks due to
| refactoring gone wrong, that is still my responsibility.
| croes wrote:
| Then you aren't a coder, you are an organizer or manager
| msoad wrote:
| I'm sure a few decades ago people would say that for not
| fiddling with actual binary to make things work.
| maccard wrote:
| I disagree. If I put hook a library into a framework (e.g.
| laminar into rails) that doesn't make me an organizer
| croes wrote:
| But this does
|
| > You will organize the codebase in a way AI can handle it,
| make architectural decisions and organize the workflow
| around AI doing the actual coding.
| maccard wrote:
| "Anyone" can implement a class that does something - the
| mark of a good engineer is someone who understands the
| context it's going to be used in and modified , be that a
| one shot method, a core library function or a user facing
| api call that needs to be resilient against malicious
| inputs.
| fifilura wrote:
| Many people identify themselves with being "a coder". Surely
| there are jobs for "coders" and will be in the future too.
| But not everyone writing programs today would qualify to
| doing the work defined as what "a coder" does.
|
| I like to be a "builder of systems" , "solver of problems".
| "Organizer or manager" would also fit in that description.
| And then what tool you use to get stuff done is not relevant.
| Art9681 wrote:
| No one cares what badge of pride you choose to wear. Only the
| output. If an LLM produces working solutions, that's what
| your audience will appreciate. Not a self imposed title you
| chose for the project.
| flessner wrote:
| Even just "organizing" the code requires great amounts of
| knowledge and intuition from prior experiences.
|
| I am personally torn between the future of LLMs in this regard.
| Right now, even with Copilot, the benefit they give
| fundamentally depends on the coder that directs them - as you
| have noted.
|
| What if that's no longer true in a couple years? How would that
| even be different from e.g. no code tools or website builders
| today? In different words will handwritten code stay valuable?
|
| I personally enjoy coding so I can always keep doing it for
| entertainment, even if I am vastly surpassed by the machine
| eventually.
| maccard wrote:
| > Even just "organizing" the code requires great amounts of
| knowledge and intuition from prior experiences.
|
| > I personally enjoy coding so I can always keep doing it for
| entertainment, even if I am vastly surpassed by the machine
| eventually.
|
| I agree with both these takes, and I think they're far more
| important than wondering if hand written code is valuable.
|
| I do some DIY around the house. I can make a moderately
| straight cut (within tolerances for joinery use cases). A jig
| or circular saw makes that skill moot, but knowing I need a
| straight clean cut is a transferable skill. There's also two
| separate skills - being able to break down and understand the
| problem and being able to implement the details of the
| problem. In trade skills we don't expect any one person to
| design, analyze, estimate, build, install and decorate
| anything larger than a small piece of furniture and I think
| the same can be said of programming.
|
| It's similar to using libraries/framesorks - there will
| always be people who will write shitty apps with shitty
| unmaintainable code - we've been complaining about that since
| I've been programming. Those people are going to move on from
| not understanding their wix websites to not understanding
| their AI generated code. But it's another tool in the belt of
| a professional programmer
| maccard wrote:
| I disagree that you won't edit lines, but I think you're right.
|
| At work this week I was investigating how we could auto scale
| our CI. I know enough Jenkins, AWS, perforce, power shell,
| packer, terraform, c++ to be able to do this, but having the
| time to implement and flesh everything out is a struggle. I
| asked Claude to create an AMI with our workspace preloaded on
| it, and a user data script that set up a perforce workspace
| without syncing it, all on windows, with the tools I mentioned.
| I had to make some small edits to the code to get it to match
| what I wanted but for the most part it took 2-3 days screwing
| around with a pretty clear concept in my head, and I had a
| prototype running in 30 minutes. Turns out it's quicker to sync
| from perforce than it is to boot the custom AMI , but I learned
| that with an hour in total rather than building out more and we
| got to look at alternatives. That's the future to me.
| senorrib wrote:
| I don't understand what you disagreed with. The OP said
|
| > edit a lot of lines
|
| "a lot" being the keywords here.
| csmpltn wrote:
| > "But you will not write or edit a lot of lines anymore"
|
| > "I wrote a small python file that I can start with"
|
| Which one is it, chief?
| jaapbadlands wrote:
| > not a lot of lines > small python file
|
| They mean the same thing, chief.
| dark_star wrote:
| Would you be willing to post the llm.py code? (Perhaps in a
| Github gist?)
| wiradikusuma wrote:
| Sorry if it's not very obvious, where does Yek fit with existing
| coding assistants such as Copilot or Continue.dev?
|
| Is it purpose-built for code, or any text (e.g., Obsidian vault)
| would work?
| mohsen1 wrote:
| This can be a piece of your own AI automation. Every task has a
| different need so being able to program your own AI automation
| is great for programmers. Any text based document works with
| this tool. It's rather simple, just stitching fils together
| with a dash of priority sorting
| CGamesPlay wrote:
| What is the use-case here? What is a "chunk"? It looks like it's
| just an arbitrary group of files, where "more important" files
| get put at the end. Why is that useful for LLMs? Also, I see it
| can chunk based on token count but... what's a token? ChatGPT?
| Llama?
|
| Note, I understand why code context is important for LLMs. I
| don't understand what this chunking is or how it helps me get
| better code context.
| mohsen1 wrote:
| token counting is done by crate that I'm using. I agree that
| not all LLMs use the same tokenizer but they are mostly
| similar.
|
| Chunking is useful because in chat mode you can feed more than
| context max size if you feed in multiple USER messages
|
| LLMs pay more attention to the last part of
| conversation/message. This is why sorting is very important.
| Your last sentence in a very long prompt is much more important
| the first.
|
| Use case: I use this to run an "AI Loop" with Deepseek to fix
| bugs or implement features. The loop steers the LLM by not
| letting it go stray in various rabbit holes. Every prompt
| reiterates what the objective is. By loop I mean: Serialize
| repo, run test, feed test failure and repo to LLM, get a diff,
| apply the diff and repeat until the objective is achieved.
| CGamesPlay wrote:
| Got it, thanks.
|
| > in chat mode you can feed more than context max size if you
| feed in multiple USER messages
|
| Just so you know, this is false. You might be using a system
| that automatically deletes or summarizes older messages,
| which would make you feel like that, and would also indicate
| why you feel that the sorting is so important (It is
| important! But possibly not _critically_ important).
|
| For future work, you might be interested in seeing how tools
| like Aider do their "repo serializing" (they call it a
| repomap), which tries to be more intelligent by only
| including "important lines" (like function definitions but
| not bodies).
| ycombiredd wrote:
| i guess I shouldn't be surprised that many of us have approached
| this in different ways. it's neat to see already multiple replies
| of the sort I'm going to make too, which is to share the approach
| I've been taking, which is to concatenate or to "summarize" the
| code, with particular attention on dependency resolution.
|
| [chimeracat](https://github.com/scottvr/chimeracat)
|
| It took the shape that it has because it started as a tool to
| concatenate a library i had been working on into a single ipynb
| file so that I didn't need to install the library on the remote
| colab, thus the dependency graph was born (as was the ascii graph
| plotter 'phart' that it uses) and then as I realized this could
| be useful to share code with an LLM, started adding the
| summarization capabilities, and in some sort of meta-recursive-
| irony, worked with Claude to do so. :-)
|
| I've put a collection of ancillary tools I use to aid in the
| pairing with LLM process up at
| https://github.com/scottvr/LLMental
| sitkack wrote:
| This is hilarious https://github.com/scottvr/retree
| ycombiredd wrote:
| while I _hope_ you mean it is hilarious in the same spirit
| that I write most of my stuff ("ludicrous" is a common phrase
| even in my documentation), I did want to ask that if you
| meant that in more of a disparaging way, that you could flesh
| out any criticism.
|
| Of course, if you meant "hilarious" similarly to how I mean
| "ludicrous", thanks! And thank you for taking the time to
| look at it. :-)
| sitkack wrote:
| Not disparaging in anyway, in fact, it made made me think
| of bijective lenses and what other tools could have an
| inverse, even if lossy.
|
| Your project is amazing, you are ahead of me in your
| thinking.
| ycombiredd wrote:
| "lessthan three" ;-)
|
| Thanks again.
| mohsen1 wrote:
| Added some benchmarking to show how fast it is:
|
| Here is a benchmark comparing it to [Repomix][1] serializing the
| Next.js project: time yek Executed
| in 5.19 secs fish external usr time
| 2.85 secs 54.00 micros 2.85 secs sys time
| 6.31 secs 629.00 micros 6.31 secs
| time repomix Executed in 22.24 mins fish
| external usr time 21.99 mins 0.18 millis
| 21.99 mins sys time 0.23 mins 1.72 millis
| 0.23 mins
|
| yek is 230x faster than repomix
|
| [1] https://github.com/jxnl/repomix
| amelius wrote:
| Maybe I don't understand the usecase but I'm curious why speed
| matters given that LLMs are so slow (?)
| BigRedEye wrote:
| 22 minutes for a medium-sized repo is probably slow enough to
| optimize.
| elashri wrote:
| However for these large size repositories. I'm not sure
| that you fit in the effective context window. I know that
| there is option to limit the token but then this would be
| your realistic limit.
| yani wrote:
| You can do this all in the browser: https://dropnread.io/
| Alifatisk wrote:
| There is also https://repo2txt.simplebasedomain.com/local.html
| which doesn't require to download anything
| endofreach wrote:
| I have a very simple bash function for this (filecontens),
| including ignoring files based on gitignore & binary files etc.
| Piped to clipboard and done.
|
| All these other ways seem unnecessarily complicated...
| imiric wrote:
| I also feel like this can be done in a few lines of shell
| script.
|
| Can you share your function, please?
| verghese wrote:
| How does this compare to a tool like RepoPrompt?
|
| https://repoprompt.com
| lordofgibbons wrote:
| Does anyone know of a more semantically meaningful way of
| chunking code in a generalizable way? Token count seems like it'd
| leave out meaningful context, or include unrelated context.
| energy123 wrote:
| I have an LLM summarize each file, and feed in a list of
| summaries categorized by theme, along with the function
| signatures and a tree of the directory. I only paste the full
| code of the files that are very important for the task.
| sitkack wrote:
| I have to add https://github.com/simonw/files-to-prompt as a
| marker guid.
|
| I think "the part of it" is key here. For packaging a codebase,
| I'll select a collection of files using rg/fzf and then
| concatenate them into a markdown document, # headers for paths
| ```filetype <data>``` for the contents.
|
| The selection of the files is key to let the LLM focus on what is
| important for the immediate task. I'll also give it the full file
| list and have the LLM request files as needed.
| zurfer wrote:
| Has anyone build a linter that optimizes code for an LLM?
|
| The idea would be to make it more token efficient and (lower
| accidental perplexity), e.g. by renaming variable names, fixing
| typos and shortening comments.
|
| It should probably run after a normal linter like black.
| grajaganDev wrote:
| Good idea - couldn't the linter also be an LLM?
| layer8 wrote:
| > accidental perplexity
|
| I have to remember this.
| foxhop wrote:
| `tree --gitignore && cat _.py && cat templates/_`
| msoad wrote:
| why Dropbox when you can rsync, huh? ;)
| claireGSB wrote:
| Adding my take to the mix, which has been working well for me:
| https://github.com/ClaireGSB/project-context.
|
| It outputs both a file tree of your repo, a list of the
| dependancies, and a select list of files you want to include in
| your prompt for the LLM, in a single xml file. The first time you
| run it, it generates a .project-context.toml config file in your
| repo with all your files commented out, and you can just
| uncomment the ones you want written in full in the context file.
| I've found this helps when iterating on a specific part of the
| codebase - while keeping the full filetree give the LLM the
| broader context; I always ask the LLM to request more files if
| needed, as it can see the full list.
|
| The files are not sorted by priority in the output though,
| curious what the impact would be / how much room for manual
| config to leave (might want to order differently depending on the
| objective of the prompt).
| jamesponddotco wrote:
| Like many people, I built something similar, llmctx[1], but the
| chunking feature seems really interesting, I have to look into
| that for llmctx.
|
| One thing I have with llmctx that I think is missing in yek is a
| "Claude mode", as it outputs the concatenation in a format more
| suitable to provide context for the Anthropic LLM.
|
| [1]: https://sr.ht/~jamesponddotco/llmctx/
| kauegimenes wrote:
| I have a feeling theres some native unix commands that should
| cover this. So looking at the current scope of this tool, i think
| it would need more features.
|
| Anyone has a bash script that covers this use case?
| HarHarVeryFunny wrote:
| Sure, find, cat and split would do the job.
| paradite wrote:
| There are a lot of them. I collected a list of cli tools that
| does this:
|
| https://prompt.16x.engineer/cli-tools
|
| I also built a GUI tool that does this:
|
| https://prompt.16x.engineer/
| gatienboquet wrote:
| https://gitingest.com/
| frays wrote:
| Interesting thread with other links.
| johnisgood wrote:
| https://github.com/simonw/files-to-prompt/ seems to work fine
| especially for Claude as it follows the recommendations and
| format.
| Bao12534 wrote:
| bao
| neuraldenis wrote:
| I made quite similar web tool some time ago, client side, no
| registration required: https://shir-man.com/txt-merge/
| dwrtz wrote:
| i wrote something similar https://github.com/dwrtz/sink
|
| it has some nice features:
|
| - follows .gitignore patterns
|
| - optional file watcher to regenerate the snapshot when files are
| changed
|
| - supports glob patterns for filter/exclude
|
| - can report token usage
|
| - can use a config yaml file if you have complicated
| filter/exclude patterns
|
| i find myself using it all the time now. it's great with claude
| projects
| __chefski__ wrote:
| The token estimator of this package is quite inaccurate, since it
| appears to just take the number of characters, minus whitespace.
| This can lead to it being overly conservative, which would
| degrade an LLM's performance. That said, it can be improved in
| subsequent versions to properly integrate with a model's
| tokenizer so it can know the true token count.
|
| https://github.com/bodo-run/yek/blob/17fe37fbd461a8194ff612e...
___________________________________________________________________
(page generated 2025-01-19 23:01 UTC)