hngopher.com

       [HN Gopher] I made my own Git
       ___________________________________________________________________
        
       I made my own Git
        
       Author : TonyStr
       Score  : 350 points
       Date   : 2026-01-27 10:55 UTC (20 hours ago)
        
 (HTM) web link (tonystr.net)
 (TXT) w3m dump (tonystr.net)
        
       | kgeist wrote:
       | >The hardest part about this project was actually just parsing.
       | 
       | How about using sqlite for this? Then you wouldn't need to parse
       | anything, just read/update tables. Fast indexing out of the box,
       | too.
        
         | grenran wrote:
         | that would be what https://fossil-scm.org/ is
        
           | TonyStr wrote:
           | Very interesting. Looks like fossil has made some unique
           | design choices that differ from git[0]. Has anyone here used
           | it? I'd love to hear how it compares.
           | 
           | [0] https://fossil-scm.org/home/doc/trunk/www/fossil-v-
           | git.wiki#...
        
             | embedding-shape wrote:
             | Used it on and off mainly to check it out, but always in a
             | personal/experimental capacity. Never managed to convince
             | any teams to give it a try, mostly because git don't tend
             | to get in the way, so hard to justify to learn something
             | completely new.
             | 
             | I really enjoy how local-first it is, as someone who
             | sometimes work without internet connection. That the data
             | around "work" is part of the SCM as well, not just the
             | code, makes a lot of sense to me at a high-level, and many
             | times I wish git worked the same...
        
               | usrbinbash wrote:
               | I mean, git is just as "local-first" (a git repo is just
               | a directory after all), and the standard git-toolchain
               | includes a server, so...
               | 
               | But yeah, fossil is interesting, and it's a crying shame
               | its not more well known, for the exact reasons you point
               | out.
        
               | embedding-shape wrote:
               | > I mean, git is just as "local-first" (a git repo is
               | just a directory after all), and the standard git-
               | toolchain includes a server, so...
               | 
               | It isn't though, Fossil integrates all the data around
               | the code too in the "repository", so issues, wiki,
               | documentation, notes and so on are all together, not like
               | in git where most commonly you have those things on
               | another platform, or you use something like `git notes`
               | which has maybe 10% of the features of the respective
               | Fossil feature.
               | 
               | It might be useful to scan through the list of features
               | of Fossil and dig into it, because it does a lot more
               | than you seem to think :) https://fossil-
               | scm.org/home/doc/trunk/www/index.wiki
        
               | adastra22 wrote:
               | Those things exist for git too, e.g. git-bug. But the
               | first-class to do it in git is email.
        
               | embedding-shape wrote:
               | Email isn't a wiki, bug tracking, documentation and all
               | the other stuff Fossil offers as part of their core
               | design. The point is for it to be in one place, and
               | local-first.
               | 
               | If you don't trust me, read the list of features and give
               | it a try yourself: https://fossil-
               | scm.org/home/doc/trunk/www/index.wiki
        
               | adastra22 wrote:
               | I am aware of fossil. Did you look up git-bug?
        
             | smartmic wrote:
             | I use Fossil extensively, but only for personal projects.
             | There are specific design conditions, such as no rebasing
             | [0], and overall, it is simpler yet more useful to me.
             | However, I think Fossil is better suited for projects
             | governed under the cathedral model than the bazaar model.
             | It's great for self-hosting, and the web UI is excellent
             | not only for version control, but also for managing a
             | software development project. However, if you want a low
             | barrier to integrating contributions, Fossil is not as good
             | as the various Git forges out there. You have to either
             | receive patches or Fossil bundles via email or forum, or
             | onboard/register contributors as developers with quite wide
             | repo permissions.
             | 
             | [0]: https://fossil-
             | scm.org/home/doc/trunk/www/rebaseharm.md
        
               | toyg wrote:
               | Sounds like a more modern cvs/Subversion
        
               | chungy wrote:
               | It was developed primarily to replace SQLite's CVS
               | repository, after all. They used CVSTrac as the forge and
               | Fossil was designed to replace that component too.
        
             | graemep wrote:
             | I like it but the problem is everyone else already knows
             | git and everything integrates with git.
             | 
             | It is very easy to self host.
             | 
             | Not having staging is awkward at first but works well once
             | you get used to it.
             | 
             | I prefer it for personal projects. In think its better for
             | small teams if people are willing to adjust but have not
             | had enough opportunities to try it.
        
               | TonyStr wrote:
               | Is it possible to commit individual files, or specific
               | lines, without a staging area? I guess this might be
               | against Fossil's ethos, and you're supposed to just
               | commit everything every time?
        
               | jact wrote:
               | You can commit individual files.
        
               | graemep wrote:
               | Yes you can list specific files, but you have to list
               | them all in the commit command.
               | 
               | I think the ethos is to discourage it.
               | 
               | It does not seem to be possible to commit just specific
               | lines.
        
             | jact wrote:
             | I use Fossil extensively for all my personal projects and
             | find it superior for the general case. As others said it's
             | more suited for small projects.
             | 
             | I also use Fossil for lots of weird things. I created a
             | forum game using Fossil's ticket and forum features because
             | it's so easy to spin up and for my friends to sign in to.
             | 
             | At work we ended up using Fossil in production to manage
             | configuration and deployment in a highly locked down
             | customer environment where its ability to run as a single
             | static binary, talk over HTTP without external
             | dependencies, etc. was essential. It was a poor man's
             | deployment tool, but it performed admirably.
             | 
             | Fossil even works well as a blogging platform.
        
           | dchest wrote:
           | While Fossil uses SQLite for underlying storage (instead of
           | the filesystem directly) and various support infrastructure,
           | its actual format is not based on SQLite: https://fossil-
           | scm.org/home/doc/trunk/www/fileformat.wiki
           | 
           | It's basically plaintext. Even deltas are plaintext for text
           | files.
           | 
           | Reason: "The global state of a fossil repository is kept
           | simple so that it can endure in useful form for decades or
           | centuries. A fossil repository is intended to be readable,
           | searchable, and extensible by people not yet born."
        
         | storystarling wrote:
         | SQLite solves the storage layer but I suspect you run into a
         | pretty big impedance mismatch on the graph traversals. For
         | heavy DAG operations like history rewriting, a custom structure
         | seems way more efficient than trying to model that
         | relationally.
        
           | SQLite wrote:
           | The Common Table Expression feature of SQL is very good at
           | walking graphs. See, for example
           | <https://sqlite.org/lang_with.html#queries_against_a_graph>.
        
       | prakhar1144 wrote:
       | I was also playing around with the ".git" directory - ended up
       | writing:
       | 
       | "What's inside .git ?" - https://prakharpratyush.com/blog/7/
        
       | sluongng wrote:
       | Zstd dictionary compression is essentially how Meta's Mercurial
       | fork (Sapling VCS) stores blobs https://sapling-
       | scm.com/docs/dev/internals/zstdelta. The source code is available
       | in GitHub if folks want to study the tradeoffs vs git delta-
       | compressed packfiles.
       | 
       | I think theoratically, Git delta-compression is still a lot more
       | optimized for smaller repos. But for bigger repos where sharding
       | storaged is required, path-based delta dictionary compression
       | does much better. Git recently (in the last 1 year) got something
       | called "path-walk" which is fairly similar though.
        
       | darkryder wrote:
       | Great writeup! It's always fun to learn the details of the tools
       | we use daily.
       | 
       | For others, I highly recommend Git from the Bottom Up[1]. It is a
       | very well-written piece on internal data structures and does a
       | great job of demystifying the opaque git commands that most
       | beginners blindly follow. Best thing you'll learn in 20ish
       | minutes.
       | 
       | 1. https://jwiegley.github.io/git-from-the-bottom-up/
        
         | spuz wrote:
         | Thanks - I think this is the article I was thinking of that
         | really helped me to understand git when I first started using
         | it back in the day. I tried to find it again and couldn't.
        
         | MarsIronPI wrote:
         | Oh, I hadn't ever seen that one. I "grokked" Git thanks to The
         | Git Parable[0] several years ago.
         | 
         | [0]: https://tom.preston-werner.com/2009/05/19/the-git-parable
        
         | sanufar wrote:
         | Ooh, this looks fun! I didn't know you could cat-file on a hash
         | id, that's actually quite cool.
        
       | heckelson wrote:
       | gentle reminder to set your website's `<title>` to something
       | descriptive :)
        
         | TonyStr wrote:
         | haha, thank you. Added now :-)
        
       | teiferer wrote:
       | If you ever wonder how coding agents know how to plan things etc,
       | this is the kind of article they get this training from.
       | 
       | Ends up being circular if the author used LLM help for this
       | writeup though there are no obvious signs of that.
        
         | wasmainiac wrote:
         | Maybe we can poison LLMs with loops of 2 or more self
         | referencing blogs.
        
           | jdiff wrote:
           | Only need one, they're not thinking critically about the
           | media they consume during training.
        
             | falcor84 wrote:
             | Here's a sad prediction: over the coming few years, AIs
             | will get significantly better at critical evaluation of
             | sources, while humans will get even worse at it.
        
               | topaz0 wrote:
               | My sad prediction is that LLMs and humans will both get
               | worse. Humans might get worse faster though.
        
               | whstl wrote:
               | I wish I could disagree with you, but what I'm seeing on
               | average (especially at work) is exactly that: people
               | asking stuff to ChatGPT and accepting hallucinations as
               | fact, and then fighting me when I say it's not true.
        
               | prmoustache wrote:
               | There is "death by GPS" for people dying after blindly
               | following their GPS instruction. There will definitely be
               | a "death by AI" expression very soon.
        
               | stevekemp wrote:
               | Tesla-related fatalities probably count already, albeit
               | without that label/name.
        
               | sailfast wrote:
               | Hot take: Humans have always been bad at this (in the
               | aggregate, without training). Only a certain percentage
               | of the population took the time to investigate.
               | 
               | For most throughout history, whatever is presented to you
               | that you believe is the right answer. AI just brings them
               | source information faster so what you're seeing is mostly
               | just the usual behavior, but faster. Before AI people
               | would not have bothered to try and figure out an answer
               | to some of these questions. It would've been too much
               | work.
        
               | keybored wrote:
               | HN commenters will be technooptimistic misanthrops.
               | Status quo ante bellum.
        
             | andy_ppp wrote:
             | The secret sauce about having good understanding, taste and
             | style (both for coding and writing) has always been in the
             | fine tuning and RHLF steps. I'd be skeptical if the signals
             | a few GitHub repos or blogs generate at the initial stages
             | of the learning are that critical. There's probably a
             | filter also for good taste on the initial training set and
             | these are so large not even a single full epoch is done on
             | the data these days.
        
             | jama211 wrote:
             | It wouldn't work at all.
        
           | jama211 wrote:
           | I see the AI hating part of HN has come out again
        
         | mexicocitinluez wrote:
         | > Ends up being circular if the author used LLM help for this
         | writeup though there are no obvious signs of that.
         | 
         | Great argument for not using AI-assisted tools to write blog
         | posts (especially if you DO use these tools). I wonder how much
         | we're taking for granted in these early phases before it starts
         | to eat itself.
        
           | jama211 wrote:
           | What does eating itself even look like? It doesn't take much
           | salt to change a hash.
        
             | mexicocitinluez wrote:
             | Being trained on it's own results?
        
         | anu7df wrote:
         | I understand model output put back into training would be an
         | issue, but if model output is guided by multiple prompts and
         | edited by the author to his/her liking wouldn't that at least
         | be marginally useful?
        
         | TonyStr wrote:
         | Interestingly, I looked at github insights and found that this
         | repo had 49 clones, and 28 unique cloners, before I published
         | this article. I definitely did not clone it 49 times, and
         | certainly not with 28 unique users. It's unlikely that the
         | handful of friends who follow me on github all cloned the repo.
         | So I can only speculate that there are bots scraping new public
         | github repos and training on everything.
         | 
         | Maybe that's obvious to most people, but it was a bit
         | surprising to see it myself. It feels weird to think that LLMs
         | are being trained on my code, especially when I'm painfully
         | aware of every corner I'm cutting.
         | 
         | The article doesn't contain any LLM output. I use LLMs to ask
         | for advice on coding conventions (especially in rust, since I'm
         | bad at it), and sometimes as part of research (zstd was
         | suggested by chatgpt along with comparisons to similar
         | algorithms).
        
           | nerdponx wrote:
           | Time to start including deliberate bugs. The correct version
           | is in a private repository.
        
             | teiferer wrote:
             | And what purpose would this serve, exactly?
        
               | adastra22 wrote:
               | Spite.
        
             | program_whiz wrote:
             | while I think this is a fun idea -- we are in such a
             | dystopian timeline that I fear you will end up being
             | prosecuted under a digital equivalent of various laws like
             | "why did you attack the intruder instead of fleeing" or
             | "you can't simply remove a squatter because its your house,
             | therefore you get an assault charge."
             | 
             | A kind of "they found this code, therefore you have a duty
             | not to poison their model as they take it." Meanwhile if I
             | scrape a website and discover data I'm not supposed to see
             | (e.g. bank details being publicly visible) then I will go
             | to jail for pointing it out. :(
        
               | wredcoll wrote:
               | Look, I get the fantasy of someday pulling out my
               | musket^W ar15 and rushing downstairs to blow away my
               | wife^W an evil intruder, but, like, we live in a society.
               | And it has a lot of benefits, but it does mean you don't
               | get to be "king of your castle" any more.
               | 
               | Living in a country with hundreds of millions of other
               | civilians or a city with tens of thousands means
               | compromising what you're allowed to do when it affects
               | other people.
               | 
               | There's a reason we have attractive nuisance laws and you
               | aren't allowed to put a slide on your yard that
               | electrocutes anyone who touches it.
               | 
               | None of this, of course, applies to "poisoning" llms,
               | that's whatever. But all your examples involved actual
               | humans being attacked, not some database.
        
               | program_whiz wrote:
               | Thanks that was the term I was looking for "attractive
               | nuisance". I wouldn't be surprised if a tech company
               | could make that case -- this user caused us tangible harm
               | and cost (training, poisoned models) and left their data
               | out for us to consume. Its the equivalent of putting
               | poison candy on a park table your honor!
        
               | teo_zero wrote:
               | That reminds me of the protagonist of Charles Stross's
               | novel "Accelerando", a prolific inventor who is accused
               | by the IRS to have caused millions of losses because he
               | releases all his ideas in the public domain instead of
               | profiting from them and paying taxes on such profits.
        
               | nerdponx wrote:
               | I think if we're at the point where posting deliberate
               | mistakes to poison training data is considered a crime,
               | we would be far far far down the path of authoritarian
               | corporate regulatory capture, much farther than we are
               | now (fortunately).
        
             | below43 wrote:
             | They used to do this with maps - eg. fake islands - to pick
             | up when they were copied.
        
           | Phelinofist wrote:
           | I selfhost Gitea. The instance is crawled by AI crawlers
           | (checked the IPs). They never cloned, they just browse and
           | take it directly from there.
        
             | Zambyte wrote:
             | i run a cgit server on an r720 in my apartment with my code
             | on it and that puppy screams whenever sam wants his code
             | 
             | blocking openai ips did wonders for the ambient noise
             | levels in my apartment. they're not the only ones
             | obviously, but they're they only ones i had to block to
             | stay sane
        
               | MarsIronPI wrote:
               | Have you considered putting it behind Anubis or an
               | equivalent?
        
               | Zambyte wrote:
               | Yes, but I haven't and would prefer not to
        
               | MarsIronPI wrote:
               | Understandable. It's an outrage that we even have to
               | consider such measures.
        
             | Phelinofist wrote:
             | For reference, this is how I do it in my Caddyfile:
             | (block_ai) {            @ai_bots {
             | header_regexp User-Agent (?i)(anthropic-
             | ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-
             | User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
             | }                 abort @ai_bots        }
             | 
             | Then, in a specific app block include it via
             | import block_ai
        
               | zaphar wrote:
               | I have almost exactly this in my own caddyfile :-D The
               | order of the items in the regex is a little different but
               | mostly the same items. I just pulled them from my web
               | access logs over time and update it every once in a
               | while.
        
               | seba_dos1 wrote:
               | Most of then pretend to be real users though and don't
               | identify themselves with their user agent strings.
        
           | tonnydourado wrote:
           | Particularly on GitHub, might not even be LLMs, just regular
           | bots looking for committed secrets (AWS keypairs, passwords,
           | etc.)
        
           | 0x696C6961 wrote:
           | This has been happening before LLMs too.
        
           | teiferer wrote:
           | I don't really get why they need to clone in order to scrape
           | ...?
           | 
           | > It feels weird to think that LLMs are being trained on my
           | code, especially when I'm painfully aware of every corner I'm
           | cutting.
           | 
           | That's very much expected. That's why the quality of LLM
           | coding agents is like it is. (No offense.)
           | 
           | The "asking LLMs for advice" part is where the circular
           | aspect starts to come into the picture. Not worse than
           | looking at StackOverflow though which then links to other
           | people who in turn turned to StackOverflow for advice.
        
             | adastra22 wrote:
             | The quality of LLM coding agents is pretty good now.
        
             | storystarling wrote:
             | Cloning gets you the raw text objects directly. If you
             | scrape the web UI you're dealing with a lot of markup
             | overhead that just burns compute during ingestion. For
             | training data you usually want the structure to be as clean
             | as possible from the start.
        
               | teiferer wrote:
               | Sure, cloning a local copy. But why clone _on github_?
        
         | prodigycorp wrote:
         | Random aside about training data:
         | 
         | One of the funniest things I've started to notice from Gemini
         | in particular is that in random situations, it talks with
         | english with an agreeable affect that I can only describe as..
         | Indian? I've never noticed such a thing leak through before.
         | There must be a _ton_ of people in India who are generating new
         | datasets for training.
        
           | blenderob wrote:
           | That's very interesting. Any examples you can share which has
           | those agreeable effects?
        
             | prodigycorp wrote:
             | I'm going to do a cursory look through my antigrav history,
             | i want to find it too. I remember it's primarily in the
             | exclamations of agreement/revelation, and one time
             | expressing concern which I remember were slightly off
             | natural for an american english speaker.
        
               | prodigycorp wrote:
               | Cant find anything, too many messages telling the agent
               | "please do NOT _those_ c changes". I'm going to remember
               | to save them going forward.
        
           | evntdrvn wrote:
           | There was a really great article or blog post published in
           | the last few months about the author's very personal
           | experience whose gist was "People complain that I sound/write
           | like an LLM, but it's actually the inverse because I grew up
           | in X where people are taught formal English to sound
           | educated/western, and those areas are now heavily used for
           | LLM training."
           | 
           | I wish I could find it again, if someone else knows the link
           | please post it!
        
             | gxnxcxcx wrote:
             | _I 'm Kenyan. I don't write like ChatGPT, ChatGPT writes
             | like me_
             | 
             | https://news.ycombinator.com/item?id=46273466
        
               | tverbeure wrote:
               | Thanks for that link.
               | 
               | This part made me laugh though:
               | 
               | > These detectors, as I understand them, often work by
               | measuring two key things: 'Perplexity' and 'burstiness'.
               | Perplexity gauges how predictable a text is. If I start a
               | sentence, "The cat sat on the...", your brain, and the
               | AI, will predict the word "floor."
               | 
               | I can't be the only one who's brain predicted "mat" ?
        
               | cozzyd wrote:
               | And I thought it would be a hat...
        
             | awesome_dude wrote:
             | I've been critical of people that default to "an em dash
             | being used means the content is generated by an LLM", or,
             | "they've numbered their points, must be an LLM"
             | 
             | I do know that LLMs generate content heavy with those
             | constructs, but they didn't create the ideas out of thin
             | air, it was in the training set, and existed strongly
             | enough that LLMs saw it as common place/best practice.
        
       | sneela wrote:
       | > If you want to look at the code, it's available on github.
       | 
       | Why not tvc-hub :P
       | 
       | Jokes aside, great write up!
        
         | TonyStr wrote:
         | haha, maybe that's the next project. It did feel weird to make
         | git commits at the same time as I was making tvc commits
        
       | igorw wrote:
       | Random but y'all might enjoy. Git client in PHP, supports reading
       | packfiles, reftables, diff via LCS. Written by hand.
       | 
       | https://github.com/igorwwwwwwwwwwwwwwwwwwww/gipht-horse
        
         | nasretdinov wrote:
         | Nice! This repo is a huge W for PHP I'd say.
         | 
         | P.S. Didn't know that plain '@' can be used instead of HEAD,
         | but I guess it makes sense since you can omit both left and
         | right parts of the expressions separated by '@'
        
       | h1fra wrote:
       | Learning git internals was definitely the moment it became clear
       | to me how efficient and smart git is.
       | 
       | And this way of versionning can be reused in other fields, as
       | soon as have some kind of graph of data that can be modified
       | independently but read all together then it makes sense.
        
       | p4bl0 wrote:
       | Nice post :). It made me think of _ugit: DIY Git in Python_ [1]
       | which is still by far my favorite of this kind of posts. It
       | really goes deep into Git internals while managing to stay easy
       | to follow along the way.
       | 
       | [1] https://www.leshenko.net/p/ugit/
        
         | TonyStr wrote:
         | This page is beautiful!
         | 
         | Bookmarked for later
        
         | mfashby wrote:
         | in a similar vein; Write yourself a Git was fun to follow
         | https://wyag.thb.lt/
        
         | UltraSane wrote:
         | I mapped git operations to Neo4j and it really helped me
         | understand how it works.
        
       | eru wrote:
       | > These objects are also compressed to save space, so writing to
       | and reading from .git/objects/ will always involve running a
       | compression algoritm. Git uses zlib to compress objects, but
       | looking at competitors, zstd seemed more promising:
       | 
       | That's a weird thing to put so close to the start. Compression is
       | about the least interesting aspect of Git's design.
        
         | alphabetag675 wrote:
         | When you are learning, everything is important. I think it is
         | okay to cut the person some slack regarding this.
        
       | nasretdinov wrote:
       | Nice work! On a complete tangent, Git is the only SCM known to me
       | that supports recursive merge strategy [1] (instead of the
       | regular 3-way merge), which essentially always remembers resolved
       | conflicts without you needing to do anything. This is a very
       | underrated feature of Git and somehow people still manage to
       | choose rebase over it. If you ever get to implementing merges,
       | please make sure you have a mechanism for remembering the
       | conflict resolution history :).
       | 
       | [1] https://stackoverflow.com/questions/55998614/merge-made-
       | by-r...
        
         | arunix wrote:
         | I remember in a previous job having to enable git rerere,
         | otherwise it wouldn't remember previously resolved conflicts.
         | 
         | https://git-scm.com/book/en/v2/Git-Tools-Rerere
        
           | nasretdinov wrote:
           | I believe rerere is a local cache, so you'd still have to
           | resolve the conflicts again on another machine. The recursive
           | merge doesn't have this issue -- the conflict resolution
           | inside the merge commits is effectively remembered (although
           | due to how Git operates it actually never even considers it a
           | conflict to be remembered -- just a snapshot of the closest
           | state to the merged branches)
        
             | Guvante wrote:
             | Are people repeatedly handling merge conflicts on multiple
             | machines?
             | 
             | If there was a better way to handle "I needed to merge in
             | the middle of my PR work" without introducing reverse
             | merged permanently in the history I wouldn't mind merge
             | commits.
             | 
             | But tools will sometimes skip over others work if you `git
             | pull` a change into your local repo due to getting confused
             | which leg of the merge to follow.
        
           | direwolf20 wrote:
           | The recursive merge is about merging branches that already
           | have merges in them, while rerere is about repeating the same
           | merge several times.
        
           | pyrolistical wrote:
           | Would be nice if centralized git platforms shared rerere
           | caches
        
           | lmm wrote:
           | Rerere is dangerous and counterproductive - it tries to give
           | rebase the same functionality that merge has, but since
           | rebase is fundamentally wrong it only stacks the wrongness.
        
             | seba_dos1 wrote:
             | Cherry-picks being "fundamentally wrong" is certainly an
             | interesting git take.
        
         | mkleczek wrote:
         | Much more principled (and hence less of a foot-gun) way of
         | handling conflicts is making them first class objects in the
         | repository, like https://pijul.org does.
        
           | jcgl wrote:
           | Jujutsu too[0]:
           | 
           | > Jujutsu keeps track of conflicts as first-class objects in
           | its model; they are first-class in the same way commits are,
           | while alternatives like Git simply think of conflicts as
           | textual diffs. While not as rigorous as systems like Darcs
           | (which is based on a formalized theory of patches, as opposed
           | to snapshots), the effect is that many forms of conflict
           | resolution can be performed and propagated automatically.
           | 
           | [0] https://github.com/jj-vcs/jj
        
           | PunchyHamster wrote:
           | I feel like people making new VCSes should just re-use GIT
           | storage/network layer and innovate on top of that. Git
           | storage is flexible enough for that, and that way you can
           | just.... use it on existing repos with very easy migration
           | path for both workflows (CI/CD never need to care about what
           | frontend you use) and users
        
             | zaphar wrote:
             | Git storage is just a merkle tree. It's a technology that's
             | been around forever and was simultaneously chosen by more
             | than one vcs technology around the same time. It's
             | incredibly effective so it makes sense that it would get
             | used.
        
             | storystarling wrote:
             | The bottleneck with git is actually the on-the-fly packfile
             | generation. The server has to burn CPU calculating deltas
             | for every clone. For a distributed system it seems much
             | better to use a simple content-addressable store where you
             | just serve static blobs.
        
             | 3eb7988a1663 wrote:
             | It is my understanding that under the hood, the repository
             | has quite a bit of state that can get mangled. That is why
             | naively syncing a git repo with say Dropbox is not a
             | surefire operation.
        
           | theLiminator wrote:
           | It's very cool though I imagine it's doa due to lack of git
           | compatibility...
        
             | speed_spread wrote:
             | Lack of current-SCM incumbent compatibility can be an
             | advantage. Like Linus decided to explicitly do the reverse
             | of every SVN decision when designing git. He even reversed
             | CLI usability!
        
               | theLiminator wrote:
               | I think the network effects of git is too large to
               | overcome now. Hence why we see jj get a lot more adoption
               | than pijul.
        
               | rob74 wrote:
               | Pssst! I think Linus didn't as much design Git as he
               | cloned BitKeeper (or at least the parts of it he liked).
               | I have never used it, but if you look at the BitKeeper
               | documentation, it sounds strangely familiar:
               | https://www.bitkeeper.org/testdrive.html . Of course,
               | that made sense for him and for the rest of the Linux
               | developers, as they were already familiar with BitKeeper.
               | Not so much for the rest of us though, who are now stuck
               | with the usability (or lack thereof) you mentioned...
        
         | p0w3n3d wrote:
         | That's something new to me (using git for 10 years, always
         | rebased)
        
           | iberator wrote:
           | I'm even more lazy. I almost always clone from scratch after
           | merging or after not touching the project for some time. So
           | easy and silly :)
           | 
           | I always forget all the flags and I work with literally just:
           | clone, branch, checkout, push.
           | 
           | (Each feature is a fresh branch tho)
        
         | chungy wrote:
         | as far as I understand the problem (sorry, the SO isn't the
         | clearest around), Fossil should support this operation. It does
         | one better, since it even tracks exactly where merges come
         | from. In Git, you have a merge commit that shows up with more
         | than one parent, but Fossil will show you where it branched off
         | too.
         | 
         | Take out the last "/timeline" component of the URL to clone via
         | Fossil:
         | https://chiselapp.com/user/chungy/repository/test/timeline
         | 
         | See also, the upstream documentation on branches and merging:
         | https://fossil-scm.org/home/doc/trunk/www/branching.wiki
        
         | ezst wrote:
         | On recursive merging, by the author of mercurial
         | 
         | https://www.mercurial-scm.org/pipermail/mercurial/2012-Janua...
        
         | pwdisswordfishy wrote:
         | New to me was discovering within the last month that git-merge
         | doesn't have a merge strategy of "null": don't try to resolve
         | any merge conflicts, because I've already taken care of them;
         | just know that this is a merge between the current branch and
         | the one specified on the command-line, so be a dutiful little
         | tool and just add it to your records. Don't try to "help".
         | Don't fuck with the index or the worktree. Just record that
         | this is happening. That's it. Nothing else.
        
           | Brian_K_White wrote:
           | What does that even mean? There already is reset hard.
        
             | pwdisswordfishy wrote:
             | What do you mean, "What does it mean?" It means what I
             | wrote.
             | 
             | > There already is reset hard.
             | 
             | That's not... remotely relevant? What does that have to do
             | with merging? We're talking about merging.
        
               | Brian_K_White wrote:
               | Neither of these are answers or explainations. So you
               | said nothing, and then said nothing again.
               | 
               | I also "mean what I wrote". Man that was sure easy to
               | say. It's almost like saying nothing at all. Which is
               | anyone's righ to do, but it's not an argument, nor a
               | definition of terms, nor communication at all. Well, it
               | does communicate one thing.
        
             | kbolino wrote:
             | The name "null" is confusing; you have to pick something.
             | However, I think what is desired here is the "theirs"
             | strategy, i.e. to replace the current branch's tree
             | entirely with the incoming branch's tree. The end result
             | would be similar to a hard reset onto the incoming branch,
             | except that it would also create a merge commit.
             | Unfortunately, the "theirs" _strategy_ does not exist, even
             | though the  "ours" strategy does exist, apparently to avoid
             | confusion with the "theirs" _option_ [1], but it is
             | possible to emulate it with a sequence of commands [2].
             | 
             | [1]: https://git-scm.com/docs/merge-
             | strategies#Documentation/merg...
             | 
             | [2]: https://stackoverflow.com/a/4969679/814422
        
         | giancarlostoro wrote:
         | I hate git squash, it only goes one direction and personally I
         | dont give a crap if it took you 100 commits to do one thing, at
         | least now we can see what you may have tried so we dont repeat
         | your mistakes. With git squash it all turns into, this is what
         | they last did that mattered, and btw we cant merge it backwards
         | without it being weird, you have to check out an entirely new
         | branch. I like to continue adding changes to branches I have
         | already merged. Not every PR is the full solution, but a piece
         | of the puzzle. No one can tell me that they only need 1 PR per
         | task because they never have a bug, ever.
         | 
         | Give me normal boring git merges over git squash merges.
        
       | jrockway wrote:
       | sha256 is a very slow algorithm, even with hardware acceleration.
       | BLAKE3 would probably make a noticeable performance difference.
       | 
       | Some reading from 2021:
       | https://jolynch.github.io/posts/use_fast_data_algorithms/
       | 
       | It is really hard to describe how slow sha256 is. Go sha256 some
       | big files. Do you think it's disk IO that's making it take so
       | long? It's not, you have a super fast SSD. It's sha256 that's
       | slow.
        
         | grumbelbart2 wrote:
         | Is that even when using the SHA256 hardware extensions?
         | https://en.wikipedia.org/wiki/SHA_instruction_set
        
           | oconnor663 wrote:
           | It's mixed. You get something in the neighborhood of a 3-4x
           | speedup with SHA-NI, but the algorithm is fundamentally
           | serial. Fully parallel algorithms like BLAKE3 and K12, which
           | can use wide vector extensions like AVX-512, can be
           | substantially faster (10x+) even on one core. And
           | multithreading compounds with that, if you have enough input
           | to keep a lot of cores occupied. On the other hand, if you're
           | limited to one thread and older/smaller vector extensions
           | (SSE, NEON), hardware-accelerated SHA-256 can win. It can
           | also win in the short input regime where parallelism isn't
           | possible (< 4 KiB for BLAKE3).
        
         | EdSchouten wrote:
         | It depends on the architecture. On ARM64, SHA-256 tends to be
         | faster than BLAKE3. The reasons being that most modern ARM64
         | CPUs have native SHA-256 instructions, and lack an equivalent
         | of AVX-512.
         | 
         | Furthermore, if your input files are large enough that
         | parallelizing across multiple cores makes sense, then it's
         | generally better to change your data model to eliminate the
         | existence of the large inputs altogether.
         | 
         | For example, Git is somewhat primitive in that every file is a
         | single object. In retrospect it would have been smarter to
         | decompose large files into chunks using a Content Defined
         | Chunking (CDC) algorithm, and model large files as a manifest
         | of chunks. That way you get better deduplication. The resulting
         | chunks can then be hashed in parallel, using a single-threaded
         | algorithm.
        
           | oconnor663 wrote:
           | As far as I know, most CDC schemes requires a single-threaded
           | pass over the whole file to find the chunk boundaries? (You
           | can try to "jump to the middle", but usually there's an upper
           | bound on chunk length, so you might need to backtrack
           | depending on what you learn later about the last chunk you
           | skipped?) The more cores you have, the more of a bottleneck
           | that becomes.
        
             | EdSchouten wrote:
             | You can always use a divide and conquer strategy to compute
             | the chunks. Chunk both halves of the file independently.
             | Once that's done, you redo the chunking around the midpoint
             | of the file forward, until it starts to match the chunks
             | obtained previously.
        
       | mg794613 wrote:
       | "Though I suck at it, my go-to language for side-projects is
       | always Rust"
       | 
       | Hmm, dont be so hard on yourself!
       | 
       |  _proceeds to call ls from rust_
       | 
       | Ok nevermind, although I dont think rust is the issue here.
       | 
       | (Tony I'm joking, thanks for the article)
        
       | sublinear wrote:
       | > If I were to do this again, I would probably use a well-defined
       | language like yaml or json to store object information.
       | 
       | I know this is only meant to be an educational project, but
       | please avoid yaml (especially for anything generated). It may be
       | a superset of json, but that should strongly suggest that json is
       | enough.
       | 
       | I am aware I'm making a decade old complaint now, but we already
       | have such an absurd mess with every tool that decided to prefer
       | yaml (docker/k8s, swagger, etc.) and it never got any better.
       | Let's not make that mistake again.
       | 
       | People just learned to cope or avoid yaml where they can, and
       | luckily these are such widely used tools that we have plenty of
       | boilerplate examples to cheat from. A new tool lacking docs or
       | examples that only accepts yaml would be anywhere from mildly
       | frustrating to borderline unusable.
        
       | holoduke wrote:
       | I wonder if in the near future there will be no tools anymore in
       | the sense we know it. you will maybe describe the tool you need
       | and its created on the fly.
        
       | ofou wrote:
       | btw, you can change the hashing algorithm in git easily
        
       | smangold wrote:
       | Tony nice work!
        
       | b1temy wrote:
       | Nice work, it's always interesting to see how one would design
       | their own VCS from scratch, and see if they fall into problems
       | existing implementations fell into in the past and if the same
       | solution was naturally reached.
       | 
       | The `tvc ls` command seems to always recompute the hash for every
       | non-ignored file in the directory and its children. Based on the
       | description in the blog post, it seems the same/similar thing is
       | happening during commits as well. I imagine such an operation
       | would become expensive in a giant monorepo with many many files,
       | and perhaps a few large binary files thrown in.
       | 
       | I'm not sure how git handles it (if it even does, but I'm sure it
       | must). Perhaps it caches the hash somewhere in the
       | `.git`directory, and only updates it if it senses the file hash
       | changed (Hm... If it can't detect this by re-hashing the file and
       | comparing it with a known value, perhaps by the timestamp the
       | file was last edited?).
       | 
       | > Git uses SHA-1, which is an old and cryptographically broken
       | algorithm. This doesn't actually matter to me though, since I'll
       | only be using hashes to identify files by their content; not to
       | protect any secrets
       | 
       | This _should_ matter to you in any case, even if it is "just to
       | identify files". If hash collisions (See: SHAttered, dating back
       | to 2017) were to occur, an attacker could, for example, have two
       | scripts uploaded in a repository, one a clean benign script, and
       | another malicious script with the same hash, perhaps hidden away
       | in some deeply nested directory, and a user pulling the script
       | might see the benign script but actually pull in the malicious
       | script. In practice, I don't think this attack has ever happened
       | in git, even with SHA-1. Interestingly, it seems that git itself
       | is considering switching to SHA-256 as of a few months ago
       | https://lwn.net/Articles/1042172/
       | 
       | I've not personally heard of the process of hashing to also be
       | known as digesting, though I don't doubt that it is the case.
       | I've mostly familiar of the resulting hash being referred to as
       | the message digest. Perhaps it's to differentiate between the
       | verb 'hash' (the process of hashing) with the output 'hash' (the
       | ` result of hashing). And naming the function
       | `sha256::try_digest`makes it more explicit that it is returning
       | the hash/digest. But it is a bit of a reach, perhaps that are
       | just synonyms to be used interchangeably as you said.
       | 
       | On a tangent, why were TOML files not considered at the end? I've
       | no skin in the game and don't really mind either way, but I'm
       | just curious since I often see Rust developers gravitate to that
       | over YAML or JSON, presumably because it is what Cargo uses for
       | its manifest.
       | 
       | --
       | 
       | Also, obligatory mention of jujutsu/jj since it seems to always
       | be mentioned when talking of a VCS in HN.
        
         | TonyStr wrote:
         | You are completely right about tvc ls recomputing each hash,
         | but I think it has to do this? A timestamp wouldn't be
         | reliable, so the only reliable way to verify a file's contents
         | would be to generate a hash.
         | 
         | In my lazy implemenation, I don't even check if the hashes
         | match, the program reads, compresses and tries to write the
         | unchanged files. This is an obvious area to improve performance
         | on. I've noticed that git speeds up object lookups by
         | generating two-letter directories from the first two letters in
         | hashes, so objects aren't actually stored as
         | `.git/objects/asdf12ha89k9fhs98...`, but as
         | `.git/objects/as/df12ha89k9fhs98...`.
         | 
         | >why were TOML files not considered at the end I'm just not
         | that familiar with toml. Maybe that would be a better choice! I
         | saw another commenter who complained about yaml. Though I would
         | argue that the choice doesn't really matter to the user, since
         | you would never actually write a commit object or a tree object
         | by hand. These files are generated by git (or tvc), and only
         | ever read by git/tvc. When you run `git cat-file <hash>`,
         | you'll have to add the `-p` flag (--pretty) to render it in a
         | human-readable format, and at that point it's just a matter of
         | taste whether it's shown in yaml/toml/json/xml/special format.
        
           | b1temy wrote:
           | > A timestamp wouldn't be reliable
           | 
           | I agree, but I'm still iffy on reading all files (already an
           | expensive operation) in the repository, then hashing every
           | one of them, every time you do an ls or a commit. I took a
           | quick look and git seems to check whether it needs to
           | recalculate the hash based on a combination of the
           | modification timestamp and if the filesize has changed, which
           | is not foolproof either since the timestamp can be modified,
           | and the filesize can remain the same and just have different
           | contents.
           | 
           | I'm not too sure how to solve this myself. Apparently this is
           | a known thing in git and is called the "racy git" problem
           | https://git-scm.com/docs/racy-git/ But to be honest, perhaps
           | I'm biased from working in a large repository, but I'd rather
           | the tradeoff of not rehashing often, rather than suffer the
           | rare case of a file being changed without modifying its
           | timestamp, whilst remaining the same size. (I suppose this
           | might have security implications if an attacker were to place
           | such a file into my local repository, but at that point,
           | having them have access to my filesystem is a far larger
           | problem...)
           | 
           | > I'm just not that familiar with toml... Though I would
           | argue that the choice doesn't really matter to the user,
           | since you would never actually write...
           | 
           | Again, I agree. At best, _maybe_ it would be slightly nicer
           | for a developer or a power user debugging an issue, if they
           | prefer the toml syntax, but ultimately, it does not matter
           | much what format it is in. I mainly asked out of curiosity
           | since your first thoughts were to use yaml or json, when I
           | see (completely empirically) most Rust devs prefer toml,
           | probably because of familiarity with Cargo.toml. Which, by
           | the way, I see you use too in your repository (As to be
           | expected with most Rust projects), so I suppose you must be
           | at least a little bit familiar with it, at least from a user
           | perspective. But I suppose you likely have even more
           | experience with yaml and json, which is why it came to mind
           | first.
        
             | TonyStr wrote:
             | > ...based on a combination of the modification timestamp
             | and if the filesize has changed
             | 
             | Oh that is interesting. I feel like the only way to get a
             | better and more reliable solution to this would be to have
             | the OS generate a hash each time the file changes, and
             | store that in file metadata. This seems like a reasonable
             | feature for an OS to me, but I don't think any OS does
             | this. Also, it would force programs to rely on whichever
             | hashing algorithm the OS uses.
        
               | b1temy wrote:
               | >... have the OS generate a hash each time the file
               | changes...
               | 
               | I'm not sure I would want this either tbh. If I have a
               | 10GB file on my filesystem, and I want to fseek to a
               | specific position in the file and just change a single
               | byte, I would probably not want it to re-hash the entire
               | file, which will probably take a minute longer compared
               | to not hashing the file. (Or maybe it's fine and it's
               | fast enough on modern systems to do this every time a
               | file is modified by any program running, I don't know how
               | much this would impact the performance.).
               | 
               | Perhaps a higher resolution timestamp by the OS might
               | help though, for decreasing the chance of a file having
               | the exact same timestamp (unless it was specifically
               | crafted to have been so).
        
       | athrowaway3z wrote:
       | I do wonder if the compression step makes sense at this layer
       | instead of the filesystem layer.
        
         | aabbcc1241 wrote:
         | Interesting take. I'm using btrfs (instead of ext4) with
         | compression enabled (using zstd), so most of the files are
         | compressed "transparently" - the files appear as normal files
         | to the applications, but on disk it is compressed, and the
         | application don't need to do the compress/decompress.
        
       | quijoteuniv wrote:
       | Now ... if you reinvent Linux you are closer to be compared to LT
        
       | oldestofsports wrote:
       | Nice job, great article!
       | 
       | I had a go at it as well a while back, I call it "shit"
       | https://github.com/emanueldonalds/shit
        
         | tpoacher wrote:
         | THE shit, in fact.
        
         | hahahahhaah wrote:
         | Fast Useful Change Keeper
        
       | astinashler wrote:
       | Does this git include empty folder? I always annoy that it's not
       | track empty folder.
        
         | TonyStr wrote:
         | yep! Had to check to be sure:                   Finished `dev`
         | profile [unoptimized + debuginfo] target(s) in 0.02s
         | Running `target/debug/tvc decompress f854e0b307caf47dee5c09c346
         | 41c41b8d5135461fcb26096af030f80d23b0e5`
         | 
         | === args === decompress f854e0b307caf47dee5c09c34641c41b8d51354
         | 61fcb26096af030f80d23b0e5 === tvcignore === ./target ./.git
         | ./.tvc
         | 
         | === subcommand === decompress ------------------ tree
         | ./src/empty-folder e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b
         | 934ca495991b7852b855 blob ./src/main.rs fdc4ccaa3a6dcc0d5451f8e
         | 5ca8aeac0f5a6566fe32e76125d627af4edf2db97
        
           | woodrowbarlow wrote:
           | huh, cool. what happens if you use vanilla-git to clone a
           | repo that contains empty folders? and do forges like github
           | display them properly?
        
         | lucasoshiro wrote:
         | Actually, the Git data model supports empty directories,
         | however, the index doesn't since it only maps names to files
         | but not to directories. You can even create a commit with a
         | root directory using --allow-empty, and it will use the
         | hardcoded empty tree object
         | (4b825dc642cb6eb9a060e54bf8d69288fbee4904).
        
       | brendoncarroll wrote:
       | Me too. Version control is great, it should get more use outside
       | of software.
       | 
       | https://github.com/gotvc/got
       | 
       | Notable differences: E2E encryption, parallel imports (Got will
       | light up all your cores), and a data structure that supports
       | large files and directories.
        
         | DASD wrote:
         | Nice! Not sure if you're aware of Got(Game of Trees) that
         | appears to pre-date your Got.
         | 
         | https://gameoftrees.org/index.html
        
           | brendoncarroll wrote:
           | Yes the author reached out. There has not yet been a
           | confusion among real users that I am aware of.
           | 
           | https://github.com/gotvc/got/issues/20
        
         | rtkwe wrote:
         | The problem is when you move beyond text files it gets hard to
         | tell what changes between two versions without opening both
         | versions in whatever program they come from and comparing.
        
           | brendoncarroll wrote:
           | > The problem is when you move beyond text files it gets hard
           | to tell what changes between two versions without opening
           | both versions in whatever program they come from and
           | comparing.
           | 
           | Yeah, totally agree. Got has not solved conflict resolution
           | for arbitrary files. However, we can tell the user where the
           | files differ, and that the file has changed.
           | 
           | There is still value in being able to import files and
           | directories of arbitrary sizes, and having the data
           | encrypted. This is the necessary infrastructure to be able to
           | do distributed version control on large amounts of private
           | data. You can't do that easily with Git. It's very clunky
           | even with remote helpers and LFS.
           | 
           | I talk about that in the Why Got? section of the docs.
           | 
           | https://github.com/gotvc/got/blob/master/doc/1.1_Why_Got.md
        
       | direwolf20 wrote:
       | Cool. When you reimplement something, it forces you to see the
       | fractal complexity of it.
        
       | temporallobe wrote:
       | Reminds me of when I tried to invent a SPA framework. So much
       | hidden complexity I hadn't thought of and I found myself going
       | down rabbit holes that I am sure the creators of React and
       | Angular went down. Git seems to be like this and I am often
       | reminded of how impressive it is at hiding underlying complexity.
        
         | alsetmusic wrote:
         | > at hiding underlying complexity.
         | 
         | It's only in the context of recreating Git that this comment
         | makes sense.
        
       | lasgawe wrote:
       | nice work! This is one of the best ways to deeply learn
       | something, reinvent the wheel yourself.
        
       | gkbrk wrote:
       | CodeCrafters has an amazing "Build your own Git" [1] tutorial
       | too. Jon Gjengset has a nice video [2] doing this challenge live
       | with Rust.
       | 
       | [1]: https://app.codecrafters.io/courses/git/overview
       | 
       | [2]: https://www.youtube.com/watch?v=u0VotuGzD_w
        
       | smekta wrote:
       | ...with blackjacks, and hookers
        
       | KolmogorovComp wrote:
       | It's really a shame git storage use files as the unit for
       | storage. That's what makes it improper for usage with many of
       | small files, or large files.
       | 
       | Content-based chunking like Xethub uses really should become the
       | default. It's not like it's new either, rsync is based on it.
       | 
       | https://huggingface.co/blog/xethub-joins-hf
        
       | jonny_eh wrote:
       | Why introduce yet another ignore file? Can you have it read
       | .gitignore if .tvcignore is missing?
        
       | bryan2 wrote:
       | Ftr you can make repos with sha256 now.
       | 
       | I wonder if signing sha-1 mitigates the threat of using an
       | outdated hash.
        
       ___________________________________________________________________
       (page generated 2026-01-28 07:01 UTC)