[HN Gopher] I made my own Git
___________________________________________________________________
I made my own Git
Author : TonyStr
Score : 350 points
Date : 2026-01-27 10:55 UTC (20 hours ago)
(HTM) web link (tonystr.net)
(TXT) w3m dump (tonystr.net)
| kgeist wrote:
| >The hardest part about this project was actually just parsing.
|
| How about using sqlite for this? Then you wouldn't need to parse
| anything, just read/update tables. Fast indexing out of the box,
| too.
| grenran wrote:
| that would be what https://fossil-scm.org/ is
| TonyStr wrote:
| Very interesting. Looks like fossil has made some unique
| design choices that differ from git[0]. Has anyone here used
| it? I'd love to hear how it compares.
|
| [0] https://fossil-scm.org/home/doc/trunk/www/fossil-v-
| git.wiki#...
| embedding-shape wrote:
| Used it on and off mainly to check it out, but always in a
| personal/experimental capacity. Never managed to convince
| any teams to give it a try, mostly because git don't tend
| to get in the way, so hard to justify to learn something
| completely new.
|
| I really enjoy how local-first it is, as someone who
| sometimes work without internet connection. That the data
| around "work" is part of the SCM as well, not just the
| code, makes a lot of sense to me at a high-level, and many
| times I wish git worked the same...
| usrbinbash wrote:
| I mean, git is just as "local-first" (a git repo is just
| a directory after all), and the standard git-toolchain
| includes a server, so...
|
| But yeah, fossil is interesting, and it's a crying shame
| its not more well known, for the exact reasons you point
| out.
| embedding-shape wrote:
| > I mean, git is just as "local-first" (a git repo is
| just a directory after all), and the standard git-
| toolchain includes a server, so...
|
| It isn't though, Fossil integrates all the data around
| the code too in the "repository", so issues, wiki,
| documentation, notes and so on are all together, not like
| in git where most commonly you have those things on
| another platform, or you use something like `git notes`
| which has maybe 10% of the features of the respective
| Fossil feature.
|
| It might be useful to scan through the list of features
| of Fossil and dig into it, because it does a lot more
| than you seem to think :) https://fossil-
| scm.org/home/doc/trunk/www/index.wiki
| adastra22 wrote:
| Those things exist for git too, e.g. git-bug. But the
| first-class to do it in git is email.
| embedding-shape wrote:
| Email isn't a wiki, bug tracking, documentation and all
| the other stuff Fossil offers as part of their core
| design. The point is for it to be in one place, and
| local-first.
|
| If you don't trust me, read the list of features and give
| it a try yourself: https://fossil-
| scm.org/home/doc/trunk/www/index.wiki
| adastra22 wrote:
| I am aware of fossil. Did you look up git-bug?
| smartmic wrote:
| I use Fossil extensively, but only for personal projects.
| There are specific design conditions, such as no rebasing
| [0], and overall, it is simpler yet more useful to me.
| However, I think Fossil is better suited for projects
| governed under the cathedral model than the bazaar model.
| It's great for self-hosting, and the web UI is excellent
| not only for version control, but also for managing a
| software development project. However, if you want a low
| barrier to integrating contributions, Fossil is not as good
| as the various Git forges out there. You have to either
| receive patches or Fossil bundles via email or forum, or
| onboard/register contributors as developers with quite wide
| repo permissions.
|
| [0]: https://fossil-
| scm.org/home/doc/trunk/www/rebaseharm.md
| toyg wrote:
| Sounds like a more modern cvs/Subversion
| chungy wrote:
| It was developed primarily to replace SQLite's CVS
| repository, after all. They used CVSTrac as the forge and
| Fossil was designed to replace that component too.
| graemep wrote:
| I like it but the problem is everyone else already knows
| git and everything integrates with git.
|
| It is very easy to self host.
|
| Not having staging is awkward at first but works well once
| you get used to it.
|
| I prefer it for personal projects. In think its better for
| small teams if people are willing to adjust but have not
| had enough opportunities to try it.
| TonyStr wrote:
| Is it possible to commit individual files, or specific
| lines, without a staging area? I guess this might be
| against Fossil's ethos, and you're supposed to just
| commit everything every time?
| jact wrote:
| You can commit individual files.
| graemep wrote:
| Yes you can list specific files, but you have to list
| them all in the commit command.
|
| I think the ethos is to discourage it.
|
| It does not seem to be possible to commit just specific
| lines.
| jact wrote:
| I use Fossil extensively for all my personal projects and
| find it superior for the general case. As others said it's
| more suited for small projects.
|
| I also use Fossil for lots of weird things. I created a
| forum game using Fossil's ticket and forum features because
| it's so easy to spin up and for my friends to sign in to.
|
| At work we ended up using Fossil in production to manage
| configuration and deployment in a highly locked down
| customer environment where its ability to run as a single
| static binary, talk over HTTP without external
| dependencies, etc. was essential. It was a poor man's
| deployment tool, but it performed admirably.
|
| Fossil even works well as a blogging platform.
| dchest wrote:
| While Fossil uses SQLite for underlying storage (instead of
| the filesystem directly) and various support infrastructure,
| its actual format is not based on SQLite: https://fossil-
| scm.org/home/doc/trunk/www/fileformat.wiki
|
| It's basically plaintext. Even deltas are plaintext for text
| files.
|
| Reason: "The global state of a fossil repository is kept
| simple so that it can endure in useful form for decades or
| centuries. A fossil repository is intended to be readable,
| searchable, and extensible by people not yet born."
| storystarling wrote:
| SQLite solves the storage layer but I suspect you run into a
| pretty big impedance mismatch on the graph traversals. For
| heavy DAG operations like history rewriting, a custom structure
| seems way more efficient than trying to model that
| relationally.
| SQLite wrote:
| The Common Table Expression feature of SQL is very good at
| walking graphs. See, for example
| <https://sqlite.org/lang_with.html#queries_against_a_graph>.
| prakhar1144 wrote:
| I was also playing around with the ".git" directory - ended up
| writing:
|
| "What's inside .git ?" - https://prakharpratyush.com/blog/7/
| sluongng wrote:
| Zstd dictionary compression is essentially how Meta's Mercurial
| fork (Sapling VCS) stores blobs https://sapling-
| scm.com/docs/dev/internals/zstdelta. The source code is available
| in GitHub if folks want to study the tradeoffs vs git delta-
| compressed packfiles.
|
| I think theoratically, Git delta-compression is still a lot more
| optimized for smaller repos. But for bigger repos where sharding
| storaged is required, path-based delta dictionary compression
| does much better. Git recently (in the last 1 year) got something
| called "path-walk" which is fairly similar though.
| darkryder wrote:
| Great writeup! It's always fun to learn the details of the tools
| we use daily.
|
| For others, I highly recommend Git from the Bottom Up[1]. It is a
| very well-written piece on internal data structures and does a
| great job of demystifying the opaque git commands that most
| beginners blindly follow. Best thing you'll learn in 20ish
| minutes.
|
| 1. https://jwiegley.github.io/git-from-the-bottom-up/
| spuz wrote:
| Thanks - I think this is the article I was thinking of that
| really helped me to understand git when I first started using
| it back in the day. I tried to find it again and couldn't.
| MarsIronPI wrote:
| Oh, I hadn't ever seen that one. I "grokked" Git thanks to The
| Git Parable[0] several years ago.
|
| [0]: https://tom.preston-werner.com/2009/05/19/the-git-parable
| sanufar wrote:
| Ooh, this looks fun! I didn't know you could cat-file on a hash
| id, that's actually quite cool.
| heckelson wrote:
| gentle reminder to set your website's `<title>` to something
| descriptive :)
| TonyStr wrote:
| haha, thank you. Added now :-)
| teiferer wrote:
| If you ever wonder how coding agents know how to plan things etc,
| this is the kind of article they get this training from.
|
| Ends up being circular if the author used LLM help for this
| writeup though there are no obvious signs of that.
| wasmainiac wrote:
| Maybe we can poison LLMs with loops of 2 or more self
| referencing blogs.
| jdiff wrote:
| Only need one, they're not thinking critically about the
| media they consume during training.
| falcor84 wrote:
| Here's a sad prediction: over the coming few years, AIs
| will get significantly better at critical evaluation of
| sources, while humans will get even worse at it.
| topaz0 wrote:
| My sad prediction is that LLMs and humans will both get
| worse. Humans might get worse faster though.
| whstl wrote:
| I wish I could disagree with you, but what I'm seeing on
| average (especially at work) is exactly that: people
| asking stuff to ChatGPT and accepting hallucinations as
| fact, and then fighting me when I say it's not true.
| prmoustache wrote:
| There is "death by GPS" for people dying after blindly
| following their GPS instruction. There will definitely be
| a "death by AI" expression very soon.
| stevekemp wrote:
| Tesla-related fatalities probably count already, albeit
| without that label/name.
| sailfast wrote:
| Hot take: Humans have always been bad at this (in the
| aggregate, without training). Only a certain percentage
| of the population took the time to investigate.
|
| For most throughout history, whatever is presented to you
| that you believe is the right answer. AI just brings them
| source information faster so what you're seeing is mostly
| just the usual behavior, but faster. Before AI people
| would not have bothered to try and figure out an answer
| to some of these questions. It would've been too much
| work.
| keybored wrote:
| HN commenters will be technooptimistic misanthrops.
| Status quo ante bellum.
| andy_ppp wrote:
| The secret sauce about having good understanding, taste and
| style (both for coding and writing) has always been in the
| fine tuning and RHLF steps. I'd be skeptical if the signals
| a few GitHub repos or blogs generate at the initial stages
| of the learning are that critical. There's probably a
| filter also for good taste on the initial training set and
| these are so large not even a single full epoch is done on
| the data these days.
| jama211 wrote:
| It wouldn't work at all.
| jama211 wrote:
| I see the AI hating part of HN has come out again
| mexicocitinluez wrote:
| > Ends up being circular if the author used LLM help for this
| writeup though there are no obvious signs of that.
|
| Great argument for not using AI-assisted tools to write blog
| posts (especially if you DO use these tools). I wonder how much
| we're taking for granted in these early phases before it starts
| to eat itself.
| jama211 wrote:
| What does eating itself even look like? It doesn't take much
| salt to change a hash.
| mexicocitinluez wrote:
| Being trained on it's own results?
| anu7df wrote:
| I understand model output put back into training would be an
| issue, but if model output is guided by multiple prompts and
| edited by the author to his/her liking wouldn't that at least
| be marginally useful?
| TonyStr wrote:
| Interestingly, I looked at github insights and found that this
| repo had 49 clones, and 28 unique cloners, before I published
| this article. I definitely did not clone it 49 times, and
| certainly not with 28 unique users. It's unlikely that the
| handful of friends who follow me on github all cloned the repo.
| So I can only speculate that there are bots scraping new public
| github repos and training on everything.
|
| Maybe that's obvious to most people, but it was a bit
| surprising to see it myself. It feels weird to think that LLMs
| are being trained on my code, especially when I'm painfully
| aware of every corner I'm cutting.
|
| The article doesn't contain any LLM output. I use LLMs to ask
| for advice on coding conventions (especially in rust, since I'm
| bad at it), and sometimes as part of research (zstd was
| suggested by chatgpt along with comparisons to similar
| algorithms).
| nerdponx wrote:
| Time to start including deliberate bugs. The correct version
| is in a private repository.
| teiferer wrote:
| And what purpose would this serve, exactly?
| adastra22 wrote:
| Spite.
| program_whiz wrote:
| while I think this is a fun idea -- we are in such a
| dystopian timeline that I fear you will end up being
| prosecuted under a digital equivalent of various laws like
| "why did you attack the intruder instead of fleeing" or
| "you can't simply remove a squatter because its your house,
| therefore you get an assault charge."
|
| A kind of "they found this code, therefore you have a duty
| not to poison their model as they take it." Meanwhile if I
| scrape a website and discover data I'm not supposed to see
| (e.g. bank details being publicly visible) then I will go
| to jail for pointing it out. :(
| wredcoll wrote:
| Look, I get the fantasy of someday pulling out my
| musket^W ar15 and rushing downstairs to blow away my
| wife^W an evil intruder, but, like, we live in a society.
| And it has a lot of benefits, but it does mean you don't
| get to be "king of your castle" any more.
|
| Living in a country with hundreds of millions of other
| civilians or a city with tens of thousands means
| compromising what you're allowed to do when it affects
| other people.
|
| There's a reason we have attractive nuisance laws and you
| aren't allowed to put a slide on your yard that
| electrocutes anyone who touches it.
|
| None of this, of course, applies to "poisoning" llms,
| that's whatever. But all your examples involved actual
| humans being attacked, not some database.
| program_whiz wrote:
| Thanks that was the term I was looking for "attractive
| nuisance". I wouldn't be surprised if a tech company
| could make that case -- this user caused us tangible harm
| and cost (training, poisoned models) and left their data
| out for us to consume. Its the equivalent of putting
| poison candy on a park table your honor!
| teo_zero wrote:
| That reminds me of the protagonist of Charles Stross's
| novel "Accelerando", a prolific inventor who is accused
| by the IRS to have caused millions of losses because he
| releases all his ideas in the public domain instead of
| profiting from them and paying taxes on such profits.
| nerdponx wrote:
| I think if we're at the point where posting deliberate
| mistakes to poison training data is considered a crime,
| we would be far far far down the path of authoritarian
| corporate regulatory capture, much farther than we are
| now (fortunately).
| below43 wrote:
| They used to do this with maps - eg. fake islands - to pick
| up when they were copied.
| Phelinofist wrote:
| I selfhost Gitea. The instance is crawled by AI crawlers
| (checked the IPs). They never cloned, they just browse and
| take it directly from there.
| Zambyte wrote:
| i run a cgit server on an r720 in my apartment with my code
| on it and that puppy screams whenever sam wants his code
|
| blocking openai ips did wonders for the ambient noise
| levels in my apartment. they're not the only ones
| obviously, but they're they only ones i had to block to
| stay sane
| MarsIronPI wrote:
| Have you considered putting it behind Anubis or an
| equivalent?
| Zambyte wrote:
| Yes, but I haven't and would prefer not to
| MarsIronPI wrote:
| Understandable. It's an outrage that we even have to
| consider such measures.
| Phelinofist wrote:
| For reference, this is how I do it in my Caddyfile:
| (block_ai) { @ai_bots {
| header_regexp User-Agent (?i)(anthropic-
| ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-
| User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
| } abort @ai_bots }
|
| Then, in a specific app block include it via
| import block_ai
| zaphar wrote:
| I have almost exactly this in my own caddyfile :-D The
| order of the items in the regex is a little different but
| mostly the same items. I just pulled them from my web
| access logs over time and update it every once in a
| while.
| seba_dos1 wrote:
| Most of then pretend to be real users though and don't
| identify themselves with their user agent strings.
| tonnydourado wrote:
| Particularly on GitHub, might not even be LLMs, just regular
| bots looking for committed secrets (AWS keypairs, passwords,
| etc.)
| 0x696C6961 wrote:
| This has been happening before LLMs too.
| teiferer wrote:
| I don't really get why they need to clone in order to scrape
| ...?
|
| > It feels weird to think that LLMs are being trained on my
| code, especially when I'm painfully aware of every corner I'm
| cutting.
|
| That's very much expected. That's why the quality of LLM
| coding agents is like it is. (No offense.)
|
| The "asking LLMs for advice" part is where the circular
| aspect starts to come into the picture. Not worse than
| looking at StackOverflow though which then links to other
| people who in turn turned to StackOverflow for advice.
| adastra22 wrote:
| The quality of LLM coding agents is pretty good now.
| storystarling wrote:
| Cloning gets you the raw text objects directly. If you
| scrape the web UI you're dealing with a lot of markup
| overhead that just burns compute during ingestion. For
| training data you usually want the structure to be as clean
| as possible from the start.
| teiferer wrote:
| Sure, cloning a local copy. But why clone _on github_?
| prodigycorp wrote:
| Random aside about training data:
|
| One of the funniest things I've started to notice from Gemini
| in particular is that in random situations, it talks with
| english with an agreeable affect that I can only describe as..
| Indian? I've never noticed such a thing leak through before.
| There must be a _ton_ of people in India who are generating new
| datasets for training.
| blenderob wrote:
| That's very interesting. Any examples you can share which has
| those agreeable effects?
| prodigycorp wrote:
| I'm going to do a cursory look through my antigrav history,
| i want to find it too. I remember it's primarily in the
| exclamations of agreement/revelation, and one time
| expressing concern which I remember were slightly off
| natural for an american english speaker.
| prodigycorp wrote:
| Cant find anything, too many messages telling the agent
| "please do NOT _those_ c changes". I'm going to remember
| to save them going forward.
| evntdrvn wrote:
| There was a really great article or blog post published in
| the last few months about the author's very personal
| experience whose gist was "People complain that I sound/write
| like an LLM, but it's actually the inverse because I grew up
| in X where people are taught formal English to sound
| educated/western, and those areas are now heavily used for
| LLM training."
|
| I wish I could find it again, if someone else knows the link
| please post it!
| gxnxcxcx wrote:
| _I 'm Kenyan. I don't write like ChatGPT, ChatGPT writes
| like me_
|
| https://news.ycombinator.com/item?id=46273466
| tverbeure wrote:
| Thanks for that link.
|
| This part made me laugh though:
|
| > These detectors, as I understand them, often work by
| measuring two key things: 'Perplexity' and 'burstiness'.
| Perplexity gauges how predictable a text is. If I start a
| sentence, "The cat sat on the...", your brain, and the
| AI, will predict the word "floor."
|
| I can't be the only one who's brain predicted "mat" ?
| cozzyd wrote:
| And I thought it would be a hat...
| awesome_dude wrote:
| I've been critical of people that default to "an em dash
| being used means the content is generated by an LLM", or,
| "they've numbered their points, must be an LLM"
|
| I do know that LLMs generate content heavy with those
| constructs, but they didn't create the ideas out of thin
| air, it was in the training set, and existed strongly
| enough that LLMs saw it as common place/best practice.
| sneela wrote:
| > If you want to look at the code, it's available on github.
|
| Why not tvc-hub :P
|
| Jokes aside, great write up!
| TonyStr wrote:
| haha, maybe that's the next project. It did feel weird to make
| git commits at the same time as I was making tvc commits
| igorw wrote:
| Random but y'all might enjoy. Git client in PHP, supports reading
| packfiles, reftables, diff via LCS. Written by hand.
|
| https://github.com/igorwwwwwwwwwwwwwwwwwwww/gipht-horse
| nasretdinov wrote:
| Nice! This repo is a huge W for PHP I'd say.
|
| P.S. Didn't know that plain '@' can be used instead of HEAD,
| but I guess it makes sense since you can omit both left and
| right parts of the expressions separated by '@'
| h1fra wrote:
| Learning git internals was definitely the moment it became clear
| to me how efficient and smart git is.
|
| And this way of versionning can be reused in other fields, as
| soon as have some kind of graph of data that can be modified
| independently but read all together then it makes sense.
| p4bl0 wrote:
| Nice post :). It made me think of _ugit: DIY Git in Python_ [1]
| which is still by far my favorite of this kind of posts. It
| really goes deep into Git internals while managing to stay easy
| to follow along the way.
|
| [1] https://www.leshenko.net/p/ugit/
| TonyStr wrote:
| This page is beautiful!
|
| Bookmarked for later
| mfashby wrote:
| in a similar vein; Write yourself a Git was fun to follow
| https://wyag.thb.lt/
| UltraSane wrote:
| I mapped git operations to Neo4j and it really helped me
| understand how it works.
| eru wrote:
| > These objects are also compressed to save space, so writing to
| and reading from .git/objects/ will always involve running a
| compression algoritm. Git uses zlib to compress objects, but
| looking at competitors, zstd seemed more promising:
|
| That's a weird thing to put so close to the start. Compression is
| about the least interesting aspect of Git's design.
| alphabetag675 wrote:
| When you are learning, everything is important. I think it is
| okay to cut the person some slack regarding this.
| nasretdinov wrote:
| Nice work! On a complete tangent, Git is the only SCM known to me
| that supports recursive merge strategy [1] (instead of the
| regular 3-way merge), which essentially always remembers resolved
| conflicts without you needing to do anything. This is a very
| underrated feature of Git and somehow people still manage to
| choose rebase over it. If you ever get to implementing merges,
| please make sure you have a mechanism for remembering the
| conflict resolution history :).
|
| [1] https://stackoverflow.com/questions/55998614/merge-made-
| by-r...
| arunix wrote:
| I remember in a previous job having to enable git rerere,
| otherwise it wouldn't remember previously resolved conflicts.
|
| https://git-scm.com/book/en/v2/Git-Tools-Rerere
| nasretdinov wrote:
| I believe rerere is a local cache, so you'd still have to
| resolve the conflicts again on another machine. The recursive
| merge doesn't have this issue -- the conflict resolution
| inside the merge commits is effectively remembered (although
| due to how Git operates it actually never even considers it a
| conflict to be remembered -- just a snapshot of the closest
| state to the merged branches)
| Guvante wrote:
| Are people repeatedly handling merge conflicts on multiple
| machines?
|
| If there was a better way to handle "I needed to merge in
| the middle of my PR work" without introducing reverse
| merged permanently in the history I wouldn't mind merge
| commits.
|
| But tools will sometimes skip over others work if you `git
| pull` a change into your local repo due to getting confused
| which leg of the merge to follow.
| direwolf20 wrote:
| The recursive merge is about merging branches that already
| have merges in them, while rerere is about repeating the same
| merge several times.
| pyrolistical wrote:
| Would be nice if centralized git platforms shared rerere
| caches
| lmm wrote:
| Rerere is dangerous and counterproductive - it tries to give
| rebase the same functionality that merge has, but since
| rebase is fundamentally wrong it only stacks the wrongness.
| seba_dos1 wrote:
| Cherry-picks being "fundamentally wrong" is certainly an
| interesting git take.
| mkleczek wrote:
| Much more principled (and hence less of a foot-gun) way of
| handling conflicts is making them first class objects in the
| repository, like https://pijul.org does.
| jcgl wrote:
| Jujutsu too[0]:
|
| > Jujutsu keeps track of conflicts as first-class objects in
| its model; they are first-class in the same way commits are,
| while alternatives like Git simply think of conflicts as
| textual diffs. While not as rigorous as systems like Darcs
| (which is based on a formalized theory of patches, as opposed
| to snapshots), the effect is that many forms of conflict
| resolution can be performed and propagated automatically.
|
| [0] https://github.com/jj-vcs/jj
| PunchyHamster wrote:
| I feel like people making new VCSes should just re-use GIT
| storage/network layer and innovate on top of that. Git
| storage is flexible enough for that, and that way you can
| just.... use it on existing repos with very easy migration
| path for both workflows (CI/CD never need to care about what
| frontend you use) and users
| zaphar wrote:
| Git storage is just a merkle tree. It's a technology that's
| been around forever and was simultaneously chosen by more
| than one vcs technology around the same time. It's
| incredibly effective so it makes sense that it would get
| used.
| storystarling wrote:
| The bottleneck with git is actually the on-the-fly packfile
| generation. The server has to burn CPU calculating deltas
| for every clone. For a distributed system it seems much
| better to use a simple content-addressable store where you
| just serve static blobs.
| 3eb7988a1663 wrote:
| It is my understanding that under the hood, the repository
| has quite a bit of state that can get mangled. That is why
| naively syncing a git repo with say Dropbox is not a
| surefire operation.
| theLiminator wrote:
| It's very cool though I imagine it's doa due to lack of git
| compatibility...
| speed_spread wrote:
| Lack of current-SCM incumbent compatibility can be an
| advantage. Like Linus decided to explicitly do the reverse
| of every SVN decision when designing git. He even reversed
| CLI usability!
| theLiminator wrote:
| I think the network effects of git is too large to
| overcome now. Hence why we see jj get a lot more adoption
| than pijul.
| rob74 wrote:
| Pssst! I think Linus didn't as much design Git as he
| cloned BitKeeper (or at least the parts of it he liked).
| I have never used it, but if you look at the BitKeeper
| documentation, it sounds strangely familiar:
| https://www.bitkeeper.org/testdrive.html . Of course,
| that made sense for him and for the rest of the Linux
| developers, as they were already familiar with BitKeeper.
| Not so much for the rest of us though, who are now stuck
| with the usability (or lack thereof) you mentioned...
| p0w3n3d wrote:
| That's something new to me (using git for 10 years, always
| rebased)
| iberator wrote:
| I'm even more lazy. I almost always clone from scratch after
| merging or after not touching the project for some time. So
| easy and silly :)
|
| I always forget all the flags and I work with literally just:
| clone, branch, checkout, push.
|
| (Each feature is a fresh branch tho)
| chungy wrote:
| as far as I understand the problem (sorry, the SO isn't the
| clearest around), Fossil should support this operation. It does
| one better, since it even tracks exactly where merges come
| from. In Git, you have a merge commit that shows up with more
| than one parent, but Fossil will show you where it branched off
| too.
|
| Take out the last "/timeline" component of the URL to clone via
| Fossil:
| https://chiselapp.com/user/chungy/repository/test/timeline
|
| See also, the upstream documentation on branches and merging:
| https://fossil-scm.org/home/doc/trunk/www/branching.wiki
| ezst wrote:
| On recursive merging, by the author of mercurial
|
| https://www.mercurial-scm.org/pipermail/mercurial/2012-Janua...
| pwdisswordfishy wrote:
| New to me was discovering within the last month that git-merge
| doesn't have a merge strategy of "null": don't try to resolve
| any merge conflicts, because I've already taken care of them;
| just know that this is a merge between the current branch and
| the one specified on the command-line, so be a dutiful little
| tool and just add it to your records. Don't try to "help".
| Don't fuck with the index or the worktree. Just record that
| this is happening. That's it. Nothing else.
| Brian_K_White wrote:
| What does that even mean? There already is reset hard.
| pwdisswordfishy wrote:
| What do you mean, "What does it mean?" It means what I
| wrote.
|
| > There already is reset hard.
|
| That's not... remotely relevant? What does that have to do
| with merging? We're talking about merging.
| Brian_K_White wrote:
| Neither of these are answers or explainations. So you
| said nothing, and then said nothing again.
|
| I also "mean what I wrote". Man that was sure easy to
| say. It's almost like saying nothing at all. Which is
| anyone's righ to do, but it's not an argument, nor a
| definition of terms, nor communication at all. Well, it
| does communicate one thing.
| kbolino wrote:
| The name "null" is confusing; you have to pick something.
| However, I think what is desired here is the "theirs"
| strategy, i.e. to replace the current branch's tree
| entirely with the incoming branch's tree. The end result
| would be similar to a hard reset onto the incoming branch,
| except that it would also create a merge commit.
| Unfortunately, the "theirs" _strategy_ does not exist, even
| though the "ours" strategy does exist, apparently to avoid
| confusion with the "theirs" _option_ [1], but it is
| possible to emulate it with a sequence of commands [2].
|
| [1]: https://git-scm.com/docs/merge-
| strategies#Documentation/merg...
|
| [2]: https://stackoverflow.com/a/4969679/814422
| giancarlostoro wrote:
| I hate git squash, it only goes one direction and personally I
| dont give a crap if it took you 100 commits to do one thing, at
| least now we can see what you may have tried so we dont repeat
| your mistakes. With git squash it all turns into, this is what
| they last did that mattered, and btw we cant merge it backwards
| without it being weird, you have to check out an entirely new
| branch. I like to continue adding changes to branches I have
| already merged. Not every PR is the full solution, but a piece
| of the puzzle. No one can tell me that they only need 1 PR per
| task because they never have a bug, ever.
|
| Give me normal boring git merges over git squash merges.
| jrockway wrote:
| sha256 is a very slow algorithm, even with hardware acceleration.
| BLAKE3 would probably make a noticeable performance difference.
|
| Some reading from 2021:
| https://jolynch.github.io/posts/use_fast_data_algorithms/
|
| It is really hard to describe how slow sha256 is. Go sha256 some
| big files. Do you think it's disk IO that's making it take so
| long? It's not, you have a super fast SSD. It's sha256 that's
| slow.
| grumbelbart2 wrote:
| Is that even when using the SHA256 hardware extensions?
| https://en.wikipedia.org/wiki/SHA_instruction_set
| oconnor663 wrote:
| It's mixed. You get something in the neighborhood of a 3-4x
| speedup with SHA-NI, but the algorithm is fundamentally
| serial. Fully parallel algorithms like BLAKE3 and K12, which
| can use wide vector extensions like AVX-512, can be
| substantially faster (10x+) even on one core. And
| multithreading compounds with that, if you have enough input
| to keep a lot of cores occupied. On the other hand, if you're
| limited to one thread and older/smaller vector extensions
| (SSE, NEON), hardware-accelerated SHA-256 can win. It can
| also win in the short input regime where parallelism isn't
| possible (< 4 KiB for BLAKE3).
| EdSchouten wrote:
| It depends on the architecture. On ARM64, SHA-256 tends to be
| faster than BLAKE3. The reasons being that most modern ARM64
| CPUs have native SHA-256 instructions, and lack an equivalent
| of AVX-512.
|
| Furthermore, if your input files are large enough that
| parallelizing across multiple cores makes sense, then it's
| generally better to change your data model to eliminate the
| existence of the large inputs altogether.
|
| For example, Git is somewhat primitive in that every file is a
| single object. In retrospect it would have been smarter to
| decompose large files into chunks using a Content Defined
| Chunking (CDC) algorithm, and model large files as a manifest
| of chunks. That way you get better deduplication. The resulting
| chunks can then be hashed in parallel, using a single-threaded
| algorithm.
| oconnor663 wrote:
| As far as I know, most CDC schemes requires a single-threaded
| pass over the whole file to find the chunk boundaries? (You
| can try to "jump to the middle", but usually there's an upper
| bound on chunk length, so you might need to backtrack
| depending on what you learn later about the last chunk you
| skipped?) The more cores you have, the more of a bottleneck
| that becomes.
| EdSchouten wrote:
| You can always use a divide and conquer strategy to compute
| the chunks. Chunk both halves of the file independently.
| Once that's done, you redo the chunking around the midpoint
| of the file forward, until it starts to match the chunks
| obtained previously.
| mg794613 wrote:
| "Though I suck at it, my go-to language for side-projects is
| always Rust"
|
| Hmm, dont be so hard on yourself!
|
| _proceeds to call ls from rust_
|
| Ok nevermind, although I dont think rust is the issue here.
|
| (Tony I'm joking, thanks for the article)
| sublinear wrote:
| > If I were to do this again, I would probably use a well-defined
| language like yaml or json to store object information.
|
| I know this is only meant to be an educational project, but
| please avoid yaml (especially for anything generated). It may be
| a superset of json, but that should strongly suggest that json is
| enough.
|
| I am aware I'm making a decade old complaint now, but we already
| have such an absurd mess with every tool that decided to prefer
| yaml (docker/k8s, swagger, etc.) and it never got any better.
| Let's not make that mistake again.
|
| People just learned to cope or avoid yaml where they can, and
| luckily these are such widely used tools that we have plenty of
| boilerplate examples to cheat from. A new tool lacking docs or
| examples that only accepts yaml would be anywhere from mildly
| frustrating to borderline unusable.
| holoduke wrote:
| I wonder if in the near future there will be no tools anymore in
| the sense we know it. you will maybe describe the tool you need
| and its created on the fly.
| ofou wrote:
| btw, you can change the hashing algorithm in git easily
| smangold wrote:
| Tony nice work!
| b1temy wrote:
| Nice work, it's always interesting to see how one would design
| their own VCS from scratch, and see if they fall into problems
| existing implementations fell into in the past and if the same
| solution was naturally reached.
|
| The `tvc ls` command seems to always recompute the hash for every
| non-ignored file in the directory and its children. Based on the
| description in the blog post, it seems the same/similar thing is
| happening during commits as well. I imagine such an operation
| would become expensive in a giant monorepo with many many files,
| and perhaps a few large binary files thrown in.
|
| I'm not sure how git handles it (if it even does, but I'm sure it
| must). Perhaps it caches the hash somewhere in the
| `.git`directory, and only updates it if it senses the file hash
| changed (Hm... If it can't detect this by re-hashing the file and
| comparing it with a known value, perhaps by the timestamp the
| file was last edited?).
|
| > Git uses SHA-1, which is an old and cryptographically broken
| algorithm. This doesn't actually matter to me though, since I'll
| only be using hashes to identify files by their content; not to
| protect any secrets
|
| This _should_ matter to you in any case, even if it is "just to
| identify files". If hash collisions (See: SHAttered, dating back
| to 2017) were to occur, an attacker could, for example, have two
| scripts uploaded in a repository, one a clean benign script, and
| another malicious script with the same hash, perhaps hidden away
| in some deeply nested directory, and a user pulling the script
| might see the benign script but actually pull in the malicious
| script. In practice, I don't think this attack has ever happened
| in git, even with SHA-1. Interestingly, it seems that git itself
| is considering switching to SHA-256 as of a few months ago
| https://lwn.net/Articles/1042172/
|
| I've not personally heard of the process of hashing to also be
| known as digesting, though I don't doubt that it is the case.
| I've mostly familiar of the resulting hash being referred to as
| the message digest. Perhaps it's to differentiate between the
| verb 'hash' (the process of hashing) with the output 'hash' (the
| ` result of hashing). And naming the function
| `sha256::try_digest`makes it more explicit that it is returning
| the hash/digest. But it is a bit of a reach, perhaps that are
| just synonyms to be used interchangeably as you said.
|
| On a tangent, why were TOML files not considered at the end? I've
| no skin in the game and don't really mind either way, but I'm
| just curious since I often see Rust developers gravitate to that
| over YAML or JSON, presumably because it is what Cargo uses for
| its manifest.
|
| --
|
| Also, obligatory mention of jujutsu/jj since it seems to always
| be mentioned when talking of a VCS in HN.
| TonyStr wrote:
| You are completely right about tvc ls recomputing each hash,
| but I think it has to do this? A timestamp wouldn't be
| reliable, so the only reliable way to verify a file's contents
| would be to generate a hash.
|
| In my lazy implemenation, I don't even check if the hashes
| match, the program reads, compresses and tries to write the
| unchanged files. This is an obvious area to improve performance
| on. I've noticed that git speeds up object lookups by
| generating two-letter directories from the first two letters in
| hashes, so objects aren't actually stored as
| `.git/objects/asdf12ha89k9fhs98...`, but as
| `.git/objects/as/df12ha89k9fhs98...`.
|
| >why were TOML files not considered at the end I'm just not
| that familiar with toml. Maybe that would be a better choice! I
| saw another commenter who complained about yaml. Though I would
| argue that the choice doesn't really matter to the user, since
| you would never actually write a commit object or a tree object
| by hand. These files are generated by git (or tvc), and only
| ever read by git/tvc. When you run `git cat-file <hash>`,
| you'll have to add the `-p` flag (--pretty) to render it in a
| human-readable format, and at that point it's just a matter of
| taste whether it's shown in yaml/toml/json/xml/special format.
| b1temy wrote:
| > A timestamp wouldn't be reliable
|
| I agree, but I'm still iffy on reading all files (already an
| expensive operation) in the repository, then hashing every
| one of them, every time you do an ls or a commit. I took a
| quick look and git seems to check whether it needs to
| recalculate the hash based on a combination of the
| modification timestamp and if the filesize has changed, which
| is not foolproof either since the timestamp can be modified,
| and the filesize can remain the same and just have different
| contents.
|
| I'm not too sure how to solve this myself. Apparently this is
| a known thing in git and is called the "racy git" problem
| https://git-scm.com/docs/racy-git/ But to be honest, perhaps
| I'm biased from working in a large repository, but I'd rather
| the tradeoff of not rehashing often, rather than suffer the
| rare case of a file being changed without modifying its
| timestamp, whilst remaining the same size. (I suppose this
| might have security implications if an attacker were to place
| such a file into my local repository, but at that point,
| having them have access to my filesystem is a far larger
| problem...)
|
| > I'm just not that familiar with toml... Though I would
| argue that the choice doesn't really matter to the user,
| since you would never actually write...
|
| Again, I agree. At best, _maybe_ it would be slightly nicer
| for a developer or a power user debugging an issue, if they
| prefer the toml syntax, but ultimately, it does not matter
| much what format it is in. I mainly asked out of curiosity
| since your first thoughts were to use yaml or json, when I
| see (completely empirically) most Rust devs prefer toml,
| probably because of familiarity with Cargo.toml. Which, by
| the way, I see you use too in your repository (As to be
| expected with most Rust projects), so I suppose you must be
| at least a little bit familiar with it, at least from a user
| perspective. But I suppose you likely have even more
| experience with yaml and json, which is why it came to mind
| first.
| TonyStr wrote:
| > ...based on a combination of the modification timestamp
| and if the filesize has changed
|
| Oh that is interesting. I feel like the only way to get a
| better and more reliable solution to this would be to have
| the OS generate a hash each time the file changes, and
| store that in file metadata. This seems like a reasonable
| feature for an OS to me, but I don't think any OS does
| this. Also, it would force programs to rely on whichever
| hashing algorithm the OS uses.
| b1temy wrote:
| >... have the OS generate a hash each time the file
| changes...
|
| I'm not sure I would want this either tbh. If I have a
| 10GB file on my filesystem, and I want to fseek to a
| specific position in the file and just change a single
| byte, I would probably not want it to re-hash the entire
| file, which will probably take a minute longer compared
| to not hashing the file. (Or maybe it's fine and it's
| fast enough on modern systems to do this every time a
| file is modified by any program running, I don't know how
| much this would impact the performance.).
|
| Perhaps a higher resolution timestamp by the OS might
| help though, for decreasing the chance of a file having
| the exact same timestamp (unless it was specifically
| crafted to have been so).
| athrowaway3z wrote:
| I do wonder if the compression step makes sense at this layer
| instead of the filesystem layer.
| aabbcc1241 wrote:
| Interesting take. I'm using btrfs (instead of ext4) with
| compression enabled (using zstd), so most of the files are
| compressed "transparently" - the files appear as normal files
| to the applications, but on disk it is compressed, and the
| application don't need to do the compress/decompress.
| quijoteuniv wrote:
| Now ... if you reinvent Linux you are closer to be compared to LT
| oldestofsports wrote:
| Nice job, great article!
|
| I had a go at it as well a while back, I call it "shit"
| https://github.com/emanueldonalds/shit
| tpoacher wrote:
| THE shit, in fact.
| hahahahhaah wrote:
| Fast Useful Change Keeper
| astinashler wrote:
| Does this git include empty folder? I always annoy that it's not
| track empty folder.
| TonyStr wrote:
| yep! Had to check to be sure: Finished `dev`
| profile [unoptimized + debuginfo] target(s) in 0.02s
| Running `target/debug/tvc decompress f854e0b307caf47dee5c09c346
| 41c41b8d5135461fcb26096af030f80d23b0e5`
|
| === args === decompress f854e0b307caf47dee5c09c34641c41b8d51354
| 61fcb26096af030f80d23b0e5 === tvcignore === ./target ./.git
| ./.tvc
|
| === subcommand === decompress ------------------ tree
| ./src/empty-folder e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b
| 934ca495991b7852b855 blob ./src/main.rs fdc4ccaa3a6dcc0d5451f8e
| 5ca8aeac0f5a6566fe32e76125d627af4edf2db97
| woodrowbarlow wrote:
| huh, cool. what happens if you use vanilla-git to clone a
| repo that contains empty folders? and do forges like github
| display them properly?
| lucasoshiro wrote:
| Actually, the Git data model supports empty directories,
| however, the index doesn't since it only maps names to files
| but not to directories. You can even create a commit with a
| root directory using --allow-empty, and it will use the
| hardcoded empty tree object
| (4b825dc642cb6eb9a060e54bf8d69288fbee4904).
| brendoncarroll wrote:
| Me too. Version control is great, it should get more use outside
| of software.
|
| https://github.com/gotvc/got
|
| Notable differences: E2E encryption, parallel imports (Got will
| light up all your cores), and a data structure that supports
| large files and directories.
| DASD wrote:
| Nice! Not sure if you're aware of Got(Game of Trees) that
| appears to pre-date your Got.
|
| https://gameoftrees.org/index.html
| brendoncarroll wrote:
| Yes the author reached out. There has not yet been a
| confusion among real users that I am aware of.
|
| https://github.com/gotvc/got/issues/20
| rtkwe wrote:
| The problem is when you move beyond text files it gets hard to
| tell what changes between two versions without opening both
| versions in whatever program they come from and comparing.
| brendoncarroll wrote:
| > The problem is when you move beyond text files it gets hard
| to tell what changes between two versions without opening
| both versions in whatever program they come from and
| comparing.
|
| Yeah, totally agree. Got has not solved conflict resolution
| for arbitrary files. However, we can tell the user where the
| files differ, and that the file has changed.
|
| There is still value in being able to import files and
| directories of arbitrary sizes, and having the data
| encrypted. This is the necessary infrastructure to be able to
| do distributed version control on large amounts of private
| data. You can't do that easily with Git. It's very clunky
| even with remote helpers and LFS.
|
| I talk about that in the Why Got? section of the docs.
|
| https://github.com/gotvc/got/blob/master/doc/1.1_Why_Got.md
| direwolf20 wrote:
| Cool. When you reimplement something, it forces you to see the
| fractal complexity of it.
| temporallobe wrote:
| Reminds me of when I tried to invent a SPA framework. So much
| hidden complexity I hadn't thought of and I found myself going
| down rabbit holes that I am sure the creators of React and
| Angular went down. Git seems to be like this and I am often
| reminded of how impressive it is at hiding underlying complexity.
| alsetmusic wrote:
| > at hiding underlying complexity.
|
| It's only in the context of recreating Git that this comment
| makes sense.
| lasgawe wrote:
| nice work! This is one of the best ways to deeply learn
| something, reinvent the wheel yourself.
| gkbrk wrote:
| CodeCrafters has an amazing "Build your own Git" [1] tutorial
| too. Jon Gjengset has a nice video [2] doing this challenge live
| with Rust.
|
| [1]: https://app.codecrafters.io/courses/git/overview
|
| [2]: https://www.youtube.com/watch?v=u0VotuGzD_w
| smekta wrote:
| ...with blackjacks, and hookers
| KolmogorovComp wrote:
| It's really a shame git storage use files as the unit for
| storage. That's what makes it improper for usage with many of
| small files, or large files.
|
| Content-based chunking like Xethub uses really should become the
| default. It's not like it's new either, rsync is based on it.
|
| https://huggingface.co/blog/xethub-joins-hf
| jonny_eh wrote:
| Why introduce yet another ignore file? Can you have it read
| .gitignore if .tvcignore is missing?
| bryan2 wrote:
| Ftr you can make repos with sha256 now.
|
| I wonder if signing sha-1 mitigates the threat of using an
| outdated hash.
___________________________________________________________________
(page generated 2026-01-28 07:01 UTC)