[HN Gopher] Analogies for Git (2016)
___________________________________________________________________
Analogies for Git (2016)
Author : luu
Score : 22 points
Date : 2022-09-17 05:19 UTC (17 hours ago)
(HTM) web link (novalis.org)
(TXT) w3m dump (novalis.org)
| klabb3 wrote:
| Can someone explain the change set data structure, specifically
| how you can re-apply it to different revisions of the repo?
|
| - Is it just a diff with paths and specific line numbers?
|
| - Does it contain a few lines of context to heuristically map the
| diff onto the right place in case unrelated lines are
| added/removed?
|
| - What other data structures are needed to produce a snapshot? Is
| it just a graph of change sets?
| kevincox wrote:
| The logical model of Git doesn't store changesets or diffs at
| all. The logical structure of a commit is:
|
| 1. Some "human" metadata such as message, author, timestamps
|
| 2. An ordered list of parents (0 to infinity parents)
|
| 3. A complete directory tree of the repository at that version.
|
| So when you reference a commit in Git you aren't referencing a
| diff/changeset. You are referencing an exact snapshot of the
| files in the repository at that time. This is sometimes
| obscured because to a human we often find it useful to see what
| changes a commit introduced. For example if you run `git show
| $commit` in the simple case where $commit has exactly one
| parent Git will compute a diff between $commit and its parent
| ($commit~1). However logically this diff isn't stored as part
| of the repository, it is computed on the fly from the snapshots
| of the two directory trees.
|
| If you want to re-apply it to a different version of the repo
| (usually via `git cherry-pick`) Git basically does two steps:
|
| 1. Compute the diff between the commit and its parent.
|
| 2. Apply that diff to the target revision.
|
| Of course this all gets more complicated in the case of merge
| commits (commits with >1 parent) but the model is similar.
|
| (For performance reasons Git does use delta compression
| internally but that is just an optimization and doesn't really
| reflect the diffs that humans are interested in).
| fuckstick wrote:
| Git internally doesn't deal in diffs - the conceptual objects
| are blobs, trees and commits. A tree is like a directory
| structure of blobs (basically whole files) and commits link
| trees in a graph. It's a graph of filesystem like trees
| basically - not changesets. Very conceptually simple - hence
| the stupid "git" content tracker.
|
| Git has a pack file format that stores deltas but that is just
| for space saving.
| philipwhiuk wrote:
| Call me crazy but I don't think equating a VCS to something that
| people 100% get confused by (time travel) helps many people
| understand a VCS.
| kickaha wrote:
| Came here to say: impressive how these metaphors are
| simultaneously so fascinating, presumably very insightful, and
| so _unhelpful to a newbie trying to understand_ git.
| djhaskin987 wrote:
| It is precisely because git stores a graph of snapshots that is
| so hard to scale it to be able to store large monorepos with
| thousands of files. Every single commit stores a reference to the
| content in every single file. Using The duality mentioned in the
| article of storing change sets instead is an interesting trade-
| off. Instead of having to compute the diffs, you have to compute
| the snapshots. This is a good trade-off if you want a small
| portion of the snapshot checked out to your machine. This is why
| perforce does better with mono repos.
|
| Storing changesets can handle very very large sets of files much
| more easily, but you pay the price with having to compute what
| file is stored whem which lengthens check out time even in the
| small. It is not a good trade off if you have to check out the
| entire thing anyways. This is why get is more popular with the
| open source community which is more like a bazaar than a
| cathedral
| kevincox wrote:
| I don't think this makes large repos fundamentally hard. You
| just need good support for working on incomplete graphs. For
| example you just need to know the tree IDs/hashes of the non-
| checked-out trees in the directories that you have checked out.
| Then you graft your checked out directories onto the tree of
| the parent.
| slondr wrote:
| Mercurial has the obvious correct answer to this problem: Store
| diffs with snapshot checkpoints. That is, store diffs, but once
| you reach a critical number of diffs, store a snapshot instead
| of a diff so you can efficiently compute any specific commit
| state without storing snapshots for every commit.
| Dylan16807 wrote:
| > Mercurial has the obvious correct answer to this problem:
| Store diffs with snapshot checkpoints.
|
| That's how the git backend works for most of its storage.
|
| It's a fool's errand to look at the conceptual model and
| start making claims about performance.
| actionfromafar wrote:
| Ah, so CVS is like _GIF_ , git is like MPEG without
| keyframes, and Mercurial is like MPEG _with_ keyframes.
| agumonkey wrote:
| Quite an apt metaphor :)
|
| Now where are the hevc and av1 of versionning system ?
| teloli wrote:
| At first I read this as "apologies for git", that would have been
| much more interesting
| mikewarot wrote:
| Git is a tool that makes snapshots, and provides a name for each
| (its checksum) and a graph that connects those names.
|
| It doesn't _usually_ compute differences. It doesn 't store
| deltas, although it seems to do so to the newly acquainted user.
|
| Git is a highly configurable folder backup program, that allows
| for sharing that folder with the world.
___________________________________________________________________
(page generated 2022-09-17 23:01 UTC)