[HN Gopher] Analogies for Git (2016)
       ___________________________________________________________________
        
       Analogies for Git (2016)
        
       Author : luu
       Score  : 22 points
       Date   : 2022-09-17 05:19 UTC (17 hours ago)
        
 (HTM) web link (novalis.org)
 (TXT) w3m dump (novalis.org)
        
       | klabb3 wrote:
       | Can someone explain the change set data structure, specifically
       | how you can re-apply it to different revisions of the repo?
       | 
       | - Is it just a diff with paths and specific line numbers?
       | 
       | - Does it contain a few lines of context to heuristically map the
       | diff onto the right place in case unrelated lines are
       | added/removed?
       | 
       | - What other data structures are needed to produce a snapshot? Is
       | it just a graph of change sets?
        
         | kevincox wrote:
         | The logical model of Git doesn't store changesets or diffs at
         | all. The logical structure of a commit is:
         | 
         | 1. Some "human" metadata such as message, author, timestamps
         | 
         | 2. An ordered list of parents (0 to infinity parents)
         | 
         | 3. A complete directory tree of the repository at that version.
         | 
         | So when you reference a commit in Git you aren't referencing a
         | diff/changeset. You are referencing an exact snapshot of the
         | files in the repository at that time. This is sometimes
         | obscured because to a human we often find it useful to see what
         | changes a commit introduced. For example if you run `git show
         | $commit` in the simple case where $commit has exactly one
         | parent Git will compute a diff between $commit and its parent
         | ($commit~1). However logically this diff isn't stored as part
         | of the repository, it is computed on the fly from the snapshots
         | of the two directory trees.
         | 
         | If you want to re-apply it to a different version of the repo
         | (usually via `git cherry-pick`) Git basically does two steps:
         | 
         | 1. Compute the diff between the commit and its parent.
         | 
         | 2. Apply that diff to the target revision.
         | 
         | Of course this all gets more complicated in the case of merge
         | commits (commits with >1 parent) but the model is similar.
         | 
         | (For performance reasons Git does use delta compression
         | internally but that is just an optimization and doesn't really
         | reflect the diffs that humans are interested in).
        
         | fuckstick wrote:
         | Git internally doesn't deal in diffs - the conceptual objects
         | are blobs, trees and commits. A tree is like a directory
         | structure of blobs (basically whole files) and commits link
         | trees in a graph. It's a graph of filesystem like trees
         | basically - not changesets. Very conceptually simple - hence
         | the stupid "git" content tracker.
         | 
         | Git has a pack file format that stores deltas but that is just
         | for space saving.
        
       | philipwhiuk wrote:
       | Call me crazy but I don't think equating a VCS to something that
       | people 100% get confused by (time travel) helps many people
       | understand a VCS.
        
         | kickaha wrote:
         | Came here to say: impressive how these metaphors are
         | simultaneously so fascinating, presumably very insightful, and
         | so _unhelpful to a newbie trying to understand_ git.
        
       | djhaskin987 wrote:
       | It is precisely because git stores a graph of snapshots that is
       | so hard to scale it to be able to store large monorepos with
       | thousands of files. Every single commit stores a reference to the
       | content in every single file. Using The duality mentioned in the
       | article of storing change sets instead is an interesting trade-
       | off. Instead of having to compute the diffs, you have to compute
       | the snapshots. This is a good trade-off if you want a small
       | portion of the snapshot checked out to your machine. This is why
       | perforce does better with mono repos.
       | 
       | Storing changesets can handle very very large sets of files much
       | more easily, but you pay the price with having to compute what
       | file is stored whem which lengthens check out time even in the
       | small. It is not a good trade off if you have to check out the
       | entire thing anyways. This is why get is more popular with the
       | open source community which is more like a bazaar than a
       | cathedral
        
         | kevincox wrote:
         | I don't think this makes large repos fundamentally hard. You
         | just need good support for working on incomplete graphs. For
         | example you just need to know the tree IDs/hashes of the non-
         | checked-out trees in the directories that you have checked out.
         | Then you graft your checked out directories onto the tree of
         | the parent.
        
         | slondr wrote:
         | Mercurial has the obvious correct answer to this problem: Store
         | diffs with snapshot checkpoints. That is, store diffs, but once
         | you reach a critical number of diffs, store a snapshot instead
         | of a diff so you can efficiently compute any specific commit
         | state without storing snapshots for every commit.
        
           | Dylan16807 wrote:
           | > Mercurial has the obvious correct answer to this problem:
           | Store diffs with snapshot checkpoints.
           | 
           | That's how the git backend works for most of its storage.
           | 
           | It's a fool's errand to look at the conceptual model and
           | start making claims about performance.
        
           | actionfromafar wrote:
           | Ah, so CVS is like _GIF_ , git is like MPEG without
           | keyframes, and Mercurial is like MPEG _with_ keyframes.
        
             | agumonkey wrote:
             | Quite an apt metaphor :)
             | 
             | Now where are the hevc and av1 of versionning system ?
        
       | teloli wrote:
       | At first I read this as "apologies for git", that would have been
       | much more interesting
        
       | mikewarot wrote:
       | Git is a tool that makes snapshots, and provides a name for each
       | (its checksum) and a graph that connects those names.
       | 
       | It doesn't _usually_ compute differences. It doesn 't store
       | deltas, although it seems to do so to the newly acquainted user.
       | 
       | Git is a highly configurable folder backup program, that allows
       | for sharing that folder with the world.
        
       ___________________________________________________________________
       (page generated 2022-09-17 23:01 UTC)