[HN Gopher] Git as a Storage
       ___________________________________________________________________
        
       Git as a Storage
        
       Author : todsacerdoti
       Score  : 85 points
       Date   : 2021-10-08 15:17 UTC (7 hours ago)
        
 (HTM) web link (bronevichok.ru)
 (TXT) w3m dump (bronevichok.ru)
        
       | earthscienceman wrote:
       | I have a technical question that I'm not at all poised to answer,
       | that might be stupid like all questions not in one's domain:
       | 
       | I recently discovered the joy that is ZFS and everything that
       | comes with it. I understand that the technical underpinnings of
       | git are actually extremely different (and mathematical) _but_
       | just how far is a ZFS snapshot from a git commit _really_? It
       | seems like the gap between the two might not need a huge bridge.
       | Could a copy-on-write filesystem benefit from more metadata that
       | would come from being implemented in a more git-like way?
        
         | gmueckl wrote:
         | Conceptually, the two things are very much related and a birds-
         | eye view shows a lot of similarities. But when you get into the
         | weeds, there are some significant differences. git is optimized
         | to store a great many historic states of files with minor
         | differences between consecutive ones, and it assumes that these
         | are essentially static, immutable snapshots. A COW file system
         | that allows for snapshots is optimized more for allowing
         | mutation of these snapshots (i.e. updating files one way in one
         | snapshot and another way in another one). This, combined with
         | the additional housekeeping required for a file system (disk
         | block allocation, etc. - the actual core features) makes the
         | implementations of the two things very different.
        
         | mattnewton wrote:
         | At Google, people have built both, and we use a version control
         | system on top of a snapshoting filesytem. The snapshotting is
         | for never losing code/state on your machines, and the version
         | control system is for interfacing with others (code review,
         | merging, etc). While you could use one system for both, having
         | both layered on top makes it easier to change them to each
         | specific workflow.
        
         | crubier wrote:
         | Very close. Actually companies such as https://postgres.ai/ use
         | ZFS storage to provide git-like features on top of Postgres:
         | Using copy-on-write on the underlying ZFS, you can "fork" a new
         | branch of your DB with all the data, instantly. Then both
         | branches can live their lives independently.
         | 
         | But I don't think ZFS has the equivalent of git merge though.
        
           | shakow wrote:
           | Naive question, but what is the advantage compared to a
           | classical DB dump? Faster?
        
             | withinboredom wrote:
             | I think the keyword was "instantly."
        
               | qwertox wrote:
               | Is it correct that then the original DB and the
               | snapshotted DB share those blocks on the file system
               | which are unmodified?
               | 
               | Assume 1 row per block: Original DB "A" has 2 rows, a
               | snapshot "B" is created, "B" deletes one row and adds a
               | new one.
               | 
               | Is it true that the row which "B" took over from "A" and
               | left unmodified resides on the same block for "A" and
               | "B", so that if the block gets corrupted, both databases
               | will have to deal with that corrupt row?
        
               | Dylan16807 wrote:
               | Yes, that's one of the core parts of copy-on-write.
               | 
               | It shouldn't matter if you have a reasonable setup. If
               | you depend on other files on the drive to continue
               | working after blocks have started to go corrupt, that's
               | not a good system.
        
         | Dylan16807 wrote:
         | It would be nice if ZFS snapshots were more flexible. And you
         | could say "like git" when talking about the user experience.
         | But it would not be like git in terms of implementation. Git's
         | implementation is not really copy-on-write. It's deduplication.
         | 
         | I'd say the git method is actually pretty low in metadata, and
         | the way you'd improve ZFS snapshots doesn't involve making them
         | more like git.
         | 
         | If you did get that huge amount of work done, you could then
         | approximate git with snapshots alone. Right now, you'd probably
         | want snapshots and dedup to work together to approximate git
         | using ZFS.
        
         | jhoechtl wrote:
         | Came here just to mention Btrfs which does the same as ZFS in
         | the sense that it is also COW by default.
        
         | [deleted]
        
         | deepspace wrote:
         | I think you are correct. It would not be a huge stretch to turn
         | a snapshotting file system into a VCS.
         | https://en.wikipedia.org/wiki/Versioning_file_system
         | 
         | The IBM/Rational Clearcase version control system is an example
         | of building a VCS on top of a versioning file system (MVFS),
         | though MVFS uses an underlying database instead of a copy-on-
         | write snapshot mechanism.
         | https://www.ibm.com/support/pages/about-multiversion-file-sy...
        
       | LukeEF wrote:
       | Few git-inspired version controlled databases out there if
       | performance becomes an issue. Dolt & TerminusDB are the most
       | prominent.
       | 
       | https://github.com/terminusdb/terminusdb
       | https://github.com/dolthub/dolt
        
       | axiomdata316 wrote:
       | If you are using Restic Backup aren't you coming close to what's
       | being recommended here?
        
       | asperous wrote:
       | I thought it was a neat article, I assumed it was talking about
       | git lfs.
       | 
       | It would be neat if github could store all its data in git,
       | similar to fossil scm. But I suppose microsoft would not want to
       | lose lockin.
        
         | bastardoperator wrote:
         | All commit data is stored in git and the beauty of git outside
         | of platform metadata is that you can add a new remote and never
         | be locked in.
        
         | gopalv wrote:
         | > I thought it was a neat article
         | 
         | I think the article talks about the "What" part of the problem,
         | but the actual code is much more interesting in the "How"
         | sense.
         | 
         | Like the git-ref stuff makes sense as you read the code
         | 
         | https://github.com/ligurio/git-test/blob/master/bin/git-test...
         | 
         | There was a similar set of additions to svn in the past with
         | "svn propedit" in the workflows which I used in a previous
         | workplace.
         | 
         | It was not pretty, because it was like embedding JIRA into svn
         | - but it meant machines could flip state to state with commits
         | during build+test and restart from that point without an
         | independent DB to track the "current state" & people with
         | commit access could nudge a stuck build out without losing "who
         | did what".
        
         | rsync wrote:
         | "It would be neat if github could store all its data in git,
         | similar to fossil scm."
         | 
         | Yes, that would be very nice - it is unfortunate that you have
         | to make API calls (over http) to get things like issues ...
         | 
         | I _think_ you can get the wiki with plain old  'git' ? I forget
         | ...
        
           | codetrotter wrote:
           | > I think you can get the wiki with plain old 'git' ? I
           | forget ...
           | 
           | This is correct. The wiki for a repo is accessible as a
           | separate repository named with a suffix of ".wiki".
           | 
           | So if user foo has a repo bar with an associated wiki, and
           | the repo URL is https://github.com/foo/bar then you can clone
           | the repo and the wiki respectively over SSH by:
           | git clone git@github.com:foo/bar.git
           | 
           | and                   git clone
           | git@github.com:foo/bar.wiki.git
           | 
           | I wish they'd do the same for all other repository meta data
           | including issues, repository description, etc
        
             | rsync wrote:
             | I believe this is true of gitlab and other providers as
             | well, correct ?
             | 
             | That is, you need API calls to get things like issues.
             | 
             | Is there a single tool that will handle downloading (and
             | the associated API calls) from all of the major providers ?
             | Or is each API tool specifically for either github or
             | gitlab or sr.ht or whatever ?
        
               | WorldMaker wrote:
               | Every issue system has its own API and today there's no
               | standard for interchange. It would be interesting to see
               | an attempt to try to build a reusable standard, but I
               | don't know what sort of standards agency exists with the
               | guts to try something like, I don't envy the political
               | battle that would entail, and having seen some of the
               | horrors of bespoke Jira and TFS configurations I'm mostly
               | such a standard would either be too minimalist and
               | disappoint too many people or too maximalist and
               | impossible to build.
        
               | rsync wrote:
               | Yes, I understand each providers standard is different -
               | and I agree that would be a real mess to wrangle.
               | 
               |  _What I am wondering is_ are there any tools that use
               | these APIs that have built-in support for multiple
               | provider APIs ? Or does every tool that (helps you manage
               | or download issues, etc.) just built for a particular
               | provider ?
               | 
               | Thanks.
        
               | WorldMaker wrote:
               | git-bug, the one mentioned in the article here, has some
               | documentation on its README of how well its
               | importer/exporter tools support Github, Gitlab, Jira, and
               | Launchpad: https://github.com/MichaelMure/git-bug
               | 
               | Most of the other such tools I've seen barely have the
               | resources to import/export a single such API. git-issue
               | only has Github import it looks like.
               | https://github.com/dspinellis/git-issue
               | 
               | There's perceval which is designed to be a generic
               | archival tool and supports lots of APIs, but only dumps
               | them to source-specific formats and would still need a
               | lot of work if you tried to use issues from different
               | APIs together: https://github.com/chaoss/grimoirelab-
               | perceval
        
       | GauntletWizard wrote:
       | I've been toying with the idea of writing a protocol for Git as a
       | "Blockchain" for bank interchange. Require signatures on all
       | commits, include a protocol for how to push commit proposals to
       | other peers for signing, verifying commits before they're merged,
       | etc. No mining, just a distributed transaction log via git.
        
         | 8eye wrote:
         | i like this idea, i think you should look further into it.
         | their might be a market for it
        
       | carapace wrote:
       | Kind of an aside, but I've been toying with a simple functional
       | language based on Joy and when it came to exposing the filesystem
       | it seemed too fraught with impurity, so instead I'm just using
       | git as the data storage system. Instead of strings or blobs you
       | have handles that are essentially three-tuples of (git object
       | hash, offset, length). It's early days yet, but so far the
       | approach seems promising. (In re: string literals, well, your
       | literal is in a source file, and your source file is in git, so
       | each literal has its tuple already, eh?)
        
       | rectang wrote:
       | How feasible is it to store raw content in the Git content-
       | addressable-store (CAS)? Git blobs are Zlib compressed.
       | 
       | I'd like to be able to store audio files uncompressed, so that
       | they could be read directly from the CAS, rather than having to
       | be expanded out into a checkout directory.
        
         | u801e wrote:
         | IIRC, a git blob has the size of the data encoded in the first
         | 4 bytes of the file, and the data itself appended to it. It
         | could be stored uncompressed, but I don't think there's
         | anything in the git plumbing layer that could deal with it
         | directly.
         | 
         | That said, even if it is compressed, a command like git cat-
         | file could be used to pipe the contents of the file to stdout
         | or any other program that could use them as input without
         | having to create a file on disk.
        
           | rectang wrote:
           | The header for a blob file is "blob", a space, the length of
           | the content as ASCII integer representation, then a null
           | byte.                   $ echo "hello world" > HELLO.txt
           | $ git add HELLO.txt          $ cat
           | .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \
           | > zpipe -d | \         > hexdump -e '"|"24/1 "%_p" "|\n"'
           | |blob 12.hello world.|         $
           | 
           | The header and the content get concatenated together, and the
           | whole thing gets Zlib compressed. The SHA1 is calculated from
           | the header-plus-content _before_ it gets Zlib compressed.
           | $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
           | | \                           > zpipe -d | \         > shasum
           | 3b18e512dba79e4c8300dd08aeb37f8e728b8dad  -         $
           | 
           | What I would like to do is record an audio file (e.g. LPCM
           | BWF), take its SHA1 and store it in the CAS as raw content,
           | then reference it somehow from a Git commit. That way it will
           | be part of the history and will travel with `push` and
           | `clone`, won't get gc'd, etc.
           | 
           | > _That said, even if it is compressed, a command like git
           | cat-file could be used to pipe the contents of the file to
           | stdout or any other program that could use them as input
           | without having to create a file on disk._
           | 
           | That's a neat suggestion! However, I don't see how it would
           | be compatible with random access, which is important for my
           | application.
        
             | GauntletWizard wrote:
             | Basically that's what Git-LFS does; it takes the SHA of the
             | file, stores it in the git version of the file, and then
             | stores the contents next to it. It's all transparent and
             | works pretty well.
        
             | rektide wrote:
             | the core of cat-file.c is quite short. i think you could
             | get the random access you want with minimal effort.
             | ideally, upstream support for --offset and --count or what
             | not to git; a lot of people would benefit.
             | 
             | https://github.com/git/git/blob/master/builtin/cat-file.c
             | 
             | you can absolutely make tools to expand out & load git
             | repos into content stores. it's going to depend on the
             | content store how you do that.
        
       | haberman wrote:
       | I've recently had a similar idea for when you want to track
       | metrics of a Git repository over time (code size, line count,
       | etc).
       | 
       | I would love to create some a script to take a measurement of the
       | current tree, then run a tool that runs my script at ~every
       | commit so I can draw a graph of how the metric changes over time.
       | 
       | It's a bit tricky: if you change the script, you need to re-run
       | the analysis at every commit. It starts looking a little bit like
       | a build system, but integrated over time.
       | 
       | I've thought of calling this GitReduce or similar, since it has
       | some similarity to MapReduce: first a "map" step runs at every
       | commit, then the "reduce" step combines all of the individual
       | outputs into a single graph or whatever.
       | 
       | Ideally Git itself could be the only storage engine, so you can
       | trivially serve the results from GitHub.
        
         | sulZ wrote:
         | git provides a couple of options for running a script against
         | the code for every commit
        
       | rectang wrote:
       | What I really want to see is a blog post on Git as an undo
       | engine.
       | 
       | This is related to the idea of using Git as general storage, in
       | that the undo history can be persisted, and then reconstituted by
       | a new process. The trick would be to make all actions compatible
       | with conversion to and from a commit.
        
         | yepguy wrote:
         | Emacs kind of has this in the form of magit-wip-mode. It
         | doesn't sync with the undo system but it does persist every
         | file save event since the last "real" commit.
        
           | muixoozie wrote:
           | I need to look into this.
           | 
           | Just yesterday I lost some work (only maybe 10 minutes worth)
           | when I was updating my org notes. I staged some files,
           | committed them, but made a typo in the commit message. I
           | ended up reverting the commit when I meant to amend. Then I
           | discarded the the staged reverted changes and noticed the
           | status said I was still reverting. So I looked up the command
           | to get me out of that state. I ran 'git revert --abort' and
           | it blew away my unstaged changes. Ah well, those versioned
           | backups I have Emacs do are going to save me this time, or so
           | I thought.
        
         | azhenley wrote:
         | I've seen programs that serialize the undo manager, but is
         | there an advantage to using git instead? I suppose it does most
         | of the work for you, but you'll still need to manage actions
         | that can be undone but don't directly modify a file (these are
         | entirely app dependent, but could be something like changing
         | which tab document is selected).
        
           | rectang wrote:
           | The advantage is that Git's data structure is an open design,
           | with existing tools able to introspect and possibly
           | manipulate it, with potential users and contributors able to
           | leverage their existing knowledge, and with
           | documents/projects still being parseable decades from now.
           | 
           | I have dreams of implementing an music composition tool /
           | audio editor with a line-oriented edit-decision-list (EDL)
           | text file format that where changes could map coherently onto
           | git history. Ideally, that EDL format would be an open
           | standard as well: I've contemplated the Pure Data file format
           | and AES31 as possible candidates. This is just at the
           | conceptual stage, though.
        
             | dTal wrote:
             | One problem with git is that it assumes you have some sort
             | of "diff" utility for before and after snapshots. For some
             | types of things, that's not feasible - you can't easily
             | diff two images and work out what filter was used.
             | Sometimes you instead you want to _create_ the diff, and
             | _generate_ the  "after" snapshot. You could emulate this by
             | stowing the information somewhere hidden and writing a fake
             | "diff" that simply retrieves it - but saving the edit tree
             | and reconstituting it is supposed to be git's job! You end
             | up reimplementing bits of git, just to make git work.
        
               | rectang wrote:
               | I suppose the "diff" limitation would constrain the
               | potential use cases. For my purposes, I don't think that
               | would be a problem, because the project format would
               | consist of two types of files:
               | 
               | * PCM audio files which are captured once and then never
               | modified.
               | 
               | * Line oriented text files for which the traditional
               | "diff" functionality suffices.
        
         | rektide wrote:
         | one of these years i'll have something fun to show you
        
       ___________________________________________________________________
       (page generated 2021-10-08 23:00 UTC)