[HN Gopher] Git as a Storage
___________________________________________________________________
Git as a Storage
Author : todsacerdoti
Score : 85 points
Date : 2021-10-08 15:17 UTC (7 hours ago)
(HTM) web link (bronevichok.ru)
(TXT) w3m dump (bronevichok.ru)
| earthscienceman wrote:
| I have a technical question that I'm not at all poised to answer,
| that might be stupid like all questions not in one's domain:
|
| I recently discovered the joy that is ZFS and everything that
| comes with it. I understand that the technical underpinnings of
| git are actually extremely different (and mathematical) _but_
| just how far is a ZFS snapshot from a git commit _really_? It
| seems like the gap between the two might not need a huge bridge.
| Could a copy-on-write filesystem benefit from more metadata that
| would come from being implemented in a more git-like way?
| gmueckl wrote:
| Conceptually, the two things are very much related and a birds-
| eye view shows a lot of similarities. But when you get into the
| weeds, there are some significant differences. git is optimized
| to store a great many historic states of files with minor
| differences between consecutive ones, and it assumes that these
| are essentially static, immutable snapshots. A COW file system
| that allows for snapshots is optimized more for allowing
| mutation of these snapshots (i.e. updating files one way in one
| snapshot and another way in another one). This, combined with
| the additional housekeeping required for a file system (disk
| block allocation, etc. - the actual core features) makes the
| implementations of the two things very different.
| mattnewton wrote:
| At Google, people have built both, and we use a version control
| system on top of a snapshoting filesytem. The snapshotting is
| for never losing code/state on your machines, and the version
| control system is for interfacing with others (code review,
| merging, etc). While you could use one system for both, having
| both layered on top makes it easier to change them to each
| specific workflow.
| crubier wrote:
| Very close. Actually companies such as https://postgres.ai/ use
| ZFS storage to provide git-like features on top of Postgres:
| Using copy-on-write on the underlying ZFS, you can "fork" a new
| branch of your DB with all the data, instantly. Then both
| branches can live their lives independently.
|
| But I don't think ZFS has the equivalent of git merge though.
| shakow wrote:
| Naive question, but what is the advantage compared to a
| classical DB dump? Faster?
| withinboredom wrote:
| I think the keyword was "instantly."
| qwertox wrote:
| Is it correct that then the original DB and the
| snapshotted DB share those blocks on the file system
| which are unmodified?
|
| Assume 1 row per block: Original DB "A" has 2 rows, a
| snapshot "B" is created, "B" deletes one row and adds a
| new one.
|
| Is it true that the row which "B" took over from "A" and
| left unmodified resides on the same block for "A" and
| "B", so that if the block gets corrupted, both databases
| will have to deal with that corrupt row?
| Dylan16807 wrote:
| Yes, that's one of the core parts of copy-on-write.
|
| It shouldn't matter if you have a reasonable setup. If
| you depend on other files on the drive to continue
| working after blocks have started to go corrupt, that's
| not a good system.
| Dylan16807 wrote:
| It would be nice if ZFS snapshots were more flexible. And you
| could say "like git" when talking about the user experience.
| But it would not be like git in terms of implementation. Git's
| implementation is not really copy-on-write. It's deduplication.
|
| I'd say the git method is actually pretty low in metadata, and
| the way you'd improve ZFS snapshots doesn't involve making them
| more like git.
|
| If you did get that huge amount of work done, you could then
| approximate git with snapshots alone. Right now, you'd probably
| want snapshots and dedup to work together to approximate git
| using ZFS.
| jhoechtl wrote:
| Came here just to mention Btrfs which does the same as ZFS in
| the sense that it is also COW by default.
| [deleted]
| deepspace wrote:
| I think you are correct. It would not be a huge stretch to turn
| a snapshotting file system into a VCS.
| https://en.wikipedia.org/wiki/Versioning_file_system
|
| The IBM/Rational Clearcase version control system is an example
| of building a VCS on top of a versioning file system (MVFS),
| though MVFS uses an underlying database instead of a copy-on-
| write snapshot mechanism.
| https://www.ibm.com/support/pages/about-multiversion-file-sy...
| LukeEF wrote:
| Few git-inspired version controlled databases out there if
| performance becomes an issue. Dolt & TerminusDB are the most
| prominent.
|
| https://github.com/terminusdb/terminusdb
| https://github.com/dolthub/dolt
| axiomdata316 wrote:
| If you are using Restic Backup aren't you coming close to what's
| being recommended here?
| asperous wrote:
| I thought it was a neat article, I assumed it was talking about
| git lfs.
|
| It would be neat if github could store all its data in git,
| similar to fossil scm. But I suppose microsoft would not want to
| lose lockin.
| bastardoperator wrote:
| All commit data is stored in git and the beauty of git outside
| of platform metadata is that you can add a new remote and never
| be locked in.
| gopalv wrote:
| > I thought it was a neat article
|
| I think the article talks about the "What" part of the problem,
| but the actual code is much more interesting in the "How"
| sense.
|
| Like the git-ref stuff makes sense as you read the code
|
| https://github.com/ligurio/git-test/blob/master/bin/git-test...
|
| There was a similar set of additions to svn in the past with
| "svn propedit" in the workflows which I used in a previous
| workplace.
|
| It was not pretty, because it was like embedding JIRA into svn
| - but it meant machines could flip state to state with commits
| during build+test and restart from that point without an
| independent DB to track the "current state" & people with
| commit access could nudge a stuck build out without losing "who
| did what".
| rsync wrote:
| "It would be neat if github could store all its data in git,
| similar to fossil scm."
|
| Yes, that would be very nice - it is unfortunate that you have
| to make API calls (over http) to get things like issues ...
|
| I _think_ you can get the wiki with plain old 'git' ? I forget
| ...
| codetrotter wrote:
| > I think you can get the wiki with plain old 'git' ? I
| forget ...
|
| This is correct. The wiki for a repo is accessible as a
| separate repository named with a suffix of ".wiki".
|
| So if user foo has a repo bar with an associated wiki, and
| the repo URL is https://github.com/foo/bar then you can clone
| the repo and the wiki respectively over SSH by:
| git clone git@github.com:foo/bar.git
|
| and git clone
| git@github.com:foo/bar.wiki.git
|
| I wish they'd do the same for all other repository meta data
| including issues, repository description, etc
| rsync wrote:
| I believe this is true of gitlab and other providers as
| well, correct ?
|
| That is, you need API calls to get things like issues.
|
| Is there a single tool that will handle downloading (and
| the associated API calls) from all of the major providers ?
| Or is each API tool specifically for either github or
| gitlab or sr.ht or whatever ?
| WorldMaker wrote:
| Every issue system has its own API and today there's no
| standard for interchange. It would be interesting to see
| an attempt to try to build a reusable standard, but I
| don't know what sort of standards agency exists with the
| guts to try something like, I don't envy the political
| battle that would entail, and having seen some of the
| horrors of bespoke Jira and TFS configurations I'm mostly
| such a standard would either be too minimalist and
| disappoint too many people or too maximalist and
| impossible to build.
| rsync wrote:
| Yes, I understand each providers standard is different -
| and I agree that would be a real mess to wrangle.
|
| _What I am wondering is_ are there any tools that use
| these APIs that have built-in support for multiple
| provider APIs ? Or does every tool that (helps you manage
| or download issues, etc.) just built for a particular
| provider ?
|
| Thanks.
| WorldMaker wrote:
| git-bug, the one mentioned in the article here, has some
| documentation on its README of how well its
| importer/exporter tools support Github, Gitlab, Jira, and
| Launchpad: https://github.com/MichaelMure/git-bug
|
| Most of the other such tools I've seen barely have the
| resources to import/export a single such API. git-issue
| only has Github import it looks like.
| https://github.com/dspinellis/git-issue
|
| There's perceval which is designed to be a generic
| archival tool and supports lots of APIs, but only dumps
| them to source-specific formats and would still need a
| lot of work if you tried to use issues from different
| APIs together: https://github.com/chaoss/grimoirelab-
| perceval
| GauntletWizard wrote:
| I've been toying with the idea of writing a protocol for Git as a
| "Blockchain" for bank interchange. Require signatures on all
| commits, include a protocol for how to push commit proposals to
| other peers for signing, verifying commits before they're merged,
| etc. No mining, just a distributed transaction log via git.
| 8eye wrote:
| i like this idea, i think you should look further into it.
| their might be a market for it
| carapace wrote:
| Kind of an aside, but I've been toying with a simple functional
| language based on Joy and when it came to exposing the filesystem
| it seemed too fraught with impurity, so instead I'm just using
| git as the data storage system. Instead of strings or blobs you
| have handles that are essentially three-tuples of (git object
| hash, offset, length). It's early days yet, but so far the
| approach seems promising. (In re: string literals, well, your
| literal is in a source file, and your source file is in git, so
| each literal has its tuple already, eh?)
| rectang wrote:
| How feasible is it to store raw content in the Git content-
| addressable-store (CAS)? Git blobs are Zlib compressed.
|
| I'd like to be able to store audio files uncompressed, so that
| they could be read directly from the CAS, rather than having to
| be expanded out into a checkout directory.
| u801e wrote:
| IIRC, a git blob has the size of the data encoded in the first
| 4 bytes of the file, and the data itself appended to it. It
| could be stored uncompressed, but I don't think there's
| anything in the git plumbing layer that could deal with it
| directly.
|
| That said, even if it is compressed, a command like git cat-
| file could be used to pipe the contents of the file to stdout
| or any other program that could use them as input without
| having to create a file on disk.
| rectang wrote:
| The header for a blob file is "blob", a space, the length of
| the content as ASCII integer representation, then a null
| byte. $ echo "hello world" > HELLO.txt
| $ git add HELLO.txt $ cat
| .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad | \
| > zpipe -d | \ > hexdump -e '"|"24/1 "%_p" "|\n"'
| |blob 12.hello world.| $
|
| The header and the content get concatenated together, and the
| whole thing gets Zlib compressed. The SHA1 is calculated from
| the header-plus-content _before_ it gets Zlib compressed.
| $ cat .git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
| | \ > zpipe -d | \ > shasum
| 3b18e512dba79e4c8300dd08aeb37f8e728b8dad - $
|
| What I would like to do is record an audio file (e.g. LPCM
| BWF), take its SHA1 and store it in the CAS as raw content,
| then reference it somehow from a Git commit. That way it will
| be part of the history and will travel with `push` and
| `clone`, won't get gc'd, etc.
|
| > _That said, even if it is compressed, a command like git
| cat-file could be used to pipe the contents of the file to
| stdout or any other program that could use them as input
| without having to create a file on disk._
|
| That's a neat suggestion! However, I don't see how it would
| be compatible with random access, which is important for my
| application.
| GauntletWizard wrote:
| Basically that's what Git-LFS does; it takes the SHA of the
| file, stores it in the git version of the file, and then
| stores the contents next to it. It's all transparent and
| works pretty well.
| rektide wrote:
| the core of cat-file.c is quite short. i think you could
| get the random access you want with minimal effort.
| ideally, upstream support for --offset and --count or what
| not to git; a lot of people would benefit.
|
| https://github.com/git/git/blob/master/builtin/cat-file.c
|
| you can absolutely make tools to expand out & load git
| repos into content stores. it's going to depend on the
| content store how you do that.
| haberman wrote:
| I've recently had a similar idea for when you want to track
| metrics of a Git repository over time (code size, line count,
| etc).
|
| I would love to create some a script to take a measurement of the
| current tree, then run a tool that runs my script at ~every
| commit so I can draw a graph of how the metric changes over time.
|
| It's a bit tricky: if you change the script, you need to re-run
| the analysis at every commit. It starts looking a little bit like
| a build system, but integrated over time.
|
| I've thought of calling this GitReduce or similar, since it has
| some similarity to MapReduce: first a "map" step runs at every
| commit, then the "reduce" step combines all of the individual
| outputs into a single graph or whatever.
|
| Ideally Git itself could be the only storage engine, so you can
| trivially serve the results from GitHub.
| sulZ wrote:
| git provides a couple of options for running a script against
| the code for every commit
| rectang wrote:
| What I really want to see is a blog post on Git as an undo
| engine.
|
| This is related to the idea of using Git as general storage, in
| that the undo history can be persisted, and then reconstituted by
| a new process. The trick would be to make all actions compatible
| with conversion to and from a commit.
| yepguy wrote:
| Emacs kind of has this in the form of magit-wip-mode. It
| doesn't sync with the undo system but it does persist every
| file save event since the last "real" commit.
| muixoozie wrote:
| I need to look into this.
|
| Just yesterday I lost some work (only maybe 10 minutes worth)
| when I was updating my org notes. I staged some files,
| committed them, but made a typo in the commit message. I
| ended up reverting the commit when I meant to amend. Then I
| discarded the the staged reverted changes and noticed the
| status said I was still reverting. So I looked up the command
| to get me out of that state. I ran 'git revert --abort' and
| it blew away my unstaged changes. Ah well, those versioned
| backups I have Emacs do are going to save me this time, or so
| I thought.
| azhenley wrote:
| I've seen programs that serialize the undo manager, but is
| there an advantage to using git instead? I suppose it does most
| of the work for you, but you'll still need to manage actions
| that can be undone but don't directly modify a file (these are
| entirely app dependent, but could be something like changing
| which tab document is selected).
| rectang wrote:
| The advantage is that Git's data structure is an open design,
| with existing tools able to introspect and possibly
| manipulate it, with potential users and contributors able to
| leverage their existing knowledge, and with
| documents/projects still being parseable decades from now.
|
| I have dreams of implementing an music composition tool /
| audio editor with a line-oriented edit-decision-list (EDL)
| text file format that where changes could map coherently onto
| git history. Ideally, that EDL format would be an open
| standard as well: I've contemplated the Pure Data file format
| and AES31 as possible candidates. This is just at the
| conceptual stage, though.
| dTal wrote:
| One problem with git is that it assumes you have some sort
| of "diff" utility for before and after snapshots. For some
| types of things, that's not feasible - you can't easily
| diff two images and work out what filter was used.
| Sometimes you instead you want to _create_ the diff, and
| _generate_ the "after" snapshot. You could emulate this by
| stowing the information somewhere hidden and writing a fake
| "diff" that simply retrieves it - but saving the edit tree
| and reconstituting it is supposed to be git's job! You end
| up reimplementing bits of git, just to make git work.
| rectang wrote:
| I suppose the "diff" limitation would constrain the
| potential use cases. For my purposes, I don't think that
| would be a problem, because the project format would
| consist of two types of files:
|
| * PCM audio files which are captured once and then never
| modified.
|
| * Line oriented text files for which the traditional
| "diff" functionality suffices.
| rektide wrote:
| one of these years i'll have something fun to show you
___________________________________________________________________
(page generated 2021-10-08 23:00 UTC)