[HN Gopher] Git as a NoSql Database (2016)
       ___________________________________________________________________
        
       Git as a NoSql Database (2016)
        
       Author : EntICOnc
       Score  : 73 points
       Date   : 2021-04-05 19:55 UTC (3 hours ago)
        
 (HTM) web link (www.kenneth-truyers.net)
 (TXT) w3m dump (www.kenneth-truyers.net)
        
       | math-dev wrote:
       | Thank you, was a nice read (particularly the conclusion, which
       | was insightful and made sense)
        
       | ioquatix wrote:
       | For the lolz: https://github.com/ioquatix/relaxo
        
         | pokstad wrote:
         | Looks like it is inspired by CouchDB
        
         | jaredcwhite wrote:
         | No lolz here! I actually need _exactly_ this as I 'll soon be
         | starting work on a CMS-style editor for Git-based static sites,
         | and it's all Ruby to boot. Awesome sauce!
        
           | 0xbadcafebee wrote:
           | lolz!
        
       | lisper wrote:
       | > The reason why it says stupid in the man-pages is that it makes
       | no assumptions about what content you store in it.
       | 
       | That's not true. The assumption that you are storing text files
       | is very much built in to the design of git, and specifically,
       | into the design of its diff and merge algorithm. Those algorithms
       | treat line breaks as privileged markers of the structure of the
       | underlying data. This is the reason git does not play well with
       | large binary files.
        
         | mediocregopher wrote:
         | The diffing/merging stuff that you see when you use the git
         | tool is really just some sugar on top of the underlying data
         | which is being stored. If you look at what a commit actually
         | contains, it's just a hash of the tree root, one or more parent
         | commits identified by their hash, and some other metadata
         | (author, date, etc). There's nothing about the commit that
         | cares about the contents of its tree or its parents' trees.
         | It's the git tooling on top of those structures which does.
         | 
         | In git world this distinction is referred to as the "porcelain"
         | vs "plumbing", the plumbing being the underlying structures and
         | porcelain being the stuff most people actually use (diffs,
         | merges, rebases, etc...)
        
           | lisper wrote:
           | Yes, I get that. But:
           | 
           | > one or more parent commits identified by their hash
           | 
           | The reason that matters is because you need that information
           | to do merges. So merges are integral to git. They are not
           | "porcelain". They are, arguably, the whole point.
        
       | bachmeier wrote:
       | > You can query by key ... and that's about it. The only piece of
       | good news here is that you can structure your data in folders in
       | such a way that you can easily get content by prefix, but that's
       | about it. Any other query is off limits, unless you want to do a
       | full recursive search. The only option here is to build indices
       | specifically for querying. You can do this on a scheduled basis
       | if staleness is of no concern or you can use git hooks to update
       | indices as soon as a commit happens.
       | 
       | Isn't the point of a database the ability to query? Why else
       | would you want "Git as a NoSql database?" If this is really what
       | you're after, maybe you should be using Fossil, for which the
       | repo is an sqlite database that you can query like any sqlite
       | database.
        
         | js2 wrote:
         | The point of a database is to hold data, and there are many
         | key/value databases for which you can only query by key unless
         | you add a secondary index. (e.g. Bigtable, DynamoDB, Riak,
         | Redis, etc).
        
         | MuffinFlavored wrote:
         | > Isn't the point of a database the ability to query?
         | 
         | I was thinking about this in terms of the new CloudFlare
         | Durable Objects open beta: https://blog.cloudflare.com/durable-
         | objects-open-beta/ Storing data is really only half the battle.
         | 
         | How could you reimplement the poorest of a poor man's SQL on
         | top of a key/value store? The idea I came up with is: whatever
         | fields you want to query by need an index as a separate key.
         | Maybe this would allow JOINs? Probably not?
        
       | 0xbadcafebee wrote:
       | How about StackExchangeQL? Write your data as comments on
       | StackExchange questions. To update the data, reply to your
       | comment. It's like NoSQL because somebody else is hosting it for
       | you and there's no schema.
        
       | simonw wrote:
       | In almost every database-backed application I've ever built
       | someone, at some point, inevitable asks for the ability to see
       | what changes were made when and by whom.
       | 
       | My current preferred strategy for dealing with this (at least for
       | any table smaller than a few GBs) is to dump the entire table
       | contents to a git repository on a schedule.
       | 
       | I've run this for a few toy projects and it seems to work really
       | well. I'm ready to try it with something production-scale the
       | next time the opportunity presents itself.
        
         | lisper wrote:
         | > My current preferred strategy for dealing with this (at least
         | for any table smaller than a few GBs) is to dump the entire
         | table contents to a git repository on a schedule.
         | 
         | How does that solve the problem? All this would do is give you
         | snapshots of the state of the DB at particular points in time.
         | 
         | The only way to know who changed what when is to keep track of
         | those changes as they are made, and the best way to do that is
         | to stick the changes in an eventlog table.
        
         | acidbaseextract wrote:
         | Completely agreed on the inevitability of that ask.
         | 
         | At risk of extreme overkill the other way, something like
         | Debezium [1] doing change monitoring dumping into S3 might be a
         | viable industrial strength approach. I haven't used it in
         | production, but have been looking for appropriate time to try
         | it.
         | 
         | [1] https://debezium.io/
        
         | irrational wrote:
         | So you commit to Git.
         | 
         | Sally makes a change to column 1 of record 1.
         | 
         | Billy makes a change to column 2 of record 1 a nanosecond
         | later.
         | 
         | You commit to Git again.
         | 
         | Your boss wants to know who changed column 1 of record 1.
         | 
         | You report it was Billy.
         | 
         | Billy is fired.
        
         | sedeki wrote:
         | So each commit is exactly one change to a relevant table?
         | 
         | I guess you can't just plug this into a common system of
         | rotating logs easily, as there might be several changes beteeen
         | the log rotation.
         | 
         | Also, I guess you'd need a user friendlier interface to
         | actually display who made the change from the git repo.
         | 
         | Anyway, interesting solution.
        
       | Spivak wrote:
       | I don't know the name of this type of learning style but I would
       | eat up a tutorial on Git that started with the plumbing commands
       | like this and worked their way up. I think this is oddly one of
       | the clearest explanations of Git I've ever read. I know that
       | wasn't really the point of the post but still.
        
         | bspammer wrote:
         | There's nothing like writing it yourself for understanding
         | something. This tutorial goes through reimplementing the basics
         | of git in a few lines of Python, it's surprisingly simple and
         | doable in an afternoon: https://wyag.thb.lt/
        
         | bacon_waffle wrote:
         | This doesn't start with the plumbing commands, but it's along
         | those lines and seemed helpful to some new-to-git folks at work
         | when we started switching to it last year:
         | https://eagain.net/articles/git-for-computer-scientists/
         | 
         | If you're already comfortable with git, the Git Internals
         | section of the book may be better: https://git-
         | scm.com/book/en/v2/Git-Internals-Git-Objects
        
         | weaksauce wrote:
         | many people have done it but here's another one
         | https://jwiegley.github.io/git-from-the-bottom-up/
        
         | finnthehuman wrote:
         | Write yourself a git https://wyag.thb.lt/
        
         | chmaynard wrote:
         | This article/video by Mary Rose Cook is helpful:
         | 
         | https://maryrosecook.com/blog/post/git-from-the-inside-out
        
         | cryptonector wrote:
         | My attempt:
         | https://gist.github.com/nicowilliams/a6e5c9131767364ce2f4b39...
        
       | skybrian wrote:
       | I haven't used it myself, but based on their blog, Dolt looks
       | like a nicer way to share tables of data in a git-like way, since
       | you get both branches and merges and SQL support.
       | 
       | https://www.dolthub.com/
        
       | chubot wrote:
       | I just started using git annex, which is an extension for large
       | files.
       | 
       | https://git-annex.branchable.com/
       | 
       | I like it a lot so far, and I think it could be used for "cloud"
       | stuff, not just backups.
       | 
       | I'd like to see Debian/Docker/PyPI/npm repositories in git annex,
       | etc.
       | 
       | It has lazy checkouts which fits that use case. By default you
       | just sync the metadata with 'git annex sync', and then you can
       | get content with 'git annex get FILE', or git annex sync
       | --content.
       | 
       | Can anyone see a reason why not? Those kinds of repositories all
       | seem to have weird custom protocols. I'd rather just sync the
       | metadata and do the query/package resolution locally. It might be
       | a bigger that way, but you only have fully sync it once, and the
       | rest are incremental.
        
         | GordonS wrote:
         | I'm a little confused about what git annex is - I think it's
         | perhaps a basic file sync utility that uses git behind the
         | scenes?
        
           | kzrdude wrote:
           | In one way it's a very distributed/decentralized alternative
           | to git-lfs.
        
           | rakoo wrote:
           | The website (https://git-annex.branchable.com/) has many
           | details, including scenarios to explain why it can be useful.
           | git-annex is not so much a backup/sync utility, it's more a
           | tool to track your files if they exist on multiple
           | repositories. Instead of having remote storages holding files
           | that happen to be the same, with git-annex the relationship
           | is inverted: you have files, and each one can be stored on
           | multiple storages. You can follow where they are, push
           | them/get them from any storage that has it, remove them from
           | one place if it's short on free space knowing that other
           | storages still have it...
           | 
           | There was a project of backing up the internet archive by
           | using git-annex (https://wiki.archiveteam.org/index.php?title
           | =INTERNETARCHIVE...). Basically the source project would
           | create repositories of files and users like you and I would
           | be remote repositories; we would get content and claim that
           | we have it, so that everyone would know this repository has a
           | valid copy on our server.
        
           | chubot wrote:
           | Yeah it solves the problem of putting big binaries that don't
           | compress well inside git.
           | 
           | If you've ever tried that, then git will start to choke
           | around a few gigabytes (I think it's the packing/diffing
           | algorithms). Github recommends that you keep repos less than
           | 1 GB and definitely less than 5 GB, and they probably have a
           | hard limit.
           | 
           | So what git annex does is simply store symlinks to big files
           | inside .git/annex, and then it has algorithms for managing
           | and syncing the big files. I don't love symlinks and neither
           | does the author, but it seems to work fine. I just do ls -L
           | -l instead of ls -l to follow the symlinks.
           | 
           | I think package repos are something like 300 GB, which should
           | be easily manageable by git annex. And again you don't have
           | to check out everything eagerly. I'm also pretty certain that
           | git annex could support 3TB or 30TB repos if the file system
           | has enough space.
           | 
           | For container images, I think you could simple store layers
           | as files which will save some space for many versions.
           | 
           | There's also git LFS, which github supports, but git annex
           | seems more truly distributed, which I like.
        
             | warp wrote:
             | With git-annex the size of the repo isn't as important,
             | it's more about how many files you have stored in it.
             | 
             | I find git-annex to become a bit unwieldy at around 20k to
             | 30k files, at least on modest hardware like a Raspberry Pi
             | or a core i3.
             | 
             | (This hasn't been a problem for my use case, I've just
             | split things up into a couple of annex repos)
        
           | jpeloquin wrote:
           | Git annex is pretty flexible, more of a framework for storing
           | large files in git than a basic sync utility. ("Large"
           | meaning larger than you'd want to directly commit to git.) If
           | you're running Git Annex Assistant, it does pretty much work
           | as basic file sync of a directory. But you can also use it
           | with normal git commits, like you would Git LFS. Or as a file
           | repository as chubot suggested. The flexibility makes it a
           | little difficult to get started.
           | 
           | The basic idea is that each file targeted by `git annex add`
           | gets replaced by a symlink pointing to its content. The
           | content is managed by git annex and lives as a checksum-
           | addressable blob in .git/annex. The symlink is staged in git
           | to be committed and tracked by the usual git mechanisms. Git
           | annex keeps a log of which host has (had) which file in a
           | branch named "git annex". (There is an alternate non-symlink
           | mechanism for Windows that I don't use and know little
           | about.)
           | 
           | I use git annex in the git LFS-like fashion to store
           | experimental data (microscope images, etc.) in the same
           | repository as the code used to analyze it. The main downside
           | is that you have to remember to sync (push) the git annex
           | branch _and_ copy the annexed content, as well as pushing
           | your main branch. It can take a very long time to sync
           | content when the other repository is not guaranteed to have
           | all the content it's supposed to have, since in that scenario
           | the existence and checksum of each annexed file has to be
           | checked. (You can skip this check if you're feeling lucky.)
           | Also, because partial content syncs are allowed, you do need
           | to run `git annex fsck` periodically and pay attention to the
           | number of verified file copies across repos.
        
       | zachmu wrote:
       | Or for a SQL database with Git versioning semantics:
       | 
       | https://github.com/dolthub/dolt
        
       | simon_acca wrote:
       | Interesting exercise!
       | 
       | Some pointers to related ideas:
       | 
       | https://en.m.wikipedia.org/wiki/Persistent_data_structure -
       | https://docs.datomic.com/cloud/index.html - https://opencrux.com/
       | - https://github.com/attic-labs/noms -
       | https://researcher.watson.ibm.com/researcher/files/us-leejin...
        
       | fiddlerwoaroof wrote:
       | Tangentially I started a re-implementation of git in Common
       | Lisp[1], and have completed parsers for most of the file formats
       | except delta encoded objects.
       | 
       | Does anyone happen to know of an implementation or tests for
       | delta encoding I could consult that is available under an MIT-
       | like license? (BSD, Apache v2, etc.)
       | 
       | [1]: https://github.com/fiddlerwoaroof/cl-git
        
         | ori_b wrote:
         | I just used git-fsck, and my implemention is here (both read
         | and write, hosted using git9 on 9front):
         | 
         | http://shithub.us/ori/git9/724c516a6eda0063439457a6701ef0d7e...
         | 
         | http://shithub.us/ori/git9/724c516a6eda0063439457a6701ef0d7e...
         | 
         | as far as I'm aware, the OpenBSD implemention of git, Game of
         | trees, is adapting this code too.
        
           | fiddlerwoaroof wrote:
           | Thanks, delta encoding stumped me a bit because the relevant
           | git documentation was a bit ambiguous.
           | 
           | Maybe I'll implement this tonight.
        
       | hungryhobo wrote:
       | Most noSQL can guarantee at least two of consistency,
       | availability and partition tolerance, this seem to drop the ball
       | on all 3.
        
         | ses1984 wrote:
         | I can't tell if this is sarcasm.
        
           | hungryhobo wrote:
           | In what ways does git implement CAP better than a NoSQL
           | database?
        
       | chmaynard wrote:
       | Fascinating article. Looks like it was posted to HN at least six
       | times before today and got no traction. In my browser, the
       | examples are sometimes a concatenation of the command and the
       | output. Does anyone else see this?
        
         | escanor wrote:
         | yes, i see this as well. at first it was confusing, but later i
         | understood it was the output.
        
       | cschulee wrote:
       | "Your scientists were so preoccupied with whether or not they
       | could, they didn't stop to think if they should."
        
       | cryptonector wrote:
       | I should write a SQL as a NoSQL database blog post, showing a
       | trivial one-table schema that does all you could ever want in a
       | NoSQL, especially after choosing to ignore that it still has SQL.
        
         | munk-a wrote:
         | This may or may not be the case anymore[1] - but for a long
         | while Postgres' managed to outperform MongoDB when restricted
         | to just using single JSON column tables.
         | 
         | One of the larger relational DBs beating the OG NoSQL will
         | always fill me with amusement.
         | 
         | 1. Ed. looks like this is still the case, but please remember
         | this is MongoDB specifically. AFAICT MongoDB is mostly a dead
         | branch at this point and a comparison against Redis or
         | something else that's had continued investment might be more
         | fair. I'm just happy chilling in my SQL world - I'm not super
         | up-to-date on the NoSQL market.
        
           | Cir0X wrote:
           | I can't find any sources for this. Can you provide one?
        
             | munk-a wrote:
             | Here's one (read the article in detail, apparently this is
             | up to pretty recently...)
             | https://www.enterprisedb.com/news/new-benchmarks-show-
             | postgr...
             | 
             | Also, here's SiSense sensationalizing it for easy digest:
             | https://www.sisense.com/blog/postgres-vs-mongodb-for-
             | storing...
        
         | Lorin wrote:
         | check out RedBeanPHP - it does some interesting stuff for
         | schema liquidity
        
         | devoutsalsa wrote:
         | Someone wrote a SQL wrapper around InnoDB & called it MySQL XD
        
           | 10000truths wrote:
           | They should sell that to a large tech corp, I bet they'd make
           | a killing.
        
         | richardwhiuk wrote:
         | sqlite and json.1 gets you pretty close to emulating a more
         | powerful nosql database.
        
           | cryptonector wrote:
           | I'm aware. My comment was a joke, but evidently it's been
           | well-received.
        
       ___________________________________________________________________
       (page generated 2021-04-05 23:01 UTC)