[HN Gopher] Git as a NoSql Database (2016)
___________________________________________________________________
Git as a NoSql Database (2016)
Author : EntICOnc
Score : 73 points
Date : 2021-04-05 19:55 UTC (3 hours ago)
(HTM) web link (www.kenneth-truyers.net)
(TXT) w3m dump (www.kenneth-truyers.net)
| math-dev wrote:
| Thank you, was a nice read (particularly the conclusion, which
| was insightful and made sense)
| ioquatix wrote:
| For the lolz: https://github.com/ioquatix/relaxo
| pokstad wrote:
| Looks like it is inspired by CouchDB
| jaredcwhite wrote:
| No lolz here! I actually need _exactly_ this as I 'll soon be
| starting work on a CMS-style editor for Git-based static sites,
| and it's all Ruby to boot. Awesome sauce!
| 0xbadcafebee wrote:
| lolz!
| lisper wrote:
| > The reason why it says stupid in the man-pages is that it makes
| no assumptions about what content you store in it.
|
| That's not true. The assumption that you are storing text files
| is very much built in to the design of git, and specifically,
| into the design of its diff and merge algorithm. Those algorithms
| treat line breaks as privileged markers of the structure of the
| underlying data. This is the reason git does not play well with
| large binary files.
| mediocregopher wrote:
| The diffing/merging stuff that you see when you use the git
| tool is really just some sugar on top of the underlying data
| which is being stored. If you look at what a commit actually
| contains, it's just a hash of the tree root, one or more parent
| commits identified by their hash, and some other metadata
| (author, date, etc). There's nothing about the commit that
| cares about the contents of its tree or its parents' trees.
| It's the git tooling on top of those structures which does.
|
| In git world this distinction is referred to as the "porcelain"
| vs "plumbing", the plumbing being the underlying structures and
| porcelain being the stuff most people actually use (diffs,
| merges, rebases, etc...)
| lisper wrote:
| Yes, I get that. But:
|
| > one or more parent commits identified by their hash
|
| The reason that matters is because you need that information
| to do merges. So merges are integral to git. They are not
| "porcelain". They are, arguably, the whole point.
| bachmeier wrote:
| > You can query by key ... and that's about it. The only piece of
| good news here is that you can structure your data in folders in
| such a way that you can easily get content by prefix, but that's
| about it. Any other query is off limits, unless you want to do a
| full recursive search. The only option here is to build indices
| specifically for querying. You can do this on a scheduled basis
| if staleness is of no concern or you can use git hooks to update
| indices as soon as a commit happens.
|
| Isn't the point of a database the ability to query? Why else
| would you want "Git as a NoSql database?" If this is really what
| you're after, maybe you should be using Fossil, for which the
| repo is an sqlite database that you can query like any sqlite
| database.
| js2 wrote:
| The point of a database is to hold data, and there are many
| key/value databases for which you can only query by key unless
| you add a secondary index. (e.g. Bigtable, DynamoDB, Riak,
| Redis, etc).
| MuffinFlavored wrote:
| > Isn't the point of a database the ability to query?
|
| I was thinking about this in terms of the new CloudFlare
| Durable Objects open beta: https://blog.cloudflare.com/durable-
| objects-open-beta/ Storing data is really only half the battle.
|
| How could you reimplement the poorest of a poor man's SQL on
| top of a key/value store? The idea I came up with is: whatever
| fields you want to query by need an index as a separate key.
| Maybe this would allow JOINs? Probably not?
| 0xbadcafebee wrote:
| How about StackExchangeQL? Write your data as comments on
| StackExchange questions. To update the data, reply to your
| comment. It's like NoSQL because somebody else is hosting it for
| you and there's no schema.
| simonw wrote:
| In almost every database-backed application I've ever built
| someone, at some point, inevitable asks for the ability to see
| what changes were made when and by whom.
|
| My current preferred strategy for dealing with this (at least for
| any table smaller than a few GBs) is to dump the entire table
| contents to a git repository on a schedule.
|
| I've run this for a few toy projects and it seems to work really
| well. I'm ready to try it with something production-scale the
| next time the opportunity presents itself.
| lisper wrote:
| > My current preferred strategy for dealing with this (at least
| for any table smaller than a few GBs) is to dump the entire
| table contents to a git repository on a schedule.
|
| How does that solve the problem? All this would do is give you
| snapshots of the state of the DB at particular points in time.
|
| The only way to know who changed what when is to keep track of
| those changes as they are made, and the best way to do that is
| to stick the changes in an eventlog table.
| acidbaseextract wrote:
| Completely agreed on the inevitability of that ask.
|
| At risk of extreme overkill the other way, something like
| Debezium [1] doing change monitoring dumping into S3 might be a
| viable industrial strength approach. I haven't used it in
| production, but have been looking for appropriate time to try
| it.
|
| [1] https://debezium.io/
| irrational wrote:
| So you commit to Git.
|
| Sally makes a change to column 1 of record 1.
|
| Billy makes a change to column 2 of record 1 a nanosecond
| later.
|
| You commit to Git again.
|
| Your boss wants to know who changed column 1 of record 1.
|
| You report it was Billy.
|
| Billy is fired.
| sedeki wrote:
| So each commit is exactly one change to a relevant table?
|
| I guess you can't just plug this into a common system of
| rotating logs easily, as there might be several changes beteeen
| the log rotation.
|
| Also, I guess you'd need a user friendlier interface to
| actually display who made the change from the git repo.
|
| Anyway, interesting solution.
| Spivak wrote:
| I don't know the name of this type of learning style but I would
| eat up a tutorial on Git that started with the plumbing commands
| like this and worked their way up. I think this is oddly one of
| the clearest explanations of Git I've ever read. I know that
| wasn't really the point of the post but still.
| bspammer wrote:
| There's nothing like writing it yourself for understanding
| something. This tutorial goes through reimplementing the basics
| of git in a few lines of Python, it's surprisingly simple and
| doable in an afternoon: https://wyag.thb.lt/
| bacon_waffle wrote:
| This doesn't start with the plumbing commands, but it's along
| those lines and seemed helpful to some new-to-git folks at work
| when we started switching to it last year:
| https://eagain.net/articles/git-for-computer-scientists/
|
| If you're already comfortable with git, the Git Internals
| section of the book may be better: https://git-
| scm.com/book/en/v2/Git-Internals-Git-Objects
| weaksauce wrote:
| many people have done it but here's another one
| https://jwiegley.github.io/git-from-the-bottom-up/
| finnthehuman wrote:
| Write yourself a git https://wyag.thb.lt/
| chmaynard wrote:
| This article/video by Mary Rose Cook is helpful:
|
| https://maryrosecook.com/blog/post/git-from-the-inside-out
| cryptonector wrote:
| My attempt:
| https://gist.github.com/nicowilliams/a6e5c9131767364ce2f4b39...
| skybrian wrote:
| I haven't used it myself, but based on their blog, Dolt looks
| like a nicer way to share tables of data in a git-like way, since
| you get both branches and merges and SQL support.
|
| https://www.dolthub.com/
| chubot wrote:
| I just started using git annex, which is an extension for large
| files.
|
| https://git-annex.branchable.com/
|
| I like it a lot so far, and I think it could be used for "cloud"
| stuff, not just backups.
|
| I'd like to see Debian/Docker/PyPI/npm repositories in git annex,
| etc.
|
| It has lazy checkouts which fits that use case. By default you
| just sync the metadata with 'git annex sync', and then you can
| get content with 'git annex get FILE', or git annex sync
| --content.
|
| Can anyone see a reason why not? Those kinds of repositories all
| seem to have weird custom protocols. I'd rather just sync the
| metadata and do the query/package resolution locally. It might be
| a bigger that way, but you only have fully sync it once, and the
| rest are incremental.
| GordonS wrote:
| I'm a little confused about what git annex is - I think it's
| perhaps a basic file sync utility that uses git behind the
| scenes?
| kzrdude wrote:
| In one way it's a very distributed/decentralized alternative
| to git-lfs.
| rakoo wrote:
| The website (https://git-annex.branchable.com/) has many
| details, including scenarios to explain why it can be useful.
| git-annex is not so much a backup/sync utility, it's more a
| tool to track your files if they exist on multiple
| repositories. Instead of having remote storages holding files
| that happen to be the same, with git-annex the relationship
| is inverted: you have files, and each one can be stored on
| multiple storages. You can follow where they are, push
| them/get them from any storage that has it, remove them from
| one place if it's short on free space knowing that other
| storages still have it...
|
| There was a project of backing up the internet archive by
| using git-annex (https://wiki.archiveteam.org/index.php?title
| =INTERNETARCHIVE...). Basically the source project would
| create repositories of files and users like you and I would
| be remote repositories; we would get content and claim that
| we have it, so that everyone would know this repository has a
| valid copy on our server.
| chubot wrote:
| Yeah it solves the problem of putting big binaries that don't
| compress well inside git.
|
| If you've ever tried that, then git will start to choke
| around a few gigabytes (I think it's the packing/diffing
| algorithms). Github recommends that you keep repos less than
| 1 GB and definitely less than 5 GB, and they probably have a
| hard limit.
|
| So what git annex does is simply store symlinks to big files
| inside .git/annex, and then it has algorithms for managing
| and syncing the big files. I don't love symlinks and neither
| does the author, but it seems to work fine. I just do ls -L
| -l instead of ls -l to follow the symlinks.
|
| I think package repos are something like 300 GB, which should
| be easily manageable by git annex. And again you don't have
| to check out everything eagerly. I'm also pretty certain that
| git annex could support 3TB or 30TB repos if the file system
| has enough space.
|
| For container images, I think you could simple store layers
| as files which will save some space for many versions.
|
| There's also git LFS, which github supports, but git annex
| seems more truly distributed, which I like.
| warp wrote:
| With git-annex the size of the repo isn't as important,
| it's more about how many files you have stored in it.
|
| I find git-annex to become a bit unwieldy at around 20k to
| 30k files, at least on modest hardware like a Raspberry Pi
| or a core i3.
|
| (This hasn't been a problem for my use case, I've just
| split things up into a couple of annex repos)
| jpeloquin wrote:
| Git annex is pretty flexible, more of a framework for storing
| large files in git than a basic sync utility. ("Large"
| meaning larger than you'd want to directly commit to git.) If
| you're running Git Annex Assistant, it does pretty much work
| as basic file sync of a directory. But you can also use it
| with normal git commits, like you would Git LFS. Or as a file
| repository as chubot suggested. The flexibility makes it a
| little difficult to get started.
|
| The basic idea is that each file targeted by `git annex add`
| gets replaced by a symlink pointing to its content. The
| content is managed by git annex and lives as a checksum-
| addressable blob in .git/annex. The symlink is staged in git
| to be committed and tracked by the usual git mechanisms. Git
| annex keeps a log of which host has (had) which file in a
| branch named "git annex". (There is an alternate non-symlink
| mechanism for Windows that I don't use and know little
| about.)
|
| I use git annex in the git LFS-like fashion to store
| experimental data (microscope images, etc.) in the same
| repository as the code used to analyze it. The main downside
| is that you have to remember to sync (push) the git annex
| branch _and_ copy the annexed content, as well as pushing
| your main branch. It can take a very long time to sync
| content when the other repository is not guaranteed to have
| all the content it's supposed to have, since in that scenario
| the existence and checksum of each annexed file has to be
| checked. (You can skip this check if you're feeling lucky.)
| Also, because partial content syncs are allowed, you do need
| to run `git annex fsck` periodically and pay attention to the
| number of verified file copies across repos.
| zachmu wrote:
| Or for a SQL database with Git versioning semantics:
|
| https://github.com/dolthub/dolt
| simon_acca wrote:
| Interesting exercise!
|
| Some pointers to related ideas:
|
| https://en.m.wikipedia.org/wiki/Persistent_data_structure -
| https://docs.datomic.com/cloud/index.html - https://opencrux.com/
| - https://github.com/attic-labs/noms -
| https://researcher.watson.ibm.com/researcher/files/us-leejin...
| fiddlerwoaroof wrote:
| Tangentially I started a re-implementation of git in Common
| Lisp[1], and have completed parsers for most of the file formats
| except delta encoded objects.
|
| Does anyone happen to know of an implementation or tests for
| delta encoding I could consult that is available under an MIT-
| like license? (BSD, Apache v2, etc.)
|
| [1]: https://github.com/fiddlerwoaroof/cl-git
| ori_b wrote:
| I just used git-fsck, and my implemention is here (both read
| and write, hosted using git9 on 9front):
|
| http://shithub.us/ori/git9/724c516a6eda0063439457a6701ef0d7e...
|
| http://shithub.us/ori/git9/724c516a6eda0063439457a6701ef0d7e...
|
| as far as I'm aware, the OpenBSD implemention of git, Game of
| trees, is adapting this code too.
| fiddlerwoaroof wrote:
| Thanks, delta encoding stumped me a bit because the relevant
| git documentation was a bit ambiguous.
|
| Maybe I'll implement this tonight.
| hungryhobo wrote:
| Most noSQL can guarantee at least two of consistency,
| availability and partition tolerance, this seem to drop the ball
| on all 3.
| ses1984 wrote:
| I can't tell if this is sarcasm.
| hungryhobo wrote:
| In what ways does git implement CAP better than a NoSQL
| database?
| chmaynard wrote:
| Fascinating article. Looks like it was posted to HN at least six
| times before today and got no traction. In my browser, the
| examples are sometimes a concatenation of the command and the
| output. Does anyone else see this?
| escanor wrote:
| yes, i see this as well. at first it was confusing, but later i
| understood it was the output.
| cschulee wrote:
| "Your scientists were so preoccupied with whether or not they
| could, they didn't stop to think if they should."
| cryptonector wrote:
| I should write a SQL as a NoSQL database blog post, showing a
| trivial one-table schema that does all you could ever want in a
| NoSQL, especially after choosing to ignore that it still has SQL.
| munk-a wrote:
| This may or may not be the case anymore[1] - but for a long
| while Postgres' managed to outperform MongoDB when restricted
| to just using single JSON column tables.
|
| One of the larger relational DBs beating the OG NoSQL will
| always fill me with amusement.
|
| 1. Ed. looks like this is still the case, but please remember
| this is MongoDB specifically. AFAICT MongoDB is mostly a dead
| branch at this point and a comparison against Redis or
| something else that's had continued investment might be more
| fair. I'm just happy chilling in my SQL world - I'm not super
| up-to-date on the NoSQL market.
| Cir0X wrote:
| I can't find any sources for this. Can you provide one?
| munk-a wrote:
| Here's one (read the article in detail, apparently this is
| up to pretty recently...)
| https://www.enterprisedb.com/news/new-benchmarks-show-
| postgr...
|
| Also, here's SiSense sensationalizing it for easy digest:
| https://www.sisense.com/blog/postgres-vs-mongodb-for-
| storing...
| Lorin wrote:
| check out RedBeanPHP - it does some interesting stuff for
| schema liquidity
| devoutsalsa wrote:
| Someone wrote a SQL wrapper around InnoDB & called it MySQL XD
| 10000truths wrote:
| They should sell that to a large tech corp, I bet they'd make
| a killing.
| richardwhiuk wrote:
| sqlite and json.1 gets you pretty close to emulating a more
| powerful nosql database.
| cryptonector wrote:
| I'm aware. My comment was a joke, but evidently it's been
| well-received.
___________________________________________________________________
(page generated 2021-04-05 23:01 UTC)