[HN Gopher] Dolt is Git for Data: a SQL database that you can fo...
___________________________________________________________________
Dolt is Git for Data: a SQL database that you can fork, clone,
branch, merge
Author : crazypython
Score : 144 points
Date : 2021-03-06 21:15 UTC (1 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| twobitshifter wrote:
| This is cool, but the parent dolthub project is even cooler!
| Dolthub.com
| laurent92 wrote:
| Free for public repos!
| zachmu wrote:
| Also private repos under a gig, but you have to give us a
| credit card to be private.
| justincormack wrote:
| I collected all the git for data open source projects I could
| find a few months back, there have been a bunch of interesting
| approaches
| https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzm...
| chub500 wrote:
| I've had a fairly long-term side project working on git for
| chronological data (data is a cause and effect DAG), know of
| anybody doing that?
| michaelmure wrote:
| It might not be exactly what you are looking for, but git-
| bug[1] is encoding data into regular git objects, with merges
| and conflict resolution. I'm mentioning this because the hard
| part is providing an ordering of events. Once you have that
| you can store and recreate whatever state you want.
|
| This branch[2] I'm almost done with remove the purely linear
| branch constraint and allow to use full DAGs (that is,
| concurrent edition) and still provide a good ordering.
|
| [1]: https://github.com/MichaelMure/git-bug [2]:
| https://github.com/MichaelMure/git-bug/pull/532
| glogla wrote:
| This one seems to be missing: https://projectnessie.org/
| bitslayer wrote:
| Is it for versions of the database design or versions of the
| data?
| skybrian wrote:
| Both. Schema changes are versioned like everything else. But
| depending on what the change is, it might make merges
| difficult.
|
| (I haven't used it; I just read the blog.)
| [deleted]
| kyrieeschaton wrote:
| People interested in this approach should compare Rich Hickey's
| Datomic.
| einpoklum wrote:
| But do you really need this functionality, if you already have an
| SQL database?
|
| That is, you can:
|
| 1. Create a table with an extra changeset id column and a branch
| id column, so that you can keep historical values.
|
| 2. Have a view on that table with the latest version of each
| record on the master branch.
|
| 3. Express branching-related actions as actions on the main table
| with different record versions and branch names
|
| 4. For the chocolate sprinkles, have tables with changeset info
| and branch info
|
| and that gives you a poor man's git already - doesn't it?
| andrewmcwatters wrote:
| Reminds me a bit of datahub.io, but potentially more useful.
| pizzabearman wrote:
| Is this mysql only?
| zachmu wrote:
| It uses the mysql SQL dialect for queries. But it's its own
| database.
| joshspankit wrote:
| I never understood why we don't have SQL databases that track all
| changes in a "third dimension" (column being one dimension, row
| being the second dimension).
|
| It might be a bit slower to write, but hook the logic in to
| write/delete, and suddenly you can see _exactly_ when a field was
| changed to break everything. The right middleware and you could
| see the user, IP, and query that changed it (along with any other
| queries before or after).
| kenniskrag wrote:
| Because you can do that with after update triggers or server-
| side in software.
| iamwil wrote:
| which db does it use?
| zachmu wrote:
| It is a database. It implements the MySQL dialect and binary
| protocol, but it isn't MySQL. Totally separate storage engine
| and implementation.
| jrumbut wrote:
| It's amazing this isn't a standard feature. The database world
| seems to have focused on large, high volume, globally distributed
| databases. Presumably you would't version clickstream or IoT
| sensor data.
|
| Features like this that are only feasible below a certain scale
| are underdeveloped and I think there's opportunity there.
| fiddlerwoaroof wrote:
| Datomic has some sort of zero-cost forming of the database:
| it's "add-only" design makes this cheap.
| qbasic_forever wrote:
| Every DB engine used at scale has a concept of snapshots and
| backups. This just looks like someone making a git-like
| porcelain for the same kind of DB management constructs.
| 101008 wrote:
| Isnt the mysql log journal* what you are looking for?
|
| * I dont remember the exact name but I refer the feature that
| is used to replicate actions if there was an error.
| strogonoff wrote:
| You can also just use Git for data!
|
| It's a bit slower, but smart use of partial/shallow clones can
| address performance degradation on large repositories over time.
| You just need to take care of the transformation between
| "physical" trees/blobs and "logical" objects in your dataset
| (which may not have 1:1 mapping, as having physical layer more
| granular reduces likelihood of merge conflicts).
|
| In this regard (versioning data) I think Pijul is promising, it
| looks like they might introduce primitives allowing to operate on
| changes in actual data structures rather than between lines in
| files, like with Git.
| teej wrote:
| The fact that I can use git for data if I carefully avoid all
| the footguns is exactly why I don't use git for data.
| pradn wrote:
| Git is too complicated. It's barely usable for daily tasks.
| Look at how many people have to Google for basic things like
| uncommitting a commit, or cleaning your local repo to mirror a
| remote one. Complexity is a liability. Mercurial has a nicer
| interface. And now I see the real simplicity of non-distributed
| source control systems. I have never actually needed to work in
| a distributed manner, just client-server. I have never sent a
| patch to another dev to patch into their local repo or whatnot.
| All this complexity seems like a solution chasing after a
| problem - at least for most developers. What works for Linux
| isn't necessary for most teams.
| yarg wrote:
| Merging is hard, but the rest can be done with copy-on-write
| cloning (or am I missing something?).
| laurent92 wrote:
| Wordpress would have benefited from this.
|
| What a lot of webmasters want is, test the site locally, then
| merge it back. A lot of people turned to Jekyll or Hugo for the
| very reason that it can be checked into git, and git is reliable.
| A static website can't get hacked, whereas anyone who has been
| burnt with Wordpress security fail knows they'd prefer a static
| site.
|
| And even more: People would like to pass the new website from the
| designer to the customer to managers -- Wordpress might have not
| needed to develop their approval workflows (permission schemes,
| draft/preview/publish) if they had had a forkable database.
| Klwohu wrote:
| Problematic name, could become a millstone on the neck of the
| developer far into the future.
| ademarre wrote:
| Agreed. I couldn't immediately see if it was "DOLT" or "do it",
| as in "just do it". It's the former.
| rapnie wrote:
| I was going back and forth between the two until seeing doLt
| in terminal font.
| zachmu wrote:
| This ambiguity in sans serif fonts has actually been pretty
| annoying. Especially since GitHub doesn't let you choose
| your font on readmes and stuff.
| TedDoesntTalk wrote:
| Already I would not use this project because of its name. I'm
| not offended by it, but I know others will be, and it will only
| be a matter of time before we have to replace it with something
| else. So why bother in the first place?
|
| I know the name is not DOLT but it is close enough to cause
| offense. Imagine the N-word with one typo. Would it still be
| offensive? Probably to some.
| maest wrote:
| It was most likely picked as an analogy to "git".
| gerdesj wrote:
| Dolt and git are closer to synonymous rather than analogous.
| zachmu wrote:
| This is correct. Specifically, to pay homage to git and how
| Linus named it.
| Ericson2314 wrote:
| What people usually miss about these things is normal version
| control benefits hugely from content addressing and normal forms.
|
| The salient aspect of relational data is that it's cyclic, this
| makes content addressing unable to provide normal forms on it's
| own (unless someone figures out how to Merkle cylic graphs!), but
| the normal form can still made other ways.
|
| The first part is easier enough, store rows in some order.
|
| The second part is more interesting: making the choice of
| surrogate keys not matter (quotienting it away). Sorting table
| rows containing surrogate keys depending on the sorting of table
| rows makes for some interesting bags of constraints, for which
| there may be more than one fixed point.
|
| Example: CREATE TABLE Foo ( a uuid
| PRIMARY KEY, b text, best_friend uuid REFERENCES
| Foo(b) );
|
| DB 0: 0 Alice 0
|
| 1 reclusive Alice, best friends with herself. Just fine.
| 0 Alice 1 1 Alice 1
|
| 2 reclusive Alices, both best friends with the second one. The
| alices are the same up to primary keys, but while primary keys
| are to be quotiented out, primary key equality isn't, so this is
| valid. And we have an asymmetry by which to sort.
| 0 Alice 1 1 Alice 0
|
| 2 reclusive Alices, each best friends with the other. The Alices
| are completely isomorphic, and one notion of normal forms would
| say this is exactly the same as DB 0: as if this is reclusive
| Alice in a fun house of mirrors.
___________________________________________________________________
(page generated 2021-03-06 23:00 UTC)