[HN Gopher] Dolt is Git for Data
       ___________________________________________________________________
        
       Dolt is Git for Data
        
       Author : gjvc
       Score  : 292 points
       Date   : 2022-06-23 11:04 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pdpi wrote:
       | Given how many databases already implement MVCC, some form of
       | versioning is already baked into DB systems. It seems obvious in
       | hindsight that this could be reified into its own feature instead
       | of a book-keeping detail. Cool to see a project actually tackle
       | that challenge.
        
       | adamw2k wrote:
       | Was going to say it reminded me of noms, only to realize it's a
       | fork...
       | 
       | https://github.com/attic-labs/noms
        
         | timsehn wrote:
         | Creator here...
         | 
         | We are a fork of Noms. Lots of incremental changes and we're in
         | the process of a major storage engine overhaul (what we use
         | Noms for) for performance as we speak.
        
       | mriet wrote:
       | Dolt is git for _databases_.
       | 
       | Nessie is git for _data_.
        
       | magicalhippo wrote:
       | Somewhat unrelated but...                   docs $ dolt conflicts
       | cat docs         +-----+--------+----+----+         |     |
       | | pk | c1 |         +-----+--------+----+----+         |     |
       | base   | 1  | 1  |         |  *  | ours   | 1  | 10 |         |
       | *  | theirs | 1  | 0  |         +-----+--------+----+----+
       | 
       | For some reason my brain just can't seem to remember what's
       | "ours" and what's "theirs" when resolving merge conflicts. Like
       | which one represents what was in the working copy and which one
       | represents the data I'm trying to merge. I'd much prefer absolute
       | terms rather than those relative ones.
        
         | zachmu wrote:
         | Yeah I'm kinda the same way. Ultimately we decided to copy
         | git's terminology for this rather than making users learn new
         | terms.
         | 
         | (ours is the working copy, theirs is the thing you're merging
         | in)
        
           | magicalhippo wrote:
           | Yeah I totally get keeping the nomenclature given Dolt tries
           | to be Git-like.
        
       | memorable wrote:
       | Seems interesting! Will keep an eye on this.
        
       | amelius wrote:
       | The nice thing about Git is that it is batteries-included.
       | 
       | I don't want to deal with databases and everything that comes
       | with it. Just give me something that works with the filesystem
       | directly.
        
       | MontyCarloHall wrote:
       | No mention of performance on their GitHub homepage. Anyone know
       | how this compares to a standard MySQL distribution?
        
         | jpnc wrote:
         | They say it's slower but are trying to close the gap [1]
         | 
         | [1] - https://docs.dolthub.com/sql-reference/benchmarks/latency
        
         | NickHoff wrote:
         | In this [1] blog post, they say it's "between 2 and 20 times
         | slower than MySQL (but committed to getting down to 4x
         | slower)".
         | 
         | [1]: https://www.dolthub.com/blog/2021-03-09-dolt-use-cases-in-
         | th...
        
       | chirau wrote:
       | They have a pretty interesting data bounty program as well.
       | 
       | https://www.dolthub.com/bounties
       | 
       | I participated in one a few months back. I had to stop midway
       | inbetween because work was calling but it was a lot of fun. I
       | know I earned some dollars, lol, but probably too few to count
       | for the competition.
       | 
       | I am actually trying to convince my manager to put up a few
       | bounties later this year.
       | 
       | Their Discord support is also super impressive.
        
       | gniap wrote:
       | It's quite similar to https://qri.io functionality, but
       | unfortunately Qri stopped providing their Cloud
       | service(https://qri.io/winding_down).
        
       | remram wrote:
       | What kind of merge actually happens, e.g. how does it deal with
       | conflicts? Does it merge at the SQL command level or at the
       | changed tuple level?
       | 
       | If I have                   name   job age         remram dev 32
       | 
       | and I concurrently do                   UPDATE table SET age =
       | age + 2 WHERE job = 'dev';         UPDATE table SET job = 'mgr'
       | WHERE age = 32;
       | 
       | Do I get a conflict? Do I get something consistent with the two
       | transactions serialized, e.g. dev 34 or op 32)? Can I get
       | something that no ordering of transaction could have given me,
       | like mgr 34?
        
         | zachmu wrote:
         | Merge is cell-wise. If the two heads being merged edited
         | disjoint cells, there's no conflict, they merge together. If
         | they touched the same cell, or if one branch deleted a row
         | another added, then you get a conflict.
         | 
         | Merge conflicts are stored in a special system table so you can
         | decide what to do with them, kind of analogous to conflict
         | markers in a source file. More details here:
         | 
         | https://docs.dolthub.com/sql-reference/version-control/merge...
         | 
         | The situation you're talking about with two transactions isn't
         | a merge operation, it's just normal database transaction
         | isolation level stuff. Dolt supports REPEATABLE_READ right now,
         | with others coming in future releases. So in the example above,
         | whichever transaction committed last would fail and get rolled
         | back (assuming they touched the same rows).
        
           | remram wrote:
           | > The situation you're talking about with two transactions
           | isn't a merge operation, it's just normal database
           | transaction isolation level stuff
           | 
           | I mean, arguably. It's not like there is a standard
           | definition for "merge operation" on data. Even Git tries to
           | do more than line-level changes, taking context into account,
           | and turning one line change into two line changes if merging
           | across a file copy for example.
           | 
           | Dolt markets itself as "a version controlled SQL database",
           | so I think it is perfectly reasonable to consider the
           | standard that already exists for concurrent changes to a SQL
           | database, and that's transaction isolation.
           | 
           | I guess anything more complex than this would be pretty
           | unwieldy though, with probably little benefits. I am
           | struggling to come up with a good example for the kind of
           | anomaly I imagine.
        
             | zachmu wrote:
             | You're thinking too small. The transaction is generally not
             | the unit you want to apply version control to, databases
             | already have robust concurrency support at that level.
             | 
             | What you want is to have branches on your data that have
             | many transactions applied to them, for days or weeks. Then
             | you merge the branch back to main when it's ready to ship.
             | 
             | An early customer built pull requests for their
             | configuration application using Dolt, read more here:
             | 
             | https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
        
               | remram wrote:
               | Please don't tell me what I'm thinking or what I want to
               | do ;-)
               | 
               | The use case I had in mind is closer to a data.world or
               | dbhub, collaborative data cleaning (which often include
               | bulk operations) rather than merging OLTP databases after
               | days or weeks of changes.
               | 
               | What I would use now is a script or Jupyter Notebook
               | checked in Git, where cells are strongly ordered and if
               | someone sends me a pull request (changing the code) I
               | have to re-run the notebook to obtain the "merged
               | dataset". I can't say that I have a use for Dolt but it
               | is definitely cool tech.
        
               | zachmu wrote:
               | If you're a data cleaning guy, you might be interested in
               | our data bounties program. It's collaborative data
               | importing / cleaning. The current one is building a
               | catalog of museum collections. We launch a new one every
               | month.
               | 
               | https://www.dolthub.com/blog/2022-06-14-announcing-
               | museums-b...
        
       | melvinroest wrote:
       | Shoutout to Smalltalk [1] for using doit [2] (do it) to execute
       | code! Name collisions happen, it's handy to know about them ;-)
       | 
       | [1] https://en.wikipedia.org/wiki/Smalltalk
       | 
       | [2] https://twitter.com/worrydream/status/1424952628207702022 <--
       | Brett Victor talking about doit
        
         | allknowingfrog wrote:
         | I think you mistook the "l" for an "i". This is "dolt", not
         | "doit".
        
           | melvinroest wrote:
           | dolt
           | 
           | DoIt
           | 
           | Ah... whoops, my bad
        
       | arealaccount wrote:
       | I've been looking for something like this where you can bring
       | your own DB. The ability to version schema and metadata is much
       | more interesting over a saas DB host.
        
       | [deleted]
        
       | Hakeemmidan wrote:
       | this looks really fun
        
       | pgt wrote:
       | XTDB is a general-purpose bitemporal database with graph query
       | capabilities: https://xtdb.com/
        
         | infogulch wrote:
         | There was a short talk about xtdb in the recent 'rubbing a
         | database on it' conference:
         | https://www.hytradboi.com/2022/baking-in-time-at-the-bottom-...
         | 
         | The talk expands on what "bitemporal" means in this context,
         | namely, separate "valid" and "audit" time dimensions. The
         | discussion upthread is a good example: if you accidentally mark
         | down inventory in the wrong location and you find out later,
         | you should backdate the fix to when you did the count -- this
         | is the 'valid' dimension, but the fact that you changed it
         | today should also be recorded so it can be reviewed later --
         | this is the 'audit' dimension.
        
       | anyfactor wrote:
       | I use dolthub's public database to practice queries.
       | 
       | I gave their bounty program a shot but I quit when I was prompted
       | to download/clone a 8gb database. So the best thing would be to
       | have a dolthub account and develop the data from there.
       | 
       | I think if you do a little bit of git hacking you can create a
       | branch with just an empty or a dummy database based on the
       | schema, clone only that branch and make a PR to append to the
       | original database. But I couldn't figure that out.
        
         | ascar wrote:
         | Could you elaborate what's so bad about 8GB of test data? This
         | is a database project after all and many things need a certain
         | amount of data to show up.
        
           | anyfactor wrote:
           | I tried downloading the data and my network connection
           | glitched at 6gb, I was thrown an error and the processed
           | exited. I believe I was recommended to restart the entire
           | process again as there was no way continue the process where
           | I left out. I didn't try again.
           | 
           | I think there could be a better design to clone/download the
           | database on local machine. I understand people don't actually
           | download databases to their local machine but I expected a
           | better solution.
        
             | zachmu wrote:
             | Definitely. We have plans to allow clone to be a shallow
             | operation, so that the original data stays remote and only
             | your changes are local. Will make queries slower obviously,
             | but that's the right trade-off for some people.
        
       | alex-treebeard wrote:
       | great name
        
         | rwoerz wrote:
         | I misread it as "do it" and wonder if that was intended too.
        
           | obliquely wrote:
           | DOLT vs DO-IT:
           | https://www.folklore.org/StoryView.py?story=Do_It.txt
        
       | [deleted]
        
       | [deleted]
        
       | patrickdevivo wrote:
       | a very cool project they also maintain is a MySQL server
       | framework for arbitrary backends (in Go):
       | https://github.com/dolthub/go-mysql-server
       | 
       | You can define a "virtual" table (schema, how to retrieve
       | rows/columns) and then a MySQL client can connect and execute
       | arbitrary queries on your table (which could just be an API or
       | other source)
        
       | aratob wrote:
       | Previous discussions:
       | 
       | - Dolt is Git for Data: a SQL database that you can fork, clone,
       | branch, merge (2021) - 177 comments
       | https://news.ycombinator.com/item?id=26370572
       | 
       | - Dolt is Git for data (2020) - 191 comments
       | https://news.ycombinator.com/item?id=22731928
        
       | nailer wrote:
       | I thought dat was git for data. But it looks like it's dead now.
       | 
       | Docs site is still up: https://docs.dat.foundation/
        
       | cosmic_quanta wrote:
       | That looks awesome. One of the listed use-cases is 'time-travel':
       | https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...
       | 
       | I wish we could use this at work. We're trying to predict time-
       | series stuff. However, there's a lot of infrastructure complexity
       | which is there to ensure that when we're training on data from
       | years ago, that we're not using data that would be in the future
       | from this point (future data leaking into the past).
       | 
       | Using Dolt, as far as I understand it, we could simply set the DB
       | to a point in the past where the 'future' data wasn't available.
       | Very cool
        
         | wodenokoto wrote:
         | For at least your use-case, this is a fairly common feature in
         | enterprise databases.
         | 
         | For a lot of types of data, auditing requires history.
         | 
         | Have a look at MSSQLs history table, for example:
         | https://docs.microsoft.com/en-us/sql/relational-databases/ta...
        
         | yread wrote:
         | We use DataVault for that. And perhaps Databricks at some point
         | in the future
        
         | kortex wrote:
         | Have you looked at dvc www.dvc.org? Takes a little bit to
         | figure out how you want to handle the backing store (usually
         | s3) but then it's very straightforward. You could do a similar
         | pattern: have a data repository and simply move the git HEAD to
         | the desired spot and dvc automatically adds/removes the data
         | files based on what's in the commit. You can even version
         | binaries, without blowing up your .git tree.
        
           | nerdponx wrote:
           | DVC is _great_ for tracking locally-stored data and artifacts
           | generated in the course of a research project, and for
           | sharing those artifacts across a team of collaborators (and
           | /or future users).
           | 
           | However DVC is fundamentally limited because you can only
           | have dependencies and outputs that are files on the
           | filesystem. Theoretically they could start supporting
           | pluggable non-file-but-file-like artifacts, but for now it's
           | just a feature request and I don't know if it's on their
           | roadmap at all.
           | 
           | This is fine, of course, but it kind of sucks for when your
           | data is "big"-ish and you can't or don't want to keep it on
           | your local machine, e.g. generating intermediate datasets
           | that live in some kind of "scratch" workspace within your
           | data lake/warehouse. You can use DBT for that in some cases,
           | but that's not really what DBT is for and then you have _two_
           | incompatibile workflow graphs within your project and a whole
           | other set of CLI touch points and program semantics to learn.
           | 
           | The universal solution is something like Airflow, but it's
           | way too verbose for use during a research project, and
           | running it is way too complicated. It's an industrial-
           | strength data engineering tool, not a research workflow-and-
           | artifact-tracking tool.
           | 
           | I think my ideal tool would be "DVC, but pluggable/extensible
           | with an Airflow-like API."
        
             | henrydark wrote:
             | I have dvc pipelines such that input/output is iceberg
             | snapshot files. The data gets medium-big and it works well.
        
               | nerdponx wrote:
               | I never heard of Apache Iceberg before. I've used
               | Databricks Delta Lake; is it similar? What is a snapshot
               | file in this case?
        
               | henrydark wrote:
               | It's basically the same, I just went with iceberg because
               | the specification is a bit more transparent
        
               | nerdponx wrote:
               | Interesting. So the snapshot file acts much in the same
               | way as a manual "sentinel" file? I generally try to avoid
               | such things because they are brittle and it's easy make a
               | mistake and get the "ad hoc database on your filesystem"
               | out of sync with the actual data.
        
           | isolli wrote:
           | I'm looking into DVC right now, and I feel like the code
           | history (in git) and the data history are too intertwined. If
           | you move the git HEAD back, then you get the old data back,
           | but you also get the old code back. I wish there was a way to
           | move the two "heads" independently. Or is there?
           | 
           | Edit: I can always revert the contents of the .dvc folder to
           | a previous commit, but I wonder if there's a more natural way
           | of doing it.
        
             | arjvik wrote:
             | If you want the dataset to be independent, I would
             | recommend having a seperate repository for the dataset, and
             | using Git Submodules to pull it in. That way you can
             | checkout different versions of the dataset and code because
             | they are essentially in seperate working trees.
        
             | george_ciobanu wrote:
             | also check out Datomic.
        
         | lichtenberger wrote:
         | Basically my research project[1] I'm working on in my spare
         | time is all about versioning and efficiently storing small
         | sized revisions of the data as well as allowing sophisticated
         | time travel queries for audits and analysis.
         | 
         | Of course all secondary user-defined, typed indexes are also
         | versioned.
         | 
         | Basically the technical idea is to map a huge tree of index
         | tries (with revisions as indexed leave pages at the top-level
         | and a document index as well as secondary indexes on the second
         | level) to an append-only file. To reduce write amplification
         | and to reduce the size of each snapshot data pages are first
         | compressed and second versioned through a sliding snapshot
         | algorithm. Thus, Sirix does not simply do a copy on write per
         | page. Instead it writes nodes, which have been changed in the
         | current revision plus nodes which fall out of the sliding
         | window (therefore it needs a fast random-read drive).
         | 
         | [1] https://github.com/sirixdb/sirix
        
           | awmarthur wrote:
           | That sounds somewhat similar to Dolt's storage index
           | structure: Prolly Trees
           | https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-
           | tabl...
        
       | cgio wrote:
       | I have been interested in this space, but have failed to
       | understand how these versioning solutions for data work in the
       | context of environments. There are aspects of time travel that
       | line up better e.g. with data modelling approaches (such as
       | bitemporal DBs, xtdb etc.) others more with git-like use cases
       | (e.g. schema evolution, backfilling) some combinations. The
       | challenge is, with data I don't see how you'd like to have all
       | environments in same place/repo and there may be additional
       | considerations coupled with directionality of moves, such as
       | anonymisation for moving from prod to non-prod , back filling for
       | moving from non-prod to prod etc. Keen to read more on other
       | people experiences in this space and how they might be combining
       | different solutions.
        
         | nerdponx wrote:
         | These tools aren't really meant for developers. They are meant
         | for researchers, analysts, and other "offline" users and
         | managers of data sources. Data science research workflows
         | generally don't need the same "dev/test/prod" kind of
         | environment setup.
        
           | zachmu wrote:
           | I won't speak for other data versioning products, but Dolt is
           | definitely for developers. Our customers are software
           | engineers writing database applications that need version
           | control features.
        
         | ollien wrote:
         | I've only used Dolt once, but it was very helpful when I was
         | working in an ML class on a group project. Previously we would
         | commit gigantic CSVs to git, which sucked. Putting it in Dolt
         | made a lot of the data exploration and sharing easier, and
         | separated our data from the code.
        
       | dmd wrote:
       | See also: https://www.datalad.org/
        
       | henrydark wrote:
       | I get "git for data" by putting iceberg snapshot files _in_ git.
       | That's it. It's literally git for data.
        
       | throwaway787544 wrote:
       | Aw man, why did they have to saddle themselves with Oracle? :(
        
         | timsehn wrote:
         | Creator here...
         | 
         | Only need Oracle clients but we will eventually have a Postgres
         | foreign data wrapper for the storage engine. I explain how we
         | ended up MySQL in this blog post if you're curious.
         | 
         | https://www.dolthub.com/blog/2022-03-28-have-postgres-want-d...
        
       | wonderwonder wrote:
       | very cool. Would love to know the size of the team that built
       | this. Like learning about people that create interesting new
       | products.
        
         | zachmu wrote:
         | https://www.dolthub.com/team
        
           | wonderwonder wrote:
           | lol, that would probably be the best place for me to have
           | looked. Appreciate it!
        
       | bjarneh wrote:
       | This looks super handy, probably worth the performance penalty in
       | most cases to make every command undo-able (except drop
       | database).
        
         | timsehn wrote:
         | We're trying to close the performance gap. Current performance
         | on sysbench can be found here:
         | 
         | https://docs.dolthub.com/sql-reference/benchmarks/latency
         | 
         | We're even slower on heavy transactional use cases.
         | 
         | We're dedicating the year to it and we think we can get under
         | 2X MySQL. We have customers who switched from other No-SQL DBs
         | and actually got faster so getting close to MySQL is a great
         | accomplishment.
        
       | iRomain wrote:
       | Also in the same vein, check out https://lakefs.io/
        
       | layer8 wrote:
       | Only MySQL-compatible, apparently?
        
         | sitkack wrote:
         | https://github.com/dolthub/go-mysql-server 170kloc of pure Go
        
         | timsehn wrote:
         | Creator here.
         | 
         | You only connect with a MySQL-client. There is no MySQL code in
         | it. Here's the story if you're curious.
         | 
         | https://www.dolthub.com/blog/2022-03-28-have-postgres-want-d...
         | 
         | We will eventually build a Postgres Foreign Data Wrapper.
        
       | bestouff wrote:
       | I don't find the name very nice: there's too much potential of
       | mixing it with "doit" with a capital "d" and a capital "i".
        
         | LandStander wrote:
         | I blame font designers.
        
           | cal85 wrote:
           | The font designer designed a font, not a logo.
        
         | amelius wrote:
         | At first I thought it was d01t
        
         | lyso wrote:
         | https://www.folklore.org/StoryView.py?story=Do_It.txt
        
         | CPLX wrote:
         | I mean forget about mixing it up, what about the actual meaning
         | of the word:
         | 
         | dolt noun \ 'dolt \ Definition of dolt : a stupid person
         | 
         | https://www.merriam-webster.com/dictionary/dolt
        
           | [deleted]
        
           | graedus wrote:
           | I thought this was an odd name too, but I'm guessing it's
           | intentional, as _git_ also is a derogatory word with a
           | similar meaning.
           | 
           | : a foolish or worthless person
           | 
           | https://www.merriam-webster.com/dictionary/git
        
             | timsehn wrote:
             | One of the creators here... Indeed. Intentional.
             | 
             | https://docs.dolthub.com/other/faq#why-is-it-called-dolt-
             | are...
        
           | [deleted]
        
           | ghkbrew wrote:
           | Are you aware of the meaning of the word "git"?
           | 
           | https://www.merriam-webster.com/dictionary/git
        
             | CPLX wrote:
             | Without ever thinking hard about it I always assumed git
             | came from American cowboy/western slang:
             | 
             | Specifically this:
             | https://en.wikipedia.org/wiki/Git_Along,_Little_Dogies
             | 
             | Every time I type it in I am visualizing my little code
             | changes as sickly orphaned calves being herded into the
             | main pen with all the mature cows currently working away at
             | their various functions.
             | 
             | Of course given the actual author the British/European
             | meaning makes a lot more sense.
        
           | bravogamma wrote:
           | https://www.folklore.org/StoryView.py?story=Do_It.txt
        
         | evanb wrote:
         | I'd have gone with digit.
        
         | moffkalast wrote:
         | Dew it.
         | 
         | Good, good. It was too dangerous to not keep a backup.
        
       | smm11 wrote:
       | Data? rsync
        
       | throwaway892238 wrote:
       | This is the future of databases, but nobody seems to realize it
       | yet.
       | 
       | One of the biggest problems with databases (particularly SQL
       | ones) is they're a giant pile of mutable state. The whole idea of
       | "migrations" exists because it is impossible to "just" revert any
       | arbitrary change to a database, diff changes automatically, merge
       | changes automatically. You need some kind of intelligent tool or
       | framework to generate DDL, DML, DCL, they have to be applied in
       | turn, something has to check if they've already been applied,
       | etc. And of course you can't roll back a change once it's been
       | applied, unless you create even more program logic to figure out
       | how to do that. It's all a big hack.
       | 
       | By treating a database as version-controlled, you can treat any
       | operation as immutable. Make any change you want and don't worry
       | about conflicts. You can always just go back to the last working
       | version, revert a specific change, merge in one or more changes
       | from different working databases. Make a thousand changes a day,
       | and when one breaks, revert it. No snapshotting and slowly
       | restoring the whole database due to a non-reversible change.
       | Somebody dropped the main table in prod? Just revert the drop.
       | Need to make a change to the prod database but the staging
       | database is different? Branch the prod database, make a change,
       | test it, merge back into prod.
       | 
       | The effect is going to be as radical as the popularization of
       | containers. Whether you like them or not, they are
       | revolutionizing an industry and are a productivity force
       | multiplier.
        
         | LukeEF wrote:
         | This has to be imagined in the context of a post-SQL future.
         | Unless of course you are a 'SQL is the end of history' person!
        
         | jandrewrogers wrote:
         | This is how relational databases have commonly worked since at
         | least the 1990s and is called multi-version concurrency control
         | (MVCC). Welcome to the future, it is called PostgreSQL. There
         | are at least two reasons no sensible database designer would
         | allow users to operate a database in this way even though they
         | are technically capable of it.
         | 
         | First, keeping every version of every piece of data forever is
         | an excellent way to consume non-intuitively vast amounts of
         | storage even if your data model is tiny. Every time this
         | feature has been offered by databases, it immediately causes a
         | rash of "out of storage" errors that force the user to manually
         | and permanently delete large numbers of old versions. This is
         | extremely user-unfriendly, so the feature is almost immediately
         | removed in subsequent versions because the pain it causes far
         | outweighs the benefits even when used carefully. In typical
         | MVCC systems, old versions are aggressively garbage collected
         | automatically to limit out-of-storage errors.
         | 
         | Second, finding or reconstructing an arbitrary number of old
         | versions of data is unavoidably expensive. Much of the
         | architectural difference between various MVCC implementations
         | are trying to manage the rather severe performance tradeoffs of
         | maintaining multiple versions of data and navigating to the
         | version you need, with the understanding that all of these
         | versions live on storage and rarely in a single place. There is
         | no optimal way, and keeping version chains short is critical
         | for good performance.
         | 
         | There is very deep literature around MVCC-style databases. The
         | challenges of generalizing and maximally exploiting MVCC as a
         | user feature while having performance that is not poor to the
         | point of unusability are thoroughly documented.
        
           | zachmu wrote:
           | MVCC is not version control, and time travel / historical
           | querying is not version control.
           | 
           | Dolt's unique functionality isn't time travel, although it
           | has that. It's version control, i.e. branch and merge, push
           | and pull, fork and clone. A bunch of database products give
           | you some of this for schema migrations, but Dolt is the only
           | one that does it for table data as well.
        
             | jandrewrogers wrote:
             | The conceit here is the assumption that this has not been
             | built many times by very clever software engineers. It is
             | not a new idea. True git-like version control systems for
             | managing large volumes of data have been built on MVCC
             | kernels for a _decades_ -- branch and merge, push and pull,
             | fork and clone.
             | 
             | There are fundamental computer science and technical issues
             | that make scaling these systems for arbitrary data models
             | extremely difficult. The platforms always had the ambition
             | to be general but the design tradeoffs required to make
             | them scale requires narrowly overfitting for a particular
             | type of data model such that they can _only_ be used for
             | the original use case. And even then, the performance ends
             | up being not good.
             | 
             | I've never designed one from scratch but I've worked on a
             | few at large companies. All of them started with the vision
             | you are proposing, all of them failed at achieving that
             | vision because of the technical tradeoffs required to
             | enable something resembling scalability. Unless you are
             | proposing some novel computer science that renders these
             | issues moot, you aren't presenting a credible defense that
             | this hasn't been done before.
        
               | zachmu wrote:
               | Git-like version control requires a Merkle DAG. Unless
               | you know something I don't, there are no RDBMS products
               | that incorporate a Merkle DAG for storage. Dolt is the
               | first.
               | 
               | Table data is stored in a cross between a Merkle DAG and
               | a B Tree (a prolly tree), which is what makes diff /
               | merge performant and scalable. We didn't invent these
               | data structures but we believe we are the first to build
               | a SQL database on them.
               | 
               | https://docs.dolthub.com/architecture/storage-
               | engine/prolly-...
        
               | jandrewrogers wrote:
               | > Git-like version control requires a Merkle DAG.
               | 
               | This is false, you are conflating the abstract algorithm
               | with a narrow implementation. That's like saying the only
               | possible sorting algorithm is quicksort.
               | 
               | With all due respect, you seem to be only loosely
               | familiar with database architecture, both theory and
               | historical practice. Nothing you've described is actually
               | novel. That you are unfamiliar with _why_ no one builds
               | things this way, despite many attempts, does not lend
               | confidence.
               | 
               | I am actually a big fan of people trying unorthodox
               | approaches to databases that have never been tried
               | before, this just isn't such an example. Which doesn't
               | make your approach wrong per se, but it leaves you
               | exposed to learning why other people tried and abandoned
               | it.
               | 
               | Tangentially, the "prolly tree" is intrinsically not a
               | scalable data structure. That may satisfy your design
               | requirements but I can't tell.
        
         | simlevesque wrote:
         | QLDB seems like something that goes in this direction. What's
         | your opinion of it ?
        
           | zachmu wrote:
           | QLDB isn't version controlled, it's an immutable ledger
           | database. If that's what you're looking for, definitely
           | choose it, it's a great product.
           | 
           | More discussion of the immutable DB product space:
           | 
           | https://www.dolthub.com/blog/2022-03-21-immutable-database/
        
         | qaq wrote:
         | You do realize that how MVCC works right? Turn off GC process
         | that collects old versions on production DB and see what
         | happens. Reverting changes out of order is not possible in many
         | cases.
        
           | zachmu wrote:
           | Right, but it is with Dolt. That's the point.
           | 
           | CALL DOLT_REVERT('bad-commit-hash')
           | 
           | Works just like git revert: creates an inverse of the changes
           | in the commit given and applies it as a patch.
        
         | blowski wrote:
         | It looks incredible, but somehow seems too good to be true.
         | 
         | What are the tradeoffs here? When wouldn't I want to use this?
        
           | timsehn wrote:
           | Creator here.
           | 
           | It's slower. This is `sysbench` Dolt vs MySQL.
           | 
           | https://docs.dolthub.com/sql-reference/benchmarks/latency
           | 
           | We've dedicated this year to performance with a storage
           | engine rewrite. We'll have some performance wins coming in
           | the back half of the year. We think we can get under 2X
           | MySQL.
           | 
           | It also requires more disk. Each change is at least on
           | average 4K on disk. So, you might need more/bigger hard
           | drives.
        
             | kragen wrote:
             | (Disclaimer: I haven't tried Dolt.)
             | 
             | In your benchmark it's only 2.1-7.4 times slower than
             | MySQL, average 4.4. And any database someone could fit on a
             | disk 20 years ago (I forget, maybe 8 GB?) fits in RAM now,
             | which makes it about 256 times faster, which is a lot
             | bigger than 4.4. You can get a 20 TB disk now, which is
             | enough space So anything that could be done with MySQL 20
             | years ago can be done faster and cheaper with Dolt now,
             | which covers, I think the technical term is, a fucking
             | shitload of applications. It probably includes _literally
             | every 20th-century application of relational databases_.
             | 
             | Well, except for things that have over 5 billion
             | transactions (20 TB / 4 kB/txn) over their lifetime, I
             | guess, so it might be important to find a way to compact
             | that 4K. 5 billion transactions is 19 months at 100 TPS. If
             | you could get that down to 256 bytes it would be almost 25
             | years of 100 TPS.
             | 
             | Also, as I understand it, and correct me if I'm wrong here,
             | that 4.4x slowdown buys you a bulletproof and highly
             | performant and scalable strategy for backups (with PITR),
             | staging servers, data warehousing, readslaves, disk error
             | detection and recovery, cryptographically secure audit
             | logs, bug reproduction, and backtesting. Along with the
             | _legal_ security the Apache 2 license gives you, which you
             | don 't have with Datomic.
             | 
             | Sounds fantastic! It sounds like you're selling its
             | performance a bit short. If someone is really concerned
             | about such a small performance loss they probably aren't
             | really in the market for a new RDBMS.
        
             | EarthLaunch wrote:
             | Another commenter noted a need for migrations in order to
             | handle rollbacks without data loss.
        
         | hinkley wrote:
         | In a similar vein, I am trying to work on a tool for a
         | volunteer group, and one of the problems I'm trying to figure
         | out how to model is not just tracking when the data changed,
         | but why it changed.
         | 
         | Sometimes when you're first entering data you just get it
         | wrong. You said the wrench was in storeroom A1 or the customer
         | lives on 3rd St or the tree was five feet from the sidewalk. If
         | the state of the asset changes due to human action, that's a
         | new thing. But if you opened the wrong list taking inventory,
         | or you missed a keystroke for 33rd St or you just eyeballed the
         | coordinates, then that row was simply wrong, and the correction
         | should be back-dated for most things.
         | 
         | But if I emptied out A1 because we aren't renting that space
         | anymore, the customer moves or the tree got replanted, then it
         | was here and now it's over there. Which might be important for
         | figuring out things like overstock, taxes or success rates.
         | 
         | Similarly if the migration introduces derived data, then the
         | migrated data is assumed/inferred, whereas if we took that data
         | from a user, that information is confirmed, which might
         | introduce subtle differences in how best to relate to the user.
         | Things a mediocre business could easily ignore but a quality
         | establishment might be ill-pleased with such a request.
        
         | 411111111111111 wrote:
         | > _This is the future of databases, but nobody seems to realize
         | it yet_
         | 
         | It's a pipedream, not the future.
         | 
         | Your database is either too big / has too much throughput or
         | migrations just don't matter. And it's not like you wouldn't
         | need migrations with a versioned schema, as otherwise a
         | rollback would mean data loss.
        
           | whazor wrote:
           | With big data such a model is even more promising. One of the
           | big problems is that people keep copying data sets, which
           | does not scale. Just syncing newer versions is much more
           | efficient, look at delta.io.
        
           | throwaway892238 wrote:
           | You're thinking in terms of the present, but I'm saying it's
           | the future. _At present_ it doesn 't make sense, because
           | nobody has yet made a database which does version control on
           | very big datasets with a lot of throughput. But when somebody
           | _does_ make it, it will be completely obvious that this was
           | something we always needed.
           | 
           | It's all just software. There is essentially no limit to what
           | we can make software do as long as the hardware supports it.
           | And there's no hardware limit I know of that says version-
           | controlled databases can't work. We just need to figure out
           | how they will work, and then make 'em (or try to make 'em and
           | in the process figure it out).
           | 
           | > And it's not like you wouldn't need migrations with a
           | versioned schema, as otherwise a rollback would mean data
           | loss.
           | 
           | When you roll back a code change, you don't lose code, as
           | it's still in history. If you need to revert but keep some
           | code, you branch the code, copy the needed code into some
           | other part of the app, revert the old change, merge
           | everything. If on merge there is a conflict (let's presume
           | losing data is a conflict), it can prompt you to issue a set
           | of commands to resolve the conflict before merge. You could
           | do all of that in a branch, test it, merge into prod, and on
           | merge it could perform the same operations. The database does
           | all the heavy lifting and the user just uses a console the
           | way they use Git today.
           | 
           | It's probably going to be required to lock the version of
           | software and the version of the database together, such that
           | both are changed/reverted at the same time. But because this
           | is version control, we could actually serve multiple versions
           | of the same database at the same time. You could have the
           | database present two different versions of itself with the
           | same data COW-overlayed for each version, and two different
           | versions of an application. You could then blue/green deploy
           | both the application and database, each modifying only its
           | version. If you need to revert, you can diff and merge
           | changes from one version to another.
        
             | hinkley wrote:
             | Do we make databases smarter, or start asking for database-
             | like behavior from version control systems?
             | 
             | I can't help thinking how much time the git team or
             | Jetbrains or Chrome or the Confluence team have spent on
             | backing store implementation/migration and file formats
             | instead of using sqlite (like why aren't V8 heap and perf
             | snapshots just a sqlite database?) but then many of these
             | things operate in that gap. So do we keep improving change
             | over time functionality in databases, or make VCS backends
             | with more formal database-like behavior?
             | 
             | IIRC, Trac stores its wiki history in a subversion
             | repository. Since it already had to understand commit
             | histories and show diffs, that was a sensible choice. Of
             | course it is easier to live with such a decision if the API
             | is good, but I haven't heard anyone say that about any
             | version control system yet.
        
               | throwaway892238 wrote:
               | Well, they're discrete problems. Version control of
               | source code, packaged applications, container images,
               | databases are all quite different.
               | 
               | Git is a distributed file manager that operates on files
               | where every change is a commit, and a commit is a set of
               | operations on files, and/or a change to a block of text
               | strings terminated by newlines. Versions are merkle trees
               | of commits.
               | 
               | RPM/Deb/etc is a semi-centralized file manager that
               | operates on files assuming each change is a collection of
               | files with executable stages before and after
               | copying/linking/unlinking. Versions are arbitrary
               | key=value pairs which optionally depend on other
               | versions, with extra logic to resolve relative versions.
               | 
               | Docker/OCI is a distributed file manager that operates on
               | layers assuming every layer is a collection of files
               | overlaid on other layers, with extra logic to do extra
               | things with the layers at runtime. Versions are (I
               | think?) merkle trees of layers.
               | 
               | The database is going to need a helluva lot of custom
               | heuristics and operations to do version-control, because
               | how you use it is so much different than the above.
               | Databases are much more complex beasts, require higher
               | performance, higher reliability, tons more functionality.
        
             | iamnafets wrote:
             | I think the problem is that the tradeoffs already exist.
             | Most users would prefer more usable space or less money to
             | a full history of their data.
             | 
             | You might be making the argument that the _usability_ of
             | immutable data is not there yet, but there are well-
             | established theoretical costs of maintaining full history
             | and I don 't think they're within bounds of many real-world
             | use-cases.
        
               | CPLX wrote:
               | As a guy who's been doing technical stuff of one kind or
               | another since the mid 90's I would say that any analysis
               | that insists that a specific use case has tradeoffs due
               | to lack of memory or processing speed has an inevitable
               | expiration date.
        
               | throwaway892238 wrote:
               | If the user doesn't want full history they could
               | configure the database to expunge it with a lifecycle
               | policy, though I think keeping deltas of the changes
               | would make any excess file storage negligible, as most
               | people don't seem to ever get rid of data anyway.
        
           | zachmu wrote:
           | You're suffering from a failure of imagination.
           | 
           | Consider a CMS, one of the most common forms of database
           | backed applications. What if you could give your customer a
           | "dev" branch of all their data to make their changes on and
           | test out new content, that you could then merge with back to
           | prod after somebody reviews it in a standard PR workflow?
           | 
           | This is the workflow one of our earliest customers built.
           | They run network configuration software, and they use Dolt to
           | implement a PR workflow for all changes their customers make.
           | 
           | More details here:
           | 
           | https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
        
             | password4321 wrote:
             | See also: https://versionpress.com (not being actively
             | developed)
        
               | zachmu wrote:
               | If we were less busy we would adopt versionpress and port
               | it to dolt, such a cool product. Some day.
        
             | Johannesbourg wrote:
             | Personally working with timeseries data my experience is
             | that clients typically underestimate how much storage they
             | need for a single state, let alone including historic
             | versions. The decision people want more data, not more
             | snapshots for a given storage spend. But that's timeseries.
        
               | hinkley wrote:
               | They want more data but they don't want to pay for it.
               | People want lots of things, doesn't mean they get it or
               | deserve it.
               | 
               | I can't recall which it was but one of the timeseries
               | databases was bragging on the fact that there are certain
               | situations where scanning a block of data is as cheap as
               | trying to add finer grained indexes to it, especially
               | with ad hoc queries. They did a bunch of benchmarks that
               | said block scanning with compression and parallelism was
               | workable.
               | 
               | And while compression typically leads to write
               | amplification (or very poor compression ratios), in a
               | timeseries database, or a regular database architected in
               | a timeseries-like fashion, modifying the old data is
               | deeply frowned upon (and in fact I've heard people argue
               | for quasi-timeseries behavior _because_ modifying old
               | records is so punishing, especially as the application
               | scales), so as long as you can decide not to compress
               | some pages - new pages - this is not a problem.
        
           | packetlost wrote:
           | I think the problem is relational datasets like that don't
           | fit well into that model. In reality, it's very possible.
           | Look at Datomic. While I agree, for high-throughput systems,
           | storage is a concern, but the general trends seem to be
           | towards streaming data and retention policies _anyways_.
        
         | jahnu wrote:
         | Doesn't Datomic do all this for some years now?
        
           | zachmu wrote:
           | Lots of databases offer time travel / historical querying,
           | including datomic, MySQL, Postgres, etc (plugins required in
           | some cases).
           | 
           | Dolt's unique functionality isn't time travel, although it
           | has that. It's version control, i.e. branch and merge, push
           | and pull, fork and clone. A bunch of database products give
           | you some of this for schema migrations, but Dolt is the only
           | one that does it for table data as well.
        
             | rapnie wrote:
             | I think TerminusDB does that as well.
        
               | zachmu wrote:
               | Yup, TerminusDB has a very similar set of capabilities,
               | just for a graph DB instead of SQL / relational. Very
               | cool product if you're in the market for a graph DB.
        
         | dizhn wrote:
         | How would you revert, merge, rollback or otherwise make sense
         | of real world data that necessarily reflects "state"? Or is
         | this only for development?
        
       ___________________________________________________________________
       (page generated 2022-06-23 23:02 UTC)