[HN Gopher] Dolt is Git for Data
___________________________________________________________________
Dolt is Git for Data
Author : gjvc
Score : 292 points
Date : 2022-06-23 11:04 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| pdpi wrote:
| Given how many databases already implement MVCC, some form of
| versioning is already baked into DB systems. It seems obvious in
| hindsight that this could be reified into its own feature instead
| of a book-keeping detail. Cool to see a project actually tackle
| that challenge.
| adamw2k wrote:
| Was going to say it reminded me of noms, only to realize it's a
| fork...
|
| https://github.com/attic-labs/noms
| timsehn wrote:
| Creator here...
|
| We are a fork of Noms. Lots of incremental changes and we're in
| the process of a major storage engine overhaul (what we use
| Noms for) for performance as we speak.
| mriet wrote:
| Dolt is git for _databases_.
|
| Nessie is git for _data_.
| magicalhippo wrote:
| Somewhat unrelated but... docs $ dolt conflicts
| cat docs +-----+--------+----+----+ | |
| | pk | c1 | +-----+--------+----+----+ | |
| base | 1 | 1 | | * | ours | 1 | 10 | |
| * | theirs | 1 | 0 | +-----+--------+----+----+
|
| For some reason my brain just can't seem to remember what's
| "ours" and what's "theirs" when resolving merge conflicts. Like
| which one represents what was in the working copy and which one
| represents the data I'm trying to merge. I'd much prefer absolute
| terms rather than those relative ones.
| zachmu wrote:
| Yeah I'm kinda the same way. Ultimately we decided to copy
| git's terminology for this rather than making users learn new
| terms.
|
| (ours is the working copy, theirs is the thing you're merging
| in)
| magicalhippo wrote:
| Yeah I totally get keeping the nomenclature given Dolt tries
| to be Git-like.
| memorable wrote:
| Seems interesting! Will keep an eye on this.
| amelius wrote:
| The nice thing about Git is that it is batteries-included.
|
| I don't want to deal with databases and everything that comes
| with it. Just give me something that works with the filesystem
| directly.
| MontyCarloHall wrote:
| No mention of performance on their GitHub homepage. Anyone know
| how this compares to a standard MySQL distribution?
| jpnc wrote:
| They say it's slower but are trying to close the gap [1]
|
| [1] - https://docs.dolthub.com/sql-reference/benchmarks/latency
| NickHoff wrote:
| In this [1] blog post, they say it's "between 2 and 20 times
| slower than MySQL (but committed to getting down to 4x
| slower)".
|
| [1]: https://www.dolthub.com/blog/2021-03-09-dolt-use-cases-in-
| th...
| chirau wrote:
| They have a pretty interesting data bounty program as well.
|
| https://www.dolthub.com/bounties
|
| I participated in one a few months back. I had to stop midway
| inbetween because work was calling but it was a lot of fun. I
| know I earned some dollars, lol, but probably too few to count
| for the competition.
|
| I am actually trying to convince my manager to put up a few
| bounties later this year.
|
| Their Discord support is also super impressive.
| gniap wrote:
| It's quite similar to https://qri.io functionality, but
| unfortunately Qri stopped providing their Cloud
| service(https://qri.io/winding_down).
| remram wrote:
| What kind of merge actually happens, e.g. how does it deal with
| conflicts? Does it merge at the SQL command level or at the
| changed tuple level?
|
| If I have name job age remram dev 32
|
| and I concurrently do UPDATE table SET age =
| age + 2 WHERE job = 'dev'; UPDATE table SET job = 'mgr'
| WHERE age = 32;
|
| Do I get a conflict? Do I get something consistent with the two
| transactions serialized, e.g. dev 34 or op 32)? Can I get
| something that no ordering of transaction could have given me,
| like mgr 34?
| zachmu wrote:
| Merge is cell-wise. If the two heads being merged edited
| disjoint cells, there's no conflict, they merge together. If
| they touched the same cell, or if one branch deleted a row
| another added, then you get a conflict.
|
| Merge conflicts are stored in a special system table so you can
| decide what to do with them, kind of analogous to conflict
| markers in a source file. More details here:
|
| https://docs.dolthub.com/sql-reference/version-control/merge...
|
| The situation you're talking about with two transactions isn't
| a merge operation, it's just normal database transaction
| isolation level stuff. Dolt supports REPEATABLE_READ right now,
| with others coming in future releases. So in the example above,
| whichever transaction committed last would fail and get rolled
| back (assuming they touched the same rows).
| remram wrote:
| > The situation you're talking about with two transactions
| isn't a merge operation, it's just normal database
| transaction isolation level stuff
|
| I mean, arguably. It's not like there is a standard
| definition for "merge operation" on data. Even Git tries to
| do more than line-level changes, taking context into account,
| and turning one line change into two line changes if merging
| across a file copy for example.
|
| Dolt markets itself as "a version controlled SQL database",
| so I think it is perfectly reasonable to consider the
| standard that already exists for concurrent changes to a SQL
| database, and that's transaction isolation.
|
| I guess anything more complex than this would be pretty
| unwieldy though, with probably little benefits. I am
| struggling to come up with a good example for the kind of
| anomaly I imagine.
| zachmu wrote:
| You're thinking too small. The transaction is generally not
| the unit you want to apply version control to, databases
| already have robust concurrency support at that level.
|
| What you want is to have branches on your data that have
| many transactions applied to them, for days or weeks. Then
| you merge the branch back to main when it's ready to ship.
|
| An early customer built pull requests for their
| configuration application using Dolt, read more here:
|
| https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
| remram wrote:
| Please don't tell me what I'm thinking or what I want to
| do ;-)
|
| The use case I had in mind is closer to a data.world or
| dbhub, collaborative data cleaning (which often include
| bulk operations) rather than merging OLTP databases after
| days or weeks of changes.
|
| What I would use now is a script or Jupyter Notebook
| checked in Git, where cells are strongly ordered and if
| someone sends me a pull request (changing the code) I
| have to re-run the notebook to obtain the "merged
| dataset". I can't say that I have a use for Dolt but it
| is definitely cool tech.
| zachmu wrote:
| If you're a data cleaning guy, you might be interested in
| our data bounties program. It's collaborative data
| importing / cleaning. The current one is building a
| catalog of museum collections. We launch a new one every
| month.
|
| https://www.dolthub.com/blog/2022-06-14-announcing-
| museums-b...
| melvinroest wrote:
| Shoutout to Smalltalk [1] for using doit [2] (do it) to execute
| code! Name collisions happen, it's handy to know about them ;-)
|
| [1] https://en.wikipedia.org/wiki/Smalltalk
|
| [2] https://twitter.com/worrydream/status/1424952628207702022 <--
| Brett Victor talking about doit
| allknowingfrog wrote:
| I think you mistook the "l" for an "i". This is "dolt", not
| "doit".
| melvinroest wrote:
| dolt
|
| DoIt
|
| Ah... whoops, my bad
| arealaccount wrote:
| I've been looking for something like this where you can bring
| your own DB. The ability to version schema and metadata is much
| more interesting over a saas DB host.
| [deleted]
| Hakeemmidan wrote:
| this looks really fun
| pgt wrote:
| XTDB is a general-purpose bitemporal database with graph query
| capabilities: https://xtdb.com/
| infogulch wrote:
| There was a short talk about xtdb in the recent 'rubbing a
| database on it' conference:
| https://www.hytradboi.com/2022/baking-in-time-at-the-bottom-...
|
| The talk expands on what "bitemporal" means in this context,
| namely, separate "valid" and "audit" time dimensions. The
| discussion upthread is a good example: if you accidentally mark
| down inventory in the wrong location and you find out later,
| you should backdate the fix to when you did the count -- this
| is the 'valid' dimension, but the fact that you changed it
| today should also be recorded so it can be reviewed later --
| this is the 'audit' dimension.
| anyfactor wrote:
| I use dolthub's public database to practice queries.
|
| I gave their bounty program a shot but I quit when I was prompted
| to download/clone a 8gb database. So the best thing would be to
| have a dolthub account and develop the data from there.
|
| I think if you do a little bit of git hacking you can create a
| branch with just an empty or a dummy database based on the
| schema, clone only that branch and make a PR to append to the
| original database. But I couldn't figure that out.
| ascar wrote:
| Could you elaborate what's so bad about 8GB of test data? This
| is a database project after all and many things need a certain
| amount of data to show up.
| anyfactor wrote:
| I tried downloading the data and my network connection
| glitched at 6gb, I was thrown an error and the processed
| exited. I believe I was recommended to restart the entire
| process again as there was no way continue the process where
| I left out. I didn't try again.
|
| I think there could be a better design to clone/download the
| database on local machine. I understand people don't actually
| download databases to their local machine but I expected a
| better solution.
| zachmu wrote:
| Definitely. We have plans to allow clone to be a shallow
| operation, so that the original data stays remote and only
| your changes are local. Will make queries slower obviously,
| but that's the right trade-off for some people.
| alex-treebeard wrote:
| great name
| rwoerz wrote:
| I misread it as "do it" and wonder if that was intended too.
| obliquely wrote:
| DOLT vs DO-IT:
| https://www.folklore.org/StoryView.py?story=Do_It.txt
| [deleted]
| [deleted]
| patrickdevivo wrote:
| a very cool project they also maintain is a MySQL server
| framework for arbitrary backends (in Go):
| https://github.com/dolthub/go-mysql-server
|
| You can define a "virtual" table (schema, how to retrieve
| rows/columns) and then a MySQL client can connect and execute
| arbitrary queries on your table (which could just be an API or
| other source)
| aratob wrote:
| Previous discussions:
|
| - Dolt is Git for Data: a SQL database that you can fork, clone,
| branch, merge (2021) - 177 comments
| https://news.ycombinator.com/item?id=26370572
|
| - Dolt is Git for data (2020) - 191 comments
| https://news.ycombinator.com/item?id=22731928
| nailer wrote:
| I thought dat was git for data. But it looks like it's dead now.
|
| Docs site is still up: https://docs.dat.foundation/
| cosmic_quanta wrote:
| That looks awesome. One of the listed use-cases is 'time-travel':
| https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...
|
| I wish we could use this at work. We're trying to predict time-
| series stuff. However, there's a lot of infrastructure complexity
| which is there to ensure that when we're training on data from
| years ago, that we're not using data that would be in the future
| from this point (future data leaking into the past).
|
| Using Dolt, as far as I understand it, we could simply set the DB
| to a point in the past where the 'future' data wasn't available.
| Very cool
| wodenokoto wrote:
| For at least your use-case, this is a fairly common feature in
| enterprise databases.
|
| For a lot of types of data, auditing requires history.
|
| Have a look at MSSQLs history table, for example:
| https://docs.microsoft.com/en-us/sql/relational-databases/ta...
| yread wrote:
| We use DataVault for that. And perhaps Databricks at some point
| in the future
| kortex wrote:
| Have you looked at dvc www.dvc.org? Takes a little bit to
| figure out how you want to handle the backing store (usually
| s3) but then it's very straightforward. You could do a similar
| pattern: have a data repository and simply move the git HEAD to
| the desired spot and dvc automatically adds/removes the data
| files based on what's in the commit. You can even version
| binaries, without blowing up your .git tree.
| nerdponx wrote:
| DVC is _great_ for tracking locally-stored data and artifacts
| generated in the course of a research project, and for
| sharing those artifacts across a team of collaborators (and
| /or future users).
|
| However DVC is fundamentally limited because you can only
| have dependencies and outputs that are files on the
| filesystem. Theoretically they could start supporting
| pluggable non-file-but-file-like artifacts, but for now it's
| just a feature request and I don't know if it's on their
| roadmap at all.
|
| This is fine, of course, but it kind of sucks for when your
| data is "big"-ish and you can't or don't want to keep it on
| your local machine, e.g. generating intermediate datasets
| that live in some kind of "scratch" workspace within your
| data lake/warehouse. You can use DBT for that in some cases,
| but that's not really what DBT is for and then you have _two_
| incompatibile workflow graphs within your project and a whole
| other set of CLI touch points and program semantics to learn.
|
| The universal solution is something like Airflow, but it's
| way too verbose for use during a research project, and
| running it is way too complicated. It's an industrial-
| strength data engineering tool, not a research workflow-and-
| artifact-tracking tool.
|
| I think my ideal tool would be "DVC, but pluggable/extensible
| with an Airflow-like API."
| henrydark wrote:
| I have dvc pipelines such that input/output is iceberg
| snapshot files. The data gets medium-big and it works well.
| nerdponx wrote:
| I never heard of Apache Iceberg before. I've used
| Databricks Delta Lake; is it similar? What is a snapshot
| file in this case?
| henrydark wrote:
| It's basically the same, I just went with iceberg because
| the specification is a bit more transparent
| nerdponx wrote:
| Interesting. So the snapshot file acts much in the same
| way as a manual "sentinel" file? I generally try to avoid
| such things because they are brittle and it's easy make a
| mistake and get the "ad hoc database on your filesystem"
| out of sync with the actual data.
| isolli wrote:
| I'm looking into DVC right now, and I feel like the code
| history (in git) and the data history are too intertwined. If
| you move the git HEAD back, then you get the old data back,
| but you also get the old code back. I wish there was a way to
| move the two "heads" independently. Or is there?
|
| Edit: I can always revert the contents of the .dvc folder to
| a previous commit, but I wonder if there's a more natural way
| of doing it.
| arjvik wrote:
| If you want the dataset to be independent, I would
| recommend having a seperate repository for the dataset, and
| using Git Submodules to pull it in. That way you can
| checkout different versions of the dataset and code because
| they are essentially in seperate working trees.
| george_ciobanu wrote:
| also check out Datomic.
| lichtenberger wrote:
| Basically my research project[1] I'm working on in my spare
| time is all about versioning and efficiently storing small
| sized revisions of the data as well as allowing sophisticated
| time travel queries for audits and analysis.
|
| Of course all secondary user-defined, typed indexes are also
| versioned.
|
| Basically the technical idea is to map a huge tree of index
| tries (with revisions as indexed leave pages at the top-level
| and a document index as well as secondary indexes on the second
| level) to an append-only file. To reduce write amplification
| and to reduce the size of each snapshot data pages are first
| compressed and second versioned through a sliding snapshot
| algorithm. Thus, Sirix does not simply do a copy on write per
| page. Instead it writes nodes, which have been changed in the
| current revision plus nodes which fall out of the sliding
| window (therefore it needs a fast random-read drive).
|
| [1] https://github.com/sirixdb/sirix
| awmarthur wrote:
| That sounds somewhat similar to Dolt's storage index
| structure: Prolly Trees
| https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-
| tabl...
| cgio wrote:
| I have been interested in this space, but have failed to
| understand how these versioning solutions for data work in the
| context of environments. There are aspects of time travel that
| line up better e.g. with data modelling approaches (such as
| bitemporal DBs, xtdb etc.) others more with git-like use cases
| (e.g. schema evolution, backfilling) some combinations. The
| challenge is, with data I don't see how you'd like to have all
| environments in same place/repo and there may be additional
| considerations coupled with directionality of moves, such as
| anonymisation for moving from prod to non-prod , back filling for
| moving from non-prod to prod etc. Keen to read more on other
| people experiences in this space and how they might be combining
| different solutions.
| nerdponx wrote:
| These tools aren't really meant for developers. They are meant
| for researchers, analysts, and other "offline" users and
| managers of data sources. Data science research workflows
| generally don't need the same "dev/test/prod" kind of
| environment setup.
| zachmu wrote:
| I won't speak for other data versioning products, but Dolt is
| definitely for developers. Our customers are software
| engineers writing database applications that need version
| control features.
| ollien wrote:
| I've only used Dolt once, but it was very helpful when I was
| working in an ML class on a group project. Previously we would
| commit gigantic CSVs to git, which sucked. Putting it in Dolt
| made a lot of the data exploration and sharing easier, and
| separated our data from the code.
| dmd wrote:
| See also: https://www.datalad.org/
| henrydark wrote:
| I get "git for data" by putting iceberg snapshot files _in_ git.
| That's it. It's literally git for data.
| throwaway787544 wrote:
| Aw man, why did they have to saddle themselves with Oracle? :(
| timsehn wrote:
| Creator here...
|
| Only need Oracle clients but we will eventually have a Postgres
| foreign data wrapper for the storage engine. I explain how we
| ended up MySQL in this blog post if you're curious.
|
| https://www.dolthub.com/blog/2022-03-28-have-postgres-want-d...
| wonderwonder wrote:
| very cool. Would love to know the size of the team that built
| this. Like learning about people that create interesting new
| products.
| zachmu wrote:
| https://www.dolthub.com/team
| wonderwonder wrote:
| lol, that would probably be the best place for me to have
| looked. Appreciate it!
| bjarneh wrote:
| This looks super handy, probably worth the performance penalty in
| most cases to make every command undo-able (except drop
| database).
| timsehn wrote:
| We're trying to close the performance gap. Current performance
| on sysbench can be found here:
|
| https://docs.dolthub.com/sql-reference/benchmarks/latency
|
| We're even slower on heavy transactional use cases.
|
| We're dedicating the year to it and we think we can get under
| 2X MySQL. We have customers who switched from other No-SQL DBs
| and actually got faster so getting close to MySQL is a great
| accomplishment.
| iRomain wrote:
| Also in the same vein, check out https://lakefs.io/
| layer8 wrote:
| Only MySQL-compatible, apparently?
| sitkack wrote:
| https://github.com/dolthub/go-mysql-server 170kloc of pure Go
| timsehn wrote:
| Creator here.
|
| You only connect with a MySQL-client. There is no MySQL code in
| it. Here's the story if you're curious.
|
| https://www.dolthub.com/blog/2022-03-28-have-postgres-want-d...
|
| We will eventually build a Postgres Foreign Data Wrapper.
| bestouff wrote:
| I don't find the name very nice: there's too much potential of
| mixing it with "doit" with a capital "d" and a capital "i".
| LandStander wrote:
| I blame font designers.
| cal85 wrote:
| The font designer designed a font, not a logo.
| amelius wrote:
| At first I thought it was d01t
| lyso wrote:
| https://www.folklore.org/StoryView.py?story=Do_It.txt
| CPLX wrote:
| I mean forget about mixing it up, what about the actual meaning
| of the word:
|
| dolt noun \ 'dolt \ Definition of dolt : a stupid person
|
| https://www.merriam-webster.com/dictionary/dolt
| [deleted]
| graedus wrote:
| I thought this was an odd name too, but I'm guessing it's
| intentional, as _git_ also is a derogatory word with a
| similar meaning.
|
| : a foolish or worthless person
|
| https://www.merriam-webster.com/dictionary/git
| timsehn wrote:
| One of the creators here... Indeed. Intentional.
|
| https://docs.dolthub.com/other/faq#why-is-it-called-dolt-
| are...
| [deleted]
| ghkbrew wrote:
| Are you aware of the meaning of the word "git"?
|
| https://www.merriam-webster.com/dictionary/git
| CPLX wrote:
| Without ever thinking hard about it I always assumed git
| came from American cowboy/western slang:
|
| Specifically this:
| https://en.wikipedia.org/wiki/Git_Along,_Little_Dogies
|
| Every time I type it in I am visualizing my little code
| changes as sickly orphaned calves being herded into the
| main pen with all the mature cows currently working away at
| their various functions.
|
| Of course given the actual author the British/European
| meaning makes a lot more sense.
| bravogamma wrote:
| https://www.folklore.org/StoryView.py?story=Do_It.txt
| evanb wrote:
| I'd have gone with digit.
| moffkalast wrote:
| Dew it.
|
| Good, good. It was too dangerous to not keep a backup.
| smm11 wrote:
| Data? rsync
| throwaway892238 wrote:
| This is the future of databases, but nobody seems to realize it
| yet.
|
| One of the biggest problems with databases (particularly SQL
| ones) is they're a giant pile of mutable state. The whole idea of
| "migrations" exists because it is impossible to "just" revert any
| arbitrary change to a database, diff changes automatically, merge
| changes automatically. You need some kind of intelligent tool or
| framework to generate DDL, DML, DCL, they have to be applied in
| turn, something has to check if they've already been applied,
| etc. And of course you can't roll back a change once it's been
| applied, unless you create even more program logic to figure out
| how to do that. It's all a big hack.
|
| By treating a database as version-controlled, you can treat any
| operation as immutable. Make any change you want and don't worry
| about conflicts. You can always just go back to the last working
| version, revert a specific change, merge in one or more changes
| from different working databases. Make a thousand changes a day,
| and when one breaks, revert it. No snapshotting and slowly
| restoring the whole database due to a non-reversible change.
| Somebody dropped the main table in prod? Just revert the drop.
| Need to make a change to the prod database but the staging
| database is different? Branch the prod database, make a change,
| test it, merge back into prod.
|
| The effect is going to be as radical as the popularization of
| containers. Whether you like them or not, they are
| revolutionizing an industry and are a productivity force
| multiplier.
| LukeEF wrote:
| This has to be imagined in the context of a post-SQL future.
| Unless of course you are a 'SQL is the end of history' person!
| jandrewrogers wrote:
| This is how relational databases have commonly worked since at
| least the 1990s and is called multi-version concurrency control
| (MVCC). Welcome to the future, it is called PostgreSQL. There
| are at least two reasons no sensible database designer would
| allow users to operate a database in this way even though they
| are technically capable of it.
|
| First, keeping every version of every piece of data forever is
| an excellent way to consume non-intuitively vast amounts of
| storage even if your data model is tiny. Every time this
| feature has been offered by databases, it immediately causes a
| rash of "out of storage" errors that force the user to manually
| and permanently delete large numbers of old versions. This is
| extremely user-unfriendly, so the feature is almost immediately
| removed in subsequent versions because the pain it causes far
| outweighs the benefits even when used carefully. In typical
| MVCC systems, old versions are aggressively garbage collected
| automatically to limit out-of-storage errors.
|
| Second, finding or reconstructing an arbitrary number of old
| versions of data is unavoidably expensive. Much of the
| architectural difference between various MVCC implementations
| are trying to manage the rather severe performance tradeoffs of
| maintaining multiple versions of data and navigating to the
| version you need, with the understanding that all of these
| versions live on storage and rarely in a single place. There is
| no optimal way, and keeping version chains short is critical
| for good performance.
|
| There is very deep literature around MVCC-style databases. The
| challenges of generalizing and maximally exploiting MVCC as a
| user feature while having performance that is not poor to the
| point of unusability are thoroughly documented.
| zachmu wrote:
| MVCC is not version control, and time travel / historical
| querying is not version control.
|
| Dolt's unique functionality isn't time travel, although it
| has that. It's version control, i.e. branch and merge, push
| and pull, fork and clone. A bunch of database products give
| you some of this for schema migrations, but Dolt is the only
| one that does it for table data as well.
| jandrewrogers wrote:
| The conceit here is the assumption that this has not been
| built many times by very clever software engineers. It is
| not a new idea. True git-like version control systems for
| managing large volumes of data have been built on MVCC
| kernels for a _decades_ -- branch and merge, push and pull,
| fork and clone.
|
| There are fundamental computer science and technical issues
| that make scaling these systems for arbitrary data models
| extremely difficult. The platforms always had the ambition
| to be general but the design tradeoffs required to make
| them scale requires narrowly overfitting for a particular
| type of data model such that they can _only_ be used for
| the original use case. And even then, the performance ends
| up being not good.
|
| I've never designed one from scratch but I've worked on a
| few at large companies. All of them started with the vision
| you are proposing, all of them failed at achieving that
| vision because of the technical tradeoffs required to
| enable something resembling scalability. Unless you are
| proposing some novel computer science that renders these
| issues moot, you aren't presenting a credible defense that
| this hasn't been done before.
| zachmu wrote:
| Git-like version control requires a Merkle DAG. Unless
| you know something I don't, there are no RDBMS products
| that incorporate a Merkle DAG for storage. Dolt is the
| first.
|
| Table data is stored in a cross between a Merkle DAG and
| a B Tree (a prolly tree), which is what makes diff /
| merge performant and scalable. We didn't invent these
| data structures but we believe we are the first to build
| a SQL database on them.
|
| https://docs.dolthub.com/architecture/storage-
| engine/prolly-...
| jandrewrogers wrote:
| > Git-like version control requires a Merkle DAG.
|
| This is false, you are conflating the abstract algorithm
| with a narrow implementation. That's like saying the only
| possible sorting algorithm is quicksort.
|
| With all due respect, you seem to be only loosely
| familiar with database architecture, both theory and
| historical practice. Nothing you've described is actually
| novel. That you are unfamiliar with _why_ no one builds
| things this way, despite many attempts, does not lend
| confidence.
|
| I am actually a big fan of people trying unorthodox
| approaches to databases that have never been tried
| before, this just isn't such an example. Which doesn't
| make your approach wrong per se, but it leaves you
| exposed to learning why other people tried and abandoned
| it.
|
| Tangentially, the "prolly tree" is intrinsically not a
| scalable data structure. That may satisfy your design
| requirements but I can't tell.
| simlevesque wrote:
| QLDB seems like something that goes in this direction. What's
| your opinion of it ?
| zachmu wrote:
| QLDB isn't version controlled, it's an immutable ledger
| database. If that's what you're looking for, definitely
| choose it, it's a great product.
|
| More discussion of the immutable DB product space:
|
| https://www.dolthub.com/blog/2022-03-21-immutable-database/
| qaq wrote:
| You do realize that how MVCC works right? Turn off GC process
| that collects old versions on production DB and see what
| happens. Reverting changes out of order is not possible in many
| cases.
| zachmu wrote:
| Right, but it is with Dolt. That's the point.
|
| CALL DOLT_REVERT('bad-commit-hash')
|
| Works just like git revert: creates an inverse of the changes
| in the commit given and applies it as a patch.
| blowski wrote:
| It looks incredible, but somehow seems too good to be true.
|
| What are the tradeoffs here? When wouldn't I want to use this?
| timsehn wrote:
| Creator here.
|
| It's slower. This is `sysbench` Dolt vs MySQL.
|
| https://docs.dolthub.com/sql-reference/benchmarks/latency
|
| We've dedicated this year to performance with a storage
| engine rewrite. We'll have some performance wins coming in
| the back half of the year. We think we can get under 2X
| MySQL.
|
| It also requires more disk. Each change is at least on
| average 4K on disk. So, you might need more/bigger hard
| drives.
| kragen wrote:
| (Disclaimer: I haven't tried Dolt.)
|
| In your benchmark it's only 2.1-7.4 times slower than
| MySQL, average 4.4. And any database someone could fit on a
| disk 20 years ago (I forget, maybe 8 GB?) fits in RAM now,
| which makes it about 256 times faster, which is a lot
| bigger than 4.4. You can get a 20 TB disk now, which is
| enough space So anything that could be done with MySQL 20
| years ago can be done faster and cheaper with Dolt now,
| which covers, I think the technical term is, a fucking
| shitload of applications. It probably includes _literally
| every 20th-century application of relational databases_.
|
| Well, except for things that have over 5 billion
| transactions (20 TB / 4 kB/txn) over their lifetime, I
| guess, so it might be important to find a way to compact
| that 4K. 5 billion transactions is 19 months at 100 TPS. If
| you could get that down to 256 bytes it would be almost 25
| years of 100 TPS.
|
| Also, as I understand it, and correct me if I'm wrong here,
| that 4.4x slowdown buys you a bulletproof and highly
| performant and scalable strategy for backups (with PITR),
| staging servers, data warehousing, readslaves, disk error
| detection and recovery, cryptographically secure audit
| logs, bug reproduction, and backtesting. Along with the
| _legal_ security the Apache 2 license gives you, which you
| don 't have with Datomic.
|
| Sounds fantastic! It sounds like you're selling its
| performance a bit short. If someone is really concerned
| about such a small performance loss they probably aren't
| really in the market for a new RDBMS.
| EarthLaunch wrote:
| Another commenter noted a need for migrations in order to
| handle rollbacks without data loss.
| hinkley wrote:
| In a similar vein, I am trying to work on a tool for a
| volunteer group, and one of the problems I'm trying to figure
| out how to model is not just tracking when the data changed,
| but why it changed.
|
| Sometimes when you're first entering data you just get it
| wrong. You said the wrench was in storeroom A1 or the customer
| lives on 3rd St or the tree was five feet from the sidewalk. If
| the state of the asset changes due to human action, that's a
| new thing. But if you opened the wrong list taking inventory,
| or you missed a keystroke for 33rd St or you just eyeballed the
| coordinates, then that row was simply wrong, and the correction
| should be back-dated for most things.
|
| But if I emptied out A1 because we aren't renting that space
| anymore, the customer moves or the tree got replanted, then it
| was here and now it's over there. Which might be important for
| figuring out things like overstock, taxes or success rates.
|
| Similarly if the migration introduces derived data, then the
| migrated data is assumed/inferred, whereas if we took that data
| from a user, that information is confirmed, which might
| introduce subtle differences in how best to relate to the user.
| Things a mediocre business could easily ignore but a quality
| establishment might be ill-pleased with such a request.
| 411111111111111 wrote:
| > _This is the future of databases, but nobody seems to realize
| it yet_
|
| It's a pipedream, not the future.
|
| Your database is either too big / has too much throughput or
| migrations just don't matter. And it's not like you wouldn't
| need migrations with a versioned schema, as otherwise a
| rollback would mean data loss.
| whazor wrote:
| With big data such a model is even more promising. One of the
| big problems is that people keep copying data sets, which
| does not scale. Just syncing newer versions is much more
| efficient, look at delta.io.
| throwaway892238 wrote:
| You're thinking in terms of the present, but I'm saying it's
| the future. _At present_ it doesn 't make sense, because
| nobody has yet made a database which does version control on
| very big datasets with a lot of throughput. But when somebody
| _does_ make it, it will be completely obvious that this was
| something we always needed.
|
| It's all just software. There is essentially no limit to what
| we can make software do as long as the hardware supports it.
| And there's no hardware limit I know of that says version-
| controlled databases can't work. We just need to figure out
| how they will work, and then make 'em (or try to make 'em and
| in the process figure it out).
|
| > And it's not like you wouldn't need migrations with a
| versioned schema, as otherwise a rollback would mean data
| loss.
|
| When you roll back a code change, you don't lose code, as
| it's still in history. If you need to revert but keep some
| code, you branch the code, copy the needed code into some
| other part of the app, revert the old change, merge
| everything. If on merge there is a conflict (let's presume
| losing data is a conflict), it can prompt you to issue a set
| of commands to resolve the conflict before merge. You could
| do all of that in a branch, test it, merge into prod, and on
| merge it could perform the same operations. The database does
| all the heavy lifting and the user just uses a console the
| way they use Git today.
|
| It's probably going to be required to lock the version of
| software and the version of the database together, such that
| both are changed/reverted at the same time. But because this
| is version control, we could actually serve multiple versions
| of the same database at the same time. You could have the
| database present two different versions of itself with the
| same data COW-overlayed for each version, and two different
| versions of an application. You could then blue/green deploy
| both the application and database, each modifying only its
| version. If you need to revert, you can diff and merge
| changes from one version to another.
| hinkley wrote:
| Do we make databases smarter, or start asking for database-
| like behavior from version control systems?
|
| I can't help thinking how much time the git team or
| Jetbrains or Chrome or the Confluence team have spent on
| backing store implementation/migration and file formats
| instead of using sqlite (like why aren't V8 heap and perf
| snapshots just a sqlite database?) but then many of these
| things operate in that gap. So do we keep improving change
| over time functionality in databases, or make VCS backends
| with more formal database-like behavior?
|
| IIRC, Trac stores its wiki history in a subversion
| repository. Since it already had to understand commit
| histories and show diffs, that was a sensible choice. Of
| course it is easier to live with such a decision if the API
| is good, but I haven't heard anyone say that about any
| version control system yet.
| throwaway892238 wrote:
| Well, they're discrete problems. Version control of
| source code, packaged applications, container images,
| databases are all quite different.
|
| Git is a distributed file manager that operates on files
| where every change is a commit, and a commit is a set of
| operations on files, and/or a change to a block of text
| strings terminated by newlines. Versions are merkle trees
| of commits.
|
| RPM/Deb/etc is a semi-centralized file manager that
| operates on files assuming each change is a collection of
| files with executable stages before and after
| copying/linking/unlinking. Versions are arbitrary
| key=value pairs which optionally depend on other
| versions, with extra logic to resolve relative versions.
|
| Docker/OCI is a distributed file manager that operates on
| layers assuming every layer is a collection of files
| overlaid on other layers, with extra logic to do extra
| things with the layers at runtime. Versions are (I
| think?) merkle trees of layers.
|
| The database is going to need a helluva lot of custom
| heuristics and operations to do version-control, because
| how you use it is so much different than the above.
| Databases are much more complex beasts, require higher
| performance, higher reliability, tons more functionality.
| iamnafets wrote:
| I think the problem is that the tradeoffs already exist.
| Most users would prefer more usable space or less money to
| a full history of their data.
|
| You might be making the argument that the _usability_ of
| immutable data is not there yet, but there are well-
| established theoretical costs of maintaining full history
| and I don 't think they're within bounds of many real-world
| use-cases.
| CPLX wrote:
| As a guy who's been doing technical stuff of one kind or
| another since the mid 90's I would say that any analysis
| that insists that a specific use case has tradeoffs due
| to lack of memory or processing speed has an inevitable
| expiration date.
| throwaway892238 wrote:
| If the user doesn't want full history they could
| configure the database to expunge it with a lifecycle
| policy, though I think keeping deltas of the changes
| would make any excess file storage negligible, as most
| people don't seem to ever get rid of data anyway.
| zachmu wrote:
| You're suffering from a failure of imagination.
|
| Consider a CMS, one of the most common forms of database
| backed applications. What if you could give your customer a
| "dev" branch of all their data to make their changes on and
| test out new content, that you could then merge with back to
| prod after somebody reviews it in a standard PR workflow?
|
| This is the workflow one of our earliest customers built.
| They run network configuration software, and they use Dolt to
| implement a PR workflow for all changes their customers make.
|
| More details here:
|
| https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
| password4321 wrote:
| See also: https://versionpress.com (not being actively
| developed)
| zachmu wrote:
| If we were less busy we would adopt versionpress and port
| it to dolt, such a cool product. Some day.
| Johannesbourg wrote:
| Personally working with timeseries data my experience is
| that clients typically underestimate how much storage they
| need for a single state, let alone including historic
| versions. The decision people want more data, not more
| snapshots for a given storage spend. But that's timeseries.
| hinkley wrote:
| They want more data but they don't want to pay for it.
| People want lots of things, doesn't mean they get it or
| deserve it.
|
| I can't recall which it was but one of the timeseries
| databases was bragging on the fact that there are certain
| situations where scanning a block of data is as cheap as
| trying to add finer grained indexes to it, especially
| with ad hoc queries. They did a bunch of benchmarks that
| said block scanning with compression and parallelism was
| workable.
|
| And while compression typically leads to write
| amplification (or very poor compression ratios), in a
| timeseries database, or a regular database architected in
| a timeseries-like fashion, modifying the old data is
| deeply frowned upon (and in fact I've heard people argue
| for quasi-timeseries behavior _because_ modifying old
| records is so punishing, especially as the application
| scales), so as long as you can decide not to compress
| some pages - new pages - this is not a problem.
| packetlost wrote:
| I think the problem is relational datasets like that don't
| fit well into that model. In reality, it's very possible.
| Look at Datomic. While I agree, for high-throughput systems,
| storage is a concern, but the general trends seem to be
| towards streaming data and retention policies _anyways_.
| jahnu wrote:
| Doesn't Datomic do all this for some years now?
| zachmu wrote:
| Lots of databases offer time travel / historical querying,
| including datomic, MySQL, Postgres, etc (plugins required in
| some cases).
|
| Dolt's unique functionality isn't time travel, although it
| has that. It's version control, i.e. branch and merge, push
| and pull, fork and clone. A bunch of database products give
| you some of this for schema migrations, but Dolt is the only
| one that does it for table data as well.
| rapnie wrote:
| I think TerminusDB does that as well.
| zachmu wrote:
| Yup, TerminusDB has a very similar set of capabilities,
| just for a graph DB instead of SQL / relational. Very
| cool product if you're in the market for a graph DB.
| dizhn wrote:
| How would you revert, merge, rollback or otherwise make sense
| of real world data that necessarily reflects "state"? Or is
| this only for development?
___________________________________________________________________
(page generated 2022-06-23 23:02 UTC)