[HN Gopher] Dolt is Git for Data: a SQL database that you can fo...
       ___________________________________________________________________
        
       Dolt is Git for Data: a SQL database that you can fork, clone,
       branch, merge
        
       Author : crazypython
       Score  : 687 points
       Date   : 2021-03-06 21:15 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | [deleted]
        
       | tjkrusinski wrote:
       | Any info on how this differs from Dat?
        
       | rymurr wrote:
       | I've been working on https://github.com/projectnessie/nessie for
       | about a year now. Its similar to Dolt in spirit but aimed at big
       | data/data lakes. Would welcome feedback from the community.
       | 
       | Its very exciting to see this field picking up speed. Tons of
       | interesting problems to be solved :-)
        
       | Aeolun wrote:
       | What does this look this look like when merging? Does it need a
       | specialized tool?
       | 
       | I'm a bit sad it doesn't seem to have git style syncing though.
        
         | zachmu wrote:
         | Merge is built in, same as with git. Same syntax too:
         | dolt checkout -b <branch>             dolt merge <branch>
         | 
         | What do you mean git-style syncing? It has `push`, `pull`, and
         | `fetch`.
        
           | Aeolun wrote:
           | In regards to merging, I understand that part, but say I have
           | a conflict, how does it get resolved. Does it open some sort
           | of text editor where I delete the unwanted lines?
           | 
           | I mean, if I have a repository on one PC, I can clone from
           | there on a different PC.
           | 
           | As far as I can see this only allows S3/GC or Dolthub
           | remotes.
        
             | zachmu wrote:
             | Merge conflicts get put into a special conflicts table. You
             | have to resolve them one by one just like text conflicts
             | from a git merge. You can do this with the command line or
             | with SQL.
             | 
             | It's true we don't support SSH remotes yet. Wouldn't be too
             | hard to add though. File an issue if this is preventing you
             | from using the product. Or use sshfs, which is a pretty
             | simple workaround.
        
       | twobitshifter wrote:
       | This is cool, but the parent dolthub project is even cooler!
       | Dolthub.com
        
         | laurent92 wrote:
         | Free for public repos!
        
           | zachmu wrote:
           | Also private repos under a gig, but you have to give us a
           | credit card to be private.
        
       | mayureshkathe wrote:
       | Since you've hosted it on GitHub, would you also consider
       | providing a JavaScript client library for read access?
       | 
       | Could pave the way for single page web applications (hosted via
       | GitHub Pages) working with Dolt as their database of choice.
        
       | efikoman wrote:
       | ?
        
       | justincormack wrote:
       | I collected all the git for data open source projects I could
       | find a few months back, there have been a bunch of interesting
       | approaches
       | https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzm...
        
         | chub500 wrote:
         | I've had a fairly long-term side project working on git for
         | chronological data (data is a cause and effect DAG), know of
         | anybody doing that?
        
           | michaelmure wrote:
           | It might not be exactly what you are looking for, but git-
           | bug[1] is encoding data into regular git objects, with merges
           | and conflict resolution. I'm mentioning this because the hard
           | part is providing an ordering of events. Once you have that
           | you can store and recreate whatever state you want.
           | 
           | This branch[2] I'm almost done with remove the purely linear
           | branch constraint and allow to use full DAGs (that is,
           | concurrent edition) and still provide a good ordering.
           | 
           | [1]: https://github.com/MichaelMure/git-bug [2]:
           | https://github.com/MichaelMure/git-bug/pull/532
        
         | glogla wrote:
         | This one seems to be missing: https://projectnessie.org/
        
           | justinclift wrote:
           | It's also missing DBHub.io. ;)
        
       | herpderperator wrote:
       | This is pretty cool. I wish `dolt diff` would use + and - though
       | (isn't that standard?) rather than > and < which is harder to
       | distinguish.
        
         | guerrilla wrote:
         | Both are "standard" from the UNIX/GNU diff(1) tool. The default
         | behavior gives you the '>' and '<' format and using -u gives
         | you the "unified" '+' and '-' format.
        
       | bitslayer wrote:
       | Is it for versions of the database design or versions of the
       | data?
        
         | skybrian wrote:
         | Both. Schema changes are versioned like everything else. But
         | depending on what the change is, it might make merges
         | difficult.
         | 
         | (I haven't used it; I just read the blog.)
        
           | fiedzia wrote:
           | BTW. I wish all databases versioned their schema and kept
           | full history. This should be a standard feature.
        
             | codesnik wrote:
             | keep history just as a changelog to see when and what was
             | changed? or something with versions you actively can revert
             | to? I suppose one of the problems is that changing schema
             | is usually interleaved with some sorts of data migrations
             | and conversion and those I have no idea how to track
             | without using some general scripting language, like
             | migrations in many frameworks which are already there.
        
               | fiedzia wrote:
               | Just storing a changelog would be very useful. I could
               | for example compare db version some app was developed for
               | with current state. Most companies will store schema
               | migrations in git, but doing anything with this
               | information is difficult. Having this in db would make
               | automated checks easier.
        
         | [deleted]
        
       | jonnycomputer wrote:
       | Naive question here. Aside from it being mysql, what is different
       | here than just using git + sqlite.
       | 
       | Update: When I posted, I'd forgotten that SQLite db file is a
       | binary. Not sure what I was thinking.
        
         | bargle0 wrote:
         | Merging a SQLite database is challenging.
        
           | [deleted]
        
           | justinclift wrote:
           | We do it on DBHub.io. :)
           | 
           | If you're into Go, these are the commits where the pieces
           | were hooked together.
           | 
           | * https://github.com/sqlitebrowser/dbhub.io/commit/6c40e051ff
           | 7...
           | 
           | * https://github.com/sqlitebrowser/dbhub.io/commit/989ee0d08e
           | 6...
        
           | unnouinceput wrote:
           | I'd venture more and say merging any DB is challenging,
           | SQLite or not.
        
       | kyrieeschaton wrote:
       | People interested in this approach should compare Rich Hickey's
       | Datomic.
        
       | einpoklum wrote:
       | But do you really need this functionality, if you already have an
       | SQL database?
       | 
       | That is, you can:
       | 
       | 1. Create a table with an extra changeset id column and a branch
       | id column, so that you can keep historical values.
       | 
       | 2. Have a view on that table with the latest version of each
       | record on the master branch.
       | 
       | 3. Express branching-related actions as actions on the main table
       | with different record versions and branch names
       | 
       | 4. For the chocolate sprinkles, have tables with changeset info
       | and branch info
       | 
       | and that gives you a poor man's git already - doesn't it?
        
         | resonantjacket5 wrote:
         | That doesn't let you merge in someone else's changes easily.
         | Aka two team members make changes to the (local copy of)
         | database and now you want to merge it.
         | 
         | I mean sure you can have another database tracking their
         | changes and a merging algorithm but that's what dolt is doing
         | for you
        
       | zomglings wrote:
       | If anyone from the dolt team is reading this, I'd like to make an
       | enquiry:
       | 
       | At bugout.dev, we have an ongoing crawl of public GitHub. We just
       | created a dataset of code snippets crawled from popular GitHub
       | repositories, listed by language, license, github repo, and
       | commit hash and are looking to release it publicly and keep it
       | up-to-date with our GitHub crawl.
       | 
       | The dataset for a single crawl comes in at about 60GB. We
       | uploaded the data to Kaggle because we thought it would be a good
       | place for people to work with the data. Unfortunately, the Kaggle
       | notebook experience is not tailored to such large datasets. Our
       | dataset is in a SQLite database. It takes a long time for the
       | dataset to load into Kaggle notebooks, and I don't think they are
       | provisioned with SSDs as queries take a long time. Our best
       | workaround to this is to partition into 3 datasets on Kaggle -
       | train, eval, and development, but it will be a pain to manage
       | this for every update, especially as we enrich the dataset with
       | results of static analysis, etc.
       | 
       | I'd like to explore hosting the public dataset on Dolthub. If
       | this sounds interesting to you please, reach out to me - email is
       | in my HN profile.
        
         | zomglings wrote:
         | This is the dataset on Kaggle -
         | https://www.kaggle.com/simiotic/github-code-snippets
        
           | justinclift wrote:
           | Yeah, that sized database is likely to be a challenge unless
           | the computer system it's running on has scads of memory.
           | 
           | One of my projects (DBHub.io) is putting effort towards
           | working through the problem with larger sized SQLite
           | databases (~10GB), and that's mainly through using bare metal
           | hosts with lots of memory. eg 64GB, 128GB, etc.
           | 
           | Putting the same data into PostgreSQL, or even MySQL, would
           | likely be much more efficient memory wise. :)
        
             | zomglings wrote:
             | Can't beat SQLite for distribution as a public dataset,
             | though.
        
               | zachmu wrote:
               | We think dolt can :)
        
               | justinclift wrote:
               | How do you send someone a dolt database as a file?
        
               | zachmu wrote:
               | Push it to DoltHub, tell them to clone it. Just like with
               | source code.
        
               | acidbaseextract wrote:
               | How do you send someone a git repository as a file? Why
               | would a tarball not work?
        
               | touisteur wrote:
               | Git bundles. Amazing simple tool. Doing it all the time
               | via sneaker net, or mail/chat-with-attachments, better
               | that sending patches around IME.
        
               | 411111111111111 wrote:
               | Probably by first pushing it into a file. This command is
               | in the readme.                 dolt remote add <remote>
               | file:///Users/xyz/abs/path/
        
             | zachmu wrote:
             | We have 200 GB databases in dolt format that are totally
             | queryable. They don't work well querying on the web though
             | - you need a local copy to query it effectively. Making web
             | query as fast as local is an ongoing project.
        
               | justinclift wrote:
               | Yeah the "on the web" piece is the thing we're talking
               | about. :)
               | 
               | 200GB databases for PostgreSQL (etc) isn't any kind of
               | amazing.
        
             | tomcam wrote:
             | Uh, personal question here. Where does your ~10G number
             | come from? I pretty much run my life on the Apple Notes
             | app. My Notes database is about 12G and now I'm scared.
        
         | zachmu wrote:
         | We'll be in touch :)
        
         | StreamBright wrote:
         | You have other options too. If I have time i can try to reduce
         | the size with a columnar format that is designed for this use
         | case (repeated values, static dataset).
        
           | zomglings wrote:
           | That would be really great. Let me know if there's any way we
           | can help. Maybe if we released a small version of the dataset
           | for testing/benchmarking and then I could take care of
           | running the final processing on the full dataset?
        
             | StreamBright wrote:
             | That would be amazing. I get back my internet tomorrow and
             | i can play with the dataset see how much we could optimize.
        
               | zomglings wrote:
               | Hi StreamBright - just published the development version
               | of the dataset also to Kaggle:
               | https://www.kaggle.com/simiotic/github-code-snippets-
               | develop...
               | 
               | Compressed, it's 471 MB. Uncompressed, just a little more
               | than 3 GB.
               | 
               | If you want to get in touch with me in a better way than
               | HN comments two good options:
               | 
               | 1. My email is in my profile
               | 
               | 2. You can direct message me (@zomglings) on the Bugout
               | community Slack: https://join.slack.com/t/bugout-
               | dev/shared_invite/zt-fhepyt8...
               | 
               | Looking forward to collaborating with you. :)
        
       | px43 wrote:
       | Is there a nice way to distribute a database across many systems?
       | 
       | Would this be useful for a data-set like Wikipedia?
       | 
       | It would be really nice if anyone could easily follow Wikipedia
       | in such a way that anyone could fork it, and merge some changes,
       | but not others, etc. This is something that really needs to
       | happen for large, distributed data-sets.
       | 
       | I really want to see large corpuses of data that can have a ton
       | of contributors sending pull requests to build out some great
       | collaborative bastion of knowledge.
        
         | metaodi wrote:
         | This is basically Wikidata [1], where the distribution is
         | linked data/RDF with a triple store and federation to query
         | different triple stores at once using SPARQL.
         | 
         | [1] https://wikidata.org
        
       | andrewmcwatters wrote:
       | Reminds me a bit of datahub.io, but potentially more useful.
        
       | Nican wrote:
       | I love the idea of this project so much. Being able to more
       | easily share comprehensible data is an interest of mine.
       | 
       | It is not the first time I have seen immutable B-trees being used
       | as a method for being able to query a dataset on a different
       | point in time.
       | 
       | Spanner (and its derivatives) uses a similar technique to ensure
       | backup consistency. Solutions such as CockroachDB also allows you
       | to query data in the past [1], and then uses a garbage collector
       | to delete older unused data. The Time-to-live of history data is
       | configurable.
       | 
       | [1] https://www.cockroachlabs.com/docs/stable/as-of-system-
       | time....
        
         | Nican wrote:
         | Albeit, now it makes me wonder how much of the functionality of
         | dolt is possible to be replicated with CockroachDB. The
         | internal data structures of both databases are mostly similar.
         | 
         | You can have infinite point-time-querying by setting the TTL of
         | data to forever.
         | 
         | You have the ability to do distributed BACKUP/IMPORT of data
         | (Which is mostly a file copy, and also includes historical
         | data)
         | 
         | A transaction would be the equivalent of a commit, but I do not
         | think there is a way to list out all transactions on CRDB, so
         | that would have to be done separately.
         | 
         | And gain other benefits, such as distributed querying, and high
         | availability.
         | 
         | I just find interesting that both databases (CockroachDB and
         | Dolt) share the same principal of immutable B-Trees.
        
       | devwastaken wrote:
       | Something like this but for sqlite would be great for building
       | small git enabled applications/servers that cana benefit from the
       | features git provides but only need a database and a library to
       | do it.
        
         | lifty wrote:
         | Noms might be what you're looking for
         | (https://github.com/attic-labs/noms). Dolt is actually a fork
         | of Noms.
        
       | 0xbadcafebee wrote:
       | If this could somehow work with existing production
       | MariaDB/PostgreSQL databases, this would be the next Docker. Sad
       | that it requires its own proprietary DB (SQL compatible though it
       | may be, it's not my existing database).
       | 
       | I wish something completely free like this existed just to manage
       | database migrations. I don't think anything is quite this
       | powerful/user-friendly (if you can call Git that) and free.
        
       | anasbarg wrote:
       | I like this. A while ago, I was asking my co-founder the question
       | "Why isn't there Git for data like there is for code?" while
       | working on a database migrations engine that aims to provide
       | automatic database migrations (data & schema migrations). After
       | all, code is data.
        
       | pizzabearman wrote:
       | Is this mysql only?
        
         | zachmu wrote:
         | It uses the mysql SQL dialect for queries. But it's its own
         | database.
        
       | pgt wrote:
       | Datomic: https://www.youtube.com/watch?v=Cym4TZwTCNU
        
       | smarx007 wrote:
       | If you are interested to one step beyond, you can check out
       | https://terminusdb.com/docs/terminusdb/#/ which uses RDF to
       | represent (knowledge) graphs (and has revision control).
        
       | joshspankit wrote:
       | I never understood why we don't have SQL databases that track all
       | changes in a "third dimension" (column being one dimension, row
       | being the second dimension).
       | 
       | It might be a bit slower to write, but hook the logic in to
       | write/delete, and suddenly you can see _exactly_ when a field was
       | changed to break everything. The right middleware and you could
       | see the user, IP, and query that changed it (along with any other
       | queries before or after).
        
         | mcrutcher wrote:
         | This has existed for a very long time as a data modeling
         | strategy (most commonly, a "type 2 dimension") and is the way
         | that all MVCC databases work under the covers. You don't need a
         | special database to do this, just add another column to your
         | database and populate it with a trigger or on update.
        
         | tthun wrote:
         | MS SQL server 2016 onwards has temporal tables that support
         | this (point in time data)
        
           | joshspankit wrote:
           | Huh. I just read the spec. Not quite three 'dimension', but
           | looks like exactly what I was asking for: a (reasonably)
           | automatic and transparent record of previous values, as well
           | as timestamps for when they changed.
           | 
           | I'll call this a "you learn something every day" and a "hey
           | thanks @tthun (and @predakanga)"
        
         | predakanga wrote:
         | This does exist, though support for it is pretty sparse; it's
         | called "Temporal Tables" in the SQL:2011 standard -
         | https://sigmodrecord.org/publications/sigmodRecord/1209/pdfs...
         | 
         | Last time I checked, it was supported in SQL server and
         | MariaDB, and Postgres via an extension.
        
         | kenniskrag wrote:
         | Because you can do that with after update triggers or server-
         | side in software.
        
       | iamwil wrote:
       | which db does it use?
        
         | zachmu wrote:
         | It is a database. It implements the MySQL dialect and binary
         | protocol, but it isn't MySQL. Totally separate storage engine
         | and implementation.
        
       | jrumbut wrote:
       | It's amazing this isn't a standard feature. The database world
       | seems to have focused on large, high volume, globally distributed
       | databases. Presumably you would't version clickstream or IoT
       | sensor data.
       | 
       | Features like this that are only feasible below a certain scale
       | are underdeveloped and I think there's opportunity there.
        
         | fiddlerwoaroof wrote:
         | Datomic has some sort of zero-cost forming of the database:
         | it's "add-only" design makes this cheap.
        
         | qbasic_forever wrote:
         | Every DB engine used at scale has a concept of snapshots and
         | backups. This just looks like someone making a git-like
         | porcelain for the same kind of DB management constructs.
        
           | zachmu wrote:
           | It's not just snapshots though.
           | 
           | Dolt actually stores the tables, rows, and commits in a
           | Merkle DAG, like Git. So you get branch and merge. You can't
           | do branch and merge with snapshots.
           | 
           | (You also get the full git / github toolbox: push and pull,
           | fork and clone, rebase, and most other git commands)
        
             | qbasic_forever wrote:
             | Yeah it's a neat idea but I struggle to think of good use-
             | cases for merge, other than toy datasets. If I'm working on
             | a service that's sharding millions of users across dozens
             | of DB instances a merge is going to be incomprehensible to
             | understand and reason about conflicts.
        
               | MyNameIsFred wrote:
               | > Yeah it's a neat idea but I struggle to think of good
               | use-cases [...] If I'm working on a service [...]
               | 
               | I suspect that's simply not the use-case they're
               | targeting. You're thinking of a database as simply the
               | persistence component for your service, a means to an
               | end. For you, the service/app/software is the thing
               | you're trying to deliver. Where this looks useful is the
               | cases where the data itself is the thing you're trying to
               | deliver.
        
         | 101008 wrote:
         | Isnt the mysql log journal* what you are looking for?
         | 
         | * I dont remember the exact name but I refer the feature that
         | is used to replicate actions if there was an error.
        
           | fiedzia wrote:
           | No. Relational DB logs are kept for short time, and do not
           | allow for branching/merging. And even if you would store full
           | log since day 1, the only way to check state of database at
           | some point would be to start from scratch and reply
           | everything to desired point. For any sizeable db, that's not
           | practical.
        
           | [deleted]
        
       | [deleted]
        
       | strogonoff wrote:
       | You can also use Git for data!
       | 
       | It's a bit slower, but smart use of partial/shallow clones can
       | address performance degradation on large repositories over time.
       | You just need to take care of the transformation between
       | "physical" trees/blobs and "logical" objects in your dataset
       | (which may not have 1:1 mapping, as having physical layer more
       | granular reduces likelihood of merge conflicts).
       | 
       | I'm also following Pijul, which seems very promising in regards
       | to versioning data--I believe they might introduce primitives
       | allowing to operate on changes in actual data structures rather
       | than between lines in files, like with Git.
       | 
       | Add to that sound theory of patches, and that's a definite win
       | over Git (or Doit for that matter, which seems to be same old Git
       | but for SQL).
        
         | vvanders wrote:
         | Nope, been there done that, no thanks.
         | 
         | Lack of locking for binary files, overhead > 1gb and all the
         | shenanigans you need to do for proxy servers. There's better
         | solutions out there but they aren't free.
        
           | strogonoff wrote:
           | Would be very curious to hear more about issues with proxy
           | servers (where were they required?), overheads (do you mean
           | RAM usage?) and locking.
        
             | vvanders wrote:
             | Sure keep in mind that my data is a little old but last
             | time I peeked into the git LFS space it seemed like there
             | were still a few gaps.
             | 
             | First, most of my background in this area comes from
             | gamedev so YMMV if the same applies in your use cases.
             | 
             | For our usage we'd usually have a repo history size that
             | crossed the 1TB mark and even upwards of 2-3TB in some
             | cases. The developer sync was 150-200GB, the art sync was
             | closer to 500-600GB and the teams were regularly churning
             | through 50-100GB/week depending on where we were in
             | production.
             | 
             | You need discipline specific views into the repo. It just
             | speeds everything up and means that only the teams that
             | need to take the pain have to. From a performance
             | perspective Perforce blows the pants off anything else I've
             | seen, SVN tries, but P4 was easily an order of magnitude
             | faster to sync or do a clean fetch.
             | 
             | I've seen proxy servers done with git but it's usually some
             | really hack thing scripted together with a ton of ductape
             | and client-specific host overrides. When you have a team
             | split across East Coast/West Coast(or other country) you
             | _need_ that proxy so that history is cached in a way that
             | it only gets pulled in locally once. Having a split push
             | /pull model is asking for trouble and last I checked it
             | wasn't clear to me if stuff like git LFS actually handles
             | locking cleanly across it.
             | 
             | From an overhead perspective git just falls over at
             | ~1gb(hence git LFS, which I've seen teams use to varying
             | degrees of success based on project size). The need to do
             | shallow history and sidestep resolving deltas is a ton of
             | complexity that isn't adding anything.
             | 
             | With a lot of assets, merging just doesn't exist and a DVCS
             | totally falls over here. I've seen fights nearly break out
             | in the hallway multiple times when two artist/animators
             | both forgot to checkout a file(usually because someone
             | missed the metadata to say it's an exclusive access file).
             | With unmergeable binary files that don't get locked your
             | choice is who gets to drop 1-3 days of work on the floor
             | when the other person blows away their changes to commit.
             | If those changes span multiple interconnected
             | packages/formats/etc you have a hard fork that you can
             | never bring back together.
             | 
             | There's a couple other details but that's the large ones,
             | Perforce worked incredibly well in this space but it is not
             | cheap and so I've seen teams try to go their own way to
             | mixed success. I'll admit that you can't do a monorepo in
             | P4(and even tools like repo in the Android world have their
             | problems too) but if you segregate your large
             | business/product lines across P4 repos it scales
             | surprisingly well.
             | 
             | Anyway, you may or may not hit any or all of this but I've
             | yet to see git tackle a 1TB+ repo history well(and things
             | like repo that uses many mini-repos doesn't count in my
             | book due to the lack of atomicity across submissions that
             | span multiple repos).
        
               | strogonoff wrote:
               | This is super informative!
               | 
               | In my case it's different since Git isn't accessed by
               | users directly, rather I'm working on some tools that
               | work on top of Git (on user's machine). Data is primarily
               | text-based, though sometimes binary assets come up
               | (options for offloading them out of Git are being
               | investigated).
               | 
               | So far there were no major issues. I predict degradation
               | over time as repos grow in size and history (Git is not
               | unique in this regard, but it'll probably be more rapid
               | and easier to observe with Git), so we might start using
               | partial cloning.
               | 
               | (I stand by the idea that using straight up Git for data
               | is something to consider, but with an amendment that it's
               | predominantly text data, not binary assets.)
        
               | vvanders wrote:
               | Yeah, my experience has been that you start seeing issues
               | with long delta decompression times around the 1-2gb
               | mark. That climbs quicker of you have binary formats that
               | push the delta compression algorithm into cases where it
               | does poorly(which makes sense since it was optimized for
               | source code).
               | 
               | If you have binary assets and they don't support merging
               | or regeneration from source artifacts that mandates
               | locking(ideally built into SCM but I've seen wiki pages
               | in a pinch at small scale).
        
         | graderjs wrote:
         | I find a balance between this using git on JSON files. And I
         | build the JSON files into a database (1 file per record, 1
         | directory per table, subdirectories for indexes). The whole
         | thing is pretty beautiful, and it's functioning well for a
         | user-account, access management database I'm running in
         | production. I like that I can go back and do:
         | 
         | `git diff -p` to see the users who have signed up recently, for
         | example.
         | 
         | You can get the code, over at: https://github.com/i5ik/sirdb
         | 
         | The advantages of this approach are using existing unix tooling
         | for text files, solid versioning, easy inspect-ability, and
         | leveraging the filesystem B-Tree indexing as a fast index
         | structure (rather than having to write my b-trees). Another
         | advantage is hardware-linked scaling. For example, if I use
         | regular hard disks, it's slower. But if I use SSDs it's faster.
         | And i should also be possible to mount the DB as a RAM disk and
         | make it super fast.
         | 
         | The disadvantages are that the database side still only
         | supports a couple of operations (like exact, multikey searches,
         | lookup by ID, and so on) rather than a rich query language. I'm
         | OK with that for now, and I'm also thinking of using skiplists
         | in future to get nice ordering property for the keys in an
         | index so I can easily iterate and page over those.
        
         | teej wrote:
         | The fact that I can use git for data if I carefully avoid all
         | the footguns is exactly why I don't use git for data.
        
         | rapjr9 wrote:
         | We used git to store and manage data sets for a machine
         | learning project involving chewing detection with audio data
         | used in training. It was cumbersome and the huge datasets
         | caused some problems with git (e.g., searches of our code base
         | got really slow because the data was being searched also until
         | we moved the data to a different repo). Something easier to use
         | that could manage large datasets would be useful.
         | 
         | I wonder if DoIt could be used to create a clone of Apple's
         | Time Machine. Seems like the basics are there.
        
         | pradn wrote:
         | Git is too complicated. It's barely usable for daily tasks.
         | Look at how many people have to Google for basic things like
         | uncommitting a commit, or cleaning your local repo to mirror a
         | remote one. Complexity is a liability. Mercurial has a nicer
         | interface. And now I see the real simplicity of non-distributed
         | source control systems. I have never actually needed to work in
         | a distributed manner, just client-server. I have never sent a
         | patch to another dev to patch into their local repo or whatnot.
         | All this complexity seems like a solution chasing after a
         | problem - at least for most developers. What works for Linux
         | isn't necessary for most teams.
        
           | ttz wrote:
           | Git is used prolifically in the tech industry. What on earth
           | are you talking about?
        
             | detaro wrote:
             | Being needlessly complicated seldomly stops the tech
             | industry from using something as long as the complexity is
             | slightly out of the way.
        
               | ttz wrote:
               | "Barely usable for daily tasks"
               | 
               | Is a pretty strong statement, especially given many tech
               | companies use it exactly for this purpose.
               | 
               | Git might have a learning curve, and sure, it's not the
               | simplest. But "barely usable" is hyperbole in the face of
               | actual evidence.
               | 
               | I'm not defending Git specifically ; other SVNs are just
               | as viable. The quoted statement seems a bit ridiculous.
        
             | strogonoff wrote:
             | To me there's some irony in that all insta-criticism of Git
             | in responses to my comment presumably applies to a project
             | that describes itself as "Git for data" and promises exact
             | reproduction of all Git command behaviour--therefore
             | suffering from the same shortcomings.
        
           | Hendrikto wrote:
           | > Git is too complicated. It's barely usable for daily tasks.
           | Look at how many people have to Google for basic things like
           | uncommitting a commit, or cleaning your local repo to mirror
           | a remote one.
           | 
           | Cars are too complicated. They are barely usable for daily
           | tasks. Look at how many people have to Google for basic
           | things like changing a fan belt, or fixing cylinder head
           | gasket.
           | 
           | You can fill in almost anything here. Most tools are
           | complicated. Yet yo don't need to know their ins and outs for
           | them to be useful to you.
        
             | yoavm wrote:
             | To me it sounds like you're proving the exact opposite. I'd
             | assume most car owners never need to change a fan belt
             | themselves, while everyone who uses git daily needed at
             | some point to revert a commit. "How to turn right" isn't
             | huge on stackoverflow last time I checked...
        
           | strogonoff wrote:
           | Doit boasts its likeness to Git as a feature. Does this mean
           | it'll also be barely usable for daily tasks? Is it possible
           | for a project to faithfully reproduce the entirety of Git
           | command interface _and_ be less complicated than Git  / not
           | suffer from the same shortcomings?
           | 
           | I personally think Git isn't that bad, once it's understood.
           | It could be counter-intuitive sometimes though (as an
           | example, for the longest time I used Git without realizing it
           | stores a snapshot of each file and diffs/deltas are only
           | computed when required). Just trying to be pragmatic and not
           | expecting a tool like Git to be entirely free of leaky
           | abstractions.
        
       | scottmcdot wrote:
       | Dolt might be good but never underestimate the power of Type 2
       | Slowly Changing Dimension tables [1]. For example, if you had an
       | SSIS package that took CSV and imported them into a database, and
       | one day you noticed it accidently rounded the value incorrectly,
       | you could fix the data and retain traceability of the data which
       | was there originally.
       | 
       | E.g., SSIS package writes row of data: https://imgur.com/DClXAi5
       | 
       | Then a few months later (on 2020-08-15) we identify that
       | trans_value was imported incorrectly so we update it:
       | https://imgur.com/wdQJWm4
       | 
       | Then whenever we SELECT from the table we always ensure we are
       | extracting "today's" version of the data:
       | select * from table where TODAY between effective_from and
       | effective_to
       | 
       | [1] https://en.wikipedia.org/wiki/Slowly_changing_dimension
        
         | skybrian wrote:
         | The commit log in Dolt is edit history. (When did someone
         | import or edit the data? Who made the change?) It's not about
         | when things happened.
         | 
         | To keep track of when things happened, you would still need
         | date columns to handle it. But at least you don't need to
         | handle two-dimensional history for auditing purposes. So, in
         | your example, I think the "effective" date columns wouldn't be
         | needed.
         | 
         | They have ways to query how the dataset appeared at some time
         | in the past. However, with both data changes and schema changes
         | being mixed together in a single commit log, I could see this
         | being troublesome.
         | 
         | I suppose writing code to convert old edit history to the new
         | schema would still be possible, similar to how git allows you
         | to create a new branch by rewriting an existing one.
        
         | zachmu wrote:
         | If all you want is AS OF semantics, then SCD2 is a great match.
         | Used it a ton in application development myself.
         | 
         | Dolt actually makes branch and merge possible, totally
         | different beast.
        
         | antman wrote:
         | Can't something like that work as SQL:2011 temporal tables?
         | 
         | Postgres equivalent solution in this [0] rather complex but
         | great tutorial
         | 
         | [0]: https://clarkdave.net/2015/02/historical-records-with-
         | postgr...
        
         | sixdimensional wrote:
         | I definitely agree, just tossing in the superset concept that
         | Dolt and Type 2 SCD involve - temporal databases [1].
         | 
         | I think the idea of a "diff" applied to datasets is quite
         | awesome, but even then, we kind of do that with databases today
         | with data comparison tools - it's just most of them are not
         | time aware, rather they are used to compare data between two
         | instances of the data in different databases, not at two points
         | in time in the same database.
         | 
         | [1] https://en.wikipedia.org/wiki/Temporal_database
        
       | controlledchaos wrote:
       | This is exactly what I was looking for!
        
       | 1f60c wrote:
       | I'm not sure if this is supposed to be Dolt or DoIt, but using a
       | swear word for a name (even a relatively mild one) is pretty
       | distracting, IMHO.
        
         | stjohnswarts wrote:
         | DoIt isn't really a swear word. It's a euphemism for coitus.
         | It's also a very common phrase in general so it really isn't
         | strongly associated with "doing it" to cause any issues with
         | English speakers. Aka it's fine as a project name. See Nike's
         | "Just do it." advertisement campaign, they would have never
         | gone with the phrase if it had strong negative connotations.
        
           | zachmu wrote:
           | It's DOLT. San serif fonts.
        
         | drewwwwww wrote:
         | presumably a riff on git, the well known famously unsuccessful
         | version control system
        
           | 1f60c wrote:
           | Huh, I had no idea. (I'm not a native speaker.)
        
       | dang wrote:
       | Some related past threads:
       | 
       |  _Dolt is Git for data_ -
       | https://news.ycombinator.com/item?id=22731928 - March 2020 (191
       | comments)
       | 
       |  _Git for Data - A TerminusDB Technical Paper [pdf]_ -
       | https://news.ycombinator.com/item?id=22045801 - Jan 2020 (5
       | comments)
       | 
       |  _Ask HN: Would you use a "git for data"?_ -
       | https://news.ycombinator.com/item?id=11537934 - April 2016 (10
       | comments)
        
       | weeboid wrote:
       | Scanned for this comment before making it. This optically reads
       | as "Dolt", not "DoIt"
        
         | XCSme wrote:
         | But it is "DOLT", right?
        
           | zachmu wrote:
           | Yup.
        
         | coldtea wrote:
         | Just optically? It's meant to be "dolt", not "do it", optically
         | and semantically.
         | 
         | It's a pun on git, which also means "stupid/unpleasant/etc
         | person".
        
           | weeboid wrote:
           | ahh, got whooshed by the cool kids. dammit
        
       | yarg wrote:
       | Merging is hard, but the rest can be done with copy-on-write
       | cloning (or am I missing something?).
        
       | xkvhs wrote:
       | Well, maybe this is the ONE. But I've heard "git for data" way
       | too many times to jump on board of the latest tool. I'll wait
       | till "dust settles" and there's a clear winner. Till then, it's
       | parquet, or even pickle.
        
       | laurent92 wrote:
       | Wordpress would have benefited from this.
       | 
       | What a lot of webmasters want is, test the site locally, then
       | merge it back. A lot of people turned to Jekyll or Hugo for the
       | very reason that it can be checked into git, and git is reliable.
       | A static website can't get hacked, whereas anyone who has been
       | burnt with Wordpress security fail knows they'd prefer a static
       | site.
       | 
       | And even more: People would like to pass the new website from the
       | designer to the customer to managers -- Wordpress might have not
       | needed to develop their approval workflows (permission schemes,
       | draft/preview/publish) if they had had a forkable database.
        
         | [deleted]
        
         | TimTheTinker wrote:
         | Sure, but Wordpress is still running PHP files for every page
         | load and back-end query. Dolt would help offload some of the
         | code's complexity, but that would still leave a significant
         | attack surface area.
         | 
         | In other words, by itself Dolt couldn't solve the problem of
         | Wordpress _not_ being run mostly from static assets on a server
         | (plus an API).
        
         | kenniskrag wrote:
         | they have a theme preview now. :)
        
           | jimsmart wrote:
           | The parent post here is speaking to the content stored in the
           | database, not the templates on the filesystem. Dolt enables
           | one to merge, push and pull _data_ , just as one would with
           | files and git.
        
             | 2malaq wrote:
             | I read it as sarcasm.
        
         | berkes wrote:
         | At the root of this lies the problem that content,
         | configuration and code is stored in one blurp (a single db).
         | Never clearly bounded nor contained. Config is spread over
         | tables. Usergenerated content (from comments to orders) mixed
         | up with redactional content: often even in tables, sometimes
         | even in the same column. Drupal suffers the same.
         | 
         | What is needed, is a clear line between configuration (and
         | logic), redactional content, and user generated content. Those
         | three have very distinct lifecycles. As long as they are
         | treated as one, the distinct lifecycles will conflict. No
         | matter how good your database merging is: the root is a missing
         | piece of architecture.
        
           | Hallucinaut wrote:
           | Very well summarised. I can see clearly you too have tried to
           | apply normal software lifecycle principles and come out the
           | other end with a rather dismal impression of WordPress.
        
       | Klwohu wrote:
       | Problematic name, could become a millstone on the neck of the
       | developer far into the future.
        
         | ademarre wrote:
         | Agreed. I couldn't immediately see if it was "DOLT" or "do it",
         | as in "just do it". It's the former.
        
           | rapnie wrote:
           | I was going back and forth between the two until seeing doLt
           | in terminal font.
        
             | zachmu wrote:
             | This ambiguity in sans serif fonts has actually been pretty
             | annoying. Especially since GitHub doesn't let you choose
             | your font on readmes and stuff.
        
               | mzs wrote:
               | just lowercase it
        
               | dheera wrote:
               | You can use <pre>...</pre> for monospace font.
        
           | ssprang wrote:
           | Reminds me of this user testing story from the early
           | Macintosh days:
           | https://www.folklore.org/StoryView.py?story=Do_It.txt
        
         | TedDoesntTalk wrote:
         | Already I would not use this project because of its name. I'm
         | not offended by it, but I know others will be, and it will only
         | be a matter of time before we have to replace it with something
         | else. So why bother in the first place?
         | 
         | I know the name is not DOLT but it is close enough to cause
         | offense. Imagine the N-word with one typo. Would it still be
         | offensive? Probably to some.
        
           | simonw wrote:
           | The name is DOLT with an L
        
           | Klwohu wrote:
           | This is the issue with names. Even though the project is
           | called doit the DoIt stylizing makes it _look_ problematic.
           | It 's a non-starter, hopefully the author makes a big change.
           | Just choosing lower case for the project would be enough.
        
             | edgyquant wrote:
             | It's not though it is dolt, meaning a stupid person, like
             | git (stupid simple version control.)
        
               | Klwohu wrote:
               | Oh wow, that's awful! I thought this was just an innocent
               | mistake.
        
           | mixedCase wrote:
           | > but I know others will be, and it will only be a matter of
           | time before we have to replace it with something else
           | 
           | Or we can just not give in to such insanity. That's always an
           | option, and would help prevent things from getting
           | increasingly worse as we cede ground to claims that
           | increasingly get further and further away from the realm of
           | what's reasonable.
        
             | TedDoesntTalk wrote:
             | It's not your choice when you're an employee of a woke
             | company (unless you want to quit) Don't you know that by
             | now?
        
           | _Wintermute wrote:
           | Do you know what "git" means?
        
         | maest wrote:
         | It was most likely picked as an analogy to "git".
        
           | gerdesj wrote:
           | Dolt and git are closer to synonymous rather than analogous.
        
           | zachmu wrote:
           | This is correct. Specifically, to pay homage to git and how
           | Linus named it.
        
       | crazygringo wrote:
       | This is absolutely fascinating, conceptually.
       | 
       | However, I'm struggling to figure out a real-world use case for
       | this. I'd love if anyone here can enlighten me.
       | 
       | I don't see how it can be for production databases involving lots
       | of users, because while it seems appealing as a way to upgrade
       | and then roll back, you'd lose all the new data inserted in the
       | meantime. When you roll back, you generally want to roll back
       | changes to the schema (e.g. delete the added column) but not
       | remove all the rows that were inserted/deleted/updated in the
       | meantime.
       | 
       | So does it handle use cases that are more like SQLite? E.g. where
       | application preferences, or even a saved file, winds up
       | containing its entire history, so you can rewind? Although that's
       | really more of a temporal database -- you don't need git
       | operations like branching. And you really just need to track row-
       | level changes, not table schema modifications etc. The git model
       | seems like way overkill.
       | 
       | Git is built for the use case of lots of different people working
       | on different parts of a codebase and then integrating their
       | changes, and saving the history of it. But I'm not sure I've ever
       | come across a use case for lots of different people working on
       | the _data and schema_ in different parts of a database and then
       | integrating their _data and schema_ changes. In any kind of
       | shared-dataset scenario I 've seen, the schema is tightly locked
       | down, and there's strict business logic around who can update
       | what and how -- otherwise it would be chaos.
       | 
       | So I feel like I'm missing something. What is this actually
       | intended for?
       | 
       | I wish the site explained why they built it -- if it was just
       | "because we can" or if projects or teams actually had the need
       | for git for data?
        
         | jorgemf wrote:
         | Machine Learning. I don't think it has many more use cases
        
           | sixdimensional wrote:
           | Or more simply put, how about table-driven logic in general?
           | It doesn't have to be as complex as machine learning. There
           | are more use cases than just machine learning, IMHO.
        
             | jedberg wrote:
             | Such as? I'm having difficulty coming up with any myself.
        
               | zachmu wrote:
               | Say, network configuration.
        
               | sixdimensional wrote:
               | See my post earlier in this thread [1].
               | 
               | Yes you need reference data for machine learning, but the
               | world isn't only about machine learning. You might want
               | reference data for human-interpreted analytics, table-
               | driven logic (business rule engines, for example), etc.
               | 
               | [1] https://news.ycombinator.com/item?id=26371748.
        
         | fiedzia wrote:
         | This won't work for usual database usecases. This is meant for
         | interactive work with data same way you work with code. Who
         | needs that?
         | 
         | Data scientists working with large datasets. You want to be
         | able to update data without redownloading everything. Also make
         | your local changes (some data cleaning) and propose your
         | updates upstream same way you would with git. Having many
         | people working interactively with data is common here.
         | 
         | One of the companies I work with provided set of data
         | distributed to their partners on a daily basis. Once it grew
         | larger, downloading everything daily became an issue. So that
         | would be desirable,
         | 
         | I have large data model that I need to deploy to production and
         | update once in a while. For code, network usage is kept to
         | minimum because we have git. For data, options are limited.
         | 
         | As with git, it is something that once you have, you will find
         | a lot of usecases that make life easier and open many new
         | doors.
        
         | zachmu wrote:
         | The application backing use case is best suited for when you
         | have parts of your database that get updated periodically and
         | need human review. So you have a production database that you
         | serve to your customers. Then you have a branch / fork of that
         | (dev) that your development team adds batches of products to.
         | Once a week you do a data release: submit a PR from dev ->
         | prod, have somebody review all the new copy, and merge it once
         | you're happy. If there's a big mistake, just back it out again.
         | We have several paying customers building products around this
         | workflow.
         | 
         | As for lots of people collaborating on data together, we have
         | started a data bounties program where we pay volunteers to
         | assemble large datasets. Two have completed so far, and a third
         | is in progress. For the first one, we paid $25k to assemble
         | precinct-level voting data for the 2016 and 2020 presidential
         | elections. For the second, we paid $10k to get procedure prices
         | for US hospitals. You can read about them here:
         | 
         | https://www.dolthub.com/blog/2021-02-15-election-bounty-revi...
         | 
         | https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/
         | 
         | What's cool is that novices can make a really good income from
         | data entry as a side gig, and it's two orders of magnitude
         | cheaper than hiring a firm to build data sets for you.
         | 
         | You're right that the site is kind of vague about what dolt is
         | "for." It's a really general, very multi-purpose tool that we
         | think will get used a lot of places. Here's a blog we wrote a
         | while back about some of the use cases we envision.
         | 
         | https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/
        
           | [deleted]
        
         | saulrh wrote:
         | Here's one thing I'd have used it for: Video game assets.
         | 
         | Say you have a tabletop game engine for designing starships.
         | Different settings have different lists of parts. Some settings
         | are run by a game's DM, some are collaborative efforts. I ended
         | up saving the lists of parts in huge JSON files and dumping
         | those into git. However, for much the same reason that data
         | science is often done in a REPL or notebook type interface, it
         | turned out that by far the most efficient way for people to
         | iterate on assets was to boot up the game, fiddle with the
         | parts in-engine until things looked right, then replicate their
         | changes back into the JSON. With this, we could just save the
         | asset database directly.
         | 
         | The same reasoning should hold for effectively any dataset
         | which a) can be factored into encapsulated parts b) isn't
         | natively linear c) needs multiple developers. Game assets are
         | one example, as I described above. Other datasets that hold: ML
         | training/testing sets, dictionaries, spreadsheets, catalogs,
         | datasets for bio papers.
        
         | sixdimensional wrote:
         | I am not associated to Dolt, but I really like the idea of Dolt
         | personally. I do see use cases, but not without challenges.
         | 
         | One of the main use cases you can see them targeting, and that
         | I think makes a ton of sense, is providing tools for
         | collecting, maintaining and publishing reference data sets
         | using crowd sourcing.
         | 
         | For example, they are doing this with hospital charge codes
         | (a.k.a. chargemaster data). Hospitals in the US are required to
         | publish this data for transparency.. however, I have never seen
         | a single aggregated national (or international) data set of all
         | these charges. In fact, such a data set could be worth a lot of
         | money to a lot of organizations for so many reasons. I used to
         | work in health insurance, gathering data from all kinds of
         | sources (government rules/regs, etc.) and it was a lot of hard
         | work, scraping, structuring, maintaining, etc.
         | 
         | This reference data can be used for analytics, to power table-
         | driven business logic, machine learning - to help identify cost
         | inequalities, efficiencies, maybe even illicit price gouging,
         | etc. There are so many reference data sets that have similar
         | characteristics... and "data marketplaces" in a way are
         | targeted at making "private" reference data sets available for
         | sale - so then where is the "open" data marketplace? Well, here
         | you go.. Dolt.
         | 
         | I have often realized that the more ways we can make things
         | collaborative, the better off we will be.
         | 
         | Data is one of those things where, coming up with common,
         | public reference datasets is difficult and there are lots of
         | different perspectives ("branches"), sometimes your data set is
         | missing something and it would be cool if someone could propose
         | it ("pull request"), sometimes you want to compare the old and
         | new version of a dataset ("diff") to see what is different.
         | 
         | One difficult thing about Dolt is, it will only be successful
         | if people are actually willing to work together to cook up and
         | maintain common data sets collaboratively, or if those doing so
         | have an incentive to manage an "open data" project on Dolt as
         | benevolent maintainers, for example. But, I could say then it
         | has the same challenges as "open source" in general, so
         | therefore it is not really that different.
         | 
         | Dolt could even be used as a foundation for a master data
         | management registry - in the sense of you could pop it in as a
         | "communal data fountain" if you will where anybody in your org,
         | or on the Web, etc. could contribute - and you can have
         | benevolent maintainers look over the general quality of the
         | data. Dolt would be missing the data quality/fuzzy matching
         | aspect that master data tools offer, but this is a start for
         | sure.
         | 
         | For example, I work in a giant corporation right now. Each
         | department prefers to maintain its own domain data and in some
         | cases duplicates common data in that domain. Imagine using Dolt
         | to make it possible for all these different domains to
         | collaborate on a single copy of common data in a "Dolt" data
         | set - now people can share data on a single copy and use pull
         | requests, etc. to have an orderly debate on what that common
         | data schema and data set should look like.
         | 
         | I think it's an idea that is very timely.
         | 
         | P.S. Dolt maintainers, if you read this and want to talk, I'm
         | game! Awesome work :)
        
           | a-dub wrote:
           | agree 100% this looks awesome for reference datasets.
           | 
           | not so sure how well it would work for live data sources that
           | update with time as it could encourage people to apply more
           | ad-hoc edits as opposed to getting their version controlled
           | jobs to work 100%, but who knows, maybe that would be a net
           | win in some cases?
        
           | zachmu wrote:
           | We totally agree on all points! Come chat with us on our
           | discord about how we can help your org solve its problems:
           | 
           | https://discord.com/invite/RFwfYpu
        
         | curryst wrote:
         | I might look at it for work. Compliance requires us to keep
         | track of who made what change when, and who it was approved by
         | in case regulators need it.
         | 
         | Right now, this often means making an MR on Git with your
         | snippet of SQL, getting it approved, and then manually
         | executing it. This would let us bring the apply stage in, as
         | well as avoid "that SQL query didn't do exactly what I
         | expected" issues.
         | 
         | It's possible to do it within the SQL engine, but then I have
         | to maintain that, which I would prefer not to do. As well as
         | dealing with performance implications from that.
        
           | zachmu wrote:
           | That's a great use case, let us know how we can help. We have
           | a discord if you want to chat about it:
           | 
           | https://discord.com/invite/RFwfYpu
        
           | martincolorado wrote:
           | I see a use-case for public resource data sets. For example,
           | right of way, land use variance, permits, and land records.
           | These are fundamentally public data sets that are often
           | maintained by entities with limit budgets and the potential
           | (even incentive) for fraud is substantial. Also, there is
           | significant value in being able to access the data set for
           | analytical purposes such as real estate analysis.
        
       | knbknb wrote:
       | Is something like "dolt diff master..somebranch -- mytable"
       | possible?
       | 
       | Same question: Are in-between branch comparisons possible in
       | dolt? Which "diff" subcommands to you plan to support in the
       | future?
        
       | Ericson2314 wrote:
       | What people usually miss about these things is normal version
       | control benefits hugely from content addressing and normal forms.
       | 
       | The salient aspect of relational data is that it's cyclic, this
       | makes content addressing unable to provide normal forms on it's
       | own (unless someone figures out how to Merkle cylic graphs!), but
       | the normal form can still made other ways.
       | 
       | The first part is easier enough, store rows in some order.
       | 
       | The second part is more interesting: making the choice of
       | surrogate keys not matter (quotienting it away). Sorting table
       | rows containing surrogate keys depending on the sorting of table
       | rows makes for some interesting bags of constraints, for which
       | there may be more than one fixed point.
       | 
       | Example:                 CREATE TABLE Foo (         a uuid
       | PRIMARY KEY,         b text,         best_friend uuid REFERENCES
       | Foo(b)       );
       | 
       | DB 0:                 0 Alice 0
       | 
       | 1 reclusive Alice, best friends with herself. Just fine.
       | 0 Alice 1       1 Alice 1
       | 
       | 2 reclusive Alices, both best friends with the second one. The
       | alices are the same up to primary keys, but while primary keys
       | are to be quotiented out, primary key equality isn't, so this is
       | valid. And we have an asymmetry by which to sort.
       | 0 Alice 1       1 Alice 0
       | 
       | 2 reclusive Alices, each best friends with the other. The Alices
       | are completely isomorphic, and one notion of normal forms would
       | say this is exactly the same as DB 0: as if this is reclusive
       | Alice in a fun house of mirrors.
       | 
       | All this is resolvable, but it's subtle. And there's no avoiding
       | complexity. E.g. if one wants to cross reference two human data
       | entries who each assigned their own surrogate IDs, this type of
       | analysis must be done. Likewise when merging forks of a database.
       | 
       | I'd love be wrong, but I don't think any of the folks doing "git
       | for data" are planning their technology with this level of
       | mathematical rigor.
        
         | barrkel wrote:
         | Consider the whole database - the whole set of facts across all
         | relations - as the state in the tree. Each transaction a
         | serialized delta that produces a new node in the tree, a new
         | HEAD. That's closer to what's being gotten at, as I see it.
         | 
         | Transaction logs are already not that different to patch sets,
         | and merge conflicts have an isomporphism with replication
         | inconsistencies.
        
           | Ericson2314 wrote:
           | > Consider the whole database - the whole set of facts across
           | all relations - as the state in the tree.
           | 
           | I tried to demonstrate that this is easier said than done.
           | Deciding the equality/redundancy of facts is very subtle. At
           | some it might even be guess whether your two clerks each met
           | the same Alice when entering in their DB forks or not.
           | 
           | Transactions are just patches, I completely agree. And
           | patches are just partial functions. But deciding what the
           | action is of a partial function, or whether the input is in
           | the domain requires a notion of equality. (Partial functions
           | are encoded very nicely with incomplete pattern matching; in
           | this case the question is whether Alice matches the pattern.)
           | 
           | Basically I know where you are coming from, and I want it to
           | work too, but you cannot just wave away the math and issues
           | it points out.)
        
         | TOGoS wrote:
         | A lot of your comment went over my head, but I have modeled
         | relational data in a way conducive to being stored in a Merkle
         | tree. The trick being that every entity in the system ended up
         | having two IDs. A hash ID, identifying this specific version of
         | the object, and an entity ID (probably a UUID or OID), which
         | remained constant as new versions were added. In a situation
         | where people can have friends that are also people, they are
         | friends with the person long-term, not just a specific version
         | of the friend, so in that case you'd use the entity IDs. Though
         | you might also include a reference to the specific version of
         | the person at the point in time at which they became friends,
         | in which case they would necessarily reference an older version
         | of that person. If you friend yourself, you're actually
         | friending that person a moment ago.
         | 
         | A list of all current entities, by hash, is stored higher up
         | the tree. Whether it's better that objects themselves store
         | their entity ID or if that's a separate data structure mapping
         | entity to hash IDs depends on the situation.
         | 
         | On second reading I guess your comment was actually about how
         | to come up with content-based IDs for objects. I guess my point
         | was that in the real world you don't usually need to do that,
         | because if object identity besides its content is important you
         | can just give it an arbitrary ID. How often does the problem of
         | differentiating between a graph containing identical Alices vs
         | one with a single self-friending Alice actually come up? Is
         | there any way around it other than numbering the Alices?
        
           | Ericson2314 wrote:
           | > ...The trick being that every entity in the system ended up
           | having two IDs...
           | 
           | I think we agree that this is a partial solution. Adding a
           | temporal dimension and referencing immutable single-versions
           | only can break cycles by making them unconstructable in the
           | first place. But once an object refers to foreign entity IDs,
           | hash IDs become "polluted" with surrogate values.
           | 
           | > How often does the problem of differentiating between a
           | graph containing identical Alices vs one with a single self-
           | friending Alice actually come up? Is there any way around it
           | other than numbering the Alices?
           | 
           | I think it would come up with forks that have some identical
           | edits, especially if those edits are in different orders. In
           | that case, surrogate keygen state would get out of sync
           | (whether it's counters or UUID state). Either we pessimize
           | merges, or we need some way to recover.
           | 
           | I think allowing for "identical Alices" is probably necessary
           | in practice, but an interface should have a warning of some
           | sort about this. (Maybe ask the Alices more questions until
           | you can differentiate them? Get prepared in case one of the
           | alices comes back and wants a new I.D. card and you don't
           | want to easily enable fraud.) Likewise when merging those
           | warnings should be brought to the fore, along with a menu of
           | resolutions at extremes for the user to decide between.
        
         | ghusbands wrote:
         | > The salient aspect of relational data is that it's cyclic
         | 
         | This is an odd claim. Most relational data is not cyclic, and
         | it's easy enough to come up with a scheme to handle cyclic data
         | in a consistent fashion.
         | 
         | Conflicting changes (two changes to the same 'cell' of a
         | database table) are a much more likely issue to hit and will
         | need handling in much the same way merge conflicts are
         | currently handled, so there are already situations in which
         | manual effort will be needed.
        
         | chatmasta wrote:
         | We have content addressable objects at Splitgraph. [0] And
         | here's an example of a "point in time" query across two
         | versions of an image on Splitgraph. [1]
         | 
         | I'm on my phone and don't have a ton of time to respond right
         | now, but I'd recommend reading our docs. We're working on a lot
         | of what you mention.
         | 
         | (Also, we're hiring for backend and frontend roles. See my
         | comment history for more.)
         | 
         | [0] https://www.splitgraph.com/docs/concepts/objects
         | 
         | [1]
         | https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...
        
       | qwerty456127 wrote:
       | I would rather have a "GitHub for data" - an SQL database I could
       | have hosted for free, give everybody read-only access to and give
       | some people I choose R/W access to. That's a thing I miss really.
        
         | rsstack wrote:
         | Dolt also have a paid product called DoltHub, check it out.
        
           | zachmu wrote:
           | Free for public repositories, and for private repos under a
           | gig.
        
       ___________________________________________________________________
       (page generated 2021-03-07 23:02 UTC)