[HN Gopher] Dolt is Git for Data: a SQL database that you can fo...
___________________________________________________________________
Dolt is Git for Data: a SQL database that you can fork, clone,
branch, merge
Author : crazypython
Score : 687 points
Date : 2021-03-06 21:15 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| [deleted]
| tjkrusinski wrote:
| Any info on how this differs from Dat?
| rymurr wrote:
| I've been working on https://github.com/projectnessie/nessie for
| about a year now. Its similar to Dolt in spirit but aimed at big
| data/data lakes. Would welcome feedback from the community.
|
| Its very exciting to see this field picking up speed. Tons of
| interesting problems to be solved :-)
| Aeolun wrote:
| What does this look this look like when merging? Does it need a
| specialized tool?
|
| I'm a bit sad it doesn't seem to have git style syncing though.
| zachmu wrote:
| Merge is built in, same as with git. Same syntax too:
| dolt checkout -b <branch> dolt merge <branch>
|
| What do you mean git-style syncing? It has `push`, `pull`, and
| `fetch`.
| Aeolun wrote:
| In regards to merging, I understand that part, but say I have
| a conflict, how does it get resolved. Does it open some sort
| of text editor where I delete the unwanted lines?
|
| I mean, if I have a repository on one PC, I can clone from
| there on a different PC.
|
| As far as I can see this only allows S3/GC or Dolthub
| remotes.
| zachmu wrote:
| Merge conflicts get put into a special conflicts table. You
| have to resolve them one by one just like text conflicts
| from a git merge. You can do this with the command line or
| with SQL.
|
| It's true we don't support SSH remotes yet. Wouldn't be too
| hard to add though. File an issue if this is preventing you
| from using the product. Or use sshfs, which is a pretty
| simple workaround.
| twobitshifter wrote:
| This is cool, but the parent dolthub project is even cooler!
| Dolthub.com
| laurent92 wrote:
| Free for public repos!
| zachmu wrote:
| Also private repos under a gig, but you have to give us a
| credit card to be private.
| mayureshkathe wrote:
| Since you've hosted it on GitHub, would you also consider
| providing a JavaScript client library for read access?
|
| Could pave the way for single page web applications (hosted via
| GitHub Pages) working with Dolt as their database of choice.
| efikoman wrote:
| ?
| justincormack wrote:
| I collected all the git for data open source projects I could
| find a few months back, there have been a bunch of interesting
| approaches
| https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzm...
| chub500 wrote:
| I've had a fairly long-term side project working on git for
| chronological data (data is a cause and effect DAG), know of
| anybody doing that?
| michaelmure wrote:
| It might not be exactly what you are looking for, but git-
| bug[1] is encoding data into regular git objects, with merges
| and conflict resolution. I'm mentioning this because the hard
| part is providing an ordering of events. Once you have that
| you can store and recreate whatever state you want.
|
| This branch[2] I'm almost done with remove the purely linear
| branch constraint and allow to use full DAGs (that is,
| concurrent edition) and still provide a good ordering.
|
| [1]: https://github.com/MichaelMure/git-bug [2]:
| https://github.com/MichaelMure/git-bug/pull/532
| glogla wrote:
| This one seems to be missing: https://projectnessie.org/
| justinclift wrote:
| It's also missing DBHub.io. ;)
| herpderperator wrote:
| This is pretty cool. I wish `dolt diff` would use + and - though
| (isn't that standard?) rather than > and < which is harder to
| distinguish.
| guerrilla wrote:
| Both are "standard" from the UNIX/GNU diff(1) tool. The default
| behavior gives you the '>' and '<' format and using -u gives
| you the "unified" '+' and '-' format.
| bitslayer wrote:
| Is it for versions of the database design or versions of the
| data?
| skybrian wrote:
| Both. Schema changes are versioned like everything else. But
| depending on what the change is, it might make merges
| difficult.
|
| (I haven't used it; I just read the blog.)
| fiedzia wrote:
| BTW. I wish all databases versioned their schema and kept
| full history. This should be a standard feature.
| codesnik wrote:
| keep history just as a changelog to see when and what was
| changed? or something with versions you actively can revert
| to? I suppose one of the problems is that changing schema
| is usually interleaved with some sorts of data migrations
| and conversion and those I have no idea how to track
| without using some general scripting language, like
| migrations in many frameworks which are already there.
| fiedzia wrote:
| Just storing a changelog would be very useful. I could
| for example compare db version some app was developed for
| with current state. Most companies will store schema
| migrations in git, but doing anything with this
| information is difficult. Having this in db would make
| automated checks easier.
| [deleted]
| jonnycomputer wrote:
| Naive question here. Aside from it being mysql, what is different
| here than just using git + sqlite.
|
| Update: When I posted, I'd forgotten that SQLite db file is a
| binary. Not sure what I was thinking.
| bargle0 wrote:
| Merging a SQLite database is challenging.
| [deleted]
| justinclift wrote:
| We do it on DBHub.io. :)
|
| If you're into Go, these are the commits where the pieces
| were hooked together.
|
| * https://github.com/sqlitebrowser/dbhub.io/commit/6c40e051ff
| 7...
|
| * https://github.com/sqlitebrowser/dbhub.io/commit/989ee0d08e
| 6...
| unnouinceput wrote:
| I'd venture more and say merging any DB is challenging,
| SQLite or not.
| kyrieeschaton wrote:
| People interested in this approach should compare Rich Hickey's
| Datomic.
| einpoklum wrote:
| But do you really need this functionality, if you already have an
| SQL database?
|
| That is, you can:
|
| 1. Create a table with an extra changeset id column and a branch
| id column, so that you can keep historical values.
|
| 2. Have a view on that table with the latest version of each
| record on the master branch.
|
| 3. Express branching-related actions as actions on the main table
| with different record versions and branch names
|
| 4. For the chocolate sprinkles, have tables with changeset info
| and branch info
|
| and that gives you a poor man's git already - doesn't it?
| resonantjacket5 wrote:
| That doesn't let you merge in someone else's changes easily.
| Aka two team members make changes to the (local copy of)
| database and now you want to merge it.
|
| I mean sure you can have another database tracking their
| changes and a merging algorithm but that's what dolt is doing
| for you
| zomglings wrote:
| If anyone from the dolt team is reading this, I'd like to make an
| enquiry:
|
| At bugout.dev, we have an ongoing crawl of public GitHub. We just
| created a dataset of code snippets crawled from popular GitHub
| repositories, listed by language, license, github repo, and
| commit hash and are looking to release it publicly and keep it
| up-to-date with our GitHub crawl.
|
| The dataset for a single crawl comes in at about 60GB. We
| uploaded the data to Kaggle because we thought it would be a good
| place for people to work with the data. Unfortunately, the Kaggle
| notebook experience is not tailored to such large datasets. Our
| dataset is in a SQLite database. It takes a long time for the
| dataset to load into Kaggle notebooks, and I don't think they are
| provisioned with SSDs as queries take a long time. Our best
| workaround to this is to partition into 3 datasets on Kaggle -
| train, eval, and development, but it will be a pain to manage
| this for every update, especially as we enrich the dataset with
| results of static analysis, etc.
|
| I'd like to explore hosting the public dataset on Dolthub. If
| this sounds interesting to you please, reach out to me - email is
| in my HN profile.
| zomglings wrote:
| This is the dataset on Kaggle -
| https://www.kaggle.com/simiotic/github-code-snippets
| justinclift wrote:
| Yeah, that sized database is likely to be a challenge unless
| the computer system it's running on has scads of memory.
|
| One of my projects (DBHub.io) is putting effort towards
| working through the problem with larger sized SQLite
| databases (~10GB), and that's mainly through using bare metal
| hosts with lots of memory. eg 64GB, 128GB, etc.
|
| Putting the same data into PostgreSQL, or even MySQL, would
| likely be much more efficient memory wise. :)
| zomglings wrote:
| Can't beat SQLite for distribution as a public dataset,
| though.
| zachmu wrote:
| We think dolt can :)
| justinclift wrote:
| How do you send someone a dolt database as a file?
| zachmu wrote:
| Push it to DoltHub, tell them to clone it. Just like with
| source code.
| acidbaseextract wrote:
| How do you send someone a git repository as a file? Why
| would a tarball not work?
| touisteur wrote:
| Git bundles. Amazing simple tool. Doing it all the time
| via sneaker net, or mail/chat-with-attachments, better
| that sending patches around IME.
| 411111111111111 wrote:
| Probably by first pushing it into a file. This command is
| in the readme. dolt remote add <remote>
| file:///Users/xyz/abs/path/
| zachmu wrote:
| We have 200 GB databases in dolt format that are totally
| queryable. They don't work well querying on the web though
| - you need a local copy to query it effectively. Making web
| query as fast as local is an ongoing project.
| justinclift wrote:
| Yeah the "on the web" piece is the thing we're talking
| about. :)
|
| 200GB databases for PostgreSQL (etc) isn't any kind of
| amazing.
| tomcam wrote:
| Uh, personal question here. Where does your ~10G number
| come from? I pretty much run my life on the Apple Notes
| app. My Notes database is about 12G and now I'm scared.
| zachmu wrote:
| We'll be in touch :)
| StreamBright wrote:
| You have other options too. If I have time i can try to reduce
| the size with a columnar format that is designed for this use
| case (repeated values, static dataset).
| zomglings wrote:
| That would be really great. Let me know if there's any way we
| can help. Maybe if we released a small version of the dataset
| for testing/benchmarking and then I could take care of
| running the final processing on the full dataset?
| StreamBright wrote:
| That would be amazing. I get back my internet tomorrow and
| i can play with the dataset see how much we could optimize.
| zomglings wrote:
| Hi StreamBright - just published the development version
| of the dataset also to Kaggle:
| https://www.kaggle.com/simiotic/github-code-snippets-
| develop...
|
| Compressed, it's 471 MB. Uncompressed, just a little more
| than 3 GB.
|
| If you want to get in touch with me in a better way than
| HN comments two good options:
|
| 1. My email is in my profile
|
| 2. You can direct message me (@zomglings) on the Bugout
| community Slack: https://join.slack.com/t/bugout-
| dev/shared_invite/zt-fhepyt8...
|
| Looking forward to collaborating with you. :)
| px43 wrote:
| Is there a nice way to distribute a database across many systems?
|
| Would this be useful for a data-set like Wikipedia?
|
| It would be really nice if anyone could easily follow Wikipedia
| in such a way that anyone could fork it, and merge some changes,
| but not others, etc. This is something that really needs to
| happen for large, distributed data-sets.
|
| I really want to see large corpuses of data that can have a ton
| of contributors sending pull requests to build out some great
| collaborative bastion of knowledge.
| metaodi wrote:
| This is basically Wikidata [1], where the distribution is
| linked data/RDF with a triple store and federation to query
| different triple stores at once using SPARQL.
|
| [1] https://wikidata.org
| andrewmcwatters wrote:
| Reminds me a bit of datahub.io, but potentially more useful.
| Nican wrote:
| I love the idea of this project so much. Being able to more
| easily share comprehensible data is an interest of mine.
|
| It is not the first time I have seen immutable B-trees being used
| as a method for being able to query a dataset on a different
| point in time.
|
| Spanner (and its derivatives) uses a similar technique to ensure
| backup consistency. Solutions such as CockroachDB also allows you
| to query data in the past [1], and then uses a garbage collector
| to delete older unused data. The Time-to-live of history data is
| configurable.
|
| [1] https://www.cockroachlabs.com/docs/stable/as-of-system-
| time....
| Nican wrote:
| Albeit, now it makes me wonder how much of the functionality of
| dolt is possible to be replicated with CockroachDB. The
| internal data structures of both databases are mostly similar.
|
| You can have infinite point-time-querying by setting the TTL of
| data to forever.
|
| You have the ability to do distributed BACKUP/IMPORT of data
| (Which is mostly a file copy, and also includes historical
| data)
|
| A transaction would be the equivalent of a commit, but I do not
| think there is a way to list out all transactions on CRDB, so
| that would have to be done separately.
|
| And gain other benefits, such as distributed querying, and high
| availability.
|
| I just find interesting that both databases (CockroachDB and
| Dolt) share the same principal of immutable B-Trees.
| devwastaken wrote:
| Something like this but for sqlite would be great for building
| small git enabled applications/servers that cana benefit from the
| features git provides but only need a database and a library to
| do it.
| lifty wrote:
| Noms might be what you're looking for
| (https://github.com/attic-labs/noms). Dolt is actually a fork
| of Noms.
| 0xbadcafebee wrote:
| If this could somehow work with existing production
| MariaDB/PostgreSQL databases, this would be the next Docker. Sad
| that it requires its own proprietary DB (SQL compatible though it
| may be, it's not my existing database).
|
| I wish something completely free like this existed just to manage
| database migrations. I don't think anything is quite this
| powerful/user-friendly (if you can call Git that) and free.
| anasbarg wrote:
| I like this. A while ago, I was asking my co-founder the question
| "Why isn't there Git for data like there is for code?" while
| working on a database migrations engine that aims to provide
| automatic database migrations (data & schema migrations). After
| all, code is data.
| pizzabearman wrote:
| Is this mysql only?
| zachmu wrote:
| It uses the mysql SQL dialect for queries. But it's its own
| database.
| pgt wrote:
| Datomic: https://www.youtube.com/watch?v=Cym4TZwTCNU
| smarx007 wrote:
| If you are interested to one step beyond, you can check out
| https://terminusdb.com/docs/terminusdb/#/ which uses RDF to
| represent (knowledge) graphs (and has revision control).
| joshspankit wrote:
| I never understood why we don't have SQL databases that track all
| changes in a "third dimension" (column being one dimension, row
| being the second dimension).
|
| It might be a bit slower to write, but hook the logic in to
| write/delete, and suddenly you can see _exactly_ when a field was
| changed to break everything. The right middleware and you could
| see the user, IP, and query that changed it (along with any other
| queries before or after).
| mcrutcher wrote:
| This has existed for a very long time as a data modeling
| strategy (most commonly, a "type 2 dimension") and is the way
| that all MVCC databases work under the covers. You don't need a
| special database to do this, just add another column to your
| database and populate it with a trigger or on update.
| tthun wrote:
| MS SQL server 2016 onwards has temporal tables that support
| this (point in time data)
| joshspankit wrote:
| Huh. I just read the spec. Not quite three 'dimension', but
| looks like exactly what I was asking for: a (reasonably)
| automatic and transparent record of previous values, as well
| as timestamps for when they changed.
|
| I'll call this a "you learn something every day" and a "hey
| thanks @tthun (and @predakanga)"
| predakanga wrote:
| This does exist, though support for it is pretty sparse; it's
| called "Temporal Tables" in the SQL:2011 standard -
| https://sigmodrecord.org/publications/sigmodRecord/1209/pdfs...
|
| Last time I checked, it was supported in SQL server and
| MariaDB, and Postgres via an extension.
| kenniskrag wrote:
| Because you can do that with after update triggers or server-
| side in software.
| iamwil wrote:
| which db does it use?
| zachmu wrote:
| It is a database. It implements the MySQL dialect and binary
| protocol, but it isn't MySQL. Totally separate storage engine
| and implementation.
| jrumbut wrote:
| It's amazing this isn't a standard feature. The database world
| seems to have focused on large, high volume, globally distributed
| databases. Presumably you would't version clickstream or IoT
| sensor data.
|
| Features like this that are only feasible below a certain scale
| are underdeveloped and I think there's opportunity there.
| fiddlerwoaroof wrote:
| Datomic has some sort of zero-cost forming of the database:
| it's "add-only" design makes this cheap.
| qbasic_forever wrote:
| Every DB engine used at scale has a concept of snapshots and
| backups. This just looks like someone making a git-like
| porcelain for the same kind of DB management constructs.
| zachmu wrote:
| It's not just snapshots though.
|
| Dolt actually stores the tables, rows, and commits in a
| Merkle DAG, like Git. So you get branch and merge. You can't
| do branch and merge with snapshots.
|
| (You also get the full git / github toolbox: push and pull,
| fork and clone, rebase, and most other git commands)
| qbasic_forever wrote:
| Yeah it's a neat idea but I struggle to think of good use-
| cases for merge, other than toy datasets. If I'm working on
| a service that's sharding millions of users across dozens
| of DB instances a merge is going to be incomprehensible to
| understand and reason about conflicts.
| MyNameIsFred wrote:
| > Yeah it's a neat idea but I struggle to think of good
| use-cases [...] If I'm working on a service [...]
|
| I suspect that's simply not the use-case they're
| targeting. You're thinking of a database as simply the
| persistence component for your service, a means to an
| end. For you, the service/app/software is the thing
| you're trying to deliver. Where this looks useful is the
| cases where the data itself is the thing you're trying to
| deliver.
| 101008 wrote:
| Isnt the mysql log journal* what you are looking for?
|
| * I dont remember the exact name but I refer the feature that
| is used to replicate actions if there was an error.
| fiedzia wrote:
| No. Relational DB logs are kept for short time, and do not
| allow for branching/merging. And even if you would store full
| log since day 1, the only way to check state of database at
| some point would be to start from scratch and reply
| everything to desired point. For any sizeable db, that's not
| practical.
| [deleted]
| [deleted]
| strogonoff wrote:
| You can also use Git for data!
|
| It's a bit slower, but smart use of partial/shallow clones can
| address performance degradation on large repositories over time.
| You just need to take care of the transformation between
| "physical" trees/blobs and "logical" objects in your dataset
| (which may not have 1:1 mapping, as having physical layer more
| granular reduces likelihood of merge conflicts).
|
| I'm also following Pijul, which seems very promising in regards
| to versioning data--I believe they might introduce primitives
| allowing to operate on changes in actual data structures rather
| than between lines in files, like with Git.
|
| Add to that sound theory of patches, and that's a definite win
| over Git (or Doit for that matter, which seems to be same old Git
| but for SQL).
| vvanders wrote:
| Nope, been there done that, no thanks.
|
| Lack of locking for binary files, overhead > 1gb and all the
| shenanigans you need to do for proxy servers. There's better
| solutions out there but they aren't free.
| strogonoff wrote:
| Would be very curious to hear more about issues with proxy
| servers (where were they required?), overheads (do you mean
| RAM usage?) and locking.
| vvanders wrote:
| Sure keep in mind that my data is a little old but last
| time I peeked into the git LFS space it seemed like there
| were still a few gaps.
|
| First, most of my background in this area comes from
| gamedev so YMMV if the same applies in your use cases.
|
| For our usage we'd usually have a repo history size that
| crossed the 1TB mark and even upwards of 2-3TB in some
| cases. The developer sync was 150-200GB, the art sync was
| closer to 500-600GB and the teams were regularly churning
| through 50-100GB/week depending on where we were in
| production.
|
| You need discipline specific views into the repo. It just
| speeds everything up and means that only the teams that
| need to take the pain have to. From a performance
| perspective Perforce blows the pants off anything else I've
| seen, SVN tries, but P4 was easily an order of magnitude
| faster to sync or do a clean fetch.
|
| I've seen proxy servers done with git but it's usually some
| really hack thing scripted together with a ton of ductape
| and client-specific host overrides. When you have a team
| split across East Coast/West Coast(or other country) you
| _need_ that proxy so that history is cached in a way that
| it only gets pulled in locally once. Having a split push
| /pull model is asking for trouble and last I checked it
| wasn't clear to me if stuff like git LFS actually handles
| locking cleanly across it.
|
| From an overhead perspective git just falls over at
| ~1gb(hence git LFS, which I've seen teams use to varying
| degrees of success based on project size). The need to do
| shallow history and sidestep resolving deltas is a ton of
| complexity that isn't adding anything.
|
| With a lot of assets, merging just doesn't exist and a DVCS
| totally falls over here. I've seen fights nearly break out
| in the hallway multiple times when two artist/animators
| both forgot to checkout a file(usually because someone
| missed the metadata to say it's an exclusive access file).
| With unmergeable binary files that don't get locked your
| choice is who gets to drop 1-3 days of work on the floor
| when the other person blows away their changes to commit.
| If those changes span multiple interconnected
| packages/formats/etc you have a hard fork that you can
| never bring back together.
|
| There's a couple other details but that's the large ones,
| Perforce worked incredibly well in this space but it is not
| cheap and so I've seen teams try to go their own way to
| mixed success. I'll admit that you can't do a monorepo in
| P4(and even tools like repo in the Android world have their
| problems too) but if you segregate your large
| business/product lines across P4 repos it scales
| surprisingly well.
|
| Anyway, you may or may not hit any or all of this but I've
| yet to see git tackle a 1TB+ repo history well(and things
| like repo that uses many mini-repos doesn't count in my
| book due to the lack of atomicity across submissions that
| span multiple repos).
| strogonoff wrote:
| This is super informative!
|
| In my case it's different since Git isn't accessed by
| users directly, rather I'm working on some tools that
| work on top of Git (on user's machine). Data is primarily
| text-based, though sometimes binary assets come up
| (options for offloading them out of Git are being
| investigated).
|
| So far there were no major issues. I predict degradation
| over time as repos grow in size and history (Git is not
| unique in this regard, but it'll probably be more rapid
| and easier to observe with Git), so we might start using
| partial cloning.
|
| (I stand by the idea that using straight up Git for data
| is something to consider, but with an amendment that it's
| predominantly text data, not binary assets.)
| vvanders wrote:
| Yeah, my experience has been that you start seeing issues
| with long delta decompression times around the 1-2gb
| mark. That climbs quicker of you have binary formats that
| push the delta compression algorithm into cases where it
| does poorly(which makes sense since it was optimized for
| source code).
|
| If you have binary assets and they don't support merging
| or regeneration from source artifacts that mandates
| locking(ideally built into SCM but I've seen wiki pages
| in a pinch at small scale).
| graderjs wrote:
| I find a balance between this using git on JSON files. And I
| build the JSON files into a database (1 file per record, 1
| directory per table, subdirectories for indexes). The whole
| thing is pretty beautiful, and it's functioning well for a
| user-account, access management database I'm running in
| production. I like that I can go back and do:
|
| `git diff -p` to see the users who have signed up recently, for
| example.
|
| You can get the code, over at: https://github.com/i5ik/sirdb
|
| The advantages of this approach are using existing unix tooling
| for text files, solid versioning, easy inspect-ability, and
| leveraging the filesystem B-Tree indexing as a fast index
| structure (rather than having to write my b-trees). Another
| advantage is hardware-linked scaling. For example, if I use
| regular hard disks, it's slower. But if I use SSDs it's faster.
| And i should also be possible to mount the DB as a RAM disk and
| make it super fast.
|
| The disadvantages are that the database side still only
| supports a couple of operations (like exact, multikey searches,
| lookup by ID, and so on) rather than a rich query language. I'm
| OK with that for now, and I'm also thinking of using skiplists
| in future to get nice ordering property for the keys in an
| index so I can easily iterate and page over those.
| teej wrote:
| The fact that I can use git for data if I carefully avoid all
| the footguns is exactly why I don't use git for data.
| rapjr9 wrote:
| We used git to store and manage data sets for a machine
| learning project involving chewing detection with audio data
| used in training. It was cumbersome and the huge datasets
| caused some problems with git (e.g., searches of our code base
| got really slow because the data was being searched also until
| we moved the data to a different repo). Something easier to use
| that could manage large datasets would be useful.
|
| I wonder if DoIt could be used to create a clone of Apple's
| Time Machine. Seems like the basics are there.
| pradn wrote:
| Git is too complicated. It's barely usable for daily tasks.
| Look at how many people have to Google for basic things like
| uncommitting a commit, or cleaning your local repo to mirror a
| remote one. Complexity is a liability. Mercurial has a nicer
| interface. And now I see the real simplicity of non-distributed
| source control systems. I have never actually needed to work in
| a distributed manner, just client-server. I have never sent a
| patch to another dev to patch into their local repo or whatnot.
| All this complexity seems like a solution chasing after a
| problem - at least for most developers. What works for Linux
| isn't necessary for most teams.
| ttz wrote:
| Git is used prolifically in the tech industry. What on earth
| are you talking about?
| detaro wrote:
| Being needlessly complicated seldomly stops the tech
| industry from using something as long as the complexity is
| slightly out of the way.
| ttz wrote:
| "Barely usable for daily tasks"
|
| Is a pretty strong statement, especially given many tech
| companies use it exactly for this purpose.
|
| Git might have a learning curve, and sure, it's not the
| simplest. But "barely usable" is hyperbole in the face of
| actual evidence.
|
| I'm not defending Git specifically ; other SVNs are just
| as viable. The quoted statement seems a bit ridiculous.
| strogonoff wrote:
| To me there's some irony in that all insta-criticism of Git
| in responses to my comment presumably applies to a project
| that describes itself as "Git for data" and promises exact
| reproduction of all Git command behaviour--therefore
| suffering from the same shortcomings.
| Hendrikto wrote:
| > Git is too complicated. It's barely usable for daily tasks.
| Look at how many people have to Google for basic things like
| uncommitting a commit, or cleaning your local repo to mirror
| a remote one.
|
| Cars are too complicated. They are barely usable for daily
| tasks. Look at how many people have to Google for basic
| things like changing a fan belt, or fixing cylinder head
| gasket.
|
| You can fill in almost anything here. Most tools are
| complicated. Yet yo don't need to know their ins and outs for
| them to be useful to you.
| yoavm wrote:
| To me it sounds like you're proving the exact opposite. I'd
| assume most car owners never need to change a fan belt
| themselves, while everyone who uses git daily needed at
| some point to revert a commit. "How to turn right" isn't
| huge on stackoverflow last time I checked...
| strogonoff wrote:
| Doit boasts its likeness to Git as a feature. Does this mean
| it'll also be barely usable for daily tasks? Is it possible
| for a project to faithfully reproduce the entirety of Git
| command interface _and_ be less complicated than Git / not
| suffer from the same shortcomings?
|
| I personally think Git isn't that bad, once it's understood.
| It could be counter-intuitive sometimes though (as an
| example, for the longest time I used Git without realizing it
| stores a snapshot of each file and diffs/deltas are only
| computed when required). Just trying to be pragmatic and not
| expecting a tool like Git to be entirely free of leaky
| abstractions.
| scottmcdot wrote:
| Dolt might be good but never underestimate the power of Type 2
| Slowly Changing Dimension tables [1]. For example, if you had an
| SSIS package that took CSV and imported them into a database, and
| one day you noticed it accidently rounded the value incorrectly,
| you could fix the data and retain traceability of the data which
| was there originally.
|
| E.g., SSIS package writes row of data: https://imgur.com/DClXAi5
|
| Then a few months later (on 2020-08-15) we identify that
| trans_value was imported incorrectly so we update it:
| https://imgur.com/wdQJWm4
|
| Then whenever we SELECT from the table we always ensure we are
| extracting "today's" version of the data:
| select * from table where TODAY between effective_from and
| effective_to
|
| [1] https://en.wikipedia.org/wiki/Slowly_changing_dimension
| skybrian wrote:
| The commit log in Dolt is edit history. (When did someone
| import or edit the data? Who made the change?) It's not about
| when things happened.
|
| To keep track of when things happened, you would still need
| date columns to handle it. But at least you don't need to
| handle two-dimensional history for auditing purposes. So, in
| your example, I think the "effective" date columns wouldn't be
| needed.
|
| They have ways to query how the dataset appeared at some time
| in the past. However, with both data changes and schema changes
| being mixed together in a single commit log, I could see this
| being troublesome.
|
| I suppose writing code to convert old edit history to the new
| schema would still be possible, similar to how git allows you
| to create a new branch by rewriting an existing one.
| zachmu wrote:
| If all you want is AS OF semantics, then SCD2 is a great match.
| Used it a ton in application development myself.
|
| Dolt actually makes branch and merge possible, totally
| different beast.
| antman wrote:
| Can't something like that work as SQL:2011 temporal tables?
|
| Postgres equivalent solution in this [0] rather complex but
| great tutorial
|
| [0]: https://clarkdave.net/2015/02/historical-records-with-
| postgr...
| sixdimensional wrote:
| I definitely agree, just tossing in the superset concept that
| Dolt and Type 2 SCD involve - temporal databases [1].
|
| I think the idea of a "diff" applied to datasets is quite
| awesome, but even then, we kind of do that with databases today
| with data comparison tools - it's just most of them are not
| time aware, rather they are used to compare data between two
| instances of the data in different databases, not at two points
| in time in the same database.
|
| [1] https://en.wikipedia.org/wiki/Temporal_database
| controlledchaos wrote:
| This is exactly what I was looking for!
| 1f60c wrote:
| I'm not sure if this is supposed to be Dolt or DoIt, but using a
| swear word for a name (even a relatively mild one) is pretty
| distracting, IMHO.
| stjohnswarts wrote:
| DoIt isn't really a swear word. It's a euphemism for coitus.
| It's also a very common phrase in general so it really isn't
| strongly associated with "doing it" to cause any issues with
| English speakers. Aka it's fine as a project name. See Nike's
| "Just do it." advertisement campaign, they would have never
| gone with the phrase if it had strong negative connotations.
| zachmu wrote:
| It's DOLT. San serif fonts.
| drewwwwww wrote:
| presumably a riff on git, the well known famously unsuccessful
| version control system
| 1f60c wrote:
| Huh, I had no idea. (I'm not a native speaker.)
| dang wrote:
| Some related past threads:
|
| _Dolt is Git for data_ -
| https://news.ycombinator.com/item?id=22731928 - March 2020 (191
| comments)
|
| _Git for Data - A TerminusDB Technical Paper [pdf]_ -
| https://news.ycombinator.com/item?id=22045801 - Jan 2020 (5
| comments)
|
| _Ask HN: Would you use a "git for data"?_ -
| https://news.ycombinator.com/item?id=11537934 - April 2016 (10
| comments)
| weeboid wrote:
| Scanned for this comment before making it. This optically reads
| as "Dolt", not "DoIt"
| XCSme wrote:
| But it is "DOLT", right?
| zachmu wrote:
| Yup.
| coldtea wrote:
| Just optically? It's meant to be "dolt", not "do it", optically
| and semantically.
|
| It's a pun on git, which also means "stupid/unpleasant/etc
| person".
| weeboid wrote:
| ahh, got whooshed by the cool kids. dammit
| yarg wrote:
| Merging is hard, but the rest can be done with copy-on-write
| cloning (or am I missing something?).
| xkvhs wrote:
| Well, maybe this is the ONE. But I've heard "git for data" way
| too many times to jump on board of the latest tool. I'll wait
| till "dust settles" and there's a clear winner. Till then, it's
| parquet, or even pickle.
| laurent92 wrote:
| Wordpress would have benefited from this.
|
| What a lot of webmasters want is, test the site locally, then
| merge it back. A lot of people turned to Jekyll or Hugo for the
| very reason that it can be checked into git, and git is reliable.
| A static website can't get hacked, whereas anyone who has been
| burnt with Wordpress security fail knows they'd prefer a static
| site.
|
| And even more: People would like to pass the new website from the
| designer to the customer to managers -- Wordpress might have not
| needed to develop their approval workflows (permission schemes,
| draft/preview/publish) if they had had a forkable database.
| [deleted]
| TimTheTinker wrote:
| Sure, but Wordpress is still running PHP files for every page
| load and back-end query. Dolt would help offload some of the
| code's complexity, but that would still leave a significant
| attack surface area.
|
| In other words, by itself Dolt couldn't solve the problem of
| Wordpress _not_ being run mostly from static assets on a server
| (plus an API).
| kenniskrag wrote:
| they have a theme preview now. :)
| jimsmart wrote:
| The parent post here is speaking to the content stored in the
| database, not the templates on the filesystem. Dolt enables
| one to merge, push and pull _data_ , just as one would with
| files and git.
| 2malaq wrote:
| I read it as sarcasm.
| berkes wrote:
| At the root of this lies the problem that content,
| configuration and code is stored in one blurp (a single db).
| Never clearly bounded nor contained. Config is spread over
| tables. Usergenerated content (from comments to orders) mixed
| up with redactional content: often even in tables, sometimes
| even in the same column. Drupal suffers the same.
|
| What is needed, is a clear line between configuration (and
| logic), redactional content, and user generated content. Those
| three have very distinct lifecycles. As long as they are
| treated as one, the distinct lifecycles will conflict. No
| matter how good your database merging is: the root is a missing
| piece of architecture.
| Hallucinaut wrote:
| Very well summarised. I can see clearly you too have tried to
| apply normal software lifecycle principles and come out the
| other end with a rather dismal impression of WordPress.
| Klwohu wrote:
| Problematic name, could become a millstone on the neck of the
| developer far into the future.
| ademarre wrote:
| Agreed. I couldn't immediately see if it was "DOLT" or "do it",
| as in "just do it". It's the former.
| rapnie wrote:
| I was going back and forth between the two until seeing doLt
| in terminal font.
| zachmu wrote:
| This ambiguity in sans serif fonts has actually been pretty
| annoying. Especially since GitHub doesn't let you choose
| your font on readmes and stuff.
| mzs wrote:
| just lowercase it
| dheera wrote:
| You can use <pre>...</pre> for monospace font.
| ssprang wrote:
| Reminds me of this user testing story from the early
| Macintosh days:
| https://www.folklore.org/StoryView.py?story=Do_It.txt
| TedDoesntTalk wrote:
| Already I would not use this project because of its name. I'm
| not offended by it, but I know others will be, and it will only
| be a matter of time before we have to replace it with something
| else. So why bother in the first place?
|
| I know the name is not DOLT but it is close enough to cause
| offense. Imagine the N-word with one typo. Would it still be
| offensive? Probably to some.
| simonw wrote:
| The name is DOLT with an L
| Klwohu wrote:
| This is the issue with names. Even though the project is
| called doit the DoIt stylizing makes it _look_ problematic.
| It 's a non-starter, hopefully the author makes a big change.
| Just choosing lower case for the project would be enough.
| edgyquant wrote:
| It's not though it is dolt, meaning a stupid person, like
| git (stupid simple version control.)
| Klwohu wrote:
| Oh wow, that's awful! I thought this was just an innocent
| mistake.
| mixedCase wrote:
| > but I know others will be, and it will only be a matter of
| time before we have to replace it with something else
|
| Or we can just not give in to such insanity. That's always an
| option, and would help prevent things from getting
| increasingly worse as we cede ground to claims that
| increasingly get further and further away from the realm of
| what's reasonable.
| TedDoesntTalk wrote:
| It's not your choice when you're an employee of a woke
| company (unless you want to quit) Don't you know that by
| now?
| _Wintermute wrote:
| Do you know what "git" means?
| maest wrote:
| It was most likely picked as an analogy to "git".
| gerdesj wrote:
| Dolt and git are closer to synonymous rather than analogous.
| zachmu wrote:
| This is correct. Specifically, to pay homage to git and how
| Linus named it.
| crazygringo wrote:
| This is absolutely fascinating, conceptually.
|
| However, I'm struggling to figure out a real-world use case for
| this. I'd love if anyone here can enlighten me.
|
| I don't see how it can be for production databases involving lots
| of users, because while it seems appealing as a way to upgrade
| and then roll back, you'd lose all the new data inserted in the
| meantime. When you roll back, you generally want to roll back
| changes to the schema (e.g. delete the added column) but not
| remove all the rows that were inserted/deleted/updated in the
| meantime.
|
| So does it handle use cases that are more like SQLite? E.g. where
| application preferences, or even a saved file, winds up
| containing its entire history, so you can rewind? Although that's
| really more of a temporal database -- you don't need git
| operations like branching. And you really just need to track row-
| level changes, not table schema modifications etc. The git model
| seems like way overkill.
|
| Git is built for the use case of lots of different people working
| on different parts of a codebase and then integrating their
| changes, and saving the history of it. But I'm not sure I've ever
| come across a use case for lots of different people working on
| the _data and schema_ in different parts of a database and then
| integrating their _data and schema_ changes. In any kind of
| shared-dataset scenario I 've seen, the schema is tightly locked
| down, and there's strict business logic around who can update
| what and how -- otherwise it would be chaos.
|
| So I feel like I'm missing something. What is this actually
| intended for?
|
| I wish the site explained why they built it -- if it was just
| "because we can" or if projects or teams actually had the need
| for git for data?
| jorgemf wrote:
| Machine Learning. I don't think it has many more use cases
| sixdimensional wrote:
| Or more simply put, how about table-driven logic in general?
| It doesn't have to be as complex as machine learning. There
| are more use cases than just machine learning, IMHO.
| jedberg wrote:
| Such as? I'm having difficulty coming up with any myself.
| zachmu wrote:
| Say, network configuration.
| sixdimensional wrote:
| See my post earlier in this thread [1].
|
| Yes you need reference data for machine learning, but the
| world isn't only about machine learning. You might want
| reference data for human-interpreted analytics, table-
| driven logic (business rule engines, for example), etc.
|
| [1] https://news.ycombinator.com/item?id=26371748.
| fiedzia wrote:
| This won't work for usual database usecases. This is meant for
| interactive work with data same way you work with code. Who
| needs that?
|
| Data scientists working with large datasets. You want to be
| able to update data without redownloading everything. Also make
| your local changes (some data cleaning) and propose your
| updates upstream same way you would with git. Having many
| people working interactively with data is common here.
|
| One of the companies I work with provided set of data
| distributed to their partners on a daily basis. Once it grew
| larger, downloading everything daily became an issue. So that
| would be desirable,
|
| I have large data model that I need to deploy to production and
| update once in a while. For code, network usage is kept to
| minimum because we have git. For data, options are limited.
|
| As with git, it is something that once you have, you will find
| a lot of usecases that make life easier and open many new
| doors.
| zachmu wrote:
| The application backing use case is best suited for when you
| have parts of your database that get updated periodically and
| need human review. So you have a production database that you
| serve to your customers. Then you have a branch / fork of that
| (dev) that your development team adds batches of products to.
| Once a week you do a data release: submit a PR from dev ->
| prod, have somebody review all the new copy, and merge it once
| you're happy. If there's a big mistake, just back it out again.
| We have several paying customers building products around this
| workflow.
|
| As for lots of people collaborating on data together, we have
| started a data bounties program where we pay volunteers to
| assemble large datasets. Two have completed so far, and a third
| is in progress. For the first one, we paid $25k to assemble
| precinct-level voting data for the 2016 and 2020 presidential
| elections. For the second, we paid $10k to get procedure prices
| for US hospitals. You can read about them here:
|
| https://www.dolthub.com/blog/2021-02-15-election-bounty-revi...
|
| https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/
|
| What's cool is that novices can make a really good income from
| data entry as a side gig, and it's two orders of magnitude
| cheaper than hiring a firm to build data sets for you.
|
| You're right that the site is kind of vague about what dolt is
| "for." It's a really general, very multi-purpose tool that we
| think will get used a lot of places. Here's a blog we wrote a
| while back about some of the use cases we envision.
|
| https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/
| [deleted]
| saulrh wrote:
| Here's one thing I'd have used it for: Video game assets.
|
| Say you have a tabletop game engine for designing starships.
| Different settings have different lists of parts. Some settings
| are run by a game's DM, some are collaborative efforts. I ended
| up saving the lists of parts in huge JSON files and dumping
| those into git. However, for much the same reason that data
| science is often done in a REPL or notebook type interface, it
| turned out that by far the most efficient way for people to
| iterate on assets was to boot up the game, fiddle with the
| parts in-engine until things looked right, then replicate their
| changes back into the JSON. With this, we could just save the
| asset database directly.
|
| The same reasoning should hold for effectively any dataset
| which a) can be factored into encapsulated parts b) isn't
| natively linear c) needs multiple developers. Game assets are
| one example, as I described above. Other datasets that hold: ML
| training/testing sets, dictionaries, spreadsheets, catalogs,
| datasets for bio papers.
| sixdimensional wrote:
| I am not associated to Dolt, but I really like the idea of Dolt
| personally. I do see use cases, but not without challenges.
|
| One of the main use cases you can see them targeting, and that
| I think makes a ton of sense, is providing tools for
| collecting, maintaining and publishing reference data sets
| using crowd sourcing.
|
| For example, they are doing this with hospital charge codes
| (a.k.a. chargemaster data). Hospitals in the US are required to
| publish this data for transparency.. however, I have never seen
| a single aggregated national (or international) data set of all
| these charges. In fact, such a data set could be worth a lot of
| money to a lot of organizations for so many reasons. I used to
| work in health insurance, gathering data from all kinds of
| sources (government rules/regs, etc.) and it was a lot of hard
| work, scraping, structuring, maintaining, etc.
|
| This reference data can be used for analytics, to power table-
| driven business logic, machine learning - to help identify cost
| inequalities, efficiencies, maybe even illicit price gouging,
| etc. There are so many reference data sets that have similar
| characteristics... and "data marketplaces" in a way are
| targeted at making "private" reference data sets available for
| sale - so then where is the "open" data marketplace? Well, here
| you go.. Dolt.
|
| I have often realized that the more ways we can make things
| collaborative, the better off we will be.
|
| Data is one of those things where, coming up with common,
| public reference datasets is difficult and there are lots of
| different perspectives ("branches"), sometimes your data set is
| missing something and it would be cool if someone could propose
| it ("pull request"), sometimes you want to compare the old and
| new version of a dataset ("diff") to see what is different.
|
| One difficult thing about Dolt is, it will only be successful
| if people are actually willing to work together to cook up and
| maintain common data sets collaboratively, or if those doing so
| have an incentive to manage an "open data" project on Dolt as
| benevolent maintainers, for example. But, I could say then it
| has the same challenges as "open source" in general, so
| therefore it is not really that different.
|
| Dolt could even be used as a foundation for a master data
| management registry - in the sense of you could pop it in as a
| "communal data fountain" if you will where anybody in your org,
| or on the Web, etc. could contribute - and you can have
| benevolent maintainers look over the general quality of the
| data. Dolt would be missing the data quality/fuzzy matching
| aspect that master data tools offer, but this is a start for
| sure.
|
| For example, I work in a giant corporation right now. Each
| department prefers to maintain its own domain data and in some
| cases duplicates common data in that domain. Imagine using Dolt
| to make it possible for all these different domains to
| collaborate on a single copy of common data in a "Dolt" data
| set - now people can share data on a single copy and use pull
| requests, etc. to have an orderly debate on what that common
| data schema and data set should look like.
|
| I think it's an idea that is very timely.
|
| P.S. Dolt maintainers, if you read this and want to talk, I'm
| game! Awesome work :)
| a-dub wrote:
| agree 100% this looks awesome for reference datasets.
|
| not so sure how well it would work for live data sources that
| update with time as it could encourage people to apply more
| ad-hoc edits as opposed to getting their version controlled
| jobs to work 100%, but who knows, maybe that would be a net
| win in some cases?
| zachmu wrote:
| We totally agree on all points! Come chat with us on our
| discord about how we can help your org solve its problems:
|
| https://discord.com/invite/RFwfYpu
| curryst wrote:
| I might look at it for work. Compliance requires us to keep
| track of who made what change when, and who it was approved by
| in case regulators need it.
|
| Right now, this often means making an MR on Git with your
| snippet of SQL, getting it approved, and then manually
| executing it. This would let us bring the apply stage in, as
| well as avoid "that SQL query didn't do exactly what I
| expected" issues.
|
| It's possible to do it within the SQL engine, but then I have
| to maintain that, which I would prefer not to do. As well as
| dealing with performance implications from that.
| zachmu wrote:
| That's a great use case, let us know how we can help. We have
| a discord if you want to chat about it:
|
| https://discord.com/invite/RFwfYpu
| martincolorado wrote:
| I see a use-case for public resource data sets. For example,
| right of way, land use variance, permits, and land records.
| These are fundamentally public data sets that are often
| maintained by entities with limit budgets and the potential
| (even incentive) for fraud is substantial. Also, there is
| significant value in being able to access the data set for
| analytical purposes such as real estate analysis.
| knbknb wrote:
| Is something like "dolt diff master..somebranch -- mytable"
| possible?
|
| Same question: Are in-between branch comparisons possible in
| dolt? Which "diff" subcommands to you plan to support in the
| future?
| Ericson2314 wrote:
| What people usually miss about these things is normal version
| control benefits hugely from content addressing and normal forms.
|
| The salient aspect of relational data is that it's cyclic, this
| makes content addressing unable to provide normal forms on it's
| own (unless someone figures out how to Merkle cylic graphs!), but
| the normal form can still made other ways.
|
| The first part is easier enough, store rows in some order.
|
| The second part is more interesting: making the choice of
| surrogate keys not matter (quotienting it away). Sorting table
| rows containing surrogate keys depending on the sorting of table
| rows makes for some interesting bags of constraints, for which
| there may be more than one fixed point.
|
| Example: CREATE TABLE Foo ( a uuid
| PRIMARY KEY, b text, best_friend uuid REFERENCES
| Foo(b) );
|
| DB 0: 0 Alice 0
|
| 1 reclusive Alice, best friends with herself. Just fine.
| 0 Alice 1 1 Alice 1
|
| 2 reclusive Alices, both best friends with the second one. The
| alices are the same up to primary keys, but while primary keys
| are to be quotiented out, primary key equality isn't, so this is
| valid. And we have an asymmetry by which to sort.
| 0 Alice 1 1 Alice 0
|
| 2 reclusive Alices, each best friends with the other. The Alices
| are completely isomorphic, and one notion of normal forms would
| say this is exactly the same as DB 0: as if this is reclusive
| Alice in a fun house of mirrors.
|
| All this is resolvable, but it's subtle. And there's no avoiding
| complexity. E.g. if one wants to cross reference two human data
| entries who each assigned their own surrogate IDs, this type of
| analysis must be done. Likewise when merging forks of a database.
|
| I'd love be wrong, but I don't think any of the folks doing "git
| for data" are planning their technology with this level of
| mathematical rigor.
| barrkel wrote:
| Consider the whole database - the whole set of facts across all
| relations - as the state in the tree. Each transaction a
| serialized delta that produces a new node in the tree, a new
| HEAD. That's closer to what's being gotten at, as I see it.
|
| Transaction logs are already not that different to patch sets,
| and merge conflicts have an isomporphism with replication
| inconsistencies.
| Ericson2314 wrote:
| > Consider the whole database - the whole set of facts across
| all relations - as the state in the tree.
|
| I tried to demonstrate that this is easier said than done.
| Deciding the equality/redundancy of facts is very subtle. At
| some it might even be guess whether your two clerks each met
| the same Alice when entering in their DB forks or not.
|
| Transactions are just patches, I completely agree. And
| patches are just partial functions. But deciding what the
| action is of a partial function, or whether the input is in
| the domain requires a notion of equality. (Partial functions
| are encoded very nicely with incomplete pattern matching; in
| this case the question is whether Alice matches the pattern.)
|
| Basically I know where you are coming from, and I want it to
| work too, but you cannot just wave away the math and issues
| it points out.)
| TOGoS wrote:
| A lot of your comment went over my head, but I have modeled
| relational data in a way conducive to being stored in a Merkle
| tree. The trick being that every entity in the system ended up
| having two IDs. A hash ID, identifying this specific version of
| the object, and an entity ID (probably a UUID or OID), which
| remained constant as new versions were added. In a situation
| where people can have friends that are also people, they are
| friends with the person long-term, not just a specific version
| of the friend, so in that case you'd use the entity IDs. Though
| you might also include a reference to the specific version of
| the person at the point in time at which they became friends,
| in which case they would necessarily reference an older version
| of that person. If you friend yourself, you're actually
| friending that person a moment ago.
|
| A list of all current entities, by hash, is stored higher up
| the tree. Whether it's better that objects themselves store
| their entity ID or if that's a separate data structure mapping
| entity to hash IDs depends on the situation.
|
| On second reading I guess your comment was actually about how
| to come up with content-based IDs for objects. I guess my point
| was that in the real world you don't usually need to do that,
| because if object identity besides its content is important you
| can just give it an arbitrary ID. How often does the problem of
| differentiating between a graph containing identical Alices vs
| one with a single self-friending Alice actually come up? Is
| there any way around it other than numbering the Alices?
| Ericson2314 wrote:
| > ...The trick being that every entity in the system ended up
| having two IDs...
|
| I think we agree that this is a partial solution. Adding a
| temporal dimension and referencing immutable single-versions
| only can break cycles by making them unconstructable in the
| first place. But once an object refers to foreign entity IDs,
| hash IDs become "polluted" with surrogate values.
|
| > How often does the problem of differentiating between a
| graph containing identical Alices vs one with a single self-
| friending Alice actually come up? Is there any way around it
| other than numbering the Alices?
|
| I think it would come up with forks that have some identical
| edits, especially if those edits are in different orders. In
| that case, surrogate keygen state would get out of sync
| (whether it's counters or UUID state). Either we pessimize
| merges, or we need some way to recover.
|
| I think allowing for "identical Alices" is probably necessary
| in practice, but an interface should have a warning of some
| sort about this. (Maybe ask the Alices more questions until
| you can differentiate them? Get prepared in case one of the
| alices comes back and wants a new I.D. card and you don't
| want to easily enable fraud.) Likewise when merging those
| warnings should be brought to the fore, along with a menu of
| resolutions at extremes for the user to decide between.
| ghusbands wrote:
| > The salient aspect of relational data is that it's cyclic
|
| This is an odd claim. Most relational data is not cyclic, and
| it's easy enough to come up with a scheme to handle cyclic data
| in a consistent fashion.
|
| Conflicting changes (two changes to the same 'cell' of a
| database table) are a much more likely issue to hit and will
| need handling in much the same way merge conflicts are
| currently handled, so there are already situations in which
| manual effort will be needed.
| chatmasta wrote:
| We have content addressable objects at Splitgraph. [0] And
| here's an example of a "point in time" query across two
| versions of an image on Splitgraph. [1]
|
| I'm on my phone and don't have a ton of time to respond right
| now, but I'd recommend reading our docs. We're working on a lot
| of what you mention.
|
| (Also, we're hiring for backend and frontend roles. See my
| comment history for more.)
|
| [0] https://www.splitgraph.com/docs/concepts/objects
|
| [1]
| https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...
| qwerty456127 wrote:
| I would rather have a "GitHub for data" - an SQL database I could
| have hosted for free, give everybody read-only access to and give
| some people I choose R/W access to. That's a thing I miss really.
| rsstack wrote:
| Dolt also have a paid product called DoltHub, check it out.
| zachmu wrote:
| Free for public repositories, and for private repos under a
| gig.
___________________________________________________________________
(page generated 2021-03-07 23:02 UTC)