[HN Gopher] The state of merging technology
___________________________________________________________________
The state of merging technology
Author : bumbledraven
Score : 64 points
Date : 2023-12-13 21:41 UTC (1 days ago)
(HTM) web link (bramcohen.com)
(TXT) w3m dump (bramcohen.com)
| kragen wrote:
| some relevant context is that bram and his brother ross developed
| an early decentralized version control system named 'codeville',
| more or less contemporary with git and mercurial; they put a lot
| of work into figuring out how merging should handle different
| hairy scenarios
|
| bram was deeply disappointed that the systems that got widely
| adopted, like git, did a terrible job with these hairy scenarios,
| since he knew that it was possible to do much better
| ajb wrote:
| Oh yeah, there was a bit of a Cambrian explosion of version
| control systems around then, due to everyone getting fed up
| with CVS as well as the bitkeeper debacle. Codeville, TLA,
| monotone, vesta...
|
| I quite liked the idea of vesta, which included a build system
| and with hindsight looks a lot like nix/guix
| kragen wrote:
| yeah, bram and len featured a lot of them in codecon. but i
| don't agree with your explanation of why it happened
|
| svn was the result of everyone getting fed up with cvs;
| basically it does the same thing as cvs, but does it in a
| less janky way, and with atomic commits. but it still suffers
| from cvs's design weaknesses
|
| vesta was a digital research (decwrl?) project from the
| previous millennium; peter deutsch told me about it at the
| time, but it was still proprietary, and it took them a while
| to be able to open-source it. it was basically a clone of
| dsee, just like clearcase, though perhaps better done. it
| wasn't motivated by dissatisfaction with cvs and in fact
| couldn't do things cvs could do
|
| i think the main thing that kicked off the cambrian explosion
| wasn't 'everyone getting fed up with cvs' but rather tom lord
| (rip) writing arch (tla, later baz and bzr) which
| demonstrated to everyone that it was possible to do
| enormously better than cvs/svn, with features like atomic
| commits, forking your own branches without permission from
| the core team, decentralization, and serving from regular ftp
| or web servers (no special server software)
|
| these design features were ideologically driven on tom's
| part; he wanted to give ordinary users version-control tools
| that were just as powerful as the ones used by core teams on
| projects like freebsd and apache, motivated by the same
| egalitarianism that had led him to become an employee of the
| fsf
|
| and i think graydon hoare's monotone was the thing that most
| inspired the other follow-on systems, like git, mercurial,
| codeville, fossil, maybe darcs, and maybe even baz and bzr
|
| maybe kernel hackers getting experience with bitkeeper in
| 02000 to 02005 added motivation for moving to better-than-
| cvs-and-svn models too tho
|
| shlomi fish's site from the time in question has a lot of
| material on what was happening, including even lesser known
| version tracking systems like aegis: https://better-
| scm.shlomifish.org/aegis/
| ajb wrote:
| Fair enough, your history is probably more accurate.
|
| I didn't know Tom Lord had died. And not very old either
| darn it :-(
| kragen wrote:
| yeah, it's a huge loss
| sitkack wrote:
| Tom Lord has died (berkeleydailyplanet.com)
| https://news.ycombinator.com/item?id=32155067
| sockaddr wrote:
| > 02000 to 02005
|
| I think this is the first time I've seen this five-digit
| notation used in the wild after reading about The Long Now
| Foundation using it years ago.
| ComputerGuru wrote:
| It's how you know a kragen post on HN.
| hyperthesis wrote:
| Historically, bitkeeper directly led to git.
|
| But you're saying arch, then monotone, came before
| bitkeeper? What innovations did each provide?
|
| (git's innovation was content-based addressing, so the data
| structure does the heavy lifting. bitkeeper used sha1
| hashes for decentralization - was that its main
| contribution?)
|
| Probably the internet enabled the decentralized version
| control Cambrian explosion (or, at least, _a_ Cambrian
| explosion)
|
| BTW: funfact re "merging": bitkeeper had first-class
| renaming, which git lost. A process of subtraction as well
| as addition.
| kragen wrote:
| bitkeeper definitely led directly to git
|
| but bitkeeper _predated_ arch and monotone
|
| i listed arch's innovations above (some of which were
| also in bitkeeper, though i don't think that's where tom
| got them; atomic commits in particular were in all kinds
| of version control systems). as i understand it, git got
| content-based addressing from monotone. but monotone
| didn't invent that either; merkle invented it for his
| dissertation in 01979
|
| the current version of bitkeeper (7.3.3) doesn't use sha1
| except to import and export to git (look for yourself: ht
| tps://www.bitkeeper.org/downloads/7.3.3/bk-7.3.3.src.tar.
| g...), so i think you might have that part wrong
|
| the internet predated the decentralized version control
| cambrian explosion by about 17 years, if we count from
| the tcp/ip flag day, or by 32 years if we count from the
| first arpanet connections. it was clearly a crucial
| ingredient but it wasn't the limiting reagent
| hyperthesis wrote:
| Thanks; maybe critical mass for the internet triggered
| it?
|
| oh yeah, I recall merkle now; but maybe git first applied
| to decentralization?
|
| maybe bitkeeper doesn't use sha1 specifically, but some
| similar hash?
| kragen wrote:
| i don't know much about how bitkeeper works, but i don't
| think it uses secure-hash-based naming of any kind
|
| i think these are some of the things that led up to the
| transition to decentralized source control:
|
| 1. linus torvalds didn't want to use cvs because the
| social process of the linux kernel already depended on
| being able to ship patches around willy-nilly, but did
| want some kind of version control system
|
| 2. larry mcvoy had worked on a decentralized source
| control system called teamware, at sun, before the very
| first version of linux https://www.krsaborio.net/linux-
| kernel/research/2002/0528.ht... so he proposed to build a
| decentralized version control system to solve linus's
| problem, initially called bitsccs
| https://lkml.org/lkml/1998/9/30/122 but later called
| bitkeeper
|
| 3. tom lord decided the way we were doing version control
| was wrong and bad and spent years annoying the hell out
| of everyone and building software to demonstrate that a
| much better way was possible, and finally he convinced a
| lot of people, who started building better versions of
| the kinda janky arch/tla
|
| 4. there had been a lot of work on merkle graphs over the
| years, mostly for cryptographic applications, but in
| particular in the late 90s for decentralized filesystems;
| things like pgp, freenet, mojonation, bittorrent, and
| tahoe-lafs were popularizing this remarkable fact of
| being able to assign decentralized, secure names to
| pieces of content as long as they didn't have to be
| human-readable (a trilemma tahoe's designer zooko would
| formalize as 'zooko's triangle' until satoshi nakamoto
| found a solution). it may or may not be relevant that
| merkle's foundational patent expired in 01996; i think
| it's maybe more relevant that napster took off hugely in
| 02000, and suddenly decentralized and peer-to-peer
| systems that didn't have a central naming authority
| became an extremely fashionable thing to work on
|
| 5. larry got pissed off at tridge for trying to make a
| bitkeeper-compatible system and revoked the bitkeeper
| license after 5 years of people using it for linux. linus
| tried a bunch of the new free-software decentralized
| version control systems, including monotone, but none of
| them were adequate, so he decided to make a really
| stupid, basic version control system that would work well
| enough, and that was git
|
| 6. some kids started github, and they did a really good
| job of building a new kind of forge, and that took a lot
| of the pain out of using git. also because of how they
| set up the namespace the barrier to starting a new
| project there was much lower than on sourceforge, because
| you could call the project, like, 'notes', and because it
| was inside the namespace of your username there was no
| implicit claim to be the one and only notes project for
| the world.
|
| you could definitely argue that critical mass for the
| internet was the thing that triggered so much interest in
| decentralized systems. but then again, zooko and ian
| clarke had spent a decade already trying to figure out
| how to protect human rights on the internet, and so maybe
| they were going to build decentralized systems once they
| figured out how, regardless of how many or how few people
| they served. or maybe if larry hadn't revoked the
| license, linus wouldn't have written git, and without
| linus's superb quality of performance engineering, people
| would have kept using svn except in cases where they
| really needed decentralization, and maybe mercurial
| wouldn't have become decently performant without
| competition from git. or maybe without 9/11 the internet
| would have developed in a totally different direction. i
| don't really know what other paths history might have
| taken
| fmajid wrote:
| Hi Kragen!
|
| Don't forget Darcs, Bazaar and Mercurial. I think it's
| the needs of Open Source collaboration that drove this
| convergent evolution of DVCS, and the real conceptual
| breakthrough getting rid of RCS-like sequential revision
| numbers.
|
| The commercial world lagged. Certainly Apple were late
| adopters and didn't support git in Xcode in 2010 when
| Subversion was the only choice, and Microsoft of course
| was a late if enthusiastic adopter because of its
| Ballmer-era aversion to anything Linux.
|
| I personally prefer Fossil and used its forge-like
| CVStrac in the guise of Gittrac for years, but for better
| or worse Git's tooling integration won out.
| kragen wrote:
| hi fazal!
|
| i mostly agree, and certainly didn't mean to slight
| darcs, bazaar and mercurial; darcs in particular was my
| version tracker of choice before switching to git
|
| but i think 'the needs of Open Source collaboration' are
| somewhat more plastic and historically contingent than
| you imply. if you read _producing open source software_
| you 'll see a snapshot of the dominant social practices
| of open source collaboration in the world git and bazaar
| were born into (which of course you also experienced, but
| others reading this comment may not have). those
| practices still survive in places, like netbsd and debian
|
| arch, git, and family were designed to support a
| _different_ set of social practices, practices that were
| at the time marginal in part because of the practical
| difficulty of applying them without software support. tom
| lord 's radical program was to change the software
| landscape on which open-source collaboration happened in
| order to make those social practices viable
|
| i agree that globally sequential revision numbers are
| incompatible with decentralization in the pre-nakamoto
| world, because they demand consensus, and decentralized
| consensus was infeasible until nakamoto. it's still
| probably too costly for this purpose
| justin_ wrote:
| Linus himself has credited Monotone with the content-
| addressing by SHA1:
| https://marc.info/?l=git&m=114685143200012
|
| I think the main issue with Monotone was the performance.
| Linus also hates databases and C++.
|
| --
|
| Hoare didn't come up with this idea either, but he did
| apply it to version control. He had potentially been
| influenced by his earlier work on distributed file
| systems and object systems. Here's his 1999 project
| making use of hashes: https://web.archive.org/web/2001042
| 0023937/venge.net/graydon...
|
| He was in contact with Ian Clarke of Freenet fame (also
| 1999). There seems to have been a rise in distributed and
| encrypted communications around the time, as kragen
| mentions in his other post.
|
| BitTorrent would also come to use hashes for identifying
| torrents in a tracker, and would come out in 2001,
| created by Bram Cohen, the author of the post here :)
| kragen wrote:
| thanks for digging up these links
|
| interestingly it does say bk used md5 in some way; i'm
| not sure how i overlooked that when i was looking at the
| code earlier, but indeed md5 is used in lots of places
| (though apparently not for naming objects the way git and
| monotone do)
|
| the crucial way bittorrent uses hashes actually is for
| identifying chunks of files (a .torrent file is mostly
| chunk hashes by volume); that's why it was immune to the
| poisoning attacks used against other p2p systems in the
| early 02000s where malicious entities would send you
| bogus data. once you had the correct .torrent file, you
| could tell good data from bad data. using the infohash
| talking to the tracker is convenient but, as i understand
| it, there isn't really a security reason for it; the
| tracker doesn't verify you're really participating
| productively in the swarm, it just sends your IP to other
| peers in case you might. so there isn't a strong reason
| to keep torrent infohashes from colliding
| hyperthesis wrote:
| Right, bitkeeper doesn't name files with hashes like git
| does. But it uses sha1 (or similar) for decentralization:
| to tell if two remote files are the same.
|
| Another player is tridge's rsync, which also uses hashes
| like that.
| kragen wrote:
| aha, thanks
| kuahyeow wrote:
| Didn't Git have a new default merge strategy, `ort`
| https://github.com/git/git/blob/master/Documentation/RelNote... ?
| juped wrote:
| histogram is a diff algorithm
| skywal_l wrote:
| From the article:
|
| > _switch the default 3 way merge algorithm to histogram_
| wscott wrote:
| The piece of BitKeeper I wish people would steal is the smerge
| gca conflict format. See
| https://www.bitkeeper.org/man/smerge.html Example:
| <<<<<<< local slib.c 1.642.1.6 vs 1.645
| sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
| - assert(sc->tree); - sccs_sdelta(sc,
| sc->tree, file); + assert(HASGRAPH(sc));
| + sccs_sdelta(sc, sccs_ino(sc), file);
| <<<<<<< remote slib.c 1.642.1.6 vs 1.642.2.1 -
| sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
| + sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, p);
| assert(sc->tree); sccs_sdelta(sc,
| sc->tree, file); >>>>>>>
|
| Here we have a code conflict and rather than showing you what the
| file looks like on the two sides it shows you what was changed on
| both sides relative to the GCA. So we get two unified diffs. The
| local side made this edit, while on the remote side we had that
| edit. Then it is obvious how to resolve the conflict without
| losing a change.
|
| This works for the cris-cross case because that GCA is really a
| set of common revisions merged together.
| gavinhoward wrote:
| I'm making a VCS based on the weave.
|
| Your wish will be answered; I wanted a conflict-aware format,
| and I will definitely plunder the smerge format.
| teraflop wrote:
| Git has something similar if you turn on the "diff3" conflict
| style, and I can't for the life of me understand why it's not
| on by default, because there are many situations where you just
| don't have enough information to properly resolve a merge
| without it.
| pabs3 wrote:
| BTW zdiff3 is a newer version of that style that is slightly
| better than diff3.
| juped wrote:
| zdiff3 is NOT "better", it's a matter of personal
| preference. it moves common lines out of the conflicted
| hunks, which may be highly confusing. some people use it
| anyway
| wscott wrote:
| The "diff3" style looks like this:
| <<<<<<< HEAD sc = sccs_init(file,
| INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
| assert(HASGRAPH(sc)); sccs_sdelta(sc,
| sccs_ino(sc), file); ||||||| merged common
| ancestors sc = sccs_init(file,
| INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
| assert(sc->tree); sccs_sdelta(sc,
| sc->tree, file); =======
| sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, p);
| assert(sc->tree); sccs_sdelta(sc,
| sc->tree, file); >>>>>>> c4892343......
|
| That contains the same information, but you have to parse it
| yourself and it is not nearly as fast to see what changed.
|
| However, it is faster to edit since you don't need to remove
| the diff markers.
| juped wrote:
| If you ask on / search the history of the mailing list,
| you'll see that in complex merges with synthetic intermediate
| parents it can produce some really gnarly output, which is
| the main reason why (coupled with git's general
| conservatism).
| jez wrote:
| The biggest day-to-day merge conflict annoyance I have is
| situations like this. As far as I know, there's no solution:
| commit aaaaaaaa diff --git a/README.md b/README.md
| --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@
| -<p align="center"> +<p align="left"> <img
| width="200" src="logo.svg"> </p>
| commit bbbbbbbb diff --git a/README.md b/README.md
| --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@
| <p align="center"> - <img width="200" src="logo.svg">
| + <img width="345" src="logo.svg"> </p> #
| Project
|
| Commit aaaaaaaa changes some text on line 1.
|
| Commit bbbbbbbb changes some unrelated text on line 2.
|
| I don't want to have to manually resolve this--just merge the
| lines. If there's a semantic conflict I'll let the tests sort it
| out. When I've looked in the past there wasn't a merge strategy
| that fixes this.
| wscott wrote:
| BitKeeper was able to merge that successfully by looking at the
| revision history and seeing that the changes involved just
| those lines. If one of them added a line between the two being
| changed then it would still be a conflict.
|
| I spent a year looking at interesting merges (and commits
| fixing bad merges) in the Linux kernel and making a catalog of
| interesting cases before writing 'smerge' for BitKeeper.
|
| It is impossible to make a perfect merge tool, but we can do a
| lot better than diff3.
___________________________________________________________________
(page generated 2023-12-14 23:01 UTC)