[HN Gopher] The state of merging technology
       ___________________________________________________________________
        
       The state of merging technology
        
       Author : bumbledraven
       Score  : 64 points
       Date   : 2023-12-13 21:41 UTC (1 days ago)
        
 (HTM) web link (bramcohen.com)
 (TXT) w3m dump (bramcohen.com)
        
       | kragen wrote:
       | some relevant context is that bram and his brother ross developed
       | an early decentralized version control system named 'codeville',
       | more or less contemporary with git and mercurial; they put a lot
       | of work into figuring out how merging should handle different
       | hairy scenarios
       | 
       | bram was deeply disappointed that the systems that got widely
       | adopted, like git, did a terrible job with these hairy scenarios,
       | since he knew that it was possible to do much better
        
         | ajb wrote:
         | Oh yeah, there was a bit of a Cambrian explosion of version
         | control systems around then, due to everyone getting fed up
         | with CVS as well as the bitkeeper debacle. Codeville, TLA,
         | monotone, vesta...
         | 
         | I quite liked the idea of vesta, which included a build system
         | and with hindsight looks a lot like nix/guix
        
           | kragen wrote:
           | yeah, bram and len featured a lot of them in codecon. but i
           | don't agree with your explanation of why it happened
           | 
           | svn was the result of everyone getting fed up with cvs;
           | basically it does the same thing as cvs, but does it in a
           | less janky way, and with atomic commits. but it still suffers
           | from cvs's design weaknesses
           | 
           | vesta was a digital research (decwrl?) project from the
           | previous millennium; peter deutsch told me about it at the
           | time, but it was still proprietary, and it took them a while
           | to be able to open-source it. it was basically a clone of
           | dsee, just like clearcase, though perhaps better done. it
           | wasn't motivated by dissatisfaction with cvs and in fact
           | couldn't do things cvs could do
           | 
           | i think the main thing that kicked off the cambrian explosion
           | wasn't 'everyone getting fed up with cvs' but rather tom lord
           | (rip) writing arch (tla, later baz and bzr) which
           | demonstrated to everyone that it was possible to do
           | enormously better than cvs/svn, with features like atomic
           | commits, forking your own branches without permission from
           | the core team, decentralization, and serving from regular ftp
           | or web servers (no special server software)
           | 
           | these design features were ideologically driven on tom's
           | part; he wanted to give ordinary users version-control tools
           | that were just as powerful as the ones used by core teams on
           | projects like freebsd and apache, motivated by the same
           | egalitarianism that had led him to become an employee of the
           | fsf
           | 
           | and i think graydon hoare's monotone was the thing that most
           | inspired the other follow-on systems, like git, mercurial,
           | codeville, fossil, maybe darcs, and maybe even baz and bzr
           | 
           | maybe kernel hackers getting experience with bitkeeper in
           | 02000 to 02005 added motivation for moving to better-than-
           | cvs-and-svn models too tho
           | 
           | shlomi fish's site from the time in question has a lot of
           | material on what was happening, including even lesser known
           | version tracking systems like aegis: https://better-
           | scm.shlomifish.org/aegis/
        
             | ajb wrote:
             | Fair enough, your history is probably more accurate.
             | 
             | I didn't know Tom Lord had died. And not very old either
             | darn it :-(
        
               | kragen wrote:
               | yeah, it's a huge loss
        
               | sitkack wrote:
               | Tom Lord has died (berkeleydailyplanet.com)
               | https://news.ycombinator.com/item?id=32155067
        
             | sockaddr wrote:
             | > 02000 to 02005
             | 
             | I think this is the first time I've seen this five-digit
             | notation used in the wild after reading about The Long Now
             | Foundation using it years ago.
        
               | ComputerGuru wrote:
               | It's how you know a kragen post on HN.
        
             | hyperthesis wrote:
             | Historically, bitkeeper directly led to git.
             | 
             | But you're saying arch, then monotone, came before
             | bitkeeper? What innovations did each provide?
             | 
             | (git's innovation was content-based addressing, so the data
             | structure does the heavy lifting. bitkeeper used sha1
             | hashes for decentralization - was that its main
             | contribution?)
             | 
             | Probably the internet enabled the decentralized version
             | control Cambrian explosion (or, at least, _a_ Cambrian
             | explosion)
             | 
             | BTW: funfact re "merging": bitkeeper had first-class
             | renaming, which git lost. A process of subtraction as well
             | as addition.
        
               | kragen wrote:
               | bitkeeper definitely led directly to git
               | 
               | but bitkeeper _predated_ arch and monotone
               | 
               | i listed arch's innovations above (some of which were
               | also in bitkeeper, though i don't think that's where tom
               | got them; atomic commits in particular were in all kinds
               | of version control systems). as i understand it, git got
               | content-based addressing from monotone. but monotone
               | didn't invent that either; merkle invented it for his
               | dissertation in 01979
               | 
               | the current version of bitkeeper (7.3.3) doesn't use sha1
               | except to import and export to git (look for yourself: ht
               | tps://www.bitkeeper.org/downloads/7.3.3/bk-7.3.3.src.tar.
               | g...), so i think you might have that part wrong
               | 
               | the internet predated the decentralized version control
               | cambrian explosion by about 17 years, if we count from
               | the tcp/ip flag day, or by 32 years if we count from the
               | first arpanet connections. it was clearly a crucial
               | ingredient but it wasn't the limiting reagent
        
               | hyperthesis wrote:
               | Thanks; maybe critical mass for the internet triggered
               | it?
               | 
               | oh yeah, I recall merkle now; but maybe git first applied
               | to decentralization?
               | 
               | maybe bitkeeper doesn't use sha1 specifically, but some
               | similar hash?
        
               | kragen wrote:
               | i don't know much about how bitkeeper works, but i don't
               | think it uses secure-hash-based naming of any kind
               | 
               | i think these are some of the things that led up to the
               | transition to decentralized source control:
               | 
               | 1. linus torvalds didn't want to use cvs because the
               | social process of the linux kernel already depended on
               | being able to ship patches around willy-nilly, but did
               | want some kind of version control system
               | 
               | 2. larry mcvoy had worked on a decentralized source
               | control system called teamware, at sun, before the very
               | first version of linux https://www.krsaborio.net/linux-
               | kernel/research/2002/0528.ht... so he proposed to build a
               | decentralized version control system to solve linus's
               | problem, initially called bitsccs
               | https://lkml.org/lkml/1998/9/30/122 but later called
               | bitkeeper
               | 
               | 3. tom lord decided the way we were doing version control
               | was wrong and bad and spent years annoying the hell out
               | of everyone and building software to demonstrate that a
               | much better way was possible, and finally he convinced a
               | lot of people, who started building better versions of
               | the kinda janky arch/tla
               | 
               | 4. there had been a lot of work on merkle graphs over the
               | years, mostly for cryptographic applications, but in
               | particular in the late 90s for decentralized filesystems;
               | things like pgp, freenet, mojonation, bittorrent, and
               | tahoe-lafs were popularizing this remarkable fact of
               | being able to assign decentralized, secure names to
               | pieces of content as long as they didn't have to be
               | human-readable (a trilemma tahoe's designer zooko would
               | formalize as 'zooko's triangle' until satoshi nakamoto
               | found a solution). it may or may not be relevant that
               | merkle's foundational patent expired in 01996; i think
               | it's maybe more relevant that napster took off hugely in
               | 02000, and suddenly decentralized and peer-to-peer
               | systems that didn't have a central naming authority
               | became an extremely fashionable thing to work on
               | 
               | 5. larry got pissed off at tridge for trying to make a
               | bitkeeper-compatible system and revoked the bitkeeper
               | license after 5 years of people using it for linux. linus
               | tried a bunch of the new free-software decentralized
               | version control systems, including monotone, but none of
               | them were adequate, so he decided to make a really
               | stupid, basic version control system that would work well
               | enough, and that was git
               | 
               | 6. some kids started github, and they did a really good
               | job of building a new kind of forge, and that took a lot
               | of the pain out of using git. also because of how they
               | set up the namespace the barrier to starting a new
               | project there was much lower than on sourceforge, because
               | you could call the project, like, 'notes', and because it
               | was inside the namespace of your username there was no
               | implicit claim to be the one and only notes project for
               | the world.
               | 
               | you could definitely argue that critical mass for the
               | internet was the thing that triggered so much interest in
               | decentralized systems. but then again, zooko and ian
               | clarke had spent a decade already trying to figure out
               | how to protect human rights on the internet, and so maybe
               | they were going to build decentralized systems once they
               | figured out how, regardless of how many or how few people
               | they served. or maybe if larry hadn't revoked the
               | license, linus wouldn't have written git, and without
               | linus's superb quality of performance engineering, people
               | would have kept using svn except in cases where they
               | really needed decentralization, and maybe mercurial
               | wouldn't have become decently performant without
               | competition from git. or maybe without 9/11 the internet
               | would have developed in a totally different direction. i
               | don't really know what other paths history might have
               | taken
        
               | fmajid wrote:
               | Hi Kragen!
               | 
               | Don't forget Darcs, Bazaar and Mercurial. I think it's
               | the needs of Open Source collaboration that drove this
               | convergent evolution of DVCS, and the real conceptual
               | breakthrough getting rid of RCS-like sequential revision
               | numbers.
               | 
               | The commercial world lagged. Certainly Apple were late
               | adopters and didn't support git in Xcode in 2010 when
               | Subversion was the only choice, and Microsoft of course
               | was a late if enthusiastic adopter because of its
               | Ballmer-era aversion to anything Linux.
               | 
               | I personally prefer Fossil and used its forge-like
               | CVStrac in the guise of Gittrac for years, but for better
               | or worse Git's tooling integration won out.
        
               | kragen wrote:
               | hi fazal!
               | 
               | i mostly agree, and certainly didn't mean to slight
               | darcs, bazaar and mercurial; darcs in particular was my
               | version tracker of choice before switching to git
               | 
               | but i think 'the needs of Open Source collaboration' are
               | somewhat more plastic and historically contingent than
               | you imply. if you read _producing open source software_
               | you 'll see a snapshot of the dominant social practices
               | of open source collaboration in the world git and bazaar
               | were born into (which of course you also experienced, but
               | others reading this comment may not have). those
               | practices still survive in places, like netbsd and debian
               | 
               | arch, git, and family were designed to support a
               | _different_ set of social practices, practices that were
               | at the time marginal in part because of the practical
               | difficulty of applying them without software support. tom
               | lord 's radical program was to change the software
               | landscape on which open-source collaboration happened in
               | order to make those social practices viable
               | 
               | i agree that globally sequential revision numbers are
               | incompatible with decentralization in the pre-nakamoto
               | world, because they demand consensus, and decentralized
               | consensus was infeasible until nakamoto. it's still
               | probably too costly for this purpose
        
               | justin_ wrote:
               | Linus himself has credited Monotone with the content-
               | addressing by SHA1:
               | https://marc.info/?l=git&m=114685143200012
               | 
               | I think the main issue with Monotone was the performance.
               | Linus also hates databases and C++.
               | 
               | --
               | 
               | Hoare didn't come up with this idea either, but he did
               | apply it to version control. He had potentially been
               | influenced by his earlier work on distributed file
               | systems and object systems. Here's his 1999 project
               | making use of hashes: https://web.archive.org/web/2001042
               | 0023937/venge.net/graydon...
               | 
               | He was in contact with Ian Clarke of Freenet fame (also
               | 1999). There seems to have been a rise in distributed and
               | encrypted communications around the time, as kragen
               | mentions in his other post.
               | 
               | BitTorrent would also come to use hashes for identifying
               | torrents in a tracker, and would come out in 2001,
               | created by Bram Cohen, the author of the post here :)
        
               | kragen wrote:
               | thanks for digging up these links
               | 
               | interestingly it does say bk used md5 in some way; i'm
               | not sure how i overlooked that when i was looking at the
               | code earlier, but indeed md5 is used in lots of places
               | (though apparently not for naming objects the way git and
               | monotone do)
               | 
               | the crucial way bittorrent uses hashes actually is for
               | identifying chunks of files (a .torrent file is mostly
               | chunk hashes by volume); that's why it was immune to the
               | poisoning attacks used against other p2p systems in the
               | early 02000s where malicious entities would send you
               | bogus data. once you had the correct .torrent file, you
               | could tell good data from bad data. using the infohash
               | talking to the tracker is convenient but, as i understand
               | it, there isn't really a security reason for it; the
               | tracker doesn't verify you're really participating
               | productively in the swarm, it just sends your IP to other
               | peers in case you might. so there isn't a strong reason
               | to keep torrent infohashes from colliding
        
               | hyperthesis wrote:
               | Right, bitkeeper doesn't name files with hashes like git
               | does. But it uses sha1 (or similar) for decentralization:
               | to tell if two remote files are the same.
               | 
               | Another player is tridge's rsync, which also uses hashes
               | like that.
        
               | kragen wrote:
               | aha, thanks
        
       | kuahyeow wrote:
       | Didn't Git have a new default merge strategy, `ort`
       | https://github.com/git/git/blob/master/Documentation/RelNote... ?
        
         | juped wrote:
         | histogram is a diff algorithm
        
           | skywal_l wrote:
           | From the article:
           | 
           | > _switch the default 3 way merge algorithm to histogram_
        
       | wscott wrote:
       | The piece of BitKeeper I wish people would steal is the smerge
       | gca conflict format. See
       | https://www.bitkeeper.org/man/smerge.html Example:
       | <<<<<<< local slib.c 1.642.1.6 vs 1.645
       | sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
       | -    assert(sc->tree);                    -    sccs_sdelta(sc,
       | sc->tree, file);                    +    assert(HASGRAPH(sc));
       | +    sccs_sdelta(sc, sccs_ino(sc), file);
       | <<<<<<< remote slib.c 1.642.1.6 vs 1.642.2.1                    -
       | sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
       | +    sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, p);
       | assert(sc->tree);                         sccs_sdelta(sc,
       | sc->tree, file);                    >>>>>>>
       | 
       | Here we have a code conflict and rather than showing you what the
       | file looks like on the two sides it shows you what was changed on
       | both sides relative to the GCA. So we get two unified diffs. The
       | local side made this edit, while on the remote side we had that
       | edit. Then it is obvious how to resolve the conflict without
       | losing a change.
       | 
       | This works for the cris-cross case because that GCA is really a
       | set of common revisions merged together.
        
         | gavinhoward wrote:
         | I'm making a VCS based on the weave.
         | 
         | Your wish will be answered; I wanted a conflict-aware format,
         | and I will definitely plunder the smerge format.
        
         | teraflop wrote:
         | Git has something similar if you turn on the "diff3" conflict
         | style, and I can't for the life of me understand why it's not
         | on by default, because there are many situations where you just
         | don't have enough information to properly resolve a merge
         | without it.
        
           | pabs3 wrote:
           | BTW zdiff3 is a newer version of that style that is slightly
           | better than diff3.
        
             | juped wrote:
             | zdiff3 is NOT "better", it's a matter of personal
             | preference. it moves common lines out of the conflicted
             | hunks, which may be highly confusing. some people use it
             | anyway
        
           | wscott wrote:
           | The "diff3" style looks like this:
           | <<<<<<< HEAD                     sc = sccs_init(file,
           | INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
           | assert(HASGRAPH(sc));                     sccs_sdelta(sc,
           | sccs_ino(sc), file);                ||||||| merged common
           | ancestors                     sc = sccs_init(file,
           | INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
           | assert(sc->tree);                     sccs_sdelta(sc,
           | sc->tree, file);                =======
           | sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, p);
           | assert(sc->tree);                     sccs_sdelta(sc,
           | sc->tree, file);                >>>>>>> c4892343......
           | 
           | That contains the same information, but you have to parse it
           | yourself and it is not nearly as fast to see what changed.
           | 
           | However, it is faster to edit since you don't need to remove
           | the diff markers.
        
           | juped wrote:
           | If you ask on / search the history of the mailing list,
           | you'll see that in complex merges with synthetic intermediate
           | parents it can produce some really gnarly output, which is
           | the main reason why (coupled with git's general
           | conservatism).
        
       | jez wrote:
       | The biggest day-to-day merge conflict annoyance I have is
       | situations like this. As far as I know, there's no solution:
       | commit aaaaaaaa         diff --git a/README.md b/README.md
       | --- a/README.md         +++ b/README.md         @@ -1,4 +1,4 @@
       | -<p align="center">         +<p align="left">            <img
       | width="200" src="logo.svg">          </p>
       | commit bbbbbbbb         diff --git a/README.md b/README.md
       | --- a/README.md         +++ b/README.md         @@ -1,5 +1,5 @@
       | <p align="center">         -  <img width="200" src="logo.svg">
       | +  <img width="345" src="logo.svg">          </p>               #
       | Project
       | 
       | Commit aaaaaaaa changes some text on line 1.
       | 
       | Commit bbbbbbbb changes some unrelated text on line 2.
       | 
       | I don't want to have to manually resolve this--just merge the
       | lines. If there's a semantic conflict I'll let the tests sort it
       | out. When I've looked in the past there wasn't a merge strategy
       | that fixes this.
        
         | wscott wrote:
         | BitKeeper was able to merge that successfully by looking at the
         | revision history and seeing that the changes involved just
         | those lines. If one of them added a line between the two being
         | changed then it would still be a conflict.
         | 
         | I spent a year looking at interesting merges (and commits
         | fixing bad merges) in the Linux kernel and making a catalog of
         | interesting cases before writing 'smerge' for BitKeeper.
         | 
         | It is impossible to make a perfect merge tool, but we can do a
         | lot better than diff3.
        
       ___________________________________________________________________
       (page generated 2023-12-14 23:01 UTC)