[HN Gopher] Google stores billions of lines of code in a single ...
       ___________________________________________________________________
        
       Google stores billions of lines of code in a single repository
       (2016) [pdf]
        
       Author : jeremylevy
       Score  : 62 points
       Date   : 2023-02-12 20:11 UTC (2 hours ago)
        
 (HTM) web link (dl.acm.org)
 (TXT) w3m dump (dl.acm.org)
        
       | zdw wrote:
       | Monorepos are great... but only if you can invest in the tooling
       | scale to handle them, and most companies can't invest in that
       | like Google can. Hyrum Wright class tooling experts don't grow on
       | trees.
       | 
       | A good article to reference when this topic gets raised:
       | http://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-yo...
        
         | no_wizard wrote:
         | You can get better tools now though, like Turbo Repo or NX.
         | They don't require the same level of investment as Bazel but
         | they don't always have the same hermetic build guarantees,
         | though for most it's "good enough".
        
         | patrick451 wrote:
         | You don't need google scale tooling to work with a mono repo
         | until you are actually at google scale. Gluing together a bunch
         | of separate repos isn't exactly free either. See, for example,
         | the complicated disaster Amazon has with brazil.
         | 
         | In the limit, there are only two options:                 1.
         | All code lives one repo       2. Every function/class/entity
         | lives in its own repo
         | 
         | with a third state in between                 3. You accept
         | code duplication
         | 
         | This compromise state where some code duplication is (maybe
         | implicitly) acceptable is what most people have in mind with a
         | poly-repo.
         | 
         | The problem though is that (3) is not a stable equilibrium.
         | Most engineers have such a kneejerk reaction against code
         | duplication that (3) is practically untenable. Even if your
         | engineers are more reasonable, (3) style compromise means they
         | constantly have to decide "should this code from package A be
         | duplicated in package B, or split off into a new smaller
         | package C, which A and B depend on". People will never agree on
         | the right answer, which generates discussion and wastes
         | engineering time. In my experience, the trend is almost never
         | to combine repos, but always to generate more and more repos.
         | 
         | The limiting case of a mono repo (which is basically it's
         | natural state) is far more palatable than the limiting case of
         | poly-repo.
        
           | ameliaquining wrote:
           | This mostly seems like a problem for pure library code. If
           | some bit of logic is only needed by a single independently-
           | released service, then there's no reason not to put it in
           | that service's repo.
        
         | ramraj07 wrote:
         | With the advent of great CI tooling like GitHub actions, simple
         | monorepos are becoming more and more viable and in fact even
         | recommendable.
        
       | dang wrote:
       | Related:
       | 
       |  _Why Google Stores Billions of Lines of Code in a Single
       | Repository (2016)_ -
       | https://news.ycombinator.com/item?id=22019827 - Jan 2020 (121
       | comments)
       | 
       |  _Why Google Stores Billions of Lines of Code in a Single
       | Repository (2016)_ -
       | https://news.ycombinator.com/item?id=17605371 - July 2018 (281
       | comments)
       | 
       |  _Why Google stores billions of lines of code in a single
       | repository (2016)_ -
       | https://news.ycombinator.com/item?id=15889148 - Dec 2017 (298
       | comments)
       | 
       |  _Why Google Stores Billions of Lines of Code in a Single
       | Repository_ - https://news.ycombinator.com/item?id=11991479 -
       | June 2016 (218 comments)
        
       | sn_master wrote:
       | Because Google does something, doesn't mean it's a good thing to
       | do for anyone else. This kind of infrastructure is very expensive
       | to maintain, and suffers from many flaws like -almost- everyone
       | being stuck using SDKs that are several versions behind the
       | latest production one even for the internal GCP ones.
        
       | [deleted]
        
       | lopkeny12ko wrote:
       | There's a lot of love for monorepos nowadays, but after more than
       | a decade of writing software, I still strongly believe it is an
       | antipattern.
       | 
       | 1. The single version dependencies are asinine. We are migrating
       | to a monorepo at work, and someone bumped the version of an open
       | source JS package that introduced a regression. The next deploy
       | took our service down. Monorepos mean loss of isolation of
       | dependencies between services, which is absolutely necessary for
       | the stability of mission-critical business services.
       | 
       | 2. It encourages poor API contracts because it lets anyone import
       | any code in any service arbitrarily. Shared functionality should
       | be exposed as a standalone library with a clear, well-defined
       | interface boundary. There are entire packaging ecosystems like
       | npmjs and pypi for exactly this purpose.
       | 
       | 3. It encourages a ton of code churn with very low signal. I see
       | at least one PR every week to code owned by my team that changes
       | some trivial configuration, library call, or build directive,
       | simply because some shared config or code changed in another part
       | of the repo and now the entire repo needs to be migrated in
       | lockstep for things to compile.
       | 
       | I've read this paper, as well as watched the talk on this topic,
       | and am absolutely stunned that these problems are not magnified
       | by 100x at Google scale. Perhaps it's simply organizational
       | inertia that prevents them from trying a more reasonable
       | solution.
        
       | chrisa wrote:
       | Here's a talk version given by Rachel (one of the authors) about
       | the same topic: https://www.youtube.com/watch?v=W71BTkUbdqE
        
         | sabujp wrote:
         | and my previous director is scaling github now :)
        
       | randyrand wrote:
       | iOS and Windows are "monorepos" too.
       | 
       | The software is built daily, and everyone must be on the same
       | version of every library.
       | 
       | Under the hood there are a bunch of repos, and there are
       | exceptions, but largely operates as a monorepo.
        
         | jbm wrote:
         | Is this still the case for Windows? I remember hearing
         | something like this when I was getting my BCompSci, but I
         | assumed it must have changed since then.
        
       | myhf wrote:
       | (published July 2016)
        
         | dang wrote:
         | Added. Thanks!
        
         | thunderbong wrote:
         | Also, it's a PDF link
        
           | dang wrote:
           | Also added. Also thanks!
        
           | Jtsummers wrote:
           | https://cacm.acm.org/magazines/2016/7/204032-why-google-
           | stor...
           | 
           | There you go, PDF free version.
        
       | bandika wrote:
       | To me it always seemed that monorepo is a cop out of proper
       | dependency management and component versioning.
        
       | GreedClarifies wrote:
       | This is from the golden age of Google.
       | 
       | Of particular note is that they published this many years after
       | it had been shipped to their internal customers. This was not
       | some position paper about "why we focus on ai" after not shipping
       | any of their "breakthroughs".
        
       | rvcdbn wrote:
       | I really wish they would make this tech available via gcloud.
       | Seems like it would be very popular and a great way to attract
       | other gcloud business away from MS/GitHub which scales horribly.
        
         | ameliaquining wrote:
         | Beating Git's network effects sounds extremely difficult,
         | especially since very few users of Git run into serious
         | problems scaling it.
        
         | grahar64 wrote:
         | They tried that by making a bit available with a remote cloud
         | builder for Bazel. It failed for some reason and they pulled
         | it.
         | 
         | I think building something that scales for one big repo is just
         | a completely different problem than making it scale for a lot
         | of small repos.
        
           | seedless-sensat wrote:
           | Bazel is not failing in the open source world though
        
             | ameliaquining wrote:
             | I think that maintaining a hosted service has significantly
             | higher fixed costs than maintaining an open source project
             | whose users are responsible for deploying it themselves. So
             | a higher degree of adoption would be necessary to justify
             | it.
        
           | blindriver wrote:
           | Long term projects like this don't get any attention because
           | the chance of getting a promotion from it are almost nil.
           | 
           | And after the layoffs, it's pretty clear that no matter how
           | hard you work, you can get fired so what's the point in
           | dedicating your career to something like this?
        
       | KolmogorovComp wrote:
       | > Google's codebase is shared by more [...] than 25,000 Google
       | software develop- ers from dozens of offices in countries around
       | the world.
       | 
       | > Access to the whole codebase encourages extensive code sharing
       | and reuse [...]
       | 
       | Doesn't this strategy result in a great risk of massive code
       | leaks from rogue employees? Even if read access are logged and
       | the culprit found, it's too late once it's been published.
        
         | ameliaquining wrote:
         | Most source code just isn't that interesting or sensitive.
        
         | scarface74 wrote:
         | If you had every line of code that Google wrote, what would you
         | do with it?
         | 
         | But I found this discussion on HN.
         | 
         | https://news.ycombinator.com/item?id=11790438
        
           | ameliaquining wrote:
           | Well, if you had the search ranking algorithms or the bot-
           | detection algorithms or anything inherently adversarial like
           | that, then you could do all kinds of nefarious things. But
           | that stuff's locked down more tightly. Likewise with a few
           | ultra-hard-tech things where the implementation's a major
           | competitive edge.
        
           | forgotusername6 wrote:
           | I imagine looking for vulnerable areas of the code might be
           | something people would be interested in doing. Maybe start
           | with login or billing or something. You could also look at
           | recent activity to spot new, unannounced projects. You could
           | use blame to find who wrote what and target them for anything
           | from job offers to social engineering attacks.
        
             | ameliaquining wrote:
             | Most of that information is readily available on the
             | corporate intranet without having to dig through source
             | code.
             | 
             | Security-by-obscurity isn't something to rely on (again,
             | except in the case of things like abuse detection where
             | there's no alternative).
        
       | yazaddaruvala wrote:
       | Having worked at Google and Amazon.
       | 
       | Honestly their systems are almost identical. Amazon just creates
       | a monotonically increasing watermark outside the "repo". Google
       | uses "the repo" to create the monotonically increasing watermark.
       | 
       | Otherwise, Google calls it "merge into g3" Amazon calls it "merge
       | into live".
       | 
       | Amazon has the extra vocabulary of VersionSets/Packages/Build
       | files. Google has all the same concepts, but just calls them
       | Dependencies/Folders/Build files.
       | 
       | Amazon's workflows are "git-like", Google is migrating to "git-
       | like" workflows (but has a lot of unnecessary vocabulary around
       | getting their Piper/Fig/Workspace/etc).
       | 
       | I really can't tell if the specific difference between "mono-
       | repo" or "multi-repo" makes much practical difference to the devs
       | working on it.
        
       | deanCommie wrote:
       | No wonder noone at Google can't ship everything if they
       | constantly have to stop development of their feature so they can
       | do mandatory upgrades of their dependencies...
        
         | ameliaquining wrote:
         | Most of that work is done by the owners of the dependencies,
         | rather than the dependents.
         | 
         | This is sometimes a problem for open source dependencies,
         | though, as there isn't always anyone whose job it is to keep
         | them up to date. Some amount of NIH syndrome is because
         | reinventing the wheel can be less work than integrating an
         | existing wheel that was designed for a different vehicle with
         | different specs.
        
       ___________________________________________________________________
       (page generated 2023-02-12 23:00 UTC)