[HN Gopher] Google stores billions of lines of code in a single ...
___________________________________________________________________
Google stores billions of lines of code in a single repository
(2016) [pdf]
Author : jeremylevy
Score : 62 points
Date : 2023-02-12 20:11 UTC (2 hours ago)
(HTM) web link (dl.acm.org)
(TXT) w3m dump (dl.acm.org)
| zdw wrote:
| Monorepos are great... but only if you can invest in the tooling
| scale to handle them, and most companies can't invest in that
| like Google can. Hyrum Wright class tooling experts don't grow on
| trees.
|
| A good article to reference when this topic gets raised:
| http://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-yo...
| no_wizard wrote:
| You can get better tools now though, like Turbo Repo or NX.
| They don't require the same level of investment as Bazel but
| they don't always have the same hermetic build guarantees,
| though for most it's "good enough".
| patrick451 wrote:
| You don't need google scale tooling to work with a mono repo
| until you are actually at google scale. Gluing together a bunch
| of separate repos isn't exactly free either. See, for example,
| the complicated disaster Amazon has with brazil.
|
| In the limit, there are only two options: 1.
| All code lives one repo 2. Every function/class/entity
| lives in its own repo
|
| with a third state in between 3. You accept
| code duplication
|
| This compromise state where some code duplication is (maybe
| implicitly) acceptable is what most people have in mind with a
| poly-repo.
|
| The problem though is that (3) is not a stable equilibrium.
| Most engineers have such a kneejerk reaction against code
| duplication that (3) is practically untenable. Even if your
| engineers are more reasonable, (3) style compromise means they
| constantly have to decide "should this code from package A be
| duplicated in package B, or split off into a new smaller
| package C, which A and B depend on". People will never agree on
| the right answer, which generates discussion and wastes
| engineering time. In my experience, the trend is almost never
| to combine repos, but always to generate more and more repos.
|
| The limiting case of a mono repo (which is basically it's
| natural state) is far more palatable than the limiting case of
| poly-repo.
| ameliaquining wrote:
| This mostly seems like a problem for pure library code. If
| some bit of logic is only needed by a single independently-
| released service, then there's no reason not to put it in
| that service's repo.
| ramraj07 wrote:
| With the advent of great CI tooling like GitHub actions, simple
| monorepos are becoming more and more viable and in fact even
| recommendable.
| dang wrote:
| Related:
|
| _Why Google Stores Billions of Lines of Code in a Single
| Repository (2016)_ -
| https://news.ycombinator.com/item?id=22019827 - Jan 2020 (121
| comments)
|
| _Why Google Stores Billions of Lines of Code in a Single
| Repository (2016)_ -
| https://news.ycombinator.com/item?id=17605371 - July 2018 (281
| comments)
|
| _Why Google stores billions of lines of code in a single
| repository (2016)_ -
| https://news.ycombinator.com/item?id=15889148 - Dec 2017 (298
| comments)
|
| _Why Google Stores Billions of Lines of Code in a Single
| Repository_ - https://news.ycombinator.com/item?id=11991479 -
| June 2016 (218 comments)
| sn_master wrote:
| Because Google does something, doesn't mean it's a good thing to
| do for anyone else. This kind of infrastructure is very expensive
| to maintain, and suffers from many flaws like -almost- everyone
| being stuck using SDKs that are several versions behind the
| latest production one even for the internal GCP ones.
| [deleted]
| lopkeny12ko wrote:
| There's a lot of love for monorepos nowadays, but after more than
| a decade of writing software, I still strongly believe it is an
| antipattern.
|
| 1. The single version dependencies are asinine. We are migrating
| to a monorepo at work, and someone bumped the version of an open
| source JS package that introduced a regression. The next deploy
| took our service down. Monorepos mean loss of isolation of
| dependencies between services, which is absolutely necessary for
| the stability of mission-critical business services.
|
| 2. It encourages poor API contracts because it lets anyone import
| any code in any service arbitrarily. Shared functionality should
| be exposed as a standalone library with a clear, well-defined
| interface boundary. There are entire packaging ecosystems like
| npmjs and pypi for exactly this purpose.
|
| 3. It encourages a ton of code churn with very low signal. I see
| at least one PR every week to code owned by my team that changes
| some trivial configuration, library call, or build directive,
| simply because some shared config or code changed in another part
| of the repo and now the entire repo needs to be migrated in
| lockstep for things to compile.
|
| I've read this paper, as well as watched the talk on this topic,
| and am absolutely stunned that these problems are not magnified
| by 100x at Google scale. Perhaps it's simply organizational
| inertia that prevents them from trying a more reasonable
| solution.
| chrisa wrote:
| Here's a talk version given by Rachel (one of the authors) about
| the same topic: https://www.youtube.com/watch?v=W71BTkUbdqE
| sabujp wrote:
| and my previous director is scaling github now :)
| randyrand wrote:
| iOS and Windows are "monorepos" too.
|
| The software is built daily, and everyone must be on the same
| version of every library.
|
| Under the hood there are a bunch of repos, and there are
| exceptions, but largely operates as a monorepo.
| jbm wrote:
| Is this still the case for Windows? I remember hearing
| something like this when I was getting my BCompSci, but I
| assumed it must have changed since then.
| myhf wrote:
| (published July 2016)
| dang wrote:
| Added. Thanks!
| thunderbong wrote:
| Also, it's a PDF link
| dang wrote:
| Also added. Also thanks!
| Jtsummers wrote:
| https://cacm.acm.org/magazines/2016/7/204032-why-google-
| stor...
|
| There you go, PDF free version.
| bandika wrote:
| To me it always seemed that monorepo is a cop out of proper
| dependency management and component versioning.
| GreedClarifies wrote:
| This is from the golden age of Google.
|
| Of particular note is that they published this many years after
| it had been shipped to their internal customers. This was not
| some position paper about "why we focus on ai" after not shipping
| any of their "breakthroughs".
| rvcdbn wrote:
| I really wish they would make this tech available via gcloud.
| Seems like it would be very popular and a great way to attract
| other gcloud business away from MS/GitHub which scales horribly.
| ameliaquining wrote:
| Beating Git's network effects sounds extremely difficult,
| especially since very few users of Git run into serious
| problems scaling it.
| grahar64 wrote:
| They tried that by making a bit available with a remote cloud
| builder for Bazel. It failed for some reason and they pulled
| it.
|
| I think building something that scales for one big repo is just
| a completely different problem than making it scale for a lot
| of small repos.
| seedless-sensat wrote:
| Bazel is not failing in the open source world though
| ameliaquining wrote:
| I think that maintaining a hosted service has significantly
| higher fixed costs than maintaining an open source project
| whose users are responsible for deploying it themselves. So
| a higher degree of adoption would be necessary to justify
| it.
| blindriver wrote:
| Long term projects like this don't get any attention because
| the chance of getting a promotion from it are almost nil.
|
| And after the layoffs, it's pretty clear that no matter how
| hard you work, you can get fired so what's the point in
| dedicating your career to something like this?
| KolmogorovComp wrote:
| > Google's codebase is shared by more [...] than 25,000 Google
| software develop- ers from dozens of offices in countries around
| the world.
|
| > Access to the whole codebase encourages extensive code sharing
| and reuse [...]
|
| Doesn't this strategy result in a great risk of massive code
| leaks from rogue employees? Even if read access are logged and
| the culprit found, it's too late once it's been published.
| ameliaquining wrote:
| Most source code just isn't that interesting or sensitive.
| scarface74 wrote:
| If you had every line of code that Google wrote, what would you
| do with it?
|
| But I found this discussion on HN.
|
| https://news.ycombinator.com/item?id=11790438
| ameliaquining wrote:
| Well, if you had the search ranking algorithms or the bot-
| detection algorithms or anything inherently adversarial like
| that, then you could do all kinds of nefarious things. But
| that stuff's locked down more tightly. Likewise with a few
| ultra-hard-tech things where the implementation's a major
| competitive edge.
| forgotusername6 wrote:
| I imagine looking for vulnerable areas of the code might be
| something people would be interested in doing. Maybe start
| with login or billing or something. You could also look at
| recent activity to spot new, unannounced projects. You could
| use blame to find who wrote what and target them for anything
| from job offers to social engineering attacks.
| ameliaquining wrote:
| Most of that information is readily available on the
| corporate intranet without having to dig through source
| code.
|
| Security-by-obscurity isn't something to rely on (again,
| except in the case of things like abuse detection where
| there's no alternative).
| yazaddaruvala wrote:
| Having worked at Google and Amazon.
|
| Honestly their systems are almost identical. Amazon just creates
| a monotonically increasing watermark outside the "repo". Google
| uses "the repo" to create the monotonically increasing watermark.
|
| Otherwise, Google calls it "merge into g3" Amazon calls it "merge
| into live".
|
| Amazon has the extra vocabulary of VersionSets/Packages/Build
| files. Google has all the same concepts, but just calls them
| Dependencies/Folders/Build files.
|
| Amazon's workflows are "git-like", Google is migrating to "git-
| like" workflows (but has a lot of unnecessary vocabulary around
| getting their Piper/Fig/Workspace/etc).
|
| I really can't tell if the specific difference between "mono-
| repo" or "multi-repo" makes much practical difference to the devs
| working on it.
| deanCommie wrote:
| No wonder noone at Google can't ship everything if they
| constantly have to stop development of their feature so they can
| do mandatory upgrades of their dependencies...
| ameliaquining wrote:
| Most of that work is done by the owners of the dependencies,
| rather than the dependents.
|
| This is sometimes a problem for open source dependencies,
| though, as there isn't always anyone whose job it is to keep
| them up to date. Some amount of NIH syndrome is because
| reinventing the wheel can be less work than integrating an
| existing wheel that was designed for a different vehicle with
| different specs.
___________________________________________________________________
(page generated 2023-02-12 23:00 UTC)