[HN Gopher] Make your monorepo feel small with Git's sparse index
___________________________________________________________________
Make your monorepo feel small with Git's sparse index
Author : CRConrad
Score : 141 points
Date : 2021-11-11 15:27 UTC (7 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| harvie wrote:
| I hope one day all of this will be as easy as in SVN. eg.:
|
| I have repository https://example.com/myrepo
|
| And i can simply do:
|
| svn co https://example.com/myrepo/some/directory/
|
| And i can work with that subdirectory as if it was actual repo.
| Completely transparently.
|
| This i really miss in git.
| Gigachad wrote:
| I'm working in hell right now. The current company has the site
| frontend, backend, and tests in separate repos and it's
| basically impossible to do anything without force merging
| because the build is broken without a chicken and egg situation
| between the 3 pull requests.
| laurent123456 wrote:
| I worked at a company that not only did that, but also
| decided to split the main web app into multiple repos, one
| per country. It was so much fun to do anything in this
| project.
| xorcist wrote:
| Now, _that 's_ a microservice if there ever was one!
| williamvds wrote:
| With shallow checkouts cloning is much quicker. You could try
| combining it with sparse checkouts too. You can even have Git
| fetch the full history in the background, and from a quick test
| you can do stuff like commit while it's fetching. Obviously the
| limited history means commands like log and blame will be
| inaccurate until it's done. $ git clone
| --depth=1 <url> $ cd repo $ git fetch --unshallow &
| $ <do work>
| zwieback wrote:
| Yeah, that's really the one thing I miss from my SVN days. I'm
| also still using Perforce, which can do even crazier things
| with workspace mappings.
| haberman wrote:
| > The index file stores a list of every file at HEAD, along with
| the object ID for its blob and some metadata. This list of files
| is stored as a flat list and Git parses the index into an array.
|
| I'm surprised that the index is not hierarchical, like tree
| objects in Git's object storage.
|
| With tree objects (https://git-scm.com/book/en/v2/Git-Internals-
| Git-Objects#_tr...), each level of the hierarchy is a separate
| object. So you would only need to load directories that are
| interesting to you. You could use a single hash compare to
| determine that two directories are identical without actually
| recursing into them.
|
| In particular, I can't understand why you would need a full list
| of all files to create a commit. If your commit is known not to
| touch certain directories, it should be able to simply refer to
| the existing tree object without loading or expanding it.
|
| I guess that's what this sparse-index work is doing. I'm just
| surprised it didn't already work that way.
| arxanas wrote:
| It makes more sense if you think of the index as a structure
| meant specifically to speed up `git status` operations. (It was
| originally called the "dircache"! See https://github.com/git/gi
| t/commit/5adf317b31729707fad4967c1a...) We desperately want to
| reduce the number of file accesses we have to make, so directly
| using the object database and a tree object (or similar
| structures) would more than double file accesses.
|
| There's performance-related metadata in the index which isn't
| in tree objects. For example, the modified-time of a given file
| exists in its index entry, which can be used to avoid reading
| the file from disk if it seems to be unmodified. If you have to
| do a disk lookup to decide whether to read a file from disk,
| then the overhead is potentially as much as the operation
| itself.
|
| There's also semantic metadata, such as which stage the file is
| in (for merge conflict resolution).
|
| It's worth noting that you can turn on the cache tree extension
| (https://git-scm.com/docs/index-format#_cache_tree) in order to
| speed up commit operations. It doesn't replace objects in the
| index with trees, but it does keep ranges of the index cached,
| if they're known to correspond to a tree.
| junon wrote:
| What I'd really like to see is Git have the ability to
| consolidate repeat submodules down into a single set of objects
| in the super repository. Currently cloning the same submodule
| results in a copy of the repository for each path, which is
| absurd.
|
| It's been something on my list to address on the mailing lists
| for a while, just haven't had time.
| arxanas wrote:
| The index as a data structure is really starting to show its age,
| especially as developers adapt Git to monorepo scale. It's really
| fast for repositories up to a certain size, but big tech
| organizations grow exponentially, and start to suffer performance
| issues. At some point, you can't afford to use a data structure
| that scales with the size of the repo, and have to switch to one
| that scales with the size of the user's change.
|
| I spent a good chunk of time working around the lack of sparse
| indexes in libgit2, which produced speedups on the order of 500x
| for certain operations, because reading and writing the entire
| index is unnecessary for most users of a monorepo:
| https://github.com/libgit2/libgit2/issues/6036. I'm excited to
| see sparse indexes make their way into Git proper.
|
| Shameless plug: I'm working on improving monorepo-scale Git
| tooling at https://github.com/arxanas/git-branchless, such as
| with in-memory rebases: https://blog.waleedkhan.name/in-memory-
| rebases/. Try it out if you work in a Git monorepo.
| stormbrew wrote:
| > I'm working on improving monorepo-scale Git tooling at
| https://github.com/arxanas/git-branchless
|
| I'm intrigued by this but the readme could maybe use some work
| to describe how you envision it being used day-to-day? All the
| examples seem to be about using it to fix things but I'm not at
| all clear how it helps enable a new workflow.
|
| Even if it was just a link to a similar tool?
| arxanas wrote:
| Thanks for the feedback. I also received this request today
| to document a relevant workflow:
| https://github.com/arxanas/git-branchless/issues/210. If you
| want to be notified when I write the documentation (hopefully
| today?), then you can watch that issue.
|
| There's a decent discussion here on "stacked changes":
| https://docs.graphite.dev/getting-started/why-use-stacked-
| ch..., with references to other articles. This workflow is
| sometimes called development via "patch stack" or "stacked
| diffs". But that's just a part of the workflow which git-
| branchless enables.
|
| The most similar tool would be Mercurial as used at large
| companies (and in fact, `git-branchless` is, for now, just
| trying to get to feature parity with it). But I don't know if
| the feature set which engineers rely on is documented
| anywhere publicly.
|
| I use git-branchless 1) simply to scale to a monorepo,
| because `git move` is a lot faster than `git rebase`, and 2)
| to do highly speculative work and jump between many different
| approaches to the same problem (a kind of breadth-first
| search). I always had this problem with Git where I wanted to
| make many speculative changes, but branch and stash
| management got in the way. (For example, it's hard to update
| a commit which is a common ancestor of two or more branches.
| `git move` solves this.) The branchless workflow lets me be
| more nimble and update the commit graph more deftly, so that
| I can do experimental work much more easily.
| rq1 wrote:
| What I was looking for recently is a way to make "sparse push".
| And trigger a chain reaction with hooks.
|
| Didn't find anything interesting.
| speedgoose wrote:
| "One of the biggest traps for smart engineers is optimizing
| something that shouldn't exist."
|
| Elon Musk.
| joconde wrote:
| Was he talking about something specific?
| junon wrote:
| He was talking about the battery vibrator plates or something
| in Tesla cars.
| plopz wrote:
| I remember him saying something like that during a
| walkthrough of the base building starship and if I recall it
| was in reference to overengineering something about the grid
| fins.
| solarmist wrote:
| Let's make the snarky comment into a helpful comment. Why do
| you think it shouldn't exist?
| speedgoose wrote:
| Monorepos create more issues than what they solve.
| solarmist wrote:
| Such as? That's just parroting "common opinion" otherwise.
| speedgoose wrote:
| The second paragraph of the article we are discussing
| about for example.
|
| But you can find a list on Wikipedia and make your own
| opinion : https://en.m.wikipedia.org/wiki/Monorepo
| jeremyjh wrote:
| This is no different from saying "monorepo bad". Aside from
| performance issues in git, why would a monorepo be bad? It
| seems very natural to me to have a whole system referenced
| with a single branch/tag that must all pass CI together.
| Otherwise supporting projects can introduce breaking
| changes downstream that are not apparent before they hit
| master.
| tambourine_man wrote:
| Supporting evidence?
| speedgoose wrote:
| The Wikipedia article about monorepos has a good summary
|
| https://en.m.wikipedia.org/wiki/Monorepo
|
| Then you can do your own opinion, I'm sharing mine.
| jeremyjh wrote:
| Apart from performance issues that article offers more
| (and more significant) advantages than it does drawbacks,
| so it really does not support your statement.
| ratww wrote:
| That's completely false. Monorepos don't really create
| issues at all when done properly, and when used in
| situations where they make sense.
|
| In smaller scales, for example, they're fantastic for
| productivity, and my company is not looking back.
| tsimionescu wrote:
| So the Linux devs have no idea how to properly use Git?
| speedgoose wrote:
| I'm not sure whether the Linux kernel git repository
| qualifies as a Monorepo.
| anon9001 wrote:
| This is well written and deserves my upvote, because sparse-
| checkout is part of git and knowing how it works is useful.
|
| That said, there's absolutely no reason to structure your code in
| a monorepo.
|
| Here's what I think GitHub is doing:
|
| 1) Encourage monorepo adoption
|
| 2) Build tooling for monorepos
|
| 3) Selling tooling to developers stranded in monorepos
|
| Microsoft, which owns GitHub, created the microsoft/git fork
| linked in the article, and they explain their justification here:
| https://github.com/microsoft/git#why-is-this-fork-needed
|
| > Well, because Git is a distributed version control system, each
| Git repository has a copy of all files in the entire history. As
| large repositories, aka monorepos grow, Git can struggle to
| manage all that data. As Git commands like status and fetch get
| slower, developers stop waiting and start switching context. And
| context switches harm developer productivity.
|
| I believe that Google's brand is so big that it led to this mass
| cognitive dissonance, which is being exploited by GitHub.
|
| To be clear, here are the two ideas in conflict:
|
| * Git is decentralized and fast, and Google famously doesn't use
| it.
|
| * Companies want to use "industry standard" tech, and Google is
| the standard for success.
|
| Now apply those observations to a world where your engineers only
| use "git".
|
| The result is market demand to misuse git for monorepos, which
| Microsoft is pouring huge amounts of resources into enabling via
| GitHub.
|
| It makes great sense that GitHub wants to lean into this. More
| centralization and being more reliant on GitHub's custom tooling
| is obviously better for GitHub.
|
| It just so happens that GitHub is building tools to enable
| monorepos, essentially normalizing their usage.
|
| Then GitHub can sell tools to deal with your enormous monorepo,
| because your traditional tools will feel slow and worse than
| GitHub's tools.
|
| In other words, GitHub is propping up the failed monorepo idea as
| a strategy to get people in the pipeline for things like
| CodeSpaces: https://github.com/features/codespaces
|
| Because if you have 100 projects and they're all separate, you
| can do development locally for each and it's fast and sensible.
| But if all your projects are in one repo, the tools grind to a
| halt, and suddenly you need to buy a solution that just works to
| meet your business goals.
| jeffbee wrote:
| > Git is ... fast, and Google ... doesn't use it.
|
| Everything about git is orders of magnitude slower than the
| monorepo in use at Google. Git is not fast, and its slowness
| scales with the size of your repo.
| tsimionescu wrote:
| Monorepos are much easier for everyone to use, and are the only
| natural way to manage code for any project. You keep talking
| about Google, but a much more famous monorepo is Linux itself.
| Perhaps Linus Torvalds has fallen into Google's hype?
|
| The fact that git is very poor at scaling monorepos might mean
| that it's a bad idea to use git for larger organizations, not
| that it's a bad idea to use monorepos. If git can be improved
| to work with monorepos, all the better.
| anon9001 wrote:
| > Monorepos are much easier for everyone to use, and are the
| only natural way to manage code for any project.
|
| I strongly disagree with that, but I'll let this blog post
| explain it better than I can:
| https://medium.com/@mattklein123/monorepos-please-
| dont-e9a27...
|
| > You keep talking about Google, but a much more famous
| monorepo is Linux itself.
|
| I thought it was fairly well known that monorepos came
| directly from Google as part of their SRE strategy. It didn't
| even come into common usage until around 2017 (according to
| wikipedia). If I'm remembering correctly, the SRE book
| recommends it, and that's why it gained popularity.
|
| Also, I don't believe that Linux is a valid interpretation of
| "monorepo". Linux is a singular product. You can't build the
| kernel without all of the parts.
|
| A better example would be if there was a "Linus" repo that
| contained both git and linux. There isn't, and for good
| reason.
|
| > The fact that git is very poor at scaling monorepos might
| mean that it's a bad idea to use git for larger
| organizations, not that it's a bad idea to use monorepos. If
| git can be improved to work with monorepos, all the better.
|
| Any performance improvement in git is welcome, but anything
| that sacrifices a full clone of the entire repository is
| antithetical to decentralization.
|
| The whole point of git is decentralized source code.
| solarmist wrote:
| Monorepos (up to a certain size where git starts getting
| too slow) are easier to use unless you have sufficient
| investment into dev tooling.
|
| I think "monorepo" here is a shorthand for large, complex
| repos with long histories which git does not scale well to
| whether or not it is all of the repos for an organization.
| For example I'd call the Windows OS a monorepo for all of
| the important reasons.
| howinteresting wrote:
| > The whole point of git is decentralized source code.
|
| The "whole point of git" is to provide value to its users.
| Full decentralization is not necessary for that.
| dataangel wrote:
| > Also, I don't believe that Linux is a valid
| interpretation of "monorepo". Linux is a singular product.
| You can't build the kernel without all of the parts.
|
| But it's also larger scale than the vast majority of
| startups will ever reach. My work has had the same monorepo
| for 8 years with over 100 employees now and git has had few
| problems.
| cdcarter wrote:
| I think it's at least somewhat fair to call Linux a
| monorepo. There are a lot of drivers included in the main
| tree. They don't need to be, (we know this because there
| are also lots of drivers not in the source tree). But by
| including them, the kernel devs can make large changes to
| the API and all the drivers in one go. This is a classic
| "why use a monorepo".
| ajkjk wrote:
| Very much doubt that's their corporate strategy. More likely
| it's as simple as: lots of people have monorepos; they have
| lots of issues with Git and Github; Github wants their
| business.
| ratww wrote:
| _> That said, there 's absolutely no reason to structure your
| code in a monorepo._
|
| Bullshit. There are very good reasons to use it in some
| situations. My company is using it and it's a tremendous
| productivity boon. And Git works _perfectly fine_ for smaller
| scales.
|
| Obviously, "because Google does it" is a terrible reason. But
| it's disingenuous to say that's the only reason people are
| doing it. Not everyone is a moron.
| anon9001 wrote:
| I'm glad you're having a good experience now, and git as a
| monorepo will work fine at smaller scales, but you will
| outgrow it at some point.
|
| When you do, you have two choices. You can either commit to
| the monorepo direction and start using non-standard tooling
| that sacrifices decentralization, or you can break up your
| repo into smaller manageable repos.
|
| I don't have any problem with small organizations throwing
| everything into one git repo because it's convenient.
|
| My objection is that when you eventually do hit the limits of
| git, will you choose to break the fundamentals of git
| decentralization as a workaround? Or will you break up the
| repo into a couple of other repos with specific purposes?
|
| I don't like that GitHub makes money by encouraging people to
| make the wrong choice at that juncture.
| ratww wrote:
| When I hit the limits of git then I will worry about it.
|
| One of our tasks when building the monorepo was proving it
| was possible to split it again. It was trivial and we have
| tools to help us avoid complexity.
|
| We're not using Github so that part doesn't apply to me.
|
| Also, nice of you to assume we'll get to Google scale, but
| thanks to the monorepo, I was able to make a few pull
| requests reducing duplication and reducing line count of
| app by thousands ever since. So I really don't see us
| getting into Google scale anytime soon. We're downsizing.
|
| I also find it ironic that you're accusing people of
| "copying Google" in a parent post but you're the one
| assuming that everyone will hit Google limits...
| anon9001 wrote:
| If you ever do hit a git limit where it's no longer
| comfortable to keep the whole repo on each developer
| machine, I would encourage you to split up the repo into
| separate project-based repos rather than switching to
| Microsoft's git fork.
|
| As a best practice, there's a reason that Linus started
| git in a separate repo, rather than as part of the Linux
| project. The reason is that if you put too many projects
| into one git repo, and it gets too large, you do
| eventually hit a scale where it becomes a problem.
|
| A very simple way to mitigate that is to keep each
| project in its own repo, which you can easily do once you
| start hitting git scale problems.
|
| Thankfully, one of the original git use cases was to
| decompose huge svn repos into smaller git repos, so the
| tooling required is already built in.
|
| > I honestly find it ironic that you're accusing people
| of "copying google" in a parent post but you're the one
| assuming that everyone will hit Google limits...
|
| I think you got the wrong take there. I'm saying that
| Google's monorepo approach is only valid because they
| invested so heavily into building custom tooling to
| handle it. We don't have access to those tools and
| therefore shouldn't use their monorepo approach.
|
| If you're going to use git, you're going to have the most
| success using it as intended, which is some logical
| separation of "one repo per project" where "project"
| doesn't grow too out of hand. The Linux kernel could be
| thought of as a large project that git still handles just
| fine.
|
| Tragically, I think if Google did opensource their
| internal vcs and monorepo tooling, they would immediately
| displace git as the dominant vcs and we would regress
| back to trunk-based development.
| rsj_hn wrote:
| > I'm glad you're having a good experience now, and git as
| a monorepo will work fine at smaller scales, but you will
| outgrow it at some point.
|
| I would say the opposite. A lot of companies are fine with
| independent teams using their own versions of dependencies
| and their own versions of core code, but at some point that
| becomes unmanageable and you need to start using a common
| set of dependencies and the same version of base frameworks
| to reduce the complexity. That means pushing a patch to a
| framework means all the teams are upgraded. Monorepos are
| the most common solution to enforce that behavior.
|
| Look, this is all dealing with the problem of coordination
| in large teams. Different organizations have different
| capacities for coordination, and so it's like squeezing a
| balloon -- yes, you want more agility to pick your own deps
| but then the cost of that is dealing with so much
| complexity when you need to push a fix to a commonly used
| framework or when a CVE is found in a widely used dep and
| needs to be updated by 1000 different teams all manually.
|
| There is no "right" way. It's just something organizations
| have to struggle with because it's going to be costly no
| matter what, and all that matters is what type of cost your
| org is most easily able to bear. That will decide whether
| you use a monorepo or a bunch of independent repos, whether
| you go for microservices or a monolith, and most companies
| will do some mix of all of the above.
| anon9001 wrote:
| > Monorepos are the most common solution to enforce that
| behavior.
|
| Yes. This is very accurate and also the problem.
| Monorepos are being used as a political tool to change
| behavior, but the problem is that it has severe technical
| implications.
|
| > There is no "right" way.
|
| With git, there is a "wrong" way, and that's not
| separating your project into different repos. It causes
| real world technical problems, otherwise we wouldn't have
| this article posted in the first place.
|
| > It's just something organizations have struggle with
| because it's going to be costly no matter what, and all
| that matters is what type of cost your org is most easily
| able to bear.
|
| It's not a coin toss whether monorepos will have better
| or worse support from all standard git tooling. It will
| be worse every time.
|
| The amount of tooling required to enforce dependency
| upgrades, code styles, security checks, etc across many
| repos is significantly less than the amount of tooling
| required to successfully use a monorepo.
| philosopher1234 wrote:
| If you want to play right and wrong, I will say that noe
| it's the right way since there is support for sparse
| checkouts in git.
|
| This isn't a useful game to play.
| eximius wrote:
| If you are in an enterprise setting, you _don 't need
| decentralized version control_.
|
| So, yea, for companies, monorepos are a no brainer in a lot
| of ways.
|
| For open source, separate repos makes more sense.
|
| To expand on corporate monorepos, if you can still set up
| access control (e.g., code owners to review additions by
| domain) and code visibility (so there isn't _unlimited_
| code sharing), then I can't think of a reason to not use
| monorepos.
| IshKebab wrote:
| > you will outgrow it at some point
|
| Given that Google and Microsoft use monorepos that seems
| unlikely!
| anon9001 wrote:
| Google had to build an internal version control system as
| an alternative to git and perforce to support their
| monorepo.
|
| Microsoft forked git and layered their own file system on
| top of it to support a centralized git workflow so that
| they could have a monorepo.
| dlp211 wrote:
| Having had used both, Google's implementation is IMO the
| superior version of monorepo. Really, Google's
| Engineering Systems are just better than anything that I
| have ever used anywhere else.
| anon9001 wrote:
| This is exactly as I'd expect.
|
| If you want a centralized, trunk-based version control,
| don't use git.
|
| It's funny how each company decides to solve these
| problems.
|
| Google called in the computer scientists and designed a
| better centralized vcs for their purposes. Good on them.
| It'd be great if they open sourced it. So typical of
| Google to invent their own thing and keep it private.
|
| Microsoft took the most popular vcs (git), and inserted a
| shim layer to make it compatible with their use case. How
| expected that Microsoft would build a compatibility shim
| that attempts to hide complexity from the end user.
|
| Meanwhile, Linux and Git are plugging along just fine, in
| their own separate repos, even though many people work on
| both projects.
| IshKebab wrote:
| > So typical of Google to invent their own thing and keep
| it private.
|
| Yeah like their build system... Bazel, that's completely
| closed source.
| jayd16 wrote:
| Your logic is circular. No one should
| work on monorepos because... monorepos are bad
| because... git can't easily handle them and we
| shouldn't fix that because... No one should work
| on monorepos...
|
| Clearly there are reasons people like monorepos and it
| makes sense to update git to support the workflow.
| anon9001 wrote:
| That isn't circular. The conclusion should be that git, a
| decentralized vcs, should not take on changes to make it
| a centralized vcs.
|
| If you think that git needs to be "fixed" or "updated" to
| support a centralized vcs server to do partial updates
| over the network, then I think you've missed the point of
| git.
| dboreham wrote:
| > it's a tremendous productivity boon
|
| Curious to hear more specifics on this. Did you migrate from
| separate repos to a monorepo and subsequently measure
| improved productivity as a result?
| ratww wrote:
| Correct. We measured how long it took to integrate changes
| in the core libraries into the consumers (multiple PRs)
| versus doing it on a monorepo (single PR for change). We
| ran them together for a couple weeks and the difference was
| big.
|
| The biggest differences were in changes that would break
| the consumers. For this cases we had to go back and patch
| the original library, or revert and start from scratch. But
| even in the easy changes, just the "bureaucracy" of opening
| tens of pull-requests, watching a few CI pipelines and
| getting them approved by different code owners was also
| large.
|
| Now, whenever we have changes in one of the core libraries,
| we also run full tests in the library consumers. With tests
| running in parallel, sometimes it takes 20 minutes (instead
| of 4, 5 hours) to get a patch affecting all frontends
| tested, approved and merged into the main branch.
|
| Also, everyone agreed that having multiple PRs open is
| quite stressful.
| solarmist wrote:
| From my understanding Microsoft is doing it because they want
| to use git for developing windows which is(was?) a large
| monorepo.
| omegalulw wrote:
| Your take is extremely biased. You only just discuss why
| monorepos are bad.
|
| Here's some of the many reasons why monorepos are excellent:
|
| - Continuous integration. Every project is almost always using
| the lastest code from other projects and libraries it depends
| on.
|
| - Builds from scratch are very easy and don't need extravagant
| tooling.
|
| - Problems due to build versions in dependency management are
| reduced (everyone is expected to use HEAD).
|
| - The whole organization settles on a common build patterns -
| so if you want to add a new dependency you wouldn't need to
| struggle with their build system. Conversely, you need to write
| lesser documentation on how to build your code - cause that's
| now standard.
| anon9001 wrote:
| Heh, the major problems that I've run into using monorepos in
| the real world at scale are:
|
| - CI breaks all the time. Even one temperamental test from
| anywhere else in the organization can cause your CI run to
| fail.
|
| - Building the monorepo locally becomes very complicated,
| even to just get your little section running. Now all
| developers need all the tools used in the monorepo.
|
| - Dependencies get upgraded unexpectedly. Tests aren't
| perfect, so people upgrade dependencies and your code
| inevitably breaks.
|
| It's cool that everyone is on the same coding style, but
| that's very much achievable with a shared linter
| configuration.
| dlp211 wrote:
| Your problem isn't monorepo, it's bad tooling. Tests should
| only execute against code that changed. Builds should only
| build the thing you want to build, not the whole
| repository.
| anon9001 wrote:
| Yes!
|
| The problem is choosing a monorepo _because_ the tooling
| isn 't suited for monorepos.
|
| Trying to build a monorepo with git is like trying to
| build your CRUD web app frontend in c++.
|
| Sure, you can do it. Webassembly exists and clang can
| compile to it. I wouldn't recommend it because the
| tooling doesn't match your actual problem.
|
| Or maybe a better example is that it's like deciding the
| browser widgets aren't very good, so we'll re-render our
| own custom widgets with WebGL. Yes, this is all quite
| possible, and your result might get to some definition of
| "better", but you're not really solving the problem you
| had of building a CRUD web app.
|
| Can Microsoft successfully shim git so that it appears
| like a centralized trunk-based monorepo, the way you'd
| find at an old cvs/svn/perforce shop? Yes, they did, but
| they shouldn't have.
|
| My thesis is they're only pushing monorepos because it
| helps GitHub monetize, and I stand by that.
|
| > Tests should only execute against code that changed.
| Builds should only build the thing you want to build, not
| the whole repository.
|
| How do you run your JS monorepo? Did you somehow get
| bazel to remote cache a webpack build into individual
| objects, so you're only building the changes? Can this
| even be done with a modern minimization tool in the
| pipeline? Is there another web packager that does take a
| remotely cachable object-based approach?
|
| I don't know enough about JS build systems to make a
| monorepo work in any sensible way that utilizes caching
| and minimizes build times. If anything good comes out of
| the monorepo movement, it will be a forcing function that
| makes JS transpilers more cacheable.
|
| And all this for what? Trunk-based development? So we can
| get surprise dependency updates? So that some manager
| feels good that all the code is in one directory?
|
| The reason Linus invented git in the first place was
| because decentralized is the best way to build software.
| He literally stopped work on the kernel for 2 weeks to
| build the first version of git because the scale by which
| he could merge code was the superpower that grew Linux.
|
| If you YouTube search for "git linus" you can listen to
| the original author explain the intent from 14 years ago:
| https://www.youtube.com/watch?v=4XpnKHJAok8
|
| If this is a topic you're passionate about, I'd encourage
| you to watch that video, as he addresses why
| decentralizing is so important and how it makes for
| healthy software projects. It's also fun to watch old
| Googlers not "get it".
|
| He was right then and he's right now. It's disappointing
| to see so much of HN not get it.
| Orphis wrote:
| > Git is decentralized and fast, and Google famously doesn't
| use it.
|
| Most (all?) of Google OSS software is hosted on either Gerrit
| or Github. Git is not used by the "google3" monorepo, but it's
| used by quite a few major projects.
| nightpool wrote:
| Is there a point in having a monorepo if you're all in on the
| microservices approach? I'm a big microservices skeptic, but as
| far as I understand it, the benefit of microservices is
| independence of change & deployment enforced by solid API
| contracts--don't you give that all up when you use a monorepo?
| What does "Monorepo with microservices" give you that a normal
| monolithic backend doesn't?
|
| (Obviously e.g. an image resizer or something else completely
| decoupled from your business logic should be a separate service /
| repo _anyway_ --my point is more along the lines of "If something
| shares code, shouldn't it share a deployment strategy?")
| rsj_hn wrote:
| Yeah, I've seen it used to allow teams to use consistent
| frameworks and libraries across many different microservices.
| Think of authentication, DB clients, logging, webservers,
| grpc/http service front ends, uptime oracle -- there's lots of
| cross cutting concerns that are shared code among many
| microservices.
|
| So the next thing you decide to do is create some microservice
| framework that bundles all that stuff in and allow your
| microservice team to write some business logic on top. But now
| 99% of your executables are in this microservice framework that
| everyone is using, and that's the point where a lot of
| companies go the monorepo route.
|
| Actually most companies do some mix -- have a lot of stuff in a
| big repo and then other smaller repos alongside that.
| johnmaguire wrote:
| Monorepo with microservices gives you the ability to scale and
| perform SRE-type maintenance at a granular level. Teams
| maintain responsibility for their service, but are more easily
| able to refactor, share code, pull dependencies like GraphQL
| schemas into the frontend, etc. across many services.
| nightpool wrote:
| So basically each team has to reinvent devops from the ground
| up, and staff their own on call rotation, instead of having a
| centralized devops function that provides a stable platform?
| That sounds horrendous.
|
| Although, that said, I can at least _see_ the benefits of the
| "1 service per team" methodology, where you have a dedicated
| team that's independently responsible for updating their
| service. I'm more used to associating "microservices" with
| the model where a single team is managing 5 or 6 interacting
| services, and the benefits there seem much smaller.
| johnmaguire wrote:
| > That sounds horrendous.
|
| Different teams can make their own decisions, but as a
| developer on a team that ran our own SRE, I found it came
| with many advantages. Specifically, we saw very little
| downtime, and when outages did occur were very prepared to
| fix it as we knew the exact state of our services (code,
| infrastructure, recent changes and deploys.) Additionally,
| we had very good logging and metrics because we knew what
| we'd want to have in the event of a problem.
|
| And I'm not sure what you mean "from the ground up." We
| were able to share a lot of Ansible playbooks, frameworks,
| and our entire observability stack across all teams.
|
| But I think you may also be missing the rest of my post.
| This is only one possible advantage. Even if the team
| doesn't perform their own SRE, these services can be scaled
| independently - both in terms of infrastructure and
| codebase - even while sharing code (including things like
| protocol data structures, auth schemes, etc.)
|
| A service that receives 1 SAML Response from an IdP
| (identity provider) per day per user may not need the same
| resources as a dashboard that exposes all SP (service
| providers) to a user many times a day. And an
| administration panel for this has still different needs.
|
| Yet, all of these services may communicate with each other.
| echelon wrote:
| > Is there a point in having a monorepo if you're all in on the
| microservices approach?
|
| Monorepos are excellent for microservices.
|
| - You can update the protobuf service graph (all strongly
| typed) easily and make sure all the code changes are
| compatible. You still have to release in a sensible order to
| make sure the APIs are talking in an expected way, but this at
| least ensures that the code agrees.
|
| - You can address library vulns and upgrades all at once for
| everything. Everything can get the new gRPC release at the same
| time. Instead of having app owners be on the hook for this, a
| central team can manage these important upgrades and provide
| assistance / pairing for only the most complex situations.
|
| - If you're the one working on a very large library migration,
| you can rebase daily against the entire fleet of microservices
| and not manage N-many code changes in N-many repos. This makes
| huge efforts much easier. Bonus: you can land incrementally
| across everything.
|
| - If you're the one scoping out one of these "big changes", you
| can statically find all of the code you'll impact or need to
| understand. This is such an amazing win. No more hunting for
| repos and grepping for code in hundreds of undiscovered places.
|
| - Once a vuln is fixed, you can tell all apps to deploy after
| SHA X to fix VULN Y. This is such an easy thing for app owners
| to do.
|
| - You can collect common service library code in a central
| place (eg. internal Guava, auth tools, i18n, etc). Such
| packages are easy to share and reuse. All of your internal code
| is "vendored" essentially, but you can choose to depend on only
| the things you need. A monorepo only feels heavy if you depend
| on all the things (or your tooling doesn't support git or build
| operations - you seriously have to staff a monorepo team).
|
| - Other teams can easily discover and read your code.
|
| Monorepos are the best possible way to go as long as you have
| the tooling to support it. They fall over and become a burden
| if they're not seriously staffed. When they work, they really
| work.
| nightpool wrote:
| None of this addresses my question--what benefits do you get
| from having monorepo-with-microservices over a monolithic
| backend? All of the things you mentioned would be even
| _easier_ with a monolithic backend.
| staticassertion wrote:
| They solve different problems, some of which may overlap I
| suppose.
|
| For one thing you get clear ownership of deployed code.
| There isn't one monolithic service that everyone is
| responsible for babying, every team can baby their own,
| even if they all share libraries and whatnot.
|
| You also get things like fault isolation and fine grained
| scaling too.
| echelon wrote:
| Should have also mentioned: all those changes to cross
| cutting library code will trigger builds and tests of the
| dependent services. You can find out at once what breaks.
| It's a superpower.
| aranchelk wrote:
| The more I've come to rely on techniques like canary testing and
| opt-in version upgrades the more firmly I believe one of the main
| motivations for monorepos is flawed: at any given time there may
| not be a fact of the matter as to which single version of an app
| or service is running in an environment.
|
| At places I've worked when we thought about canary testing we
| ignored the fact that there were multiple versions of the
| software running in parallel, we classified it as part of an
| orchestration process and not a reality about the code or env,
| but we really did have multiple versions of a service running at
| once, sometimes for days.
|
| Similarly if you've got a setup where you can upgrade
| users/regions/etc piecemeal (opt-in or by other selection
| criteria) I don't know how you reflect this in a monorepo. I'm
| curious how Google actually does this as I recall they have
| offered user opt-in upgrades for Gmail. My suspicion is this gets
| solved with something like directories ./v2/ and ./v3/ -- but
| that's far from ideal.
| eximius wrote:
| that doesnt seem like a problem with monorepos (or otherwise).
|
| youd just need to tag your releases, right?
| aranchelk wrote:
| I don't think it's a problem, rather I'm challenging a touted
| benefit.
|
| In large monorepos the supposition is you've got a class of
| compatible apps and services bundled together. Version
| dependencies are somewhat implicit: the correct version for
| each project to interoperate is whatever was checked-in
| together in your commit.
|
| I don't know how it works in practice at different orgs, but
| there's certainly an idea I've heard repeated that you can
| essentially test, build, and deploy your monorepo atomically,
| but the reality in my experience is you can't escape the need
| to think about compatibility across multiple versions of
| services once you use techniques like canary testing or
| targeted upgrades.
| sroussey wrote:
| This is still true, but to a matter of degree. Even in the
| feature flagged deploys mixed with canary the permutations
| are all evident, and ideally tested.
|
| Also, you wouldn't expect a schema change to occur with
| code that requires it. Those will need to happen earlier.
|
| Real systems are complex. A monorepo is one attempt at
| capping the complexity to known permutations. For smaller
| teams, it might collapse to a single one.
| staticassertion wrote:
| You still have to think about compatibility across versions
| - that does not go away in a monorepo, and you should use
| protocols that enforce compatible changes. The monorepo
| just tells you that all tests pass across your entire
| codebase given a change you made to some other part.
| eximius wrote:
| thats fair. you require reasonable deployment intervals
| and may need to wait to merge based on deployment.
| Workflow actions that can check whether a commit is
| deployed in a given environment are invaluable
| jeffbee wrote:
| > may need to wait to merge based on deployment
|
| Again, this fundamentally misunderstands the purpose of
| the source code repo and how it relates to the build
| artifacts deployed in production. If you are waiting for
| something to happen in production before landing some
| change, that tells me right there you have committed some
| kind of serious error.
| joshuamorton wrote:
| I'd caveat this with _code_ change.
|
| Its very common to need to wait for some version of a
| binary to be live before updating some associated
| configuration to enable a feature in that binary (since
| the _dynamic_ configuration isn 't usually versioned with
| the binary). It's possible that some systems exist that
| fail quiet, with a non-existent configuration option
| being silently ignored, but the ones I know of don't do
| that.
| coryrc wrote:
| Only binaries are released. Binaries are timestamped and linked
| to a changelist.
|
| The "opt-in upgrades" are all live code. I know more than a few
| "foo" "foo2" directories, but I wouldn't want an actively-
| delivered, long-running service to be living in a feature
| branch so I would still expect anyone to be using a similar
| naming scheme.
| ASinclair wrote:
| > I'm curious how Google actually does this
|
| Branches are cut for releases. Binaries are versioned.
| [deleted]
| jeffbee wrote:
| > the main motivations for monorepos is flawed: at any given
| time there may not be a fact of the matter as to which single
| version of an app or service is running in an environment.
|
| Your understanding of the motivations for monorepo is flawed.
| I've never heard anyone even advocate for this as a reason for
| monorepos. For some actual reasons people use monorepos, see
| https://danluu.com/monorepo/
|
| Regarding your question, which I re-emphasize has got nothing
| to do with the arrangement of the source code, the solution is
| to simply treat your protocol as API and follow these rules: 1)
| Follow "Postel's Law", accepting to the best of your abilities
| unknown messages and fields; 2) never change the meaning of
| anything in an existing protocol. Do not change between
| incompatible types, or change an optional item to required; 3)
| Never re-use a retired aspect of the protocol with a different
| meaning; 4) Generally do not make any incompatible change to an
| existing protocol. If you must change the semantics, then you
| are making a new protocol, not changing the old one,
|
| > I don't know how you reflect this in a monorepo
|
| You don't. Why would the deployed version of some application
| be coded in your repo? It's simply a fact on the ground and
| there's no reason for that to appear in source control.
| aranchelk wrote:
| We may just be talking past each other, but in the link you
| provided, sections "Simplified dependencies" and (to a lesser
| extent) "Cross-project changes" are pretty much exactly what
| I'm talking about.
| joshuamorton wrote:
| They aren't, because those discussions are all related to
| link-time stuff (if I update foo.h and bar.c that depends
| on foo.h, I can do so atomically, because those are built
| into the same artifact).
|
| As soon as you discuss network traffic (or really anything
| that crosses an RPC boundary), things get more complicated,
| but none of that has anything to do with a monorepo, and
| monorepos still sometimes simplify things.
|
| So there's a few tools that are common: feature flags, 3
| stage-rollouts, and probably more that are relevant, but
| let's dive into those first two.
|
| Feature "flags" are often dynamically scoped and runtime-
| modifiable. You can change a feature flag via an RPC,
| without restarting the binary running. This is done by
| having something along the lines of if
| (condition_that_enables_feature()) {
| do_feature_thing() } else {
| do_old_thing() }
|
| A/B testing tools like optimizely and co provide this, and
| there are generic frameworks too.
| `condition_that_enables_feature()`, here is a dynamic
| function that may change value based on the time of day,
| the user, etc. (think something like
| `hash(user.username).startswith(b'00') and user.locale ==
| 'EN'`). The tools allow you to modify these conditions and
| push and change the conditions all without restarts. That's
| how you get per-user opt-in to certain behaviors.
| Fundamentally, you might have an app that is capable of
| serving two completely different UIs for the same user
| journey.
|
| Then you have "3-phase" updates. In this process, you have
| a client and server. You want to update them to use "v2" of
| some api, that's totally incompatible with v1. You start by
| updating the server to accept requests in either v1 or v2
| format. That's stage one. Then you update the clients to
| sent requests in v2 format. That's stage two. Then you
| remove all support for v1. That's stage three.
|
| When you canary a new version of a binary, you'll have the
| old version that only supports v1, and the canary version
| that supports v1 and v2. If it's the server, none of the
| clients use v2 yet, so this is fine. If it's the client,
| you've already updated the server to support v2, so it
| works fine.
|
| Note again that all of this happens whether or not you use
| a monorepo.
| howinteresting wrote:
| In general, it is a good practice to try and maximize
| compile-time resolution of dependencies and minimize
| network resolution of them. Services are great when the
| working set doesn't fit in RAM or the different parts have
| different hardware needs, but trying to make every little
| thing its own service is foolish.
|
| Doing so makes this a less pertinent problem.
| [deleted]
| rurban wrote:
| Even better looks the new OTR merge strategy, which benefits
| everyone. Not only the tiny monorepo userbase.
___________________________________________________________________
(page generated 2021-11-11 23:00 UTC)