[HN Gopher] Improving large monorepo performance on GitHub
___________________________________________________________________
Improving large monorepo performance on GitHub
Author : todsacerdoti
Score : 224 points
Date : 2021-03-16 16:37 UTC (6 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| seattle_spring wrote:
| What geographic feature is pictured in the hero shot of this
| blogpost? At first I thought it was the Golden Throne in Capitol
| Reef but I now think it's something else. I'm 90% sure it's
| either in Capitol Reef or Grand Staircase.
| david_allison wrote:
| https://www.flickr.com/photos/23155134@N06/7132776459
|
| Forest Mountains of Zion National Park, Utah
| seattle_spring wrote:
| Close but so far! Thanks so much.
| chris_wot wrote:
| Did they contribute the repack optimizations upstream?
| iliekcomputers wrote:
| Nice work, really interesting blog post!
|
| On a sidenote, git itself can also get painfully slow with large
| monorepos. Hope GitHub can push some changes there as well.
|
| I know FB moved off git to mercurial because of performance
| issues.
| klodolph wrote:
| My understanding is that neither Git nor Mercurial can do this
| well out of the box, and FB and Google both have their own
| extensions to Mercurial to make this possible (because even
| though Mercurial is often slower than Git, it's extensible)
|
| e.g. https://facebook.github.io/watchman/ - used as part of
| Facebook's Mercurial solution, I think.
| vtbassmatt wrote:
| Git also has a file system monitor interface which can use
| Watchman. We (GitHub) are working on a native file system
| monitor implementation in addition -
| https://github.com/gitgitgadget/git/pull/900.
| jauer wrote:
| And then from mercurial extensions to our own server,
| mononoke, which apparently has been moved under the Eden
| umbrella: https://github.com/facebookexperimental/eden
| jayd16 wrote:
| I thought Google used some custom fork of perforce.
| klodolph wrote:
| From what I understand, Piper is not a fork of Perforce,
| but instead a completely different system with the same
| interface. You know, built on top of a BigTable or Spanner
| cluster instead of whatever Perforce uses.
|
| The Mercurial extensions are then an _alternative client_
| for Piper.
| pitaj wrote:
| You might be interested in scalar [1] developed by Microsoft
| for handling large repos.
|
| [1]: https://github.com/microsoft/scalar
| WorldMaker wrote:
| It's also interesting to note how much of Microsoft's work
| for handling large repos in git has merged upstream directly
| into git itself.
|
| One very interesting part of that is the effort that has gone
| into the git commit-graph: https://git-scm.com/docs/commit-
| graph.
|
| It's part of what makes scalar interesting compared to some
| of the projects you hear mentioned used inside the FB and
| Google gates: not only is scalar itself open source, but a
| lot of what scalar does is tune configuration flags to turn
| on optional git features such as the commit-graph, sparse
| checkout "cones", etc that are all themselves directly
| supported by the git client. Even if you aren't at the scale
| where it makes sense to use all of the tools that scalar
| provides, you can get some interesting baby steps by
| following scalar's "advice" on git configuration.
| rmasters wrote:
| If you are willing to adapt to a different structure and
| workflow, you can filter the scope of git down dramatically
| with sparse checkouts (as @WorldMaker also mentioned).
|
| https://github.blog/2020-01-17-bring-your-monorepo-down-to-s...
| chgibb wrote:
| Sparse-checkouts are amazing. I wrote some small tools that
| use dependency information in Flutter packages to drive a
| sparse-checkout. We use it at $dayjob now.
| tuyiown wrote:
| Offtopic, but I often wonder if there are people using `git
| worktree` to have several related code trees within the same
| repo.
|
| Technically it works the mostly the same as multiple repos, but
| theoretically allows to have something like a bootstrap script
| with everything self contained in the same repos. Looks like an
| alternative tradeoff between a monorepos with shared history and
| multiple repositories.
| numbsafari wrote:
| I'm not sure worktree works exactly as you think.
|
| I use worktree locally so that, for example, I can have my
| working copy that I am doing development in and then a separate
| working copy where I can do code review for someone else,
| without having to interrupt what I am doing in my own worktree.
|
| My own experience is that if you are using branches with
| radically different content for different purposes in the same
| tree, it's going to end up a mess at some point. Worktrees, as
| far as I am aware, do not help with that in any special way.
| pjc50 wrote:
| I tried that, and discovered that it won't let you have two
| worktrees with the same branch checked out?
| hiq wrote:
| Offtopic (I think), but I just learned that "For instance,
| git worktree add -d <path> creates a new working tree with a
| detached HEAD at the same commit as the current branch."
| (from the manpage).
|
| It's offtopic because the second worktree has a detached
| HEAD, so that doesn't help in the case you mention.
| ivanbakel wrote:
| Which is very sane - what should git do if you modify the
| branch in one tree but not in the other? The least painful
| solution would require something like multiple-refs-per-
| remote-branch, which would be (to my understanding) a re-
| architecture.
| pjc50 wrote:
| Well, I wanted the same effect as two separate checkouts
| (ie entirely separate branch structures) but with a bit of
| the disk space shared between them, but that's not how it
| works.
| oftenwrong wrote:
| Are you thinking of git subtree?
|
| https://www.atlassian.com/git/tutorials/git-subtree
| hiq wrote:
| Do you have an example of this setup?
|
| As far as I know git worktree is just to have different
| branches of the same repo checked out in different locations.
| At least, that's the only way I use it (and it's great!). Are
| you suggesting to have different projects on different
| branches? So an empty "master", then "project1", "project2"
| etc. as branches?
| adeltoso wrote:
| Just 10 years too late, I remember when Facebook switched to
| Mercurial because the Git community wouldn't care about big
| monorepos. Mercurial is great!
| kevincox wrote:
| I'm slightly surprised that GitHub is still basically storing a
| git repo on a regular filesystem using the Git CLI. I would have
| expected that the repos were broken up into individual objects
| and stored in an object store. This should make pushes much
| faster as you have basically infinitely scalable writes. However
| it does make pulls more difficult. However computing packfiles
| could still be done (asynchronously) and with some attention to
| data-locality it should be possible.
|
| This would be a huge rewrite of some internals but seems like it
| would be a lot easier to manage. It would also provide a some
| benefits as objects could be shared between repos (although some
| care would probably be necessary for hash collisions) and it
| would remove some of the oddness about forks (as IIUC they
| effectively share the repo with the "parent" repo).
|
| I would love to know if something like this has been considered
| and why the decided against it.
| hobofan wrote:
| > I'm slightly surprised that GitHub is still basically storing
| a git repo on a regular filesystem using the Git CLI.
|
| Maybe I'm a bit dense, but how did you get that from the
| article? I'm fairly certain that in other pieces of writing
| they showed that they are using an object store, and I'm
| guessing that's what the "file servers" in the article are.
| stuhood wrote:
| `git repack` is an operation that is fairly specific to git's
| default file format: if they were storing objects in any
| (other) database, it is very unlikely that they would
| experience blocking repack operations, as that is an area
| where databases are highly optimized to execute
| incrementally.
| lumost wrote:
| I am not a github employee, but my 2 cents.
|
| An object store lacks an index which your typical FS will
| provide with a relatively high degree of efficiency. FS's can
| be distributed to arbitrary write velocity given an
| appropriately distributed block storage solution ( which will
| provide the k/v API of an object store that you're looking for
| ). Distributed FS's are conveniently compatible with most POSIX
| operations rather than requiring bespoke integration. Most
| object stores are optimized for largish objects and lack the
| ability to condense records into an individual write (via the
| block API) or pre-emptively pre-fetch the most likely next set
| of requested blocks.
|
| In the GitHub's case the choice of diverging from GitCLI/FS
| based storage APIs could lead to long term support issues and
| an implicit "github" flavor of git rather than improving the
| core git toolchain.
|
| Object Stores are great, but if you need some form of index
| they get slow and painful really fast.
| ddorian43 wrote:
| You should be able to split the object store into 2 systems:
| 1 metadata (think rdbms/nosql/etc) and a blob-data service,
| keeping large files, think 10KB+. Both systems should be able
| to be more efficient than the current method.
|
| Example: you can add erasure coding to the blob-data service
| for better efficiency. You can add fancy indexing to your
| metadata store. etc etc.
|
| But somebody has to create it, that's the issue.
| lumost wrote:
| That's exactly how distributed filesystems are built.
|
| Systems such as HDFS use the NameNode for this task, but
| depending on the exact characteristics of the fileSystem a
| multi-master setup is often used. I know of at least one
| NFS implementation which uses postgres as its metadata
| layer.
| Denvercoder9 wrote:
| > This should make pushes much faster as you have basically
| infinitely scalable writes. However it does make pulls more
| difficult.
|
| I bet GitHub has much more read traffic than write traffic, so
| this trade-off does not make sense.
| random5634 wrote:
| Seriously, imagine the compute and requests costs to assemble
| a large git pull.
| cordite wrote:
| Sounds a lot like this here
| https://github.com/Homebrew/brew/pull/9383
| kevincox wrote:
| I said "difficult" not expensive. Once you assembled the
| packfiles (much like they do today) it should be roughly the
| same cost.
| WorldMaker wrote:
| Yes and I would also imagine that the trade-offs between
| writing a proprietary object storage and reusing the battle-
| tested object storage that everyone else uses would have been
| considered as well.
|
| It seems like the sort of thing that would be an interesting
| open source research topic if you could build an object
| database for git that performs better than its packed in
| filesystem object store. But it's probably not something you
| want to do as a proprietary project with fewer eyeballs on
| its performance trade-offs and more engineering work every
| time git slightly changes its object storage behavior which
| would remain tuned for the filesystem object store because it
| was entirely unaware of your efforts.
| oconnor663 wrote:
| I think the GitHub folks have written more than one article
| about this. I'm not sure I can find the one I'm thinking of,
| but here's another one:
| https://github.blog/2016-04-05-introducing-dgit/
|
| > Perhaps it's surprising that GitHub's repository-storage
| tier, DGit, is built using the same technologies. Why not a
| SAN? A distributed file system? Some other magical cloud
| technology that abstracts away the problem of storing bits
| durably? The answer is simple: it's fast and it's robust.
| parhamn wrote:
| Have you ever tried it? It's not remotely performant and
| wouldn't make sense since GH is read heavy. Plus I'm sure they
| spend a lot of time thinking about this stuff, no?
|
| If you want to get your feet wet, check out go-git[1]. It's a
| native golang implementation of git. They have a storage layer
| abstracted over a lean interface that you quickly create
| alternative drivers for in golang. You'll be effectively
| implementing poorly sharded file system on a database, then it
| becomes obvious why scaling the FS is just easier.
|
| [1] https://github.com/go-git/go-git/tree/master/storage
| brown9-2 wrote:
| > Plus I'm sure they spend a lot of time thinking about this
| stuff, no?
|
| I think this is unfair - the author was not insinuating that
| the people who designed this system at Github are stupid in
| some way, but just asking if other architectures have been
| considered.
| parhamn wrote:
| > ...Github are stupid in some way, but just asking if
| other architectures have been considered.
|
| To me, asking an engineering org if they've considered
| alternative architectures for their main engineering
| problem is silly at best, overconfident at worst.
| ben0x539 wrote:
| I think the main point of the comment was asking _why_
| they decided against it. At one point the wording mildly
| suggests the possibility that no one at github has
| thought about it:
|
| > I would love to know if something like this has been
| considered and why the decided against it.
|
| ... but that still sounds more like a grammatical hedge
| than an actual suggestion that github didn't think it
| through.
|
| imo it's fair to lay out why you're surprised about some
| decision in the hopes that someone will enlighten you,
| even if it can be tricky to phrase that without coming
| off like a "why didn't you just..." comment.
| tkiolp4 wrote:
| > but just asking if other architectures have been
| considered.
|
| That's the polite way of calling them stupids ;)
| JeremyBanks wrote:
| They're just expressing curiosity. Jeeze.
| ben0x539 wrote:
| The problem with recognizing that a lot of phrasings are
| just the polite way to call someone stupid/tell someone
| to fuck off/etc is that you start seeing assholes
| whenever someone is just trying to be polite. :(
| mvzvm wrote:
| This kind of comment is why every single project needs to be
| justified with "What problems are you solving?" and "What
| usecase are you supporting?". Because I could 150% imagine
| somebody getting excited about this and then:
|
| 1) Framing it as such with poor justification "a lot easier to
| manage"
|
| 2) "This would be a huge rewrite of some internals" Becoming a
| multi-year migration quagmire
|
| 3) The dawning realization that you have used a write-heavy
| architecture in a read-heavy system
| Ericson2314 wrote:
| I always had the impression GitHub was not preemptively
| investing in the fundamentals like that. So yeah, agree it's
| bummer but also not surprised.
|
| And hey, at least that means a post GitHub FOSS world won't be
| leaving fundamental improvements behind!
| lamontcg wrote:
| Additionally to everything else in this thread it'd be nice to
| see better support for monorepos in the github UI as well.
|
| Something like the ability to have
| github.com/<org>/<repo>/<subproject>/issues be a shard of all the
| issues for a subproject.
|
| You can do that with tagging, but that's a bit of a PITA because
| that's all fairly bad and unscalable of a UI.
| masklinn wrote:
| > Improving repository maintenance
|
| There's one thing I'd really like to see there: the ability to
| lock out the repository and perform a _really_ aggressive repack.
| I 'm talking `-AdF --window=500` or somesuch. On $dayjob's
| repository, the base checkout is several gigs. Aggressively
| repacking it reduces its size by 60%.
|
| There's also a git-level thing which would greatly benefit large
| repositories: for packs to be easier to craft and more reliably
| kept, so it's easier to e.g. segregate assets into packs and not
| bother compressing that, or segregate l10n files separately from
| the code and run a more expensive compression scheme on _that_.
| tasuki wrote:
| > On $dayjob's repository, the base checkout is several gigs.
|
| Why is it several gigs? Is that really necessary?
| chrisseaton wrote:
| > Why is it several gigs? Is that really necessary?
|
| A lot of code written by a lot of engineers over a lot of
| years.
|
| I'm not sure what other answer you're expecting?
|
| I work with a compiler that has a ten of tens on it over a
| decade or so and even that's 5 GB. No binary assets. I really
| don't think it's that unusual.
| wikibob wrote:
| When is GitHub going to finally add support for Microsoft's
| VFSforGit?
|
| https://github.com/microsoft/VFSForGit
|
| https://vfsforgit.org/
| hyperrail wrote:
| I'm not sure that will ever happen [1] as Microsoft itself is
| limiting active development of VFS for Git in favor of Scalar
| [2] by the same team, which aims to improve client-side big
| repo performance _without_ having to use OS-level file system
| virtualization hooks.
|
| I don't believe VFS for Git will ever be abandoned by
| Microsoft, but I'm doubtful it will ever get any more major
| improvements from them.
|
| Scalar does use the VFS for Git client-server protocol, and
| both Scalar and VFS for Git rely on the same improvements to
| the git app itself, so I could imagine that GitHub would adopt
| the GVFS protocol and support Scalar without formally
| supporting GVFS itself.
|
| [1] GitHub did announce future GVFS support in 2017 -
| https://venturebeat.com/2017/11/15/github-adopts-microsofts-...
| - but if anything came out of that I don't see it in GitHub
| help today.
|
| [2] https://github.com/microsoft/scalar
| vitorgrs wrote:
| You know Microsoft runs windows repo with VFS right?
| hyperrail wrote:
| Yes, I do know the Windows os git repo uses GVFS. In fact,
| I shared my personal experience with git in the os repo
| some time ago:
| https://news.ycombinator.com/item?id=20748778
|
| When I left Microsoft about half a year ago, GVFS and
| Scalar were both in heavy use there.
| hyperrail wrote:
| I should clarify that Scalar does not _require_ a VFS for Git
| server to work correctly, even though it can get significant
| benefits if a VFS server is available. This means you can use
| Scalar today with GitHub, but not VFS.
|
| Scalar also supports Windows and macOS, while VFS only
| supports Windows: https://github.com/microsoft/VFSForGit/blob
| /v1.0.21014.1/doc...
| vtbassmatt wrote:
| Hey, I'm the product manager for Git Systems at GitHub. Can you
| share more about how you'd use VFS for Git / GVFS protocol if
| we had it on GitHub?
|
| Right now we don't plan on supporting it; most of our work is
| focused on upstreamable changes and opinionated defaults. But
| that could change if we're missing some important use cases.
|
| Feel free to email me - my HN alias @github.com - if you prefer
| to discuss privately.
| jmull wrote:
| I'm curious what counts as a "large" monorepo?
| bob1029 wrote:
| This is a very subjective evaluation. You could look at # of
| files versioned, total bytes of the repository on disk, # of
| logical business apps contained within, total # of commits,
| etc.
|
| For me, its any repository where I would think "damnit im going
| to have to do a fresh clone" if the situation comes up. There
| isnt a hard line in the sand, but there is certainly some
| abstract sensation of "largeness" around git repos when things
| start to slow down a bit.
| crecker wrote:
| I can bet whatever you want they did this improvement for
| microsoft/windows repo.
| noahl wrote:
| Microsoft/windows is hosted on Azure DevOps, and they have also
| blogged about what they've done to improve its performance!
|
| Here's a recent post:
| https://devblogs.microsoft.com/devops/introducing-scalar/
| WorldMaker wrote:
| Rumors are Azure DevOps and GitHub are converging "soon", and
| maybe "Project Cyclops" wasn't specifically to improve
| Microsoft/Windows repo performance, but it seems reasonable
| given the convergence rumors it could be a step in the
| direction of preparing for/migrating the repo to GitHub. Of
| course Microsoft doesn't want to panic Enterprise developers
| on Azure DevOps just yet so they are extremely quiet right
| now about any convergence efforts, so I take the rumors with
| a grain of salt. It is something that I wish Microsoft would
| properly announce sooner rather than later as it might
| provide momentum towards GitHub in capital-E Enterprise
| development world (even if will panic those that are still
| afraid of GitHub for whatever reasons).
| endisneigh wrote:
| kind of an aside, but what's the best practice for pushing and
| building separate projects in a monorepo?
|
| say you have a structure like: projectA projectB sharedUtils
|
| Each time you push you might have a build for projectA and
| projectB but it builds both each time you push to master. Ideally
| you could use Git to see if anything in projectA or sharedUtils
| changed to trigger projectA's build and same for projectB, but
| I'm curious what others are doing.
| numbsafari wrote:
| Perhaps check out a tool like please[1]. There are other tools
| in this space, but that one has worked well for me without the
| complexity of some other, similar tools.
|
| [1] https://please.build
| oftenwrong wrote:
| I can't speak for using it in a massive monorepo, but I
| started using https://please.build for some of my personal
| projects recently just as an alternative to the dominant Java
| build systems (Ant/Maven/Gradle). It's far more
| straightforward to use, and incremental builds actually work
| reliably.
| zdw wrote:
| Monorepos require much more care to be put into the
| integration/CI side of the process.
|
| This is worth a read: https://yosefk.com/blog/dont-ask-if-a-
| monorepo-is-good-for-y...
| alfalfasprout wrote:
| As a few others have mentioned this is something that build
| systems handle since they understand the dependency graph. For
| example, Bazel is often used to this end.
|
| However... I would _strongly_ advise not going for a monorepo.
| No, I don 't mean something like tensorflow where you have a
| bunch of related tools and projects in a single repo. I mean
| one repo for the entire org where totally unrelated projects
| live.
|
| Every company I've been at that used a monorepo found
| themselves struggling to make it work since you need a ton of
| full time engineers just to keep things working and scaling.
| Many of the problems that monorepos try to solve (simplifying
| dependency and version management) are traded for 10x as many
| problems and many of them are hard (incremental builds,
| dependency resolution).
|
| Google has a huge team in charge of helping their monorepo
| scale and work efficiently. You are not google... don't be
| tempted.
| jschwartzi wrote:
| Well, sure. if you have a pile of totally unrelated things
| that never need to change in lock-step, then you don't need a
| "monorepo." But on the other hand if you're building an
| entire software system such as a collection of API services,
| a database schema, embedded device firmware, and a website,
| and all of these things are interdependent and incompatible
| across versions then please for the love of god use a
| monorepo.
|
| At my job our cloud team uses multiple separate repositories
| which makes sense, but it also moves the burden of versioning
| to run-time. This is because they have to interface with
| multiple different versions of the device firmware. So they
| deploy different run-time versions of the APIs to support
| legacy and current production firmware versions. But our
| firmware repository is a monorepo in that the sources and
| build system builds the artifacts for multiple devices from
| the same source tree.
|
| So it's not so cut and dried as "never use a monorepo" or
| "always use a monorepo." It involves engineering tradeoffs
| and decisions that are made in a context, and you can't
| extract your advice from the context in which it exists. What
| works for our cloud team would be a terrible mess on the
| embedded side simply because of how the software is deployed
| and managed.
| jayd16 wrote:
| "You are not google" is also an argument for why you don't
| have to worry about scaling a monorepo.
| benreesman wrote:
| I'll try to tread at least a little lightly here because this
| topic does tend to be a bit flammable, but caveat emptor.
|
| My contrasting anecdotal experience is that whether at BigCo
| or on a small team monorepo is almost always the right answer
| until your requirements get exotic enough that you're in
| special-case land anyways (like a separate repo for machine-
| initiated commits, or something that's security-sensitive
| enough to wall off some contributors).
|
| Both `git` and `hg` scale easily to to really big projects if
| you're storing text in them (at FB our C++ code was in a
| `git` monorepo on stock software until like 2014 or something
| before it started bogging down, I'll gloss over the numbers
| but, big): the monorepo-scaling argument is brought out a lot
| but rarely quantified.
|
| The multi-repo problem that gets you is dependency
| management, which in the general case requires a SAT-solver
| (https://research.swtch.com/version-sat), but of course you
| don't have a SAT-solver in your build script for your small-
| to-medium organization, so you get some half-assed thing like
| what `pip` and `npm` do.
|
| Again purely anecdotal, but in my personal experience multi-
| repo too often gets pushed by folks who want to make their
| own rules for part of the codebase ("the braces go _here_ "),
| push an agenda around unnecessary RPCs, or both. That's not
| true of all cases of course, but it's a common enough
| antipattern to be memorable.
| adsfoiu1 wrote:
| I personally have seen the opposite problem - the friction of
| making small changes to "utility" libraries becomes a huge
| pain point for developers when you have to make changes, test
| locally, push to package manager, update all consumers to use
| the new version... It's much easier, in my experience, to
| just consume a class that's already in the same project /
| repo.
| agency wrote:
| I have also experienced this pain where a company I worked
| for went too hard on splitting every thing into separate
| repos, such that updating something deep in the dependency
| tree becomes very painful and involves a protracted
| "version bump dance" on dependent repos. There's no silver
| bullet here.
| TechBro8615 wrote:
| > Google has a huge team in charge of helping their monorepo
| scale and work efficiently. You are not google... don't be
| tempted.
|
| It's funny, I've heard this exact same argument for why you
| should not use micro services.
| brown9-2 wrote:
| The other half of needing to use a build system that
| understands the dependency graph like Bazel is that Bazel
| _keeps state_, so that it knows which part of the graph does
| not need to be re-built when you push commit B because it was
| already built in commit A.
| simias wrote:
| If separate projects have independent builds maybe a monorepo
| was not a great idea to begin with?
|
| I have a big monorepo at work but whenever anything changes I
| want to rebuild everything to generate a new firmware image. I
| have ccache setup to speedup the process given that obviously
| only a tiny fraction of the code actually needs to be rebuilt.
|
| It's a bit wasteful, sure, but if I were to optimize it I'd be
| worried about ending up with buggy, non-reproducible builds.
| Easier to just recompile everything every time and make sure
| everything still works the way you expect.
|
| So basically my approach is KISS, even if it means longer build
| times.
| jrockway wrote:
| That's what build systems aim to do, and there are many of
| them. In general, I've found all the tooling required around
| monorepos to be a job for a full-time team. Shortcuts (as
| suggested in other replies) or full builds on every commit tend
| to stop scaling relatively quickly. If you take shortcuts, you
| will find that it becomes "tribal knowledge" to do a full build
| every time you edit a single line of code, and people who were
| once making multiple changes a day start making one change a
| week, or they start committing code without ever having run it.
| (It happened to me on a 4 person team. We had so many things
| that needed to be running to test your app, that people just
| started committing and pushing to production without ever
| having run their changes locally! That is the kind of thing
| that happens if you stop caring about tooling, and it happens
| fast. I addressed it by taking a couple of days to start up the
| exact environment locally in a few seconds, without a docker
| build, and people started running their code again.) Be very,
| very careful.
|
| If you do a full build on every commit, it gets slow much
| sooner than you'd expect, and people are going to do less work
| while they context switch to posting to HN while waiting for
| their 15 minute build for a 1 line code change.
|
| I worked at Google and we had a monorepo, and there were
| hundreds if not thousands of engineers working on build speed
| and developer productivity, and it was still significantly
| slower to "bazel run my-go-binary" versus "go run cmd/my-go-
| binary". In many cases, it was worth it, but in very isolated
| applications, it was definitely not worth it. (And people did
| work around it, by just setting up Git somewhere and using
| Makefiles or whatever, and that ended up being even worse. But
| it gets worse incrementally over time, and you're kind of the
| frog getting boiled alive.)
|
| Where I'm going with is to advise you to be very careful. The
| tools to support real productivity in a monorepo are expensive
| in terms of your org's time. If you can get by with a repo per
| app and a common modules repo, and just update the app to refer
| to a version of the modules repo as though it's some random
| open source project you depend on, you're going to get much
| farther with much less tooling work than you would with a
| monorepo. But, the modules repo is going to break apps without
| knowing, and that's going to be a pain. Monorepos do exist for
| a good reason.
|
| (The other thing I like about monorepos is that you do less
| per-project setup work. Want to make some new app? You can just
| start writing it, and you get the build, deploy, framework,
| etc. for free. It can be very productive if you're finding
| yourself starting new projects regularly. In my spare time, I
| write a software, and I really regret splitting it up into
| multiple projects. But, it's kind of necessary for open source
| stuff -- people don't want to download ekglue if they want to
| just run jlog. So I split them, but it costs me my valuable
| free time to do something I've already done ;)
|
| My TL;DR is that you will be tempted to take shortcuts and the
| shortcuts will suck. If your project has the resources to have
| someone set up Bazel, distribute the right version of Bazel and
| the JRE to developer workstations, setup CI that is aware of
| Bazel artifact caching, and SREs to be around 24/7 to support
| your now-custom build environment, you will have a good
| experience. Be aware that a monorepo is that level of
| investment.
|
| Meanwhile, if you just have a frontend and a backend in the
| same repo, you can probably get away with a full build for
| every commit. And you don't need that shadow team of tooling
| engineers to make it work, you just need a docker build, and a
| script that runs "go test ./... && npm test" or whatever ;)
| dgellow wrote:
| IIRC you can specify filters "paths" and "paths-ignore" when
| you define a github action that should only be triggered when a
| subdirectory changes.
|
| See this documentation page:
| https://docs.github.com/en/actions/reference/workflow-syntax...
|
| Their example is: on: push:
| paths: - '\*.js'
|
| but I believe you can also specify the subdirectory you care
| about.
| kroolik wrote:
| > We made a change to compute these replica checksums prior to
| taking the lock. By precomputing the checksums, we've been able
| to reduce the lock to under 1 second, allowing more write
| operations to succeed immediatelly.
|
| Isnt this changeset introducing a race condition? One of the
| replicas' checksum could change between the checksum is computed
| and the lock is taken. Otherwise, there is no need for the lock
| at all.
| jacoblambda wrote:
| You can compute the checksums outside of the lock. You just
| need to compare them inside the lock.
|
| The key thing here is that prior to the lock if data changes
| you recompute the checksums. As long as any change outside the
| lock triggers a recompute of the corresponding checksums and no
| changes can occur during the lock, there is no race condition.
|
| I imagine that this may result in data getting de-
| synced/failing the checksum comparisons more often however it's
| still a net performance increase as long as the aggregate time
| spent re-syncing the data is less than the extra time spent
| waiting for checksums in the lock.
| alexhutcheson wrote:
| No, it's just switching from a pessimistic locking approach to
| an optimistic one:
| https://en.wikipedia.org/wiki/Optimistic_concurrency_control
___________________________________________________________________
(page generated 2021-03-16 23:00 UTC)