[HN Gopher] We put half a million files in one Git repository (2...
___________________________________________________________________
We put half a million files in one Git repository (2022)
Author : kisamoto
Score : 122 points
Date : 2023-08-28 09:46 UTC (13 hours ago)
(HTM) web link (www.canva.dev)
(TXT) w3m dump (www.canva.dev)
| bob1029 wrote:
| Our monorepo is at ~500 megs right now. This is 7 years worth of
| changes. No signs of distress anywhere, other than a periodic git
| gc operation that now takes long enough to barely notice.
|
| I can't imagine using anything else for my current project. In
| fact, the only domain within which I would even consider
| something different would be game development. Even then, only if
| the total asset set is ever expected to exceed a gigabyte or so.
| Git is awful with large blobs. LFS is an option, but I've always
| felt like it was a bandaid and not a fundamental solve.
| Alacart wrote:
| Ah yes, I too have accidentally committed node_modules.
|
| Jokes aside, and coming from a place of ignorance, it's
| interesting to me that a file count that size is still a real
| performance issue for git. I'd have expected something that's so
| ubiquitous and core to most of the software world hasn't seen
| improvements there.
|
| Genuine, non snarky question: Are there some fundamental aspects
| of git that would make it either very difficult to improve that,
| or that would sacrifice some important benefits if they were
| made? Or is this a case of it being a large effort and no one has
| particularly cared enough yet to take it on?
| 1MachineElf wrote:
| Other users have made good comments about performance
| limitations on the underlying filesystems themselves. Adding to
| this, I recently encountered the findlargedir tool, which aims
| to detect potentially problematic directories such as this:
| https://github.com/dkorunic/findlargedir/
|
| >Findlargedir is a tool specifically written to help quickly
| identify "black hole" directories on an any filesystem having
| more than 100k entries in a single flat structure. When a
| directory has many entries (directories or files), getting
| directory listing gets slower and slower, impacting performance
| of all processes attempting to get a directory listing (for
| instance to delete some files and/or to find some specific
| files). Processes reading large directory inodes get frozen
| while doing so and end up in the uninterruptible sleep ("D"
| state) for longer and longer periods of time. Depending on the
| filesystem, this might start to become visible with 100k
| entries and starts being a very noticeable performance impact
| with 1M+ entries.
|
| >Such directories mostly cannot shrink back even if content
| gets cleaned up due to the fact that most Linux and Un*x
| filesystems do not support directory inode shrinking (for
| instance very common ext3/ext4). This often happens with
| forgotten Web sessions directory (PHP sessions folder where GC
| interval was configured to several days), various cache folders
| (CMS compiled templates and caches), POSIX filesystem emulating
| object storage, etc.
| kudokatz wrote:
| > Are there some fundamental aspects of git that would make it
| either very difficult to improve that, or that would sacrifice
| some important benefits if they were made?
|
| I can't speak to _improving_ git, but I think some light on
| this area can be shed by Linus ' tech talk at Google in 2007.
|
| 1. Linus says there's a specific focus on full history and
| content, _not_ files ... so it 's a deliberate, different axis
| of focus than file count:
|
| https://youtu.be/4XpnKHJAok8?t=2586
|
| ... AND it's a specific pitfall to avoid when using Git:
|
| https://youtu.be/4XpnKHJAok8?t=4047
|
| 2. As Linus tells it, Git appears to be designed specifically
| for project maintenance while not getting in the way of
| individual commits and collaboration. But the global history
| and more expensive operations on things like "who touched this
| line" are deliberate so lines of a function are tracked _across
| all moves_ of the content itself.
|
| Maintainer tool enablement: https://youtu.be/4XpnKHJAok8?t=3815
|
| Content tracking slower than file-based "who touched this":
| https://youtu.be/4XpnKHJAok8?t=4071
|
| ===
|
| I have no answer, but ...
|
| Practically, I've used lazy filesystems both for Windows-on-Git
| via GVFS [1][2] and Google's monorepo jacked into a mercurial
| client (I think that's what it is?). Both companies have made
| this work, but as Linus says, a lot of the stuff just doesn't
| work well with either system.
|
| Windows-on-Git still takes a lot of time overall, and stacking
| > 10 patches of an exploratory refactor with the monorepo on hg
| starts slowing WAY WAY down to the point where any source
| control operations just get in the way.
|
| [1] https://devblogs.microsoft.com/devops/announcing-gvfs-git-
| vi...
|
| [2] https://github.com/microsoft/VFSForGit
| klodolph wrote:
| > Are there some fundamental aspects of git that would make it
| either very difficult to improve that, or that would sacrifice
| some important benefits if they were made?
|
| It's hard to look at a million files on disk and figure out
| which ones have changed. Git, by default, examines the
| filesystem metadata. It takes a long time to examine the
| metadata for a million files.
|
| The main alternative approaches are:
|
| - Locking: Git makes all the files read-only, so you have to
| unlock them first before editing. This way, you only have to
| look at the unlocked files.
|
| - Watching: Keep a process running in the background and listen
| to notifications that the files have changed.
|
| - Virtual filesystem: Present a virtual filesystem to the user,
| so all file modifications go through some kind of Git daemon
| running in the background.
|
| All three approaches have been used by various version control
| systems. They're not _easy_ approaches by any means, and they
| all have major impacts on the way you have to set up your Git
| repository.
|
| People also want e.g. sparse checkouts, when you're working
| with such large repos.
| eviks wrote:
| What about asking the OS for the list of changes like
| Everything on Windows does, instantly, for millions, at a RAM
| cost of a ~1-2 browser tabs (though that might be limited to
| NTFS, but still)?
| wintogreen74 wrote:
| this only happens because it's not querying on demand,
| which is what the article indicates they're essentially
| (now) doing
| HALtheWise wrote:
| It's notable that git does support "watching", but it
| requires some setup on Linux to install and integrate with
| Watchman. On Windows and Mac, core.fsmonitor has been built
| in since version 2.37.
|
| https://www.infoq.com/news/2022/06/git-2-37-released/
| robotresearcher wrote:
| Has anyone made a system like option 3 that successfully
| merges git with a filesystem? It could present both git and
| fs interfaces, but share events internally. I'd be interested
| to see how that would work.
| LordShredda wrote:
| That would make you at the mercy of git being a decent file
| system driver.
| 10000truths wrote:
| Are there any solutions that use libgit2's ability to define
| a custom ODB backend? There are even example backends already
| written [1] that use RDBMSs as the underlying data store.
|
| [1] https://github.com/libgit2/libgit2-backends
| klodolph wrote:
| There are repos with many files and there are repos with
| lots of history data. Those are problems with different
| solutions--adding millions of files to the repo will make
| 'git status' take ages, but it won't necessarily put the
| same level of pressure on the object database.
|
| There are various versions of Git that use alternative
| object storage, like Microsoft's VFS, if I remember
| correctly.
| eigenvalue wrote:
| In my experience, the standard linux file system can get very
| slow even on super powerful machines when you have too many
| files in a directory. I recently generated ~550,000 files in a
| directory on a 64-core machine with 256gb of RAM and an SSD,
| and it took around 10 seconds to do `ls` on it. So that could
| be a part of it too.
| tp34 wrote:
| What is the "standard linux file system"?
|
| ext4 on an old system, feeble in comparison to yours,
| performs much better.
|
| ext4, 8GB memory, 2 core Intel i7-4600U 2.1GHz, Toshiba
| THNSNJ25 SSD:
|
| $ time ls -U | wc -l 555557
|
| real 0m0.275s user 0m0.022s sys 0m0.258s
|
| stat(2) slows it down, but sill this is not as poor as your
| results:
|
| $ time ls -lU | wc -l 555557
|
| real 0m2.514s user 0m1.126s sys 0m1.407s
|
| Sorting is not prohibitively expensive:
|
| $ time ls | wc -l 555556
|
| real 0m1.438s user 0m1.249s sys 0m0.193s
|
| Drop caches, sort, and stat:
|
| # echo 3 > /proc/sys/vm/drop_caches
|
| $ time ls -lU | wc -l 555557
|
| real 0m6.431s user 0m1.249s sys 0m4.324s
| bityard wrote:
| IME, on basically all filesystems, just walking a directory
| tree of lots of files is expensive. Half a million files on
| modern systems should not be a terribly huge issue but once you
| get into the millions, just figuring out how to back them all
| up correctly and in a reasonable time frame starts to become a
| major admin headache.
|
| Since git is essentially a filesystem with extensive version
| control features, it doesn't surprise me that it would have
| problems handing large amounts of files.
| thrashh wrote:
| I mean you can design a filesystem to handle a million files
| extremely quickly... it just has to be in the requirements up
| front.
|
| But there will be some trade-off.
|
| And I don't think people generally put "a million files" in
| the requirements because it's fairly rare.
| saltcured wrote:
| Not related to git (I hope), but a lot of scientific
| data/imaging folks seem to think file abstractions are
| free. I've seen more than one stack explode a _single_
| microscope image into 100k files, so you'd hit 1M after
| trying to store just 10 microscope slides. Then, a
| realistic archive with thousands of images can hit a
| billion files before you know it.
|
| It's hard to get people past the demo phase "works for me"
| when they have played with one image, to realize they
| really need a reasonable container format to play nice with
| the systems world outside their one task.
| Frannyies wrote:
| Funny how the view is so different
|
| I always marvel at it and think: "wow so git goes through its
| history, pulls out many small files and chunks and patches,
| updates the whole file tree and all of this after hitting enter
| and being done like immediately."
| Borg3 wrote:
| Hmm, I've read this one: "These .xlf files are generated and
| contain translated strings for each locale."
|
| So why to store them under VCS at first place? I think they are
| doing it wrong.
| ufjfjjfjfj wrote:
| I can't be the only thinking this is a small amount of files
| unless you keep them all in the same directory
| baz00 wrote:
| Probably learned how enterprise software developers suffer.
| mrAssHat wrote:
| The site is not opening. Thanks, CloudFlare.
| psydvl wrote:
| There is VFS for git from Microsoft, that can solve problem more
| elegant way, I think: https://github.com/microsoft/scalar
| MikusR wrote:
| That was discontinued (like multiple times under different
| names). And is moved into a git fork.
| https://github.com/microsoft/git
| zellyn wrote:
| Are they still trying to upstream everything? For a while
| they were being good about that...
| ComputerGuru wrote:
| Do you know what replaced it?
| WorldMaker wrote:
| git
|
| They upstreamed almost everything. The last version of
| "scalar" was mostly just a configuration tool for sparse
| checkout "cones" which needed a bit of hand-holding, and
| that is easier to configure in git itself now, or so I
| hear.
| MikusR wrote:
| Their fork https://github.com/microsoft/git
| Groxx wrote:
| Don't bother with watchman, it has consistently been so flaky
| that I simply live with the normal latency.
|
| Thankfully, nowadays git has one built in for some OSes, and it's
| much, MUCH better than watchman ever was.
| fsckboy wrote:
| this is one of those multipurpose PR articles (not all bad) to
| generate awareness of the company, their product, use case, and
| developers.
|
| > _At Canva, we made the conscious decision to adopt the monorepo
| pattern with its benefits and drawbacks. Since the first commit
| in 2012, the repository has rapidly grown alongside the product
| in both size and traffic_
|
| while reading it i was having trouble keeping track of where I
| was in the recursion, it's sort of "Xzibit A" for "yo dawg, we
| know you use source repositories, so check out our source
| repository (we keep it in our source repository) while you check
| out your source repository!"
| time4tea wrote:
| We learned they were 70% autogenerated so probably shouldn't have
| been in git at all, but our build process relied on that, and
| didnt want to fix it, so we bodged it.
| [deleted]
| Cthulhu_ wrote:
| I'm on the fence with this one. My previous project was Go &
| Typescript with a range of generated files; I committed the
| generated files, so that they would flag up in code reviews if
| they were changed, avoiding hidden or magic changes. I also
| didn't automatically regenerate, avoiding churn.
|
| That said, if the autogenerated output is stable, it's fine.
| After all, in a sense, compiling your code is also a kind of
| autogenerating and few people will advocate for keeping
| compiled code in git.
| maccard wrote:
| > probably shouldn't have been in git at all
|
| Something being autogenerated, or binary, doesn't mean it
| shouldn't be in version control. If step one of your
| instructions to build something from version control involve
| downloading a specific version of something else, then your VCS
| isn't doing it's job, and you're likely skirting around it to
| avoid limitations in the tool itself. People still use tools
| like P4 because they want versioned binary content that belongs
| in version control, or because they want to handle half a
| million files, and git chokes.
|
| In my last org, we vendored our entire toolchain, including
| SDKs. The project setup instructions were:
|
| - Install p4 - Sync, get coffee - Run build, get more coffee.
|
| A disruptive thing like a compiler upgrade just works out of
| the box in this scenario.
|
| It's a shame that the mantra of "do one thing well" devolves
| into "only support a few hundred text files on linux" with git.
| PMunch wrote:
| Wouldn't Git LFS be the tool for this job? Have the automated
| tool build a .zip file for example of the translations
| (possibly with compression level set to 0), then have your
| build toolchain unzip the archive before it runs. Then check
| that big .zip file into GitLFS, et voila you now have this
| large file versioned in Git.
| folmar wrote:
| It's good enough for the small usecases, but way behind
| tools that have first class support for binary files
| (binary deltas, common compression, ...). Even SVN shines
| here.
| maccard wrote:
| Git LFS isn't the same as git, though. It's better than
| putting everything in a separate store, but for one it
| disables offline work, and breaks the concept of D in the
| DVCS of git.
|
| > then have your build toolchain unzip the archive before
| it runs
|
| My build toolchain shouldn't have to work around the
| shortcomings of my environment, IMO.
|
| > et voila you now have this large file versioned in Git.
|
| No, it's on a separate http server that is fetched via git
| lfs. Subtle, but important difference.
| aidenn0 wrote:
| > it disables offline work,
|
| This is a non-issue for images and autogenerated files,
| since you shouldn't ever be doing a merge on them.
|
| > breaks the concept of D in the DVCS of git.
|
| git-annex is distributed and works well for files that
| will never be merged (such as images, or autogenerated
| files)
| avidiax wrote:
| > Something being autogenerated, or binary, doesn't mean it
| shouldn't be in version control.
|
| I think the SHA should be in version control. The file should
| be reproducibly built [1], then cached on a central server.
|
| This means that a build target like a system image could be
| satisfied by downloading the complete image and no
| intermediate files. And a change to one file in one binary
| will result in only a small number of intermediate files
| being downloaded or reproducibly built to chain up to the new
| system image.
|
| This is something that's really lacking in, for example, Git.
|
| [1] https://en.wikipedia.org/wiki/Reproducible_builds
| jtsiskin wrote:
| https://git-lfs.com/
| maccard wrote:
| > I think the SHA should be in version control. The file
| should be reproducibly built [1], then cached on a central
| server.
|
| Requiring reproducible builds to handle translations or
| images is a bit much. Also, if it's cached on a central
| server, that now means you need to be connected to that
| central server. If you require a connection to said central
| server, why not just have your source code on said server
| in the first place, a la p4?
|
| I do agree that NixOS is a great idea, but personally 99%
| of my problems would be solved if git scaled properly.
| avidiax wrote:
| You can always build from source in this scenario. The
| cache server lets you skip two things. First, you can
| prune the leaves of the tree of intermediate files you
| might need. Second, where you do need to
| compile/build/link/package, etc., you can do only those
| steps that are altered by your changes. So you save CPU
| time and storage space.
|
| > why not just have your source code on said server in
| the first place, a la p4?
|
| That would be great. A version of git where cloning is
| almost a no-op, and building is downloading the package
| assuming you haven't changed anything.
|
| I'm not aware of how p4 allowing this. My recollection of
| perforce is that I still had most source files locally.
| tomjakubowski wrote:
| Does perforce have features which make vendoring easier? Just
| curious why I see P4 called out here and in the replies too.
| tom_ wrote:
| It just does a pretty good job of dealing with binary files
| in general. The check in/check out model is perfect for
| unmergeable files; you can purge old revisions; all the
| metadata is server side, so you only pay for the files you
| get; partial gets are well supported. And so, if you're
| going to maintain a set of tools that everybody is going to
| use to build your project, the Perforce depot is the
| obvious place to put them. Your project's source code is
| already there!
|
| (There are various good reasons why you might not! But
| "because binary files shouldn't go in version control" is
| not one of them)
| Karellen wrote:
| > In my last org, we vendored our entire toolchain,
|
| You vendored all your compilers/language runtimes in the
| source control repo of each project? Including, like, gcc or
| clang? WTF?
|
| > It's a shame that the mantra of "do one thing well"
| devolves into "only support a few hundred text files on
| linux" with git.
|
| Because the Linux kernel source tree and its history can
| accurately be described as "a few hundred text files".
|
| Yeah, right.
| 0xcoffee wrote:
| It's not that unusual, we vendor entire VM images which
| contain the development environment. (Codebase existed
| since before docker). And it works well, need to fix
| something in a project that was last update 20 years ago?
| Just boot up the VM and you are ready.
| xorcist wrote:
| I don't think that was the question but rather why commit
| to git?
|
| Having local commits intermingled with an upstream code
| base can make for really hairy upgrades, but I guess
| every situation is slightly different.
| maccard wrote:
| > but rather why commit to git?
|
| Well we don't put them in git, we put them in perforce
| because git keels over if you try and stuff 10GB of
| binaries into it once every few months.
|
| I think the real question is the other way around though,
| why _not_ use git for versioning when that's what it's
| supposed to be for? Why do I have to verison some things
| with git, and others with npm/go
| build/pip/vcpkg/cargo/whatever?
| tom_ wrote:
| I've worked on a couple of game projects that did this.
| Build on Windows PC, build for Windows/Switch/Xbox One/Xbox
| Serieses/PS4/PS5/Linux. I was never responsible for setting
| this up, and that side of things did sounda bit annoying,
| but it seemed to work well enough once up and running. No
| need to worry about which precise version of Visual Studio
| 2019 you have, or whether you've got the exact same minor
| revision of the SDK as everybody else. You always build
| with exactly the right toolchain and SDK for each target
| platform.
| maccard wrote:
| > You vendored all your compilers/language runtimes in the
| source control repo of each project? Including, like, gcc
| or clang? WTF?
|
| Yep. Along with paltform SDKs, third party dependencies,
| precompiled binaries, non-redistributable runtimes, you
| name it.
|
| Giant PSD or FBX files? 4K Textures? all of it.
|
| Client mappings are the bread and butter of P4 (or Stream
| views more recently which are not as nice to work with) -
| you say "I don't want the path containing MacOS" if you
| don't want it.
|
| > Because the Linux kernel source tree and its history can
| accurately be described as "a few hundred text files".
|
| I was off by a little bit, it's ~60k. But it's still "only"
| 60k text files, no matter how important those text files
| are.
| thechao wrote:
| This is _precisely_ why every ASIC (HW) company I 'm familiar
| with uses P4. ASIC design flows rely _critically_ on 3rd
| party tooling, that must be version /release specific. You
| can't rely on those objects being available whenever. They
| get squirreled away and kept, forever.
| IshKebab wrote:
| It's not an unbreakable rule that generated or binary files
| should not be in Git. It's a rough guideline. Partly because
| Git is bad at dealing with binary files.
|
| There are plenty of cases when including generated files is
| appropriate. It has many advantages over not doing that -
| probably the biggest are
|
| * Code review is much easier because you can see the effect on
| the output.
|
| * It's easier to find the generated files because they're next
| to the rest of your code. IDEs like it much more too.
|
| In fact the upsides are so great and the downsides so minimal I
| would say it should be the default option as long as:
|
| * The generated files are not huge.
|
| * The generated files are always the same.
|
| Even when they are huge it might still be a good idea, but you
| can put the files in a submodule or LFS. I do that for a
| project that has a really difficult to install generator so
| users don't need to install it.
| issafram wrote:
| Since when was a "monorepo" ever considered a good idea?
| Cthulhu_ wrote:
| A couple of years now, but whether it's a good idea depends on
| your use case and organization. Seems to work for some. It
| works for my current assignment too - two and possibly more
| React Native that reuse a lot of components, translations, have
| the same APIs, etc.
| yashap wrote:
| They have some nice advantages:
|
| - Makes it easy to develop applications and libraries together
| in a single branch
|
| - Similarly, makes it easy to make a breaking change to a
| library, then change all clients of said library, in a single
| branch
|
| - And because of the above, makes it easy to keep all
| dependencies on internal libs at the latest version, which can
| greatly reduce all sorts of "dependency hell" issues
|
| - Generally makes integration testing a bit easier
|
| The downside is you have to invest a lot more time in tooling,
| keeping both local and CI builds fast. And even with that
| tooling, builds won't be as fast as they trivially are with
| multi-repo. But if you do invest that time in tooling, you can
| generally get them fast enough, and then reap the other
| benefits for a very productive dev experience.
|
| Have done both monorepo and multi-repo at different, decent
| sized companies. Both have their pros/cons.
| [deleted]
| jiggawatts wrote:
| Something I learned about writing robust code is that scalability
| needs to be tested up-front. Test with 0, 1, and _many_ where the
| latter is tens of millions, not just ten.
|
| I've seen production databases that had 40,000 tables for _valid_
| reasons.
|
| I've personally deployed an app that needed 80,000 security
| groups in a single LDAP domain, just for it. I can't remember
| what the total group number of groups across everything was, but
| it was a decent chunk of a million.
|
| Making something like Git, or a file system, or a package
| manager? Test what happens with millions of objects! Try
| _billions_ and see where your app breaks. Fix the issues even if
| you never think anyone will trigger them.
|
| It's not about scaling to some arbitrary number, it's about
| _scaling_ , period.
| switch007 wrote:
| At what cost?
| zoomablemind wrote:
| > ...scalability needs to be tested up-front.
|
| I'd rephrase it: if expecting massive or longterm use - know
| how that thing is built/designed.
|
| Picking a technology based on general popularity or vendor's
| marketing is not a way to solve _your_ problem.
| Karellen wrote:
| > Making something like Git, or a file system, or a package
| manager? Test what happens with millions of objects!
|
| Test with, say, one of largest open-source projects in
| existence at the time? Like, for instance, the Linux kernel?
| layer8 wrote:
| When Git was first released, the Linux kernel sources had
| less than 20,000 files. It currently has around 70,000 files.
| It's not nothing, but it also isn't millions.
| compiler-guy wrote:
| The kernel is big, but it isn't _that_ big in the grand
| scheme of things. The project from the original article here
| is bigger, and many companies have projects bigger than that.
| vincnetas wrote:
| could you elaborate on 40,000 tables DB? I want to learn what
| could be valid reasons for that?
| horse_dung wrote:
| Database audit tool and they needed test what happens when an
| excessive number of tables is hit??? :)
| jasonjayr wrote:
| Not the OP, but in our case, 300 tables x 300 customers
| (different 'schemas/dbs') == single mysql instance with
| 90000+ tables.
| stef25 wrote:
| Never understood why each customer would have their own DB.
| Must be a nightmare to maintain.
| icedchai wrote:
| I worked on a system like this. Rolling out migrations
| would take hours.
| asguy wrote:
| Did you not parallelize your migrations?
| icedchai wrote:
| We did, but there was only so much migration load we
| wanted to place per DB server. Some DBs had 100's of
| customers.
| wernercd wrote:
| Security. You have access to stef25_ tables and I don't.
|
| The alternative would be we both have access to the same
| tables with a permission layer to grant access to row.
|
| Both choices have trade offs but if company makes a
| mistake and I now have access to your rows? Seems easier
| to control access at the table layer rather than the
| column layer.
| indeed30 wrote:
| It makes enterprise sales easier, since it removes a
| common objection from security, privacy and compliance
| teams.
| Cthulhu_ wrote:
| Wouldn't having separate databases (with separate users
| (per organization)) make more sense from a security point
| of view? I have no knowledge of these things, I've never
| actually worked with more than one database in a mysql
| instance.
|
| edit: I tell a lie, I separated the forums and wordpress
| databases on a website I run.
| vincnetas wrote:
| One DB server can have multiple DB's. In this case we are
| talking about single DB (not server) containing multi
| thousands of tables. And im curious what is the use case
| for such designs.
| vincnetas wrote:
| Sure multi tenant DB's, but lets limit this to per project
| DB's. 300 tables is quite reasonable, but multi thousands?
| wiredfool wrote:
| I've got a db that hosts postgresql versions of CSVs/XLSs
| that are uploaded/harvested to an open data portal (as
| part of the portal). There are ~10k of them in there
| (+-), and could certainly see more (O(5k)) if some of the
| CSVs were parsed better.
| phyrex wrote:
| Represent every entity (like "person" or "post") in a large
| graph using a relational database. You can get to 40k rather
| quickly
| Timon3 wrote:
| Do you have any example for a project with 40k entities?
| I'd love to see how they handle the complexity.
| vincnetas wrote:
| Is this valid reason to use RDBM like this?
| phyrex wrote:
| Why not? I don't think there are many graph databases
| that are set up to handle multiple petabytes of data, so
| RDBMs make a good storage layer at that scale
| hgsgm wrote:
| Why would you need a table for each person or post? Those
| are rows.
| phyrex wrote:
| Each entity. Person would be a table, you and I would be
| rows, correct.
| lenkite wrote:
| Saw ~60k tables in one famous ecommerce company backend
| primarily due to sharding - spread across multiple DB's of-
| course.
| marcosdumay wrote:
| Any ERP will bring you near that.
| tambourine_man wrote:
| I was surprised as well, I feel one may need a DB to sort
| that DB. A meta DB :)
| jiggawatts wrote:
| SAP with a bunch of plugins plus custom tables added for
| various purposes. This is for managing the finances of 200K
| staff across 2,500 locations.
| stef25 wrote:
| Isn't that just bad design on the part of SAP ?
| miroljub wrote:
| Non necessarily.
|
| Would it be "better" if they had one table with
| json/xml/whatever and handled schema in code?
|
| They made a trade-off they found right. When they hit the
| limit with their approach, they even implemented their
| own DB (S4/Hana) to support their system.
| pravus wrote:
| I worked at an educational institution where we ran an
| academic-focused Enterprise Resource Planning (ERP) system
| that was fairly large. Not quite 40k tables, but it had over
| 4k. To give you an idea of how this was organized:
| * Most simple things like a "Person" were multiple tables
| because you had to include audits and historical changes for
| each field * A "Person" wasn't even all that useful
| because it included guests or other fairly transient entities
| like vendor contacts so you had an explosion of more tables
| as you classified roles into "Student", "Faculty",
| "Employee", etc... (many with histories as above). *
| Addresses and other non-core demographic information were
| usually sharded into all sorts of categories like "primary",
| "parent's", "last known good", "good for mailing", etc...
| (more histories, etc...) * All coded information like
| label types such as "STUDENT", or "MAILING" were always
| handled as separate validation tables with strict FK
| constraints and usually included extra meta information like
| descriptions and usage notes within parts of the system.
| * Each functional sub-system (HR, Payroll, AR, AP, etc.) had
| its own dedicated schema. * All external jobs,
| processes, and external integrations were configured
| separately. * All enterprise integrations usually had
| a whole a dedicated schema for configuration. * Most
| parts of the interactive web UI were database driven
| (Oracle's Apache mod PL/SQL) with many templates and other
| components stored in large collections of tables.
|
| I'll stop there, but basically just imagine a very large
| application that tries to be 100% database-driven. That's how
| you get a lot of tables.
| Spivak wrote:
| And honestly, I kinda get it. Until you run into a case
| where your volume is such that you physically _can 't_ run
| on the db then run it on the db. I run all my job
| processing off the DB and couldn't be happier. I have to
| hit "can't run along side the real data" and "can't run in
| its own db" before I'll need to consider something else.
|
| It probably feels weird for devs to drive the UI off the db
| but it's just Wordpress by another name.
| Cthulhu_ wrote:
| I've worked with / on an application like that, it had all
| form fields awkwardly configured in a database, plus a
| complicated database migration script to add, remove and
| update those fields.
|
| When I rewrote the application I just hardcoded the form
| fields, nobody should need to do a database migration to
| change an otherwise mostly static form.
| notTooFarGone wrote:
| I mean that's how you get k8s for projects that in reality will
| never need it. Now you have a developer that is only doing k8s.
| Managing overhead and minimizing it is really something to keep
| in mind. So your App can't handle 100000 concurrent Users? As
| long as there is a plan how you could enable that in case of
| emergency there is really no incentive to have all that
| premature optimization for 90% of companies imo.
| horse_dung wrote:
| I would agree with scale orders of magnitude higher than you
| can possibly imagine. But once you know what your scaling
| limits are (and there are always are limits) and what the
| (pre)failure behaviour looks like... we'll you don't _have_ to
| fix them...
| miroljub wrote:
| While this is true in some cases, more frequently I saw apps
| designed and able to handle millions of users and billions of
| transactions that ended being used by tens of users and
| hundreds of transactions.
|
| All the effort spent on testing and optimizations for scaling
| purpose was a waste of time and resources, that could be better
| spent elsewhere.
|
| I'm not telling one should not care, or code sloppy, but there
| is a balance where code is just good enough for the purpose.
| There's a lot of truth in this "don't do premature
| optimization".
| jandrewrogers wrote:
| You still need to do enough to buy time if you do need more
| scalability, since scalability tends to be architectural.
| Waiting until you hit a wall in production is usually months
| too late to start working on it. As a moving target, I often
| try to test at 10x the current workload, which is usually
| enough to deal with load spikes and surfaces scalability
| issues early enough that customers don't see them.
| thrashh wrote:
| This is where experience comes in.
|
| Someone experienced will know how much work a certain
| approach will take and its capacity.
|
| Sometimes there are quick wins to give like 100x capacity to
| a system just by doing things slightly differently, but only
| with experience will you know that.
| pixl97 wrote:
| The question comes.. when your app is on the growth curve, do
| you start to test this?
|
| I work in enterprise software and one of the big problems I
| see is companies suck at software growth when it's obvious
| the software is in the upward curve.
|
| Large companies will throw huge amounts of data at your app
| once you sell it to them.
| heavenlyblue wrote:
| Yeah and these apps probably would never work with that many
| of users in practice because they missed a few things here
| and there and the only way to fix them would have been to
| have that amount of traffic in the first place.
| jasfi wrote:
| I think the GP was saying that it should scale without
| breaking. It can get slow, fine, that's a different
| challenge. But it shouldn't segfault (as an example).
| wernercd wrote:
| "GP was saying that it should scale without breaking" and
| the responder was saying that making that a priority means
| that your wasting time that probably won't be needed.
|
| The time you spend making it work for millions of users
| that won't be needed is time not spent making value to
| customers that do need it.
| yebyen wrote:
| The gist of it is this: many load tests don't even consider
| the actual potential volume of traffic. But that's fine, if
| you're using a load tester, you don't have to estimate the
| traffic well - even to within an order of magnitude. You
| can just add a couple extra zeroes, and see where it
| breaks. Failing to do this simple thing will usually lead
| to objectively worse software, and there's a chance that
| some day you'll need to handle that much traffic.
|
| But that chance isn't the sole reason why you're doing that
| load test. The reason is to improve the software. You're
| identifying defects by stressing the limits.
|
| When you're doing a load test (or any test really) the
| possible outcomes are basically three: (1) it works! (2) it
| broke. (3) huh, that's interesting. If your tests are
| always coming up (1) then you're not obtaining any benefit
| from them. Don't you want to know where the limiting
| factors are in your app? If you're able to remove those
| limits, but not for production (at least not right now),
| wouldn't it be great to know what will break next month (or
| next year) when you do?
|
| Think of the person who writes unit tests for every piece
| of code, but not as TDD. There's a school of thought that
| you should write the test first, then write the simplest
| code that passes, and that's fine but not what I'm talking
| about. Imagine a person who writes perfect code and perfect
| tests. Every code works, every test passes confirming that
| it worked. What is even the value of writing the test?
|
| That's what load testing under only the expected conditions
| is like. We already know the software works under those
| conditions (likely) because it's already in production,
| handling that amount of load. So while there is value in a
| load test that runs prior to deployment, in order to check
| that nothing of the change is likely to induce a break
| under the expected/existing load, it's a different kind of
| testing and produces different value than a stress test
| that is designed to hopefully induce a failure and show you
| where there is a defect in code. Where it segfaults, for
| example.
|
| And just because you've identified a limiting factor
| outside the bounds of what expected activity is likely to
| go through the system soon, doesn't mean you need to fix it
| now. Having one less "known unknown" on the table is a
| thing of value. Now that stress won't be able to surprise
| you later, when that parameter has drifted into the danger
| zone because of organic development, and now it's becoming
| a thing in the way.
| Cthulhu_ wrote:
| It's a difficult one; if you don't know yet if you will have
| billions of transactions, you should focus on clarity and
| flexibility - that is, you should be able to rewrite and re-
| architect your application and its runtime IF it turns out to
| be successful.
|
| A parent comment mentioned SQL databases for example; those
| are great because they can scale both horizontally and
| vertically these days, sometimes with the click of a button
| in AWS.
|
| Other good practices are things like stateless back-end
| services so they can scale horizontally, thoroughly
| documenting (and maintaining documentation) on business
| processes handled by the software, monitoring, etc.
|
| Disclaimer: I'm an armchair expert, I've never had to deal
| with back-end scaling.
| coffeebeqn wrote:
| Something has gone horribly wrong if you don't know if the
| requirements are 10 request per second or a billion per
| second.
|
| We build some services from the ground up for very high
| traffic and the hoops you have to jump through and the
| tradeoffs just don't make sense for a basic CRUD thing
| which can run on a boring ole machine and a little SQL
| instance
| wintogreen74 wrote:
| also all projects, git or anything else, have limited
| resourcing. I'd rather it's spent on the prioritized features
| & needs than exhaustive testing for edge cases.
| klysm wrote:
| I think you can write good code that intentionally omits
| performance optimizations you know could be made, but don't
| want to make right now because it trades off complexity for
| performance. I usually leave myself a note of how to improve
| it if it does in fact become the bottleneck or starts to hurt
| latency
| crabbone wrote:
| People who make filesystems test this stuff and will be able to
| tell you the ballpark figure for performance of this kind of
| operation even w/o testing. Testing here isn't the problem...
|
| The problem here is that we need a reasonably small interface
| for filesystem to enable competing implementations, so, for
| example, we don't have a filesystem interface for bulk metadata
| operations, because this is an unusual request (most user-space
| applications which consume filesystem services don't need it).
| So, we can only query individual files for metadata changes
| through "legal" means (i.e. through the documented interface).
| And now you end up in a situation where instead of fetching all
| the necessary information in a single query, the performance
| impact of your query scales linearly with the number of items
| queried.
|
| Even if Git developers anticipated this performance bottleneck,
| there's not much they can do w/o doing some other undesirable
| stuff. Any solution created outside of the filesystem would
| risk de-synchronization with the filesystem (i.e. something
| that watches the state of the filesystem dies and needs to be
| restarted, either loosing old changes or changes done between
| the restarts). Another solution could try going behind the
| documented filesystem interface, and try to salvage this
| information directly from the known filesystems... which would
| be a lot of work compounded with the potential to screw up your
| filesystem.
|
| Maybe if we'd have Git integrated with the kernel and be able
| to thus integrate better with at least the in-kernel
| filesystems. But this would still put people on anything but
| Linux at a disadvantage, and even on Linux, if you wanted some
| filesystem that's not in the kernel, you'd also have the same
| problem...
| dahfizz wrote:
| Scaling code isn't always as simple as rewriting your search
| function to be faster.
|
| What if scaling to millions of objects forces real tradeoffs
| for the hundreds of objects case?
|
| It feels like you're asking people to only create Postgres, but
| SQLite has a perfectly valid use case as well.
|
| In this case, git checking the access time of 500k files is
| fundamentally slow. The only way around this is to change how
| git tracks files, which all come with other usability
| tradeoffs. Git itself supports a fsmonitor that makes handling
| more files faster, but very few people use it because the
| tradeoffs aren't worth it.
| adamckay wrote:
| > It feels like you're asking people to only create Postgres,
| but SQLite has a perfectly valid use case as well.
|
| I don't mean this to be a "well actually" comment, but
| because I found it interesting when I learnt this a few weeks
| ago - some limits for SQLite [1] are actually higher than the
| limits for Postgres [2] (specifically the number of columns
| in a table and the maximum size of a single field).
|
| 1 - https://www.sqlite.org/limits.html
|
| 2 - https://www.postgresql.org/docs/current/limits.html
| lordgrenville wrote:
| I've been working with some SQLite databases that are
| >100GB lately, and wondering if this is a bad idea. The
| theoretical max size is 140TB, but there's a big gap
| between _can_ and _should_.
| 5e92cb50239222b wrote:
| Among mainline Linux filesystems, xfs started doing this first.
| The test suite is still named xfstests, although many more
| filesystems rely on it now. They regularly test xfs on enormous
| filesystems which 99.9% of us will never see, both with
| hundreds of billions of tiny files, and relatively small
| numbers of very large ones, plus various mixes of the two.
| Pushing it into edge cases like billions of files in one
| directory without any nesting. I really like that strong
| engineering culture and that's why I prefer xfs for most stuff.
| hinkley wrote:
| These sorts of exercises help with performance tuning in the
| small.
|
| One thing you should learn, and many don't, about perf
| analysis is that you start getting serious artifacting in the
| data for tiny functions that get called an awful lot. I've
| found a lot of tangible improvements from removing 50% of the
| calls to a function that the profiler claims takes barely any
| time. Profilers lie. You have to know what they lie about.
|
| When I'm trying to optimize leaf- or near-leaf-node functions
| I've been known to wrap the call with a for loop that runs
| the same operation 10, 100, 1000 times in a row, just so I
| can see if some change has a barely-double-digit effect on
| performance. These predictions usually hold up in production.
|
| Just be very, very sure not to commit that for loop.
|
| Or use representative data that is ridiculously large
| compared to the average case.
| silvestrov wrote:
| > edge cases like billions of files
|
| Sometimes edge cases can quickly detect bugs that only
| happend rarely under normal circumstances and therefore is
| difficult to reproduce/debug.
|
| E.g. when programming in C for little endian computers it can
| be a good idea to test code on big endian CPUs as the
| difference in endianess can reveal "out of bounds" writes for
| pointers.
| crabbone wrote:
| This is really not unique to XFS... anyone who worked on a
| filesystem in at least the last decade would tell you that a
| test like the one similar to what OP inadvertently created
| are commonplace.
|
| Unlike with many user-space applications, filesystems have
| very well-defined range of conditions they have to work in.
| Eg. every filesystem worth its salt will come with a limit on
| number of everything in it, i.e. number of files, groups,
| links and so on. And these limits are tested, they aren't
| conjectures. Ask any filesystem developer how many metadata
| operations per second can their program do, and they will
| likely be able to answer you in their sleep. This might be
| surprising on the consumer end of the deal, but to the
| developers there's nothing new here.
| hjgraca wrote:
| Or as they call it, a simple "Hello World" Javascript project
| paulirish wrote:
| > A git fetch trace that was captured
|
| Anyone know observability software are they using to visualize
| the GIT_TRACE details? (Or is the assumption that the UI is Olly
| as well?)
| avidiax wrote:
| How does a "monorepo" differ from, say, using a master project
| containing many git submodules[1], perhaps recursively? You would
| probably need a bit of tooling. But the gain is that git commands
| in the submodules are speedy, and there is only O(logN) commit
| multiplication to commit the updated commit SHAs up the chain.
| Think Merkle tree, not single head commit SHA.
|
| Eventually, you may get a monstrosity like Android Repo [2]
| though. And an Android checkout and build is pushing 1TB these
| days.
|
| But there, perhaps, the submodule idea wins again. Replace most
| of the submodules with prebuilt variants, and have full source +
| building only for the module of interest.
|
| [1] https://git-scm.com/book/en/v2/Git-Tools-Submodules
|
| [2] https://source.android.com/docs/setup/download#repo
| ramesh31 wrote:
| >How does a "monorepo" differ from, say, using a master project
| containing many git submodules[1], perhaps recursively?
|
| Submodules are essentially broken with no way to fix them. It
| was a good idea that never took off.
| maccard wrote:
| The problem with submodules is they're not "vanilla" git, and
| have some subtle, unexpected behaviours. See this thread[0] for
| some examples.
|
| Submodules, like LFS, are a great idea that suck in practice
| because they're bolted on to git to avoid compromising the
| purity of the base project.
|
| [0] https://news.ycombinator.com/item?id=31792303
| [deleted]
| tazjin wrote:
| Most large monorepos simply are not on git. Google has Piper,
| Yandex has arc, Facebook has eden (which is actually semi-open-
| source, btw!), some companies use Perforce and so on.
| MikusR wrote:
| Microsoft uses git.
| tazjin wrote:
| Not off-the-shelf git though, they have their own file
| system virtualisation stuff on top. Some of that used to be
| open-source (Windows only, I think?).
| WorldMaker wrote:
| VFS for Git is still Open Source:
| https://github.com/microsoft/VFSForGit
|
| Microsoft's blog posts have indicated a move to use
| something as close to off-the-shelf git as possible,
| though. They say they've stopped using VFS much and are
| instead more often relying on sparse checkouts. They've
| upstreamed a lot of patches into git itself, and maintain
| their own git fork but the fork distance is generally
| shrinking as those patches upstream.
| 0xcoffee wrote:
| Windows is coming out with their own 'Dev Drive':
| https://learn.microsoft.com/en-us/windows/dev-drive/
|
| I'm very curious how it performs compared to EdenFS: https://
| github.com/facebook/sapling/blob/main/eden/fs/docs/O...
| tazjin wrote:
| Maybe I'm misreading those Dev Drive docs, but those don't
| seem related in any way?
|
| Dev Drive seems to be a special type of disk volume with
| higher reliability or something for dev-related workloads.
|
| EdenFS is "Facebook's CITC", i.e. a virtual filesystem view
| into a remote version-control system.
| avidiax wrote:
| I think it's orthogonal.
|
| "Monorepo" is a culture around having a single branch with a
| single lineage, and not developing anything in any isolation
| greater than a single developer's workstation.
|
| I agree that Git is not very adequate for large monorepos,
| but I'd say that most open source projects are on Git, and
| most of them are trivial monorepos.
| hgsgm wrote:
| No. Branching is orthogonal to modules/monorepo.
| ants_everywhere wrote:
| The monorepo is essentially a single file system.
|
| Things like moving a file from one git submodule to another is
| more cumbersome than just `mv foo dir/bar`. That means your
| directory structure is in practice tightly coupled to the tree
| of git projects.
|
| Also, since any of the git sub-repos can be branched, the chaos
| of merging development branches seems like it gets even more
| complicated in a submodule architecture.
|
| It may be possible to put a user interface that abstracts away
| the submodule architecture and forces everything to live on
| HEAD. But at that point it might be easier to just provide a
| git-like UI to a centralized VCS.
| dmoy wrote:
| > How does a "monorepo" differ from, say, using a master
| project containing many git submodules[1], perhaps recursively
|
| One fundamental way it differs is atomic commits. You can't
| change something in repo A and subsubrepo XYZ in a single pull.
|
| A monorepo allows you to do things like atomic commits to
| arbitrary pairs of files in the repo, which among other things
| opens up the possibility of enforcing single-version of
| libraries, which in turn removes a whole class of diamond
| dependency issues.
|
| There's other benefits, but imo it's probably not worth it for
| most companies because of the staggering number of things it
| breaks in the developer toolspace once it gets large enough.
| Eventually you need teams of people that do nothing but make
| tooling to support monorepo scaling, because everything off the
| shelf explodes (what do you do when even perforce can't handle
| your repo?)
|
| For example, at Google we have a team of people who do nothing
| but, effectively, recreate the cross referencing and jump-to-
| def everyone else gets for "free" from IntelliJ / VS
| intellisence, etc. (We do other stuff too, but that's a fair
| paraphrase). And on top of that the team really only exists
| because Steve Yegge is a Force to be Reckoned With, otherwise
| we might still be flailing around without jump to def, idk.
| jeffbee wrote:
| The ability of Google's internal code search to jump between
| declaration, definition, override, and call site is miles
| ahead of what Intellisense can do.
| HdS84 wrote:
| One Major Point for monorepos is the ability to eschew
| packages. In most languages creating, publishing and
| consuming a package is a lot of work, while in a monorepo you
| just add a reference to the code and are ready to go (except
| in react native...gaah that was pure horror). That's
| especially valuable if you need to refactor something and
| need to adjust it's dependencies. Doing that via packages is
| slow and painful. Via project references it's mich easier and
| has a tight feature loop, change+build+fix instead of of
| change+build+publish+consume+fix
| dmoy wrote:
| Yup, agree completely. That's a natural extension of the
| same thing that enables atomic commits - suddenly just
| having direct library dependencies instead of packages
| isn't that big of a problem if you push everything into the
| monorepo.
|
| > That's especially valuable if you need to refactor
| something and need to adjust it's dependencies.
|
| And yes exactly, being able to change a library and all of
| its callers at the same time is pretty handy.
| solarkraft wrote:
| > And an Android checkout and build is pushing 1TB these days
|
| I remember it "only" being somewhere around 200Gb.
| jcarrano wrote:
| The monorepo is where you end up when you have failed to
| enforce encapsulation and your "modules" do not have stable
| APIs (or are actually modular). Then, with sub-modules each
| change will often involve multiple commits to different
| modules, plus commits to update references, so O(N) commit
| multiplication.
| hgsgm wrote:
| Monorepo works whether your APIs are modular or not, and also
| allows changes to the modular structure.
| eigenvalue wrote:
| Since 70% of the files were xlf files used for
| translation/localization, couldn't they instead just store all of
| those in a single SQLite file and solve their problem much more
| easily? Any of the nuances of the directory structure could be
| captured in SQLite tables and relationships, and it would be easy
| to access them for edits by non-coders using a tool like DB
| Browser.
|
| I feel like often people make problems much harder than they need
| to be by imposing arbitrary constraints on themselves that could
| be avoided if they approached the problem differently.
| Cthulhu_ wrote:
| That just sounds like adding another problem though. A
| filesystem (and git) already is a database, and plain files can
| be read and managed more easily than a possibly corruptible
| binary file. Plus, you'd lose history, unless you add more
| complexity to add history.
|
| I mean I don't know if they ever needed history but, just
| saying. You get certain things for free by using a filesystem /
| git.
| melx wrote:
| You commit the sqlite dump file(?) to git and have the
| history...
|
| I dunno but there are folks who would put anything in git. I
| work with someone who manages to exceed disk space of
| company's Gitlab instance by git adding everything. The disk
| is full again once a month.
| steffres wrote:
| Anyone know, what's the advantage of this over a big composite
| repo with several git submdolues?
|
| I think that submodules are better suited for separation of
| concerns and performance, even while achieving the same composite
| structure as an equivalent monorepo?
| aseipp wrote:
| The advantage is simple: Git submodules suck and are a chore to
| manage for any dependency that sees remotely high traffic or
| requires frequent synchronization. As the number of developers,
| submodules, and synchronization requirements increase, this
| pain increases dramatically. Basic git features, like cherry
| picking and bisecting to find errors become dramatically worse.
| You cannot even run `git checkout` without potentially
| introducing an error, because you might need to update the
| submodule! All your most basic commands become worse. I have
| worked on and helped maintain projects with 10+ submodules, and
| they were one of the most annoying, constantly problematic pain
| points of the entire project, that every single developer
| screwed up repeatedly, whether they were established
| contributors or new ones. We had to finally give in and start
| using pre-push hooks to ban people from touching submodules
| without specific commit message patterns. And every single time
| we eliminated a submodule -- mostly by merging them and their
| history into the base project, where they belonged anyway --
| people were happier, development speed increased, and people
| made less errors.
|
| The reasons for those things being separate projects had a
| history (dating to a time before Git was popular, even) and can
| be explained, but ultimately it doesn't matter; by the time I
| was around, all of those reasons ceased to exist or were simply
| not important.
|
| I will personally never, ever, ever, ever allow Git submodules
| in any project I manage unless they are both A) extremely low
| traffic, so updating them constantly doesn't suck, and B) a
| completely external dependency that is mostly outside of my
| control, that cannot be managed any other way.
|
| Save yourself hair and time and at least use worktrees instead.
| tantalor wrote:
| Monorepo allows a single commit to update across components, eg
| an API change
| steffres wrote:
| for each submodule affected by some change you would need an
| additional commits, yes. But those commits are bundled
| together in the commit of the parent repo where they act as
| one.
|
| So, atomicity of changes can be guaranteed, but you need to
| write a few more commits. However this effort of small
| increases of commits is far outweighed by the modularity imo.
| marksomnian wrote:
| Is it? I'm slightly struggling to understand what benefit
| you gain from having the "parent" repo but also having
| individual submodules. Sure, working in each individual
| project's module makes cloning faster, until you need to
| work on a module that references another module (at which
| point you need to check out the parent repo or risk using
| the wrong version), and now every change you make needs two
| commits (one to the sub-repo, and one to the base to bump
| the submodule reference),
| steffres wrote:
| In our case, we have a codebase that involves two
| submodules: one for persistence and one for python based
| management of internal git repos. Both of these are
| standalone applications and can run on their own. They
| are then used in a parent repo which represents the
| overarching architecture, which calls into the
| submodules.
|
| The advantage of this is, that work can be done by devs
| on the individual modules without much knowledge of the
| overarching architecture, nor strong code ties into it.
|
| Right now our persistence is done with SQL, but we could
| swap it with anything else, e.g. mongo, and the parent
| codebase wouldn't notice a thing since the submodule only
| returns well defined python objects.
|
| Of course, this comes at the cost of higher number of
| commits as you mentioned. But in my opinion these are
| still cheap because they only affect trivial quantity and
| not brain-demanding quality.
| marksomnian wrote:
| But what do you do as soon as one of the submodules has a
| dependency on another? I imagine you might not hit it in
| your simple case, but I feel like scenarios like that are
| where the advantages of monorepos lie.
|
| To take a concrete example, I'm working on a codebase
| that houses both a Node.js server-side application and an
| Electron app that communicates with it (using tRPC [0]).
| The Electron app can directly import the API router types
| from the Node app, thus gaining full type safety, and
| whenever the backend API is changed the Electron app can
| be updated at the same time (or type checks in CI will
| fail).
|
| If this weren't in a monorepo, you would need to first
| update the Node app, then pick up those changes in the
| Electron app. This becomes risky in the presence of
| automated deployment, because, if the Node app's changes
| accidentally introduced a breaking API change, the
| Electron app is now broken until the changes are picked
| up. In a monorepo you'd spot this scenario right away.
| (Mind you, there is still the issue of updating the built
| Electron app on the users' machines, but the point
| remains - you can easily imagine a JS SPA or some other
| downstream dependency in its place.)
|
| [0]: https://trpc.io/
| steffres wrote:
| yes, if one submodule would depend on another, this would
| cause problems indeed.
|
| So far, we could avoid it though, by strict
| encapsulation.
|
| But I definitely see the point in your example and
| wouldn't follow through with submodules there probably
| too.
|
| It's just that in OP's link, I'm quite sceptical as the
| monorepo approach requires quite some heavy tweaking.
| crabbone wrote:
| I missed the git push --recurse-submodules flag, even
| though it seems like it's been there for a long time. Yeah,
| it seems like it would work, except you need to configure
| it to be always "check" and be always on when you push.
| tantalor wrote:
| > this effort of small increases of commits is far
| outweighed by the modularity
|
| Not remotely, as the scale of the codebase increases, the
| benefit of modularity goes to zero and the benefit of
| atomic changes increases.
|
| Also: it's not always feasible to break up a change into
| smaller commits. Sometimes atomic change is the only way to
| do it.
| crabbone wrote:
| With --recurse-submodules the atomicity doesn't seem to
| suffer. It used to be the case that you couldn't ensure
| all changes in the source tree couldn't be pushed
| atomically, now you can, but I'm not sure it's the
| default behavior.
| Smaug123 wrote:
| Bold move to enable the "ours" merge strategy by default! I
| presume this is a typo for the "-Xours" merge _option_ to `ort`
| or `recursive`, but that still seems pretty brave.
| [deleted]
| Waterluvian wrote:
| A very useful flag. But as a default? That is scary...
| avidiax wrote:
| It is still only "ours" per hunk. But yes, it could
| obliterate changes. On the other hand, the default merge
| strategy is a huge waste of developer time. There is rarely a
| genuine conflict. It's usually just that we want to keep both
| sides.
| Smaug123 wrote:
| I've found `ort` (which I believe is now the default) to be
| better than `recursive` by leaps and bounds. Have you tried
| `ort`?
| rsp1984 wrote:
| I am not sure what I'm looking at here. Surely those half million
| files are for dozens if not hundreds of different apps, libraries
| and tools and surely those do not all depend on each other, no?
|
| Because if so, why not just use one repo per app/library/tool?
| Sure, if you have a cluster of things that all depend on each
| other, or a cluster of things that typically is needed in bulk,
| by all means, put those in a single repo.
|
| But putting literally _all_ your code in a single repo is not a
| very sane technical choice, is it?
| compiler-guy wrote:
| Google runs a single monorepo for 95% of its projects across
| the company. Google isn't perfect, but it's hard to argue that
| it isn't technically very good.
|
| One of the biggest advantages is that there is no version
| chasing and dependency questions. At commit X, everything works
| consistently. No debating about whether this or that dependency
| is out of sync.
| nottorp wrote:
| > Our engineers generally work in small teams and interact
| with an even smaller subset of the monorepo.
|
| The article says. But if Google does it it must be good.
| jayd16 wrote:
| Depends on the test tooling. If you want a single commit to
| pass integration tests then they need to be in a single commit.
| Otherwise you're tracking every version of every tool.
|
| But I like to look at the problem from another perspective. Why
| _not_ use a single repo. The only real reason would be to work
| around technical challenges with your source control of choice,
| not because having everything tracked together is inherently
| bad.
___________________________________________________________________
(page generated 2023-08-28 23:01 UTC)