hngopher.com

       [HN Gopher] We put half a million files in one Git repository (2...
       ___________________________________________________________________
        
       We put half a million files in one Git repository (2022)
        
       Author : kisamoto
       Score  : 122 points
       Date   : 2023-08-28 09:46 UTC (13 hours ago)
        
 (HTM) web link (www.canva.dev)
 (TXT) w3m dump (www.canva.dev)
        
       | bob1029 wrote:
       | Our monorepo is at ~500 megs right now. This is 7 years worth of
       | changes. No signs of distress anywhere, other than a periodic git
       | gc operation that now takes long enough to barely notice.
       | 
       | I can't imagine using anything else for my current project. In
       | fact, the only domain within which I would even consider
       | something different would be game development. Even then, only if
       | the total asset set is ever expected to exceed a gigabyte or so.
       | Git is awful with large blobs. LFS is an option, but I've always
       | felt like it was a bandaid and not a fundamental solve.
        
       | Alacart wrote:
       | Ah yes, I too have accidentally committed node_modules.
       | 
       | Jokes aside, and coming from a place of ignorance, it's
       | interesting to me that a file count that size is still a real
       | performance issue for git. I'd have expected something that's so
       | ubiquitous and core to most of the software world hasn't seen
       | improvements there.
       | 
       | Genuine, non snarky question: Are there some fundamental aspects
       | of git that would make it either very difficult to improve that,
       | or that would sacrifice some important benefits if they were
       | made? Or is this a case of it being a large effort and no one has
       | particularly cared enough yet to take it on?
        
         | 1MachineElf wrote:
         | Other users have made good comments about performance
         | limitations on the underlying filesystems themselves. Adding to
         | this, I recently encountered the findlargedir tool, which aims
         | to detect potentially problematic directories such as this:
         | https://github.com/dkorunic/findlargedir/
         | 
         | >Findlargedir is a tool specifically written to help quickly
         | identify "black hole" directories on an any filesystem having
         | more than 100k entries in a single flat structure. When a
         | directory has many entries (directories or files), getting
         | directory listing gets slower and slower, impacting performance
         | of all processes attempting to get a directory listing (for
         | instance to delete some files and/or to find some specific
         | files). Processes reading large directory inodes get frozen
         | while doing so and end up in the uninterruptible sleep ("D"
         | state) for longer and longer periods of time. Depending on the
         | filesystem, this might start to become visible with 100k
         | entries and starts being a very noticeable performance impact
         | with 1M+ entries.
         | 
         | >Such directories mostly cannot shrink back even if content
         | gets cleaned up due to the fact that most Linux and Un*x
         | filesystems do not support directory inode shrinking (for
         | instance very common ext3/ext4). This often happens with
         | forgotten Web sessions directory (PHP sessions folder where GC
         | interval was configured to several days), various cache folders
         | (CMS compiled templates and caches), POSIX filesystem emulating
         | object storage, etc.
        
         | kudokatz wrote:
         | > Are there some fundamental aspects of git that would make it
         | either very difficult to improve that, or that would sacrifice
         | some important benefits if they were made?
         | 
         | I can't speak to _improving_ git, but I think some light on
         | this area can be shed by Linus ' tech talk at Google in 2007.
         | 
         | 1. Linus says there's a specific focus on full history and
         | content, _not_ files ... so it 's a deliberate, different axis
         | of focus than file count:
         | 
         | https://youtu.be/4XpnKHJAok8?t=2586
         | 
         | ... AND it's a specific pitfall to avoid when using Git:
         | 
         | https://youtu.be/4XpnKHJAok8?t=4047
         | 
         | 2. As Linus tells it, Git appears to be designed specifically
         | for project maintenance while not getting in the way of
         | individual commits and collaboration. But the global history
         | and more expensive operations on things like "who touched this
         | line" are deliberate so lines of a function are tracked _across
         | all moves_ of the content itself.
         | 
         | Maintainer tool enablement: https://youtu.be/4XpnKHJAok8?t=3815
         | 
         | Content tracking slower than file-based "who touched this":
         | https://youtu.be/4XpnKHJAok8?t=4071
         | 
         | ===
         | 
         | I have no answer, but ...
         | 
         | Practically, I've used lazy filesystems both for Windows-on-Git
         | via GVFS [1][2] and Google's monorepo jacked into a mercurial
         | client (I think that's what it is?). Both companies have made
         | this work, but as Linus says, a lot of the stuff just doesn't
         | work well with either system.
         | 
         | Windows-on-Git still takes a lot of time overall, and stacking
         | > 10 patches of an exploratory refactor with the monorepo on hg
         | starts slowing WAY WAY down to the point where any source
         | control operations just get in the way.
         | 
         | [1] https://devblogs.microsoft.com/devops/announcing-gvfs-git-
         | vi...
         | 
         | [2] https://github.com/microsoft/VFSForGit
        
         | klodolph wrote:
         | > Are there some fundamental aspects of git that would make it
         | either very difficult to improve that, or that would sacrifice
         | some important benefits if they were made?
         | 
         | It's hard to look at a million files on disk and figure out
         | which ones have changed. Git, by default, examines the
         | filesystem metadata. It takes a long time to examine the
         | metadata for a million files.
         | 
         | The main alternative approaches are:
         | 
         | - Locking: Git makes all the files read-only, so you have to
         | unlock them first before editing. This way, you only have to
         | look at the unlocked files.
         | 
         | - Watching: Keep a process running in the background and listen
         | to notifications that the files have changed.
         | 
         | - Virtual filesystem: Present a virtual filesystem to the user,
         | so all file modifications go through some kind of Git daemon
         | running in the background.
         | 
         | All three approaches have been used by various version control
         | systems. They're not _easy_ approaches by any means, and they
         | all have major impacts on the way you have to set up your Git
         | repository.
         | 
         | People also want e.g. sparse checkouts, when you're working
         | with such large repos.
        
           | eviks wrote:
           | What about asking the OS for the list of changes like
           | Everything on Windows does, instantly, for millions, at a RAM
           | cost of a ~1-2 browser tabs (though that might be limited to
           | NTFS, but still)?
        
             | wintogreen74 wrote:
             | this only happens because it's not querying on demand,
             | which is what the article indicates they're essentially
             | (now) doing
        
           | HALtheWise wrote:
           | It's notable that git does support "watching", but it
           | requires some setup on Linux to install and integrate with
           | Watchman. On Windows and Mac, core.fsmonitor has been built
           | in since version 2.37.
           | 
           | https://www.infoq.com/news/2022/06/git-2-37-released/
        
           | robotresearcher wrote:
           | Has anyone made a system like option 3 that successfully
           | merges git with a filesystem? It could present both git and
           | fs interfaces, but share events internally. I'd be interested
           | to see how that would work.
        
             | LordShredda wrote:
             | That would make you at the mercy of git being a decent file
             | system driver.
        
           | 10000truths wrote:
           | Are there any solutions that use libgit2's ability to define
           | a custom ODB backend? There are even example backends already
           | written [1] that use RDBMSs as the underlying data store.
           | 
           | [1] https://github.com/libgit2/libgit2-backends
        
             | klodolph wrote:
             | There are repos with many files and there are repos with
             | lots of history data. Those are problems with different
             | solutions--adding millions of files to the repo will make
             | 'git status' take ages, but it won't necessarily put the
             | same level of pressure on the object database.
             | 
             | There are various versions of Git that use alternative
             | object storage, like Microsoft's VFS, if I remember
             | correctly.
        
         | eigenvalue wrote:
         | In my experience, the standard linux file system can get very
         | slow even on super powerful machines when you have too many
         | files in a directory. I recently generated ~550,000 files in a
         | directory on a 64-core machine with 256gb of RAM and an SSD,
         | and it took around 10 seconds to do `ls` on it. So that could
         | be a part of it too.
        
           | tp34 wrote:
           | What is the "standard linux file system"?
           | 
           | ext4 on an old system, feeble in comparison to yours,
           | performs much better.
           | 
           | ext4, 8GB memory, 2 core Intel i7-4600U 2.1GHz, Toshiba
           | THNSNJ25 SSD:
           | 
           | $ time ls -U | wc -l 555557
           | 
           | real 0m0.275s user 0m0.022s sys 0m0.258s
           | 
           | stat(2) slows it down, but sill this is not as poor as your
           | results:
           | 
           | $ time ls -lU | wc -l 555557
           | 
           | real 0m2.514s user 0m1.126s sys 0m1.407s
           | 
           | Sorting is not prohibitively expensive:
           | 
           | $ time ls | wc -l 555556
           | 
           | real 0m1.438s user 0m1.249s sys 0m0.193s
           | 
           | Drop caches, sort, and stat:
           | 
           | # echo 3 > /proc/sys/vm/drop_caches
           | 
           | $ time ls -lU | wc -l 555557
           | 
           | real 0m6.431s user 0m1.249s sys 0m4.324s
        
         | bityard wrote:
         | IME, on basically all filesystems, just walking a directory
         | tree of lots of files is expensive. Half a million files on
         | modern systems should not be a terribly huge issue but once you
         | get into the millions, just figuring out how to back them all
         | up correctly and in a reasonable time frame starts to become a
         | major admin headache.
         | 
         | Since git is essentially a filesystem with extensive version
         | control features, it doesn't surprise me that it would have
         | problems handing large amounts of files.
        
           | thrashh wrote:
           | I mean you can design a filesystem to handle a million files
           | extremely quickly... it just has to be in the requirements up
           | front.
           | 
           | But there will be some trade-off.
           | 
           | And I don't think people generally put "a million files" in
           | the requirements because it's fairly rare.
        
             | saltcured wrote:
             | Not related to git (I hope), but a lot of scientific
             | data/imaging folks seem to think file abstractions are
             | free. I've seen more than one stack explode a _single_
             | microscope image into 100k files, so you'd hit 1M after
             | trying to store just 10 microscope slides. Then, a
             | realistic archive with thousands of images can hit a
             | billion files before you know it.
             | 
             | It's hard to get people past the demo phase "works for me"
             | when they have played with one image, to realize they
             | really need a reasonable container format to play nice with
             | the systems world outside their one task.
        
         | Frannyies wrote:
         | Funny how the view is so different
         | 
         | I always marvel at it and think: "wow so git goes through its
         | history, pulls out many small files and chunks and patches,
         | updates the whole file tree and all of this after hitting enter
         | and being done like immediately."
        
       | Borg3 wrote:
       | Hmm, I've read this one: "These .xlf files are generated and
       | contain translated strings for each locale."
       | 
       | So why to store them under VCS at first place? I think they are
       | doing it wrong.
        
       | ufjfjjfjfj wrote:
       | I can't be the only thinking this is a small amount of files
       | unless you keep them all in the same directory
        
       | baz00 wrote:
       | Probably learned how enterprise software developers suffer.
        
       | mrAssHat wrote:
       | The site is not opening. Thanks, CloudFlare.
        
       | psydvl wrote:
       | There is VFS for git from Microsoft, that can solve problem more
       | elegant way, I think: https://github.com/microsoft/scalar
        
         | MikusR wrote:
         | That was discontinued (like multiple times under different
         | names). And is moved into a git fork.
         | https://github.com/microsoft/git
        
           | zellyn wrote:
           | Are they still trying to upstream everything? For a while
           | they were being good about that...
        
           | ComputerGuru wrote:
           | Do you know what replaced it?
        
             | WorldMaker wrote:
             | git
             | 
             | They upstreamed almost everything. The last version of
             | "scalar" was mostly just a configuration tool for sparse
             | checkout "cones" which needed a bit of hand-holding, and
             | that is easier to configure in git itself now, or so I
             | hear.
        
             | MikusR wrote:
             | Their fork https://github.com/microsoft/git
        
       | Groxx wrote:
       | Don't bother with watchman, it has consistently been so flaky
       | that I simply live with the normal latency.
       | 
       | Thankfully, nowadays git has one built in for some OSes, and it's
       | much, MUCH better than watchman ever was.
        
       | fsckboy wrote:
       | this is one of those multipurpose PR articles (not all bad) to
       | generate awareness of the company, their product, use case, and
       | developers.
       | 
       | > _At Canva, we made the conscious decision to adopt the monorepo
       | pattern with its benefits and drawbacks. Since the first commit
       | in 2012, the repository has rapidly grown alongside the product
       | in both size and traffic_
       | 
       | while reading it i was having trouble keeping track of where I
       | was in the recursion, it's sort of "Xzibit A" for "yo dawg, we
       | know you use source repositories, so check out our source
       | repository (we keep it in our source repository) while you check
       | out your source repository!"
        
       | time4tea wrote:
       | We learned they were 70% autogenerated so probably shouldn't have
       | been in git at all, but our build process relied on that, and
       | didnt want to fix it, so we bodged it.
        
         | [deleted]
        
         | Cthulhu_ wrote:
         | I'm on the fence with this one. My previous project was Go &
         | Typescript with a range of generated files; I committed the
         | generated files, so that they would flag up in code reviews if
         | they were changed, avoiding hidden or magic changes. I also
         | didn't automatically regenerate, avoiding churn.
         | 
         | That said, if the autogenerated output is stable, it's fine.
         | After all, in a sense, compiling your code is also a kind of
         | autogenerating and few people will advocate for keeping
         | compiled code in git.
        
         | maccard wrote:
         | > probably shouldn't have been in git at all
         | 
         | Something being autogenerated, or binary, doesn't mean it
         | shouldn't be in version control. If step one of your
         | instructions to build something from version control involve
         | downloading a specific version of something else, then your VCS
         | isn't doing it's job, and you're likely skirting around it to
         | avoid limitations in the tool itself. People still use tools
         | like P4 because they want versioned binary content that belongs
         | in version control, or because they want to handle half a
         | million files, and git chokes.
         | 
         | In my last org, we vendored our entire toolchain, including
         | SDKs. The project setup instructions were:
         | 
         | - Install p4 - Sync, get coffee - Run build, get more coffee.
         | 
         | A disruptive thing like a compiler upgrade just works out of
         | the box in this scenario.
         | 
         | It's a shame that the mantra of "do one thing well" devolves
         | into "only support a few hundred text files on linux" with git.
        
           | PMunch wrote:
           | Wouldn't Git LFS be the tool for this job? Have the automated
           | tool build a .zip file for example of the translations
           | (possibly with compression level set to 0), then have your
           | build toolchain unzip the archive before it runs. Then check
           | that big .zip file into GitLFS, et voila you now have this
           | large file versioned in Git.
        
             | folmar wrote:
             | It's good enough for the small usecases, but way behind
             | tools that have first class support for binary files
             | (binary deltas, common compression, ...). Even SVN shines
             | here.
        
             | maccard wrote:
             | Git LFS isn't the same as git, though. It's better than
             | putting everything in a separate store, but for one it
             | disables offline work, and breaks the concept of D in the
             | DVCS of git.
             | 
             | > then have your build toolchain unzip the archive before
             | it runs
             | 
             | My build toolchain shouldn't have to work around the
             | shortcomings of my environment, IMO.
             | 
             | > et voila you now have this large file versioned in Git.
             | 
             | No, it's on a separate http server that is fetched via git
             | lfs. Subtle, but important difference.
        
               | aidenn0 wrote:
               | > it disables offline work,
               | 
               | This is a non-issue for images and autogenerated files,
               | since you shouldn't ever be doing a merge on them.
               | 
               | > breaks the concept of D in the DVCS of git.
               | 
               | git-annex is distributed and works well for files that
               | will never be merged (such as images, or autogenerated
               | files)
        
           | avidiax wrote:
           | > Something being autogenerated, or binary, doesn't mean it
           | shouldn't be in version control.
           | 
           | I think the SHA should be in version control. The file should
           | be reproducibly built [1], then cached on a central server.
           | 
           | This means that a build target like a system image could be
           | satisfied by downloading the complete image and no
           | intermediate files. And a change to one file in one binary
           | will result in only a small number of intermediate files
           | being downloaded or reproducibly built to chain up to the new
           | system image.
           | 
           | This is something that's really lacking in, for example, Git.
           | 
           | [1] https://en.wikipedia.org/wiki/Reproducible_builds
        
             | jtsiskin wrote:
             | https://git-lfs.com/
        
             | maccard wrote:
             | > I think the SHA should be in version control. The file
             | should be reproducibly built [1], then cached on a central
             | server.
             | 
             | Requiring reproducible builds to handle translations or
             | images is a bit much. Also, if it's cached on a central
             | server, that now means you need to be connected to that
             | central server. If you require a connection to said central
             | server, why not just have your source code on said server
             | in the first place, a la p4?
             | 
             | I do agree that NixOS is a great idea, but personally 99%
             | of my problems would be solved if git scaled properly.
        
               | avidiax wrote:
               | You can always build from source in this scenario. The
               | cache server lets you skip two things. First, you can
               | prune the leaves of the tree of intermediate files you
               | might need. Second, where you do need to
               | compile/build/link/package, etc., you can do only those
               | steps that are altered by your changes. So you save CPU
               | time and storage space.
               | 
               | > why not just have your source code on said server in
               | the first place, a la p4?
               | 
               | That would be great. A version of git where cloning is
               | almost a no-op, and building is downloading the package
               | assuming you haven't changed anything.
               | 
               | I'm not aware of how p4 allowing this. My recollection of
               | perforce is that I still had most source files locally.
        
           | tomjakubowski wrote:
           | Does perforce have features which make vendoring easier? Just
           | curious why I see P4 called out here and in the replies too.
        
             | tom_ wrote:
             | It just does a pretty good job of dealing with binary files
             | in general. The check in/check out model is perfect for
             | unmergeable files; you can purge old revisions; all the
             | metadata is server side, so you only pay for the files you
             | get; partial gets are well supported. And so, if you're
             | going to maintain a set of tools that everybody is going to
             | use to build your project, the Perforce depot is the
             | obvious place to put them. Your project's source code is
             | already there!
             | 
             | (There are various good reasons why you might not! But
             | "because binary files shouldn't go in version control" is
             | not one of them)
        
           | Karellen wrote:
           | > In my last org, we vendored our entire toolchain,
           | 
           | You vendored all your compilers/language runtimes in the
           | source control repo of each project? Including, like, gcc or
           | clang? WTF?
           | 
           | > It's a shame that the mantra of "do one thing well"
           | devolves into "only support a few hundred text files on
           | linux" with git.
           | 
           | Because the Linux kernel source tree and its history can
           | accurately be described as "a few hundred text files".
           | 
           | Yeah, right.
        
             | 0xcoffee wrote:
             | It's not that unusual, we vendor entire VM images which
             | contain the development environment. (Codebase existed
             | since before docker). And it works well, need to fix
             | something in a project that was last update 20 years ago?
             | Just boot up the VM and you are ready.
        
               | xorcist wrote:
               | I don't think that was the question but rather why commit
               | to git?
               | 
               | Having local commits intermingled with an upstream code
               | base can make for really hairy upgrades, but I guess
               | every situation is slightly different.
        
               | maccard wrote:
               | > but rather why commit to git?
               | 
               | Well we don't put them in git, we put them in perforce
               | because git keels over if you try and stuff 10GB of
               | binaries into it once every few months.
               | 
               | I think the real question is the other way around though,
               | why _not_ use git for versioning when that's what it's
               | supposed to be for? Why do I have to verison some things
               | with git, and others with npm/go
               | build/pip/vcpkg/cargo/whatever?
        
             | tom_ wrote:
             | I've worked on a couple of game projects that did this.
             | Build on Windows PC, build for Windows/Switch/Xbox One/Xbox
             | Serieses/PS4/PS5/Linux. I was never responsible for setting
             | this up, and that side of things did sounda bit annoying,
             | but it seemed to work well enough once up and running. No
             | need to worry about which precise version of Visual Studio
             | 2019 you have, or whether you've got the exact same minor
             | revision of the SDK as everybody else. You always build
             | with exactly the right toolchain and SDK for each target
             | platform.
        
             | maccard wrote:
             | > You vendored all your compilers/language runtimes in the
             | source control repo of each project? Including, like, gcc
             | or clang? WTF?
             | 
             | Yep. Along with paltform SDKs, third party dependencies,
             | precompiled binaries, non-redistributable runtimes, you
             | name it.
             | 
             | Giant PSD or FBX files? 4K Textures? all of it.
             | 
             | Client mappings are the bread and butter of P4 (or Stream
             | views more recently which are not as nice to work with) -
             | you say "I don't want the path containing MacOS" if you
             | don't want it.
             | 
             | > Because the Linux kernel source tree and its history can
             | accurately be described as "a few hundred text files".
             | 
             | I was off by a little bit, it's ~60k. But it's still "only"
             | 60k text files, no matter how important those text files
             | are.
        
           | thechao wrote:
           | This is _precisely_ why every ASIC (HW) company I 'm familiar
           | with uses P4. ASIC design flows rely _critically_ on 3rd
           | party tooling, that must be version /release specific. You
           | can't rely on those objects being available whenever. They
           | get squirreled away and kept, forever.
        
         | IshKebab wrote:
         | It's not an unbreakable rule that generated or binary files
         | should not be in Git. It's a rough guideline. Partly because
         | Git is bad at dealing with binary files.
         | 
         | There are plenty of cases when including generated files is
         | appropriate. It has many advantages over not doing that -
         | probably the biggest are
         | 
         | * Code review is much easier because you can see the effect on
         | the output.
         | 
         | * It's easier to find the generated files because they're next
         | to the rest of your code. IDEs like it much more too.
         | 
         | In fact the upsides are so great and the downsides so minimal I
         | would say it should be the default option as long as:
         | 
         | * The generated files are not huge.
         | 
         | * The generated files are always the same.
         | 
         | Even when they are huge it might still be a good idea, but you
         | can put the files in a submodule or LFS. I do that for a
         | project that has a really difficult to install generator so
         | users don't need to install it.
        
       | issafram wrote:
       | Since when was a "monorepo" ever considered a good idea?
        
         | Cthulhu_ wrote:
         | A couple of years now, but whether it's a good idea depends on
         | your use case and organization. Seems to work for some. It
         | works for my current assignment too - two and possibly more
         | React Native that reuse a lot of components, translations, have
         | the same APIs, etc.
        
         | yashap wrote:
         | They have some nice advantages:
         | 
         | - Makes it easy to develop applications and libraries together
         | in a single branch
         | 
         | - Similarly, makes it easy to make a breaking change to a
         | library, then change all clients of said library, in a single
         | branch
         | 
         | - And because of the above, makes it easy to keep all
         | dependencies on internal libs at the latest version, which can
         | greatly reduce all sorts of "dependency hell" issues
         | 
         | - Generally makes integration testing a bit easier
         | 
         | The downside is you have to invest a lot more time in tooling,
         | keeping both local and CI builds fast. And even with that
         | tooling, builds won't be as fast as they trivially are with
         | multi-repo. But if you do invest that time in tooling, you can
         | generally get them fast enough, and then reap the other
         | benefits for a very productive dev experience.
         | 
         | Have done both monorepo and multi-repo at different, decent
         | sized companies. Both have their pros/cons.
        
         | [deleted]
        
       | jiggawatts wrote:
       | Something I learned about writing robust code is that scalability
       | needs to be tested up-front. Test with 0, 1, and _many_ where the
       | latter is tens of millions, not just ten.
       | 
       | I've seen production databases that had 40,000 tables for _valid_
       | reasons.
       | 
       | I've personally deployed an app that needed 80,000 security
       | groups in a single LDAP domain, just for it. I can't remember
       | what the total group number of groups across everything was, but
       | it was a decent chunk of a million.
       | 
       | Making something like Git, or a file system, or a package
       | manager? Test what happens with millions of objects! Try
       | _billions_ and see where your app breaks. Fix the issues even if
       | you never think anyone will trigger them.
       | 
       | It's not about scaling to some arbitrary number, it's about
       | _scaling_ , period.
        
         | switch007 wrote:
         | At what cost?
        
         | zoomablemind wrote:
         | > ...scalability needs to be tested up-front.
         | 
         | I'd rephrase it: if expecting massive or longterm use - know
         | how that thing is built/designed.
         | 
         | Picking a technology based on general popularity or vendor's
         | marketing is not a way to solve _your_ problem.
        
         | Karellen wrote:
         | > Making something like Git, or a file system, or a package
         | manager? Test what happens with millions of objects!
         | 
         | Test with, say, one of largest open-source projects in
         | existence at the time? Like, for instance, the Linux kernel?
        
           | layer8 wrote:
           | When Git was first released, the Linux kernel sources had
           | less than 20,000 files. It currently has around 70,000 files.
           | It's not nothing, but it also isn't millions.
        
           | compiler-guy wrote:
           | The kernel is big, but it isn't _that_ big in the grand
           | scheme of things. The project from the original article here
           | is bigger, and many companies have projects bigger than that.
        
         | vincnetas wrote:
         | could you elaborate on 40,000 tables DB? I want to learn what
         | could be valid reasons for that?
        
           | horse_dung wrote:
           | Database audit tool and they needed test what happens when an
           | excessive number of tables is hit??? :)
        
           | jasonjayr wrote:
           | Not the OP, but in our case, 300 tables x 300 customers
           | (different 'schemas/dbs') == single mysql instance with
           | 90000+ tables.
        
             | stef25 wrote:
             | Never understood why each customer would have their own DB.
             | Must be a nightmare to maintain.
        
               | icedchai wrote:
               | I worked on a system like this. Rolling out migrations
               | would take hours.
        
               | asguy wrote:
               | Did you not parallelize your migrations?
        
               | icedchai wrote:
               | We did, but there was only so much migration load we
               | wanted to place per DB server. Some DBs had 100's of
               | customers.
        
               | wernercd wrote:
               | Security. You have access to stef25_ tables and I don't.
               | 
               | The alternative would be we both have access to the same
               | tables with a permission layer to grant access to row.
               | 
               | Both choices have trade offs but if company makes a
               | mistake and I now have access to your rows? Seems easier
               | to control access at the table layer rather than the
               | column layer.
        
               | indeed30 wrote:
               | It makes enterprise sales easier, since it removes a
               | common objection from security, privacy and compliance
               | teams.
        
               | Cthulhu_ wrote:
               | Wouldn't having separate databases (with separate users
               | (per organization)) make more sense from a security point
               | of view? I have no knowledge of these things, I've never
               | actually worked with more than one database in a mysql
               | instance.
               | 
               | edit: I tell a lie, I separated the forums and wordpress
               | databases on a website I run.
        
               | vincnetas wrote:
               | One DB server can have multiple DB's. In this case we are
               | talking about single DB (not server) containing multi
               | thousands of tables. And im curious what is the use case
               | for such designs.
        
             | vincnetas wrote:
             | Sure multi tenant DB's, but lets limit this to per project
             | DB's. 300 tables is quite reasonable, but multi thousands?
        
               | wiredfool wrote:
               | I've got a db that hosts postgresql versions of CSVs/XLSs
               | that are uploaded/harvested to an open data portal (as
               | part of the portal). There are ~10k of them in there
               | (+-), and could certainly see more (O(5k)) if some of the
               | CSVs were parsed better.
        
           | phyrex wrote:
           | Represent every entity (like "person" or "post") in a large
           | graph using a relational database. You can get to 40k rather
           | quickly
        
             | Timon3 wrote:
             | Do you have any example for a project with 40k entities?
             | I'd love to see how they handle the complexity.
        
             | vincnetas wrote:
             | Is this valid reason to use RDBM like this?
        
               | phyrex wrote:
               | Why not? I don't think there are many graph databases
               | that are set up to handle multiple petabytes of data, so
               | RDBMs make a good storage layer at that scale
        
               | hgsgm wrote:
               | Why would you need a table for each person or post? Those
               | are rows.
        
               | phyrex wrote:
               | Each entity. Person would be a table, you and I would be
               | rows, correct.
        
           | lenkite wrote:
           | Saw ~60k tables in one famous ecommerce company backend
           | primarily due to sharding - spread across multiple DB's of-
           | course.
        
           | marcosdumay wrote:
           | Any ERP will bring you near that.
        
           | tambourine_man wrote:
           | I was surprised as well, I feel one may need a DB to sort
           | that DB. A meta DB :)
        
           | jiggawatts wrote:
           | SAP with a bunch of plugins plus custom tables added for
           | various purposes. This is for managing the finances of 200K
           | staff across 2,500 locations.
        
             | stef25 wrote:
             | Isn't that just bad design on the part of SAP ?
        
               | miroljub wrote:
               | Non necessarily.
               | 
               | Would it be "better" if they had one table with
               | json/xml/whatever and handled schema in code?
               | 
               | They made a trade-off they found right. When they hit the
               | limit with their approach, they even implemented their
               | own DB (S4/Hana) to support their system.
        
           | pravus wrote:
           | I worked at an educational institution where we ran an
           | academic-focused Enterprise Resource Planning (ERP) system
           | that was fairly large. Not quite 40k tables, but it had over
           | 4k. To give you an idea of how this was organized:
           | * Most simple things like a "Person" were multiple tables
           | because you had to include audits and historical changes for
           | each field         * A "Person" wasn't even all that useful
           | because it included guests or other fairly transient entities
           | like vendor contacts so you had an explosion of more tables
           | as you classified roles into "Student", "Faculty",
           | "Employee", etc... (many with histories as above).         *
           | Addresses and other non-core demographic information were
           | usually sharded into all sorts of categories like "primary",
           | "parent's", "last known good", "good for mailing", etc...
           | (more histories, etc...)         * All coded information like
           | label types such as "STUDENT", or "MAILING" were always
           | handled as separate validation tables with strict FK
           | constraints and usually included extra meta information like
           | descriptions and usage notes within parts of the system.
           | * Each functional sub-system (HR, Payroll, AR, AP, etc.) had
           | its own dedicated schema.         * All external jobs,
           | processes, and external integrations were configured
           | separately.         * All enterprise integrations usually had
           | a whole a dedicated schema for configuration.         * Most
           | parts of the interactive web UI were database driven
           | (Oracle's Apache mod PL/SQL) with many templates and other
           | components stored in large collections of tables.
           | 
           | I'll stop there, but basically just imagine a very large
           | application that tries to be 100% database-driven. That's how
           | you get a lot of tables.
        
             | Spivak wrote:
             | And honestly, I kinda get it. Until you run into a case
             | where your volume is such that you physically _can 't_ run
             | on the db then run it on the db. I run all my job
             | processing off the DB and couldn't be happier. I have to
             | hit "can't run along side the real data" and "can't run in
             | its own db" before I'll need to consider something else.
             | 
             | It probably feels weird for devs to drive the UI off the db
             | but it's just Wordpress by another name.
        
             | Cthulhu_ wrote:
             | I've worked with / on an application like that, it had all
             | form fields awkwardly configured in a database, plus a
             | complicated database migration script to add, remove and
             | update those fields.
             | 
             | When I rewrote the application I just hardcoded the form
             | fields, nobody should need to do a database migration to
             | change an otherwise mostly static form.
        
         | notTooFarGone wrote:
         | I mean that's how you get k8s for projects that in reality will
         | never need it. Now you have a developer that is only doing k8s.
         | Managing overhead and minimizing it is really something to keep
         | in mind. So your App can't handle 100000 concurrent Users? As
         | long as there is a plan how you could enable that in case of
         | emergency there is really no incentive to have all that
         | premature optimization for 90% of companies imo.
        
         | horse_dung wrote:
         | I would agree with scale orders of magnitude higher than you
         | can possibly imagine. But once you know what your scaling
         | limits are (and there are always are limits) and what the
         | (pre)failure behaviour looks like... we'll you don't _have_ to
         | fix them...
        
         | miroljub wrote:
         | While this is true in some cases, more frequently I saw apps
         | designed and able to handle millions of users and billions of
         | transactions that ended being used by tens of users and
         | hundreds of transactions.
         | 
         | All the effort spent on testing and optimizations for scaling
         | purpose was a waste of time and resources, that could be better
         | spent elsewhere.
         | 
         | I'm not telling one should not care, or code sloppy, but there
         | is a balance where code is just good enough for the purpose.
         | There's a lot of truth in this "don't do premature
         | optimization".
        
           | jandrewrogers wrote:
           | You still need to do enough to buy time if you do need more
           | scalability, since scalability tends to be architectural.
           | Waiting until you hit a wall in production is usually months
           | too late to start working on it. As a moving target, I often
           | try to test at 10x the current workload, which is usually
           | enough to deal with load spikes and surfaces scalability
           | issues early enough that customers don't see them.
        
           | thrashh wrote:
           | This is where experience comes in.
           | 
           | Someone experienced will know how much work a certain
           | approach will take and its capacity.
           | 
           | Sometimes there are quick wins to give like 100x capacity to
           | a system just by doing things slightly differently, but only
           | with experience will you know that.
        
           | pixl97 wrote:
           | The question comes.. when your app is on the growth curve, do
           | you start to test this?
           | 
           | I work in enterprise software and one of the big problems I
           | see is companies suck at software growth when it's obvious
           | the software is in the upward curve.
           | 
           | Large companies will throw huge amounts of data at your app
           | once you sell it to them.
        
           | heavenlyblue wrote:
           | Yeah and these apps probably would never work with that many
           | of users in practice because they missed a few things here
           | and there and the only way to fix them would have been to
           | have that amount of traffic in the first place.
        
           | jasfi wrote:
           | I think the GP was saying that it should scale without
           | breaking. It can get slow, fine, that's a different
           | challenge. But it shouldn't segfault (as an example).
        
             | wernercd wrote:
             | "GP was saying that it should scale without breaking" and
             | the responder was saying that making that a priority means
             | that your wasting time that probably won't be needed.
             | 
             | The time you spend making it work for millions of users
             | that won't be needed is time not spent making value to
             | customers that do need it.
        
             | yebyen wrote:
             | The gist of it is this: many load tests don't even consider
             | the actual potential volume of traffic. But that's fine, if
             | you're using a load tester, you don't have to estimate the
             | traffic well - even to within an order of magnitude. You
             | can just add a couple extra zeroes, and see where it
             | breaks. Failing to do this simple thing will usually lead
             | to objectively worse software, and there's a chance that
             | some day you'll need to handle that much traffic.
             | 
             | But that chance isn't the sole reason why you're doing that
             | load test. The reason is to improve the software. You're
             | identifying defects by stressing the limits.
             | 
             | When you're doing a load test (or any test really) the
             | possible outcomes are basically three: (1) it works! (2) it
             | broke. (3) huh, that's interesting. If your tests are
             | always coming up (1) then you're not obtaining any benefit
             | from them. Don't you want to know where the limiting
             | factors are in your app? If you're able to remove those
             | limits, but not for production (at least not right now),
             | wouldn't it be great to know what will break next month (or
             | next year) when you do?
             | 
             | Think of the person who writes unit tests for every piece
             | of code, but not as TDD. There's a school of thought that
             | you should write the test first, then write the simplest
             | code that passes, and that's fine but not what I'm talking
             | about. Imagine a person who writes perfect code and perfect
             | tests. Every code works, every test passes confirming that
             | it worked. What is even the value of writing the test?
             | 
             | That's what load testing under only the expected conditions
             | is like. We already know the software works under those
             | conditions (likely) because it's already in production,
             | handling that amount of load. So while there is value in a
             | load test that runs prior to deployment, in order to check
             | that nothing of the change is likely to induce a break
             | under the expected/existing load, it's a different kind of
             | testing and produces different value than a stress test
             | that is designed to hopefully induce a failure and show you
             | where there is a defect in code. Where it segfaults, for
             | example.
             | 
             | And just because you've identified a limiting factor
             | outside the bounds of what expected activity is likely to
             | go through the system soon, doesn't mean you need to fix it
             | now. Having one less "known unknown" on the table is a
             | thing of value. Now that stress won't be able to surprise
             | you later, when that parameter has drifted into the danger
             | zone because of organic development, and now it's becoming
             | a thing in the way.
        
           | Cthulhu_ wrote:
           | It's a difficult one; if you don't know yet if you will have
           | billions of transactions, you should focus on clarity and
           | flexibility - that is, you should be able to rewrite and re-
           | architect your application and its runtime IF it turns out to
           | be successful.
           | 
           | A parent comment mentioned SQL databases for example; those
           | are great because they can scale both horizontally and
           | vertically these days, sometimes with the click of a button
           | in AWS.
           | 
           | Other good practices are things like stateless back-end
           | services so they can scale horizontally, thoroughly
           | documenting (and maintaining documentation) on business
           | processes handled by the software, monitoring, etc.
           | 
           | Disclaimer: I'm an armchair expert, I've never had to deal
           | with back-end scaling.
        
             | coffeebeqn wrote:
             | Something has gone horribly wrong if you don't know if the
             | requirements are 10 request per second or a billion per
             | second.
             | 
             | We build some services from the ground up for very high
             | traffic and the hoops you have to jump through and the
             | tradeoffs just don't make sense for a basic CRUD thing
             | which can run on a boring ole machine and a little SQL
             | instance
        
           | wintogreen74 wrote:
           | also all projects, git or anything else, have limited
           | resourcing. I'd rather it's spent on the prioritized features
           | & needs than exhaustive testing for edge cases.
        
           | klysm wrote:
           | I think you can write good code that intentionally omits
           | performance optimizations you know could be made, but don't
           | want to make right now because it trades off complexity for
           | performance. I usually leave myself a note of how to improve
           | it if it does in fact become the bottleneck or starts to hurt
           | latency
        
         | crabbone wrote:
         | People who make filesystems test this stuff and will be able to
         | tell you the ballpark figure for performance of this kind of
         | operation even w/o testing. Testing here isn't the problem...
         | 
         | The problem here is that we need a reasonably small interface
         | for filesystem to enable competing implementations, so, for
         | example, we don't have a filesystem interface for bulk metadata
         | operations, because this is an unusual request (most user-space
         | applications which consume filesystem services don't need it).
         | So, we can only query individual files for metadata changes
         | through "legal" means (i.e. through the documented interface).
         | And now you end up in a situation where instead of fetching all
         | the necessary information in a single query, the performance
         | impact of your query scales linearly with the number of items
         | queried.
         | 
         | Even if Git developers anticipated this performance bottleneck,
         | there's not much they can do w/o doing some other undesirable
         | stuff. Any solution created outside of the filesystem would
         | risk de-synchronization with the filesystem (i.e. something
         | that watches the state of the filesystem dies and needs to be
         | restarted, either loosing old changes or changes done between
         | the restarts). Another solution could try going behind the
         | documented filesystem interface, and try to salvage this
         | information directly from the known filesystems... which would
         | be a lot of work compounded with the potential to screw up your
         | filesystem.
         | 
         | Maybe if we'd have Git integrated with the kernel and be able
         | to thus integrate better with at least the in-kernel
         | filesystems. But this would still put people on anything but
         | Linux at a disadvantage, and even on Linux, if you wanted some
         | filesystem that's not in the kernel, you'd also have the same
         | problem...
        
         | dahfizz wrote:
         | Scaling code isn't always as simple as rewriting your search
         | function to be faster.
         | 
         | What if scaling to millions of objects forces real tradeoffs
         | for the hundreds of objects case?
         | 
         | It feels like you're asking people to only create Postgres, but
         | SQLite has a perfectly valid use case as well.
         | 
         | In this case, git checking the access time of 500k files is
         | fundamentally slow. The only way around this is to change how
         | git tracks files, which all come with other usability
         | tradeoffs. Git itself supports a fsmonitor that makes handling
         | more files faster, but very few people use it because the
         | tradeoffs aren't worth it.
        
           | adamckay wrote:
           | > It feels like you're asking people to only create Postgres,
           | but SQLite has a perfectly valid use case as well.
           | 
           | I don't mean this to be a "well actually" comment, but
           | because I found it interesting when I learnt this a few weeks
           | ago - some limits for SQLite [1] are actually higher than the
           | limits for Postgres [2] (specifically the number of columns
           | in a table and the maximum size of a single field).
           | 
           | 1 - https://www.sqlite.org/limits.html
           | 
           | 2 - https://www.postgresql.org/docs/current/limits.html
        
             | lordgrenville wrote:
             | I've been working with some SQLite databases that are
             | >100GB lately, and wondering if this is a bad idea. The
             | theoretical max size is 140TB, but there's a big gap
             | between _can_ and _should_.
        
         | 5e92cb50239222b wrote:
         | Among mainline Linux filesystems, xfs started doing this first.
         | The test suite is still named xfstests, although many more
         | filesystems rely on it now. They regularly test xfs on enormous
         | filesystems which 99.9% of us will never see, both with
         | hundreds of billions of tiny files, and relatively small
         | numbers of very large ones, plus various mixes of the two.
         | Pushing it into edge cases like billions of files in one
         | directory without any nesting. I really like that strong
         | engineering culture and that's why I prefer xfs for most stuff.
        
           | hinkley wrote:
           | These sorts of exercises help with performance tuning in the
           | small.
           | 
           | One thing you should learn, and many don't, about perf
           | analysis is that you start getting serious artifacting in the
           | data for tiny functions that get called an awful lot. I've
           | found a lot of tangible improvements from removing 50% of the
           | calls to a function that the profiler claims takes barely any
           | time. Profilers lie. You have to know what they lie about.
           | 
           | When I'm trying to optimize leaf- or near-leaf-node functions
           | I've been known to wrap the call with a for loop that runs
           | the same operation 10, 100, 1000 times in a row, just so I
           | can see if some change has a barely-double-digit effect on
           | performance. These predictions usually hold up in production.
           | 
           | Just be very, very sure not to commit that for loop.
           | 
           | Or use representative data that is ridiculously large
           | compared to the average case.
        
           | silvestrov wrote:
           | > edge cases like billions of files
           | 
           | Sometimes edge cases can quickly detect bugs that only
           | happend rarely under normal circumstances and therefore is
           | difficult to reproduce/debug.
           | 
           | E.g. when programming in C for little endian computers it can
           | be a good idea to test code on big endian CPUs as the
           | difference in endianess can reveal "out of bounds" writes for
           | pointers.
        
           | crabbone wrote:
           | This is really not unique to XFS... anyone who worked on a
           | filesystem in at least the last decade would tell you that a
           | test like the one similar to what OP inadvertently created
           | are commonplace.
           | 
           | Unlike with many user-space applications, filesystems have
           | very well-defined range of conditions they have to work in.
           | Eg. every filesystem worth its salt will come with a limit on
           | number of everything in it, i.e. number of files, groups,
           | links and so on. And these limits are tested, they aren't
           | conjectures. Ask any filesystem developer how many metadata
           | operations per second can their program do, and they will
           | likely be able to answer you in their sleep. This might be
           | surprising on the consumer end of the deal, but to the
           | developers there's nothing new here.
        
       | hjgraca wrote:
       | Or as they call it, a simple "Hello World" Javascript project
        
       | paulirish wrote:
       | > A git fetch trace that was captured
       | 
       | Anyone know observability software are they using to visualize
       | the GIT_TRACE details? (Or is the assumption that the UI is Olly
       | as well?)
        
       | avidiax wrote:
       | How does a "monorepo" differ from, say, using a master project
       | containing many git submodules[1], perhaps recursively? You would
       | probably need a bit of tooling. But the gain is that git commands
       | in the submodules are speedy, and there is only O(logN) commit
       | multiplication to commit the updated commit SHAs up the chain.
       | Think Merkle tree, not single head commit SHA.
       | 
       | Eventually, you may get a monstrosity like Android Repo [2]
       | though. And an Android checkout and build is pushing 1TB these
       | days.
       | 
       | But there, perhaps, the submodule idea wins again. Replace most
       | of the submodules with prebuilt variants, and have full source +
       | building only for the module of interest.
       | 
       | [1] https://git-scm.com/book/en/v2/Git-Tools-Submodules
       | 
       | [2] https://source.android.com/docs/setup/download#repo
        
         | ramesh31 wrote:
         | >How does a "monorepo" differ from, say, using a master project
         | containing many git submodules[1], perhaps recursively?
         | 
         | Submodules are essentially broken with no way to fix them. It
         | was a good idea that never took off.
        
         | maccard wrote:
         | The problem with submodules is they're not "vanilla" git, and
         | have some subtle, unexpected behaviours. See this thread[0] for
         | some examples.
         | 
         | Submodules, like LFS, are a great idea that suck in practice
         | because they're bolted on to git to avoid compromising the
         | purity of the base project.
         | 
         | [0] https://news.ycombinator.com/item?id=31792303
        
           | [deleted]
        
         | tazjin wrote:
         | Most large monorepos simply are not on git. Google has Piper,
         | Yandex has arc, Facebook has eden (which is actually semi-open-
         | source, btw!), some companies use Perforce and so on.
        
           | MikusR wrote:
           | Microsoft uses git.
        
             | tazjin wrote:
             | Not off-the-shelf git though, they have their own file
             | system virtualisation stuff on top. Some of that used to be
             | open-source (Windows only, I think?).
        
               | WorldMaker wrote:
               | VFS for Git is still Open Source:
               | https://github.com/microsoft/VFSForGit
               | 
               | Microsoft's blog posts have indicated a move to use
               | something as close to off-the-shelf git as possible,
               | though. They say they've stopped using VFS much and are
               | instead more often relying on sparse checkouts. They've
               | upstreamed a lot of patches into git itself, and maintain
               | their own git fork but the fork distance is generally
               | shrinking as those patches upstream.
        
           | 0xcoffee wrote:
           | Windows is coming out with their own 'Dev Drive':
           | https://learn.microsoft.com/en-us/windows/dev-drive/
           | 
           | I'm very curious how it performs compared to EdenFS: https://
           | github.com/facebook/sapling/blob/main/eden/fs/docs/O...
        
             | tazjin wrote:
             | Maybe I'm misreading those Dev Drive docs, but those don't
             | seem related in any way?
             | 
             | Dev Drive seems to be a special type of disk volume with
             | higher reliability or something for dev-related workloads.
             | 
             | EdenFS is "Facebook's CITC", i.e. a virtual filesystem view
             | into a remote version-control system.
        
           | avidiax wrote:
           | I think it's orthogonal.
           | 
           | "Monorepo" is a culture around having a single branch with a
           | single lineage, and not developing anything in any isolation
           | greater than a single developer's workstation.
           | 
           | I agree that Git is not very adequate for large monorepos,
           | but I'd say that most open source projects are on Git, and
           | most of them are trivial monorepos.
        
             | hgsgm wrote:
             | No. Branching is orthogonal to modules/monorepo.
        
         | ants_everywhere wrote:
         | The monorepo is essentially a single file system.
         | 
         | Things like moving a file from one git submodule to another is
         | more cumbersome than just `mv foo dir/bar`. That means your
         | directory structure is in practice tightly coupled to the tree
         | of git projects.
         | 
         | Also, since any of the git sub-repos can be branched, the chaos
         | of merging development branches seems like it gets even more
         | complicated in a submodule architecture.
         | 
         | It may be possible to put a user interface that abstracts away
         | the submodule architecture and forces everything to live on
         | HEAD. But at that point it might be easier to just provide a
         | git-like UI to a centralized VCS.
        
         | dmoy wrote:
         | > How does a "monorepo" differ from, say, using a master
         | project containing many git submodules[1], perhaps recursively
         | 
         | One fundamental way it differs is atomic commits. You can't
         | change something in repo A and subsubrepo XYZ in a single pull.
         | 
         | A monorepo allows you to do things like atomic commits to
         | arbitrary pairs of files in the repo, which among other things
         | opens up the possibility of enforcing single-version of
         | libraries, which in turn removes a whole class of diamond
         | dependency issues.
         | 
         | There's other benefits, but imo it's probably not worth it for
         | most companies because of the staggering number of things it
         | breaks in the developer toolspace once it gets large enough.
         | Eventually you need teams of people that do nothing but make
         | tooling to support monorepo scaling, because everything off the
         | shelf explodes (what do you do when even perforce can't handle
         | your repo?)
         | 
         | For example, at Google we have a team of people who do nothing
         | but, effectively, recreate the cross referencing and jump-to-
         | def everyone else gets for "free" from IntelliJ / VS
         | intellisence, etc. (We do other stuff too, but that's a fair
         | paraphrase). And on top of that the team really only exists
         | because Steve Yegge is a Force to be Reckoned With, otherwise
         | we might still be flailing around without jump to def, idk.
        
           | jeffbee wrote:
           | The ability of Google's internal code search to jump between
           | declaration, definition, override, and call site is miles
           | ahead of what Intellisense can do.
        
           | HdS84 wrote:
           | One Major Point for monorepos is the ability to eschew
           | packages. In most languages creating, publishing and
           | consuming a package is a lot of work, while in a monorepo you
           | just add a reference to the code and are ready to go (except
           | in react native...gaah that was pure horror). That's
           | especially valuable if you need to refactor something and
           | need to adjust it's dependencies. Doing that via packages is
           | slow and painful. Via project references it's mich easier and
           | has a tight feature loop, change+build+fix instead of of
           | change+build+publish+consume+fix
        
             | dmoy wrote:
             | Yup, agree completely. That's a natural extension of the
             | same thing that enables atomic commits - suddenly just
             | having direct library dependencies instead of packages
             | isn't that big of a problem if you push everything into the
             | monorepo.
             | 
             | > That's especially valuable if you need to refactor
             | something and need to adjust it's dependencies.
             | 
             | And yes exactly, being able to change a library and all of
             | its callers at the same time is pretty handy.
        
         | solarkraft wrote:
         | > And an Android checkout and build is pushing 1TB these days
         | 
         | I remember it "only" being somewhere around 200Gb.
        
         | jcarrano wrote:
         | The monorepo is where you end up when you have failed to
         | enforce encapsulation and your "modules" do not have stable
         | APIs (or are actually modular). Then, with sub-modules each
         | change will often involve multiple commits to different
         | modules, plus commits to update references, so O(N) commit
         | multiplication.
        
           | hgsgm wrote:
           | Monorepo works whether your APIs are modular or not, and also
           | allows changes to the modular structure.
        
       | eigenvalue wrote:
       | Since 70% of the files were xlf files used for
       | translation/localization, couldn't they instead just store all of
       | those in a single SQLite file and solve their problem much more
       | easily? Any of the nuances of the directory structure could be
       | captured in SQLite tables and relationships, and it would be easy
       | to access them for edits by non-coders using a tool like DB
       | Browser.
       | 
       | I feel like often people make problems much harder than they need
       | to be by imposing arbitrary constraints on themselves that could
       | be avoided if they approached the problem differently.
        
         | Cthulhu_ wrote:
         | That just sounds like adding another problem though. A
         | filesystem (and git) already is a database, and plain files can
         | be read and managed more easily than a possibly corruptible
         | binary file. Plus, you'd lose history, unless you add more
         | complexity to add history.
         | 
         | I mean I don't know if they ever needed history but, just
         | saying. You get certain things for free by using a filesystem /
         | git.
        
           | melx wrote:
           | You commit the sqlite dump file(?) to git and have the
           | history...
           | 
           | I dunno but there are folks who would put anything in git. I
           | work with someone who manages to exceed disk space of
           | company's Gitlab instance by git adding everything. The disk
           | is full again once a month.
        
       | steffres wrote:
       | Anyone know, what's the advantage of this over a big composite
       | repo with several git submdolues?
       | 
       | I think that submodules are better suited for separation of
       | concerns and performance, even while achieving the same composite
       | structure as an equivalent monorepo?
        
         | aseipp wrote:
         | The advantage is simple: Git submodules suck and are a chore to
         | manage for any dependency that sees remotely high traffic or
         | requires frequent synchronization. As the number of developers,
         | submodules, and synchronization requirements increase, this
         | pain increases dramatically. Basic git features, like cherry
         | picking and bisecting to find errors become dramatically worse.
         | You cannot even run `git checkout` without potentially
         | introducing an error, because you might need to update the
         | submodule! All your most basic commands become worse. I have
         | worked on and helped maintain projects with 10+ submodules, and
         | they were one of the most annoying, constantly problematic pain
         | points of the entire project, that every single developer
         | screwed up repeatedly, whether they were established
         | contributors or new ones. We had to finally give in and start
         | using pre-push hooks to ban people from touching submodules
         | without specific commit message patterns. And every single time
         | we eliminated a submodule -- mostly by merging them and their
         | history into the base project, where they belonged anyway --
         | people were happier, development speed increased, and people
         | made less errors.
         | 
         | The reasons for those things being separate projects had a
         | history (dating to a time before Git was popular, even) and can
         | be explained, but ultimately it doesn't matter; by the time I
         | was around, all of those reasons ceased to exist or were simply
         | not important.
         | 
         | I will personally never, ever, ever, ever allow Git submodules
         | in any project I manage unless they are both A) extremely low
         | traffic, so updating them constantly doesn't suck, and B) a
         | completely external dependency that is mostly outside of my
         | control, that cannot be managed any other way.
         | 
         | Save yourself hair and time and at least use worktrees instead.
        
         | tantalor wrote:
         | Monorepo allows a single commit to update across components, eg
         | an API change
        
           | steffres wrote:
           | for each submodule affected by some change you would need an
           | additional commits, yes. But those commits are bundled
           | together in the commit of the parent repo where they act as
           | one.
           | 
           | So, atomicity of changes can be guaranteed, but you need to
           | write a few more commits. However this effort of small
           | increases of commits is far outweighed by the modularity imo.
        
             | marksomnian wrote:
             | Is it? I'm slightly struggling to understand what benefit
             | you gain from having the "parent" repo but also having
             | individual submodules. Sure, working in each individual
             | project's module makes cloning faster, until you need to
             | work on a module that references another module (at which
             | point you need to check out the parent repo or risk using
             | the wrong version), and now every change you make needs two
             | commits (one to the sub-repo, and one to the base to bump
             | the submodule reference),
        
               | steffres wrote:
               | In our case, we have a codebase that involves two
               | submodules: one for persistence and one for python based
               | management of internal git repos. Both of these are
               | standalone applications and can run on their own. They
               | are then used in a parent repo which represents the
               | overarching architecture, which calls into the
               | submodules.
               | 
               | The advantage of this is, that work can be done by devs
               | on the individual modules without much knowledge of the
               | overarching architecture, nor strong code ties into it.
               | 
               | Right now our persistence is done with SQL, but we could
               | swap it with anything else, e.g. mongo, and the parent
               | codebase wouldn't notice a thing since the submodule only
               | returns well defined python objects.
               | 
               | Of course, this comes at the cost of higher number of
               | commits as you mentioned. But in my opinion these are
               | still cheap because they only affect trivial quantity and
               | not brain-demanding quality.
        
               | marksomnian wrote:
               | But what do you do as soon as one of the submodules has a
               | dependency on another? I imagine you might not hit it in
               | your simple case, but I feel like scenarios like that are
               | where the advantages of monorepos lie.
               | 
               | To take a concrete example, I'm working on a codebase
               | that houses both a Node.js server-side application and an
               | Electron app that communicates with it (using tRPC [0]).
               | The Electron app can directly import the API router types
               | from the Node app, thus gaining full type safety, and
               | whenever the backend API is changed the Electron app can
               | be updated at the same time (or type checks in CI will
               | fail).
               | 
               | If this weren't in a monorepo, you would need to first
               | update the Node app, then pick up those changes in the
               | Electron app. This becomes risky in the presence of
               | automated deployment, because, if the Node app's changes
               | accidentally introduced a breaking API change, the
               | Electron app is now broken until the changes are picked
               | up. In a monorepo you'd spot this scenario right away.
               | (Mind you, there is still the issue of updating the built
               | Electron app on the users' machines, but the point
               | remains - you can easily imagine a JS SPA or some other
               | downstream dependency in its place.)
               | 
               | [0]: https://trpc.io/
        
               | steffres wrote:
               | yes, if one submodule would depend on another, this would
               | cause problems indeed.
               | 
               | So far, we could avoid it though, by strict
               | encapsulation.
               | 
               | But I definitely see the point in your example and
               | wouldn't follow through with submodules there probably
               | too.
               | 
               | It's just that in OP's link, I'm quite sceptical as the
               | monorepo approach requires quite some heavy tweaking.
        
             | crabbone wrote:
             | I missed the git push --recurse-submodules flag, even
             | though it seems like it's been there for a long time. Yeah,
             | it seems like it would work, except you need to configure
             | it to be always "check" and be always on when you push.
        
             | tantalor wrote:
             | > this effort of small increases of commits is far
             | outweighed by the modularity
             | 
             | Not remotely, as the scale of the codebase increases, the
             | benefit of modularity goes to zero and the benefit of
             | atomic changes increases.
             | 
             | Also: it's not always feasible to break up a change into
             | smaller commits. Sometimes atomic change is the only way to
             | do it.
        
               | crabbone wrote:
               | With --recurse-submodules the atomicity doesn't seem to
               | suffer. It used to be the case that you couldn't ensure
               | all changes in the source tree couldn't be pushed
               | atomically, now you can, but I'm not sure it's the
               | default behavior.
        
       | Smaug123 wrote:
       | Bold move to enable the "ours" merge strategy by default! I
       | presume this is a typo for the "-Xours" merge _option_ to `ort`
       | or `recursive`, but that still seems pretty brave.
        
         | [deleted]
        
         | Waterluvian wrote:
         | A very useful flag. But as a default? That is scary...
        
           | avidiax wrote:
           | It is still only "ours" per hunk. But yes, it could
           | obliterate changes. On the other hand, the default merge
           | strategy is a huge waste of developer time. There is rarely a
           | genuine conflict. It's usually just that we want to keep both
           | sides.
        
             | Smaug123 wrote:
             | I've found `ort` (which I believe is now the default) to be
             | better than `recursive` by leaps and bounds. Have you tried
             | `ort`?
        
       | rsp1984 wrote:
       | I am not sure what I'm looking at here. Surely those half million
       | files are for dozens if not hundreds of different apps, libraries
       | and tools and surely those do not all depend on each other, no?
       | 
       | Because if so, why not just use one repo per app/library/tool?
       | Sure, if you have a cluster of things that all depend on each
       | other, or a cluster of things that typically is needed in bulk,
       | by all means, put those in a single repo.
       | 
       | But putting literally _all_ your code in a single repo is not a
       | very sane technical choice, is it?
        
         | compiler-guy wrote:
         | Google runs a single monorepo for 95% of its projects across
         | the company. Google isn't perfect, but it's hard to argue that
         | it isn't technically very good.
         | 
         | One of the biggest advantages is that there is no version
         | chasing and dependency questions. At commit X, everything works
         | consistently. No debating about whether this or that dependency
         | is out of sync.
        
           | nottorp wrote:
           | > Our engineers generally work in small teams and interact
           | with an even smaller subset of the monorepo.
           | 
           | The article says. But if Google does it it must be good.
        
         | jayd16 wrote:
         | Depends on the test tooling. If you want a single commit to
         | pass integration tests then they need to be in a single commit.
         | Otherwise you're tracking every version of every tool.
         | 
         | But I like to look at the problem from another perspective. Why
         | _not_ use a single repo. The only real reason would be to work
         | around technical challenges with your source control of choice,
         | not because having everything tracked together is inherently
         | bad.
        
       ___________________________________________________________________
       (page generated 2023-08-28 23:01 UTC)