[HN Gopher] FYI: LLVM-project repo has exceeded GitHub upload si...
       ___________________________________________________________________
        
       FYI: LLVM-project repo has exceeded GitHub upload size limit (2022)
        
       Author : hippich
       Score  : 132 points
       Date   : 2023-01-30 15:28 UTC (7 hours ago)
        
 (HTM) web link (discourse.llvm.org)
 (TXT) w3m dump (discourse.llvm.org)
        
       | oauea wrote:
       | > As it turns out, we only run repacks on a repository network
       | level, which means that repacks need to consider objects from all
       | forks of a given repository.
       | 
       | > Repacking entire repository networks will always lead to less
       | optimal pack sizes compared to repacking just objects from a
       | single fork. For GitHub, disk space is not the only thing we
       | optimize for, but also performance across forks and client
       | performance.
       | 
       | So the lesson here is you can DoS existing open source projects
       | somehow by forking them and increasing the forked repo size >2GB?
        
         | tagraves wrote:
         | How would the DoS work? As I understand it the issue here only
         | occurs if you try to push the entire project to a _new_ repo.
         | It doesn't affect existing repos.
        
           | [deleted]
        
           | pointlessone wrote:
           | I'm almost certain it affects all repos. It's just more
           | unusual to push huge packs into exiting repos. If for
           | whatever reason you happen to have a huge branch clicking in
           | over 2GB you'd get the same error.
        
         | pointlessone wrote:
         | I don't think that's the case.
         | 
         | The issue is that GH doesn't accept too big packs. git by
         | default pack everything into a single pack. Maximum pack size
         | can be specified either in config or as an argument to repack.
         | The way I read the error message a user can push a huge repo by
         | making sure it's packed into a few packs under 2GB limit.
         | 
         | It doesn't seem like there's an easy way to turn this into a
         | DoS. GH would repack the fork network on its own schedule. A
         | user probably can not trigger repacks. The repacks on GH side
         | would probably be smaller then their limit, too. packs are
         | probably scoped to a fork and the server is an active client so
         | it most likely wouldn't return objects from other forks. I
         | don't think it would be easy to DoS GH just by pushing big
         | packs (either under or over the limit).
        
         | mayli wrote:
         | True, for github.
        
       | MithrilTuxedo wrote:
       | I ran into this at work trying to push a P4 clone that sat at
       | 20GB in my local .git folders. Needless to say, I didn't have
       | room on my local to actually check out a workspace.
       | 
       | A shell loop can push N commits at a time, using git's handy
       | syntax for that:
       | 
       | `git push origin p4/master~${x}:master`
        
       | mooreds wrote:
       | Would git prune ( https://git-scm.com/docs/git-prune ) help here?
       | Or is the entire repo of reachable objects more then 2GB?
        
         | trynewideas wrote:
         | Considering this from the thread, as a response from GitHub to
         | their repack request:
         | 
         | > As it turns out, we only run repacks on a repository network
         | level, which means that repacks need to consider objects from
         | all forks of a given repository.
         | 
         | > Repacking entire repository networks will always lead to less
         | optimal pack sizes compared to repacking just objects from a
         | single fork. For GitHub, disk space is not the only thing we
         | optimize for, but also performance across forks and client
         | performance.
         | 
         | Repacking the repo locally more than halves the size:
         | 
         | > I tried locally right now to run git repack -a -d -f
         | --depth=250 --window=250 and the size of the .git folder went
         | from 2417MB to 943MB...
         | 
         | But GitHub repacking the repo network reduces the size by only
         | 20%. So presumably GitHub can't aggressively remove things from
         | a repo without affecting how forks work.
         | 
         | This blog post is very old (2015) but there's a section that
         | describes their use of Git alternates to facilitate forks:
         | https://github.blog/2015-09-22-counting-objects/#your-very-o...
        
         | Arnt wrote:
         | The latter. It's >20 years of history, and if you want a straw
         | test as to whether the history is relevant: I personally needed
         | to look at >20-year-old history this winter.
        
           | zffr wrote:
           | Why did you need to look at >20 year old history?
        
             | themerone wrote:
             | A lot of big applications have features that are coded and
             | go years with minimal changes.
             | 
             | My applications definitely have modules that only get
             | touched when we do a major UI refresh (about once every 10
             | years.)
        
             | Arnt wrote:
             | Because code written >20 years ago still runs today. I
             | wanted to understand some odd behaviour, ran git blame to
             | find the relevant commits, then looked at the commit
             | messages. Some of the lines involved were _old._
        
             | LeifCarrotson wrote:
             | Not OP, but I've worked on lots of 20-50 year old CNC
             | machines, and have done controls for press brakes and
             | resistance welders that are more than 100 years in age
             | (granted, the PLCs and PCs were added later to streamline
             | the relay logic created around World War 1, but the cast
             | iron and motion works date to that era). Two weeks ago I
             | fixed up an RS232 DNC pipeline for a 1985 mill, the source
             | code manuals were heavily yellowed but I eventually figured
             | out the now-esoteric RS232 configuration. One of my
             | coworkers worked on this equipment a bit more than a decade
             | ago, but the newest part - a Windows 7 PC - was the first
             | thing to break.
             | 
             | The addage "if it ain't broke, don't fix it" comes with the
             | corollary that if you build it right it won't break...until
             | it does. And then someone has to do some archaeology to get
             | a process back online that hasn't been documented since
             | dot-matrix printers and typewriters, much less git
             | development. I'm also building stuff and making decisions
             | today for machines that have 10, 20, or 30-year expected
             | lifetimes, and the core components should last indefinitely
             | as long as maintenance is performed.
             | 
             | Many businesses aren't built for and aren't compatible with
             | a 2- or 3-year obsolescence cycle.
        
       | sixstringtheory wrote:
       | Maybe at this point it shouldn't be hosted or even mirrored on
       | GitHub.
       | 
       | CocoaPods and Homebrew both hit similar issues in the past with
       | huge work trees by hosting all their specs on GitHub, resulting
       | in breaking workflows for their users.
       | 
       | Those projects made changes to their workflow to mitigate this.
       | Since the source here is around 6 months old, have LLVM done
       | something similar?
        
         | trogdc wrote:
         | No they still use GitHub, and there are plans to switch code
         | reviews to use GitHub PRs as well.
        
         | newaccount74 wrote:
         | Wasn't the problem with brew that they were (maybe still are)
         | using the git repo as a content delivery network? As I recall
         | the problem wasn't the repo size itself, the problem was
         | millions of users have to do a "git clone" to update homebrew
         | and it overwhelmed Githubs infrastructure.
        
       | nativecoinc wrote:
       | I cloned                   git clone
       | https://github.com/llvm/llvm-project
       | 
       | When entering the directory my fancy prompt timed out
       | [WARN] - (starship::utils): Executing command "git" timed out.
       | 
       | But it seemed OKish after `git status` (disk cache?)
       | time git status         ...          real 0m0.287s          user
       | 0m0.189s          sys 0m0.392s
       | 
       | The size of `.git/objects` after the clone was 2.5GB.
       | 
       | Output of `git-sizer`: https://pastebin.com/T5HRMfg9
        
       | billti wrote:
       | Tangentially related, but after installing LLVM on Windows I
       | notice that the clang, clang++, clang-cl, and clang-cpp binaries
       | are all identical. That's four copies of the same 116MB file. The
       | linkers (ld.lld, ld64.lld, lld, lld-link, wasm-ld) are likewise
       | the same 84MB binary repeated 5 times, and similarly llvm-ar,
       | llvm-lib and more are identical.
       | 
       | Overall looks like ~750MB of unnecessary file duplication for one
       | install of LLVM.
       | 
       | I get that Windows doesn't have the same niceties when it comes
       | to symbolic links as macOS and Linux, but that seems really
       | suboptimal and wasteful.
        
         | nequo wrote:
         | If not symlinks, might those be three or four hard links to the
         | same file? I'm not sure how to check on Windows.
        
       | weinzierl wrote:
       | Built LLVM about a year ago and not only is the source large but
       | the build artifacts filled my disk to the brim.
        
         | mshockwave wrote:
         | There are some workarounds to shrink the size. I think the most
         | noticeable ones being only build certain backends
         | (-DLLVM_TARGETS_TO_BUILD="X86;..."), built as shared libraries
         | (-DBUILD_SHARED_LIBS=ON) and split (DWARF) debug info for debug
         | builds (-DLLVM_USE_SPLIT_DWARF=ON). Plus, usually I just build
         | the tools I need instead of firing `ninja` or `ninja all`
        
       | rebolek wrote:
       | It's interesting that the post doesn't deal deal with the basic
       | problem that if your repo is +2GB, the problem may be on your
       | side.
        
         | sammyteee wrote:
         | How so? You realize there can be more than just code in a repo?
         | And even then, what if there is a lot of code?
        
           | psychphysic wrote:
           | He said 'may be'.
           | 
           | Are you really suggesting it's impossible for bad practices
           | to bloat a got repo?
        
           | hobofan wrote:
           | > You realize there can be more than just code in a repo?
           | 
           | Yes, and for large assets there are extra solutions like Git
           | LFS (which GitHub has support for).
        
           | mrguyorama wrote:
           | What large data assets does a compiler have? Is there an HD
           | video for you to watch while you compile?
        
             | pertymcpert wrote:
             | LLVM is much more than a compiler. It's a set of projects
             | under the umbrella of the LLVM project. It includes not
             | only LLVM and clang, but a linker, a collection of runtimes
             | to support many features (like various sanitizers), the
             | MLIR project, a fortran front-end, c/c++ libraries, a
             | debugger, openmp support, a polyhedral optimization
             | framework, and all the tests that every feature in every
             | one of those projects has.
             | 
             | Is that too much under one umbrella? Probably. But it's not
             | just a compiler. It's a monorepo.
        
             | thewebcount wrote:
             | I work on video-related applications and plug-ins, and yes
             | we do sometimes need to include HD, 4K and larger assets in
             | our code base. That's probably not the case for LLVM, but
             | it's not at all out of the question for other types of
             | projects.
        
             | Arnt wrote:
             | The LLVM repo is that large because a lot of people have
             | worked on a lot of code.
             | 
             | It's not so difficult. A team of tens of programmers can
             | reach that size in a couple of decades, just by writing
             | code and textual documentation. No graphic assets required,
             | all it takes is a generation of steady work by a mid-sized
             | team.
        
             | staringback wrote:
             | > Is there an HD video for you to watch while you compile?
             | 
             | This made me laugh
        
             | jcranmer wrote:
             | The test suites of a compiler typically involve compiling
             | very large projects; SPEC cpu2017 is I believe a gigabyte
             | or so. Of course, these tests _aren 't_ in the llvm source
             | repo, they're in a separate llvm-test-suite repo (although
             | SPEC itself isn't distributed there because it's a
             | benchmark you have to buy).
        
             | rcxdude wrote:
             | The latest checkout of the main branch of the LLVM project
             | is 1.3GB, basically entirely text. Of that, ~900MB is
             | tests, largely consisting of reference test vectors (this
             | apparently isn't even the entirety of LLVM's tests, just
             | the core ones). LLVM code itself weighs about 100MB,
             | consisting of about 2 million lines of code (not including
             | comments). clang (also in the same repo) is another 50MB,
             | and about 800k lines of code. The remainder is other
             | utilities and documentation.
        
       | lwhsiao wrote:
       | Only tangentially related, but I've found that git-sizer [1] is
       | handy for getting a sense of repository metrics.
       | 
       | [1]: https://github.com/github/git-sizer
        
       | secondcoming wrote:
       | June 2022
        
       | cycomanic wrote:
       | Funny I just encountered something similar last week. My org runs
       | a gitlab instance and I creates a repo with lecture slides etc..
       | The slides are html files and some of them have generated videos
       | (for various reasons I wanted to upload the slides not the source
       | scripts). Turns out that my org put a 150MB upload limit and one
       | of the commits exceeded that limit, so my push was failing. The
       | fix was non trivial, but git-sizer was definitely helpful.
        
       | londons_explore wrote:
       | I think github has some whitelist for large popular projects.
       | Chromium is hundreds of GB for example, but you can still fork it
       | on github and make any changes you like and push them.
        
         | madeofpalk wrote:
         | If I understand correctly, the limitation is on the size of a
         | push.
         | 
         | > _If you happened to push the entire llvm-project to another
         | new (personal) GitHub repo recently, you might encounter the
         | following error message before the whole process bails out:_
         | 
         | > _The crux here is that we tried to push the entire repo,
         | which has exceeded GitHub's upload limit (2GB) at once. This is
         | not a problem for majority of the developers who already had a
         | copy of llvm-project in their separated GitHub repos_
         | 
         | If you fork the repo on the Github website, they manage the
         | fork server side, and you never have to push up the entire
         | repo.
        
       | t344344 wrote:
       | It only happens if you push entire repo at once. Some script may
       | split upload into multiple smaller parts and solve this.
        
         | silverwind wrote:
         | Sounds like Git CLI could do this automatically, if it can
         | detect these errors (I think it can't).
        
           | pionar wrote:
           | This is a GitHub-specific issue, so I don't think Git should
           | do anything about it.
        
       | low_tech_punk wrote:
       | On a philosophical note, git, being a distributed source control
       | system has no limits to repo sizing, but github, being is a
       | centralized "hub" suffers scaling problems. It's fascinating to
       | watch these tradeoffs play out. Maybe, IPFS would come to the
       | rescue one day.
        
         | cabirum wrote:
         | The problem here is unbounded growth of a git repo. In this
         | specific case, a size limit was triggered. In other
         | circumstances, it would have required too much time to transfer
         | or no storage space left.
         | 
         | Anyway, the problem is that git stores all changes, forever. A
         | better approach would be to clean old commits, or somehow merge
         | them into snapshots of fixed timespans (like, anything older
         | than a year get compressed into monthly changesets)
        
           | howinteresting wrote:
           | This is not a "problem", this is why source control exists.
           | You should _never_ rewrite published history.
        
             | torstenvl wrote:
             | I don't think I can agree with that.
             | 
             | Accidentally publish secrets/credentials? Rotate them yes
             | but also remove them from the published history.
             | 
             | Accidentally publish a binary for a build tool without the
             | proper license? Definitely remove it (and add it to your
             | .gitignore so it doesn't happen again!)
             | 
             | You discover a major flaw that bricks certain systems or
             | causes data loss? Retroactively replace the
             | Makefile/configure script/whatever to print out a warning
             | instead of building the bad build.
             | 
             | I'm sure there are others.
        
             | cabirum wrote:
             | A value of a commit approaches zero as it gets older. After
             | a certain threshold, no one will ever see it. Never say
             | never; any reason why should we keep deadweight around?
        
               | themerone wrote:
               | As long as a line code is in use, there is value in
               | knowing who and when it was authored.
               | 
               | If a 10 year old vulnerability is found in OpenSSL, it be
               | nice to be able to investigate if it was an accident or
               | an act of espionage.
        
               | howinteresting wrote:
               | Your premise is incorrect. The other day I was looking
               | around a repository that's been through many migrations,
               | and found a commit from 2004 that was relevant to my
               | interests.
        
             | tonnydourado wrote:
             | A more conservative approach would be some sort of layered
             | storage/archiving, I guess. The older the commit, the less
             | likely it is to be used, so it could be archive in a
             | different storage, optimized for long term. This way you
             | keep the "hot" history small, while keeping the full
             | history still available.
        
               | howinteresting wrote:
               | That's generally how git packs are used at large
               | organizations that host their own repositories. I'm sure
               | GitHub does something similar.
        
             | crznp wrote:
             | I don't think that your points are actually in conflict.
             | 
             | If this is my source code, I want the whole history. I want
             | that 10-year old commit that isn't used in any current
             | branch. A build machine may not need any history: it just
             | wants to check out a particular branch as it is right now,
             | and that works too.
             | 
             | But there is an intermediate case: Let's say that I have an
             | issue with a dependency. I might check out that code and
             | want some history to know what has changed recently, but I
             | don't need that huge zip file that was accidentally checked
             | in and then removed 4 years ago. If it were a consistent
             | problem, perhaps you'd invent some sort of 'shallow' or
             | 'partial' clone, something like this:
             | 
             | https://github.blog/2020-12-21-get-up-to-speed-with-
             | partial-...
        
               | howinteresting wrote:
               | True, though shallow clones have performance issues:
               | https://github.com/Homebrew/discussions/discussions/225
        
           | mrguyorama wrote:
           | Git allows you to rewrite history, so you can "crush" old
           | commits to reduce the size of history as needed.
        
             | cabirum wrote:
             | Sure, but rewriting it manually is a tedious process.
             | Should be automated on github side, to keep repo size
             | approx constant over time.
        
               | Arnavion wrote:
               | That would be silent data loss, so absolutely should not
               | be automated.
        
               | blueflow wrote:
               | Github cant rewrite the refs on their own without
               | breaking users stuff. They can only repack the existing
               | objects, the squashing needs to be done by the
               | developers. Also its a non-fast-forward thing, so it
               | needs to be coordinated between the git users anyways.
        
               | saghm wrote:
               | I think this is what `git filter-branch` its supposed to
               | be for: https://git-scm.com/docs/git-filter-branch
               | 
               | I've never used it before, but from what I understand,
               | it's very powerful but also very confusing and easy to
               | mess up, and of course with a sort of vague ambiguous
               | name that makes it hard to discover; in other words, it's
               | quintessentially git.
        
             | jacobr1 wrote:
             | It would be great to have some kind of way to do this while
             | still maintaining the merkle-tree.
        
               | jonhohle wrote:
               | Isn't that what packs are for? The raw, content
               | addressable object store has no inherent optimization for
               | reducing repo size. Any changed file is completely copied
               | until a higher level does something to compress that
               | down.
        
         | tsimionescu wrote:
         | I don't understand where git's distributed nature comes into
         | play here. Perforce is fully centralized, and it also doesn't
         | have any limits to repo size - as long as you have enough disk,
         | P4 will handle it.
         | 
         | In fact, Git's distributed nature actually makes it infamously
         | bad at scaling with repo size - since it requires every
         | "participant" in a repo to have a copy of the entire repo
         | before they can start any work on it.
        
           | rvbissell wrote:
           | Shallow cloning of Git repos is a thing. Basically, you get a
           | fake head commit (that includes all the files) and none of
           | the real history. Useful if you only intend to build, or make
           | changes locally. If you want to push, you have to unshallow
           | first.
        
             | someguy101010 wrote:
             | would be cool to be able to commit without unshallowing
             | first
        
               | phil-m wrote:
               | Not possible, due to the way how git works: It's a merkle
               | tree of commits, where each of these commits point to a
               | file tree (content-addressed by the hash) and the
               | previous commit
        
           | dblitt wrote:
           | This is where something like VFSForGit [0] helps out. Instead
           | of cloning the entire repo, it creates a virtual file system
           | and fetches objects on demand. MSFT uses it internally for
           | the Windows source tree (which now exceeds 300GB).
           | 
           | [0]: https://github.com/microsoft/VFSForGit
        
             | btown wrote:
             | VFSForGit seems to be in maintenance mode now - do you know
             | what it was replaced by, and if there's something that
             | works cross-platform?
        
               | AaronFriel wrote:
               | It was replaced by Scalar and is now merged into Git:
               | 
               | Introduced:
               | https://devblogs.microsoft.com/devops/introducing-scalar/
               | 
               | Integration into Git: https://github.blog/2022-10-13-the-
               | story-of-scalar/
        
         | Arnt wrote:
         | I believe the problem here is related to LLVM using a CI model,
         | where all changes eventually go to a single tree and tests are
         | run on that single tree. Hosting that CI happen on another site
         | than github wouldn't really change anything. Maybe change the
         | limit, but all such CI hosts have a limit, for the same reason
         | that all the search engines have a size limit for web pages
         | they scrape.
        
       | CoastalCoder wrote:
       | I'm not sure what the current status of this issue is, based on
       | the linked Discourse thread.
       | 
       | Is it still a fundamental limitation, but with a known client-
       | side workaround?
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-01-30 23:01 UTC)