[HN Gopher] FYI: LLVM-project repo has exceeded GitHub upload si...
___________________________________________________________________
FYI: LLVM-project repo has exceeded GitHub upload size limit (2022)
Author : hippich
Score : 132 points
Date : 2023-01-30 15:28 UTC (7 hours ago)
(HTM) web link (discourse.llvm.org)
(TXT) w3m dump (discourse.llvm.org)
| oauea wrote:
| > As it turns out, we only run repacks on a repository network
| level, which means that repacks need to consider objects from all
| forks of a given repository.
|
| > Repacking entire repository networks will always lead to less
| optimal pack sizes compared to repacking just objects from a
| single fork. For GitHub, disk space is not the only thing we
| optimize for, but also performance across forks and client
| performance.
|
| So the lesson here is you can DoS existing open source projects
| somehow by forking them and increasing the forked repo size >2GB?
| tagraves wrote:
| How would the DoS work? As I understand it the issue here only
| occurs if you try to push the entire project to a _new_ repo.
| It doesn't affect existing repos.
| [deleted]
| pointlessone wrote:
| I'm almost certain it affects all repos. It's just more
| unusual to push huge packs into exiting repos. If for
| whatever reason you happen to have a huge branch clicking in
| over 2GB you'd get the same error.
| pointlessone wrote:
| I don't think that's the case.
|
| The issue is that GH doesn't accept too big packs. git by
| default pack everything into a single pack. Maximum pack size
| can be specified either in config or as an argument to repack.
| The way I read the error message a user can push a huge repo by
| making sure it's packed into a few packs under 2GB limit.
|
| It doesn't seem like there's an easy way to turn this into a
| DoS. GH would repack the fork network on its own schedule. A
| user probably can not trigger repacks. The repacks on GH side
| would probably be smaller then their limit, too. packs are
| probably scoped to a fork and the server is an active client so
| it most likely wouldn't return objects from other forks. I
| don't think it would be easy to DoS GH just by pushing big
| packs (either under or over the limit).
| mayli wrote:
| True, for github.
| MithrilTuxedo wrote:
| I ran into this at work trying to push a P4 clone that sat at
| 20GB in my local .git folders. Needless to say, I didn't have
| room on my local to actually check out a workspace.
|
| A shell loop can push N commits at a time, using git's handy
| syntax for that:
|
| `git push origin p4/master~${x}:master`
| mooreds wrote:
| Would git prune ( https://git-scm.com/docs/git-prune ) help here?
| Or is the entire repo of reachable objects more then 2GB?
| trynewideas wrote:
| Considering this from the thread, as a response from GitHub to
| their repack request:
|
| > As it turns out, we only run repacks on a repository network
| level, which means that repacks need to consider objects from
| all forks of a given repository.
|
| > Repacking entire repository networks will always lead to less
| optimal pack sizes compared to repacking just objects from a
| single fork. For GitHub, disk space is not the only thing we
| optimize for, but also performance across forks and client
| performance.
|
| Repacking the repo locally more than halves the size:
|
| > I tried locally right now to run git repack -a -d -f
| --depth=250 --window=250 and the size of the .git folder went
| from 2417MB to 943MB...
|
| But GitHub repacking the repo network reduces the size by only
| 20%. So presumably GitHub can't aggressively remove things from
| a repo without affecting how forks work.
|
| This blog post is very old (2015) but there's a section that
| describes their use of Git alternates to facilitate forks:
| https://github.blog/2015-09-22-counting-objects/#your-very-o...
| Arnt wrote:
| The latter. It's >20 years of history, and if you want a straw
| test as to whether the history is relevant: I personally needed
| to look at >20-year-old history this winter.
| zffr wrote:
| Why did you need to look at >20 year old history?
| themerone wrote:
| A lot of big applications have features that are coded and
| go years with minimal changes.
|
| My applications definitely have modules that only get
| touched when we do a major UI refresh (about once every 10
| years.)
| Arnt wrote:
| Because code written >20 years ago still runs today. I
| wanted to understand some odd behaviour, ran git blame to
| find the relevant commits, then looked at the commit
| messages. Some of the lines involved were _old._
| LeifCarrotson wrote:
| Not OP, but I've worked on lots of 20-50 year old CNC
| machines, and have done controls for press brakes and
| resistance welders that are more than 100 years in age
| (granted, the PLCs and PCs were added later to streamline
| the relay logic created around World War 1, but the cast
| iron and motion works date to that era). Two weeks ago I
| fixed up an RS232 DNC pipeline for a 1985 mill, the source
| code manuals were heavily yellowed but I eventually figured
| out the now-esoteric RS232 configuration. One of my
| coworkers worked on this equipment a bit more than a decade
| ago, but the newest part - a Windows 7 PC - was the first
| thing to break.
|
| The addage "if it ain't broke, don't fix it" comes with the
| corollary that if you build it right it won't break...until
| it does. And then someone has to do some archaeology to get
| a process back online that hasn't been documented since
| dot-matrix printers and typewriters, much less git
| development. I'm also building stuff and making decisions
| today for machines that have 10, 20, or 30-year expected
| lifetimes, and the core components should last indefinitely
| as long as maintenance is performed.
|
| Many businesses aren't built for and aren't compatible with
| a 2- or 3-year obsolescence cycle.
| sixstringtheory wrote:
| Maybe at this point it shouldn't be hosted or even mirrored on
| GitHub.
|
| CocoaPods and Homebrew both hit similar issues in the past with
| huge work trees by hosting all their specs on GitHub, resulting
| in breaking workflows for their users.
|
| Those projects made changes to their workflow to mitigate this.
| Since the source here is around 6 months old, have LLVM done
| something similar?
| trogdc wrote:
| No they still use GitHub, and there are plans to switch code
| reviews to use GitHub PRs as well.
| newaccount74 wrote:
| Wasn't the problem with brew that they were (maybe still are)
| using the git repo as a content delivery network? As I recall
| the problem wasn't the repo size itself, the problem was
| millions of users have to do a "git clone" to update homebrew
| and it overwhelmed Githubs infrastructure.
| nativecoinc wrote:
| I cloned git clone
| https://github.com/llvm/llvm-project
|
| When entering the directory my fancy prompt timed out
| [WARN] - (starship::utils): Executing command "git" timed out.
|
| But it seemed OKish after `git status` (disk cache?)
| time git status ... real 0m0.287s user
| 0m0.189s sys 0m0.392s
|
| The size of `.git/objects` after the clone was 2.5GB.
|
| Output of `git-sizer`: https://pastebin.com/T5HRMfg9
| billti wrote:
| Tangentially related, but after installing LLVM on Windows I
| notice that the clang, clang++, clang-cl, and clang-cpp binaries
| are all identical. That's four copies of the same 116MB file. The
| linkers (ld.lld, ld64.lld, lld, lld-link, wasm-ld) are likewise
| the same 84MB binary repeated 5 times, and similarly llvm-ar,
| llvm-lib and more are identical.
|
| Overall looks like ~750MB of unnecessary file duplication for one
| install of LLVM.
|
| I get that Windows doesn't have the same niceties when it comes
| to symbolic links as macOS and Linux, but that seems really
| suboptimal and wasteful.
| nequo wrote:
| If not symlinks, might those be three or four hard links to the
| same file? I'm not sure how to check on Windows.
| weinzierl wrote:
| Built LLVM about a year ago and not only is the source large but
| the build artifacts filled my disk to the brim.
| mshockwave wrote:
| There are some workarounds to shrink the size. I think the most
| noticeable ones being only build certain backends
| (-DLLVM_TARGETS_TO_BUILD="X86;..."), built as shared libraries
| (-DBUILD_SHARED_LIBS=ON) and split (DWARF) debug info for debug
| builds (-DLLVM_USE_SPLIT_DWARF=ON). Plus, usually I just build
| the tools I need instead of firing `ninja` or `ninja all`
| rebolek wrote:
| It's interesting that the post doesn't deal deal with the basic
| problem that if your repo is +2GB, the problem may be on your
| side.
| sammyteee wrote:
| How so? You realize there can be more than just code in a repo?
| And even then, what if there is a lot of code?
| psychphysic wrote:
| He said 'may be'.
|
| Are you really suggesting it's impossible for bad practices
| to bloat a got repo?
| hobofan wrote:
| > You realize there can be more than just code in a repo?
|
| Yes, and for large assets there are extra solutions like Git
| LFS (which GitHub has support for).
| mrguyorama wrote:
| What large data assets does a compiler have? Is there an HD
| video for you to watch while you compile?
| pertymcpert wrote:
| LLVM is much more than a compiler. It's a set of projects
| under the umbrella of the LLVM project. It includes not
| only LLVM and clang, but a linker, a collection of runtimes
| to support many features (like various sanitizers), the
| MLIR project, a fortran front-end, c/c++ libraries, a
| debugger, openmp support, a polyhedral optimization
| framework, and all the tests that every feature in every
| one of those projects has.
|
| Is that too much under one umbrella? Probably. But it's not
| just a compiler. It's a monorepo.
| thewebcount wrote:
| I work on video-related applications and plug-ins, and yes
| we do sometimes need to include HD, 4K and larger assets in
| our code base. That's probably not the case for LLVM, but
| it's not at all out of the question for other types of
| projects.
| Arnt wrote:
| The LLVM repo is that large because a lot of people have
| worked on a lot of code.
|
| It's not so difficult. A team of tens of programmers can
| reach that size in a couple of decades, just by writing
| code and textual documentation. No graphic assets required,
| all it takes is a generation of steady work by a mid-sized
| team.
| staringback wrote:
| > Is there an HD video for you to watch while you compile?
|
| This made me laugh
| jcranmer wrote:
| The test suites of a compiler typically involve compiling
| very large projects; SPEC cpu2017 is I believe a gigabyte
| or so. Of course, these tests _aren 't_ in the llvm source
| repo, they're in a separate llvm-test-suite repo (although
| SPEC itself isn't distributed there because it's a
| benchmark you have to buy).
| rcxdude wrote:
| The latest checkout of the main branch of the LLVM project
| is 1.3GB, basically entirely text. Of that, ~900MB is
| tests, largely consisting of reference test vectors (this
| apparently isn't even the entirety of LLVM's tests, just
| the core ones). LLVM code itself weighs about 100MB,
| consisting of about 2 million lines of code (not including
| comments). clang (also in the same repo) is another 50MB,
| and about 800k lines of code. The remainder is other
| utilities and documentation.
| lwhsiao wrote:
| Only tangentially related, but I've found that git-sizer [1] is
| handy for getting a sense of repository metrics.
|
| [1]: https://github.com/github/git-sizer
| secondcoming wrote:
| June 2022
| cycomanic wrote:
| Funny I just encountered something similar last week. My org runs
| a gitlab instance and I creates a repo with lecture slides etc..
| The slides are html files and some of them have generated videos
| (for various reasons I wanted to upload the slides not the source
| scripts). Turns out that my org put a 150MB upload limit and one
| of the commits exceeded that limit, so my push was failing. The
| fix was non trivial, but git-sizer was definitely helpful.
| londons_explore wrote:
| I think github has some whitelist for large popular projects.
| Chromium is hundreds of GB for example, but you can still fork it
| on github and make any changes you like and push them.
| madeofpalk wrote:
| If I understand correctly, the limitation is on the size of a
| push.
|
| > _If you happened to push the entire llvm-project to another
| new (personal) GitHub repo recently, you might encounter the
| following error message before the whole process bails out:_
|
| > _The crux here is that we tried to push the entire repo,
| which has exceeded GitHub's upload limit (2GB) at once. This is
| not a problem for majority of the developers who already had a
| copy of llvm-project in their separated GitHub repos_
|
| If you fork the repo on the Github website, they manage the
| fork server side, and you never have to push up the entire
| repo.
| t344344 wrote:
| It only happens if you push entire repo at once. Some script may
| split upload into multiple smaller parts and solve this.
| silverwind wrote:
| Sounds like Git CLI could do this automatically, if it can
| detect these errors (I think it can't).
| pionar wrote:
| This is a GitHub-specific issue, so I don't think Git should
| do anything about it.
| low_tech_punk wrote:
| On a philosophical note, git, being a distributed source control
| system has no limits to repo sizing, but github, being is a
| centralized "hub" suffers scaling problems. It's fascinating to
| watch these tradeoffs play out. Maybe, IPFS would come to the
| rescue one day.
| cabirum wrote:
| The problem here is unbounded growth of a git repo. In this
| specific case, a size limit was triggered. In other
| circumstances, it would have required too much time to transfer
| or no storage space left.
|
| Anyway, the problem is that git stores all changes, forever. A
| better approach would be to clean old commits, or somehow merge
| them into snapshots of fixed timespans (like, anything older
| than a year get compressed into monthly changesets)
| howinteresting wrote:
| This is not a "problem", this is why source control exists.
| You should _never_ rewrite published history.
| torstenvl wrote:
| I don't think I can agree with that.
|
| Accidentally publish secrets/credentials? Rotate them yes
| but also remove them from the published history.
|
| Accidentally publish a binary for a build tool without the
| proper license? Definitely remove it (and add it to your
| .gitignore so it doesn't happen again!)
|
| You discover a major flaw that bricks certain systems or
| causes data loss? Retroactively replace the
| Makefile/configure script/whatever to print out a warning
| instead of building the bad build.
|
| I'm sure there are others.
| cabirum wrote:
| A value of a commit approaches zero as it gets older. After
| a certain threshold, no one will ever see it. Never say
| never; any reason why should we keep deadweight around?
| themerone wrote:
| As long as a line code is in use, there is value in
| knowing who and when it was authored.
|
| If a 10 year old vulnerability is found in OpenSSL, it be
| nice to be able to investigate if it was an accident or
| an act of espionage.
| howinteresting wrote:
| Your premise is incorrect. The other day I was looking
| around a repository that's been through many migrations,
| and found a commit from 2004 that was relevant to my
| interests.
| tonnydourado wrote:
| A more conservative approach would be some sort of layered
| storage/archiving, I guess. The older the commit, the less
| likely it is to be used, so it could be archive in a
| different storage, optimized for long term. This way you
| keep the "hot" history small, while keeping the full
| history still available.
| howinteresting wrote:
| That's generally how git packs are used at large
| organizations that host their own repositories. I'm sure
| GitHub does something similar.
| crznp wrote:
| I don't think that your points are actually in conflict.
|
| If this is my source code, I want the whole history. I want
| that 10-year old commit that isn't used in any current
| branch. A build machine may not need any history: it just
| wants to check out a particular branch as it is right now,
| and that works too.
|
| But there is an intermediate case: Let's say that I have an
| issue with a dependency. I might check out that code and
| want some history to know what has changed recently, but I
| don't need that huge zip file that was accidentally checked
| in and then removed 4 years ago. If it were a consistent
| problem, perhaps you'd invent some sort of 'shallow' or
| 'partial' clone, something like this:
|
| https://github.blog/2020-12-21-get-up-to-speed-with-
| partial-...
| howinteresting wrote:
| True, though shallow clones have performance issues:
| https://github.com/Homebrew/discussions/discussions/225
| mrguyorama wrote:
| Git allows you to rewrite history, so you can "crush" old
| commits to reduce the size of history as needed.
| cabirum wrote:
| Sure, but rewriting it manually is a tedious process.
| Should be automated on github side, to keep repo size
| approx constant over time.
| Arnavion wrote:
| That would be silent data loss, so absolutely should not
| be automated.
| blueflow wrote:
| Github cant rewrite the refs on their own without
| breaking users stuff. They can only repack the existing
| objects, the squashing needs to be done by the
| developers. Also its a non-fast-forward thing, so it
| needs to be coordinated between the git users anyways.
| saghm wrote:
| I think this is what `git filter-branch` its supposed to
| be for: https://git-scm.com/docs/git-filter-branch
|
| I've never used it before, but from what I understand,
| it's very powerful but also very confusing and easy to
| mess up, and of course with a sort of vague ambiguous
| name that makes it hard to discover; in other words, it's
| quintessentially git.
| jacobr1 wrote:
| It would be great to have some kind of way to do this while
| still maintaining the merkle-tree.
| jonhohle wrote:
| Isn't that what packs are for? The raw, content
| addressable object store has no inherent optimization for
| reducing repo size. Any changed file is completely copied
| until a higher level does something to compress that
| down.
| tsimionescu wrote:
| I don't understand where git's distributed nature comes into
| play here. Perforce is fully centralized, and it also doesn't
| have any limits to repo size - as long as you have enough disk,
| P4 will handle it.
|
| In fact, Git's distributed nature actually makes it infamously
| bad at scaling with repo size - since it requires every
| "participant" in a repo to have a copy of the entire repo
| before they can start any work on it.
| rvbissell wrote:
| Shallow cloning of Git repos is a thing. Basically, you get a
| fake head commit (that includes all the files) and none of
| the real history. Useful if you only intend to build, or make
| changes locally. If you want to push, you have to unshallow
| first.
| someguy101010 wrote:
| would be cool to be able to commit without unshallowing
| first
| phil-m wrote:
| Not possible, due to the way how git works: It's a merkle
| tree of commits, where each of these commits point to a
| file tree (content-addressed by the hash) and the
| previous commit
| dblitt wrote:
| This is where something like VFSForGit [0] helps out. Instead
| of cloning the entire repo, it creates a virtual file system
| and fetches objects on demand. MSFT uses it internally for
| the Windows source tree (which now exceeds 300GB).
|
| [0]: https://github.com/microsoft/VFSForGit
| btown wrote:
| VFSForGit seems to be in maintenance mode now - do you know
| what it was replaced by, and if there's something that
| works cross-platform?
| AaronFriel wrote:
| It was replaced by Scalar and is now merged into Git:
|
| Introduced:
| https://devblogs.microsoft.com/devops/introducing-scalar/
|
| Integration into Git: https://github.blog/2022-10-13-the-
| story-of-scalar/
| Arnt wrote:
| I believe the problem here is related to LLVM using a CI model,
| where all changes eventually go to a single tree and tests are
| run on that single tree. Hosting that CI happen on another site
| than github wouldn't really change anything. Maybe change the
| limit, but all such CI hosts have a limit, for the same reason
| that all the search engines have a size limit for web pages
| they scrape.
| CoastalCoder wrote:
| I'm not sure what the current status of this issue is, based on
| the linked Discourse thread.
|
| Is it still a fundamental limitation, but with a known client-
| side workaround?
| [deleted]
___________________________________________________________________
(page generated 2023-01-30 23:01 UTC)