[HN Gopher] Avoid Git LFS if possible
       ___________________________________________________________________
        
       Avoid Git LFS if possible
        
       Author : reimbar
       Score  : 50 points
       Date   : 2021-05-12 20:46 UTC (2 hours ago)
        
 (HTM) web link (gregoryszorc.com)
 (TXT) w3m dump (gregoryszorc.com)
        
       | SavantIdiot wrote:
       | Yep. All of this. I tried using Git LFS for a project and
       | reverted back to links to cloud server for the large binary blobs
       | and hashes on those blobs.
        
       | robmsmt wrote:
       | Pushing Github past the 100mb limit has to be the most requested
       | feature. Ridiculous that we have to use the fudge that is GitLFS.
       | 
       | It just adds complication for a limit that shouldn't be there
       | anyway.
        
         | viraptor wrote:
         | You can use gitlab instead with a few GB limit instead.
        
       | korijn wrote:
       | I'm honestly super content with LFS. Wrote our own little API
       | server to hook it up to Azure Blob Storage, never have issues
       | with it. I don't recognize the issues mentioned in the article at
       | all. Our whole team relies on it for years, and it delivers. No
       | problems. Keep up the great work, git-lfs maintainers! Much love.
        
       | alkonaut wrote:
       | I'm using Git+LFS because my issue tracker, CI/CD etc natively
       | speaks it. Not because it's in any way superior or even on par
       | with the large file handling of Mercurial (or even SVN to be
       | honest).
        
       | goodcjw2 wrote:
       | A side topic: is there a concrete reason why github's LFS
       | solution has to be so expensive?
       | 
       | IIRC, it's $5 per 50GB per month? That's really a deal breaker to
       | me and wondering whether people actually use LFS at volume will
       | avoid LFS-over-GitHub.
        
         | adkadskhj wrote:
         | Yea i actually wrote my own file chunking and general git-lfs-
         | like backend for this exact reason. I liked Git LFS, but
         | Github's pricing felt insane for my indie dev. For my needs i
         | could backup onto a local server, network drive, or w/e at an
         | insanely cheaper price.
         | 
         | Hell even uploading to an S3 compatible API was insanely
         | cheaper than Github.
         | 
         | That and i really hated the feeling that Git LFS was being
         | designed for a server architecture. I didn't have an easy way
         | to locally dump the written files without running an HTTP
         | server.
         | 
         | There are a couple Git LFS servers that upload to, say, S3 -
         | but i really just wanted a dumb FS or SSH dump of my LFS files.
         | Running a localhost server feels so.. anti-Git to me.
        
         | tux1968 wrote:
         | $60 / year for a decent fraction of a hard disk and the
         | associated backup resources, seems pretty fair to me. What
         | price would you expect?
        
           | moshmosh wrote:
           | It's a very large markup on the small-user _retail_ cost of
           | the basic thing they 're providing (web-accessible, access-
           | controlled file storage--see, for example, BackBlaze B2) but
           | that's utterly typical of services that can get away with
           | charging you a "convenience fee" for that sort of thing once
           | you're on their SaaS. 2-3x markup isn't unusual, and that's
           | about what this is, and that's above typical retail--even if
           | GH's not managing the storage and such themselves, they're
           | likely getting an even better (bulk) rate.
        
         | toomuchtodo wrote:
         | Consider S3 storage and egress costs. You're paying a flat rate
         | to store and then pull that 50GB data (edit: removed an
         | incorrect statement here).
        
           | Someone1234 wrote:
           | Just for clarity, it is 50 GB of storage _and_ 50 GB of
           | bandwidth.
           | 
           | So definitely not "as much as you want." If you pull it too
           | many times you may get charged another $5.
        
             | toomuchtodo wrote:
             | Thanks for the correction!
        
       | Game_Ender wrote:
       | The latest version of git has a very similar feature called
       | "partial clones" to what the author describes for Mercurial. All
       | the data is still in your history, no extra tools are needed, but
       | you only fetch the blobs from the server for the commits you
       | checkout. So just like LFS larger blobs not on master are
       | effectively free, but you still grab all the blobs for your
       | current commit.
       | 
       | You need server side support, which GitHub and GitLab have, and
       | then a special clone command:                   git clone
       | --filter=blob:none
       | 
       | Some background about the feature is here:
       | https://github.blog/2020-12-21-get-up-to-speed-with-partial-...
        
       | Someone1234 wrote:
       | All three points are really just the same point repeated three
       | times: That it isn't part of core/official GIT ("stop gap" until
       | official, irreversible to later official solution, and adds
       | complexity that an official version would lack due to extra/third
       | party tooling).
       | 
       | I'm frankly surprised GIT hasn't made LFS an official part by
       | now. It fixes the problem, the problem is common and real, and
       | GIT hasn't offered a better alternative.
       | 
       | If LFS was made official it would solve this critique, since that
       | is really the only critique here.
        
       | asymptosis wrote:
       | Something missing from the list of problems: Git LFS is a http(s)
       | protocol so is problematic at best when you are using Git over
       | ssh[1].
       | 
       | The git-lfs devs obviously don't use ssh, so you get the feeling
       | they are a bit exasperated by this call to support an industry
       | standard protocol which is widely used as part of ecosystems and
       | workflows involving Git.
       | 
       | [1] https://github.com/git-lfs/git-lfs/issues/1044
        
         | alkonaut wrote:
         | That issue has come a long way though, already a draft PR! This
         | seems like it could actually happen.
        
       | breck wrote:
       | My practice for storing large files with Git is to include the
       | metadata for the large file in a tiny file(s):
       | 
       | 1. Type information. Enough to synthesize a fake example.
       | 
       | 2. A simple preview. This can be a thumb or video snippet, for
       | example.
       | 
       | 3. Checksum and URL of the big file.
       | 
       | This way your code can work at compile/test time using the
       | snippet or synthesized data, and you can fetch the actual big
       | data at ship time.
       | 
       | You can then also use the best version control tool for the job
       | for the particular big files in question.
        
         | dmm wrote:
         | Tools like git-annex or dvc support similar strategies.
        
         | CobrastanJorji wrote:
         | Is this just a manual equivalent of git LFS, or is there some
         | advantage here?
        
       | justaguy88 wrote:
       | > Git LFS is a Stop Gap Solution
       | 
       | Build the real thing then..
        
       | TeeMassive wrote:
       | The reason the author provides is in my opinion weak compared to
       | both his alternatives.
       | 
       | Sure, lfs contaminates a repository, so do large files, sensitive
       | data removal, and references to packages and package managers
       | that might become obsolete or non-existent in the future. The
       | chance of your project compiling after 15 years, the age of git
       | by the way, are very slim, and the chance that having a entirely
       | compilable history being useful even slimmer.
       | 
       | And I think the author's statement about setupping up lfs being
       | hard is exaggerated. It's a handful of command lines that should
       | be in the "welcome at our company" manual anyway.
       | 
       | I've used lfs in the past and while it can be misused, as with
       | all other tools, it does the job without too much headaches
       | compared submodules and ignored tracked files.
        
       | hpcjoe wrote:
       | Just this past week, git lfs was throwing smudge errors for me.
       | Not really sure what the issue was, I followed the
       | recommendations to disable, pull, and re-enable. And got them
       | again. So I disabled. And left it disabled.
       | 
       | Not a solution.
       | 
       | This said, the whole git-lfs bit feels like a (bad) afterthought
       | the way its implemented. I'd love to see some significant
       | reduction of complexity (you shouldn't need to do 'git lfs
       | enable', it should be done automatically), and increases in
       | resiliency (sharding into FEC'ed blocks with distributed
       | checksums, etc.) so we don't have to deal with 'bad' files.
       | 
       | I was a fan of mercurial before I switched to git ... it was IMO
       | an easier/better system at the time (early 2010s). Not likely to
       | switch now though.
        
       | temac wrote:
       | If you just don't jump on random tech without good reasons, you
       | already naturally apply this advice. Especially since once you
       | _really_ need it and also wants Git, there is not much
       | alternative (as the author recognizes). In this context, just
       | waiting for a potential  "better support for handling of large
       | files" of official Git makes little sense; plus I make the wild
       | prediction that what will actually happen is that it's Git LFS
       | that will (continue to) be improved and used by most people (and
       | maybe even integrated in "official Git"?)
        
       | madjam002 wrote:
       | As much as Git LFS is a bit of a pain, on recent projects I've
       | resorted to committing my node_modules with Yarn 2 to Git using
       | LFS and it works really well.
       | 
       | Note that with Yarn 2 you're committing .tar.gz's of packages
       | rather than the JS files themselves, so it lends itself quite
       | well to LFS as there are a smaller number of large files.
       | 
       | https://yarnpkg.com/features/zero-installs#how-do-you-reach-...
       | https://yarnpkg.com/features/zero-installs#is-it-different-f...
        
       | slaymaker1907 wrote:
       | Is rewriting the history for large repos really that difficult
       | besides coordinating with other contributors? My understanding is
       | that it shouldn't be that much worse than "git gc --aggressive".
       | Yes it is expensive, but it is the sort of thing you can schedule
       | to do overnight or on a weekend.
        
         | alkonaut wrote:
         | The problem I see is that things like commit hashes which are
         | etched in history in bug reports, version tags etc, instantly
         | lose meaning. Whether or not that's a problem depends on how
         | much of that you have.
        
         | sjansen wrote:
         | The issue is breaking external references.
         | 
         | Do you include git SHAs in your bug tracking system? Or perhaps
         | your department wiki links to a specific commit to document
         | lessons learned? Maybe you're using Sentry and find including
         | the git SHA of the build to be invaluable for troubleshooting?
         | 
         | For some organizations, rewriting history would be a non-event
         | and for others it would be a major disruption.
        
       | wbillingsley wrote:
       | The solution I've tended to use in classes (where there'll always
       | be some student who hasn't installed LFS) is to store the large
       | files in Artifactory, so they are pulled in at build-time in the
       | same way as libraries.
       | 
       | This seemed to me a sensible approach as Artifactory is a
       | repository for binaries (usually, the compiled output of a
       | project). It also seemed to me that the decisions on which
       | versions to retain and when an update to a binary is expected or
       | when that resource is now frozen and a replacement would be a new
       | version is similar to the decision on when a build is a snapshot
       | vs a release.
        
       | CreepGin wrote:
       | I've been using Git LFS with several large Unity projects in the
       | past several years. Never really had any problems. It was always
       | just "enable and forget" kind of thing.
        
       | dwohnitmok wrote:
       | git-annex is an interesting alternative the HTTP-first nature of
       | Git LFS and the one-way door bother you.
       | 
       | You can remove it after the fact if you don't like it, it
       | supports a ton of protocols, and it's distributed just like git
       | is (you can share the files managed by git-annex among different
       | repos or even among different non-git backends such as S3).
       | 
       | The main issue that git-annex does _not_ solve is that, like Git
       | LFS, it 's not a part of git proper and it shows in its
       | occasionally clunky integration. By virtue of having more knobs
       | and dials it also potentially has more to learn than Git LFS.
        
       | dheera wrote:
       | Okay, so I should avoid it. What is the alternative?
       | 
       | I see so many git repos with READMEs saying download this huge
       | pretrained weights file from {Dropbox link, Google drive link,
       | Baidu link, ...} and I don't think that's a very good user
       | experience compared to LFS.
       | 
       | LFS itself sucks and should be transparent without having to
       | install it, but it's slightly better than downloading stuff from
       | Dropbox or Google Drive.
        
         | rkangel wrote:
         | Some combination of the following two features:
         | 
         | Partial clones
         | (https://docs.gitlab.com/ee/topics/git/partial_clone.html)
         | 
         | Shallow clones (see the --depth argument:
         | https://linux.die.net/man/1/git-clone)
         | 
         | The problem with large files is not so much that putting a 1Gb
         | file in Git is a problem. If you just have one revision of it,
         | you get a 1Gb repo, and things run at a reasonable speed. The
         | problem is when you have 10 revisions of the 1Gb file and you
         | end up dealing with 10Gb of data when you only want one,
         | because the default git clone model is to give you the full
         | history of everything since the beginning of time. This is fine
         | for (compressible) text files, less fine for large binary
         | blobs.
         | 
         | Git-lfs is a hack and it has caused me pain every time I've
         | used it, despite Gitlab having good support for it. Some of
         | this is more implementation detail - the command line UI has
         | some wierdness to it, there's no clear error if someone doesn't
         | have git-lfs when cloning and so something in your build
         | process down the line breaks with a weird error because you've
         | got a marker file instead of the expected binary blob. Some of
         | it is inherent though - the hardest problem is that we now
         | can't easily mirror the git repo from our internal gitlab to
         | the client's gitlab because the config has to hold the http
         | server address with the blobs in. We have workarounds but
         | they're not fun.
         | 
         | The solution is to get over the 'always have the whole
         | repository' thing. This is also useful for massive monorepos
         | because you can clone and checkout just the subfolder you need
         | and not all of everything.
         | 
         | I say this, but I haven't yet used partial clones in anger
         | (unlike git-lfs). I have high hopes though, and it's a feature
         | in early days.
        
           | snovv_crash wrote:
           | I found using git-lfs only in a subrepo worked well, since
           | subrepos by default are checked out shallow.
        
         | nerdponx wrote:
         | DVC [0] is great for data science applications, but I don't see
         | why you couldn't use it as a general-purpose LFS replacement.
         | 
         | It doesn't fix all of the problems with LFS, but it helps a lot
         | with some of them (and happens to also be a decent Make
         | replacement in certain situations).
         | 
         | [0]: https://dvc.org/
        
         | alkonaut wrote:
         | If you are like most people you use systems that speak git
         | (from Microsoft, Jetbrains, GitHub, Atlassian...) but rarely or
         | less fluently anything else so the problem I'm trying to solve
         | isn't "which VCS lets me work well with large files" but rather
         | "I'm stuck with git so what do I do with my large files".
         | 
         | Your option is basically Git LFS, possibly also VFSForGit, or
         | putting your large files in separate storage.
        
         | jayd16 wrote:
         | According to the article you should use mercurial or PlasticSCM
         | because otherwise you might have to rewrite your history to get
         | to some hypothetical git solution that isn't even on the
         | roadmap.
         | 
         | I think I'll stick to LFS.
        
       ___________________________________________________________________
       (page generated 2021-05-12 23:00 UTC)