[HN Gopher] Avoid Git LFS if possible
___________________________________________________________________
Avoid Git LFS if possible
Author : reimbar
Score : 50 points
Date : 2021-05-12 20:46 UTC (2 hours ago)
(HTM) web link (gregoryszorc.com)
(TXT) w3m dump (gregoryszorc.com)
| SavantIdiot wrote:
| Yep. All of this. I tried using Git LFS for a project and
| reverted back to links to cloud server for the large binary blobs
| and hashes on those blobs.
| robmsmt wrote:
| Pushing Github past the 100mb limit has to be the most requested
| feature. Ridiculous that we have to use the fudge that is GitLFS.
|
| It just adds complication for a limit that shouldn't be there
| anyway.
| viraptor wrote:
| You can use gitlab instead with a few GB limit instead.
| korijn wrote:
| I'm honestly super content with LFS. Wrote our own little API
| server to hook it up to Azure Blob Storage, never have issues
| with it. I don't recognize the issues mentioned in the article at
| all. Our whole team relies on it for years, and it delivers. No
| problems. Keep up the great work, git-lfs maintainers! Much love.
| alkonaut wrote:
| I'm using Git+LFS because my issue tracker, CI/CD etc natively
| speaks it. Not because it's in any way superior or even on par
| with the large file handling of Mercurial (or even SVN to be
| honest).
| goodcjw2 wrote:
| A side topic: is there a concrete reason why github's LFS
| solution has to be so expensive?
|
| IIRC, it's $5 per 50GB per month? That's really a deal breaker to
| me and wondering whether people actually use LFS at volume will
| avoid LFS-over-GitHub.
| adkadskhj wrote:
| Yea i actually wrote my own file chunking and general git-lfs-
| like backend for this exact reason. I liked Git LFS, but
| Github's pricing felt insane for my indie dev. For my needs i
| could backup onto a local server, network drive, or w/e at an
| insanely cheaper price.
|
| Hell even uploading to an S3 compatible API was insanely
| cheaper than Github.
|
| That and i really hated the feeling that Git LFS was being
| designed for a server architecture. I didn't have an easy way
| to locally dump the written files without running an HTTP
| server.
|
| There are a couple Git LFS servers that upload to, say, S3 -
| but i really just wanted a dumb FS or SSH dump of my LFS files.
| Running a localhost server feels so.. anti-Git to me.
| tux1968 wrote:
| $60 / year for a decent fraction of a hard disk and the
| associated backup resources, seems pretty fair to me. What
| price would you expect?
| moshmosh wrote:
| It's a very large markup on the small-user _retail_ cost of
| the basic thing they 're providing (web-accessible, access-
| controlled file storage--see, for example, BackBlaze B2) but
| that's utterly typical of services that can get away with
| charging you a "convenience fee" for that sort of thing once
| you're on their SaaS. 2-3x markup isn't unusual, and that's
| about what this is, and that's above typical retail--even if
| GH's not managing the storage and such themselves, they're
| likely getting an even better (bulk) rate.
| toomuchtodo wrote:
| Consider S3 storage and egress costs. You're paying a flat rate
| to store and then pull that 50GB data (edit: removed an
| incorrect statement here).
| Someone1234 wrote:
| Just for clarity, it is 50 GB of storage _and_ 50 GB of
| bandwidth.
|
| So definitely not "as much as you want." If you pull it too
| many times you may get charged another $5.
| toomuchtodo wrote:
| Thanks for the correction!
| Game_Ender wrote:
| The latest version of git has a very similar feature called
| "partial clones" to what the author describes for Mercurial. All
| the data is still in your history, no extra tools are needed, but
| you only fetch the blobs from the server for the commits you
| checkout. So just like LFS larger blobs not on master are
| effectively free, but you still grab all the blobs for your
| current commit.
|
| You need server side support, which GitHub and GitLab have, and
| then a special clone command: git clone
| --filter=blob:none
|
| Some background about the feature is here:
| https://github.blog/2020-12-21-get-up-to-speed-with-partial-...
| Someone1234 wrote:
| All three points are really just the same point repeated three
| times: That it isn't part of core/official GIT ("stop gap" until
| official, irreversible to later official solution, and adds
| complexity that an official version would lack due to extra/third
| party tooling).
|
| I'm frankly surprised GIT hasn't made LFS an official part by
| now. It fixes the problem, the problem is common and real, and
| GIT hasn't offered a better alternative.
|
| If LFS was made official it would solve this critique, since that
| is really the only critique here.
| asymptosis wrote:
| Something missing from the list of problems: Git LFS is a http(s)
| protocol so is problematic at best when you are using Git over
| ssh[1].
|
| The git-lfs devs obviously don't use ssh, so you get the feeling
| they are a bit exasperated by this call to support an industry
| standard protocol which is widely used as part of ecosystems and
| workflows involving Git.
|
| [1] https://github.com/git-lfs/git-lfs/issues/1044
| alkonaut wrote:
| That issue has come a long way though, already a draft PR! This
| seems like it could actually happen.
| breck wrote:
| My practice for storing large files with Git is to include the
| metadata for the large file in a tiny file(s):
|
| 1. Type information. Enough to synthesize a fake example.
|
| 2. A simple preview. This can be a thumb or video snippet, for
| example.
|
| 3. Checksum and URL of the big file.
|
| This way your code can work at compile/test time using the
| snippet or synthesized data, and you can fetch the actual big
| data at ship time.
|
| You can then also use the best version control tool for the job
| for the particular big files in question.
| dmm wrote:
| Tools like git-annex or dvc support similar strategies.
| CobrastanJorji wrote:
| Is this just a manual equivalent of git LFS, or is there some
| advantage here?
| justaguy88 wrote:
| > Git LFS is a Stop Gap Solution
|
| Build the real thing then..
| TeeMassive wrote:
| The reason the author provides is in my opinion weak compared to
| both his alternatives.
|
| Sure, lfs contaminates a repository, so do large files, sensitive
| data removal, and references to packages and package managers
| that might become obsolete or non-existent in the future. The
| chance of your project compiling after 15 years, the age of git
| by the way, are very slim, and the chance that having a entirely
| compilable history being useful even slimmer.
|
| And I think the author's statement about setupping up lfs being
| hard is exaggerated. It's a handful of command lines that should
| be in the "welcome at our company" manual anyway.
|
| I've used lfs in the past and while it can be misused, as with
| all other tools, it does the job without too much headaches
| compared submodules and ignored tracked files.
| hpcjoe wrote:
| Just this past week, git lfs was throwing smudge errors for me.
| Not really sure what the issue was, I followed the
| recommendations to disable, pull, and re-enable. And got them
| again. So I disabled. And left it disabled.
|
| Not a solution.
|
| This said, the whole git-lfs bit feels like a (bad) afterthought
| the way its implemented. I'd love to see some significant
| reduction of complexity (you shouldn't need to do 'git lfs
| enable', it should be done automatically), and increases in
| resiliency (sharding into FEC'ed blocks with distributed
| checksums, etc.) so we don't have to deal with 'bad' files.
|
| I was a fan of mercurial before I switched to git ... it was IMO
| an easier/better system at the time (early 2010s). Not likely to
| switch now though.
| temac wrote:
| If you just don't jump on random tech without good reasons, you
| already naturally apply this advice. Especially since once you
| _really_ need it and also wants Git, there is not much
| alternative (as the author recognizes). In this context, just
| waiting for a potential "better support for handling of large
| files" of official Git makes little sense; plus I make the wild
| prediction that what will actually happen is that it's Git LFS
| that will (continue to) be improved and used by most people (and
| maybe even integrated in "official Git"?)
| madjam002 wrote:
| As much as Git LFS is a bit of a pain, on recent projects I've
| resorted to committing my node_modules with Yarn 2 to Git using
| LFS and it works really well.
|
| Note that with Yarn 2 you're committing .tar.gz's of packages
| rather than the JS files themselves, so it lends itself quite
| well to LFS as there are a smaller number of large files.
|
| https://yarnpkg.com/features/zero-installs#how-do-you-reach-...
| https://yarnpkg.com/features/zero-installs#is-it-different-f...
| slaymaker1907 wrote:
| Is rewriting the history for large repos really that difficult
| besides coordinating with other contributors? My understanding is
| that it shouldn't be that much worse than "git gc --aggressive".
| Yes it is expensive, but it is the sort of thing you can schedule
| to do overnight or on a weekend.
| alkonaut wrote:
| The problem I see is that things like commit hashes which are
| etched in history in bug reports, version tags etc, instantly
| lose meaning. Whether or not that's a problem depends on how
| much of that you have.
| sjansen wrote:
| The issue is breaking external references.
|
| Do you include git SHAs in your bug tracking system? Or perhaps
| your department wiki links to a specific commit to document
| lessons learned? Maybe you're using Sentry and find including
| the git SHA of the build to be invaluable for troubleshooting?
|
| For some organizations, rewriting history would be a non-event
| and for others it would be a major disruption.
| wbillingsley wrote:
| The solution I've tended to use in classes (where there'll always
| be some student who hasn't installed LFS) is to store the large
| files in Artifactory, so they are pulled in at build-time in the
| same way as libraries.
|
| This seemed to me a sensible approach as Artifactory is a
| repository for binaries (usually, the compiled output of a
| project). It also seemed to me that the decisions on which
| versions to retain and when an update to a binary is expected or
| when that resource is now frozen and a replacement would be a new
| version is similar to the decision on when a build is a snapshot
| vs a release.
| CreepGin wrote:
| I've been using Git LFS with several large Unity projects in the
| past several years. Never really had any problems. It was always
| just "enable and forget" kind of thing.
| dwohnitmok wrote:
| git-annex is an interesting alternative the HTTP-first nature of
| Git LFS and the one-way door bother you.
|
| You can remove it after the fact if you don't like it, it
| supports a ton of protocols, and it's distributed just like git
| is (you can share the files managed by git-annex among different
| repos or even among different non-git backends such as S3).
|
| The main issue that git-annex does _not_ solve is that, like Git
| LFS, it 's not a part of git proper and it shows in its
| occasionally clunky integration. By virtue of having more knobs
| and dials it also potentially has more to learn than Git LFS.
| dheera wrote:
| Okay, so I should avoid it. What is the alternative?
|
| I see so many git repos with READMEs saying download this huge
| pretrained weights file from {Dropbox link, Google drive link,
| Baidu link, ...} and I don't think that's a very good user
| experience compared to LFS.
|
| LFS itself sucks and should be transparent without having to
| install it, but it's slightly better than downloading stuff from
| Dropbox or Google Drive.
| rkangel wrote:
| Some combination of the following two features:
|
| Partial clones
| (https://docs.gitlab.com/ee/topics/git/partial_clone.html)
|
| Shallow clones (see the --depth argument:
| https://linux.die.net/man/1/git-clone)
|
| The problem with large files is not so much that putting a 1Gb
| file in Git is a problem. If you just have one revision of it,
| you get a 1Gb repo, and things run at a reasonable speed. The
| problem is when you have 10 revisions of the 1Gb file and you
| end up dealing with 10Gb of data when you only want one,
| because the default git clone model is to give you the full
| history of everything since the beginning of time. This is fine
| for (compressible) text files, less fine for large binary
| blobs.
|
| Git-lfs is a hack and it has caused me pain every time I've
| used it, despite Gitlab having good support for it. Some of
| this is more implementation detail - the command line UI has
| some wierdness to it, there's no clear error if someone doesn't
| have git-lfs when cloning and so something in your build
| process down the line breaks with a weird error because you've
| got a marker file instead of the expected binary blob. Some of
| it is inherent though - the hardest problem is that we now
| can't easily mirror the git repo from our internal gitlab to
| the client's gitlab because the config has to hold the http
| server address with the blobs in. We have workarounds but
| they're not fun.
|
| The solution is to get over the 'always have the whole
| repository' thing. This is also useful for massive monorepos
| because you can clone and checkout just the subfolder you need
| and not all of everything.
|
| I say this, but I haven't yet used partial clones in anger
| (unlike git-lfs). I have high hopes though, and it's a feature
| in early days.
| snovv_crash wrote:
| I found using git-lfs only in a subrepo worked well, since
| subrepos by default are checked out shallow.
| nerdponx wrote:
| DVC [0] is great for data science applications, but I don't see
| why you couldn't use it as a general-purpose LFS replacement.
|
| It doesn't fix all of the problems with LFS, but it helps a lot
| with some of them (and happens to also be a decent Make
| replacement in certain situations).
|
| [0]: https://dvc.org/
| alkonaut wrote:
| If you are like most people you use systems that speak git
| (from Microsoft, Jetbrains, GitHub, Atlassian...) but rarely or
| less fluently anything else so the problem I'm trying to solve
| isn't "which VCS lets me work well with large files" but rather
| "I'm stuck with git so what do I do with my large files".
|
| Your option is basically Git LFS, possibly also VFSForGit, or
| putting your large files in separate storage.
| jayd16 wrote:
| According to the article you should use mercurial or PlasticSCM
| because otherwise you might have to rewrite your history to get
| to some hypothetical git solution that isn't even on the
| roadmap.
|
| I think I'll stick to LFS.
___________________________________________________________________
(page generated 2021-05-12 23:00 UTC)