hngopher.com

       [HN Gopher] How we decreased GitLab repo backup times from 48 ho...
       ___________________________________________________________________
        
       How we decreased GitLab repo backup times from 48 hours to 41
       minutes
        
       Author : immortaljoe
       Score  : 290 points
       Date   : 2025-06-06 15:43 UTC (7 hours ago)
        
 (HTM) web link (about.gitlab.com)
 (TXT) w3m dump (about.gitlab.com)
        
       | judofyr wrote:
       | See also: https://www.tumblr.com/accidentallyquadratic
       | 
       | Quadratic complexity sits in an awkward sweet spot: Fast enough
       | for medium-sized _n_ to pass first QA, but doomed to fail
       | eventually as _n_ grows.
        
         | vjerancrnjak wrote:
         | This particular change was not accidental. It was announced as
         | quadratic.
        
       | jeffbee wrote:
       | Are there any reimplementations of git, by professional
       | programmers using real tools? The source in question -- object.c
       | -- is "banging rocks together" material.
        
         | landr0id wrote:
         | https://github.com/GitoxideLabs/gitoxide
        
         | umanwizard wrote:
         | Ignoring your inaccurate flamebait and answering the underlying
         | question: there are several reimplementations of git, including
         | "got" (Game of Trees) from the OpenBSD developers, jgit
         | (reimplementation in Java), the Haskell git package, the
         | gitoxide rust crate (and I assume several more half-finished
         | Rust hobby projects), and probably others; that's just what I
         | found with a cursory google.
        
         | Vilian wrote:
         | Git, but as it was intended to be used
        
         | kgeist wrote:
         | We use go-git in one of our Go projects that relies on Git for
         | version control of its data.
        
       | hiddew wrote:
       | "fixed it with an algorithmic change, reducing backup times
       | exponentially"
       | 
       | If the backup times were O(n^2), are they now O(n^2 / 2^n)? I
       | would guess not.
        
         | umanwizard wrote:
         | This is not the precise mathematical definition of exponential,
         | but rather the colloquial one, where it just means "a lot".
        
           | cvoss wrote:
           | You shouldn't use a word that can carry a precise
           | mathematical meaning in a sentence that literally uses
           | mathematical notation in order to speak precisely and then
           | expect readers not to interpret the word in the precise
           | mathematical way.
        
             | blharr wrote:
             | I somewhat agree, but for lack of a better word, what would
             | you use? Quadratically doesn't have the same punch
        
               | umanwizard wrote:
               | "Dramatically" ?
        
               | remram wrote:
               | "a lot"
        
               | jjmarr wrote:
               | "by a factor of `n`" also sounds impressive.
        
               | jrochkind1 wrote:
               | If you just mean "a lot" in a non-technical sense, there
               | are plenty of words available. enormously. immensely.
               | tremendously. remarkably. incredibly. vastly.
        
             | IshKebab wrote:
             | You should if you expect your readers to be normal humans
             | who understand obvious context, and not pedantic HN readers
             | who understand obvious context but delight in nit-picking
             | it anyway.
        
               | Dylan16807 wrote:
               | You _can_ , but it's not _should_.
        
             | sneak wrote:
             | "Words mean things."
             | 
             | If you can't agree with this, then you shouldn't be
             | speaking or writing, IMO.
             | 
             | Those who argue that words that mean different things are
             | actually equivalent have no business dealing with language.
        
         | marcellus23 wrote:
         | Meaningless and non-constructive pedantry.
        
           | chrisweekly wrote:
           | I'm not the OP you're responding to, but to be fair, in a
           | sentence about big-O perf characteristics, which includes the
           | word "algorithms", using "exponentially" in a colloquial non-
           | technical sense is an absolutely terrible word choice.
        
             | linguistbreaker wrote:
             | Exponentially bad word choice even... since we're using
             | that word however we want now?
             | 
             | I don't think this is meaningless or non-constructive
             | pedantry - we're a technical community and those are
             | technical words.
        
           | msgodel wrote:
           | I disagree. Misuse of the word "exponential" is a major pet
           | peeve of mine. It's a particular case of the much more common
           | "use mathematically precise phrasing to sound
           | careful/precise" that you often find in less than honest
           | writing.
           | 
           | Here they are actually using it to refer to growth functions
           | (which is rare for this error) and being honest (which is
           | also rare IMO) but it's still wrong. They should have written
           | about quadratic or quadratic vs linear.
           | 
           | Regardless sloppy language leads to sloppy thought.
        
             | keybored wrote:
             | Sloppy writing is up orders of magnitude lately.
        
               | 0cf8612b2e1e wrote:
               | It will decimate readership.
        
             | sneak wrote:
             | My personal pet peeve is when the term "exponentially" is
             | used to refer to a change between precisely two data
             | points.
             | 
             | It's a very specific subset of the one you're describing.
             | 
             | "Tell me you don't know what you're talking about without
             | telling me you don't know what you're talking about."
        
         | csnweb wrote:
         | If you replace an n^2 algorithm with a log(n) lookup you get an
         | exponential speed up. Although a hashmap lookup is usually
         | O(1), which is even faster.
        
           | ryao wrote:
           | That is not true unless n^C / e^n = log(n) where C is some
           | constant, which it is not. The difference between log(n) and
           | some polynomial is logarithmic, not exponential.
        
       | gre wrote:
       | looks like the commit is here:
       | https://github.com/git/git/commit/a52d459e72b890c192485002ec...
        
         | mortar wrote:
         | Thanks, this helped me find the submission and related
         | discussion thread -
         | https://lore.kernel.org/git/20250401-488-generating-bundles-...
        
       | SomaticPirate wrote:
       | How was the flame graph created? (Not very familiar with C and
       | the performance tools around it)
        
         | plorkyeran wrote:
         | https://github.com/brendangregg/FlameGraph
         | 
         | You record performance data with `perf`, then use the scripts
         | there to turn it into a SVG.
        
           | immortaljoe wrote:
           | OP here. In this particular case, I used
           | https://github.com/flamegraph-rs/flamegraph
        
           | IshKebab wrote:
           | I strongly recommend not using this. Instead use pprof - it
           | has a MUCH better interactive flamegraph, plus other nice
           | performance visualisations (e.g. a call graph):
           | 
           | https://github.com/google/pprof                   go install
           | github.com/google/pprof@latest         pprof -http=: prof.out
           | 
           | I normally collect the profiles with gperftools
           | (https://github.com/gperftools/gperftools) and then just
           | LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so
           | CPUPROFILE=prof.out <your command>
           | 
           | I've been meaning to try Samply though. Not sure if it works
           | with pprof.
        
         | landr0id wrote:
         | You can also use https://github.com/mstange/samply to make
         | recording and viewing in the Firefox profiler easier.
         | 
         | It will spin up a localhost server after the trace ends, the
         | profiler uses the localhost server and nothing is shared with
         | Firefox servers unless you explicitly choose to upload the data
         | and create a permalink.
        
       | edflsafoiewq wrote:
       | So they replaced a nested for loop for checking for duplicates
       | with a set data structure. Surprised such a common thing was in
       | git.
        
       | tomxor wrote:
       | One of the many reasons I moved to self hosting. I use ZFS to
       | backup every 15 minutes, ... could do it even more frequently but
       | that seems a little pointless.
       | 
       | Also moved away from Gitlab because it's so damn slow.
        
         | glookler wrote:
         | ZFS is great, getting off gitlab would also be great, what did
         | you switch to?
        
         | haroldp wrote:
         | This was my thought too, but I figured I just didn't understand
         | the problem. Why use git commands to backup when a file system
         | copy seems like it would do? zfs snapshot takes less than a
         | second on any size repo. zfs send transfers just the changes
         | since the last backup, as fast as your network, more or less.
        
         | mdaniel wrote:
         | It for sure has an excruciatingly slow UI, but I haven't yet
         | been able to stomach the Ruby magick enough to dig in and find
         | out if it's _all of GitLab_ that 's slow, and thus their
         | culture of who-cares just propagates to the UI, or it's
         | literally just the web frontend that's molasses yet because so
         | many folks interact with it over the web that's where the
         | conclusion ends up
         | 
         | I know their backend git proxy is written in golang, their
         | runner agent is written in golang, it spawns CI jobs using
         | containerd, written in golang, and they use postgresql and a
         | redis-esque KV -- although for that part I do believe they're
         | still using Sidekick _(ruby)_ for doing job dispatch, so that
         | one could very easily lolol back into not actioning tasks
         | efficiently
         | 
         | GitHub Actions sure does enjoy just sitting there twiddling its
         | thumbs when I push "run job," and is also both Ruby and their
         | own crazypants ajax-y web framework, so I don't think it's
         | exactly the shining star of performance itself
        
       | treesknees wrote:
       | Duplicate of https://news.ycombinator.com/item?id=44197061
        
         | mdaniel wrote:
         | Since that one only has one comment, I'm pretty sure that means
         | the HN front-page is luck of the draw and I wouldn't have
         | bothered even commenting about the other one
        
       | LgWoodenBadger wrote:
       | IME, it has always turned out to be the correct decision to
       | eliminate any n^2 operation in anything I've written.
       | 
       | I don't write exotic algorithms, but it's always astounding how
       | small n needs to be to become observably problematic.
        
         | parthdesai wrote:
         | My rule of thumb for 80%-90% of the problems is, if you need
         | complicated algorithm, it means your data model isn't right.
         | Sure, you do need complicated algorithms for compilers, db
         | internals, route planning et all, but all things considered,
         | those are minority of the use cases.
        
           | anttihaapala wrote:
           | This is not a complicated algorithm. A hash map (dictionary)
           | or a hash set is how you would always do deduplication in
           | Python, because it is easiest to write / least keystrokes
           | anyway. That is not the case in C though, as it is much
           | easier to use arrays and nested loops instead of hash maps.
        
           | paulddraper wrote:
           | This isn't knapsack. This is a dict lookup.
        
         | koala_man wrote:
         | Good call. O(N^2) is the worst time complexity because it's
         | fast enough to be instantaneous in all your testing, but slow
         | enough to explode in prod.
         | 
         | I've seen it several times before, and it's exactly what
         | happened here.
        
           | david422 wrote:
           | We just had this exact problem. Tests ran great, production
           | slowed to a crawl.
        
         | SAI_Peregrinus wrote:
         | I'd say the exception is when `n` is under about 10, and is
         | counting some sort of hardware constrained thing (e.g. some
         | operation over all CAN interfaces pesent on an OBDII connector
         | can be O(n^(2)) since n will always be between 1 and 4). If you
         | wouldn't have to physically replace hardware for `n` to
         | increase, you _really_ need to avoid n^2 operations. And even
         | then consider them carefully, perhaps explicitly failing if `n`
         | gets too big to allow for noticing rework is needed before new
         | hardware hits the field.
        
           | echelon wrote:
           | > perhaps explicitly failing if `n` gets too big
           | 
           | That's the problem. A lot of these quadratic time algorithms
           | don't set limits.
           | 
           | Even 'n!' is fine for small 'n'. Real production use cases
           | don't have small 'n'.
        
             | lassejansen wrote:
             | Or, phrased differently, if n has an upper limit, the
             | algorithm is O(1).
        
               | connicpu wrote:
               | As long as you have tests that regularly exercise your
               | algorithm at n=max where you would notice if they were
               | exceptionally slow
        
             | masklinn wrote:
             | > Real production use cases don't have small 'n'.
             | 
             | Real production use cases absolutely have small n. You
             | don't hear about them, because it's very hard for them to
             | cause issues. Unless the use case changes and now the n is
             | not small anymore and nobody noticed the trap.
        
             | haiku2077 wrote:
             | I have an app that's been running an O n^2 algorithm in
             | "production" (free open source app used by various
             | communities) for about half a year now.
             | 
             | It's been fine because "n" is "number of aircraft flying in
             | this flight simulator" - and the simulator's engine starts
             | to fail above around 2000 anyway. So even in the worst case
             | it's still going to run within milliseconds.
        
         | EdwardCoffin wrote:
         | Bruce Dawson says: _I like to call this Dawson's first law of
         | computing: O(n^2) is the sweet spot of badly scaling
         | algorithms: fast enough to make it into production, but slow
         | enough to make things fall down once it gets there._
         | 
         | https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
        
           | paulddraper wrote:
           | The second law is that O(n * log n) is for practical intents
           | and purposes O(n).
        
             | EdwardCoffin wrote:
             | To be clear though, that isn't _his_ second law, at least
             | as of two months ago, according to https://bsky.app/profile
             | /randomascii.bsky.social/post/3lk4c6...
        
               | to11mtm wrote:
               | Fair, but `n log n` definitely is the historical "good
               | enough to actually sleep at night" in my head, every time
               | I see it I think of the prof who taught my first CSC
               | course and our data structures course due to how often it
               | came up.
               | 
               | Also, the wise statement that 'memory is fairly cheap
               | compared to CPU for scaling'. It's insane to see how
               | often folks would rather manually open and scan a
               | 'static-on-deploy' 20-100MB Json file for each request vs
               | just parsing it into structures in memory (where, for
               | most cases, the in memory usage is a fraction of the json
               | itself) and just caching the parsed structure for the
               | length of the application.
        
               | hinkley wrote:
               | Not often but occasionally I will chose the nlogn
               | algorithm which obviously has no bugs over the O(n)
               | algorithm with no obvious bugs.
               | 
               | Less brittleness is worth paying a few percent.
               | Especially if it unmuddies the waters enough for someone
               | to spot other accidental (time) complexity.
        
             | hinkley wrote:
             | But sometimes a big enough C can flip which solution helps
             | you hit your margins.
        
           | hinkley wrote:
           | I made the "mistake" in an interview of equating two super-
           | quadratic solutions in an interview. What I meant was what
           | Dawson meant. It doesn't matter because they're both too
           | ridiculous to even discuss.
        
             | dieortin wrote:
             | They're too ridiculous... unless a more optimal solution
             | does not exist
        
       | Arcuru wrote:
       | 48 hours is a crazy amount of time to spend just to compress a
       | git folder, it's only a couple GB. 41 minutes still seems like
       | quite a long time.
       | 
       | Why aren't they just snapshotting and archiving the full git
       | repo? Does `git bundle` add something over frequent ZFS backups?
        
         | bombela wrote:
         | > Be aware that even with these recommendations, syncing in
         | this way has some risk since it bypasses Git's normal integrity
         | checking for repositories, so having backups is advised. You
         | may also wish to do a git fsck to verify the integrity of your
         | data on the destination system after syncing.
         | 
         | https://git-scm.com/docs/gitfaq#_transfers
         | 
         | It doesn't tell you how to make a backup safely though.
         | 
         | On a personal scale, Syncthing and Btrfs snapshots work plenty
         | good enough. It's as fast as the storage/network too.
        
           | nightfly wrote:
           | Syncthing is the only way I've ever corrupted a git repo
           | before
        
             | hobofan wrote:
             | I think that's why they specified the "BTRFS snapshots"
             | part. Yes, directly syncing a .git directory seems like a
             | recipe for disaster with how often I've seen individual
             | files lagging to sync, but I guess with BTRFS snaphots one
             | can ensure that only a consistent view of a git directory
             | is being backed up and synced.
        
               | bombela wrote:
               | Nah I truly do it the wrong way around. Syncthing on the
               | git repos. And one of my device in the Syncthing cluster
               | does btrfs snapshots minutely for recovery and further
               | backups.
               | 
               | Because it's at a personal scale, the only time I can
               | corrupt a git repo is if I work on the same repo (and
               | it's workdir) from more than one device in the time it
               | takes for Syncthing to replicate the changes.
               | 
               | But even then it's not a big deal because git fsck is
               | quick. And I have my snapshots, and the syncthing
               | versioning, and git defaults to two weeks before pruning.
               | And because of how git works, using hash to identify
               | contents, files are not easily overwritten either.
               | 
               | In 10y I only had one git corruption (I ran a command on
               | the same repo on a different machine via ssh, yielding a
               | synctning conflict). Syncthing kept copies of the
               | conflict file. One commit disappeared from the history
               | but not from the database. It was easy to rebase the
               | changes. I think I used git fsck to deleted the syncthing
               | versioned files.
        
         | unsnap_biceps wrote:
         | zfs snapshots are difficult to offsite in non-zfs replicas, say
         | like an S3 bucket.
         | 
         | That said, there's another less known feature that bundles help
         | out with when used with `git clone --bundle-uri` The client can
         | specify a location to a bundle, or the server can send the
         | client the bundle location in the clone results and the client
         | can fetch the bundle, unpack it, and then update the delta via
         | the git server, so it's a lot lighter weight on the server for
         | cloning large repos, and _a ton_ faster for the client for
         | initial clones.
        
           | nightfly wrote:
           | ZFS can send to file or whatever you want to pipe to, you can
           | have incremental sends, and if you convert to bookmarks on
           | the sender you don't have to keep the historical data after
           | you send it
        
           | benlivengood wrote:
           | I think if you want consistent snapshot backups on non-zfs
           | destinations the safest thing is to clone the snapshot and
           | rsync from the clone. Not a single-step operation but
           | preserves the atomicity of the snapshot.
           | 
           | EDIT: you could also rsync from a .zfs snapshot directory if
           | you have them enabled.
        
         | AtlasBarfed wrote:
         | ... so they added caching to things that should have been
         | cached?
         | 
         | ... is this really the way people "back up" git repos? I mean,
         | it is git, so isn't there some way to mirror changes to the
         | repo in another repo and just use ZFS / snapshots / backup
         | software / etc to do that? It's a distributed version control
         | system. Just make sure the version control information is ...
         | distributed?
        
       | DamonHD wrote:
       | Here comes an unpopular nitpick: "... we traced the issue to a
       | 15-year-old Git function with O(N2) complexity and fixed it with
       | an algorithmic change, reducing backup times exponentially."
       | 
       | Uh no you didn't. Not possible. At most a polynomial reduction is
       | possible else complexity theory needs a re-write.
       | 
       | (OK, yes, k could be doing some heavy lifting here, but I doubt
       | it.)
       | 
       | If you are going to quote a maths formula then please don't use
       | "exponetially" to mean "lots".
       | 
       | I stopped reading there: I don't want to have to read each word
       | and wonder if they actually meant it, or it's just bad hype.
        
         | immortaljoe wrote:
         | OP here. Feedback is always welcome, I did mean exponentially
         | in the colloquial sense. I do see how it is confusing here,
         | will change it.
        
           | DamonHD wrote:
           | Thank you.
           | 
           | (I don't think that anyone should use "exponentially" that
           | way: it is an art term with a specific and particular
           | meaning, so find another word if you mean something else!
           | Like misusing specific legal or sporting terms...)
        
             | ctxc wrote:
             | Art term?
        
               | wizzwizz4 wrote:
               | https://en.wiktionary.org/wiki/term_of_art
        
               | ctxc wrote:
               | Ah, thanks. I could only think of art as in literal,
               | artistic art.
        
               | kccqzy wrote:
               | If you have that tendency, you just need to think of
               | TAOCP, this industry's best distillation of intellectual
               | achievement, with the word "art" in its name.
        
             | taftster wrote:
             | Which term is appropriate here? What would you suggest?
             | (honest question)
        
           | serial_dev wrote:
           | You can't use exponentially in the colloquial sense in a post
           | about software performance.
           | 
           | If my barista says that his new coffee is exponentially
           | better, it's ok to use it colloquially.
           | 
           | If my git hosting provider writes about an impactful
           | performance improvement, it's not.
        
         | bombela wrote:
         | I can forgive exponential as a figure of speech. You missed
         | when they describe using a map for the side effect of [key]
         | dedupliction.
         | 
         | Instead of saying "set". The code itself uses the type
         | "strset".
        
       | Lammy wrote:
       | > What this means for GitLab customers -- [a bunch of stuff about
       | how customers can now back up more frequently and more robustly]
       | 
       | Realtalk: They should rewrite this post's headline to be in a
       | positive tense instead of leading with a negative word. I'm glad
       | I read the post, because it is a cool and good fix, but I saw
       | "Decreasing [...] repo backup" and my first thought was that it
       | was an announcement of some service downgrade like some sort of
       | cost-cutting measure.
        
         | serial_dev wrote:
         | I don't think it's unreasonable to expect interested people to
         | read five words from the title. I know people don't always do
         | that, but complaining about it as if they did something wrong
         | is ridiculous.
        
           | robertlagrant wrote:
           | No one was complaining.
        
       | divbzero wrote:
       | The performance improvement that GitLab contributed to Git is
       | slated to be released with v2.50.0:
       | 
       | https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
        
       | pjmlp wrote:
       | A very good example that writing code in C doesn't help for
       | performance, when the algorithms or data structures aren't
       | properly taken into consideration.
        
         | IshKebab wrote:
         | I would say C makes this sort of thing far more likely because
         | it's usually a ton of effort to obtain suitable containers. In
         | C++ or Rust they have plenty of things like
         | `unordered_set`/`HashSet` built in, so people are much more
         | likely to use it and not go "eh, I'll use a for loop".
         | 
         | In this case Git already had a string set, but it's still not
         | standard so there's a good chance the original author just
         | didn't know about it.
        
           | pjmlp wrote:
           | Yeah, not something that WG14 will ever care about.
        
           | masklinn wrote:
           | > In this case Git already had a string set, but it's still
           | not standard so there's a good chance the original author
           | just didn't know about it.
           | 
           | The original commit was made in January 2009 (https://github.
           | com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was
           | added in November 2020 (https://github.com/git/git/commit/ae2
           | 0bf1ad98bdc716879a8da99..., strset was added a few days
           | later: https://github.com/git/git/commit/1201eb628ac753af5751
           | 258466...). It was first proposed in 2018 (https://lore.kerne
           | l.org/git/20180906191203.GA26184@sigill.in... the proposal
           | specifically mentions it fixing possibly quadratic sites).
           | 
           | As noted in the comment, git _did_ have a sorted string list
           | with bisection search, and that 's from 2008 (and it actually
           | dates back to 2006 as the "path list" API, before it was
           | renamed following the realisation that it was a generalised
           | string list). Though as the hashmap proposal notes, it's a
           | bit tricky because there's a single type with functions for
           | sorted and functions for unsorted operations, you need to
           | know whether your list is sorted or not independent of its
           | type.
        
       | ashishb wrote:
       | O(n^2) is fast enough to end up I'm production and slow enough to
       | cause problems at scale.
       | 
       | The worst troublesome cases of inefficient production are almost
       | always O(n^2).
        
       | divbzero wrote:
       | There seems to be a lesson here about the balance between
       | premature _vs._ anticipatory optimization. We're generally warned
       | against premature optimization but perhaps, as a rule of thumb,
       | we should look for optimizations in frequently-called functions
       | that are obvious and not onerous to implement.
        
         | Quekid5 wrote:
         | If a set-of-strings was trivially available in the source
         | language (at time of implementation) the original programmer
         | would probably have done this (relatively) trivial
         | optimization... This is a symptom of anemic languages like C.
        
       | prymitive wrote:
       | Well done, next maybe make the web ui usable because right now
       | there's absolutely no distinction between the UI itself and the
       | user content, which IMHO combined with action buttons in the
       | middle of the page makes for a really poor ux.
        
       | SoftTalker wrote:
       | Cool discovery but the article could have been about 1/10 as long
       | and still communicated effectively. At least they didn't post it
       | as a video, so it was easy to skim to the important details.
        
         | kokada wrote:
         | Yes. I read the whole article thinking that this must have been
         | generated by LLM, because at least the style remembers it.
        
           | ruuda wrote:
           | That was also my thought.
        
         | 1oooqooq wrote:
         | it could have been longer. I still don't know why they were
         | doing backup bundles with two refs :)
        
       | bob1029 wrote:
       | I'm confused why you wouldn't simply snapshot the block-level
       | device if the protocol of the information on top is going to
       | cause this much headache. Quiescing git operations for block
       | level activity is probably not trivial, but it sounds like an
       | easier problem to solve to me.
       | 
       | This is the approach I've taken with SQLite in production
       | environments. Turn on WAL and the problem gets even easier to
       | solve. Customer configures the VM for snapshots every X minutes.
       | Git presumably doesn't have something approximating a WAL, so I
       | understand the hesitation with this path. But, I still think the
       | overall strategy is much more viable and robust to weird edges
       | within git.
        
         | nonameiguess wrote:
         | Gitlab isn't just a managed service. They release the software
         | for self-hosted instances as well. There is no guarantee or
         | requirement that users all run Gitlab on the same filesystem or
         | even a filesystem that supports block level snapshots at all.
         | Presumably, they want a universal backup system that works for
         | all Gitlabs.
        
           | bob1029 wrote:
           | I've never heard of a .git folder that spanned multiple
           | filesystems. It sounds like we are now conflating the git
           | workspace with everything else in the product.
           | 
           | There are system requirements that a customer would be
           | expected to adhere to if they wanted a valid enterprise
           | support contract with one of these vendors.
        
             | to11mtm wrote:
             | I think GP's point is that the filesystems used by someone
             | self-hosting gitlab may not be the same as what gitlab
             | themselves are using.
             | 
             | File systems can be weird. Sometimes the OS can be weird
             | and fsync type calls may not do what you expect. At least
             | at one point MacOS fsync didn't behave the same way as
             | Linux (i.e. Linux should ensure the write is truly done and
             | not just in cache so long as the drive isn't lying). [0]
             | 
             | > There are system requirements that a customer would be
             | expected to adhere to if they wanted a valid enterprise
             | support contract with one of these vendors.
             | 
             | Gitlab has a community edition. Not handling data well
             | would be bad for their public image.
             | 
             | [0] - https://news.ycombinator.com/item?id=30372218
        
             | nonameiguess wrote:
             | I can't reply to the other reply for some reason, but what
             | they said is indeed what I meant. gitlab.com might be
             | running their Gitaly servers on btrfs or zfs or lvm volumes
             | or whatever, but other customers may be using ext2. Gitlab
             | the company _could_ require customers to only run Gitaly on
             | a specific filesystem, but up to now, they never have, it
             | would be pretty shitty to suddenly change their minds after
             | a decade plus of establishing one expectation, and whoever
             | the developer is who submitted the patch to upstream Git
             | and got a technical blog post out of it has absolutely no
             | power to dictate contract terms to enterprise customers
             | personally and instead did what is actually in their power
             | to do.
        
         | amtamt wrote:
         | > This is the approach I've taken with SQLite in production
         | environments. Turn on WAL and the problem gets even easier to
         | solve.
         | 
         | A few months back a better solution was provided by SQLite:
         | sqlite3_rsync
         | 
         | https://www.sqlite.org/rsync.html
        
         | lbotos wrote:
         | > Git presumably doesn't have something approximating a WAL, so
         | I understand the hesitation with this path.
         | 
         | Bingo. One of the worst problems is helping a client piece back
         | together a corrupted repo when they are using snapshots. Check
         | my profile to see how I know. :)
         | 
         | It's usually an OMG down scenario, and then you are adding in
         | the "oh no, now the restore is corrupted."
         | 
         | It's fixable but it's definitely annoying.
        
       | einpoklum wrote:
       | I had a somewhat similar experience when writing a "remove
       | duplicates" extension for Thunderbird:
       | 
       | https://addons.thunderbird.net/en-US/thunderbird/addon/remov...
       | 
       | I used a hash to begin with, using a simplistic digest function
       | to the message headers I was comparing, getting me a 4-byte hash
       | key.
       | 
       | That worked, but was kind of slow.
       | 
       | Finally, the idea came to me to not apply _any_ digesting, and
       | use the combined concatenated headers of the messages as hash
       | keys. About 2k bytes per hash key!
       | 
       | The result: About 20x perf improvement if memory serves.
       | 
       | How is that possible? The reason is that the code all runs in a
       | Javascript machine; and applying the digest was not a built-in
       | function, it was looping over the headers and doing the
       | arithmetic. Thousands upon thousands of JS abstract machine
       | steps. The use of the large hash key may be inefficient, but -
       | it's just one JS object / dictionary operation, and one of the
       | most heavily-optimized in any implementation.
        
       | sneak wrote:
       | ...and they say leetcode-style problems are out of date.
       | 
       | While perhaps implementing low level sorts is silly, knowing
       | which data structures to use and when, and what underlying big-o
       | performance they have, is critical, as demonstrated here.
        
       | kristianp wrote:
       | It would have been nice if the post included some details, such
       | as before and after code.
        
       | niffydroid wrote:
       | Why use this method over just having a script do a clone or pull
       | elsewhere? Got is about distribution right?
        
       ___________________________________________________________________
       (page generated 2025-06-06 23:00 UTC)