[HN Gopher] How we decreased GitLab repo backup times from 48 ho...
       ___________________________________________________________________
        
       How we decreased GitLab repo backup times from 48 hours to 41
       minutes
        
       Author : nayak
       Score  : 566 points
       Date   : 2025-06-06 15:43 UTC (2 days ago)
        
 (HTM) web link (about.gitlab.com)
 (TXT) w3m dump (about.gitlab.com)
        
       | judofyr wrote:
       | See also: https://www.tumblr.com/accidentallyquadratic
       | 
       | Quadratic complexity sits in an awkward sweet spot: Fast enough
       | for medium-sized _n_ to pass first QA, but doomed to fail
       | eventually as _n_ grows.
        
         | vjerancrnjak wrote:
         | This particular change was not accidental. It was announced as
         | quadratic.
        
         | hinkley wrote:
         | We have a habit of taking our eye off of old problems by trying
         | to juggle several new ones. By the time someone notices that we
         | have a problem, the dogleg in the graphs where the n2 solution
         | stopped fitting into the CPU cache has been obvious for months
         | but nobody was looking, and we dance around that fact that we
         | had time to take a reasonable approach to fix the problem if we
         | had noticed it when it became measurable, by adding anxiety to
         | the cleanup work.
         | 
         | And then someone learns from this experience, gets the bright
         | idea to set up an alert for such things, but the alert doesn't
         | factor in things like customer base growth or feature creep
         | slowly pushing up the expected runtime. Eventually organic load
         | gets close to the alarm and then the fucking thing goes off on
         | a three day weekend (why is it always a long weekend or just
         | before one?) and then we wage war on alarm overreach and the
         | whole cycle repeats itself.
         | 
         | We like to think of ourselves as blazing trails in the
         | wilderness but most of the time we are doing laps around the
         | parking lot.
        
       | jeffbee wrote:
       | Are there any reimplementations of git, by professional
       | programmers using real tools? The source in question -- object.c
       | -- is "banging rocks together" material.
        
         | landr0id wrote:
         | https://github.com/GitoxideLabs/gitoxide
        
         | umanwizard wrote:
         | Ignoring your inaccurate flamebait and answering the underlying
         | question: there are several reimplementations of git, including
         | "got" (Game of Trees) from the OpenBSD developers, jgit
         | (reimplementation in Java), the Haskell git package, the
         | gitoxide rust crate (and I assume several more half-finished
         | Rust hobby projects), and probably others; that's just what I
         | found with a cursory google.
        
         | Vilian wrote:
         | Git, but as it was intended to be used
        
         | kgeist wrote:
         | We use go-git in one of our Go projects that relies on Git for
         | version control of its data.
        
         | gonzus wrote:
         | https://github.com/libgit2/libgit2
        
         | fHr wrote:
         | lmao
        
       | hiddew wrote:
       | "fixed it with an algorithmic change, reducing backup times
       | exponentially"
       | 
       | If the backup times were O(n^2), are they now O(n^2 / 2^n)? I
       | would guess not.
        
         | umanwizard wrote:
         | This is not the precise mathematical definition of exponential,
         | but rather the colloquial one, where it just means "a lot".
        
           | cvoss wrote:
           | You shouldn't use a word that can carry a precise
           | mathematical meaning in a sentence that literally uses
           | mathematical notation in order to speak precisely and then
           | expect readers not to interpret the word in the precise
           | mathematical way.
        
             | blharr wrote:
             | I somewhat agree, but for lack of a better word, what would
             | you use? Quadratically doesn't have the same punch
        
               | umanwizard wrote:
               | "Dramatically" ?
        
               | remram wrote:
               | "a lot"
        
               | jjmarr wrote:
               | "by a factor of `n`" also sounds impressive.
        
               | jrochkind1 wrote:
               | If you just mean "a lot" in a non-technical sense, there
               | are plenty of words available. enormously. immensely.
               | tremendously. remarkably. incredibly. vastly.
        
               | bobbylarrybobby wrote:
               | "From quadratic to linear" seems fine.
        
               | bobbylarrybobby wrote:
               | "From quadratic to linear" or "... to constant" seems
               | fine.
        
               | morepedantic wrote:
               | Algorithmic? Big-O? Polynomially? Linear improvement?
               | O(n^2) to O(n)? Or if you want to be less mathematically
               | precise: enormous improvement?
               | 
               | Using exponential in this way in any context is a faux
               | pas, because it's highly ambiguous, and requires context
               | for clarification. But in this situation the context
               | clearly resolved to the mathematically accurate
               | definition, except it was used in the other way.
        
               | globular-toast wrote:
               | Runtimes dropped precipitously.
        
               | ndriscoll wrote:
               | Quadratically doesn't have the same punch because it is
               | _actually_ exponentially less than exponentially. So
               | doing it for extra punch (as opposed to not knowing the
               | correct word) in a technical context would just be lying.
               | It 'd be like a paper saying they had a result with p
               | less than one in a trillion for "extra punch" when they
               | actually had p=0.1.
        
             | IshKebab wrote:
             | You should if you expect your readers to be normal humans
             | who understand obvious context, and not pedantic HN readers
             | who understand obvious context but delight in nit-picking
             | it anyway.
        
               | Dylan16807 wrote:
               | You _can_ , but it's not _should_.
        
               | morepedantic wrote:
               | >pedantic
               | 
               | Who the fuck do you think is the intended audience for an
               | article about an algorithm in `git bundle create`? I
               | spent approximately two minutes of my life trying to
               | figure out where the O(n^2) algorithm was being invoked
               | in such a way that it influenced an exponential.
               | 
               | Exponential was bolded in the same sentence as a big-O.
               | 50/50 troll/author oversight.
        
               | saagarjha wrote:
               | Maybe not 'morepedantic
        
               | globular-toast wrote:
               | Ah yes because "normal humans" know what O(n^2) means but
               | damnit they are going to use exponential wrong.
        
               | tomjakubowski wrote:
               | I'm a normal human and I know what O(n^2) means. There
               | are dozens of us.
        
             | sneak wrote:
             | "Words mean things."
             | 
             | If you can't agree with this, then you shouldn't be
             | speaking or writing, IMO.
             | 
             | Those who argue that words that mean different things are
             | actually equivalent have no business dealing with language.
        
               | morepedantic wrote:
               | I understood every word, phrase, and sentence you wrote.
               | But I did not understand your point. Still, I got the
               | meaning of your words, so presumably you're satisfied.
        
             | hskalin wrote:
             | I find the whole article rather poorly written. Most likely
             | using an LLM.
        
             | morepedantic wrote:
             | Especially when the colloquial meaning derives from the
             | mathematical meaning.
        
             | MyFedora wrote:
             | We simplify the big O notation in computer science. This is
             | standard practice.
        
               | rovr138 wrote:
               | Just drop the constants, it doesn't matter /s
               | 
               | Production systems running and melting here...
        
           | morepedantic wrote:
           | >Ultimately, we traced the issue to a 15-year-old Git
           | function with O(N2) complexity and fixed it with an
           | algorithmic change, reducing backup times exponentially.
           | 
           | No, not in the exact same sentence as a big-O. That's either
           | author error, or an intentional troll. Either way it's egg on
           | their faces.
        
         | marcellus23 wrote:
         | Meaningless and non-constructive pedantry.
        
           | chrisweekly wrote:
           | I'm not the OP you're responding to, but to be fair, in a
           | sentence about big-O perf characteristics, which includes the
           | word "algorithms", using "exponentially" in a colloquial non-
           | technical sense is an absolutely terrible word choice.
        
             | linguistbreaker wrote:
             | Exponentially bad word choice even... since we're using
             | that word however we want now?
             | 
             | I don't think this is meaningless or non-constructive
             | pedantry - we're a technical community and those are
             | technical words.
        
           | msgodel wrote:
           | I disagree. Misuse of the word "exponential" is a major pet
           | peeve of mine. It's a particular case of the much more common
           | "use mathematically precise phrasing to sound
           | careful/precise" that you often find in less than honest
           | writing.
           | 
           | Here they are actually using it to refer to growth functions
           | (which is rare for this error) and being honest (which is
           | also rare IMO) but it's still wrong. They should have written
           | about quadratic or quadratic vs linear.
           | 
           | Regardless sloppy language leads to sloppy thought.
        
             | keybored wrote:
             | Sloppy writing is up orders of magnitude lately.
        
               | 0cf8612b2e1e wrote:
               | It will decimate readership.
        
             | sneak wrote:
             | My personal pet peeve is when the term "exponentially" is
             | used to refer to a change between precisely two data
             | points.
             | 
             | It's a very specific subset of the one you're describing.
             | 
             | "Tell me you don't know what you're talking about without
             | telling me you don't know what you're talking about."
        
               | morepedantic wrote:
               | >My personal pet peeve
               | 
               | Just be glad you have only one.
        
           | morepedantic wrote:
           | >pedantry
           | 
           | I wasted 2 minutes of my life looking for the exponential
           | reduction. So did many others.
           | 
           | Now I'm wasting more of my life shit posting about it, but at
           | least that's a conscious choice.
        
         | csnweb wrote:
         | If you replace an n^2 algorithm with a log(n) lookup you get an
         | exponential speed up. Although a hashmap lookup is usually
         | O(1), which is even faster.
        
           | ryao wrote:
           | That is not true unless n^C / e^n = log(n) where C is some
           | constant, which it is not. The difference between log(n) and
           | some polynomial is logarithmic, not exponential.
        
             | csnweb wrote:
             | But if you happen to have n=2^c, then an algorithm with
             | logarithmic complexity only needs c time. Thats why this is
             | usually referred to as exponential speedup in complexity
             | theory, just like from O(2^n) to O(n). More concretely if
             | the first algorithm needs 1024 seconds, the second one will
             | need only 10 seconds in both cases, so I think it makes
             | sense.
        
               | ryao wrote:
               | N is a variable in what I posted, not a constant.
        
             | wasabi991011 wrote:
             | It depends if you consider "speedup" to mean dividing the
             | runtime or applying a function to the runtime.
             | 
             | I.e. you are saying and f(n) speedup means T(n)/f(n), but
             | others would say it means f(T(n)) or some variation of
             | that.
        
               | morepedantic wrote:
               | The man, or llm, used the mathematically imprecise
               | definition of exponential in a sentence with a big-O
               | notation. I don't think he's going to be writing entire
               | arguments formally.
        
           | ndriscoll wrote:
           | They're still using the map in a loop, so it'd be nlogn for a
           | tree-based map or n for a hash map.
        
         | sn9 wrote:
         | The algorithm complexity went down in the function they patched
         | (6x improvement in their benchmark), but in the context of how
         | they benefited with how they were using the algorithm the
         | impact was much larger (improved to taking 1% of the time),
         | which is plausibly exponential (and figuring out the actual
         | complexity is neither relevant nor an economic use of time).
        
           | ndriscoll wrote:
           | > figuring out the actual complexity is neither relevant nor
           | an economic use of time
           | 
           | The fix was replacing a nested loop with a map. Figuring out
           | that this goes from O(n^2) to O(n) (modulo details like
           | bucket count) is immediate if you know what the words mean
           | and understand enough to identify the problem and make the
           | fix in the first place.
        
         | wasabi991011 wrote:
         | I interpreted that as n->log(n) since log and exp are inverses.
         | 
         | Also because I've often heard tha the quantum Fourier transform
         | is an exponential speedup over the discrete Fourier transform,
         | and there the scaling goes n^2->nlogn.
        
         | morepedantic wrote:
         | I was also looking for the exponential algorithm.
        
         | abhashanand1501 wrote:
         | In fact O(n^2) is exponentially more than O(log n).
        
       | gre wrote:
       | looks like the commit is here:
       | https://github.com/git/git/commit/a52d459e72b890c192485002ec...
        
         | mortar wrote:
         | Thanks, this helped me find the submission and related
         | discussion thread -
         | https://lore.kernel.org/git/20250401-488-generating-bundles-...
        
       | SomaticPirate wrote:
       | How was the flame graph created? (Not very familiar with C and
       | the performance tools around it)
        
         | plorkyeran wrote:
         | https://github.com/brendangregg/FlameGraph
         | 
         | You record performance data with `perf`, then use the scripts
         | there to turn it into a SVG.
        
           | immortaljoe wrote:
           | OP here. In this particular case, I used
           | https://github.com/flamegraph-rs/flamegraph
        
           | IshKebab wrote:
           | I strongly recommend not using this. Instead use pprof - it
           | has a MUCH better interactive flamegraph, plus other nice
           | performance visualisations (e.g. a call graph):
           | 
           | https://github.com/google/pprof                   go install
           | github.com/google/pprof@latest         pprof -http=: prof.out
           | 
           | I normally collect the profiles with gperftools
           | (https://github.com/gperftools/gperftools) and then just
           | LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so
           | CPUPROFILE=prof.out <your command>
           | 
           | I've been meaning to try Samply though. Not sure if it works
           | with pprof.
        
         | landr0id wrote:
         | You can also use https://github.com/mstange/samply to make
         | recording and viewing in the Firefox profiler easier.
         | 
         | It will spin up a localhost server after the trace ends, the
         | profiler uses the localhost server and nothing is shared with
         | Firefox servers unless you explicitly choose to upload the data
         | and create a permalink.
        
       | edflsafoiewq wrote:
       | So they replaced a nested for loop for checking for duplicates
       | with a set data structure. Surprised such a common thing was in
       | git.
        
       | tomxor wrote:
       | One of the many reasons I moved to self hosting. I use ZFS to
       | backup every 15 minutes, ... could do it even more frequently but
       | that seems a little pointless.
       | 
       | Also moved away from Gitlab because it's so damn slow.
        
         | glookler wrote:
         | ZFS is great, getting off gitlab would also be great, what did
         | you switch to?
        
           | tomxor wrote:
           | Gitea, i can't recommend it enough. Lightening fast compared
           | to the proprietary offerings, clean simple UI like an older
           | Github, and super simple to self host.
        
             | mdaniel wrote:
             | So long as you have an existing CI/CD story, or are willing
             | to invest in head-to-heading them, to find one that fits
             | your needs. Because both it and Forgejo's "we're deploying
             | act, what could go wrong" give me the heebie-jeebies. And,
             | aside from that, there are _so many_ FLOSS ones they could
             | have chosen that legitimately _work_ making that decision
             | extra opaque to me
        
         | haroldp wrote:
         | This was my thought too, but I figured I just didn't understand
         | the problem. Why use git commands to backup when a file system
         | copy seems like it would do? zfs snapshot takes less than a
         | second on any size repo. zfs send transfers just the changes
         | since the last backup, as fast as your network, more or less.
        
           | tomxor wrote:
           | Yup, once you've used snapshots for backups once, doing any
           | kind of filesystem level backup just seems absurd. I now use
           | it as the basis for every piece of infrastructure I add that
           | holds valuable data. The only place it doesn't really fit is
           | distributed databases.
        
         | mdaniel wrote:
         | It for sure has an excruciatingly slow UI, but I haven't yet
         | been able to stomach the Ruby magick enough to dig in and find
         | out if it's _all of GitLab_ that 's slow, and thus their
         | culture of who-cares just propagates to the UI, or it's
         | literally just the web frontend that's molasses yet because so
         | many folks interact with it over the web that's where the
         | conclusion ends up
         | 
         | I know their backend git proxy is written in golang, their
         | runner agent is written in golang, it spawns CI jobs using
         | containerd, written in golang, and they use postgresql and a
         | redis-esque KV -- although for that part I do believe they're
         | still using Sidekick _(ruby)_ for doing job dispatch, so that
         | one could very easily lolol back into not actioning tasks
         | efficiently
         | 
         | GitHub Actions sure does enjoy just sitting there twiddling its
         | thumbs when I push "run job," and is also both Ruby and their
         | own crazypants ajax-y web framework, so I don't think it's
         | exactly the shining star of performance itself
        
       | treesknees wrote:
       | Duplicate of https://news.ycombinator.com/item?id=44197061
        
         | mdaniel wrote:
         | Since that one only has one comment, I'm pretty sure that means
         | the HN front-page is luck of the draw and I wouldn't have
         | bothered even commenting about the other one
        
       | LgWoodenBadger wrote:
       | IME, it has always turned out to be the correct decision to
       | eliminate any n^2 operation in anything I've written.
       | 
       | I don't write exotic algorithms, but it's always astounding how
       | small n needs to be to become observably problematic.
        
         | parthdesai wrote:
         | My rule of thumb for 80%-90% of the problems is, if you need
         | complicated algorithm, it means your data model isn't right.
         | Sure, you do need complicated algorithms for compilers, db
         | internals, route planning et all, but all things considered,
         | those are minority of the use cases.
        
           | anttihaapala wrote:
           | This is not a complicated algorithm. A hash map (dictionary)
           | or a hash set is how you would always do deduplication in
           | Python, because it is easiest to write / least keystrokes
           | anyway. That is not the case in C though, as it is much
           | easier to use arrays and nested loops instead of hash maps.
        
             | throwaway2037 wrote:
             | > That is not the case in C though, as it is much easier to
             | use arrays and nested loops instead of hash maps.
             | 
             | I am confused. There are plenty of open source, fast hash
             | map impls in C.
        
               | saagarjha wrote:
               | Yes, the problem is getting them into your project.
        
           | paulddraper wrote:
           | This isn't knapsack. This is a dict lookup.
        
           | LtWorf wrote:
           | I wrote a funny algorithm to group together words that end
           | the same way to write them once in my binary wordlist file,
           | since there is an array that points to the start character
           | and a \0 to end the word. My initial solution was O(n2) but
           | it was too slow on a real wordlist so I had to come up with
           | something better. In the end I just sort the list with
           | quicksort, but revert the order of the words and then the
           | groupable ones end up nearby each other.
        
         | koala_man wrote:
         | Good call. O(N^2) is the worst time complexity because it's
         | fast enough to be instantaneous in all your testing, but slow
         | enough to explode in prod.
         | 
         | I've seen it several times before, and it's exactly what
         | happened here.
        
           | david422 wrote:
           | We just had this exact problem. Tests ran great, production
           | slowed to a crawl.
        
             | bcrl wrote:
             | I was just helping out with the network at an event. Worked
             | great in testing, but it failed in production due to
             | unicast flooding the network core. Turns out that some of
             | the PoE Ethernet switches had an insufficiently sized CAM
             | for the deployment combined with STP topology changes
             | reducing the effective size of the CAM by a factor of 10 on
             | the larger switches. Gotta love when packet forwarding goes
             | from O(1) to O(n) and O(n^2)! Debugging that in production
             | is non-trivial as the needle is in such a large haystack of
             | packets so as to be nearly impossible to find in the output
             | of tcpdump and wireshark. The horror... The horror...
        
             | hinkley wrote:
             | First big project I worked on a couple of us sped up the db
             | initialization scripts so we could use a less trivial set
             | of test data to stop this sort of shenanigans.
             | 
             | Things like inserting the test data first and turning on
             | constraints and possibly indexes afterward.
        
         | SAI_Peregrinus wrote:
         | I'd say the exception is when `n` is under about 10, and is
         | counting some sort of hardware constrained thing (e.g. some
         | operation over all CAN interfaces pesent on an OBDII connector
         | can be O(n^(2)) since n will always be between 1 and 4). If you
         | wouldn't have to physically replace hardware for `n` to
         | increase, you _really_ need to avoid n^2 operations. And even
         | then consider them carefully, perhaps explicitly failing if `n`
         | gets too big to allow for noticing rework is needed before new
         | hardware hits the field.
        
           | echelon wrote:
           | > perhaps explicitly failing if `n` gets too big
           | 
           | That's the problem. A lot of these quadratic time algorithms
           | don't set limits.
           | 
           | Even 'n!' is fine for small 'n'. Real production use cases
           | don't have small 'n'.
        
             | lassejansen wrote:
             | Or, phrased differently, if n has an upper limit, the
             | algorithm is O(1).
        
               | connicpu wrote:
               | As long as you have tests that regularly exercise your
               | algorithm at n=max where you would notice if they were
               | exceptionally slow
        
             | masklinn wrote:
             | > Real production use cases don't have small 'n'.
             | 
             | Real production use cases absolutely have small n. You
             | don't hear about them, because it's very hard for them to
             | cause issues. Unless the use case changes and now the n is
             | not small anymore and nobody noticed the trap.
        
             | haiku2077 wrote:
             | I have an app that's been running an O n^2 algorithm in
             | "production" (free open source app used by various
             | communities) for about half a year now.
             | 
             | It's been fine because "n" is "number of aircraft flying in
             | this flight simulator" - and the simulator's engine starts
             | to fail above around 2000 anyway. So even in the worst case
             | it's still going to run within milliseconds.
        
         | EdwardCoffin wrote:
         | Bruce Dawson says: _I like to call this Dawson's first law of
         | computing: O(n^2) is the sweet spot of badly scaling
         | algorithms: fast enough to make it into production, but slow
         | enough to make things fall down once it gets there._
         | 
         | https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
        
           | paulddraper wrote:
           | The second law is that O(n * log n) is for practical intents
           | and purposes O(n).
        
             | EdwardCoffin wrote:
             | To be clear though, that isn't _his_ second law, at least
             | as of two months ago, according to https://bsky.app/profile
             | /randomascii.bsky.social/post/3lk4c6...
        
               | to11mtm wrote:
               | Fair, but `n log n` definitely is the historical "good
               | enough to actually sleep at night" in my head, every time
               | I see it I think of the prof who taught my first CSC
               | course and our data structures course due to how often it
               | came up.
               | 
               | Also, the wise statement that 'memory is fairly cheap
               | compared to CPU for scaling'. It's insane to see how
               | often folks would rather manually open and scan a
               | 'static-on-deploy' 20-100MB Json file for each request vs
               | just parsing it into structures in memory (where, for
               | most cases, the in memory usage is a fraction of the json
               | itself) and just caching the parsed structure for the
               | length of the application.
        
               | hinkley wrote:
               | Not often but occasionally I will chose the nlogn
               | algorithm which obviously has no bugs over the O(n)
               | algorithm with no obvious bugs.
               | 
               | Less brittleness is worth paying a few percent.
               | Especially if it unmuddies the waters enough for someone
               | to spot other accidental (time) complexity.
        
               | refulgentis wrote:
               | Considerably more than a few percent, IMHO. :)
               | 
               | But I also don't dabble in this area _nearly_ enough to
               | know whether there 's years of tears and toil finding out
               | repeatedly that O(n) is ~impossible to implement and
               | verify :)                 | n   | n log n  |       | 5
               | | 8.0472   |       | 10  | 23.0259  |       | 25  |
               | 80.4719  |       | 50  | 195.6012 |       | 100 |
               | 460.5170 |
        
               | hinkley wrote:
               | This is magic thinking about how C, memory hierarchies,
               | networking, and system calls work.
        
               | Someone wrote:
               | Depends on the constants and on the value of _n_. If the
               | constant for the _O(n log n)_ algorithm is five times
               | that of the _O(n)_ algorithm, the _O(n)_ algorithm is
               | faster for _n < 100_.
               | 
               | If you expect that _n < 100_ will always hold, it may be
               | better to implement the _O(n)_ algorithm and add a
               | logging warning if _n > 250_ or so (and, maybe, a fatal
               | error if _n > 1000_ or so), instead of spending time to
               | write both versions of the algorithm _and_ spend time
               | finding the cut off value for choosing between the two.
        
               | hinkley wrote:
               | Fatal errors tend to blow up in production rather than
               | test.
               | 
               | One of the simplest solutions for detecting cyclic graphs
               | is instead of collecting a lookup table or doing
               | something non-concurrent like marking the nodes, is to
               | count nodes and panic if the encountered set is more than
               | an order of magnitude more than you expected.
               | 
               | I came onto a project that had done that before and it
               | blew up during my tenure. The worst case graph size was
               | several times the expected case, and long term customers
               | were growing their data sets vertically rather than
               | horizontally (eg, ever notice how much friction there is
               | to making new web pages versus cramming more data into
               | the existing set?) and now instead of 10x never happening
               | it was happening every Tuesday.
               | 
               | I was watching the same thing play out on another project
               | recently but it got cancelled before we hit that
               | threshold for anything other than incorrect queries.
        
               | rcxdude wrote:
               | It's also often in the range where constant factors can
               | make a big difference over a wide range of n
        
               | paulddraper wrote:
               | Yes, that isn't actually Dawson's second law.
        
               | brucedawson wrote:
               | Should it be my second law?
               | 
               | https://bsky.app/profile/randomascii.bsky.social/post/3lr
               | 24s...
        
             | hinkley wrote:
             | But sometimes a big enough C can flip which solution helps
             | you hit your margins.
        
               | hansvm wrote:
               | In my mind, that's always been the point in dropping log
               | factors. The algorithms are comparable enough that the
               | actual implementation starts to matter, which is all
               | we're really looking for in a Big-O analysis.
        
             | sn9 wrote:
             | Skiena has a great table in his algorithms book mapping
             | time complexity to hypothetical times for different input
             | sizes.
             | 
             | For _n_ of 10^9, where lg _n_ takes 0.03 us and _n_ takes 1
             | s, _n_ lg _n_ takes 29.9 s and _n_ ^2 takes _31.7 years_.
        
               | swyx wrote:
               | more from table please?
        
               | johnisgood wrote:
               | I would rather have the table and related content. Name
               | of the book?
        
           | hinkley wrote:
           | I made the "mistake" in an interview of equating two super-
           | quadratic solutions in an interview. What I meant was what
           | Dawson meant. It doesn't matter because they're both too
           | ridiculous to even discuss.
        
             | dieortin wrote:
             | They're too ridiculous... unless a more optimal solution
             | does not exist
        
               | hinkley wrote:
               | Absolutely not.
               | 
               | If the cost of doing something goes above quadratic, you
               | shouldn't do it at all. Because essentially every
               | customer interaction costs you more than the one before.
               | You will never be able to come up with ways to cover that
               | cost faster than it ramps. You are digging a hole,
               | filling it with cash and lighting it on fire.
               | 
               | If you can't do something well you should consider not
               | doing it at all. If you can only do it badly with no hope
               | of ever correcting it, you should outsource it.
        
               | Retric wrote:
               | Chess engines faced worse than quadratic scaling and came
               | out the other side...
               | 
               | Software operates in a crazy number of different domains
               | with wildly different constraints.
        
               | crabmusket wrote:
               | I believe hinkley was commenting on things that are
               | quadratic in the number of users. It doesn't sound like a
               | chess engine would have that property.
        
               | tux3 wrote:
               | They did make it sound like almost anything would
               | necessarily have n scale with new users. That assumption
               | is already questionnable
               | 
               | There's a bit of a "What Computational Complexity Taught
               | Me About B2B SaaS" bias going.
        
               | delusional wrote:
               | All of modern Neural Network AI is based on GEMM which
               | are O(n^2) algorithms. There are sub-cubic alternatives,
               | but it's my understanding that the cache behavior of
               | those variants mean they aren't practically faster when
               | memory bound.
               | 
               | n is only rarely related to "customers". As long as n
               | doesn't grow, the asymptotic complexity doesn't actually
               | matter.
        
               | saagarjha wrote:
               | The GEMM is O(n^3) actually. Transformers are quadratic
               | in the size of their context window.
        
               | hinkley wrote:
               | I read that as a typo given the next sentence.
               | 
               | I'm on the fence about cubic time. I was mostly thinking
               | of exponential and factorial problems. I think some very
               | clever people can make cubic work despite my warnings.
               | But most of us shouldn't. General advice is to be ignored
               | by masters when appropriate. That's also the story arc of
               | about half of kung fu movies.
               | 
               | Did chess solvers really progress much before there was a
               | cubic approximation?
        
               | delusional wrote:
               | > I read that as a typo given the next sentence.
               | 
               | Thank you for the courtesy.
               | 
               | > I think some very clever people can make cubic work
               | despite my warnings.
               | 
               | I think you're selling yourself short. You don't need to
               | be that clever to make these algorithms work, you have
               | all the tools necessary. Asymptotic analysis is helpful
               | not just because it tells us a growth, but also because
               | it limits that growth to being in _n_. If you're doing
               | matmul and n is proportional to the size of the input
               | matrix, then you know that if your matrix is constant
               | then the matmul will always take the same time. It does
               | not matter to you what the asymptotic complexity is,
               | because you have a fixed n. In your program, it's O(1).
               | As long as the runtime is sufficient, you know it will
               | never change for the lifetime of the program.
               | 
               | There's absolutely no reason to be scared of that kind of
               | work, it's not hard.
        
               | hinkley wrote:
               | Right but back up at the top of the chain the assertion
               | was that if n grows as your company does then IME you're
               | default dead. Because when the VC money runs out you
               | can't charge your customers enough to keep the lights on
               | and also keep the customers.
        
               | endgame wrote:
               | I feel this is too hardline and e.g. eliminates the
               | useful things people do with SAT solvers.
        
               | hinkley wrote:
               | The first SAT solver case that comes to mind is circuit
               | layout, and then you have a k vs n problem. Because you
               | don't SAT solve per chip, you SAT solve per model and
               | then amortize that cost across the first couple years'
               | sales. And they're also "cheating" by copy pasting cores,
               | which means the SAT problem is growing much more slowly
               | than the number of gates per chip. Probably more like
               | n^1/2 these days.
               | 
               | If SAT solvers suddenly got inordinately more expensive
               | you'd use a human because they used to do this but the
               | solver was better/cheaper.
               | 
               | Edit: checking my math, looks like in a 15 year period
               | from around 2005 to 2020, AMD increased the number of
               | cores by about 30x and the transistors per core by about
               | 10x.
        
               | IshKebab wrote:
               | That's quite a contortion to avoid losing the argument!
               | 
               | "Oh well my algorithm isn't _really_ O(N^2) because I 'm
               | going to print N copies of the answer!"
               | 
               | Absurd!
        
               | hinkley wrote:
               | What I'm saying is that the gate count problem that is
               | profitable is in m3 not n3. And as long as m < n^2/3 then
               | you are n2 despite applying a cubic time solution to m.
               | 
               | I would argue that this is essentially part of why Intel
               | is flagging now. They had a model of ever increasing
               | design costs that was offset by a steady inflation of
               | sales quarter after quarter offsetting those costs. They
               | introduced the "tick tock" model of biting off a major
               | design every second cycle and small refinements in
               | between, to keep the slope of the cost line below the
               | slope of the sales line. Then they stumbled on that and
               | now it's tick tick tock and clearly TSM, AMD and possibly
               | Apple (with TSM's help) can now produce a better product
               | for a lower cost per gate.
               | 
               | Doesn't TSM's library of existing circuit layouts
               | constitute a substantial decrease in the complexity of
               | laying out an entire chip? As grows you introduce more
               | precalculated components that are dropped in, bringing
               | the slope of the line down.
               | 
               | Meanwhile NVIDIA has an even better model where they spam
               | gpu units like mad. What's the doubling interval for gpu
               | units?
        
               | crote wrote:
               | That only matters when the constants are nontrivial and N
               | has a potential to get big.
               | 
               | Not every app is a B2C product intending to grow to
               | billions of users. If the costs start out as near-zero
               | and are going to grow to still be negligible at 100%
               | market share, who cares that it's _technically_
               | suboptimal? Sure, you could spend expensive developer-
               | hours trying to find a better way of doing it, but YAGNI.
        
               | hinkley wrote:
               | I just exited a B2B that discovered they invested in
               | luxury features and the market tightened their belts by
               | going with cheaper and simpler competitors. Their n
               | wasn't really that high but they sure tried their
               | damnedest to make it cubic complexity. "Power" and
               | "flexibility" outnumbered, "straightforward" and even
               | "robust" but at least three to one in conversations. A
               | lot of my favorite people saw there was no winning that
               | conversation and noped out long before I did.
               | 
               | The devs voted with their feet and the customers with
               | their wallets.
        
               | Tainnor wrote:
               | Gaussian elimination (for square matrices) is O(n^3)
               | arithmetic operations and it's one of the most important
               | algorithms in any scientific domain.
        
               | hinkley wrote:
               | I'll allow that perhaps I should have said "cubic"
               | instead of "quadratic" - there are much worse orders in
               | the menagerie than n^3. But it's a constraint we bang
               | into over and over again. We use these systems because
               | they're cheaper than humans, yes? People are still trying
               | to shave off hundredths of the exponent in matrix
               | multiplication for instance. It makes the front page of
               | HN every time someone makes a "breakthrough".
        
               | marcinzm wrote:
               | > no hope of ever correcting it
               | 
               | That's a pretty bold assumption.
               | 
               | Almost every startup that has succeeded was utterly
               | unscalable at first in tons of technical and business
               | ways. Then they fixed it as they scaled. Over-optimizing
               | early has probably killed far more projects and companies
               | than the opposite.
        
               | hinkley wrote:
               | > That's a pretty bold assumption.
               | 
               | That's not a bold assumption it's the predicate for this
               | entire sidebar. The commenter at the top said some things
               | can't be done in quadratic time and have to be done
               | anyway, and I took exception.
               | 
               | >> unless a more optimal solution does not exist
               | 
               | Dropping into the middle of a conversation and ignoring
               | the context so you can treat the participants like they
               | are confused or stupid is very bad manners. I'm not
               | grumpy at you I'm grumpy that this is the eleventeenth
               | time this has happened.
               | 
               | > Almost every startup
               | 
               | Almost every startup fails. Do you model your behavior on
               | people who fail >90% of the time? Maybe you, and perhaps
               | by extension we, need to reflect on that.
               | 
               | > Then we fixed it as we scaled
               | 
               | Yes, because you picked a problem that _can_ be
               | architected to run in reasonable time. You elected to do
               | it later. You trusted that you could delay it and turned
               | out to be right.
               | 
               | >> unless a more optimal solution does not exist
               | 
               | When the devs discover the entire premise is
               | unsustainable or nobody knows how to make it sustainable
               | after banging their heads against it, they quickly find
               | someplace else to be and everyone wonders what went
               | wrong. There was a table of ex employees who knew exactly
               | what went wrong but it was impolitic to say. Don't want
               | the VCs to wake up.
        
               | patrick451 wrote:
               | There are two many obvious exceptions to even start
               | taking this seriously. If we all followed this advice, we
               | would never even multiply matrices.
        
               | Tepix wrote:
               | So, how would you write a solver for tower of Hanoi then?
               | Are you saying you wouldn't?
        
               | hinkley wrote:
               | As a business? Would you try to sell a product that
               | behaved like tower of Hanoi or walk away?
        
         | vlovich123 wrote:
         | I have an n^3 operation that's currently a huge bottleneck at
         | only 10k elements. Not sure how to fix it.
        
           | IshKebab wrote:
           | I once looked into tree diffing algorithms (the literature is
           | all about diffing XML even though it's really trees). The
           | obvious dumb algorithm (which seems to be what everyone uses)
           | is n^4.
        
         | esprehn wrote:
         | Modern computers are pretty great at scanning small blocks of
         | memory repeatedly, so n^2 can be faster than the alternative
         | using a map in cases for small N.
         | 
         | I spent a lot of time fixing n^2 in blink, but there were some
         | fun surprises:
         | 
         | https://source.chromium.org/chromium/chromium/src/+/main:thi...
         | 
         | For large N without a cache :nth-child matching would be very
         | slow doing n^2 scans of the siblings to compute the index. On
         | the other hand for small sibling counts it turned out the cache
         | overhead was noticably worse than just doing an n^2. (See the
         | end of the linked comment).
         | 
         | This turns out to be true in a lot of surprising places, both
         | where linear search beats constant time maps, and where n^2 is
         | better than fancy algorithms to compensate.
         | 
         | Memory latency and instruction timing is the gotcha of many fun
         | algorithms in the real world.
        
           | throwaway2037 wrote:
           | > where linear search beats constant time maps
           | 
           | Can you give an example? You said lots of good things in your
           | post, but I struggling to believe this one. Also, it would
           | help to see some wall clock times or real world impact.
        
             | hansvm wrote:
             | Pick any compiled language and test it. Pick an algorithm
             | making heavy use of a small (<10, maybe up to a hundred
             | elements) hashset, and try using a linear structure
             | instead. The difference will be most apparent with
             | complicated keys, but even strings of more than a few
             | characters should work.
             | 
             | Some example workloads include:
             | 
             | 1. Tokenization (checking if a word is a keyword)
             | 
             | 2. Interpretation (mapping an instruction name to its
             | action)
             | 
             | 3. ML (encoding low-cardinality string features in
             | something like catboost)
             | 
             | 4. JSON parsers (usually key count is low, so parse into a
             | linear-scan hashmap rather than a general-purpose hashmap)
             | 
             | Details vary in the exact workload, the hardware you're
             | using, what other instructions you're mixing in, etc. It's
             | a well-known phenomenon though, and when you're doing a
             | microoptimization pass it's definitely something to
             | consider. 2x speedups are common. 10x or more happen from
             | time to time.
             | 
             | It's similar to (but not _quite_ the same as) the reason
             | real-world binary search uses linear scans for small
             | element counts.
             | 
             | When you go to really optimize the system, you'll also find
             | that the linear scan solution is often more amenable to
             | performance improvements from batching.
             | 
             | As to how much it matters for your composite program? Even
             | at a microoptimization level I think it's much more
             | important to pay attention to memory access patterns. When
             | we wrote our protobuf parser that's all we really had to
             | care about to improve performance (33% less execution time
             | for the entire workload, proto parsing being much better
             | than that). You're much more likely to be able to write
             | sane code that way (contrasted with focusing on
             | instructions and whatnot first), and it's easier to layer
             | CPU improvements on top of a good memory access pattern
             | than to go the other way around.
        
             | crote wrote:
             | You've got to keep in mind that computers aren't the
             | 1-instruction-at-a-time purely sequential machines anymore.
             | 
             | Let's say you've got a linear array of bytes, and you want
             | to see if it contains a specific value. What would a modern
             | CPU need to execute? Well, we can actually compare 64
             | values at a time with _mm512_mask_cmp_epu8_mask! You still
             | need a little bit of setup and a final "did any of them
             | match" check, of course. Want to compare 512 values? You
             | can probably do that in less than 10 clock cycles with
             | modern machines
             | 
             | Doing the same with a hash set? Better make sure that hash
             | algorithm is fast. Sure it's O(1), but if calculating the
             | hash takes 20 cycles it doesn't matter.
        
               | ncruces wrote:
               | This is a good point.
               | 
               | A string search algorithm that uses SIMD to do quickly
               | discard a majority of 16, 32 or 64 attempts in parallel,
               | and then verify the surviving ones quadratically (again
               | 16, 32 or 64 bytes at a time) can go a _very_ long way
               | against a sublinear algorithm that understands needle
               | structure, but necessarily needs to process the haystack
               | one byte at a time.
        
           | kevincox wrote:
           | This is true. But unless you are sure that current and future
           | inputs will always be small I find it is better to start with
           | the algorithm that scales better. Then you can add a special
           | case for small sizes if it turns up in a hot path.
           | 
           | This is because performance is typically less important for
           | the fast/small case and it is generally acceptable for
           | processing twice as much to be twice (or slightly more than
           | twice) as slow, but users are far more likely to hit and
           | really burned by n^2 algorithms in things you thought would
           | almost always be small and you never tested large enough
           | sizes in testing to notice.
           | 
           | I wrote more on this topic here
           | https://kevincox.ca/2023/05/09/less-than-quadratic/
        
           | hinkley wrote:
           | A lot of computations are really higher complexity order
           | functions with stair steps at certain intervals based on
           | hardware trying to pretend they are constant time. All
           | operations cost the same amount until n doubles again and
           | then it's slower. If you zoom out toward infinity, the stair
           | steps smooth out into a logarithmic curve. It becomes
           | logarithmic in the former case and square root in the latter.
           | Even dividing two numbers or doing a memory address lookup
           | stops being C, which is part of why prime factoring worked
           | for RSA for so long.
           | 
           | If anyone had made clockless logic work you would see that
           | adding 1 + 1 is in fact faster than adding 2^63 + 1.
           | 
           | If you put enough data into a hash table the key length has
           | to increase logarithmically to the table size in order to
           | have distinct keys per record. Even Knuth points out that
           | hash tables are really nlogn - something I'm pretty sure my
           | CS professors left out. In multiple classes. Man, did I get
           | tired of hash tables, but I see now why they harped on them.
           | Case on point, this article.
        
         | solardev wrote:
         | For those of us without a formal comp sci background, is there
         | a way for the IDE to detect and warn about these automatically?
         | Or any other easy telltale signs to look for?
         | 
         | As a self taught dev, when I encounter nested loops, I have to
         | mentally go through them and try to see if they iterate through
         | each item more than once. But that's not a very foolproof
         | method.
        
           | chacha102 wrote:
           | Too much domain knowledge for an IDE to catch. I'm self
           | taught as well, and it comes down to spending more time
           | thinking about the code than writing the code.
           | 
           | It's a fairly simple thought experiment to ask yourself what
           | if there was 10x items in this array? 100x? That is
           | essentially what the O(n) notation is trying to quantify. You
           | just don't need to do it that formally.
        
       | Arcuru wrote:
       | 48 hours is a crazy amount of time to spend just to compress a
       | git folder, it's only a couple GB. 41 minutes still seems like
       | quite a long time.
       | 
       | Why aren't they just snapshotting and archiving the full git
       | repo? Does `git bundle` add something over frequent ZFS backups?
        
         | bombela wrote:
         | > Be aware that even with these recommendations, syncing in
         | this way has some risk since it bypasses Git's normal integrity
         | checking for repositories, so having backups is advised. You
         | may also wish to do a git fsck to verify the integrity of your
         | data on the destination system after syncing.
         | 
         | https://git-scm.com/docs/gitfaq#_transfers
         | 
         | It doesn't tell you how to make a backup safely though.
         | 
         | On a personal scale, Syncthing and Btrfs snapshots work plenty
         | good enough. It's as fast as the storage/network too.
        
           | nightfly wrote:
           | Syncthing is the only way I've ever corrupted a git repo
           | before
        
             | hobofan wrote:
             | I think that's why they specified the "BTRFS snapshots"
             | part. Yes, directly syncing a .git directory seems like a
             | recipe for disaster with how often I've seen individual
             | files lagging to sync, but I guess with BTRFS snaphots one
             | can ensure that only a consistent view of a git directory
             | is being backed up and synced.
        
               | bombela wrote:
               | Nah I truly do it the wrong way around. Syncthing on the
               | git repos. And one of my device in the Syncthing cluster
               | does btrfs snapshots minutely for recovery and further
               | backups.
               | 
               | Because it's at a personal scale, the only time I can
               | corrupt a git repo is if I work on the same repo (and
               | it's workdir) from more than one device in the time it
               | takes for Syncthing to replicate the changes.
               | 
               | But even then it's not a big deal because git fsck is
               | quick. And I have my snapshots, and the syncthing
               | versioning, and git defaults to two weeks before pruning.
               | And because of how git works, using hash to identify
               | contents, files are not easily overwritten either.
               | 
               | In 10y I only had one git corruption (I ran a command on
               | the same repo on a different machine via ssh, yielding a
               | synctning conflict). Syncthing kept copies of the
               | conflict file. One commit disappeared from the history
               | but not from the database. It was easy to rebase the
               | changes. I think I used git fsck to deleted the syncthing
               | versioned files.
        
           | ndriscoll wrote:
           | If filesystem snapshots weren't safe, wouldn't that also mean
           | git is prone to corrupting your repo in the event of a power
           | loss or crash? That seems like a bad bug.
        
         | unsnap_biceps wrote:
         | zfs snapshots are difficult to offsite in non-zfs replicas, say
         | like an S3 bucket.
         | 
         | That said, there's another less known feature that bundles help
         | out with when used with `git clone --bundle-uri` The client can
         | specify a location to a bundle, or the server can send the
         | client the bundle location in the clone results and the client
         | can fetch the bundle, unpack it, and then update the delta via
         | the git server, so it's a lot lighter weight on the server for
         | cloning large repos, and _a ton_ faster for the client for
         | initial clones.
        
           | nightfly wrote:
           | ZFS can send to file or whatever you want to pipe to, you can
           | have incremental sends, and if you convert to bookmarks on
           | the sender you don't have to keep the historical data after
           | you send it
        
           | benlivengood wrote:
           | I think if you want consistent snapshot backups on non-zfs
           | destinations the safest thing is to clone the snapshot and
           | rsync from the clone. Not a single-step operation but
           | preserves the atomicity of the snapshot.
           | 
           | EDIT: you could also rsync from a .zfs snapshot directory if
           | you have them enabled.
        
         | AtlasBarfed wrote:
         | ... so they added caching to things that should have been
         | cached?
         | 
         | ... is this really the way people "back up" git repos? I mean,
         | it is git, so isn't there some way to mirror changes to the
         | repo in another repo and just use ZFS / snapshots / backup
         | software / etc to do that? It's a distributed version control
         | system. Just make sure the version control information is ...
         | distributed?
        
         | broken_broken_ wrote:
         | Reading the article I thought exactly the same! I'd be curious
         | to know how much time the same would take with zfs.
        
         | gregorvand wrote:
         | 15 years also seems like a long time
        
       | DamonHD wrote:
       | Here comes an unpopular nitpick: "... we traced the issue to a
       | 15-year-old Git function with O(N2) complexity and fixed it with
       | an algorithmic change, reducing backup times exponentially."
       | 
       | Uh no you didn't. Not possible. At most a polynomial reduction is
       | possible else complexity theory needs a re-write.
       | 
       | (OK, yes, k could be doing some heavy lifting here, but I doubt
       | it.)
       | 
       | If you are going to quote a maths formula then please don't use
       | "exponetially" to mean "lots".
       | 
       | I stopped reading there: I don't want to have to read each word
       | and wonder if they actually meant it, or it's just bad hype.
        
         | immortaljoe wrote:
         | OP here. Feedback is always welcome, I did mean exponentially
         | in the colloquial sense. I do see how it is confusing here,
         | will change it.
        
           | DamonHD wrote:
           | Thank you.
           | 
           | (I don't think that anyone should use "exponentially" that
           | way: it is an art term with a specific and particular
           | meaning, so find another word if you mean something else!
           | Like misusing specific legal or sporting terms...)
        
             | ctxc wrote:
             | Art term?
        
               | wizzwizz4 wrote:
               | https://en.wiktionary.org/wiki/term_of_art
        
               | ctxc wrote:
               | Ah, thanks. I could only think of art as in literal,
               | artistic art.
        
               | kccqzy wrote:
               | If you have that tendency, you just need to think of
               | TAOCP, this industry's best distillation of intellectual
               | achievement, with the word "art" in its name.
        
             | taftster wrote:
             | Which term is appropriate here? What would you suggest?
             | (honest question)
        
               | DamonHD wrote:
               | "hugely" or "a lot" or "to O(XXX)" whatever the new XXX
               | complexity is.
        
           | serial_dev wrote:
           | You can't use exponentially in the colloquial sense in a post
           | about software performance.
           | 
           | If my barista says that his new coffee is exponentially
           | better, it's ok to use it colloquially.
           | 
           | If my git hosting provider writes about an impactful
           | performance improvement, it's not.
        
         | bombela wrote:
         | I can forgive exponential as a figure of speech. You missed
         | when they describe using a map for the side effect of [key]
         | dedupliction.
         | 
         | Instead of saying "set". The code itself uses the type
         | "strset".
        
       | Lammy wrote:
       | > What this means for GitLab customers -- [a bunch of stuff about
       | how customers can now back up more frequently and more robustly]
       | 
       | Realtalk: They should rewrite this post's headline to be in a
       | positive tense instead of leading with a negative word. I'm glad
       | I read the post, because it is a cool and good fix, but I saw
       | "Decreasing [...] repo backup" and my first thought was that it
       | was an announcement of some service downgrade like some sort of
       | cost-cutting measure.
        
         | serial_dev wrote:
         | I don't think it's unreasonable to expect interested people to
         | read five words from the title. I know people don't always do
         | that, but complaining about it as if they did something wrong
         | is ridiculous.
        
           | robertlagrant wrote:
           | No one was complaining.
        
       | divbzero wrote:
       | The performance improvement that GitLab contributed to Git is
       | slated to be released with v2.50.0:
       | 
       | https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
        
         | Nemo_bis wrote:
         | Nice that GitLab is active upstream! Do I see correctly that
         | the most active contributors to git are currently GitLab
         | employees?
        
           | dfc wrote:
           | I dont know how you came to that conclusion. Just looking at
           | the top 4 contributors over the last 2 years[1] it looks like
           | one works for google(gitster), two work at Github and one
           | works at GitLab(pks-t).
           | 
           | [1]: https://github.com/git/git/graphs/contributors?from=6%2F
           | 3%2F...
        
             | Nemo_bis wrote:
             | I had looked at the monthly activity. Indeed on a longer
             | timespan it looks different, thanks.
        
               | dfc wrote:
               | That's weird I looked at the monthly and did not see any
               | Gitlab employees. Were you looking at a different repo?
        
       | pjmlp wrote:
       | A very good example that writing code in C doesn't help for
       | performance, when the algorithms or data structures aren't
       | properly taken into consideration.
        
         | IshKebab wrote:
         | I would say C makes this sort of thing far more likely because
         | it's usually a ton of effort to obtain suitable containers. In
         | C++ or Rust they have plenty of things like
         | `unordered_set`/`HashSet` built in, so people are much more
         | likely to use it and not go "eh, I'll use a for loop".
         | 
         | In this case Git already had a string set, but it's still not
         | standard so there's a good chance the original author just
         | didn't know about it.
        
           | pjmlp wrote:
           | Yeah, not something that WG14 will ever care about.
        
           | masklinn wrote:
           | > In this case Git already had a string set, but it's still
           | not standard so there's a good chance the original author
           | just didn't know about it.
           | 
           | The original commit was made in January 2009 (https://github.
           | com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was
           | added in November 2020 (https://github.com/git/git/commit/ae2
           | 0bf1ad98bdc716879a8da99..., strset was added a few days
           | later: https://github.com/git/git/commit/1201eb628ac753af5751
           | 258466...). It was first proposed in 2018 (https://lore.kerne
           | l.org/git/20180906191203.GA26184@sigill.in... the proposal
           | specifically mentions it fixing possibly quadratic sites).
           | 
           | As noted in the comment, git _did_ have a sorted string list
           | with bisection search, and that 's from 2008 (and it actually
           | dates back to 2006 as the "path list" API, before it was
           | renamed following the realisation that it was a generalised
           | string list). Though as the hashmap proposal notes, it's a
           | bit tricky because there's a single type with functions for
           | sorted and functions for unsorted operations, you need to
           | know whether your list is sorted or not independent of its
           | type.
        
           | morepedantic wrote:
           | I've seen this exact problem in C code many times in my life,
           | especially in kernel space where data structures and memory
           | allocations are fun.
           | 
           | Ironically, this is much _faster_ for small sets. Sometimes
           | the error is intentional, because the programmer believes
           | that all inputs will be small. IME, those programmers were
           | wrong, but that's the inverse of survival bias.
        
             | masklinn wrote:
             | And in fairness I can understand not considering someone
             | might eventually be bundling a repository with tens of
             | thousands of refs back in early 2009.
        
       | ashishb wrote:
       | O(n^2) is fast enough to end up I'm production and slow enough to
       | cause problems at scale.
       | 
       | The worst troublesome cases of inefficient production are almost
       | always O(n^2).
        
         | hinkley wrote:
         | Had a modular monorepo back when UML was still a thing people
         | did. People were having trouble opening a TogetherJ project -
         | it was taking 30 minutes. I dug into it, and I don't recall how
         | I timed it but I kept adding more submodules and the runtime
         | went up ridiculously fast. I stopped at a project that took 18
         | hours to load. Started it before going home one day and timed
         | it the next morning.
         | 
         | When I plotted the runtime, I got n^5 for the fitting curve.
         | That's the largest polynomial I've encountered in the wild.
         | Second place has always been cubic.
         | 
         | Their response was that it had something to do with processing
         | config files per module, and a suggestion to rearrange them as
         | a workaround. They fixed the problem in the next patch and the
         | load time went from 18 hours to a couple minutes.
        
       | divbzero wrote:
       | There seems to be a lesson here about the balance between
       | premature _vs._ anticipatory optimization. We're generally warned
       | against premature optimization but perhaps, as a rule of thumb,
       | we should look for optimizations in frequently-called functions
       | that are obvious and not onerous to implement.
        
         | Quekid5 wrote:
         | If a set-of-strings was trivially available in the source
         | language (at time of implementation) the original programmer
         | would probably have done this (relatively) trivial
         | optimization... This is a symptom of anemic languages like C.
        
       | prymitive wrote:
       | Well done, next maybe make the web ui usable because right now
       | there's absolutely no distinction between the UI itself and the
       | user content, which IMHO combined with action buttons in the
       | middle of the page makes for a really poor ux.
        
       | SoftTalker wrote:
       | Cool discovery but the article could have been about 1/10 as long
       | and still communicated effectively. At least they didn't post it
       | as a video, so it was easy to skim to the important details.
        
         | kokada wrote:
         | Yes. I read the whole article thinking that this must have been
         | generated by LLM, because at least the style remembers it.
        
           | ruuda wrote:
           | That was also my thought.
        
           | Scaevolus wrote:
           | Em dashes and bullet points!
        
           | jaygreco wrote:
           | Glad I wasn't the only one who thought this. The post is also
           | missing one obvious thing that I expect in any technical
           | post: code snippets. Let me see the code.
           | 
           | ChatGPT has ruined bullet points for the rest of us...
           | 
           | No offense but writing this blog post couldn't take more than
           | a few minutes, why spoil it with LLM? Shoot, use one to check
           | grammar and recommend edits even.
        
           | bearjaws wrote:
           | Don't take my bullet points away from me
        
             | jorvi wrote:
             | They came for the em dashes, and I did not speak up. Then
             | they came for the bullet points, and I did not speak up..
        
           | djdeutschebahn wrote:
           | Exactly thought the same. Reading experience of the post
           | would have been definitely improved, with less text.
        
         | 1oooqooq wrote:
         | it could have been longer. I still don't know why they were
         | doing backup bundles with two refs :)
        
           | 6LLvveMx2koXfwn wrote:
           | They weren't, if you look at the fix [1] the dedupe loop was
           | run in _all_ cases, not just those with known dupes, so the
           | performance hit was any bundle with lots of refs.
           | 
           | 1.https://github.com/git/git/commit/bb74c0abbc31da35be5299956
           | 9...
        
             | rjmunro wrote:
             | But why couldn't they just dedupe the refs from the command
             | line before starting the actual bundling - surely there are
             | never more than a couple of hundred of those (one per
             | branch)?
        
               | 6LLvveMx2koXfwn wrote:
               | The point is the performance hit had nothing to do with
               | dupe count (which could be zero), and everything to do
               | with ref count.
        
         | davidron wrote:
         | For those that haven't read the article yet, scroll down to the
         | flame graph and start reading unit it starts talking about back
         | porting the fix. Then stop.
        
           | wgjordan wrote:
           | "How we decreased reading 'How we decreased GitLab repo
           | backup times from 48 hours to 41 minutes' times from 4.8
           | minutes to 41 seconds"
        
         | hliyan wrote:
         | Interesting that multiple people are noticing this same thing.
         | For me, this could have been:
         | 
         | "We found that a large portion of the 48 hours taken to backup
         | our rails respository was due to a function in `git bundle
         | create` that checks for duplicate references entered as command
         | line arguments. The check contained a nested for loop ( O(N2)
         | ), which we replaced with a map data structure in an upstream
         | fix to Git. The patch was accepted, but we also backported fix
         | without waiting for the next Git version. With this change,
         | backup time dropped to 41 minutes."
        
         | m104 wrote:
         | It's the "impact" style of technical write-ups: sell the
         | problem and the scale, then present the solution, which is thus
         | presented and understood through the lens of business and
         | customer success.
         | 
         | Generously, this writing style is supposed to show the business
         | value of teams and individuals, for promotions or other
         | recognition. But yeah, it can be frustrating to read this
         | style.
        
       | bob1029 wrote:
       | I'm confused why you wouldn't simply snapshot the block-level
       | device if the protocol of the information on top is going to
       | cause this much headache. Quiescing git operations for block
       | level activity is probably not trivial, but it sounds like an
       | easier problem to solve to me.
       | 
       | This is the approach I've taken with SQLite in production
       | environments. Turn on WAL and the problem gets even easier to
       | solve. Customer configures the VM for snapshots every X minutes.
       | Git presumably doesn't have something approximating a WAL, so I
       | understand the hesitation with this path. But, I still think the
       | overall strategy is much more viable and robust to weird edges
       | within git.
        
         | nonameiguess wrote:
         | Gitlab isn't just a managed service. They release the software
         | for self-hosted instances as well. There is no guarantee or
         | requirement that users all run Gitlab on the same filesystem or
         | even a filesystem that supports block level snapshots at all.
         | Presumably, they want a universal backup system that works for
         | all Gitlabs.
        
           | bob1029 wrote:
           | I've never heard of a .git folder that spanned multiple
           | filesystems. It sounds like we are now conflating the git
           | workspace with everything else in the product.
           | 
           | There are system requirements that a customer would be
           | expected to adhere to if they wanted a valid enterprise
           | support contract with one of these vendors.
        
             | to11mtm wrote:
             | I think GP's point is that the filesystems used by someone
             | self-hosting gitlab may not be the same as what gitlab
             | themselves are using.
             | 
             | File systems can be weird. Sometimes the OS can be weird
             | and fsync type calls may not do what you expect. At least
             | at one point MacOS fsync didn't behave the same way as
             | Linux (i.e. Linux should ensure the write is truly done and
             | not just in cache so long as the drive isn't lying). [0]
             | 
             | > There are system requirements that a customer would be
             | expected to adhere to if they wanted a valid enterprise
             | support contract with one of these vendors.
             | 
             | Gitlab has a community edition. Not handling data well
             | would be bad for their public image.
             | 
             | [0] - https://news.ycombinator.com/item?id=30372218
        
             | nonameiguess wrote:
             | I can't reply to the other reply for some reason, but what
             | they said is indeed what I meant. gitlab.com might be
             | running their Gitaly servers on btrfs or zfs or lvm volumes
             | or whatever, but other customers may be using ext2. Gitlab
             | the company _could_ require customers to only run Gitaly on
             | a specific filesystem, but up to now, they never have, it
             | would be pretty shitty to suddenly change their minds after
             | a decade plus of establishing one expectation, and whoever
             | the developer is who submitted the patch to upstream Git
             | and got a technical blog post out of it has absolutely no
             | power to dictate contract terms to enterprise customers
             | personally and instead did what is actually in their power
             | to do.
        
         | amtamt wrote:
         | > This is the approach I've taken with SQLite in production
         | environments. Turn on WAL and the problem gets even easier to
         | solve.
         | 
         | A few months back a better solution was provided by SQLite:
         | sqlite3_rsync
         | 
         | https://www.sqlite.org/rsync.html
        
         | lbotos wrote:
         | > Git presumably doesn't have something approximating a WAL, so
         | I understand the hesitation with this path.
         | 
         | Bingo. One of the worst problems is helping a client piece back
         | together a corrupted repo when they are using snapshots. Check
         | my profile to see how I know. :)
         | 
         | It's usually an OMG down scenario, and then you are adding in
         | the "oh no, now the restore is corrupted."
         | 
         | It's fixable but it's definitely annoying.
        
       | einpoklum wrote:
       | I had a somewhat similar experience when writing a "remove
       | duplicates" extension for Thunderbird:
       | 
       | https://addons.thunderbird.net/en-US/thunderbird/addon/remov...
       | 
       | I used a hash to begin with, using a simplistic digest function
       | to the message headers I was comparing, getting me a 4-byte hash
       | key.
       | 
       | That worked, but was kind of slow.
       | 
       | Finally, the idea came to me to not apply _any_ digesting, and
       | use the combined concatenated headers of the messages as hash
       | keys. About 2k bytes per hash key!
       | 
       | The result: About 20x perf improvement if memory serves.
       | 
       | How is that possible? The reason is that the code all runs in a
       | Javascript machine; and applying the digest was not a built-in
       | function, it was looping over the headers and doing the
       | arithmetic. Thousands upon thousands of JS abstract machine
       | steps. The use of the large hash key may be inefficient, but -
       | it's just one JS object / dictionary operation, and one of the
       | most heavily-optimized in any implementation.
        
       | sneak wrote:
       | ...and they say leetcode-style problems are out of date.
       | 
       | While perhaps implementing low level sorts is silly, knowing
       | which data structures to use and when, and what underlying big-o
       | performance they have, is critical, as demonstrated here.
        
       | kristianp wrote:
       | It would have been nice if the post included some details, such
       | as before and after code.
        
       | niffydroid wrote:
       | Why use this method over just having a script do a clone or pull
       | elsewhere? Got is about distribution right?
        
       | iamcreasy wrote:
       | Cool! Please post another flame graph after applying the patch.
        
       | tcdent wrote:
       | TLDR if you only add objects to an array that are already not
       | contained in said array you won't have to iterate back through it
       | to remove the duplicates you created.
       | 
       | wild that this a pattern like this would be part of git-core, but
       | I guess we all overlook stuff on a regular basis
        
       | katella wrote:
       | I don't know much about git backups, but why would there be race
       | conditions if you just create the backup off the local repo
       | instead of remote?
        
       | KETpXDDzR wrote:
       | With git it should be straightforward to implement an
       | incremental, dedup backup solution since all objects are stored
       | with their hashs in filename.
        
         | mdaniel wrote:
         | Run `man git-repack` or its `man git-gc` friend and recall that
         | most filesystems _hate_ dealing with a bazillion small files
         | 
         | I think there have been several attempts to use S3-ish
         | blobstores as git backends but of the two major git hosters,
         | only one of them is MIT licensed and they don't attempt that
         | stunt, so safe to assume it's not a viable approach
        
       | DeepYogurt wrote:
       | I miss a good tech blog post. Cheers for this
        
       | ianks wrote:
       | Reminds me of the timeless advice I got for acing leet code
       | interviews: "remember to use a hash map." Tends to work out
       | pretty well.
        
       | SkiFire13 wrote:
       | > Ultimately, we traced the issue to a 15-year-old Git function
       | with O(N2) complexity and fixed it with an algorithmic change,
       | reducing backup times exponentially.
       | 
       | I have yet to finish the article, but this means they improved
       | the complexity to something like O(logN), _right_? I hate when
       | people confuse quadratic improvement for exponential ones.
        
         | globular-toast wrote:
         | Really annoying to see someone use exponential wrong when
         | they're talking about performance of algorithms. We're supposed
         | to know what this means! They went from quadratic to linear.
        
       | jas39 wrote:
       | What a poor write-up, neither defining the problem nor the
       | solution with any clearity. Did anyone come away with any
       | information that can be used for anything?
        
       | fHr wrote:
       | absolute chads, nice!
        
       | hinkley wrote:
       | This feels like some of the problems I've ended up tackling. The
       | people too close to the problem don't see it so someone has to
       | reach across the aisle and make something happen in code they
       | normally wouldn't touch.
       | 
       | Even with the quadratic complexity I suspect this problem has
       | been brewing for a while. This commit has been biding its time
       | for 16 years waiting to ruin someone's day (let's not confuse the
       | utility of Make it Work early in a project with justifying
       | _retaining_ that code in perpetuity. Make it Right, Make it
       | Fast.) Backups have probably been going over six hours for years
       | and steadily climbing.
       | 
       | "We" feels a lot like one or two people jumping on a grenade.
        
       | gbacon wrote:
       | Algorithmic improvements beat micro-optimizations.
        
       | James_K wrote:
       | Perhaps nested for loops should be some kind of compiler warning.
        
       | abhashanand1501 wrote:
       | Lot of comments complaining that going from O(n^2) to O(log n) is
       | not an exponential improvement, but it is indeed an exponential
       | improvement. In fact O(n) is exponentially more than O(log n).
        
       ___________________________________________________________________
       (page generated 2025-06-08 23:02 UTC)