[HN Gopher] How we decreased GitLab repo backup times from 48 ho...
___________________________________________________________________
How we decreased GitLab repo backup times from 48 hours to 41
minutes
Author : immortaljoe
Score : 290 points
Date : 2025-06-06 15:43 UTC (7 hours ago)
(HTM) web link (about.gitlab.com)
(TXT) w3m dump (about.gitlab.com)
| judofyr wrote:
| See also: https://www.tumblr.com/accidentallyquadratic
|
| Quadratic complexity sits in an awkward sweet spot: Fast enough
| for medium-sized _n_ to pass first QA, but doomed to fail
| eventually as _n_ grows.
| vjerancrnjak wrote:
| This particular change was not accidental. It was announced as
| quadratic.
| jeffbee wrote:
| Are there any reimplementations of git, by professional
| programmers using real tools? The source in question -- object.c
| -- is "banging rocks together" material.
| landr0id wrote:
| https://github.com/GitoxideLabs/gitoxide
| umanwizard wrote:
| Ignoring your inaccurate flamebait and answering the underlying
| question: there are several reimplementations of git, including
| "got" (Game of Trees) from the OpenBSD developers, jgit
| (reimplementation in Java), the Haskell git package, the
| gitoxide rust crate (and I assume several more half-finished
| Rust hobby projects), and probably others; that's just what I
| found with a cursory google.
| Vilian wrote:
| Git, but as it was intended to be used
| kgeist wrote:
| We use go-git in one of our Go projects that relies on Git for
| version control of its data.
| hiddew wrote:
| "fixed it with an algorithmic change, reducing backup times
| exponentially"
|
| If the backup times were O(n^2), are they now O(n^2 / 2^n)? I
| would guess not.
| umanwizard wrote:
| This is not the precise mathematical definition of exponential,
| but rather the colloquial one, where it just means "a lot".
| cvoss wrote:
| You shouldn't use a word that can carry a precise
| mathematical meaning in a sentence that literally uses
| mathematical notation in order to speak precisely and then
| expect readers not to interpret the word in the precise
| mathematical way.
| blharr wrote:
| I somewhat agree, but for lack of a better word, what would
| you use? Quadratically doesn't have the same punch
| umanwizard wrote:
| "Dramatically" ?
| remram wrote:
| "a lot"
| jjmarr wrote:
| "by a factor of `n`" also sounds impressive.
| jrochkind1 wrote:
| If you just mean "a lot" in a non-technical sense, there
| are plenty of words available. enormously. immensely.
| tremendously. remarkably. incredibly. vastly.
| IshKebab wrote:
| You should if you expect your readers to be normal humans
| who understand obvious context, and not pedantic HN readers
| who understand obvious context but delight in nit-picking
| it anyway.
| Dylan16807 wrote:
| You _can_ , but it's not _should_.
| sneak wrote:
| "Words mean things."
|
| If you can't agree with this, then you shouldn't be
| speaking or writing, IMO.
|
| Those who argue that words that mean different things are
| actually equivalent have no business dealing with language.
| marcellus23 wrote:
| Meaningless and non-constructive pedantry.
| chrisweekly wrote:
| I'm not the OP you're responding to, but to be fair, in a
| sentence about big-O perf characteristics, which includes the
| word "algorithms", using "exponentially" in a colloquial non-
| technical sense is an absolutely terrible word choice.
| linguistbreaker wrote:
| Exponentially bad word choice even... since we're using
| that word however we want now?
|
| I don't think this is meaningless or non-constructive
| pedantry - we're a technical community and those are
| technical words.
| msgodel wrote:
| I disagree. Misuse of the word "exponential" is a major pet
| peeve of mine. It's a particular case of the much more common
| "use mathematically precise phrasing to sound
| careful/precise" that you often find in less than honest
| writing.
|
| Here they are actually using it to refer to growth functions
| (which is rare for this error) and being honest (which is
| also rare IMO) but it's still wrong. They should have written
| about quadratic or quadratic vs linear.
|
| Regardless sloppy language leads to sloppy thought.
| keybored wrote:
| Sloppy writing is up orders of magnitude lately.
| 0cf8612b2e1e wrote:
| It will decimate readership.
| sneak wrote:
| My personal pet peeve is when the term "exponentially" is
| used to refer to a change between precisely two data
| points.
|
| It's a very specific subset of the one you're describing.
|
| "Tell me you don't know what you're talking about without
| telling me you don't know what you're talking about."
| csnweb wrote:
| If you replace an n^2 algorithm with a log(n) lookup you get an
| exponential speed up. Although a hashmap lookup is usually
| O(1), which is even faster.
| ryao wrote:
| That is not true unless n^C / e^n = log(n) where C is some
| constant, which it is not. The difference between log(n) and
| some polynomial is logarithmic, not exponential.
| gre wrote:
| looks like the commit is here:
| https://github.com/git/git/commit/a52d459e72b890c192485002ec...
| mortar wrote:
| Thanks, this helped me find the submission and related
| discussion thread -
| https://lore.kernel.org/git/20250401-488-generating-bundles-...
| SomaticPirate wrote:
| How was the flame graph created? (Not very familiar with C and
| the performance tools around it)
| plorkyeran wrote:
| https://github.com/brendangregg/FlameGraph
|
| You record performance data with `perf`, then use the scripts
| there to turn it into a SVG.
| immortaljoe wrote:
| OP here. In this particular case, I used
| https://github.com/flamegraph-rs/flamegraph
| IshKebab wrote:
| I strongly recommend not using this. Instead use pprof - it
| has a MUCH better interactive flamegraph, plus other nice
| performance visualisations (e.g. a call graph):
|
| https://github.com/google/pprof go install
| github.com/google/pprof@latest pprof -http=: prof.out
|
| I normally collect the profiles with gperftools
| (https://github.com/gperftools/gperftools) and then just
| LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so
| CPUPROFILE=prof.out <your command>
|
| I've been meaning to try Samply though. Not sure if it works
| with pprof.
| landr0id wrote:
| You can also use https://github.com/mstange/samply to make
| recording and viewing in the Firefox profiler easier.
|
| It will spin up a localhost server after the trace ends, the
| profiler uses the localhost server and nothing is shared with
| Firefox servers unless you explicitly choose to upload the data
| and create a permalink.
| edflsafoiewq wrote:
| So they replaced a nested for loop for checking for duplicates
| with a set data structure. Surprised such a common thing was in
| git.
| tomxor wrote:
| One of the many reasons I moved to self hosting. I use ZFS to
| backup every 15 minutes, ... could do it even more frequently but
| that seems a little pointless.
|
| Also moved away from Gitlab because it's so damn slow.
| glookler wrote:
| ZFS is great, getting off gitlab would also be great, what did
| you switch to?
| haroldp wrote:
| This was my thought too, but I figured I just didn't understand
| the problem. Why use git commands to backup when a file system
| copy seems like it would do? zfs snapshot takes less than a
| second on any size repo. zfs send transfers just the changes
| since the last backup, as fast as your network, more or less.
| mdaniel wrote:
| It for sure has an excruciatingly slow UI, but I haven't yet
| been able to stomach the Ruby magick enough to dig in and find
| out if it's _all of GitLab_ that 's slow, and thus their
| culture of who-cares just propagates to the UI, or it's
| literally just the web frontend that's molasses yet because so
| many folks interact with it over the web that's where the
| conclusion ends up
|
| I know their backend git proxy is written in golang, their
| runner agent is written in golang, it spawns CI jobs using
| containerd, written in golang, and they use postgresql and a
| redis-esque KV -- although for that part I do believe they're
| still using Sidekick _(ruby)_ for doing job dispatch, so that
| one could very easily lolol back into not actioning tasks
| efficiently
|
| GitHub Actions sure does enjoy just sitting there twiddling its
| thumbs when I push "run job," and is also both Ruby and their
| own crazypants ajax-y web framework, so I don't think it's
| exactly the shining star of performance itself
| treesknees wrote:
| Duplicate of https://news.ycombinator.com/item?id=44197061
| mdaniel wrote:
| Since that one only has one comment, I'm pretty sure that means
| the HN front-page is luck of the draw and I wouldn't have
| bothered even commenting about the other one
| LgWoodenBadger wrote:
| IME, it has always turned out to be the correct decision to
| eliminate any n^2 operation in anything I've written.
|
| I don't write exotic algorithms, but it's always astounding how
| small n needs to be to become observably problematic.
| parthdesai wrote:
| My rule of thumb for 80%-90% of the problems is, if you need
| complicated algorithm, it means your data model isn't right.
| Sure, you do need complicated algorithms for compilers, db
| internals, route planning et all, but all things considered,
| those are minority of the use cases.
| anttihaapala wrote:
| This is not a complicated algorithm. A hash map (dictionary)
| or a hash set is how you would always do deduplication in
| Python, because it is easiest to write / least keystrokes
| anyway. That is not the case in C though, as it is much
| easier to use arrays and nested loops instead of hash maps.
| paulddraper wrote:
| This isn't knapsack. This is a dict lookup.
| koala_man wrote:
| Good call. O(N^2) is the worst time complexity because it's
| fast enough to be instantaneous in all your testing, but slow
| enough to explode in prod.
|
| I've seen it several times before, and it's exactly what
| happened here.
| david422 wrote:
| We just had this exact problem. Tests ran great, production
| slowed to a crawl.
| SAI_Peregrinus wrote:
| I'd say the exception is when `n` is under about 10, and is
| counting some sort of hardware constrained thing (e.g. some
| operation over all CAN interfaces pesent on an OBDII connector
| can be O(n^(2)) since n will always be between 1 and 4). If you
| wouldn't have to physically replace hardware for `n` to
| increase, you _really_ need to avoid n^2 operations. And even
| then consider them carefully, perhaps explicitly failing if `n`
| gets too big to allow for noticing rework is needed before new
| hardware hits the field.
| echelon wrote:
| > perhaps explicitly failing if `n` gets too big
|
| That's the problem. A lot of these quadratic time algorithms
| don't set limits.
|
| Even 'n!' is fine for small 'n'. Real production use cases
| don't have small 'n'.
| lassejansen wrote:
| Or, phrased differently, if n has an upper limit, the
| algorithm is O(1).
| connicpu wrote:
| As long as you have tests that regularly exercise your
| algorithm at n=max where you would notice if they were
| exceptionally slow
| masklinn wrote:
| > Real production use cases don't have small 'n'.
|
| Real production use cases absolutely have small n. You
| don't hear about them, because it's very hard for them to
| cause issues. Unless the use case changes and now the n is
| not small anymore and nobody noticed the trap.
| haiku2077 wrote:
| I have an app that's been running an O n^2 algorithm in
| "production" (free open source app used by various
| communities) for about half a year now.
|
| It's been fine because "n" is "number of aircraft flying in
| this flight simulator" - and the simulator's engine starts
| to fail above around 2000 anyway. So even in the worst case
| it's still going to run within milliseconds.
| EdwardCoffin wrote:
| Bruce Dawson says: _I like to call this Dawson's first law of
| computing: O(n^2) is the sweet spot of badly scaling
| algorithms: fast enough to make it into production, but slow
| enough to make things fall down once it gets there._
|
| https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
| paulddraper wrote:
| The second law is that O(n * log n) is for practical intents
| and purposes O(n).
| EdwardCoffin wrote:
| To be clear though, that isn't _his_ second law, at least
| as of two months ago, according to https://bsky.app/profile
| /randomascii.bsky.social/post/3lk4c6...
| to11mtm wrote:
| Fair, but `n log n` definitely is the historical "good
| enough to actually sleep at night" in my head, every time
| I see it I think of the prof who taught my first CSC
| course and our data structures course due to how often it
| came up.
|
| Also, the wise statement that 'memory is fairly cheap
| compared to CPU for scaling'. It's insane to see how
| often folks would rather manually open and scan a
| 'static-on-deploy' 20-100MB Json file for each request vs
| just parsing it into structures in memory (where, for
| most cases, the in memory usage is a fraction of the json
| itself) and just caching the parsed structure for the
| length of the application.
| hinkley wrote:
| Not often but occasionally I will chose the nlogn
| algorithm which obviously has no bugs over the O(n)
| algorithm with no obvious bugs.
|
| Less brittleness is worth paying a few percent.
| Especially if it unmuddies the waters enough for someone
| to spot other accidental (time) complexity.
| hinkley wrote:
| But sometimes a big enough C can flip which solution helps
| you hit your margins.
| hinkley wrote:
| I made the "mistake" in an interview of equating two super-
| quadratic solutions in an interview. What I meant was what
| Dawson meant. It doesn't matter because they're both too
| ridiculous to even discuss.
| dieortin wrote:
| They're too ridiculous... unless a more optimal solution
| does not exist
| Arcuru wrote:
| 48 hours is a crazy amount of time to spend just to compress a
| git folder, it's only a couple GB. 41 minutes still seems like
| quite a long time.
|
| Why aren't they just snapshotting and archiving the full git
| repo? Does `git bundle` add something over frequent ZFS backups?
| bombela wrote:
| > Be aware that even with these recommendations, syncing in
| this way has some risk since it bypasses Git's normal integrity
| checking for repositories, so having backups is advised. You
| may also wish to do a git fsck to verify the integrity of your
| data on the destination system after syncing.
|
| https://git-scm.com/docs/gitfaq#_transfers
|
| It doesn't tell you how to make a backup safely though.
|
| On a personal scale, Syncthing and Btrfs snapshots work plenty
| good enough. It's as fast as the storage/network too.
| nightfly wrote:
| Syncthing is the only way I've ever corrupted a git repo
| before
| hobofan wrote:
| I think that's why they specified the "BTRFS snapshots"
| part. Yes, directly syncing a .git directory seems like a
| recipe for disaster with how often I've seen individual
| files lagging to sync, but I guess with BTRFS snaphots one
| can ensure that only a consistent view of a git directory
| is being backed up and synced.
| bombela wrote:
| Nah I truly do it the wrong way around. Syncthing on the
| git repos. And one of my device in the Syncthing cluster
| does btrfs snapshots minutely for recovery and further
| backups.
|
| Because it's at a personal scale, the only time I can
| corrupt a git repo is if I work on the same repo (and
| it's workdir) from more than one device in the time it
| takes for Syncthing to replicate the changes.
|
| But even then it's not a big deal because git fsck is
| quick. And I have my snapshots, and the syncthing
| versioning, and git defaults to two weeks before pruning.
| And because of how git works, using hash to identify
| contents, files are not easily overwritten either.
|
| In 10y I only had one git corruption (I ran a command on
| the same repo on a different machine via ssh, yielding a
| synctning conflict). Syncthing kept copies of the
| conflict file. One commit disappeared from the history
| but not from the database. It was easy to rebase the
| changes. I think I used git fsck to deleted the syncthing
| versioned files.
| unsnap_biceps wrote:
| zfs snapshots are difficult to offsite in non-zfs replicas, say
| like an S3 bucket.
|
| That said, there's another less known feature that bundles help
| out with when used with `git clone --bundle-uri` The client can
| specify a location to a bundle, or the server can send the
| client the bundle location in the clone results and the client
| can fetch the bundle, unpack it, and then update the delta via
| the git server, so it's a lot lighter weight on the server for
| cloning large repos, and _a ton_ faster for the client for
| initial clones.
| nightfly wrote:
| ZFS can send to file or whatever you want to pipe to, you can
| have incremental sends, and if you convert to bookmarks on
| the sender you don't have to keep the historical data after
| you send it
| benlivengood wrote:
| I think if you want consistent snapshot backups on non-zfs
| destinations the safest thing is to clone the snapshot and
| rsync from the clone. Not a single-step operation but
| preserves the atomicity of the snapshot.
|
| EDIT: you could also rsync from a .zfs snapshot directory if
| you have them enabled.
| AtlasBarfed wrote:
| ... so they added caching to things that should have been
| cached?
|
| ... is this really the way people "back up" git repos? I mean,
| it is git, so isn't there some way to mirror changes to the
| repo in another repo and just use ZFS / snapshots / backup
| software / etc to do that? It's a distributed version control
| system. Just make sure the version control information is ...
| distributed?
| DamonHD wrote:
| Here comes an unpopular nitpick: "... we traced the issue to a
| 15-year-old Git function with O(N2) complexity and fixed it with
| an algorithmic change, reducing backup times exponentially."
|
| Uh no you didn't. Not possible. At most a polynomial reduction is
| possible else complexity theory needs a re-write.
|
| (OK, yes, k could be doing some heavy lifting here, but I doubt
| it.)
|
| If you are going to quote a maths formula then please don't use
| "exponetially" to mean "lots".
|
| I stopped reading there: I don't want to have to read each word
| and wonder if they actually meant it, or it's just bad hype.
| immortaljoe wrote:
| OP here. Feedback is always welcome, I did mean exponentially
| in the colloquial sense. I do see how it is confusing here,
| will change it.
| DamonHD wrote:
| Thank you.
|
| (I don't think that anyone should use "exponentially" that
| way: it is an art term with a specific and particular
| meaning, so find another word if you mean something else!
| Like misusing specific legal or sporting terms...)
| ctxc wrote:
| Art term?
| wizzwizz4 wrote:
| https://en.wiktionary.org/wiki/term_of_art
| ctxc wrote:
| Ah, thanks. I could only think of art as in literal,
| artistic art.
| kccqzy wrote:
| If you have that tendency, you just need to think of
| TAOCP, this industry's best distillation of intellectual
| achievement, with the word "art" in its name.
| taftster wrote:
| Which term is appropriate here? What would you suggest?
| (honest question)
| serial_dev wrote:
| You can't use exponentially in the colloquial sense in a post
| about software performance.
|
| If my barista says that his new coffee is exponentially
| better, it's ok to use it colloquially.
|
| If my git hosting provider writes about an impactful
| performance improvement, it's not.
| bombela wrote:
| I can forgive exponential as a figure of speech. You missed
| when they describe using a map for the side effect of [key]
| dedupliction.
|
| Instead of saying "set". The code itself uses the type
| "strset".
| Lammy wrote:
| > What this means for GitLab customers -- [a bunch of stuff about
| how customers can now back up more frequently and more robustly]
|
| Realtalk: They should rewrite this post's headline to be in a
| positive tense instead of leading with a negative word. I'm glad
| I read the post, because it is a cool and good fix, but I saw
| "Decreasing [...] repo backup" and my first thought was that it
| was an announcement of some service downgrade like some sort of
| cost-cutting measure.
| serial_dev wrote:
| I don't think it's unreasonable to expect interested people to
| read five words from the title. I know people don't always do
| that, but complaining about it as if they did something wrong
| is ridiculous.
| robertlagrant wrote:
| No one was complaining.
| divbzero wrote:
| The performance improvement that GitLab contributed to Git is
| slated to be released with v2.50.0:
|
| https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
| pjmlp wrote:
| A very good example that writing code in C doesn't help for
| performance, when the algorithms or data structures aren't
| properly taken into consideration.
| IshKebab wrote:
| I would say C makes this sort of thing far more likely because
| it's usually a ton of effort to obtain suitable containers. In
| C++ or Rust they have plenty of things like
| `unordered_set`/`HashSet` built in, so people are much more
| likely to use it and not go "eh, I'll use a for loop".
|
| In this case Git already had a string set, but it's still not
| standard so there's a good chance the original author just
| didn't know about it.
| pjmlp wrote:
| Yeah, not something that WG14 will ever care about.
| masklinn wrote:
| > In this case Git already had a string set, but it's still
| not standard so there's a good chance the original author
| just didn't know about it.
|
| The original commit was made in January 2009 (https://github.
| com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was
| added in November 2020 (https://github.com/git/git/commit/ae2
| 0bf1ad98bdc716879a8da99..., strset was added a few days
| later: https://github.com/git/git/commit/1201eb628ac753af5751
| 258466...). It was first proposed in 2018 (https://lore.kerne
| l.org/git/20180906191203.GA26184@sigill.in... the proposal
| specifically mentions it fixing possibly quadratic sites).
|
| As noted in the comment, git _did_ have a sorted string list
| with bisection search, and that 's from 2008 (and it actually
| dates back to 2006 as the "path list" API, before it was
| renamed following the realisation that it was a generalised
| string list). Though as the hashmap proposal notes, it's a
| bit tricky because there's a single type with functions for
| sorted and functions for unsorted operations, you need to
| know whether your list is sorted or not independent of its
| type.
| ashishb wrote:
| O(n^2) is fast enough to end up I'm production and slow enough to
| cause problems at scale.
|
| The worst troublesome cases of inefficient production are almost
| always O(n^2).
| divbzero wrote:
| There seems to be a lesson here about the balance between
| premature _vs._ anticipatory optimization. We're generally warned
| against premature optimization but perhaps, as a rule of thumb,
| we should look for optimizations in frequently-called functions
| that are obvious and not onerous to implement.
| Quekid5 wrote:
| If a set-of-strings was trivially available in the source
| language (at time of implementation) the original programmer
| would probably have done this (relatively) trivial
| optimization... This is a symptom of anemic languages like C.
| prymitive wrote:
| Well done, next maybe make the web ui usable because right now
| there's absolutely no distinction between the UI itself and the
| user content, which IMHO combined with action buttons in the
| middle of the page makes for a really poor ux.
| SoftTalker wrote:
| Cool discovery but the article could have been about 1/10 as long
| and still communicated effectively. At least they didn't post it
| as a video, so it was easy to skim to the important details.
| kokada wrote:
| Yes. I read the whole article thinking that this must have been
| generated by LLM, because at least the style remembers it.
| ruuda wrote:
| That was also my thought.
| 1oooqooq wrote:
| it could have been longer. I still don't know why they were
| doing backup bundles with two refs :)
| bob1029 wrote:
| I'm confused why you wouldn't simply snapshot the block-level
| device if the protocol of the information on top is going to
| cause this much headache. Quiescing git operations for block
| level activity is probably not trivial, but it sounds like an
| easier problem to solve to me.
|
| This is the approach I've taken with SQLite in production
| environments. Turn on WAL and the problem gets even easier to
| solve. Customer configures the VM for snapshots every X minutes.
| Git presumably doesn't have something approximating a WAL, so I
| understand the hesitation with this path. But, I still think the
| overall strategy is much more viable and robust to weird edges
| within git.
| nonameiguess wrote:
| Gitlab isn't just a managed service. They release the software
| for self-hosted instances as well. There is no guarantee or
| requirement that users all run Gitlab on the same filesystem or
| even a filesystem that supports block level snapshots at all.
| Presumably, they want a universal backup system that works for
| all Gitlabs.
| bob1029 wrote:
| I've never heard of a .git folder that spanned multiple
| filesystems. It sounds like we are now conflating the git
| workspace with everything else in the product.
|
| There are system requirements that a customer would be
| expected to adhere to if they wanted a valid enterprise
| support contract with one of these vendors.
| to11mtm wrote:
| I think GP's point is that the filesystems used by someone
| self-hosting gitlab may not be the same as what gitlab
| themselves are using.
|
| File systems can be weird. Sometimes the OS can be weird
| and fsync type calls may not do what you expect. At least
| at one point MacOS fsync didn't behave the same way as
| Linux (i.e. Linux should ensure the write is truly done and
| not just in cache so long as the drive isn't lying). [0]
|
| > There are system requirements that a customer would be
| expected to adhere to if they wanted a valid enterprise
| support contract with one of these vendors.
|
| Gitlab has a community edition. Not handling data well
| would be bad for their public image.
|
| [0] - https://news.ycombinator.com/item?id=30372218
| nonameiguess wrote:
| I can't reply to the other reply for some reason, but what
| they said is indeed what I meant. gitlab.com might be
| running their Gitaly servers on btrfs or zfs or lvm volumes
| or whatever, but other customers may be using ext2. Gitlab
| the company _could_ require customers to only run Gitaly on
| a specific filesystem, but up to now, they never have, it
| would be pretty shitty to suddenly change their minds after
| a decade plus of establishing one expectation, and whoever
| the developer is who submitted the patch to upstream Git
| and got a technical blog post out of it has absolutely no
| power to dictate contract terms to enterprise customers
| personally and instead did what is actually in their power
| to do.
| amtamt wrote:
| > This is the approach I've taken with SQLite in production
| environments. Turn on WAL and the problem gets even easier to
| solve.
|
| A few months back a better solution was provided by SQLite:
| sqlite3_rsync
|
| https://www.sqlite.org/rsync.html
| lbotos wrote:
| > Git presumably doesn't have something approximating a WAL, so
| I understand the hesitation with this path.
|
| Bingo. One of the worst problems is helping a client piece back
| together a corrupted repo when they are using snapshots. Check
| my profile to see how I know. :)
|
| It's usually an OMG down scenario, and then you are adding in
| the "oh no, now the restore is corrupted."
|
| It's fixable but it's definitely annoying.
| einpoklum wrote:
| I had a somewhat similar experience when writing a "remove
| duplicates" extension for Thunderbird:
|
| https://addons.thunderbird.net/en-US/thunderbird/addon/remov...
|
| I used a hash to begin with, using a simplistic digest function
| to the message headers I was comparing, getting me a 4-byte hash
| key.
|
| That worked, but was kind of slow.
|
| Finally, the idea came to me to not apply _any_ digesting, and
| use the combined concatenated headers of the messages as hash
| keys. About 2k bytes per hash key!
|
| The result: About 20x perf improvement if memory serves.
|
| How is that possible? The reason is that the code all runs in a
| Javascript machine; and applying the digest was not a built-in
| function, it was looping over the headers and doing the
| arithmetic. Thousands upon thousands of JS abstract machine
| steps. The use of the large hash key may be inefficient, but -
| it's just one JS object / dictionary operation, and one of the
| most heavily-optimized in any implementation.
| sneak wrote:
| ...and they say leetcode-style problems are out of date.
|
| While perhaps implementing low level sorts is silly, knowing
| which data structures to use and when, and what underlying big-o
| performance they have, is critical, as demonstrated here.
| kristianp wrote:
| It would have been nice if the post included some details, such
| as before and after code.
| niffydroid wrote:
| Why use this method over just having a script do a clone or pull
| elsewhere? Got is about distribution right?
___________________________________________________________________
(page generated 2025-06-06 23:00 UTC)