[HN Gopher] Hotspot performance engineering fails
___________________________________________________________________
Hotspot performance engineering fails
Author : slimsag
Score : 97 points
Date : 2023-04-27 16:40 UTC (6 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| PaulHoule wrote:
| I went through a time when I was pitching a "boxes-and-lines"
| data processing tool like
|
| https://www.knime.com/
|
| which more-or-less passed JSON documents (instead of SQL rows)
| over the lines and found that the kind of people who bought and
| finance database startups wouldn't touch anything that couldn't
| be implemented with columnar processing.
|
| I thought that this kind of system would advance the "low code"
| nature of these systems because with relational rows many kinds
| of data processing require splitting up the data into streams and
| joining them whereas an object-relational system lets you
| localize processing in a small area of the graph and also be able
| to reuse parts of a computation.
|
| Columnar processing is so much faster than row-based processing
| and most investors and partners thought that customers _really_
| needed speed at the expense of being able to write simpler
| pipelines. Even though I had a nice demo of a hybrid batch
| /stream processing system (that gave correct answers), none of
| them cared. Thus, from one viewpoint, architecture is everything.
|
| (Funny though, I later worked for a company that had a system
| like this that wasn't quite sure what algebra the tool worked on
| and the tool didn't quite always get the same answer on each
| run...)
| NovemberWhiskey wrote:
| I don't think I entirely agree with the premise here. Yes, it is
| extremely difficult to engineer performance in after the fact;
| but assuming you've got an architecture that's basically fit for
| purpose (from the performance perspective), then improving by
| targeting hotspots is sound, isn't it? That's literally Amdahl's
| law.
| Sesse__ wrote:
| Amdahl's law is specifically about the futility of optimizing
| by removing hotspots... (Or rather, that it can only take you
| so far.)
| tasubotadas wrote:
| The guy invents a strawman to justify premature optimization.
| MattPalmer1086 wrote:
| It was interesting to read about why hotspots are not the whole
| story in performance. They are still important though.
|
| Facebook may have the resources and/or need to do complete
| rewrites of everything to squeeze out more performance, but most
| companies don't.
|
| I've personally improved performance of a lot of code
| significantly by identifying hot spots. So calling hotspot
| performance engineering a fail seems a bit unnecessarily
| provocative.
| jesse__ wrote:
| > Facebook may have the resources and/or need to do complete
| rewrites of everything to squeeze out more performance, but
| most companies don't.
|
| Actually, if you watch the video Casey put together, he very
| clearly demonstrates most companies _do_.
| continuational wrote:
| This is true, when you keep optimizing, you soon face death from
| a thousand paper cuts. But often, it's enough to find that
| bottleneck and make it a few times faster.
| hinkley wrote:
| The solution to this is zone defense instead of man-man.
|
| The sad fact is that a manager won't approve you working on
| something that'll save 1% CPU. But once the tall and medium
| tent poles have been knocked down, that's all there is left.
| There are hundreds of them, and they double or triple your
| response time and/or CPU load.
|
| I've had much, much better outcomes by rejecting trying to
| achieve an N% speedup across the entire app, and instead
| picking one subject area of the code and finding 20% there. You
| deep dive into that section, fulling absorbing how it works and
| why it works, and you fix every problem you see that registers
| above the noise floor in your perf tool. Some second and third
| tier performance problems complement each other, and you can
| avoid one entirely by altering the other. The risk of the 1%
| changes can be amortized over both the effort you expended
| learning this code, and the testing time required to validate 3
| large changes scattered across the codebase versus 8 changes in
| the same workflow. Much simpler to explain, much easier to
| verify.
|
| Big wins feel good _now_ but the company comes to expect them.
| In the place where I used this best, I delivered 20%
| performance improvements per release for something like 8
| releases in a row, before I ran out of areas I hadn 't touched
| before. Often I'd find a perf issue in how the current section
| of code talks to another, and that would inform what section of
| code I worked on next, while the problem domain was still fresh
| in my brain.
| charcircuit wrote:
| It's always about architecture. In the micro these are the
| hotspots you optimize in the macro these are the large rewrites
| you see.
|
| Performance is not the only thing that you should optimize your
| architecture for. Factors like adaptability, robustness, ease of
| understanding, speed of implementation, maintance cost, etc are
| things that you should consider. The factors that are the best
| today are not always still the best in the future which is why
| rewrites are a part of any software's life cycle.
| aranchelk wrote:
| That stuff is boring. Don't be a killjoy.
| sosodev wrote:
| This is true if your end goal is to have a super fast program but
| that is very rarely the case. The GTA online loading times issues
| went unnoticed for years because Rockstar just didn't care that
| the loading times were long. Users still played the game and
| spent a ton of money.
|
| Performance hotspots often are the difference between acceptable
| and unacceptable performance. I'm sure I'm not the only person
| who has seen that be the case many times.
| hinkley wrote:
| I don't think people understand the ways that we have adapted
| to delays. At least once a month I complain about how when we
| were kids, commercials were when you went for a pee break or to
| get a snack. There was no pause button. Bing watching on
| streaming always means you have to interrupt or wait twenty
| five minutes.
|
| I suspect if you spied on a bunch of GTA players you'd find
| them launching the game _and then_ going to the fridge, rather
| than the other way around.
| eklitzke wrote:
| >This is true if your end goal is to have a super fast program
| but that is very rarely the case.
|
| This is true in some banal sense, but kind of misses the point
| that there are certain domains where high performance software
| is a given, and in other domains it may rarely be important. If
| you're working on games, certain types of financial systems,
| autonomous vehicles, operating systems, etc. then high
| performance is critical and something you need to think about
| quite literally from day one.
| tonyarkles wrote:
| > This is true in some banal sense, but kind of misses the
| point that there are certain domains where high performance
| software is a given
|
| I work in a field where we're trying to squeeze the maximum
| amount of juice out of a fixed amount of compute (the
| hardware we're using only gets a rev every couple of years).
| My background (MSc + past work) was in primarily distributed
| systems performance analysis, and we definitely designed our
| system from day one to have an architecture that could
| support high performance.
|
| The GP's comment irks me. There are so many tools I use day-
| to-day that are ancillary to the work I do where the
| performance is absolutely miserable. I stare at them in
| disbelief. I'm processing 500MB/s of high resolution image
| data on about 30W in my primary system. How the hell does it
| take 5 seconds for a friggin' email to load in a local
| application? How does it take 3 seconds _for a password
| search dialog_ to open when I click on it? How does WhatsApp
| consume the same amount of memory as QGIS loaded up with
| hundreds of geoprojected high-resolution images?
|
| I agree that many systems _don 't_ require maximum-throughput
| heavy optimization, but there's a spectrum here and it's
| infuriating to me how far left on that spectrum a lot of
| applications are.
| vgatherps wrote:
| I feel the same frustration. I work in a field with
| stupendously tight latency constraints and am shocked by
| the disparity vs how much work we fit into tiny deadlines,
| vs how horrifyingly slow gui software written by well
| resourced mega corporations is.
|
| It feels to me like user interfaces are somehow not
| considered high-performance applications because they
| aren't doing super-high-throughput stuff, they're "just a
| gui", they're running on a phone, etc. All of that is true
| but it misses that guis are latency/determinism sensitive
| applications.
|
| I remember hearing some quote about how Apple was the
| _only_ software company that systematically measured
| response time on their GUIs, and I 'd believe it because my
| apple products are by far the snappiest and most responsive
| computing devices I have (the only thing that even competes
| is a very beefy desktop).
| tonyarkles wrote:
| Yeah, exactly, like... we're doing microsecond-precise
| high-bandwidth imaging and processing it real-time (not
| in the Hard Real-Time sense, but in the "we don't have
| enough RAM to buffer more than a couple of seconds worth
| of frames and we don't post-process it after the fact"
| real-time sense) with a team of... 3-5 or so dedicated to
| the end-to-end flow from photons to ML engine to disk.
| The ML models themselves are a different team that we
| just have to bonk once in a while if they do something
| that hurts throughput too badly.
|
| I'm sure we'd be bored as hell working on UI performance
| optimization, but if we could gamify it somehow... :D
| manv1 wrote:
| TL; DR: "It's better to design a fast system from the get-go
| instead of trying to fix a slow system later."
|
| That's basically true. I worked on a system that was
| Java/scala/spring/hibernate and it was just slow. It was slow
| when it was servicing an empty request, and it just went downhill
| from there. They just built it wrong...and they went ahead and
| built it wrong again.
|
| Today, I could replace it was a few hundred lines of node in
| AWS/Lambda and get multiple orders of magnitude of performance.
| pestatije wrote:
| [flagged]
| tonyarkles wrote:
| > Today, I could replace it was a few hundred lines of node in
| AWS/Lambda and get multiple orders of magnitude of performance.
|
| I had a fun bake-off a few years back. I was in more of a
| devOPS role (i.e. mostly Ops but writing code here and there
| when needed) and we needed something akin to an API Gateway but
| with some very domain-specific routing logic. One of the
| developers and I talked it through, he wanted to do Node, I
| suggested it would be a perfect place for Go. We decided to do
| two parallel (~500 LOC) implementations over a weekend and run
| them head-to-head on Monday.
|
| The code, logically, ended up coming out quite similar, which
| made us both pretty happy. Then... we started the benchmarking.
| They were neck and neck! For a fixed level of throughput, Go
| was only winning by maybe 5% on latency. That stayed true up
| until about 10krps, at which point Node flatlined because it
| was saturating a single CPU and Go just kept going and going
| and going until it saturated all of the cores on the VM we were
| testing on.
|
| Could we have scaled out the Node version to multiple nodes in
| the cluster? Sure. At 10krps though, it was already using 2-3x
| the RAM that the Go version was using at 80krps, and
| replicating 8 copies of it vs the 2x we did with the Go version
| (just for redundancy) starts to have non-trivial resource
| costs.
|
| And don't get me wrong, we had a bunch of the exact same
| Java/scala/spring/hibernate type stuff in the system as well,
| and it was dog-ass slow in comparison while also eating RAM
| like it was candy.
| manv1 wrote:
| Yeah, the one time I used go it was pretty good. The big
| question is always whether your stuff spends more time
| waiting or more time processing. For the former, it's node.
| For the latter, it's go.
| ummonk wrote:
| From my experience it's better to just consider performance from
| the get-go, and carefully consider which tech stack you're using
| and how the specific logic / system architecture you've chosen
| will be performant. It's much easier than being stuck with
| performance problems down the road that will need a painful
| rewrite.
|
| The whole mantra of avoiding "premature optimizations" was
| applicable in an era when "optimizations" meant rewriting C code
| in assembly.
| govolckurself wrote:
| [dead]
| secondcoming wrote:
| Well, Lemire is renowned for his SIMD algos.
| dilap wrote:
| Yep.
|
| You need to be thinking about performance from the very
| beginning, if you're ever going to be fast.
|
| Because, like the article said, "overall architecture trumps
| everything". You (probably) can't go back and fix that without
| doing a rewrite.
|
| (Though it can be OK to have particular small parts where say
| "we'll do this in a slow way and it's clear how we'll swap it
| out into a faster way later if it matters".)
|
| But if your approach is just "don't even worry about
| performance, that's premature optimization", you'll be in for a
| world of pain when you want to make it fast.
| attractivechaos wrote:
| A catch in Knuth's famous quote is how to define "premature". I
| am not old enough to see how programmers in his time thought
| about "premature", but my impression is quite a few modern
| programmers think all optimizations are premature.
| smolder wrote:
| The other thing that's changed from the 'every optimization is
| premature' era is that shrinking CPUs don't result in big gains
| in frequency anymore -- Moore's law isn't going to make your
| python run at C speed no matter how long you wait for better
| hardware.
| 0x000xca0xfe wrote:
| And the speed of light ensures that memory latencies won't
| get much better until CPUs are small cubes made of SRAM.
| amluto wrote:
| Come again?
|
| Modern servers seem to have about 100ns latency to main
| memory. The speed of light (actually electrical signals)
| delay is maybe 1-2ns.
| actionfromafar wrote:
| Ehrm. Small _spheres_ of SRAM, if I may.
| speed_spread wrote:
| That's one way of dealing with corner cases.
| cma wrote:
| That and memory latency has improved much slower than
| everything else, so pointer chasing implicit throughout
| languages like Python is just horrendously slow. SRAM for
| bigger cache isn't scaling down anymore either in the last
| several process nodes.
| adamnemecek wrote:
| I agree with the "premature optimization". It's one of those
| phrases like "correlation does not imply causation" that makes
| my blood boil. Like cool dude, did you just take freshman CS.
| Psychlist wrote:
| If you have a fast design/architecture, you may never need to
| optimise the code at all. But the flip side is that with a bad
| design or bad architecture optimising the implementation won't
| save you. With a sufficiently bad architecture starting again
| is the only reasonable choice.
|
| I've seen code that does "fast" searches of a tree in a dumb
| way come out O(n^10) or worse (at some point you just stop
| counting), and the solution was not to search most of the tree
| at all. Find the relevant node and follow links from that.
|
| Meanwhile in my day job performance really doesn't matter. We
| need a cloud system for the distributed high bandwidth side,
| but the smallest instances we can buy with the necessary
| bandwidth have so much CPU and RAM that even quite bad memory
| leaks take days to bring an instance down. Admittedly this is
| C++ with a sensible design (if I do say so myself) so ... good
| design and architecture means you don't have to optimise.
| turtleyacht wrote:
| > these lines of code [were] pulling data from memory and
| _software cannot beat Physics._ [These] are elementary
| operations...
|
| > measuring big effects is easy, measuring small ones becomes
| impossible because the _action of measuring interacts_ with the
| software
|
| > to multiply the performance by N, you need ... 2^N
| optimizations
|
| > why companies do full rewrite of their code for performance
| saagarjha wrote:
| Why quote these lines?
| turtleyacht wrote:
| Summarizing the article. Also gives me a way to evaluate
| performance/optimization. Ideas to hang hooks on.
| bastawhiz wrote:
| > And that explain why companies do full rewrites of their code
| for performance: the effort needed to squeeze more performance
| from the existing code becomes too much and a complete rewrite is
| cheaper.
|
| The article provides reasons why optimization gets harder, but no
| arguments for why a rewrite is better. It's unclear whether the
| author is arguing for rewrites or whether they're simply pointing
| out why companies take them on.
|
| Arguably, though, companies taking on a full rewrite surely must
| have considered the cost of optimization (versus naively saying
| "the system is slow, replace it!"--though maybe some did).
| Rewrites are big, expensive, and time-consuming. It means new
| bugs and unknown unknowns, and no time to add features or fix
| bugs because you're busy rewriting functional code. It's a
| scapegoat for lack of improvement or progress. You shouldn't take
| one on lightly.
|
| At the same time, this post also neglects that some efficiency
| wins have little to do with the efficiency of the code, but
| rather the efficiency of the logic. An N+1 query in your
| application looks like your database is slow: you're wasting a
| ton of time sitting and waiting for your DB to return
| information! But the real problem is that you're repeatedly going
| back-and-forth to the database to query lots of little pieces of
| information that could have far more efficiently been queried all
| at once.
|
| > It is relatively easy to double the performance of an
| unoptimized piece of code, but much harder to multiply it by 10.
| You quickly hit walls that can be unsurmountable: the effort
| needed to double the performance again would just be too much.
|
| That's not really true, though. One bad SQL query can go from
| many seconds or minutes to milliseconds. One accidentally-
| quadratic algorithm can take orders of magnitude more time than a
| linear-time algorithm. One bad regexp can account for the
| majority of a request. Of course, as you fix the biggest
| performance problems, the only problems left are ones that are
| smaller than your biggest ones, so you'll have diminishing
| returns.
|
| But it also begs the question, what choices has your existing
| code made that makes it _ten times_ slower than you want it to
| be? In my experience, you're doing work synchronously that could
| have been put in a queue and worked on asynchronously. It's more
| often "you're doing more work than you should" or "you're being
| inefficient with the resources you have available" than "a
| specific piece of code is computationally inefficient".
| nemothekid wrote:
| > _The article provides reasons why optimization gets harder,
| but no arguments for why a rewrite is better. It 's unclear
| whether the author is arguing for rewrites or whether they're
| simply pointing out why companies take them on._
|
| He didn't argue a rewrite is just "better"; his argument was
| that a rewrite was the only card of the table. The
| _architecture_ was deficient and to get more performance you
| have to change the architecture, which means a rewrite.
|
| I tend to agree; I take the view that most engineers are smart,
| and compilers/interpreters/virtual machines are even smarter so
| most targeted optimizations aren't going to result in very much
| gain. A codebase full of N+1 queries or unindexed queries never
| cared about performance to begin with.
|
| For true gains, you will have to think about data which is the
| true bottleneck for most applications - getting data from
| memory, the disk or the network will be much longer that any
| instruction cycle. The way memory moves through your
| application is baked into your architecture and changing this
| will almost always involve a rewrite. To your final point,
|
| > _In my experience, you 're doing work synchronously that
| could have been put in a queue and worked on asynchronously._
|
| moving from a synchronous codebase to an async one almost
| always involves a rewrite.
| 0x000xca0xfe wrote:
| Optimizing for modern CPUs means optimizing for predictable
| memory accesses and program flow. Minimizing memory usage helps a
| lot, too.
|
| Unfortunately this is pretty counterintuitive and most
| programming languages do not make it easy. And if you optimize
| for size you almost get laughed at.
| helen___keller wrote:
| This whole thing is basically a straw man. "Performance
| engineering works but sometimes it's not enough to overcome a bad
| architecture". Alright, was that actually in question in the
| first place?
| vgatherps wrote:
| You'd be surprised how common the view "Performance doesn't
| matter now we'll just fix the hotspot later" is
___________________________________________________________________
(page generated 2023-04-27 23:00 UTC)