[HN Gopher] Digging for performance gold: finding hidden perform...
___________________________________________________________________
Digging for performance gold: finding hidden performance wins
Author : markdog12
Score : 114 points
Date : 2021-04-23 14:19 UTC (8 hours ago)
(HTM) web link (blog.chromium.org)
(TXT) w3m dump (blog.chromium.org)
| PaulHoule wrote:
| I wrote something using Python and Pillow that prints titles,
| credits, and qr codes on the back of art prints.
|
| I ran very much into the problem that there are not really
| "unicode" fonts but rather the web browser is patching together
| characters from different fonts when you use Chinese, Arabic,
| Emoji(s), etc.
|
| I want something that looks like the card at the art museum that
| introduces a piece so I have just a nice serif en font and a
| Japanese font I like because I have a lot of Japanese subject
| matter.
|
| If I wanted to print some math character or arabic I would have
| to register that typeface in my system but it is a hassle at the
| moment.
|
| What I get for this (as compared to HTML) is that the system
| understands the border of the card, which is big for a 6x4 card
| and I can align multiple printings on both sides to the limits of
| the hardware.
|
| Jank is the least of my problems.
| bombcar wrote:
| The "Noto" family of fonts may be of interest:
| https://www.google.com/get/noto/
| matthewaveryusa wrote:
| Great write-up. I've done countless investigations like this and
| couldn't have worded it better.
|
| >Depth vs. breadth.
|
| Ah yes, which direction do you look at your program? do you look
| at which functions consume the most resources bottom up (probably
| some string or memory function in libc) or top down?
|
| If you're the person writing the system libraries for enormous
| platforms, probably bottom-up, but if you're an application
| developer, top down. Sometimes though, especially with the
| performance issue described in the article you're in the middle
| -- those are tough to spot!
| Leherenn wrote:
| Personally, in an application, I would quickly start bottom up.
| It's unlikely, but maybe there's an obvious way to improve one
| of those functions. An unnecessary copy or similar can easily
| happen.
|
| Then, yes, spend time bottom up, what's were you're more likely
| to find consistent gains, usually by finding ways to call those
| low level functions less often.
| markdog12 wrote:
| > A subset of Canary users who have opted in to sharing
| anonymized metrics have circular-buffer tracing enabled
|
| Where is that setting? I'm pretty sure it asks on install, but
| what about after that?
|
| Update: Seems to be in settings -> Sync and Google Services ->
| Help improve Chrome's features and performance
| ww520 wrote:
| Speaking of performance Chrome's WebGL performance is quite good.
| Some of the stress tests I ran it came up 40% faster than
| Firefox. It seems Chrome is faster at ferrying the WebGL calls
| and large amount of data to the GPU.
| jeffbee wrote:
| Great write-up. I really feel like any kind of performance
| optimization, compromise, or other detail should be accompanied
| by tests or assertions that capture all of the inputs that
| supported the decision. In this example, ideally the compromise
| necessary to support Windows XP would have come with an assertion
| that the minimum supported version of Windows was still XP or
| earlier. This way, the decision is remembered and revisited if XP
| stops being supported, because the build would break. I don't
| know what the chrome code looks like but I imagine something like
| // TODO: Remove this hack if we drop Windows XP
| assert(min_win < win7)
|
| ... simple. I recall finding a function deep in Google search
| that had been "optimized" in x86 assembly, but way back when the
| cache lines were 32 bytes. On Opteron and later the "optimized"
| code was slower than idiomatic C++. That's when I decided any
| kind of performance decision needs to be recorded, somehow.
| Either something like `assert cache_bytes==32` or just a
| FIXME($date) that forces someone to revisit the decision every
| year.
| bombcar wrote:
| I've always thought that there should be a coding construct
| (especially for inline assembly) where you have the code in a
| "macro-like comment" in original C, and then the inline "live"
| and one of the integration checks determines if they deviate in
| performance or results (and therefore should be retested).
| vlovich123 wrote:
| > Chrome measures jank every 30 seconds, so Jank in 1% of samples
| for a given user means jank once every 50 minutes
|
| Is that actually true? Doesn't this just mean that once every 50
| minutes the system has been janky for >= 60s? Anything less &
| you're below the Nyquist frequency & are unlikely to be actually
| sampling it, no? My knowledge of signal analysis is just what I
| recall from some intro university classes so there could be more
| involved here in this claim so happy to learn if I'm
| misremembering (+ it might be made more complicated because their
| also sampling across a population of users).
|
| (Speaking as somehow who regularly has to shut down Chrome
| because it's making my entire machine janky).
| londons_explore wrote:
| I think they mean "we have code that sets a flag whenever the
| UI blocks for more than 100 milliseconds. We clear that flag
| every 30 seconds. We see it set 1% of the times that we clear
| it."
| vlovich123 wrote:
| That would work but wouldn't let you say that jank happens
| "once every 50 minutes" because you don't actually know how
| many times that happened.
|
| Also, this article isn't talking about UI blockage. This is
| talking about the time delta between user input & the result
| hitting the eyeball, presumably even across any asynchronous
| threads/IPC.
| bentcorner wrote:
| I think the phrasing they used ( _" Let's talk about 1%. 1%
| is quite large in practice. The core metric we use is
| "jank" which is a noticeable delay between when the user
| gives input and when software reacts to it. Chrome measures
| jank every 30 seconds, so Jank in 1% of samples for a given
| user means jank once every 50 minutes."_) was just to give
| an example of what 1% meant in practice.
|
| > _Also, this article isn 't talking about UI blockage.
| This is talking about the time delta between user input &
| the result hitting the eyeball, presumably even across any
| asynchronous threads/IPC._
|
| Aren't those the same things?
| vlovich123 wrote:
| That would be a pretty weird example to give I think if
| the article is solely focusing about a specific bank
| issue. I think it's more that "at chrome scale, 1% is a
| lot, especially when you're talking about number of
| users".
|
| > Aren't those the same things?
|
| Depends how you define it. Typically I think of "UI
| blockage" interpreted as "main thread doing CPU work or
| blocked on something and not processing events". That's a
| subset of the problems described (and maybe not even a
| perfect subset since you may have some kinds of UI
| blockage that's not directly tracked to a user action). A
| user action might cause a repaint of the cursor/text.
| That repaint actually gets to the user through the
| compositor which is an external process (for security
| reasons). That's all asynchronous and means you have to
| actually plumb through all your time stamps and metadata
| about the source event in all dependent work in a
| meaningful enough way to come up with an answer.
| infogulch wrote:
| I think you're assuming too much about what they mean by
| 'measure'. My guess is that they measure all jank regardless of
| duration, and record in frequency buckets that are 30s wide.
| But it's not exactly clear.
| vlovich123 wrote:
| 30s frequency buckets wouldn't be phrased as "every 30s"
| though, no?
| infogulch wrote:
| I don't know how they would phrase it, but the whole
| sentence seems disjointed like it's been through too many
| editing passes and lost meaning.
| londons_explore wrote:
| > Speaking as somehow who regularly has to shut down Chrome
| because it's making my entire machine janky
|
| You either need more RAM, or a browser that uses less RAM...
| vlovich123 wrote:
| Currently I have 32GB & used a machine with 96GB. How much
| RAM do browsers need? FWIW Firefox doesn't do too much
| better.
| londons_explore wrote:
| Do you have a machine with slow storage (hard disk or early
| SSD)? Chromes HTTP cache does a lot of tiny reads and can
| easily make the whole system slow, especially when the
| profile is gigabytes or more.
|
| Clearing the cache, or even the entire Chrome profile will
| fix it if its the case.
| vlovich123 wrote:
| Traditionally always an SSD & more recently (on the 32gb
| machine) NVME. I/O is certainly a good hypothesis.
| Regardless of the subresource, I think the fault actually
| lays with the kernel. I don't care how many subprocesses
| are started - the totality should be grouped under a jail
| that is fairly queued with all other work on CPU & I/O
| unless I explicitly raise that jails limits (heck, maybe
| even RAM - swap out Chrome more quickly if it's hogging
| up all the RAM).
| londons_explore wrote:
| Chrome has a lot of processes, but only 1 or 2 processes
| do all the disk and network IO, so I don't think that
| particular hypothesis holds up.
|
| What you say probably _is_ an issue with CPU scheduling
| though.
| vlovich123 wrote:
| Yeah. I was thinking more that kernels historically have
| not been able to achieve good I/O queuing for user-facing
| operation (some of which was probably because the
| hardware interfaces weren't good enough. Maybe that's
| been since resolved.
|
| I do think it causes issues with CPU scheduling but it
| could be any number of other issues. I don't think kernel
| developers are looking at improving the overall perf of
| the system with a large number of chrome tabs.
| vlovich123 wrote:
| Now that I've finished reading, I'm curious why the Chrome team
| didn't optimize GetLinkedFonts since it's the obvious culprit
| to my eye. Querying the registry is _slow_. Really slow. . The
| Chrome code appears to always read it on a missing value. If
| you have a missing value in your already populated cache, then
| every miss is going to reach out to the infrequently changing
| registry. It makes far more sense to only invalidate your in-
| process cache when the registry actually changes
| londons_explore wrote:
| Indeed it would seem a trivial change to cache failed lookups
| here[1]
|
| [1]: https://source.chromium.org/chromium/chromium/src/+/mast
| er:u...
| vlovich123 wrote:
| Yup. Looked at the source first to double-check there
| wasn't a legitimate reason for the registry read.
|
| EDIT: Filed the suggestion upstream: https://bugs.chromium.
| org/p/chromium/issues/detail?id=120214...
| masklinn wrote:
| > Now that I've finished reading, I'm curious why the Chrome
| team didn't optimize GetLinkedFonts since it's the obvious
| culprit to my eye.
|
| That's the point of the article: `GetLinkedFonts` is the
| "obvious culprit", but it's the fallback to the fallback, it
| should not be getting called in the first place. It doesn't
| really matter that it's slow because it should almost never
| be called.
|
| And _then_ , I assume they fixed (or will fix) the cache so
| that it'd cache failures, so GetLinkedFonts would only be
| called once per failure instead of being called over and over
| again, only for failure (as successes would get cached after
| the first one).
| frabjoused wrote:
| I personally find performance optimization on an existing, high
| traffic system to be some of my favorite times as a developer.
| You have a large number, and you have to get it as close to zero
| as possible. There's no mystery as to whether you improved
| something and it has a tangible reward.
| brundolf wrote:
| > There's no mystery as to whether you improved something
|
| I'd nitpick a little bit and say it's possible that an
| optimization in one case causes a slowdown in another case- or
| worse, a bug. Benchmarks can also be inconsistent on the "same"
| case due to caches, etc, some of which may live outside of the
| code you actually control. Even the simplest program will vary
| a bit when you re-run it on the same system due to the state of
| that system (other processes, temperature, CPU cache, etc).
|
| Some optimizations are clear wins, but many of them involve
| trade-offs and can have some mystery. Thorough
| testing/benchmarking helps a lot, but it can only get you so
| far.
| jeffbee wrote:
| FYI Chrome performance is not generally guided by
| microbenchmarks, for the exact reasons you mention. It is
| guided by full-scale benchmarks (e.g. render the top 10000
| sites) and by ChromeOS-wide profiles gathered in the wild. If
| a performance change doesn't work in the wild as indicated by
| profiles then it generally will be backed out and
| reconsidered. This is consistent with Google's backend
| performance culture where microbenchmarks are fine and good
| but changes need to be vetted on a full-scale production
| loadtest fixture.
| brundolf wrote:
| Yep, makes sense to me. Unit tests assume that a clean-room
| environment translates reasonably well to the end result;
| the more naturally complex or unruly a product or a target
| metric is, the more your testing process should lean
| towards integration/real-world testing.
| m4rtink wrote:
| Not to mention maintenance and future development costs if
| the optimization makes the piece of software more complicated
| and less flexible.
| GordonS wrote:
| I'm with you - one of my favourite things to do is optimise
| performance, whether it's memory, CPU, latency, whatever.
|
| Actually, I often enjoy it a bit too much... it's frequently
| the case that I'll realise I've just spent an entire day
| reducing memory allocations that didn't _really_ need reducing,
| rather than building features :(
| segmondy wrote:
| It doesn't even have to be for a high traffic system, it could
| also be for a cost constrained system. If you're an indie
| developer. You might be able to afford $50 a month but not $500
| a month for your side project and improving performance can
| keep you in business and give you a shot at success.
| dan-robertson wrote:
| If you work at a sufficiently large company then optimisation
| work on sufficiently low level systems can save 7-8 figures per
| year, not that you'd likely see much of those savings yourself.
| It often turns out that some tiny bit of code with not-great
| performance becomes embedded deeply in the stack and the small
| cost can add up over many machines.
| aeturnum wrote:
| I wish the Chrome team would dig into why my Chrome uses nearly
| all of two cores all of the time with one tab open. The issue (or
| something similar) comes up all the time on their forums and they
| just lock all the topics[1]. Chrome is such a resource hog under
| normal operation that it's hard to say when something is going
| wrong.
|
| [1] https://support.google.com/chrome/thread/17537877?hl=en
| josephg wrote:
| Try disabling all your browser extensions and see if the
| problem persists.
|
| It's amazing to me how inefficiently a lot of browser
| extensions are written - eg last I checked, metamask pulls in
| web3, which is a clown car of javascript that takes hundreds of
| milliseconds to parse. That code needs to be parsed every time
| you navigate to a new website. You might not notice a single
| extension like that, but with a few bad extensions it's easy
| for your browser to slow to a crawl. The obvious response is to
| blame the browser for stuff like this, but it's usually the
| extensions that are causing your problems.
|
| [1] https://github.com/ChainSafe/web3.js/issues/1178
| username90 wrote:
| It doesn't take much cpu for me. Chromes 19 processes combined
| sits there are 1% cpu and about 1gb ram for me. Probably the
| sites you are visitings fault.
|
| Edit: Looking through that thread seems like some plugins
| caused the issue.
| [deleted]
| flakiness wrote:
| Use Chrome tracing [1] or Perfetto [2] to take a couple of
| trace when the problem is happening. Then submit a bug with the
| trace. That's one of the most promising way to report
| performance bugs. It is especially powerful when you're using
| Chromebook because Chrome OS integrates Linux ftrace to these
| app-level tracing and draws a system-wide picture of the
| workload.
|
| (Disclaimer: I used to work on Chrome many years ago.)
|
| [1] https://www.chromium.org/developers/how-tos/trace-event-
| prof... [2] https://ui.perfetto.dev/
| shinycode wrote:
| I uninstalled Chrome and switched to an other Chromium
| alternative and never looked back. Never had a performance issue
| on my Mac since then ...
___________________________________________________________________
(page generated 2021-04-23 23:01 UTC)