hngopher.com

       [HN Gopher] Digging for performance gold: finding hidden perform...
       ___________________________________________________________________
        
       Digging for performance gold: finding hidden performance wins
        
       Author : markdog12
       Score  : 114 points
       Date   : 2021-04-23 14:19 UTC (8 hours ago)
        
 (HTM) web link (blog.chromium.org)
 (TXT) w3m dump (blog.chromium.org)
        
       | PaulHoule wrote:
       | I wrote something using Python and Pillow that prints titles,
       | credits, and qr codes on the back of art prints.
       | 
       | I ran very much into the problem that there are not really
       | "unicode" fonts but rather the web browser is patching together
       | characters from different fonts when you use Chinese, Arabic,
       | Emoji(s), etc.
       | 
       | I want something that looks like the card at the art museum that
       | introduces a piece so I have just a nice serif en font and a
       | Japanese font I like because I have a lot of Japanese subject
       | matter.
       | 
       | If I wanted to print some math character or arabic I would have
       | to register that typeface in my system but it is a hassle at the
       | moment.
       | 
       | What I get for this (as compared to HTML) is that the system
       | understands the border of the card, which is big for a 6x4 card
       | and I can align multiple printings on both sides to the limits of
       | the hardware.
       | 
       | Jank is the least of my problems.
        
         | bombcar wrote:
         | The "Noto" family of fonts may be of interest:
         | https://www.google.com/get/noto/
        
       | matthewaveryusa wrote:
       | Great write-up. I've done countless investigations like this and
       | couldn't have worded it better.
       | 
       | >Depth vs. breadth.
       | 
       | Ah yes, which direction do you look at your program? do you look
       | at which functions consume the most resources bottom up (probably
       | some string or memory function in libc) or top down?
       | 
       | If you're the person writing the system libraries for enormous
       | platforms, probably bottom-up, but if you're an application
       | developer, top down. Sometimes though, especially with the
       | performance issue described in the article you're in the middle
       | -- those are tough to spot!
        
         | Leherenn wrote:
         | Personally, in an application, I would quickly start bottom up.
         | It's unlikely, but maybe there's an obvious way to improve one
         | of those functions. An unnecessary copy or similar can easily
         | happen.
         | 
         | Then, yes, spend time bottom up, what's were you're more likely
         | to find consistent gains, usually by finding ways to call those
         | low level functions less often.
        
       | markdog12 wrote:
       | > A subset of Canary users who have opted in to sharing
       | anonymized metrics have circular-buffer tracing enabled
       | 
       | Where is that setting? I'm pretty sure it asks on install, but
       | what about after that?
       | 
       | Update: Seems to be in settings -> Sync and Google Services ->
       | Help improve Chrome's features and performance
        
       | ww520 wrote:
       | Speaking of performance Chrome's WebGL performance is quite good.
       | Some of the stress tests I ran it came up 40% faster than
       | Firefox. It seems Chrome is faster at ferrying the WebGL calls
       | and large amount of data to the GPU.
        
       | jeffbee wrote:
       | Great write-up. I really feel like any kind of performance
       | optimization, compromise, or other detail should be accompanied
       | by tests or assertions that capture all of the inputs that
       | supported the decision. In this example, ideally the compromise
       | necessary to support Windows XP would have come with an assertion
       | that the minimum supported version of Windows was still XP or
       | earlier. This way, the decision is remembered and revisited if XP
       | stops being supported, because the build would break. I don't
       | know what the chrome code looks like but I imagine something like
       | // TODO: Remove this hack if we drop Windows XP
       | assert(min_win < win7)
       | 
       | ... simple. I recall finding a function deep in Google search
       | that had been "optimized" in x86 assembly, but way back when the
       | cache lines were 32 bytes. On Opteron and later the "optimized"
       | code was slower than idiomatic C++. That's when I decided any
       | kind of performance decision needs to be recorded, somehow.
       | Either something like `assert cache_bytes==32` or just a
       | FIXME($date) that forces someone to revisit the decision every
       | year.
        
         | bombcar wrote:
         | I've always thought that there should be a coding construct
         | (especially for inline assembly) where you have the code in a
         | "macro-like comment" in original C, and then the inline "live"
         | and one of the integration checks determines if they deviate in
         | performance or results (and therefore should be retested).
        
       | vlovich123 wrote:
       | > Chrome measures jank every 30 seconds, so Jank in 1% of samples
       | for a given user means jank once every 50 minutes
       | 
       | Is that actually true? Doesn't this just mean that once every 50
       | minutes the system has been janky for >= 60s? Anything less &
       | you're below the Nyquist frequency & are unlikely to be actually
       | sampling it, no? My knowledge of signal analysis is just what I
       | recall from some intro university classes so there could be more
       | involved here in this claim so happy to learn if I'm
       | misremembering (+ it might be made more complicated because their
       | also sampling across a population of users).
       | 
       | (Speaking as somehow who regularly has to shut down Chrome
       | because it's making my entire machine janky).
        
         | londons_explore wrote:
         | I think they mean "we have code that sets a flag whenever the
         | UI blocks for more than 100 milliseconds. We clear that flag
         | every 30 seconds. We see it set 1% of the times that we clear
         | it."
        
           | vlovich123 wrote:
           | That would work but wouldn't let you say that jank happens
           | "once every 50 minutes" because you don't actually know how
           | many times that happened.
           | 
           | Also, this article isn't talking about UI blockage. This is
           | talking about the time delta between user input & the result
           | hitting the eyeball, presumably even across any asynchronous
           | threads/IPC.
        
             | bentcorner wrote:
             | I think the phrasing they used ( _" Let's talk about 1%. 1%
             | is quite large in practice. The core metric we use is
             | "jank" which is a noticeable delay between when the user
             | gives input and when software reacts to it. Chrome measures
             | jank every 30 seconds, so Jank in 1% of samples for a given
             | user means jank once every 50 minutes."_) was just to give
             | an example of what 1% meant in practice.
             | 
             | > _Also, this article isn 't talking about UI blockage.
             | This is talking about the time delta between user input &
             | the result hitting the eyeball, presumably even across any
             | asynchronous threads/IPC._
             | 
             | Aren't those the same things?
        
               | vlovich123 wrote:
               | That would be a pretty weird example to give I think if
               | the article is solely focusing about a specific bank
               | issue. I think it's more that "at chrome scale, 1% is a
               | lot, especially when you're talking about number of
               | users".
               | 
               | > Aren't those the same things?
               | 
               | Depends how you define it. Typically I think of "UI
               | blockage" interpreted as "main thread doing CPU work or
               | blocked on something and not processing events". That's a
               | subset of the problems described (and maybe not even a
               | perfect subset since you may have some kinds of UI
               | blockage that's not directly tracked to a user action). A
               | user action might cause a repaint of the cursor/text.
               | That repaint actually gets to the user through the
               | compositor which is an external process (for security
               | reasons). That's all asynchronous and means you have to
               | actually plumb through all your time stamps and metadata
               | about the source event in all dependent work in a
               | meaningful enough way to come up with an answer.
        
         | infogulch wrote:
         | I think you're assuming too much about what they mean by
         | 'measure'. My guess is that they measure all jank regardless of
         | duration, and record in frequency buckets that are 30s wide.
         | But it's not exactly clear.
        
           | vlovich123 wrote:
           | 30s frequency buckets wouldn't be phrased as "every 30s"
           | though, no?
        
             | infogulch wrote:
             | I don't know how they would phrase it, but the whole
             | sentence seems disjointed like it's been through too many
             | editing passes and lost meaning.
        
         | londons_explore wrote:
         | > Speaking as somehow who regularly has to shut down Chrome
         | because it's making my entire machine janky
         | 
         | You either need more RAM, or a browser that uses less RAM...
        
           | vlovich123 wrote:
           | Currently I have 32GB & used a machine with 96GB. How much
           | RAM do browsers need? FWIW Firefox doesn't do too much
           | better.
        
             | londons_explore wrote:
             | Do you have a machine with slow storage (hard disk or early
             | SSD)? Chromes HTTP cache does a lot of tiny reads and can
             | easily make the whole system slow, especially when the
             | profile is gigabytes or more.
             | 
             | Clearing the cache, or even the entire Chrome profile will
             | fix it if its the case.
        
               | vlovich123 wrote:
               | Traditionally always an SSD & more recently (on the 32gb
               | machine) NVME. I/O is certainly a good hypothesis.
               | Regardless of the subresource, I think the fault actually
               | lays with the kernel. I don't care how many subprocesses
               | are started - the totality should be grouped under a jail
               | that is fairly queued with all other work on CPU & I/O
               | unless I explicitly raise that jails limits (heck, maybe
               | even RAM - swap out Chrome more quickly if it's hogging
               | up all the RAM).
        
               | londons_explore wrote:
               | Chrome has a lot of processes, but only 1 or 2 processes
               | do all the disk and network IO, so I don't think that
               | particular hypothesis holds up.
               | 
               | What you say probably _is_ an issue with CPU scheduling
               | though.
        
               | vlovich123 wrote:
               | Yeah. I was thinking more that kernels historically have
               | not been able to achieve good I/O queuing for user-facing
               | operation (some of which was probably because the
               | hardware interfaces weren't good enough. Maybe that's
               | been since resolved.
               | 
               | I do think it causes issues with CPU scheduling but it
               | could be any number of other issues. I don't think kernel
               | developers are looking at improving the overall perf of
               | the system with a large number of chrome tabs.
        
         | vlovich123 wrote:
         | Now that I've finished reading, I'm curious why the Chrome team
         | didn't optimize GetLinkedFonts since it's the obvious culprit
         | to my eye. Querying the registry is _slow_. Really slow. . The
         | Chrome code appears to always read it on a missing value. If
         | you have a missing value in your already populated cache, then
         | every miss is going to reach out to the infrequently changing
         | registry. It makes far more sense to only invalidate your in-
         | process cache when the registry actually changes
        
           | londons_explore wrote:
           | Indeed it would seem a trivial change to cache failed lookups
           | here[1]
           | 
           | [1]: https://source.chromium.org/chromium/chromium/src/+/mast
           | er:u...
        
             | vlovich123 wrote:
             | Yup. Looked at the source first to double-check there
             | wasn't a legitimate reason for the registry read.
             | 
             | EDIT: Filed the suggestion upstream: https://bugs.chromium.
             | org/p/chromium/issues/detail?id=120214...
        
           | masklinn wrote:
           | > Now that I've finished reading, I'm curious why the Chrome
           | team didn't optimize GetLinkedFonts since it's the obvious
           | culprit to my eye.
           | 
           | That's the point of the article: `GetLinkedFonts` is the
           | "obvious culprit", but it's the fallback to the fallback, it
           | should not be getting called in the first place. It doesn't
           | really matter that it's slow because it should almost never
           | be called.
           | 
           | And _then_ , I assume they fixed (or will fix) the cache so
           | that it'd cache failures, so GetLinkedFonts would only be
           | called once per failure instead of being called over and over
           | again, only for failure (as successes would get cached after
           | the first one).
        
       | frabjoused wrote:
       | I personally find performance optimization on an existing, high
       | traffic system to be some of my favorite times as a developer.
       | You have a large number, and you have to get it as close to zero
       | as possible. There's no mystery as to whether you improved
       | something and it has a tangible reward.
        
         | brundolf wrote:
         | > There's no mystery as to whether you improved something
         | 
         | I'd nitpick a little bit and say it's possible that an
         | optimization in one case causes a slowdown in another case- or
         | worse, a bug. Benchmarks can also be inconsistent on the "same"
         | case due to caches, etc, some of which may live outside of the
         | code you actually control. Even the simplest program will vary
         | a bit when you re-run it on the same system due to the state of
         | that system (other processes, temperature, CPU cache, etc).
         | 
         | Some optimizations are clear wins, but many of them involve
         | trade-offs and can have some mystery. Thorough
         | testing/benchmarking helps a lot, but it can only get you so
         | far.
        
           | jeffbee wrote:
           | FYI Chrome performance is not generally guided by
           | microbenchmarks, for the exact reasons you mention. It is
           | guided by full-scale benchmarks (e.g. render the top 10000
           | sites) and by ChromeOS-wide profiles gathered in the wild. If
           | a performance change doesn't work in the wild as indicated by
           | profiles then it generally will be backed out and
           | reconsidered. This is consistent with Google's backend
           | performance culture where microbenchmarks are fine and good
           | but changes need to be vetted on a full-scale production
           | loadtest fixture.
        
             | brundolf wrote:
             | Yep, makes sense to me. Unit tests assume that a clean-room
             | environment translates reasonably well to the end result;
             | the more naturally complex or unruly a product or a target
             | metric is, the more your testing process should lean
             | towards integration/real-world testing.
        
           | m4rtink wrote:
           | Not to mention maintenance and future development costs if
           | the optimization makes the piece of software more complicated
           | and less flexible.
        
         | GordonS wrote:
         | I'm with you - one of my favourite things to do is optimise
         | performance, whether it's memory, CPU, latency, whatever.
         | 
         | Actually, I often enjoy it a bit too much... it's frequently
         | the case that I'll realise I've just spent an entire day
         | reducing memory allocations that didn't _really_ need reducing,
         | rather than building features :(
        
         | segmondy wrote:
         | It doesn't even have to be for a high traffic system, it could
         | also be for a cost constrained system. If you're an indie
         | developer. You might be able to afford $50 a month but not $500
         | a month for your side project and improving performance can
         | keep you in business and give you a shot at success.
        
         | dan-robertson wrote:
         | If you work at a sufficiently large company then optimisation
         | work on sufficiently low level systems can save 7-8 figures per
         | year, not that you'd likely see much of those savings yourself.
         | It often turns out that some tiny bit of code with not-great
         | performance becomes embedded deeply in the stack and the small
         | cost can add up over many machines.
        
       | aeturnum wrote:
       | I wish the Chrome team would dig into why my Chrome uses nearly
       | all of two cores all of the time with one tab open. The issue (or
       | something similar) comes up all the time on their forums and they
       | just lock all the topics[1]. Chrome is such a resource hog under
       | normal operation that it's hard to say when something is going
       | wrong.
       | 
       | [1] https://support.google.com/chrome/thread/17537877?hl=en
        
         | josephg wrote:
         | Try disabling all your browser extensions and see if the
         | problem persists.
         | 
         | It's amazing to me how inefficiently a lot of browser
         | extensions are written - eg last I checked, metamask pulls in
         | web3, which is a clown car of javascript that takes hundreds of
         | milliseconds to parse. That code needs to be parsed every time
         | you navigate to a new website. You might not notice a single
         | extension like that, but with a few bad extensions it's easy
         | for your browser to slow to a crawl. The obvious response is to
         | blame the browser for stuff like this, but it's usually the
         | extensions that are causing your problems.
         | 
         | [1] https://github.com/ChainSafe/web3.js/issues/1178
        
         | username90 wrote:
         | It doesn't take much cpu for me. Chromes 19 processes combined
         | sits there are 1% cpu and about 1gb ram for me. Probably the
         | sites you are visitings fault.
         | 
         | Edit: Looking through that thread seems like some plugins
         | caused the issue.
        
         | [deleted]
        
         | flakiness wrote:
         | Use Chrome tracing [1] or Perfetto [2] to take a couple of
         | trace when the problem is happening. Then submit a bug with the
         | trace. That's one of the most promising way to report
         | performance bugs. It is especially powerful when you're using
         | Chromebook because Chrome OS integrates Linux ftrace to these
         | app-level tracing and draws a system-wide picture of the
         | workload.
         | 
         | (Disclaimer: I used to work on Chrome many years ago.)
         | 
         | [1] https://www.chromium.org/developers/how-tos/trace-event-
         | prof... [2] https://ui.perfetto.dev/
        
       | shinycode wrote:
       | I uninstalled Chrome and switched to an other Chromium
       | alternative and never looked back. Never had a performance issue
       | on my Mac since then ...
        
       ___________________________________________________________________
       (page generated 2021-04-23 23:01 UTC)