[HN Gopher] Running tasks on E cores can use a third of the ener...
       ___________________________________________________________________
        
       Running tasks on E cores can use a third of the energy of P cores
        
       Author : zdw
       Score  : 45 points
       Date   : 2022-05-03 14:06 UTC (1 days ago)
        
 (HTM) web link (eclecticlight.co)
 (TXT) w3m dump (eclecticlight.co)
        
       | modeless wrote:
       | I'm not convinced that either estimate is accurate. To validate
       | that the measurements are real they should actually run the tasks
       | for a significant amount of time and see how much battery is
       | used.
        
         | ars wrote:
         | I agree - this appears to be very sensitive to the polling
         | resolution.
         | 
         | If you only run a task for 0.5 seconds, and you only check
         | power usage every 0.25 seconds, you'll get an artificially high
         | value.
         | 
         | They need to run this for much longer, and with no breaks
         | (moments of low power usage).
        
         | astrange wrote:
         | powermetrics is not an estimate. Of course some other things on
         | the system may change behavior if it's not plugged in.
        
       | phaedrus wrote:
       | From the title, not knowing what E and P cores are, I wondered if
       | maybe it meant using software emulation to make a "virtual CPU
       | core" could actually be _more_ energy efficient than running on a
       | physical core. And now I kinda wonder if there is something to
       | that idea...
       | 
       | For example I was just reading a different HN thread were people
       | were complaining about Windows 10 randomly turning your laptop
       | into a hairdryer when it's otherwise idle (pegging the CPU at
       | 100% for an update or scan). It's clear a lot of software _doesn
       | 't_ need to run at full hardware speed; maybe I'd rather these
       | background scans use only 10% CPU and take 10 times at long. And
       | I bet it wouldn't take 10 times as long, due to the way latency &
       | etc. scale.
       | 
       | Another example is a lot of software is inefficient for stupid
       | reasons. If we had automated ways to detect the most common
       | stupid code, maybe we just not run it (if the answer makes no
       | difference), or run an optimized replacement. Constrained to run
       | on a physical core, you either need static analysis or dynamic
       | recompilation and are limited in some ways, but running on an
       | emulated core you could do whatever you want with dirty tricks
       | that the process can't detect.
        
         | eternityforest wrote:
         | Linux file scanners are the worst for this, or were until
         | recently. I have no idea why it was so hard not to occasionally
         | consume 100% CPU and tons of disk IO at inconvenient times, but
         | multiple different indexers have exactly the same issue.
         | 
         | I bet you could static analyze to find pure functions, then add
         | a caching layer automatically. A lot of slowness seems to be
         | because a cache wasn't used.
         | 
         | Or because something stateless should have been stateful(As in
         | fully recomputing stuff that depends on previous inputs rather
         | than calculating one item at a time).
        
           | sliken wrote:
           | The major problems I see are one of two cases. Often just
           | naive implementations run a filescan, run as fast as they
           | can, the OS starts using more ram for caching, then at some
           | point applications need more ram. That triggers a reclaiming
           | pages from cache, writing dirty pages, reading pages from the
           | disk for the application, all while the scan is keeping the
           | I/O pipeline full. The user observes this as a slow/laggy
           | machine, and if you are unlucky enough to have a i7/i9 in a
           | laptop often a fan running flat out.
           | 
           | A second case is where a vendor offers "unlimited" backups
           | for a flat price. This creates an incentive for the vendor to
           | make an INCREDIBLY inefficient backup client that consumes
           | substantial ram and CPU resources locally and minimal storage
           | remotely. Typically this means that you aren't allowed to use
           | more efficient clients, and have to use the vendors client.
           | Here's looking at you crashplan.
        
           | smoldesu wrote:
           | MacOS' indexer is equally as bad. The amount of time's I've
           | seen 10 mdworker threads pinning my CPU is staggering...
        
             | als0 wrote:
             | AFAIK this problem seems to have gone away with the M1
             | based Macs. Can anyone confirm?
        
               | smoldesu wrote:
               | I'd imagine it's not an issue since Grand Central
               | Dispatch will only assign it to a single core cluster,
               | which leaves the other cores free to do their own
               | processing.
        
               | ceeplusplus wrote:
               | I believe indexing tasks only run on e-cores, at least
               | from quick glances at Activity Monitor. I also see more
               | kernel code run on e-cores when I'm compiling code, not
               | sure if Apple is moving blocking syscalls onto e-cores
               | and swapping threads or something (intuition tells me
               | this shouldn't lead to perf gains due to context switch
               | cost, but who knows).
        
             | astrange wrote:
             | Spotlight runs at a low CPU priority, so it doesn't use a
             | lot of power (theoretically) and on Intel won't increase
             | P-states. The CPU % doesn't mean much at all except it's
             | runnable.
        
               | smoldesu wrote:
               | In practice my CPU hits ~80c and makes the system nearly
               | unusable. I just disabled Spotlight indexing altogether.
        
               | sliken wrote:
               | To be fair backups are one of many things (like
               | compiling, video conferencing, or even just looking at
               | the laptop wrong) that makes x86-64 macs run their fans
               | run out. My particular case is an i9 in a mbp 16". It
               | throttles heavily while making impressive noise any just
               | about any light workload that's not pure idle.
        
         | chlorion wrote:
         | On Linux we have something called "cgroups" which is integrated
         | with the kernel's scheduler. Cgroups can be used to control and
         | limit resources and do exactly what you are describing!
         | 
         | Cgroups can be used to limit the maximum amount of CPU
         | bandwidth a certain group can use in a given period with fairly
         | high resolution. For example, you can use the cpu.max variable
         | to limit a group to run for a maximum of 1000 microseconds per
         | second (1s = 1000000us). You can also limit what cores the
         | group can run on, and the "weight" of the group when CPU
         | bandwidth is being contested by other groups!
         | 
         | Cgroups can also limit resources such as RAM, swap and IO in a
         | similar way.
         | 
         | Systemd has some features to put daemons in cgroups and
         | provides ways to set the various control variables which is
         | pretty handy, and it also has a way to launch regular programs
         | with resource controls applied with systemd-run. There is also
         | libcgroup and the sysfs interface for systems without systemd.
        
       | throw10920 wrote:
       | > Conventional wisdom, at least for SMP chips, is that this is
       | the most efficient strategy, as it gets the task out of the way
       | as quickly as possible.
       | 
       | I've heard this conventional wisdom cited repeatedly in the realm
       | of embedded devices, but I don't have any idea how it could
       | possibly be true: in a synchronous CMOS chip, power scales with
       | the _square_ of voltage, and in order to push the chip to higher
       | frequencies, you have to increase the (gate? rail?) voltage -
       | doesn 't that mean that computing things at lower frequencies is
       | almost always more efficient?
        
         | wmf wrote:
         | It depends on the idle power.
        
         | nagisa wrote:
         | Running a core doesn't only run the core itself. There is a
         | fair amount of supporting circuits including, but not limited
         | to, things like memory controller, I/O bus, peripherals, etc.
         | 
         | These can sip a non-trivial amount of power, and if a task
         | running on an efficiency core kept them all up, you ultimately
         | might end up with greater aggregate power use.
         | 
         | Nowadays in modern chips these other circuits seem to be
         | brought down to the low power states much more aggressively.
        
         | brigade wrote:
         | If you scale frequency down far enough then leakage power can
         | easily dominate, and a decade ago it was bad enough that race-
         | to-idle really was more power efficient than the DVFS most SoCs
         | could do. Vendors were so bad at this that one of ARM's largest
         | motivations for big.LITTLE so that their partners could have
         | any semblance of DVFS that actually saved power at lower perf
         | states.
         | 
         | Although, I think the blogger only measures pinning on
         | different CPUs, not the effect of DVFS vs. race-to-idle, so his
         | statement is a non sequitur. Because _of course_ a power-
         | optimized arch is going to be more efficient than a perf-
         | optimized arch (modulo uncore power), and that 's been known
         | and true since forever.
        
       | eternityforest wrote:
       | Hardware-based efficiency stuff seems to be the future.
       | 
       | This whole idea that we need light and simple software to be fast
       | doesn't quite we need software that takes fullest advantage of
       | hardware.
       | 
       | I bet we could do a lot more with this idea. We could have dozens
       | of tiny cores, without any math beyond addition, for things like
       | regexes and file format parsing.
       | 
       | We could probably fit stuff like small AIs and video codecs on SD
       | sized accelerator chips if we really tried.
       | 
       | It will be cool to see how far they can take this.
       | 
       | It will also be really cool to see how we program for this kind
       | of thing, to really make the best use of it.
        
         | Wohlf wrote:
         | This has always been a key component of processor design and
         | has been integral to improvements in consumer device processors
         | for a while now.
        
         | zekrioca wrote:
         | > It will also be really cool to see how we program for this
         | kind of thing, to really make the best use of it.
         | 
         | With light and simple software (libraries)
        
       | GeekyBear wrote:
       | With the M2 chip expected to debut at WWDC next month, here's the
       | difference between the A14 efficiency cores used in the M1 and
       | the A15 efficiency cores used in the current gen iPhones.
       | 
       | >The core has a median performance improvement of +23%, resulting
       | in a median IPC increase of +11.6%. The cores here don't showcase
       | the same energy efficiency improvement as the new A15's
       | performance cores, as energy consumption is mostly flat due to
       | performance increases coming at a cost of power increases, which
       | are still very much low.
       | 
       | Also, when compared to the efficiency cores used in the
       | Snapdragon 888:
       | 
       | >The comparison against the little Cortex-A55 cores is more
       | absurd though, as the A15's E-core is 3.5x faster on average, yet
       | only consuming 32% more power, so energy efficiency is 60%
       | better.
       | 
       | https://www.anandtech.com/show/16983/the-apple-a15-soc-perfo...
        
       | Kon-Peki wrote:
       | These Apple Silicon chips are already very efficient, so it is
       | pleasantly surprising to see the ability to reduce power
       | consumption by 66%. I am wondering though - with some tasks it
       | may be possible to implement using multiple
       | algorithms/approaches, and that could make the differences even
       | larger (if I am pinning to E cores, use algorithm A for maximum
       | efficiency but if I am using P+E cores, use algorithm B for
       | maximum performance).
        
         | wmf wrote:
         | _use algorithm A for maximum efficiency but use algorithm B for
         | maximum performance_
         | 
         | This is pretty rare; the fastest algorithm is almost always the
         | most efficient.
        
           | Lichtso wrote:
           | For purely sequential algorithms I agree, fewer work steps
           | means less energy used and less time spent.
           | 
           | But the picture is very different when it comes to parallel
           | algorithms. There often is a tradeoff between doing more
           | redundant work to reduce the latency and finish faster
           | overall at the expense of more energy being used.
           | 
           | One example is the "scan" operation on GPUs. There are lots
           | of different algorithms to do it which form a gradient
           | between fast and efficient. For reference see page 2 of this
           | paper:
           | https://research.nvidia.com/publication/2016-03_single-
           | pass-...
        
       | Aardwolf wrote:
       | But can I be certain that I won't be waiting longer than I should
       | because something that really should be running on a P core,
       | happens to be running on an E core instead?
       | 
       | (Talking about Intel here, I only now see the article is about
       | M1)
        
       ___________________________________________________________________
       (page generated 2022-05-04 23:01 UTC)