[HN Gopher] Running tasks on E cores can use a third of the ener...
___________________________________________________________________
Running tasks on E cores can use a third of the energy of P cores
Author : zdw
Score : 45 points
Date : 2022-05-03 14:06 UTC (1 days ago)
(HTM) web link (eclecticlight.co)
(TXT) w3m dump (eclecticlight.co)
| modeless wrote:
| I'm not convinced that either estimate is accurate. To validate
| that the measurements are real they should actually run the tasks
| for a significant amount of time and see how much battery is
| used.
| ars wrote:
| I agree - this appears to be very sensitive to the polling
| resolution.
|
| If you only run a task for 0.5 seconds, and you only check
| power usage every 0.25 seconds, you'll get an artificially high
| value.
|
| They need to run this for much longer, and with no breaks
| (moments of low power usage).
| astrange wrote:
| powermetrics is not an estimate. Of course some other things on
| the system may change behavior if it's not plugged in.
| phaedrus wrote:
| From the title, not knowing what E and P cores are, I wondered if
| maybe it meant using software emulation to make a "virtual CPU
| core" could actually be _more_ energy efficient than running on a
| physical core. And now I kinda wonder if there is something to
| that idea...
|
| For example I was just reading a different HN thread were people
| were complaining about Windows 10 randomly turning your laptop
| into a hairdryer when it's otherwise idle (pegging the CPU at
| 100% for an update or scan). It's clear a lot of software _doesn
| 't_ need to run at full hardware speed; maybe I'd rather these
| background scans use only 10% CPU and take 10 times at long. And
| I bet it wouldn't take 10 times as long, due to the way latency &
| etc. scale.
|
| Another example is a lot of software is inefficient for stupid
| reasons. If we had automated ways to detect the most common
| stupid code, maybe we just not run it (if the answer makes no
| difference), or run an optimized replacement. Constrained to run
| on a physical core, you either need static analysis or dynamic
| recompilation and are limited in some ways, but running on an
| emulated core you could do whatever you want with dirty tricks
| that the process can't detect.
| eternityforest wrote:
| Linux file scanners are the worst for this, or were until
| recently. I have no idea why it was so hard not to occasionally
| consume 100% CPU and tons of disk IO at inconvenient times, but
| multiple different indexers have exactly the same issue.
|
| I bet you could static analyze to find pure functions, then add
| a caching layer automatically. A lot of slowness seems to be
| because a cache wasn't used.
|
| Or because something stateless should have been stateful(As in
| fully recomputing stuff that depends on previous inputs rather
| than calculating one item at a time).
| sliken wrote:
| The major problems I see are one of two cases. Often just
| naive implementations run a filescan, run as fast as they
| can, the OS starts using more ram for caching, then at some
| point applications need more ram. That triggers a reclaiming
| pages from cache, writing dirty pages, reading pages from the
| disk for the application, all while the scan is keeping the
| I/O pipeline full. The user observes this as a slow/laggy
| machine, and if you are unlucky enough to have a i7/i9 in a
| laptop often a fan running flat out.
|
| A second case is where a vendor offers "unlimited" backups
| for a flat price. This creates an incentive for the vendor to
| make an INCREDIBLY inefficient backup client that consumes
| substantial ram and CPU resources locally and minimal storage
| remotely. Typically this means that you aren't allowed to use
| more efficient clients, and have to use the vendors client.
| Here's looking at you crashplan.
| smoldesu wrote:
| MacOS' indexer is equally as bad. The amount of time's I've
| seen 10 mdworker threads pinning my CPU is staggering...
| als0 wrote:
| AFAIK this problem seems to have gone away with the M1
| based Macs. Can anyone confirm?
| smoldesu wrote:
| I'd imagine it's not an issue since Grand Central
| Dispatch will only assign it to a single core cluster,
| which leaves the other cores free to do their own
| processing.
| ceeplusplus wrote:
| I believe indexing tasks only run on e-cores, at least
| from quick glances at Activity Monitor. I also see more
| kernel code run on e-cores when I'm compiling code, not
| sure if Apple is moving blocking syscalls onto e-cores
| and swapping threads or something (intuition tells me
| this shouldn't lead to perf gains due to context switch
| cost, but who knows).
| astrange wrote:
| Spotlight runs at a low CPU priority, so it doesn't use a
| lot of power (theoretically) and on Intel won't increase
| P-states. The CPU % doesn't mean much at all except it's
| runnable.
| smoldesu wrote:
| In practice my CPU hits ~80c and makes the system nearly
| unusable. I just disabled Spotlight indexing altogether.
| sliken wrote:
| To be fair backups are one of many things (like
| compiling, video conferencing, or even just looking at
| the laptop wrong) that makes x86-64 macs run their fans
| run out. My particular case is an i9 in a mbp 16". It
| throttles heavily while making impressive noise any just
| about any light workload that's not pure idle.
| chlorion wrote:
| On Linux we have something called "cgroups" which is integrated
| with the kernel's scheduler. Cgroups can be used to control and
| limit resources and do exactly what you are describing!
|
| Cgroups can be used to limit the maximum amount of CPU
| bandwidth a certain group can use in a given period with fairly
| high resolution. For example, you can use the cpu.max variable
| to limit a group to run for a maximum of 1000 microseconds per
| second (1s = 1000000us). You can also limit what cores the
| group can run on, and the "weight" of the group when CPU
| bandwidth is being contested by other groups!
|
| Cgroups can also limit resources such as RAM, swap and IO in a
| similar way.
|
| Systemd has some features to put daemons in cgroups and
| provides ways to set the various control variables which is
| pretty handy, and it also has a way to launch regular programs
| with resource controls applied with systemd-run. There is also
| libcgroup and the sysfs interface for systems without systemd.
| throw10920 wrote:
| > Conventional wisdom, at least for SMP chips, is that this is
| the most efficient strategy, as it gets the task out of the way
| as quickly as possible.
|
| I've heard this conventional wisdom cited repeatedly in the realm
| of embedded devices, but I don't have any idea how it could
| possibly be true: in a synchronous CMOS chip, power scales with
| the _square_ of voltage, and in order to push the chip to higher
| frequencies, you have to increase the (gate? rail?) voltage -
| doesn 't that mean that computing things at lower frequencies is
| almost always more efficient?
| wmf wrote:
| It depends on the idle power.
| nagisa wrote:
| Running a core doesn't only run the core itself. There is a
| fair amount of supporting circuits including, but not limited
| to, things like memory controller, I/O bus, peripherals, etc.
|
| These can sip a non-trivial amount of power, and if a task
| running on an efficiency core kept them all up, you ultimately
| might end up with greater aggregate power use.
|
| Nowadays in modern chips these other circuits seem to be
| brought down to the low power states much more aggressively.
| brigade wrote:
| If you scale frequency down far enough then leakage power can
| easily dominate, and a decade ago it was bad enough that race-
| to-idle really was more power efficient than the DVFS most SoCs
| could do. Vendors were so bad at this that one of ARM's largest
| motivations for big.LITTLE so that their partners could have
| any semblance of DVFS that actually saved power at lower perf
| states.
|
| Although, I think the blogger only measures pinning on
| different CPUs, not the effect of DVFS vs. race-to-idle, so his
| statement is a non sequitur. Because _of course_ a power-
| optimized arch is going to be more efficient than a perf-
| optimized arch (modulo uncore power), and that 's been known
| and true since forever.
| eternityforest wrote:
| Hardware-based efficiency stuff seems to be the future.
|
| This whole idea that we need light and simple software to be fast
| doesn't quite we need software that takes fullest advantage of
| hardware.
|
| I bet we could do a lot more with this idea. We could have dozens
| of tiny cores, without any math beyond addition, for things like
| regexes and file format parsing.
|
| We could probably fit stuff like small AIs and video codecs on SD
| sized accelerator chips if we really tried.
|
| It will be cool to see how far they can take this.
|
| It will also be really cool to see how we program for this kind
| of thing, to really make the best use of it.
| Wohlf wrote:
| This has always been a key component of processor design and
| has been integral to improvements in consumer device processors
| for a while now.
| zekrioca wrote:
| > It will also be really cool to see how we program for this
| kind of thing, to really make the best use of it.
|
| With light and simple software (libraries)
| GeekyBear wrote:
| With the M2 chip expected to debut at WWDC next month, here's the
| difference between the A14 efficiency cores used in the M1 and
| the A15 efficiency cores used in the current gen iPhones.
|
| >The core has a median performance improvement of +23%, resulting
| in a median IPC increase of +11.6%. The cores here don't showcase
| the same energy efficiency improvement as the new A15's
| performance cores, as energy consumption is mostly flat due to
| performance increases coming at a cost of power increases, which
| are still very much low.
|
| Also, when compared to the efficiency cores used in the
| Snapdragon 888:
|
| >The comparison against the little Cortex-A55 cores is more
| absurd though, as the A15's E-core is 3.5x faster on average, yet
| only consuming 32% more power, so energy efficiency is 60%
| better.
|
| https://www.anandtech.com/show/16983/the-apple-a15-soc-perfo...
| Kon-Peki wrote:
| These Apple Silicon chips are already very efficient, so it is
| pleasantly surprising to see the ability to reduce power
| consumption by 66%. I am wondering though - with some tasks it
| may be possible to implement using multiple
| algorithms/approaches, and that could make the differences even
| larger (if I am pinning to E cores, use algorithm A for maximum
| efficiency but if I am using P+E cores, use algorithm B for
| maximum performance).
| wmf wrote:
| _use algorithm A for maximum efficiency but use algorithm B for
| maximum performance_
|
| This is pretty rare; the fastest algorithm is almost always the
| most efficient.
| Lichtso wrote:
| For purely sequential algorithms I agree, fewer work steps
| means less energy used and less time spent.
|
| But the picture is very different when it comes to parallel
| algorithms. There often is a tradeoff between doing more
| redundant work to reduce the latency and finish faster
| overall at the expense of more energy being used.
|
| One example is the "scan" operation on GPUs. There are lots
| of different algorithms to do it which form a gradient
| between fast and efficient. For reference see page 2 of this
| paper:
| https://research.nvidia.com/publication/2016-03_single-
| pass-...
| Aardwolf wrote:
| But can I be certain that I won't be waiting longer than I should
| because something that really should be running on a P core,
| happens to be running on an E core instead?
|
| (Talking about Intel here, I only now see the article is about
| M1)
___________________________________________________________________
(page generated 2022-05-04 23:01 UTC)