[HN Gopher] Why is my CPU usage always 100%?
___________________________________________________________________
Why is my CPU usage always 100%?
Author : pncnmnp
Score : 409 points
Date : 2025-01-09 21:15 UTC (4 days ago)
(HTM) web link (www.downtowndougbrown.com)
(TXT) w3m dump (www.downtowndougbrown.com)
| veltas wrote:
| It doesn't feel like reading 4 times is necessarily a portable
| solution, if there will be more versions at different speeds and
| different I/O architectures; or how this will work under more
| load, and whether the original change was done to fix some other
| performance problem OP is not aware of, but not sure what else
| can be done. Unfortunately many vendors like Marvell can
| seriously under-document crucial features like this. If anything
| it would be good to put some of this info in the comment itself,
| not very elegant but how else practically are we meant to keep
| track of this, is the mailing list part of the documentation?
|
| Doesn't look like there's a lot of discussion on the mailing
| list, but I don't know if I'm reading the thread view correctly.
| _nalply wrote:
| I also wondered about this, but there's a crucial differnce, no
| idea if it matters: in that loop it reads the register, so the
| register is read at least 4 times.
| adrian_b wrote:
| This is a workaround for a hardware bug of a certain CPU.
|
| Therefore it cannot really be portable, because other timers in
| other devices will have different memory maps and different
| commands for reading.
|
| The fault is with the designers of these timers, who have
| failed to provide a reliable way to read their value.
|
| It in hard to believe that this still happens in this century,
| because reading correct values despite the fact that the timer
| is incremented or decremented continuously is an essential goal
| in the design of any timer that may be read, and how to do it
| has been well known for more than 3 quarters of century.
|
| The only way to make such a workaround somewhat portable is to
| parametrize it, e.g. with the number of retries for direct
| reading or with the delay time when reading the auxiliary
| register. This may be portable between different revisions of
| the same buggy timer, but the buggy timers in other unrelated
| CPU designs will need different workarounds anyway.
| stkdump wrote:
| > how to do it has been well known for more than 3 quarters
| of century
|
| Don't leave me hanging! How to do it?
| adrian_b wrote:
| Direct reading without the risk of reading incorrect values
| is possible only when the timer is implemented using a
| synchronous counter instead of an asynchronous counter and
| the synchronous counter must be fast enough to ensure a
| stable correct value by the time when it is read, and the
| reading signal must be synchronized with the timer clock
| signal.
|
| Synchronous counters are more expensive in die area than
| asynchronous counters, especially at high clock
| frequencies. Moreover, it may be difficult to also
| synchronize the reading signal with the timer clock.
| Therefore the second solution may be preferable, which uses
| a separate capture register for reading the timer value.
|
| This was implemented in the timer described in TFA, but it
| was done in a wrong way.
|
| The capture register must either ensure that the capture is
| already complete by the time when it is possible to read
| its value after giving a capture command, or it must have
| some extra bit that indicates when its value is valid.
|
| In this case, one can read the capture register until the
| valid bit is on, having a complete certainty that the end
| value is correct.
|
| When adding some arbitrary delay between the capture
| command and reading the capture register, you can never be
| certain that the delay value is good.
|
| Even when the chosen delay is 100% effective during
| testing, it can result in failures on other computers or
| when the ambient temperature is different.
| veltas wrote:
| > This is a workaround for a hardware bug of a certain CPU.
|
| What about different variants, revisions, and speeds of this
| CPU?
| Karliss wrote:
| The related part of doc has one more note "This request
| requires up to three timer clock cycles. If the selected timer
| is working at slow clock, the request could take longer." From
| the way doc is formatted it's not fully clear what "this
| request" refers to. It might explain where 3-5 attempts come
| from, and that it might not be pulled completely out of thin
| air. But the part about taking up to but sometimes more clock
| cycles makes it impossible to have a "proper" solution without
| guesswork or further clarifications from vendor.
|
| "working at slow clock" part, might explain why some other
| implementations had different code path for 32.768 KHz clocks.
| According to docs there are two available clock sources "Fast
| clock" and "32768 Hz" which could mean that "slow clock" refers
| to specific hardware functionality is not just a vague phrase.
|
| As for portability concerns, this is already low level hardware
| specific register access. If Marvell releases new SOC not only
| there is no assurance that will require same timing, it might
| was well have different set of registers which require
| completely different read and setup procedure not just
| different timing.
|
| One thing that slightly confuses me - the old implementation
| had 100 cycles of "cpu_relax()" which is unrelated to specific
| timer clock, but neither is reading of TMR_CVWR register. Since
| 3-5 of cycles of that worked better than 100 cycles of
| cpu_relex, it clearly takes more time unless cpu_relax part got
| completely optimized out. At least I didn't find any references
| mentioning that timer clock affects read time of TMR_CVWR.
| veltas wrote:
| It sounds like this is an old CPU(?), so no need to worry
| about the future here.
|
| > I didn't find any references mentioning that timer clock
| affects read time of TMR_CVWR.
|
| Reading the register might be related to the timer's internal
| clock, as it would have to wait for the timer's bus to
| respond. This is essentially implied if Marvell recommend re-
| reading this register, or if their reference implementation
| did so. My main complaint is it's all guesswork, because
| Marvell's docs aren't that good.
| MBCook wrote:
| The Chumby hardware I'm thinking of is from 2010 or so. So
| if that's it, it would certainly be old. And it would
| explain a possible relation with the OLPC having a similar
| chip.
|
| https://en.wikipedia.org/wiki/Chumby
| begueradj wrote:
| Oops, this is not valid.
| M95D wrote:
| I'm sure a few more software updates will take care of this
| little problem...
| zaik wrote:
| You're probably thinking about memory and caching. There are no
| advantages to keeping the CPU at 100% when no workload needs to
| be done.
| josephg wrote:
| Only when your computer actually has work to do. Otherwise your
| CPU is just a really expensive heater.
|
| Modern computers are designed to idle at 0% then temporarily
| boost up when you have work to do. Then once the task is done,
| they can drop back to idle and cool down again.
| PUSH_AX wrote:
| Not that I disagree, but when exactly in modern operating
| systems are there moments where there are zero instructions
| being executed? Surely there are always processes doing
| background things?
| pintxo wrote:
| With multi-core cpus, some of them can be fully off, while
| others handle any background tasks.
| _flux wrote:
| There are a lot of such moments, but they are just short.
| When you're playing music, you download a bit of data from
| the network or the SSD/HDD by first issuing a request and
| then waiting (i.e. doing nothing) to get the short piece of
| data back. Then you decode it and upload a short bit of the
| sound to your sound card and then again you wait for new
| space to come up, before you send more data.
|
| One of the older ways (in x86 side) to do this was to
| invoke the HLT instruction
| https://en.wikipedia.org/wiki/HLT_(x86_instruction) : you
| stop the processor, and then the processor wakes up when an
| interrupt wakes it up. An interrupt might come from the
| sound card, network card, keyboard, GPU, timer (e.g. 100
| times a second to schedule an another process, if some
| process exists that is waiting for CPU), and during the
| time you wait for the interrupt to happen you just do
| nothing, thus saving energy.
|
| I suspect things are more complicated in the world of
| multiple CPUs.
| johannes1234321 wrote:
| From human perception there will "always" be work on a
| "normal" system.
|
| However for a CPU with multiple cores, each running at 2+
| GHz, there is enough room for idling while seeming active.
| reshlo wrote:
| > Timer Coalescing attempts to enforce some order on all
| this chaos. While on battery power, Mavericks will
| routinely scan all upcoming timers that apps have set and
| then apply a gentle nudge to line up any timers that will
| fire close to each other in time. This "coalescing"
| behavior means that the disk and CPU can awaken, perform
| timer-related tasks for multiple apps at once, and then
| return to sleep or idle for a longer period of time before
| the next round of timers fire.[0]
|
| > Specify a tolerance for the accuracy of when your timers
| fire. The system will use this flexibility to shift the
| execution of timers by small amounts of time--within their
| tolerances--so that multiple timers can be executed at the
| same time. Using this approach dramatically increases the
| amount of time that the processor spends idling...[1]
|
| [0] https://arstechnica.com/gadgets/2013/06/how-os-x-
| mavericks-w...
|
| [1] https://developer.apple.com/library/archive/documentati
| on/Pe...
| miki123211 wrote:
| Modern Macs also have two different kinds of cores, slow
| but energy-efficient e-cores and high-performance
| p-cores.
|
| The p cores can be activated and deactivated very
| quickly, on the order of microseconds IIRC, which means
| the processor always "feels" fast while still conserving
| battery life.
| Someone wrote:
| We're not talking about what humans call "a moment". For a
| (modern) computer, a millisecond is "a moment", possibly
| even "a long moment". It can run millions of instructions
| in such a time frame.
|
| A modern CPU also has multiple cores not all of which may
| be needed, and will be supported by hardware that can do
| lots of tasks.
|
| For example, sending out an audio signal isn't typically
| done by the main CPU. It tells some hardware to send a
| buffer of data at some frequency, then prepares the next
| buffer, and can then sleep or do other stuff until it has
| to send the new buffer.
| nejsjsjsbsb wrote:
| My processor gets several whole nanoseconds to rest up, I
| am not a slave driver.
| homebrewer wrote:
| This feels like the often-repeated "argument" that Electron
| applications are fine because "unused memory is wasted memory".
| What Linus meant by that is that the operating system should
| strive to use as much of the _free_ RAM as possible for things
| like file and dentry caches. Not that memory should be wasted
| on millions of layers of abstraction and too-high resolution
| images. But it 's often misunderstood that way.
| Culonavirus wrote:
| Eeeh, the Electron issue is oveblown.
|
| These days the biggest hog of memory is the browser. Not
| everyone does this, but a lot of people, myself included,
| have tens of tabs open at a time (with tab groups and all of
| that)... all day. The browser is the primary reason I
| recommend a minimum of 16gb ram to F&F when they ask "the it
| guy" what computer to buy.
|
| When my Chrome is happily munching on many gigabytes of ram I
| don't think a few hundred megs taken by your average Electron
| app is gonna move the needle.
|
| The situation is a bit different on mobile, but Electron is
| not a mobile framework so that's not relevant.
|
| PS: Can I rant a bit how useless the new(ish) Chrome memory
| saver thing is? What is the point having tabs open if you're
| gonna remove them from memory and just reload on activation?
| In the age of fast consumer ssds I'd expect you to
| intelligently hibernate the tabs on disk, otherwise what you
| have are silly bookmarks.
| smolder wrote:
| Your argument against electron being a memory hog is that
| chrome is a bigger one? You are aware that electron is an
| instance of chromium, right?
| rbanffy wrote:
| This is a good point, but it would be interesting if we
| had a "just enough" rendering engine for UI elements that
| was a subset of a browser with enough functionality to
| provide a desktop app environment and that could be
| driven by the underlying application (or by the GUI,
| passing events to the underlying app).
| nejsjsjsbsb wrote:
| Problem there is Electron devs do it for convenience.
| That means esbuild, npm install react this that. If it
| ain't a full browser this won't work.
| caspper69 wrote:
| Funny thing about all of this is that it's just such
| oppressive overkill.
|
| Most GUI toolkits can do layout / graphics / fonts in a
| much simpler (and sane) way. "Reactive" layout is not a
| new concept.
|
| HTML/CSS/JS is not an efficient or clean way to do layout
| in an application. It only exists to shoehorn UI layout
| into a rich text DOCUMENT format.
|
| Can you imagine if Microsoft or Apple had insisted that
| GUI application layout be handled the way we do it today
| back in the 80s and 90s? Straight up C was easier to grok
| that this garbage we have today. The industry as a whole
| should be ashamed. It's not easier, it doesn't make
| things look better, and it wastes billions in developer
| time and user time, not to mention slowly making the
| oceans boil.
|
| Every time I have to use a web-based application (which
| is most of the time nowadays), it infuriates me. The
| latency is atrocious. The UIs are slow. There's
| mysterious errors at least once or twice daily. WTF are
| we doing? When a Windows 95 application ran faster and
| was more responsive and more reliable than something
| written 30 years later, we have a serious problem.
|
| Here's some advice: stop throwing your web code into
| Electron, and start using a cross-platform GUI toolkit.
| Use local files and/or sqlite databases for storage, and
| then sync to the cloud in the background. Voila, non-shit
| applications that stop wasting everybody's effing time.
|
| If your only tool is a hammer, something, something,
| nails...
| eadmund wrote:
| > Eeeh, the Electron issue is oveblown.
|
| > These days the biggest hog of memory is the browser.
|
| That's the problem: Electron is another browser instance.
|
| > I don't think a few hundred megs taken by your average
| Electron app is gonna move the needle.
|
| Low-end machines even in 2025 still come with single-digit
| GB RAM sizes. A few hundred MB is a substantial portion of
| an 8GB RAM bank.
|
| Especially when it's just waste.
| p0w3n3d wrote:
| And this company that says: let's push to the users the
| installer of our brand new app, that will reside in their
| tray, which we have made in electron. Poof. 400MB taken
| for a tray notifier that also accidentally adds a browser
| to the memory
|
| My computer: starts 5 seconds slower
|
| 1mln of computers in the world: start cumulatively 5mln
| seconds slower
|
| Meanwhile a Microsoft programmer whose postgres via ssh
| starts 500ms slower: "I think this is a rootkit installed
| in ssh"
| Dalewyn wrote:
| >otherwise what you have are silly bookmarks.
|
| My literal _several hundreds_ of tabs are silly bookmarks
| in practice.
| ack_complete wrote:
| It's so annoying when that line is used to defend
| applications with poor memory usage, ignoring the fact that
| all modern OSes already put unallocated memory to use for
| caching.
|
| "Task Manager doesn't report memory usage correctly" is
| another B.S. excuse heard on Windows. It's actually true, but
| the other way around -- Task Manager _underreports_ the
| memory usage of most programs.
| TonyTrapp wrote:
| What you are probably thinking of is "race to idle". A CPU
| should process everything it can, as quickly it can (using all
| the power), and then go to an idle state, instead of processing
| everything slowly (potentially consuming less energy at that
| time) but take more time.
| j16sdiz wrote:
| > computer architecture courses.
|
| I guess it was some _theoretical_ task scheduling stuff....
| When you are doing task scheduling, yes, maybe, depends on what
| you optimize for.
|
| .... but this bug have nothing to do with that. This bug is
| about some accounting error.
| g-b-r wrote:
| I expected it to be about holding down the spacebar :/
| labster wrote:
| Spacebar heating was great for my workflow, please re-enable
| smidgeon wrote:
| For the confused: https://www.xkcd.com/1172/
| lohfu wrote:
| He must running version 10.17 or newer
| g-b-r wrote:
| Not to argue, but I don't understand why someone downvoted it
| sneela wrote:
| This is a wonderful write-up and a very enjoyable read. Although
| my knowledge about systems programming on ARM is limited, I know
| that it isn't easy to read hardware-based time counters; at the
| very least, it's not as simple as the x86 rdtsc [1]. This is
| probably why the author writes:
|
| > This code is more complicated than what I expected to see. I
| was thinking it would just be a simple register read. Instead, it
| has to write a 1 to the register, and then delay for a while, and
| then read back the same register. There was also a very
| noticeable FIXME in the comment for the function, which
| definitely raised a red flag in my mind.
|
| Regardless, this was a very nice read and I'm glad they got down
| to the issue and the problem fixed.
|
| [1]: https://www.felixcloutier.com/x86/rdtsc.
| pm215 wrote:
| Bear in mind that the blog post is about a 32 bit SoC that's
| over a decade old, and the timer it is reading is specific to
| that CPU implementation. In the intervening time both timers
| and performance counters have been architecturally
| standardised, so on a modern CPU there is a register roughly
| equivalent to the one x86 rdtsc uses and which you can just
| read; and kernels can use the generic timer code for timers and
| don't need to have board specific functions to do it.
|
| But yeah, nice writeup of the kinds of problem you can run into
| in embedded systems programming.
| InsomniacL wrote:
| > Chumby's kernel did a total of 5 reads of the CVWR register.
| The other two kernels did a total of 3 reads.
|
| > I opted to use 4 as a middle ground
|
| reminded me of xkcd: Standards
|
| https://xkcd.com/927/
| thrdbndndn wrote:
| I don't get the fix.
|
| Why reading it multiple times will fix the issue?
|
| Is it just because reading takes time, therefore reading multiple
| time makes the needed time from writing to reading passes? If so,
| it sounds like a worse solution than just extending waiting delay
| longer like the author did initially.
|
| If not, then I would like to know the reason.
|
| (Needless to say, a great article!)
| rep_lodsb wrote:
| It's possible that actually reading the register takes
| (significantly) more time than an empty countdown loop. A
| somewhat extreme example of that would be on x86, where
| accessing legacy I/O ports for e.g. the timer goes through a
| much lower-clocked emulated ISA bus.
|
| However, a more likely explanation is the use of "volatile"
| (which only appears in the working version of the code).
| Without it, the compiler might even have completely removed the
| loop?
| deng wrote:
| > However, a more likely explanation is the use of "volatile"
| (which only appears in the working version of the code).
| Without it, the compiler might even have completely removed
| the loop?
|
| No, because the loop calls cpu_relax(), which is a compiler
| barrier. It cannot be optimized away.
|
| And yes, reading via the memory bus is much, much slower than
| a barrier. It's absolutely likely that reading 4 times from
| main memory on such an old embedded system takes several
| hundred cycles.
| rep_lodsb wrote:
| You're right, didn't account for that. Though even when
| declared volatile, the counter variable would be on the
| stack, and thus already in the CPU cache (at least 32K
| according to the datasheet)?
|
| Looking at the assembly code for both versions of this
| delay loop might clear it up.
| deng wrote:
| The only thing volatile does is to assure that the value
| is read from memory each time (which implicitly also
| forbids optimizations). Whether that memory is in a CPU
| cache is purely a hardware issue and outside the C
| specification. If you read something like a hardware
| register, you yourself need to take care in some way that
| a hardware cache will not give you old values (by mapping
| it into a non-cached memory area, or by forcing a cache
| update). If you for-loop over something that acts as a
| compiler barrier, all that 'volatile' on the counter
| variable will do is potentially make the for-loop slower.
|
| There's really just very few reasons to ever use
| 'volatile'. In fact, the Linux kernel even has its own
| documentation why you should usually not use it:
|
| https://www.kernel.org/doc/html/latest/process/volatile-
| cons...
| sim7c00 wrote:
| doesnt volatile also ensure the address is not changed
| for the read by compiler (as it might optimise data
| layout otherwise)? (so you can be sure when using mmio
| etc. it wont read from wrong place?)
| deng wrote:
| "volatile", according to the standard, simply is: "An
| object that has volatile-qualified type may be modified
| in ways unknown to the implementation or have other
| unknown side effects. Therefore any expression referring
| to such an object shall be evaluated strictly according
| to the rules of the abstract machine."
|
| Or simpler: don't assume anything what you think you
| might know about this object, just do as you're told.
|
| And yes, that for instance prohibits putting a value from
| a memory address into a register for further use, which
| would be a simple case of data optimization. Instead, a
| fresh retrieval from memory must be done on each access.
|
| However, if your system has caching or an MMU is outside
| of the spec. The compiler does not care. If you tell the
| compiler to give you the byte at address 0x1000, it will
| do so. 'volatile' just forbids the compiler to deduce the
| value from already available knowledge. If a hardware
| cache or MMU messes with that, that's your problem, not
| the compiler's.
| Karliss wrote:
| From what I understand the timer registers should be on
| APB(1) bus which operates at fixed 26MHz clock. That should
| be much closer to the scale of fast timer clocks compared
| to cpu_relax() and main CPU clock running somewhere in the
| range of 0.5-1GHz and potentially doing some dynamic
| frequency scaling for power saving purpose.
|
| The silliest part of this mess is that 26Mhz clock for APB1
| bus is derived from the same source as 13Mhz, 6.5Mhz
| 3.25Mhz, 1Mhz clocks usable by fast timers.
| deng wrote:
| > Is it just because reading takes time, therefore reading
| multiple time makes the needed time from writing to reading
| passes?
|
| Yes.
|
| > If so, it sounds like a worse solution than just extending
| waiting delay longer like the author did initially.
|
| Yeah, it's a judgement call. Previously, the code called
| cpu_relax() for waiting, which is also dependent on how this is
| defined (can be simply NOP or barrier(), for instance). The
| reading of the timer register maybe has the advantage that it
| is dependent on the actual memory bus speed, but I wouldn't
| know for sure. Hardware at that level is just messy, and
| especially niche platforms have their fair share of bugs where
| you need to do ugly workarounds like these.
|
| What I'm rather wondering is why they didn't try the other
| solution that was mentioned by the manufacturer: reading the
| timer directly two times and compare it, until you get a stable
| output.
| adrian_b wrote:
| The article says that the buggy timer has 2 different methods
| for reading.
|
| When reading directly, the value may be completely wrong,
| because the timer is incremented continuously and the updating
| of its bits is not synchronous with the reading signal.
| Therefore any bit in the value that is read may be wrong,
| because it has been read exactly during a transition between
| valid values.
|
| The workaround in this case is to read multiple times and
| accept as good a value that is approximately the same for
| multiple reads. The more significant bits of the timer value
| change much less frequently than the least significant bits, so
| at most attempts of reading, only a few bits can be wrong. Only
| seldom the read value can be complete garbage, when comparing
| it with the other read values will reject it.
|
| The second reading method was to use a separate capture
| register. After giving a timer capture command, reading an
| unchanging value from the capture register should have caused
| no problems. Except that in this buggy timer, it is
| unpredictable when the capture is actually completed. This
| requires the insertion of an empirically determined delay time
| before reading the capture register, hopefully allowing enough
| time for the capture to be complete.
| Dylan16807 wrote:
| > The workaround in this case is to read multiple times and
| accept as good a value that is approximately the same for
| multiple reads.
|
| It's only incrementing at 3.25MHz, right? Shouldn't you be
| able to get exactly the same value for multiple reads? That
| seems both simpler and faster than using this very slow
| capture register, but maybe I'm missing something.
| dougg3 wrote:
| Author here. Thanks! I believe the register reads are just
| extending the delay, although the new approach does have a side
| effect of reading from the hardware multiple times. I don't
| think the multiple reads really matter though.
|
| I went with the multiple reads because that's what Marvell's
| own kernel fork does. My reasoning was that people have been
| using their fork, not only on the PXA168, but on the newer
| PXAxxxx series, so it would be best to retain Marvell's
| approach. I could have just increased the delay loop, but I
| didn't have any way of knowing if the delay I chose would be
| correct on newer PXAxxx models as well, like the chip used in
| the OLPC. Really wish they had more/better documentation!
| mastax wrote:
| Karliss above found docs which mention:
|
| > This request requires up to three timer clock cycles. If the
| selected timer is working at slow clock, the request could take
| longer.
|
| Let's ignore the weirdly ambiguous second sentence and say for
| pedagogical purposes it takes up to three timer clock cycles
| full stop. Timer clock cycles aren't CPU clock cycles, so we
| can't just do `nop; nop; nop;`. How do we wait three timer
| clock cycles? Well a timer register read is handled by the
| timer peripheral which runs at the timer clock, so reading (or
| writing) a timer register will take until at least the end of
| the next timer clock.
|
| This is a very common pattern when dealing with memory mapped
| peripheral registers.
|
| ---
|
| I'm making some reasonable assumptions about how the clock
| peripheral works. I haven't actually dug into the Marvell
| documentation.
| TrickyReturn wrote:
| Probably running Slack...
| rbanffy wrote:
| In the late 1990's I worked in a company that had a couple
| mainframes in their fleet and once I looked into a resource usage
| screen (Omegamon, perhaps? Is it that old?) and noticed the CPU
| was pegged at 100%. I asked the operator if that was normal. His
| answer was "Of course. We paid for that CPU, might as well use
| it". Funny though that mainframes are designed for that - most,
| if not all, non-application work is offloaded to other processors
| in the system so that the CPU can run applications as fast as it
| can.
| defrost wrote:
| Having a number of running processes take the CPU usage to 100%
| is one thing, have an under utilised CPU with almost no
| processes running _report_ that usage is at 100% is another
| thing, the subject of the article here.
| rbanffy wrote:
| I didn't intend this as an example of the issue the article
| mentions (a misreporting of usage because of a hardware
| design issue). It was just a fun example of how different
| hardware behaves differently.
|
| One can also say Omegamon (or whatever tool) was
| misreporting, because it didn't account for the processor
| time of the various supporting systems that dealt with
| peripheral operations. After all, they also paid for the disk
| controllers, disks, tape drives, terminal controllers and so
| on, so they could want to drive those to close to 100% as
| well.
| defrost wrote:
| Sure, no drama - I came across as a little dry and clipped
| as I was clarifying on the fly as it were.
|
| I had my time squeezing the last cycle possible from a
| Cyber 205 waaaay back in the day.
| datadrivenangel wrote:
| Some mainframes have the ability to lock clock speed and
| always run at exactly 100%, so you can often have hard
| guarantees about program latency and performance.
| WediBlino wrote:
| An old manager of mine once spent the day trying to kill a
| process that was running at 99% on Windows box.
|
| When I finally got round to see what he was doing I was
| disappointed to find he was attempting to kill the 'system idle'
| process.
| belter wrote:
| Did he have a pointy hair?
| cassepipe wrote:
| I abandonned Windows 8 for linux because of an bug (?) where my
| HDD was showing it was 99% busy all the time. I had removed
| every startup program that could be and analysed thouroughly
| for any viruses, to no avail. Had no debugging skills at the
| time and wasn't sure the hardware could stand windows 10.
| That's how linux got me.
| saintfire wrote:
| I had this happen with an nvme drive. Tried changing just
| about every setting that affected the slot.
|
| Everything worked fine on my Linux install ootb
| margana wrote:
| Why is this such a huge issue if it merely shows it's busy,
| but the performance of it indicates that it actually isn't?
| Switching to Linux can be a good choice for a lot of people,
| the reason just seems a bit odd here. Maybe it was simply the
| straw that broke the camel's back.
| RHSeeger wrote:
| 1. I expect that a HD that is actually doing things 100% of
| the time is going to have it's lifespan significantly
| reduce, and
|
| 2. If it isn't doing anything and it just lying to you...
| when there IS a problem, your tools to diagnose the problem
| are limited because you can't trust what they're telling
| you
| ddingus wrote:
| Over the years I have used top and friends to profile
| machines and identify expensive bottlenecks. Once one comes
| to count on those tools, the idea of one being wrong, and
| actually really wrong! --is just a bad rub.
|
| Fixing it would be gratifying and reassuring too.
| ryandrake wrote:
| Recent Linux distributions are quickly catching up to Windows
| and macOS. Do a fresh install of your favorite distribution
| and then use 'ps' to look at what's running. Dozens of
| processes doing who knows what? They're probably not pegging
| your CPU at 100%, which is good, but it seems that gone are
| the days when you could turn on your computer and it was
| truly idle until you commanded it to actually do something.
| That's a special use case now, I suppose.
| ndriscoll wrote:
| IME on Linux the only things that use random CPU while idle
| are web browsers. Otherwise, there's dbus and
| NetworkManager and bluez and oomd and stuff, but most
| processes have a fraction of a second used CPU over months.
| If they're not using CPU, they'll presumably swap out if
| needed, so they're using ~nothing.
| johnmaguire wrote:
| this is why I use arch btw
| rirze wrote:
| this guy arches
| diggan wrote:
| Add Gnome3 and you can have that too! Source: me, a
| arch+gnome user, who recently had to turn off the search
| indexer as it was stuck processing countless multi-GB
| binary files...
| johnisgood wrote:
| Exactly, or Void, or Alpine, but I love pacman.
| craftkiller wrote:
| This is one the reasons I love FreeBSD. You boot up a fresh
| install of FreeBSD and there are only a couple processes
| running and I know what each of them does / why they are
| there.
| m3047 wrote:
| At least under some circumstances Linux shows (schedulable)
| threads as separate processes. Just be aware of that.
| BizarroLand wrote:
| Windows 8/8.1/10 had an issue for a while where when it was
| run on spinning rust HDD it would peg it out and slow the
| system to a crawl.
|
| The only solution was to swap over to a SSD.
| m463 wrote:
| That's what managers do.
|
| Silly idle process.
|
| If you've got time for leanin', you've got time for cleanin'
| marcosdumay wrote:
| Windows used to have that habit of making the processes CPU
| starved, and yet claiming the CPU was idle all the time.
|
| Since the Microsoft response to the bug was denying and
| gaslighting the affected people, we can't tell for sure what
| caused it. But several people were in a situation where their
| computer couldn't finish any work, and the task-manager claimed
| all of the CPU time was spent on that line item.
| gruez wrote:
| I've never heard of this. How do you know it's windows
| "gaslighting" users, and not something dumb like thermal
| throttling or page faults?
| belter wrote:
| Well this is one possible scenario. Power management....
|
| "Windows 10 Task Manager shows 100% CPU but Performance
| Monitor Shows less than 2%" -
| https://answers.microsoft.com/en-
| us/windows/forum/all/window...
| marcosdumay wrote:
| It's gaslighting because it consists on people from
| Microsoft explicitly saying that it is impossible, it's not
| how Windows behave, and the user's system is idle instead
| of overloaded.
|
| Gaslighting customers was the standard Microsoft's reaction
| to bugs until at least 2007, when I last oversaw somebody
| interacting with them.
| RajT88 wrote:
| > Since the Microsoft response to the bug was denying and
| gaslighting the affected people
|
| Well. I wouldn't go that far. Any busy dev team is
| incentivized to make you run the gauntlet:
|
| 1. It's not an issue (you have to prove to me it's an issue)
|
| 2. It's not _my_ issue (you have to prove to me it 's my
| issue)
|
| 3. It's not that important (you have to prove it has
| significant business value to fix it)
|
| 4. It's not that time sensitive (you have to prove it's worth
| fixing soon)
|
| It was exactly like this at my last few companies. Microsoft
| is quite a lot like this as well.
|
| If you have an assigned CSAM, they can help run the gauntlet.
| That's what they are there for.
|
| See also: The 6 stages of developer realization:
|
| https://www.amazon.com/Panvola-Debugging-Computer-
| Programmer...
| ziddoap wrote:
| > _If you have an assigned CSAM_
|
| That's an unfortunate acronym. I assume you mean Customer
| Service Account Manager.
| RajT88 wrote:
| Customer Success Account Manager. And I would agree - it
| is very unfortunate.
|
| Definitely in my top 5 questionable acronym choices from
| MSFT.
| thatfunkymunki wrote:
| Your reticence to accept the term gaslighting clearly
| indicates you've never had to interact with MSFT support.
| RajT88 wrote:
| On the contrary, I have spent thousands of hours
| interacting with MSFT support.
|
| What I'm getting at with my post is the dev teams support
| has to talk to, which they just forward along their
| responses verbatim.
|
| A lot of MSFT support does suck. There are also some
| really amazing engineers in the support org.
|
| I did my time in support early in my career (not at
| MSFT), and so I understand well it's extremely hard to
| hire good support engineers, and even harder to keep
| them. The skills they learn on the job makes them
| attractive to other parts of the org, and they get
| poached.
|
| There is also an industry-wide tendency for developers to
| treat support as a bunch of knuckle-dragging idiots, but
| at the same time they don't arm them with detailed
| information on _how stuff works_.
| RHSeeger wrote:
| > What I'm getting at with my post is the dev teams
| support has to talk to, which they just forward along
| their responses verbatim.
|
| But the "support" that the end user sees is that
| combination, not two different teams (even if they know
| it's two or more different teams). The point is that the
| end user reached out for help and was told their own
| experiences weren't true. The fact that Dave had Doug
| actually tell them that is irrelevant.
| RajT88 wrote:
| I guess I see your point.
|
| If we're going to call it gaslighting, then gaslighting
| is typical dev team behavior, which of course flows back
| down to support. It's a problem with Microsoft just like
| it is a problem for any other company which makes
| software.
| marcosdumay wrote:
| I've never seen the same behavior from any other software
| supplier.
|
| Almost every software company out there will jump into
| their customers complaints, and try to fix the issue even
| when the root cause is not on their software.
| RajT88 wrote:
| I can't say I've seen it with every vendor. Or even
| internal dev team I've been an internal customer of - but
| I've seen it around a lot.
|
| You might be lucky in that you've worked at companies
| where you are a big enough customer they bend over
| backwards for you. For example: If you work for Wal-Mart,
| you probably get this less often. They are usually the
| biggest fish in whatever pond they are swimming in.
| Twirrim wrote:
| Even when you have an expensive contract with Microsoft and
| a direct account manager to help you run the gauntlet you
| _still_ end up having to deal with awful support people.
|
| Years ago at a job we were seeing issues with a network
| card on a VM. One of my coworkers spent 2-3 days working
| his way through support engineer after support engineer
| until they got into a call with one. He talked the engineer
| through what was happening. Remote VM, can only access over
| RDP (well, we could VNC too, but that idea just confuses
| Microsoft support people for some reason.)
|
| The support engineer decided that the way to resolve the
| problem was to uninstall and re-install the network card
| driver. Coworker decided to give the support engineer
| enough rope to hang themselves with, hoping it'd help him
| escalate faster: "Won't that break the RDP connection?" "No
| sir, I've done this many times before, trust me" "Okay
| then...."
|
| Unsurprisingly enough, when you uninstall the network card
| driver and cause the instance to have no network cards, RDP
| stops working. Go figure.
|
| Co-worker let the support engineer know that he'd now lost
| access, and a guess why. "Oh, yeah. I can see why that
| might have been a problem"
|
| Co-worker was right though, it did finally let us escalate
| further up the chain....
| nerdile wrote:
| As a former Windows OS engineer, based on the short statement
| here, my assumption would be that your programs are IO-bound,
| not CPU-bound, and that the next step would be to gather data
| (using a profiler) to investigate the bottlenecks. This is
| something any Win32 developer should learn how to do.
|
| Although I can understand how "Please provide data to
| demonstrate that this is an OS scheduling issue since app
| bottlenecks are much more likely in our experience" could
| come across as "denying and gaslighting" to less experienced
| engineers and layfolk
| Twirrim wrote:
| Years ago I worked for a company that provided managed hosting
| services. That included some level of alarm watching for
| customers.
|
| We used to rotate the "person of contact" (POC) each shift, and
| they were responsible for reaching out to customers, and doing
| initial ticket triage.
|
| One customer kept having a CPU usage alarm go off on their
| Windows instances not long after midnight. The overnight POC
| reached out to the customer to let them know that they had
| investigated and noticed that "system idle processes" were
| taking up 99% of CPU time and the customer should probably
| investigate, and then closed the ticket.
|
| I saw the ticket within a minute or two of it reopening as the
| customer responded with a barely diplomatic message to the tune
| of "WTF". I picked up that ticket, and within 2 minutes had
| figured out the high CPU alarm was being caused by the backup
| service we provided, apologised to the customer and had that
| ticket closed... but not before someone not in the team saw the
| ticket and started sharing it around.
|
| I would love to say that particular support staff never lived
| that incident down, but sadly that particular incident was par
| for the course with them, and the team spent inordinate amount
| of time doing damage control with customers.
| panarky wrote:
| In the 90s I worked for a retail chain where the CIO proposed
| to spend millions to upgrade the point-of-sale hardware. The
| old hardware was only a year old, but the CPU was pegged at
| 100% on every device and scanning barcodes was very sluggish.
|
| He justified the capex by saying if cashiers could scan
| products faster, customers would spend less time in line and
| sales would go up.
|
| A little digging showed that the CIO wrote the point-of-sale
| software himself in an ancient version of Visual Basic.
|
| I didn't know VB, but it didn't take long to find the loops
| that do nothing except count to large numbers to soak up CPU
| cycles since VB didn't have a sleep() function.
| jimt1234 wrote:
| That's hilarious. I had a similar situation, also back in
| the 90s, when a developer shipped some code that kept
| pegging the CPU on a production server. He insisted it was
| the server, and the company should spend $$$ on a new one
| to fix the problem. We went back-and-forth for a while: his
| code was crap versus the server hardware was inadequate,
| and I was losing the battle, because I was just a lowly
| sysadmin, while he was a great software engineer. Also, it
| was Java code, and back then, Java was kinda new, and
| everyone thought it could do no wrong. I wasn't a developer
| at all back then, but I decided to take a quick look at his
| code. It was basically this:
|
| 1. take input from a web form
|
| 2. do an expensive database lookup
|
| 3. do an expensive network request, wait for response
|
| 4. do another expensive network request, wait for response
|
| 5. and, of course, another expensive network request, wait
| for response
|
| 6. fuck it, another expensive network request, wait for
| response
|
| 7. a couple more database lookups for customer data
|
| 8. store the data in a table
|
| 9. store the same data in another table. and, of course,
| another one.
|
| 10. now, check to see if the form was submitted with valid
| data. if not, repeat all steps above to back-out the data
| from where it was written.
|
| 11. finally, check to see if the customer is a valid/paying
| customer. if not, once again, repeat all the steps above to
| back-out the data.
|
| I looked at the logs, and something like 90% of the
| requests were invalid data from the web form or
| invalid/non-paying customers (this service was provided
| only to paying customers).
|
| I was so upset from this dude convincing management that my
| server was the problem that I sent an email to pretty much
| everyone that said, basically, "This code sucks. Here's the
| problem: check for invalid data/customers first.", and I
| included a snippet from the code. The dude replied-to-all
| immediately, claiming I didn't know anything about Java
| code, and I should stay in my lane. Well, throughout the
| day, other emails started to trickle in, saying, "Yeah, the
| code is the problem here. Please fit it ASAP." The dude was
| so upset that he just left, he went completely AWOL, he
| didn't show up to work for a week or so. We were all
| worried, like he jumped off a bridge or something. It
| turned into an HR incident. When he finally returned, he
| complained to HR that I stabbed him in the back, that he
| couldn't work with me because I was so rude. I didn't
| really care; I was a kid. Oh yeah, his nickname became AWOL
| Wang. LOL
| eludwig wrote:
| Hehe, being a Java dev since the late 90's meant seeing a
| lot of bad code. My favorite was when I was working for a
| large life insurance company.
|
| The company's customer-facing website was servlet based.
| The main servlet was performing horribly, time outs,
| spinners, errors etc. Our team looked at the code and
| found that the original team implementing the logic had a
| problem they couldn't figure out how to solve, so they
| decided to apply the big hammer: they synchronized the
| doService() method... oh dear...
| foobazgt wrote:
| For those not familiar with servlets, this means
| serializing every single request to the server that hits
| that servlet. And a single servlet can serve many
| different pages. In fact, in the early days, servlet
| filters didn't exist, so you would often implement cross-
| cutting functionality like authentication using a
| servlet.
|
| TBF, I don't think a lot of developers at the time (90's)
| were used to the idea of having to write MT-safe callback
| code. Nowadays thousands of object allocations per second
| is nothing to sweat over, so a framework might make a
| different decision to instantiate callbacks per request
| by default.
| nullhole wrote:
| To be fair, it is a really poorly named "process". The computer
| equivalent of the "everything's ok" alarm.
| chowells wrote:
| Long enough ago (win95 era) it wasn't part of Windows to
| sleep the CPU when there was no work to be done. It always
| assigned some task to the CPU. The system idle process was a
| way to do this that played nicely with all of the other
| process management systems. I don't remember when they
| finally added CPU power management. SP3? Win98? Win98SE? Eh,
| it was somewhere in there.
| drsopp wrote:
| I remember listening on FM radio to my 100MHz computer
| running FreeBSD, which sounded like calm rain, and to
| Windows 95, which sounded like a screaming monster.
| fifilura wrote:
| To be fair, there are worse mistakes. It does say 99% CPU.
| Agentus wrote:
| reminds of when i was a kid and noticed a virus had taken over
| a registry. from that point forward i attempted to delete every
| single registry file, not quite understanding. Between that and
| excessive bad website viewing, I dunno how i ever managed to
| not brick my operating system, unlike my grandma who seemed to
| brick her desktop in a timely fashion before each of the many
| monthly visits to her place.
| bornfreddy wrote:
| The things grandmas do to see their grandsons regularly.
| Smart. :-)
| mrmuagi wrote:
| I wonder if you make a process with idle in it you could end up
| in the reverse track where users ignore it. Is there anything
| preventing an executable being named System Idle.
| jsight wrote:
| I worked at a government site with a government machine at one
| time. I had an issue, so I took it to the IT desk. They were
| able to get that sorted, but then said I had another issue.
| "Your CPU is running at 100% all the time, because some sort of
| unkillable process is consuming all your cpu".
|
| Yep, that was "System Idle" that was doing it. They had the
| best people.
| kernal wrote:
| You're keeping us in suspense. Did he ever manage to kill the
| System Idle process?
| a1o wrote:
| This was very well written, I somehow read every single line and
| didn't skip to the end. Great work too!
| amelius wrote:
| To diagnose, why not run "time top" and look at the user and sys
| outputs?
| RajT88 wrote:
| TIL there are still Chumby's alive in the wild. My Insignia
| Chumby 8 didn't last.
| evanjrowley wrote:
| This headline reminded me of Mumptris, an implementation of
| Tetris in the old mainframe-oriented language MUMPS, which by
| design, uses 100% CPU to reduce latency:
| https://news.ycombinator.com/item?id=4085593
| Suppafly wrote:
| Isn't this one of those problems that switching to linux is
| supposed to fix?
| DougN7 wrote:
| He's on linux
| Suppafly wrote:
| Exactly, that's the joke. If it had been an issue on Windows
| the default response from folks here would be to switch to
| Linux instead of trying to get to the root of the issue.
| Guess I should have included an /s on my comment.
| NotYourLawyer wrote:
| That's an awful lot of effort to deal with an issue that was
| basically just cosmetic. I suspect at some point the author was
| just nerd sniped though.
| dougg3 wrote:
| To be fair, other non-cosmetic stuff uses the CPU percentage.
| This same bug was preventing fast user suspend on the OLPC
| until they worked around it. It was also a fun challenge.
| ndesaulniers wrote:
| Great read! Eerily similar to some bugs I've had, but the root
| cause has been a compiler bug. Debugging a kernel that doesn't
| boot is... interesting. QEMU+GDB to the rescue.
| dmitrygr wrote:
| Curiously, instead of "set capture reg, wait for clock edge,
| read", the "read reg twice, until same result is obtained"
| approach is ignored. This is strange as it is usually much faster
| - reading a 3.25MHz counter at 200MHz+ twice is very likely to
| see the same value twice. For a 32KHz counter, it is basically
| guaranteed. u32 val; do { val
| = readl(...); } while (val != readl(...));
| return val;
|
| compiles to a nice 6-instr little function on arm/thumb too, with
| no delays readclock: LDR R2, =...
| 1: LDR R0, [R2] LDR R1, [R2] CMP
| R0, R1 BNE 1b BX LR
| markhahn wrote:
| very nice investigation.
|
| shame about the unnecessary use of cat :)
| askvictor wrote:
| My recurring issue (on a variety of laptops, both Linux and
| Windows): the fans will start going full-blast, everything slows
| down, then as soon as I open a task manager CPU usage drops from
| 100% to something negligible.
| crazydoggers wrote:
| You my friend, most likely have mining malware on your systems.
| They'll shutdown when they detect task manager is opened so you
| don't notice them.
___________________________________________________________________
(page generated 2025-01-13 23:00 UTC)