[HN Gopher] When power cycling your (x86) server isn't enough to...
___________________________________________________________________
When power cycling your (x86) server isn't enough to recover it
Author : zdw
Score : 96 points
Date : 2024-12-22 21:24 UTC (3 days ago)
(HTM) web link (utcc.utoronto.ca)
(TXT) w3m dump (utcc.utoronto.ca)
| trebligdivad wrote:
| A BIOS can forget to reset some devices. A physical device might
| have a design flaw where it forgets to reset some registers on
| reset. A BIOS (including device firmware) can forget to zero some
| RAM/initialise a structure and get lucky.
| garganzol wrote:
| Yep, this is a typical flaw and it can cause annoying
| situations. I met it in my practice.
| chasil wrote:
| I ran MECleaner once, and removed power from a desktop, waited
| ten seconds, plugged it back in, and the test for the presence of
| the ME was still positive.
|
| I unplugged it and left it overnight, and the next day, the ME
| was gone.
|
| This was the ARC version, but it can remain operational for some
| time after power is removed.
| Joel_Mckay wrote:
| IIRC, on most modern intel cpus removing/blanking the ME will
| reboot the machine every 20 minutes or so. It is unfortunately
| an irremovable OEM hardware RAT on most modern systems.
|
| That being said, there are some versions of BIOS that do allow
| turning the ME off, but most motherboard and laptop
| manufacturers will not allow general consumers to install that
| version of the firmware. There are some groups that have
| figured out how to sign a patched fully feature-unlocked BIOS
| on a per machine basis (disabling ME is a simple Y/N flag), but
| YMMV given these tools are nearly impossible to get working.
|
| AMD should end the clown show of RATs, and eat the remaining
| Intel market. =3
| guerrilla wrote:
| The AMD equivalent is the PSL, right? Can that be disabled on
| any CPUs?
| DaSHacka wrote:
| I am unaware of the PSL, but I know AMD PSP is the
| equivalent to ME for most AMD chips [0].
|
| Some motherboards allow you to disable it, and it doesn't
| do as much as ME in the first place (no network modules or
| built-in remote access purpose like ME)
|
| [0] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_P
| roces...
| guerrilla wrote:
| Typo, I meant PSP.
| doublepg23 wrote:
| I was under the impression some boutique Linux laptop
| manufacturers like System76 and StarLabs flashed Coreboot.
| Joel_Mckay wrote:
| Indeed, they used the coreboot nvramtool to set the disable
| IME flag.
|
| It's still there, but unlike most consumer BIOS can
| apparently be turned off (whatever that means to Intel.)
|
| Personally, I don't hold a lot of hope outdated on-chip
| minix OS can't be exploited/activated anyway. =3
| DaSHacka wrote:
| > IIRC, on most modern intel cpus removing/blanking the ME
| will reboot the machine every 20 minutes or so. It is
| unfortunately an irremovable OEM hardware RAT on most modern
| systems.
|
| Yes, if ME detects a problem when initializing it grants you
| a 20 minute window as a grace period, presumably to allow
| users to attempt to fix it.
|
| > There are some groups that have figured out how to sign a
| patched fully feature-unlocked BIOS on a per machine basis
| (disabling ME is a simple Y/N flag), but YMMV given these
| tools are nearly impossible to get working.
|
| You can also just flip the HAP bit[0], I'd assume that's what
| those advanced (usually leaked dev build) BIOS firmwares do
| anyway.
|
| > AMD should end the clown show of RATs, and eat the
| remaining Intel market. =3
|
| AMD has PSP[1], which is functionally equivalent (though with
| a significantly smaller attack surface, when left enabled)
|
| I personally am of the belief that both technologies are
| likely backdoored. There's so much pointing against them[2],
| that the simplest explanation is they're more likely than not
| a mandated backdoor that chipmakers eventually expanded for
| other purposes (such as recent versions of ME handling
| suspend-related power management)
|
| [0] https://github.com/corna/me_cleaner/wiki/HAP-
| AltMeDisable-bi...
|
| [1] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_Pro
| ces...
|
| [2] https://en.m.wikipedia.org/wiki/Intel_Management_Engine#A
| sse...
| Joel_Mckay wrote:
| Computrace was replaced by the Absolute BIOS module, so
| yes... 100% RAT features have been active for sometime.
| Whatever legitimate asset recovery and remote drive
| deletion features it offers, is superseded by potential
| backdoors on the refurbished PC market.
|
| This is why we can't have nice things. =3
| trilbyglens wrote:
| Probably a capacitor in there somewhere that slowly discharges
| when unplugged for a longer time.
| Szpadel wrote:
| I have at least 2 regular cases where full power off was required
| to resolve the issue.
|
| First one is dell latitude laptop with fingerprint reader,
| randomly after few days of operation, fingerprint reader stops
| responding and login screens freeze for a minute until it
| timeouts few times. reboot does not solve it, nor suspending
| machine. it needs to be powered off and on again (hibernation to
| disk also works).
|
| second case is my pc with ASRock creator x570, after long time if
| keeping it suspended, WiFi card stopped to function and just
| throwed some errors in dmesg on driver initialization. here even
| power off and on did not help, but flipping switch on power
| supply for few second resolved the issue
| Latty wrote:
| The WiFi/Bluetooth one was common on AM4, I think, I also had
| that issue.
| duffyjp wrote:
| The integrated wifi/bt on my AM5 board was so bad I had to
| disable it and use a PCIe card.
|
| For obvious reasons AMD boards don't tend to ship with Intel
| wifi, but in my experience anything else sucks. The intel 6e
| cards are amazing and dirt cheap.
| doubled112 wrote:
| I've had some weirdness with Intel WiFi cards over the
| years too, especially when dual booting.
| duffyjp wrote:
| Was it the 9560 by chance? (The original AC / wifi5 one)
| Those were terrible. Our house isn't practical to wire,
| so I had a lot of them. All swapped to AX210 cards (6E)
| and those work phenomenally.
|
| I also dual boot, in addition to being an incurable
| distro hopper, and these AX210 cards worked out of the
| box in basically everything.
| tonyarkles wrote:
| Yeah I've got a Lenovo Legion laptop that I dual-boot
| Windows and Linux. I haven't tried in a while but for at
| least a year it was impossible to soft-reboot to switch
| OSes if you wanted wifi to work. My best theory was that
| Windows and Linux had different firmware that they loaded
| into it at boot and they weren't reloading that after a
| soft reboot (just using whatever was already running on
| the card).
| toast0 wrote:
| > For obvious reasons AMD boards don't tend to ship with
| Intel wifi, but in my experience anything else sucks.
|
| Cause realtek checks the box for has wifi and costs
| probably $3 less? If you care, you can swap it, and if you
| don't, you don't.
| jmb99 wrote:
| > For obvious reasons AMD boards don't tend to ship with
| Intel wifi
|
| Funnily enough, the threadripper (at least WRX90, and at
| least asrock) come with an Intel dual 10Gb LAN card.
| Probably because none of the alternatives are good enough
| for a pro board.
| speckx wrote:
| My friend had an issue with a laptop that did not resolve until
| the battery was fully drained.
| Aachen wrote:
| Wouldn't the quicker solution be disconnecting the battery
| for 2 seconds?
| pests wrote:
| Not everyone has the skills or knowledge to disassemble
| their laptop. I haven't had a removable easily replaceable
| battery since I feel 2006ish. My current one requires 8
| security screws on the bottom, a bracket removed, and even
| I had some issues when I did a swap earlier this year.
| apfsx wrote:
| I've actually had some strange anomalies happen like this on a
| couple laptops I have. Rebooting or even holding the power
| button long enough to do according to the manufacturer some
| kind of CMOS or hard reset didn't work either. I had to open up
| the bottom, cover unplugged, the battery completely Then re-
| plugged in and everything went back to operational condition.
| Reventlov wrote:
| I had the problem on APU4C4, iirc. You install openwrt on it,
| everything is working fine, then, you reboot and you get nothing
| on the serial port.
|
| You unplug/plug it, cold boot it, and then it works again.
| kyrofa wrote:
| The Linux kernel supports rebooting using a number of different
| strategies[1]. Some PCs need a different one than the default in
| order to make sure everything is properly reset.
|
| [1]:
| https://github.com/torvalds/linux/blob/9b2ffa6148b1e4468d08f...
| mjg59 wrote:
| Linux now uses exactly the same reboot strategy as Windows
| does, so no PC should "need" a different one - it may be the
| case that driver code leaves the hardware in a state the system
| vendor didn't test, and using a different reboot approach may
| work around that, but it's not fundamentally the reboot method
| that's causing the problem there
| (https://mjg59.dreamwidth.org/3561.html goes into some more
| detail on how all this actually works)
| kyrofa wrote:
| Yes, I didn't mean to imply that Linux was doing anything
| wrong, just that some hardware seems to work better with
| other approaches, for the reasons you state.
| vachina wrote:
| I've came across Acer laptops that'd always bluescreen on restart
| after a PROCHOT shutdown. The fix is to pull out the battery for
| a few seconds and then plug it back in, magically fixes the
| bluescreen.
| jcalvinowens wrote:
| OP, what Linux is this? I'm really curious, I don't recognize
| that trace format and I can't find the code to print exception
| traces with the eight bangs on the first line like that anywhere
| in the upstream git history. I think they're actually from the
| BIOS? !!!! X64 Exception Type - 12(#MC -
| Machine-Check) CPU Apic ID - 00000000 !!!!
|
| My story: I had an Intel NUC running Linux back in the day, which
| would get stuck in standby such that I had to remove and replace
| the CMOS battery to get it to boot again! I never figured that
| one out...
| pzmarzly wrote:
| This is a trace from the BIOS, it is not uncommon to have them
| printed over the serial console. Potentially the BIOS is based
| on EDK2 source code, in which case you can take a look here for
| the implementation of the trace printing logic:
| https://github.com/tianocore/edk2/blob/9e6537469d4700d9d793e...
| neuroelectron wrote:
| I've seen similar behavior when trying out a fork bomb in the
| terminal on both Linux and Windows. My guess is that on windows
| the fork bomb made it into the virtual memory and was recorded to
| disk and wasn't cleaned out completely during boot.
|
| It too, 3 reboots to clear up the errors. Generally on the linux
| system one extra reboot was necessary about half of the time.
| geocrasher wrote:
| Whenever I power cycle something that doesn't go right the first
| time, I leave it off for at least 30 seconds so all the caps can
| discharge and any saved state can reset. Especially true of
| routers etc.
| ijustlovemath wrote:
| You can further be sure of this by pressing the On button while
| the power supply is disconnected. Ofc make sure it's always off
| when you connect or disconnect the power supply.
| klysm wrote:
| Depends on how the on button is implemented, and the power
| management of the system. On older devices I would expect
| this to be more reliable.
| geocrasher wrote:
| Indeed, this used to be my "secret trick" for laptops that
| wouldn't power on: Disconnect the battery and power supply,
| hold the power button for 30 seconds, then power it back up.
| Worked every time.
| bell-cot wrote:
| I think it was the 1970's when I first heard of the "remove
| power, wait a good while, try again" strategy.
|
| The subject was a cheap little black & white TV set that my folks
| had. Dad was an amateur radio operator, who mostly built his own
| equipment. He could have dissembled it, traced circuits, and
| calculated the wait time if he'd cared to.
| markhahn wrote:
| I usually prefer the 'reset' option (such as in IPMI). After all,
| this is the as-designed way to politely ask all devices to re-
| initialize.
|
| Yes, power-cycling is more unambiguous, but afaikt, the example
| here is purely that power cycling really needs a noticable off-
| period so that all devices can fully come down. Otherwise,
| there's no real standard on what should happen - this or that
| component might stay up or retain state.
|
| The other reason I like 'reset' is that lots of devices (fans,
| disks, probably all power systems - definitely including PSUs)
| have lifetime limits in power cycles. Mostly this is minor,
| unless you do something like reboot cluster nodes after a job
| (concievably a paranoid security requirement), or some automation
| gets in a loop and continually zaps a server.
| NBJack wrote:
| I have had several laptops over the years like this. Full
| shutdown and power on does not reset some problems, like missing
| audio, missing wifi, etc. For Lenovo devices, I have to go as far
| as using the 'recovery' button. This goes for DP Alt Mode as
| well. Kinda annoying, but at least there's a solution.
| petemc_ wrote:
| When managing large numbers of Dell rack mounted servers, a flea
| power drain is something you become very familiar with.
| dxdxdt wrote:
| I don't get it. That post was a whole bag of nothing. Why are you
| guys upvoting it?
| magicalhippo wrote:
| I've had it happen to me, so not a whole bag of nothing, and
| might be surprising to some.
|
| Also, a topic which can spur some interesting comments.
| GlenTheMachine wrote:
| In grad school we built a lot of logic boards from scratch. They
| were used for submersible robots, and we had a 350,000 gallon
| water tank that we kept heated to 88 degrees. This was three
| stories above the ground in a metal building. You can't really
| air condition that, so in the summer it got quite hot.
|
| It was not uncommon to return from lunch to find than an embedded
| computer board that had been working when you left wasn't any
| more. One way to debug them was to put them in the refrigerator
| for a while. If they then worked, you knew you had a bad solder
| joint or an IC that was on the verge of failing.
| nielsbot wrote:
| Wow. That's 3M lbs of water. (1.34M kg)
| kccqzy wrote:
| I've experienced a similar problem with a Thunderbolt port on a
| machine. Nothing that plugs into the machine would be recognized.
| Not even a simple USB device. Power cycling multiple times didn't
| fix it. But powering off and leaving the machine off for a few
| minutes fixed it.
|
| Given the problem occurred only once, I didn't do any more
| investigation on why.
| zoky wrote:
| Bad electrons. Turning off the power lets them drain out.
| snakeyjake wrote:
| Are these Dells?
|
| Some Dells have a "feature" where something, somewhere, in their
| mess of a UEFI/iDRAC stack will get corrupted and will stay wrong
| through power cycles until you physically unplug the servers from
| power and hold down the power button to discharge a capacitor and
| clear out the NVRAM where the corrupted value is.
|
| Most recently this impacted a PowerEdge R7525 server we have
| where the iDRAC was enforcing a power cap of ~300 watts leaving
| the system to be less than 1/10th as performant as it should have
| been. Manually setting a new power cap did nothing except update
| the values displayed in the UI. Multiple six minute (because of
| their mess of a UEFI/iDRAC stack) reboots of both the server and
| the iDRAC did nothing.
|
| Dell was less than useful except for the fact that they hosted
| the answer. After raging against their CSA script/LLM auto-reply
| bullshit for days an aggrieved user with the same issue looking
| for help in their forums finally posted that he did the cap drain
| trick and it worked.
|
| Saved me tons of wasted time. Thanks, anonymous fellow frustrated
| dell customer!
| wibbily wrote:
| Something like this happened to me once. Lost power in a
| lightning storm and when it came back my computer could no longer
| shut off.
|
| Like, at all. Would just hang when you tried. Couldn't exit from
| BIOS after changing settings, couldn't suspend to RAM. Had to
| yoink the cord whenever I needed to restart. Wild stuff.
|
| Perhaps like Frankenstein the lightning was a breath of life, and
| with its new sentience my PC was trying to preserve its
| existence. At any rate I reflashed the BIOS after a few months
| and it never happened again.
___________________________________________________________________
(page generated 2024-12-25 23:00 UTC)