[HN Gopher] When power cycling your (x86) server isn't enough to...
       ___________________________________________________________________
        
       When power cycling your (x86) server isn't enough to recover it
        
       Author : zdw
       Score  : 96 points
       Date   : 2024-12-22 21:24 UTC (3 days ago)
        
 (HTM) web link (utcc.utoronto.ca)
 (TXT) w3m dump (utcc.utoronto.ca)
        
       | trebligdivad wrote:
       | A BIOS can forget to reset some devices. A physical device might
       | have a design flaw where it forgets to reset some registers on
       | reset. A BIOS (including device firmware) can forget to zero some
       | RAM/initialise a structure and get lucky.
        
         | garganzol wrote:
         | Yep, this is a typical flaw and it can cause annoying
         | situations. I met it in my practice.
        
       | chasil wrote:
       | I ran MECleaner once, and removed power from a desktop, waited
       | ten seconds, plugged it back in, and the test for the presence of
       | the ME was still positive.
       | 
       | I unplugged it and left it overnight, and the next day, the ME
       | was gone.
       | 
       | This was the ARC version, but it can remain operational for some
       | time after power is removed.
        
         | Joel_Mckay wrote:
         | IIRC, on most modern intel cpus removing/blanking the ME will
         | reboot the machine every 20 minutes or so. It is unfortunately
         | an irremovable OEM hardware RAT on most modern systems.
         | 
         | That being said, there are some versions of BIOS that do allow
         | turning the ME off, but most motherboard and laptop
         | manufacturers will not allow general consumers to install that
         | version of the firmware. There are some groups that have
         | figured out how to sign a patched fully feature-unlocked BIOS
         | on a per machine basis (disabling ME is a simple Y/N flag), but
         | YMMV given these tools are nearly impossible to get working.
         | 
         | AMD should end the clown show of RATs, and eat the remaining
         | Intel market. =3
        
           | guerrilla wrote:
           | The AMD equivalent is the PSL, right? Can that be disabled on
           | any CPUs?
        
             | DaSHacka wrote:
             | I am unaware of the PSL, but I know AMD PSP is the
             | equivalent to ME for most AMD chips [0].
             | 
             | Some motherboards allow you to disable it, and it doesn't
             | do as much as ME in the first place (no network modules or
             | built-in remote access purpose like ME)
             | 
             | [0] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_P
             | roces...
        
               | guerrilla wrote:
               | Typo, I meant PSP.
        
           | doublepg23 wrote:
           | I was under the impression some boutique Linux laptop
           | manufacturers like System76 and StarLabs flashed Coreboot.
        
             | Joel_Mckay wrote:
             | Indeed, they used the coreboot nvramtool to set the disable
             | IME flag.
             | 
             | It's still there, but unlike most consumer BIOS can
             | apparently be turned off (whatever that means to Intel.)
             | 
             | Personally, I don't hold a lot of hope outdated on-chip
             | minix OS can't be exploited/activated anyway. =3
        
           | DaSHacka wrote:
           | > IIRC, on most modern intel cpus removing/blanking the ME
           | will reboot the machine every 20 minutes or so. It is
           | unfortunately an irremovable OEM hardware RAT on most modern
           | systems.
           | 
           | Yes, if ME detects a problem when initializing it grants you
           | a 20 minute window as a grace period, presumably to allow
           | users to attempt to fix it.
           | 
           | > There are some groups that have figured out how to sign a
           | patched fully feature-unlocked BIOS on a per machine basis
           | (disabling ME is a simple Y/N flag), but YMMV given these
           | tools are nearly impossible to get working.
           | 
           | You can also just flip the HAP bit[0], I'd assume that's what
           | those advanced (usually leaked dev build) BIOS firmwares do
           | anyway.
           | 
           | > AMD should end the clown show of RATs, and eat the
           | remaining Intel market. =3
           | 
           | AMD has PSP[1], which is functionally equivalent (though with
           | a significantly smaller attack surface, when left enabled)
           | 
           | I personally am of the belief that both technologies are
           | likely backdoored. There's so much pointing against them[2],
           | that the simplest explanation is they're more likely than not
           | a mandated backdoor that chipmakers eventually expanded for
           | other purposes (such as recent versions of ME handling
           | suspend-related power management)
           | 
           | [0] https://github.com/corna/me_cleaner/wiki/HAP-
           | AltMeDisable-bi...
           | 
           | [1] https://en.m.wikipedia.org/wiki/AMD_Platform_Security_Pro
           | ces...
           | 
           | [2] https://en.m.wikipedia.org/wiki/Intel_Management_Engine#A
           | sse...
        
             | Joel_Mckay wrote:
             | Computrace was replaced by the Absolute BIOS module, so
             | yes... 100% RAT features have been active for sometime.
             | Whatever legitimate asset recovery and remote drive
             | deletion features it offers, is superseded by potential
             | backdoors on the refurbished PC market.
             | 
             | This is why we can't have nice things. =3
        
         | trilbyglens wrote:
         | Probably a capacitor in there somewhere that slowly discharges
         | when unplugged for a longer time.
        
       | Szpadel wrote:
       | I have at least 2 regular cases where full power off was required
       | to resolve the issue.
       | 
       | First one is dell latitude laptop with fingerprint reader,
       | randomly after few days of operation, fingerprint reader stops
       | responding and login screens freeze for a minute until it
       | timeouts few times. reboot does not solve it, nor suspending
       | machine. it needs to be powered off and on again (hibernation to
       | disk also works).
       | 
       | second case is my pc with ASRock creator x570, after long time if
       | keeping it suspended, WiFi card stopped to function and just
       | throwed some errors in dmesg on driver initialization. here even
       | power off and on did not help, but flipping switch on power
       | supply for few second resolved the issue
        
         | Latty wrote:
         | The WiFi/Bluetooth one was common on AM4, I think, I also had
         | that issue.
        
           | duffyjp wrote:
           | The integrated wifi/bt on my AM5 board was so bad I had to
           | disable it and use a PCIe card.
           | 
           | For obvious reasons AMD boards don't tend to ship with Intel
           | wifi, but in my experience anything else sucks. The intel 6e
           | cards are amazing and dirt cheap.
        
             | doubled112 wrote:
             | I've had some weirdness with Intel WiFi cards over the
             | years too, especially when dual booting.
        
               | duffyjp wrote:
               | Was it the 9560 by chance? (The original AC / wifi5 one)
               | Those were terrible. Our house isn't practical to wire,
               | so I had a lot of them. All swapped to AX210 cards (6E)
               | and those work phenomenally.
               | 
               | I also dual boot, in addition to being an incurable
               | distro hopper, and these AX210 cards worked out of the
               | box in basically everything.
        
               | tonyarkles wrote:
               | Yeah I've got a Lenovo Legion laptop that I dual-boot
               | Windows and Linux. I haven't tried in a while but for at
               | least a year it was impossible to soft-reboot to switch
               | OSes if you wanted wifi to work. My best theory was that
               | Windows and Linux had different firmware that they loaded
               | into it at boot and they weren't reloading that after a
               | soft reboot (just using whatever was already running on
               | the card).
        
             | toast0 wrote:
             | > For obvious reasons AMD boards don't tend to ship with
             | Intel wifi, but in my experience anything else sucks.
             | 
             | Cause realtek checks the box for has wifi and costs
             | probably $3 less? If you care, you can swap it, and if you
             | don't, you don't.
        
             | jmb99 wrote:
             | > For obvious reasons AMD boards don't tend to ship with
             | Intel wifi
             | 
             | Funnily enough, the threadripper (at least WRX90, and at
             | least asrock) come with an Intel dual 10Gb LAN card.
             | Probably because none of the alternatives are good enough
             | for a pro board.
        
         | speckx wrote:
         | My friend had an issue with a laptop that did not resolve until
         | the battery was fully drained.
        
           | Aachen wrote:
           | Wouldn't the quicker solution be disconnecting the battery
           | for 2 seconds?
        
             | pests wrote:
             | Not everyone has the skills or knowledge to disassemble
             | their laptop. I haven't had a removable easily replaceable
             | battery since I feel 2006ish. My current one requires 8
             | security screws on the bottom, a bracket removed, and even
             | I had some issues when I did a swap earlier this year.
        
         | apfsx wrote:
         | I've actually had some strange anomalies happen like this on a
         | couple laptops I have. Rebooting or even holding the power
         | button long enough to do according to the manufacturer some
         | kind of CMOS or hard reset didn't work either. I had to open up
         | the bottom, cover unplugged, the battery completely Then re-
         | plugged in and everything went back to operational condition.
        
       | Reventlov wrote:
       | I had the problem on APU4C4, iirc. You install openwrt on it,
       | everything is working fine, then, you reboot and you get nothing
       | on the serial port.
       | 
       | You unplug/plug it, cold boot it, and then it works again.
        
       | kyrofa wrote:
       | The Linux kernel supports rebooting using a number of different
       | strategies[1]. Some PCs need a different one than the default in
       | order to make sure everything is properly reset.
       | 
       | [1]:
       | https://github.com/torvalds/linux/blob/9b2ffa6148b1e4468d08f...
        
         | mjg59 wrote:
         | Linux now uses exactly the same reboot strategy as Windows
         | does, so no PC should "need" a different one - it may be the
         | case that driver code leaves the hardware in a state the system
         | vendor didn't test, and using a different reboot approach may
         | work around that, but it's not fundamentally the reboot method
         | that's causing the problem there
         | (https://mjg59.dreamwidth.org/3561.html goes into some more
         | detail on how all this actually works)
        
           | kyrofa wrote:
           | Yes, I didn't mean to imply that Linux was doing anything
           | wrong, just that some hardware seems to work better with
           | other approaches, for the reasons you state.
        
       | vachina wrote:
       | I've came across Acer laptops that'd always bluescreen on restart
       | after a PROCHOT shutdown. The fix is to pull out the battery for
       | a few seconds and then plug it back in, magically fixes the
       | bluescreen.
        
       | jcalvinowens wrote:
       | OP, what Linux is this? I'm really curious, I don't recognize
       | that trace format and I can't find the code to print exception
       | traces with the eight bangs on the first line like that anywhere
       | in the upstream git history. I think they're actually from the
       | BIOS?                  !!!! X64 Exception Type - 12(#MC -
       | Machine-Check)  CPU Apic ID - 00000000 !!!!
       | 
       | My story: I had an Intel NUC running Linux back in the day, which
       | would get stuck in standby such that I had to remove and replace
       | the CMOS battery to get it to boot again! I never figured that
       | one out...
        
         | pzmarzly wrote:
         | This is a trace from the BIOS, it is not uncommon to have them
         | printed over the serial console. Potentially the BIOS is based
         | on EDK2 source code, in which case you can take a look here for
         | the implementation of the trace printing logic:
         | https://github.com/tianocore/edk2/blob/9e6537469d4700d9d793e...
        
       | neuroelectron wrote:
       | I've seen similar behavior when trying out a fork bomb in the
       | terminal on both Linux and Windows. My guess is that on windows
       | the fork bomb made it into the virtual memory and was recorded to
       | disk and wasn't cleaned out completely during boot.
       | 
       | It too, 3 reboots to clear up the errors. Generally on the linux
       | system one extra reboot was necessary about half of the time.
        
       | geocrasher wrote:
       | Whenever I power cycle something that doesn't go right the first
       | time, I leave it off for at least 30 seconds so all the caps can
       | discharge and any saved state can reset. Especially true of
       | routers etc.
        
         | ijustlovemath wrote:
         | You can further be sure of this by pressing the On button while
         | the power supply is disconnected. Ofc make sure it's always off
         | when you connect or disconnect the power supply.
        
           | klysm wrote:
           | Depends on how the on button is implemented, and the power
           | management of the system. On older devices I would expect
           | this to be more reliable.
        
           | geocrasher wrote:
           | Indeed, this used to be my "secret trick" for laptops that
           | wouldn't power on: Disconnect the battery and power supply,
           | hold the power button for 30 seconds, then power it back up.
           | Worked every time.
        
       | bell-cot wrote:
       | I think it was the 1970's when I first heard of the "remove
       | power, wait a good while, try again" strategy.
       | 
       | The subject was a cheap little black & white TV set that my folks
       | had. Dad was an amateur radio operator, who mostly built his own
       | equipment. He could have dissembled it, traced circuits, and
       | calculated the wait time if he'd cared to.
        
       | markhahn wrote:
       | I usually prefer the 'reset' option (such as in IPMI). After all,
       | this is the as-designed way to politely ask all devices to re-
       | initialize.
       | 
       | Yes, power-cycling is more unambiguous, but afaikt, the example
       | here is purely that power cycling really needs a noticable off-
       | period so that all devices can fully come down. Otherwise,
       | there's no real standard on what should happen - this or that
       | component might stay up or retain state.
       | 
       | The other reason I like 'reset' is that lots of devices (fans,
       | disks, probably all power systems - definitely including PSUs)
       | have lifetime limits in power cycles. Mostly this is minor,
       | unless you do something like reboot cluster nodes after a job
       | (concievably a paranoid security requirement), or some automation
       | gets in a loop and continually zaps a server.
        
       | NBJack wrote:
       | I have had several laptops over the years like this. Full
       | shutdown and power on does not reset some problems, like missing
       | audio, missing wifi, etc. For Lenovo devices, I have to go as far
       | as using the 'recovery' button. This goes for DP Alt Mode as
       | well. Kinda annoying, but at least there's a solution.
        
       | petemc_ wrote:
       | When managing large numbers of Dell rack mounted servers, a flea
       | power drain is something you become very familiar with.
        
       | dxdxdt wrote:
       | I don't get it. That post was a whole bag of nothing. Why are you
       | guys upvoting it?
        
         | magicalhippo wrote:
         | I've had it happen to me, so not a whole bag of nothing, and
         | might be surprising to some.
         | 
         | Also, a topic which can spur some interesting comments.
        
       | GlenTheMachine wrote:
       | In grad school we built a lot of logic boards from scratch. They
       | were used for submersible robots, and we had a 350,000 gallon
       | water tank that we kept heated to 88 degrees. This was three
       | stories above the ground in a metal building. You can't really
       | air condition that, so in the summer it got quite hot.
       | 
       | It was not uncommon to return from lunch to find than an embedded
       | computer board that had been working when you left wasn't any
       | more. One way to debug them was to put them in the refrigerator
       | for a while. If they then worked, you knew you had a bad solder
       | joint or an IC that was on the verge of failing.
        
         | nielsbot wrote:
         | Wow. That's 3M lbs of water. (1.34M kg)
        
       | kccqzy wrote:
       | I've experienced a similar problem with a Thunderbolt port on a
       | machine. Nothing that plugs into the machine would be recognized.
       | Not even a simple USB device. Power cycling multiple times didn't
       | fix it. But powering off and leaving the machine off for a few
       | minutes fixed it.
       | 
       | Given the problem occurred only once, I didn't do any more
       | investigation on why.
        
         | zoky wrote:
         | Bad electrons. Turning off the power lets them drain out.
        
       | snakeyjake wrote:
       | Are these Dells?
       | 
       | Some Dells have a "feature" where something, somewhere, in their
       | mess of a UEFI/iDRAC stack will get corrupted and will stay wrong
       | through power cycles until you physically unplug the servers from
       | power and hold down the power button to discharge a capacitor and
       | clear out the NVRAM where the corrupted value is.
       | 
       | Most recently this impacted a PowerEdge R7525 server we have
       | where the iDRAC was enforcing a power cap of ~300 watts leaving
       | the system to be less than 1/10th as performant as it should have
       | been. Manually setting a new power cap did nothing except update
       | the values displayed in the UI. Multiple six minute (because of
       | their mess of a UEFI/iDRAC stack) reboots of both the server and
       | the iDRAC did nothing.
       | 
       | Dell was less than useful except for the fact that they hosted
       | the answer. After raging against their CSA script/LLM auto-reply
       | bullshit for days an aggrieved user with the same issue looking
       | for help in their forums finally posted that he did the cap drain
       | trick and it worked.
       | 
       | Saved me tons of wasted time. Thanks, anonymous fellow frustrated
       | dell customer!
        
       | wibbily wrote:
       | Something like this happened to me once. Lost power in a
       | lightning storm and when it came back my computer could no longer
       | shut off.
       | 
       | Like, at all. Would just hang when you tried. Couldn't exit from
       | BIOS after changing settings, couldn't suspend to RAM. Had to
       | yoink the cord whenever I needed to restart. Wild stuff.
       | 
       | Perhaps like Frankenstein the lightning was a breath of life, and
       | with its new sentience my PC was trying to preserve its
       | existence. At any rate I reflashed the BIOS after a few months
       | and it never happened again.
        
       ___________________________________________________________________
       (page generated 2024-12-25 23:00 UTC)