[HN Gopher] Debugging Hetzner: Uncovering failures with powersta...
       ___________________________________________________________________
        
       Debugging Hetzner: Uncovering failures with powerstat, sensors, and
       dmidecode
        
       Author : ngalstyan4
       Score  : 265 points
       Date   : 2025-02-19 12:40 UTC (10 hours ago)
        
 (HTM) web link (www.ubicloud.com)
 (TXT) w3m dump (www.ubicloud.com)
        
       | V__ wrote:
       | > Looking back, waiting six months could have helped us avoid
       | many issues. Early adopters usually find problems that get fixed
       | later.
       | 
       | This is really good advice and what I'm following for all systems
       | which need to be stable. If there aren't any security issues, I
       | either wait a few months or keep one or two versions behind.
        
         | pwmtr wrote:
         | Author of the blog post here.
         | 
         | Yeah, this is generally a good practice. The silver lining is
         | that our suffering helped uncover the underlying issue faster.
         | :)
         | 
         | This isn't part of the blog post, but we also considered
         | getting the servers and keeping them idle, without actual
         | customer workload, for about a month in the future. This would
         | be more expensive, but it could help identify potential issues
         | without impacting our users. In our case, the crashes started
         | three weeks after we deployed our first AX162 server, so we
         | need at least a month (or maybe even longer) as a buffer
         | period.
        
           | ThePowerOfFuet wrote:
           | >The silver lining is that our suffering helped uncover the
           | underlying issue faster.
           | 
           | Did you actually uncover the true root cause? Or did they
           | finally uncap the power consumption without telling you, just
           | as they neither confirmed nor denied having limited it?
        
             | pwmtr wrote:
             | The root cause was a problem with the motherboard, though
             | the exact issue remains unknown to us. I suspect that a
             | component on the motherboard may have been vulnerable to
             | power limitations or fluctuations and that the newer-
             | generation motherboards included additional protection
             | against this. However, this is purely my speculation.
             | 
             | I don't believe they simply lifted a power cap (if there
             | was one in the first place). I genuinely think the fix came
             | after the motherboard replacements. We had 2 batches of
             | motherboard replacements and after that, the issue
             | disappeared.
             | 
             | If someone from Hetzner is here, maybe they can give extra
             | information.
        
             | oz3d wrote:
             | hetzner is currently replacing motherboards of their
             | dedicated servers [1] But I dont know if thats the same
             | issue that was mentioned in the article.
             | 
             | [1] https://status.hetzner.com/incident/7fae9cca-b38c-4154-
             | 8a27-...
        
               | ubanholzer wrote:
               | Thats the same issue, yes.
        
           | axus wrote:
           | Customers are the best QA. And they pay you too, instead of
           | the reverse!
        
             | rat9988 wrote:
             | I'm pretty sure they pay for QA. QA cannot always catch
             | every possible bug.
        
               | knowitnone wrote:
               | these crashes should have been caught easily
        
         | InDubioProRubio wrote:
         | This is a wildly successfully pattern in nature, the old using
         | the young and inexperienced, as enthusiastic test units.
         | 
         | In the wild for example in Forrest, old boars give safety
         | squeaks to send the younglings ahead into a clearing they do
         | not trust. The equivalent to that- would be to write a tech-
         | blog entry that hypes up a technology that is not yet
         | production ready.
        
           | Tzela wrote:
           | Just for curiosity: do you have a source?
        
         | esafak wrote:
         | GitHub is looking to add this feature to dependabot:
         | https://github.com/dependabot/dependabot-core/issues/3651
        
           | h1fra wrote:
           | In theory, that works in practice nope. You get a random
           | update with a possible bug inside that is only fixed by a new
           | version that you won't get until later. The other strategy is
           | to wait for a package to be fully stable (no update), and in
           | that case, some packages that receive daily/weekly updates
           | are never updated
        
             | esafak wrote:
             | It does help, because major version updates are more likely
             | to cause breakage than minor ones, so you benefit if you
             | wait for a few minor version updates. That is not to say
             | minor versions can't introduce bugs.
             | 
             | Windows is a well-known example; people used to wait for a
             | service pack or two before upgrading.
        
               | ajmurmann wrote:
               | We could even wait for a patch version or the minor being
               | out a certain amount of time. For a major I'd wait even
               | longer and potentially for a second patch.
        
               | Cthulhu_ wrote:
               | And then they went towards a more evergreen update
               | strategy, causing some major outages when some releases
               | caused issues.
               | 
               | I mean evergreen releases make sense imo, as the overhead
               | of maintaining older versions for a long time is huge,
               | but you need to have canary releases, monitoring, and
               | gradual rollout plans; for something like Windows, this
               | should be done with a lot of care. Even a 1% release rate
               | will affect hundreds of thousands if not millions of
               | systems.
        
           | TZubiri wrote:
           | Being so deep into dependencies that you have to find more
           | dependencies and features to make your dependency less of a
           | clusterfuck is sad.
        
         | fdr wrote:
         | It varies by system. As the legendary (to some) Kelly Johnson
         | of the Skunk Works had as one of his main rules:
         | 
         | > The inspection system as currently used by the Skunk Works,
         | which has been approved by both the Air Force and the Navy,
         | meets the intent of existing military requirements and should
         | be used on new projects. Push more basic inspection
         | responsibility back to the subcontractors and vendors. Don't
         | duplicate so much inspection.
         | 
         | But this will be the only and last time Ubicloud does not burn
         | in a new model, or even tranches of purchases (I also work
         | there...and am a founder).
        
       | vitus wrote:
       | > To increase the number of machines under power constraints,
       | data center operators usually cap power use per machine. However,
       | this can cause motherboards to degrade more quickly.
       | 
       | Can anyone elaborate on this point? This is counter to my
       | intuition (and in fact, what I saw upon a cursory search), which
       | is that power capping should prolong the useful lifetime of
       | various components.
       | 
       | The only search results I found that claimed otherwise were
       | indicating that if you're running into thermal throttling, then
       | higher operating temperatures can cause components (e.g.
       | capacitors) to degrade faster. But that's expressly not the case
       | in the article, which looked at various temperature sensors.
        
         | tecleandor wrote:
         | Yep, that's weird, I've always read that high power/temp can
         | degrade electronics way faster. Any EE can shed a light here?
        
           | avian wrote:
           | As an electronics engineer I have no idea what the author is
           | talking about here and was about to post the same question.
        
         | OptionOfT wrote:
         | The only place I could find some answer that sheds some light
         | was StackOverflow:
         | 
         | https://electronics.stackexchange.com/a/65827
         | 
         | > A mosfet needs a certain voltage at its gate to turn fully
         | on. 8V is a typical value. A simple driver circuit could get
         | this voltage directly from the power that also feeds the motor.
         | When this voltage is too low to turn the mosfet fully on a
         | dangerous situation (from the point of view of the moseft) can
         | arise: when it is half-on, both the current through it and the
         | voltage across it can be substantial, resulting in a
         | dissipation that can kill it. Death by undervoltage.
        
         | pwmtr wrote:
         | At the time of our investigation, we found few articles
         | supporting that power caps could potentially cause hardware
         | degradation, though I don't have the exact sources at hand. I
         | see the child comment shared one example, and after some
         | searching, I found a few more sources [1], [2].
         | 
         | That said, I'm not an electronics engineer, so my understanding
         | might not be entirely accurate. It's possible that the
         | degradation was caused by power fluctuations rather than the
         | power cap itself, or perhaps another factor was at play.
         | 
         | [1] https://electronics.stackexchange.com/questions/65837/can-
         | el... [2] https://superuser.com/questions/1202062/what-happens-
         | when-ha...
        
           | immibis wrote:
           | The power used by a computer isn't limited by giving it less
           | voltage/current than it should have - if it was, the CPU
           | would crash almost immediately. It's done by reducing the
           | CPU's clock rate until the power it naturally consumes is
           | less than the power limit.
        
         | nickcw wrote:
         | Power = volts * amps
         | 
         | Volts is as supplied by the utility company.
         | 
         | Amps are monitored per rack and the usual data centre response
         | to going over an amp limit is that a fuse blows or the data
         | centre asks you for more money!
         | 
         | The only way you can decrease power used by a server is by
         | throttling the CPUs.
         | 
         | The normal way of throttling CPUs is via the OS which requires
         | cooperation.
         | 
         | I speculate this is possible via the lights out base band
         | controller (which doesn't need the os to be involved), but I'm
         | pretty sure you'd see that in /sys if it was.
        
         | cibyr wrote:
         | One possibility is that at lower power settings, the CPUs don't
         | get as hot, which means the fans don't spin up as much, which
         | can mean that other components also get less airflow and then
         | get hotter than they would otherwise. The fix for this is
         | usually to monitor the temperature of those other components
         | and include that as an input to the fan speed algorithm. No
         | idea if that's what's actually going on here though.
        
         | redleader55 wrote:
         | Every rack in a data center has a power budget, which is
         | actually constrained by how much heat the HVAC system can pull
         | out of the DC, rather than how much power is available.
         | Nevertheless it is limited per rack to ensure a few high power
         | servers don't bring down a larger portion of the DC.
         | 
         | I don't know for sure how the limiting is done, but a simple
         | circuit breaker like the ones we have in our houses would be a
         | simple solution for it. That causes the rack to loose power
         | when the circuit breaks, which is not ideal because you loose
         | the whole rack and affect multiple customers.
         | 
         | Another option would be a current/power limiter[0], which would
         | cause more problems because P = U * I. That would make the
         | voltage (U) drop and then the whole system to be undervolted -
         | weird glitches happen here and it's a common way to bypass
         | various security measures in chips. For example, Raspberry Pi
         | ran this challenge [1] to look for this kind of bugs and test
         | how well their chips can handle attacks, including voltage
         | attacks.
         | 
         | [0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] -
         | https://www.raspberrypi.com/news/security-through-transparen...
        
       | chronid wrote:
       | We will never know, but I wonder if it could be a power/signaling
       | or VRM issue - the CPU non getting hot doesn't mean something
       | else on the board has gone out of spec and into catastrophic
       | failure.
       | 
       | Motherboard issues around power/signaling are a pain to diagnose,
       | they will emerge as all sort of problems apparently related to
       | other components (ram failing to initialize and random restarts
       | are very common in my experience) and you end up swapping
       | everything before actually replacing the MB...
        
       | jonatron wrote:
       | At a previous company, devops would regularly find CPU fan
       | failures on Hetzner. That's in addition to the usual expected
       | HD/SSD failures. You've got to do your own monitoring, it's one
       | of the reasons why unmanaged servers are cheaper than cloud
       | instances.
        
         | jeffbee wrote:
         | I regularly find broken thermal solutions in azure and when I
         | worked at Google it was also a low-level but constant irritant.
         | When I joined Dropbox I said to my team on my first day that I
         | could find a machine in their fleet running at 400MHz, and I
         | was right: a bogus redundant PSU controller was asserting
         | PROCHOT. These things happen whenever you have a lot of
         | machines.
        
           | tryauuum wrote:
           | in my (limited) experience this only happened with GIGABYTE
           | servers
           | 
           | very weird behavior, I'd prefer my servers to crash instead
           | of lowering frequency to 400MHz.
        
             | dijit wrote:
             | I've seen it on nearly every brand, I have some Lenovo
             | Servers in the basement that also down-clock if both PSU's
             | aren't installed.
             | 
             | I have alerts on PSU's and frequency for this reason.
             | 
             | The servers are so cheap that overcommitting them by double
             | is still significantly cheaper than using cloud hosting,
             | which tends to have the same issue only monitoring it is
             | harder. Though most people using cloud seem to be happy not
             | to know and it's been a known thing that there's a 5x
             | variation between instances of the same size on AWS.: https
             | ://www.brendangregg.com/Slides/AWSreInvent2017_performa...
        
             | jeffbee wrote:
             | > I'd prefer my servers to crash instead of lowering
             | frequency to 400MHz.
             | 
             | 100% agreed. There is _nothing_ worse than a slow server in
             | your fleet. This behavior reeks of  "pet" thinking.
        
             | formerly_proven wrote:
             | Stuff like this just comes up from time to time as soon as
             | you run a four digit and up number of systems.
        
           | radicality wrote:
           | The term PROCHOT just brought me back to vivid memories of
           | debugging exactly that at Facebook a while ago.
           | 
           | It was very non-obvious to debug since pretty much most
           | emitted metrics, apart from mysterious errors/timeouts to our
           | service, looked reasonable. Even the cpu usage and cpu
           | temperature graphs looked normal since it was a bogus prochot
           | and not actually a real thermal throttling
        
             | porridgeraisin wrote:
             | And it brought me back to memories of debugging that on my
             | friends laptop.
             | 
             | It kept going to 400mhz.. i suspected throttling and we got
             | it cleaned thermal paste replaced and all that.
             | 
             | Still throttled. We replaced the windows with linux since
             | it was atleast a bit more usable
             | 
             | At the time I didn't know about PROCHOT. And my googling
             | skills clearly weren't sufficient.
             | 
             | One fine day during lunch at a place on campus, Id read
             | about BD_PROCHOT recently. So i wrote a script to msrprobe
             | or whatever it was and disabled it. "Extended" the lifespan
             | of the thing.
        
           | bityard wrote:
           | A laptop that I had would assert PROCHOT if it didn't like
           | the power supply you plugged into it. It actually took an
           | embarrassing amount of time for me to notice that this is
           | what was causing Slack to be inexplicably slower at my desk
           | than when I was out working in a common area in the building.
        
         | TZubiri wrote:
         | I'm heavily against both relying on free dependencies and going
         | for the cheapest option.
         | 
         | If you can't put yourself in the shoes for a second when
         | evaluating a purchase and you just braindead try to make cost
         | go lower and income go higher, your ngmi except in shady sales
         | businesses.
         | 
         | Server hardware is incredibly cheap, if you are somewhat of a
         | competent programmer you can handle most programs in a single
         | server or even a virtual machine. Just give them a little bit
         | of margin and pay 50$/mo instead of 25$/mo, it's not even
         | enough to guarantee they won't go broke or make you a valuable
         | customer, you'll still be banking on whales to make the whole
         | thing profitable.
         | 
         | Also, if your business is in the US, find a US host ffs.
        
         | KennyBlanken wrote:
         | No? Maybe you cloud kids don't know how this stuff works, but
         | unmanaged just means you get silicon-level access and remote
         | KVM.
         | 
         | It's still the hosting company's responsibility to competently
         | own, maintain, and repair the physical hardware. That includes
         | monitoring. In the old days you had to run a script or install
         | a package to hook into their monitoring....but with IPMI et al
         | being standard they don't need anything from you to do their
         | job.
         | 
         | The _only_ time a hosting company should be hands-off is when
         | they 're just providing rack space, power, and data. Anything
         | beyond that is between you and them in a contract/agreement.
         | 
         | Every time I hear Hetzner come up in the last few years it's
         | been a story about them being incompetent. If they're not
         | detecting things like CPU fan failures of their own hardware
         | _and_ they deployed new systems without properly testing them
         | first, then that 's just further evidence they're still
         | slipping.
        
       | scottcha wrote:
       | I'd like to see what cpu governor is running on those systems
       | before assuming a power cap is in place. Lots of defaults
       | installs of Linux ship with the power save governor running which
       | is going to limit your max frequencies and through that the max
       | power you can hit.
        
         | __m wrote:
         | schedutil on mine scheduled for mainboard replacement
        
       | wink wrote:
       | > One of the providers we like is Hetzner because of their
       | affordable and reliable servers.
       | 
       | > In the days that followed, the crash frequency increased.
       | 
       | I don't find the article conclusive whether they would still call
       | them reliable.
        
         | aduffy wrote:
         | To their credit they actually fixed the problem. Good luck
         | getting this level of support from any of the big 3 public
         | cloud providers.
        
           | frenchtoast8 wrote:
           | For example, AWS's Mac machines frequently run into hardware
           | failures. My current job runs a measly 5 mac1.metal hosts for
           | internal testing, and we experience hardware failures on
           | these machines a few times a year. Doesn't sound like a lot,
           | but these machines are almost always completely idle, and we
           | almost never get host failures for Linux hosts. To make
           | matters worse, sometimes a brand new instance needs
           | replacement before it even comes up for the first time, which
           | is annoying because you are billed a minimum of 24 hours for
           | these instances. People have been complaining about this for
           | years and seemingly nothing is being done about it.
           | 
           | https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok.
           | ..
        
         | cbozeman wrote:
         | Hetzner's reliable... until they aren't.
         | 
         | Since they don't do any sort of monitoring on their bare metal
         | servers at all, at least insofar as I can tell having been a
         | customer of theirs for ten years, you don't know there's as
         | problem until there's a problem, or unless you've got your own
         | monitoring solution in place.
        
       | vednig wrote:
       | as a CI/CD provider wouldn't it benefit if Ubicloud had their own
       | servers?
        
         | eitland wrote:
         | They are in the early stages.
         | 
         | I think the website said they recently raised 16 million euros
         | (or dollars).
         | 
         | Making investments into data centers and hardware could burn
         | through that really quick in addition to needing more
         | engineers.
         | 
         | By using rented servers (and only renting them when a customer
         | signs up) they avoid this problem.
        
           | vednig wrote:
           | understood, would love to know about it from founders tho,
           | and what went through in their decision
        
             | fdr wrote:
             | GP is more or less correct.
             | 
             | Building and owning an institution that finances, racks,
             | services, networks, and disposes of servers, both takes
             | time and increases the commitment level. Hetzner is month
             | to month, with a fixed overhead for fresh leasing of
             | servers: the set-up fee.
             | 
             | This is a lot to administer when also building a software
             | institution, and a business. It was not certain at the
             | outset, for example, that the GitHub Actions Runner product
             | would be as popular as it became. In its earliest form, it
             | was partially an engineering test for our virtual machines,
             | and we went around asking friendly contacts that we knew
             | would report abnormalities to use it. There's another
             | universe where it only went as far as an engineering test,
             | and our utilization and revenue pattern (that is, utility
             | to other people) is different.
        
       | rikafurude21 wrote:
       | Similar thing happened to a AX102 I currently use, something
       | related the network card which caused crashes. Thankfully hetzner
       | support was helpful with replacement hardware. caused quite some
       | grief but at least it was a good lesson in hardware
       | troubleshooting. Worth it to me personally
        
         | yread wrote:
         | Yep same here. AX102 crashes with almost no load, nothing in
         | the logs, won't come on. Hetzner looked at it multiple times
         | and found either nothing or replaced cpu paste or a PSU
         | connector. I migrated to AX162 and so far so good
        
       | andai wrote:
       | > Hetzner didn't confirm or deny the possibility of power
       | limiting
       | 
       | What are the consequences of power limiting? The article says it
       | can cause hardware to degrade more quickly, why?
       | 
       | Hetzner's lack of response here (and UbiCloud's measurements)
       | seems to suggest they are indeed limiting power, since if they
       | weren't doing it, they'd say so, right?
        
         | radicality wrote:
         | Related and perhaps useful: I've seen this in multiple cloud
         | offerings already, where the cpu scaling governor is set to
         | some eco-friendly value, in benefit to the cloud provider and
         | in zero benefit to you and much reduced peak cpu perf.
         | 
         | To check, run `cat /sys/devices/system/cpu/cpu _/
         | cpufreq/scaling_governor`. It should be `performance`.
         | 
         | If it's not, set it with `echo performance | sudo tee
         | /sys/devices/system/cpu/cpu_/cpufreq/scaling_governor`. If your
         | workload is cpu hungry this will help. It will revert on
         | startup, so can make it stick, with some cron/systemd or
         | whichever.
         | 
         | Of course if you are the one paying for power or it's your own
         | hardware, make your own judgement for the scaling governor. But
         | if it's a rented bare metal server, you do want `performance`.
        
           | chpatrick wrote:
           | Is there any downside to ondemand? If your servers aren't
           | running at 100% then there's no point wasting watts, even if
           | you aren't paying for them, right?
        
           | Tijdreiziger wrote:
           | However, eco-friendly power modes can reduce electricity
           | usage, so they can be friendlier for our climate.
           | 
           | https://www.rvo.nl/onderwerpen/energie-besparen-de-
           | industrie...
        
             | rat9988 wrote:
             | I'm not sure why you are downvoted. Is it wrong?
        
             | kjellsbells wrote:
             | Yes, but the point is that the customer has the agency to
             | decide.
             | 
             | If I rent a server I want to be able to run it to the
             | maximum capacity, since I'm paying for all of it. It's
             | dishonest to make me pay for X and give me < X. Idle CPU is
             | wasted money.
             | 
             | The flip side is that the provider should be also offering
             | more climate friendly, lower power options. I'll still want
             | to run them to the max, but the total energy consumed would
             | be less than before.
             | 
             | Also not forgetting that code efficiency matters if we want
             | to get the max results for the minimum carbon spend.
             | Another reason why giant web frameworks and bloated OSes
             | depress me a little.
        
       | nik736 wrote:
       | Most other AX models (AX42, AX52 and AX102) also have serious
       | reliability issues, where they will fail after some months. They
       | are based on a faulty motherboard. Hetzner has to replace most,
       | if not all, motherboards for servers built before a certain date
       | over the next 12 months [0]
       | 
       | [0] https://docs.hetzner.com/robot/dedicated-server/general-
       | info...
        
       | gtirloni wrote:
       | Anyone got experience with Ubicloud's OpenStack stack?
        
         | fdr wrote:
         | Ubicloud does not have an OpenStack dependency.
        
           | gtirloni wrote:
           | Thanks, I was under the impression it did but re-reading the
           | posts I see it's not the case.
        
       | jauntywundrkind wrote:
       | > _To increase the number of machines under power constraints,
       | data center operators usually cap power use per machine. However,
       | this can cause motherboards to degrade more quickly._
       | 
       | This was something I hadn't heard before, & a surprise to me.
        
       | dangoodmanUT wrote:
       | is there a provider that's like bare metal, but would detect
       | these kinds of things mostly automatic? E.g. faulty or constantly
       | crashing hardware.
        
         | greggyb wrote:
         | Managed servers: https://www.hetzner.com/managed-server/
         | 
         | There are also others, but Hetzner is under discussion here.
        
           | Tijdreiziger wrote:
           | Managed servers are quite a different product, closer to
           | 'old-school' shared webhosting.
           | 
           | You don't get root access, but you do get a preinstalled LAMP
           | stack and a web UI for management.
        
       | urbandw311er wrote:
       | Would anybody with data center experience be able to hazard a
       | guess on what type of commercial resolution Hetzner would have
       | reached with the Motherboard supplier here? Would we assume all
       | mobos replaced free of charge plus compensation?
        
       | bayindirh wrote:
       | Dell has this problem sometimes. I remember getting the first
       | batch one of their older servers when they were new. We had to
       | replace motherboards' I/O (rear) section because the servers lost
       | some devices on that part (e.g.: Ethernet controllers, iDRAC,
       | sometimes BIOS) for some time. After shaking out these problems,
       | they ran for almost a decade.
       | 
       | We recently retired them because we worn down everything on these
       | servers. From RAID cards to power regulators. Rebooting a
       | perfectly running server due to a configuration change and losing
       | the RAID card forever because electron migration erode a trace
       | inside the RAID processor is a sobering experience.
        
         | merb wrote:
         | Dell has tons of issues. A faulty mini board of the front led
         | can actually stop the server from booting/running at all (even
         | drac will be dead)
        
       | indulona wrote:
       | i am so glad my sign up process with hetzner failed when i was so
       | dumb that i wanted to give them a chance even with the internet
       | full of horrific stories of bad experiences from their customers.
       | lucky me.
        
         | cbozeman wrote:
         | Hetzner is fine for what it is, you just need to know that it's
         | all on _you_ and only _YOU_.
         | 
         |  _YOU_ do the monitoring.
         | 
         |  _YOU_ do the troubleshooting.
         | 
         |  _YOU_ etc., etc.
         | 
         | If that doesn't appeal to you, or if you don't have the
         | requisite knowledge, which I admit is fairly broad and
         | encompassing, then it's not for you. For those of you that meet
         | those checkboxes, they're a pretty amazing deal.
         | 
         | Where else could I get a 4c/8t CPU with 32 GB of RAM and four
         | (4) 6TB disks for $38 a month? I really don't know of many
         | places with that much hardware for that little cost. And yes,
         | it's an Intel i7-3770, but I don't care. It's still a hell of a
         | lot of hardware for not much price.
        
           | nobankai wrote:
           | You should have asked to delete your account after what you
           | said about Jordan Neely:
           | https://news.ycombinator.com/item?id=42969922
           | 
           | I will never let you live down this disgusting comment and
           | hope it haunts your character until your last breath. You
           | better hope I don't repost it under the next comment you make
           | demanding moral authority from your audience.
        
       ___________________________________________________________________
       (page generated 2025-02-19 23:00 UTC)