[HN Gopher] Debugging Hetzner: Uncovering failures with powersta...
___________________________________________________________________
Debugging Hetzner: Uncovering failures with powerstat, sensors, and
dmidecode
Author : ngalstyan4
Score : 265 points
Date : 2025-02-19 12:40 UTC (10 hours ago)
(HTM) web link (www.ubicloud.com)
(TXT) w3m dump (www.ubicloud.com)
| V__ wrote:
| > Looking back, waiting six months could have helped us avoid
| many issues. Early adopters usually find problems that get fixed
| later.
|
| This is really good advice and what I'm following for all systems
| which need to be stable. If there aren't any security issues, I
| either wait a few months or keep one or two versions behind.
| pwmtr wrote:
| Author of the blog post here.
|
| Yeah, this is generally a good practice. The silver lining is
| that our suffering helped uncover the underlying issue faster.
| :)
|
| This isn't part of the blog post, but we also considered
| getting the servers and keeping them idle, without actual
| customer workload, for about a month in the future. This would
| be more expensive, but it could help identify potential issues
| without impacting our users. In our case, the crashes started
| three weeks after we deployed our first AX162 server, so we
| need at least a month (or maybe even longer) as a buffer
| period.
| ThePowerOfFuet wrote:
| >The silver lining is that our suffering helped uncover the
| underlying issue faster.
|
| Did you actually uncover the true root cause? Or did they
| finally uncap the power consumption without telling you, just
| as they neither confirmed nor denied having limited it?
| pwmtr wrote:
| The root cause was a problem with the motherboard, though
| the exact issue remains unknown to us. I suspect that a
| component on the motherboard may have been vulnerable to
| power limitations or fluctuations and that the newer-
| generation motherboards included additional protection
| against this. However, this is purely my speculation.
|
| I don't believe they simply lifted a power cap (if there
| was one in the first place). I genuinely think the fix came
| after the motherboard replacements. We had 2 batches of
| motherboard replacements and after that, the issue
| disappeared.
|
| If someone from Hetzner is here, maybe they can give extra
| information.
| oz3d wrote:
| hetzner is currently replacing motherboards of their
| dedicated servers [1] But I dont know if thats the same
| issue that was mentioned in the article.
|
| [1] https://status.hetzner.com/incident/7fae9cca-b38c-4154-
| 8a27-...
| ubanholzer wrote:
| Thats the same issue, yes.
| axus wrote:
| Customers are the best QA. And they pay you too, instead of
| the reverse!
| rat9988 wrote:
| I'm pretty sure they pay for QA. QA cannot always catch
| every possible bug.
| knowitnone wrote:
| these crashes should have been caught easily
| InDubioProRubio wrote:
| This is a wildly successfully pattern in nature, the old using
| the young and inexperienced, as enthusiastic test units.
|
| In the wild for example in Forrest, old boars give safety
| squeaks to send the younglings ahead into a clearing they do
| not trust. The equivalent to that- would be to write a tech-
| blog entry that hypes up a technology that is not yet
| production ready.
| Tzela wrote:
| Just for curiosity: do you have a source?
| esafak wrote:
| GitHub is looking to add this feature to dependabot:
| https://github.com/dependabot/dependabot-core/issues/3651
| h1fra wrote:
| In theory, that works in practice nope. You get a random
| update with a possible bug inside that is only fixed by a new
| version that you won't get until later. The other strategy is
| to wait for a package to be fully stable (no update), and in
| that case, some packages that receive daily/weekly updates
| are never updated
| esafak wrote:
| It does help, because major version updates are more likely
| to cause breakage than minor ones, so you benefit if you
| wait for a few minor version updates. That is not to say
| minor versions can't introduce bugs.
|
| Windows is a well-known example; people used to wait for a
| service pack or two before upgrading.
| ajmurmann wrote:
| We could even wait for a patch version or the minor being
| out a certain amount of time. For a major I'd wait even
| longer and potentially for a second patch.
| Cthulhu_ wrote:
| And then they went towards a more evergreen update
| strategy, causing some major outages when some releases
| caused issues.
|
| I mean evergreen releases make sense imo, as the overhead
| of maintaining older versions for a long time is huge,
| but you need to have canary releases, monitoring, and
| gradual rollout plans; for something like Windows, this
| should be done with a lot of care. Even a 1% release rate
| will affect hundreds of thousands if not millions of
| systems.
| TZubiri wrote:
| Being so deep into dependencies that you have to find more
| dependencies and features to make your dependency less of a
| clusterfuck is sad.
| fdr wrote:
| It varies by system. As the legendary (to some) Kelly Johnson
| of the Skunk Works had as one of his main rules:
|
| > The inspection system as currently used by the Skunk Works,
| which has been approved by both the Air Force and the Navy,
| meets the intent of existing military requirements and should
| be used on new projects. Push more basic inspection
| responsibility back to the subcontractors and vendors. Don't
| duplicate so much inspection.
|
| But this will be the only and last time Ubicloud does not burn
| in a new model, or even tranches of purchases (I also work
| there...and am a founder).
| vitus wrote:
| > To increase the number of machines under power constraints,
| data center operators usually cap power use per machine. However,
| this can cause motherboards to degrade more quickly.
|
| Can anyone elaborate on this point? This is counter to my
| intuition (and in fact, what I saw upon a cursory search), which
| is that power capping should prolong the useful lifetime of
| various components.
|
| The only search results I found that claimed otherwise were
| indicating that if you're running into thermal throttling, then
| higher operating temperatures can cause components (e.g.
| capacitors) to degrade faster. But that's expressly not the case
| in the article, which looked at various temperature sensors.
| tecleandor wrote:
| Yep, that's weird, I've always read that high power/temp can
| degrade electronics way faster. Any EE can shed a light here?
| avian wrote:
| As an electronics engineer I have no idea what the author is
| talking about here and was about to post the same question.
| OptionOfT wrote:
| The only place I could find some answer that sheds some light
| was StackOverflow:
|
| https://electronics.stackexchange.com/a/65827
|
| > A mosfet needs a certain voltage at its gate to turn fully
| on. 8V is a typical value. A simple driver circuit could get
| this voltage directly from the power that also feeds the motor.
| When this voltage is too low to turn the mosfet fully on a
| dangerous situation (from the point of view of the moseft) can
| arise: when it is half-on, both the current through it and the
| voltage across it can be substantial, resulting in a
| dissipation that can kill it. Death by undervoltage.
| pwmtr wrote:
| At the time of our investigation, we found few articles
| supporting that power caps could potentially cause hardware
| degradation, though I don't have the exact sources at hand. I
| see the child comment shared one example, and after some
| searching, I found a few more sources [1], [2].
|
| That said, I'm not an electronics engineer, so my understanding
| might not be entirely accurate. It's possible that the
| degradation was caused by power fluctuations rather than the
| power cap itself, or perhaps another factor was at play.
|
| [1] https://electronics.stackexchange.com/questions/65837/can-
| el... [2] https://superuser.com/questions/1202062/what-happens-
| when-ha...
| immibis wrote:
| The power used by a computer isn't limited by giving it less
| voltage/current than it should have - if it was, the CPU
| would crash almost immediately. It's done by reducing the
| CPU's clock rate until the power it naturally consumes is
| less than the power limit.
| nickcw wrote:
| Power = volts * amps
|
| Volts is as supplied by the utility company.
|
| Amps are monitored per rack and the usual data centre response
| to going over an amp limit is that a fuse blows or the data
| centre asks you for more money!
|
| The only way you can decrease power used by a server is by
| throttling the CPUs.
|
| The normal way of throttling CPUs is via the OS which requires
| cooperation.
|
| I speculate this is possible via the lights out base band
| controller (which doesn't need the os to be involved), but I'm
| pretty sure you'd see that in /sys if it was.
| cibyr wrote:
| One possibility is that at lower power settings, the CPUs don't
| get as hot, which means the fans don't spin up as much, which
| can mean that other components also get less airflow and then
| get hotter than they would otherwise. The fix for this is
| usually to monitor the temperature of those other components
| and include that as an input to the fan speed algorithm. No
| idea if that's what's actually going on here though.
| redleader55 wrote:
| Every rack in a data center has a power budget, which is
| actually constrained by how much heat the HVAC system can pull
| out of the DC, rather than how much power is available.
| Nevertheless it is limited per rack to ensure a few high power
| servers don't bring down a larger portion of the DC.
|
| I don't know for sure how the limiting is done, but a simple
| circuit breaker like the ones we have in our houses would be a
| simple solution for it. That causes the rack to loose power
| when the circuit breaks, which is not ideal because you loose
| the whole rack and affect multiple customers.
|
| Another option would be a current/power limiter[0], which would
| cause more problems because P = U * I. That would make the
| voltage (U) drop and then the whole system to be undervolted -
| weird glitches happen here and it's a common way to bypass
| various security measures in chips. For example, Raspberry Pi
| ran this challenge [1] to look for this kind of bugs and test
| how well their chips can handle attacks, including voltage
| attacks.
|
| [0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] -
| https://www.raspberrypi.com/news/security-through-transparen...
| chronid wrote:
| We will never know, but I wonder if it could be a power/signaling
| or VRM issue - the CPU non getting hot doesn't mean something
| else on the board has gone out of spec and into catastrophic
| failure.
|
| Motherboard issues around power/signaling are a pain to diagnose,
| they will emerge as all sort of problems apparently related to
| other components (ram failing to initialize and random restarts
| are very common in my experience) and you end up swapping
| everything before actually replacing the MB...
| jonatron wrote:
| At a previous company, devops would regularly find CPU fan
| failures on Hetzner. That's in addition to the usual expected
| HD/SSD failures. You've got to do your own monitoring, it's one
| of the reasons why unmanaged servers are cheaper than cloud
| instances.
| jeffbee wrote:
| I regularly find broken thermal solutions in azure and when I
| worked at Google it was also a low-level but constant irritant.
| When I joined Dropbox I said to my team on my first day that I
| could find a machine in their fleet running at 400MHz, and I
| was right: a bogus redundant PSU controller was asserting
| PROCHOT. These things happen whenever you have a lot of
| machines.
| tryauuum wrote:
| in my (limited) experience this only happened with GIGABYTE
| servers
|
| very weird behavior, I'd prefer my servers to crash instead
| of lowering frequency to 400MHz.
| dijit wrote:
| I've seen it on nearly every brand, I have some Lenovo
| Servers in the basement that also down-clock if both PSU's
| aren't installed.
|
| I have alerts on PSU's and frequency for this reason.
|
| The servers are so cheap that overcommitting them by double
| is still significantly cheaper than using cloud hosting,
| which tends to have the same issue only monitoring it is
| harder. Though most people using cloud seem to be happy not
| to know and it's been a known thing that there's a 5x
| variation between instances of the same size on AWS.: https
| ://www.brendangregg.com/Slides/AWSreInvent2017_performa...
| jeffbee wrote:
| > I'd prefer my servers to crash instead of lowering
| frequency to 400MHz.
|
| 100% agreed. There is _nothing_ worse than a slow server in
| your fleet. This behavior reeks of "pet" thinking.
| formerly_proven wrote:
| Stuff like this just comes up from time to time as soon as
| you run a four digit and up number of systems.
| radicality wrote:
| The term PROCHOT just brought me back to vivid memories of
| debugging exactly that at Facebook a while ago.
|
| It was very non-obvious to debug since pretty much most
| emitted metrics, apart from mysterious errors/timeouts to our
| service, looked reasonable. Even the cpu usage and cpu
| temperature graphs looked normal since it was a bogus prochot
| and not actually a real thermal throttling
| porridgeraisin wrote:
| And it brought me back to memories of debugging that on my
| friends laptop.
|
| It kept going to 400mhz.. i suspected throttling and we got
| it cleaned thermal paste replaced and all that.
|
| Still throttled. We replaced the windows with linux since
| it was atleast a bit more usable
|
| At the time I didn't know about PROCHOT. And my googling
| skills clearly weren't sufficient.
|
| One fine day during lunch at a place on campus, Id read
| about BD_PROCHOT recently. So i wrote a script to msrprobe
| or whatever it was and disabled it. "Extended" the lifespan
| of the thing.
| bityard wrote:
| A laptop that I had would assert PROCHOT if it didn't like
| the power supply you plugged into it. It actually took an
| embarrassing amount of time for me to notice that this is
| what was causing Slack to be inexplicably slower at my desk
| than when I was out working in a common area in the building.
| TZubiri wrote:
| I'm heavily against both relying on free dependencies and going
| for the cheapest option.
|
| If you can't put yourself in the shoes for a second when
| evaluating a purchase and you just braindead try to make cost
| go lower and income go higher, your ngmi except in shady sales
| businesses.
|
| Server hardware is incredibly cheap, if you are somewhat of a
| competent programmer you can handle most programs in a single
| server or even a virtual machine. Just give them a little bit
| of margin and pay 50$/mo instead of 25$/mo, it's not even
| enough to guarantee they won't go broke or make you a valuable
| customer, you'll still be banking on whales to make the whole
| thing profitable.
|
| Also, if your business is in the US, find a US host ffs.
| KennyBlanken wrote:
| No? Maybe you cloud kids don't know how this stuff works, but
| unmanaged just means you get silicon-level access and remote
| KVM.
|
| It's still the hosting company's responsibility to competently
| own, maintain, and repair the physical hardware. That includes
| monitoring. In the old days you had to run a script or install
| a package to hook into their monitoring....but with IPMI et al
| being standard they don't need anything from you to do their
| job.
|
| The _only_ time a hosting company should be hands-off is when
| they 're just providing rack space, power, and data. Anything
| beyond that is between you and them in a contract/agreement.
|
| Every time I hear Hetzner come up in the last few years it's
| been a story about them being incompetent. If they're not
| detecting things like CPU fan failures of their own hardware
| _and_ they deployed new systems without properly testing them
| first, then that 's just further evidence they're still
| slipping.
| scottcha wrote:
| I'd like to see what cpu governor is running on those systems
| before assuming a power cap is in place. Lots of defaults
| installs of Linux ship with the power save governor running which
| is going to limit your max frequencies and through that the max
| power you can hit.
| __m wrote:
| schedutil on mine scheduled for mainboard replacement
| wink wrote:
| > One of the providers we like is Hetzner because of their
| affordable and reliable servers.
|
| > In the days that followed, the crash frequency increased.
|
| I don't find the article conclusive whether they would still call
| them reliable.
| aduffy wrote:
| To their credit they actually fixed the problem. Good luck
| getting this level of support from any of the big 3 public
| cloud providers.
| frenchtoast8 wrote:
| For example, AWS's Mac machines frequently run into hardware
| failures. My current job runs a measly 5 mac1.metal hosts for
| internal testing, and we experience hardware failures on
| these machines a few times a year. Doesn't sound like a lot,
| but these machines are almost always completely idle, and we
| almost never get host failures for Linux hosts. To make
| matters worse, sometimes a brand new instance needs
| replacement before it even comes up for the first time, which
| is annoying because you are billed a minimum of 24 hours for
| these instances. People have been complaining about this for
| years and seemingly nothing is being done about it.
|
| https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok.
| ..
| cbozeman wrote:
| Hetzner's reliable... until they aren't.
|
| Since they don't do any sort of monitoring on their bare metal
| servers at all, at least insofar as I can tell having been a
| customer of theirs for ten years, you don't know there's as
| problem until there's a problem, or unless you've got your own
| monitoring solution in place.
| vednig wrote:
| as a CI/CD provider wouldn't it benefit if Ubicloud had their own
| servers?
| eitland wrote:
| They are in the early stages.
|
| I think the website said they recently raised 16 million euros
| (or dollars).
|
| Making investments into data centers and hardware could burn
| through that really quick in addition to needing more
| engineers.
|
| By using rented servers (and only renting them when a customer
| signs up) they avoid this problem.
| vednig wrote:
| understood, would love to know about it from founders tho,
| and what went through in their decision
| fdr wrote:
| GP is more or less correct.
|
| Building and owning an institution that finances, racks,
| services, networks, and disposes of servers, both takes
| time and increases the commitment level. Hetzner is month
| to month, with a fixed overhead for fresh leasing of
| servers: the set-up fee.
|
| This is a lot to administer when also building a software
| institution, and a business. It was not certain at the
| outset, for example, that the GitHub Actions Runner product
| would be as popular as it became. In its earliest form, it
| was partially an engineering test for our virtual machines,
| and we went around asking friendly contacts that we knew
| would report abnormalities to use it. There's another
| universe where it only went as far as an engineering test,
| and our utilization and revenue pattern (that is, utility
| to other people) is different.
| rikafurude21 wrote:
| Similar thing happened to a AX102 I currently use, something
| related the network card which caused crashes. Thankfully hetzner
| support was helpful with replacement hardware. caused quite some
| grief but at least it was a good lesson in hardware
| troubleshooting. Worth it to me personally
| yread wrote:
| Yep same here. AX102 crashes with almost no load, nothing in
| the logs, won't come on. Hetzner looked at it multiple times
| and found either nothing or replaced cpu paste or a PSU
| connector. I migrated to AX162 and so far so good
| andai wrote:
| > Hetzner didn't confirm or deny the possibility of power
| limiting
|
| What are the consequences of power limiting? The article says it
| can cause hardware to degrade more quickly, why?
|
| Hetzner's lack of response here (and UbiCloud's measurements)
| seems to suggest they are indeed limiting power, since if they
| weren't doing it, they'd say so, right?
| radicality wrote:
| Related and perhaps useful: I've seen this in multiple cloud
| offerings already, where the cpu scaling governor is set to
| some eco-friendly value, in benefit to the cloud provider and
| in zero benefit to you and much reduced peak cpu perf.
|
| To check, run `cat /sys/devices/system/cpu/cpu _/
| cpufreq/scaling_governor`. It should be `performance`.
|
| If it's not, set it with `echo performance | sudo tee
| /sys/devices/system/cpu/cpu_/cpufreq/scaling_governor`. If your
| workload is cpu hungry this will help. It will revert on
| startup, so can make it stick, with some cron/systemd or
| whichever.
|
| Of course if you are the one paying for power or it's your own
| hardware, make your own judgement for the scaling governor. But
| if it's a rented bare metal server, you do want `performance`.
| chpatrick wrote:
| Is there any downside to ondemand? If your servers aren't
| running at 100% then there's no point wasting watts, even if
| you aren't paying for them, right?
| Tijdreiziger wrote:
| However, eco-friendly power modes can reduce electricity
| usage, so they can be friendlier for our climate.
|
| https://www.rvo.nl/onderwerpen/energie-besparen-de-
| industrie...
| rat9988 wrote:
| I'm not sure why you are downvoted. Is it wrong?
| kjellsbells wrote:
| Yes, but the point is that the customer has the agency to
| decide.
|
| If I rent a server I want to be able to run it to the
| maximum capacity, since I'm paying for all of it. It's
| dishonest to make me pay for X and give me < X. Idle CPU is
| wasted money.
|
| The flip side is that the provider should be also offering
| more climate friendly, lower power options. I'll still want
| to run them to the max, but the total energy consumed would
| be less than before.
|
| Also not forgetting that code efficiency matters if we want
| to get the max results for the minimum carbon spend.
| Another reason why giant web frameworks and bloated OSes
| depress me a little.
| nik736 wrote:
| Most other AX models (AX42, AX52 and AX102) also have serious
| reliability issues, where they will fail after some months. They
| are based on a faulty motherboard. Hetzner has to replace most,
| if not all, motherboards for servers built before a certain date
| over the next 12 months [0]
|
| [0] https://docs.hetzner.com/robot/dedicated-server/general-
| info...
| gtirloni wrote:
| Anyone got experience with Ubicloud's OpenStack stack?
| fdr wrote:
| Ubicloud does not have an OpenStack dependency.
| gtirloni wrote:
| Thanks, I was under the impression it did but re-reading the
| posts I see it's not the case.
| jauntywundrkind wrote:
| > _To increase the number of machines under power constraints,
| data center operators usually cap power use per machine. However,
| this can cause motherboards to degrade more quickly._
|
| This was something I hadn't heard before, & a surprise to me.
| dangoodmanUT wrote:
| is there a provider that's like bare metal, but would detect
| these kinds of things mostly automatic? E.g. faulty or constantly
| crashing hardware.
| greggyb wrote:
| Managed servers: https://www.hetzner.com/managed-server/
|
| There are also others, but Hetzner is under discussion here.
| Tijdreiziger wrote:
| Managed servers are quite a different product, closer to
| 'old-school' shared webhosting.
|
| You don't get root access, but you do get a preinstalled LAMP
| stack and a web UI for management.
| urbandw311er wrote:
| Would anybody with data center experience be able to hazard a
| guess on what type of commercial resolution Hetzner would have
| reached with the Motherboard supplier here? Would we assume all
| mobos replaced free of charge plus compensation?
| bayindirh wrote:
| Dell has this problem sometimes. I remember getting the first
| batch one of their older servers when they were new. We had to
| replace motherboards' I/O (rear) section because the servers lost
| some devices on that part (e.g.: Ethernet controllers, iDRAC,
| sometimes BIOS) for some time. After shaking out these problems,
| they ran for almost a decade.
|
| We recently retired them because we worn down everything on these
| servers. From RAID cards to power regulators. Rebooting a
| perfectly running server due to a configuration change and losing
| the RAID card forever because electron migration erode a trace
| inside the RAID processor is a sobering experience.
| merb wrote:
| Dell has tons of issues. A faulty mini board of the front led
| can actually stop the server from booting/running at all (even
| drac will be dead)
| indulona wrote:
| i am so glad my sign up process with hetzner failed when i was so
| dumb that i wanted to give them a chance even with the internet
| full of horrific stories of bad experiences from their customers.
| lucky me.
| cbozeman wrote:
| Hetzner is fine for what it is, you just need to know that it's
| all on _you_ and only _YOU_.
|
| _YOU_ do the monitoring.
|
| _YOU_ do the troubleshooting.
|
| _YOU_ etc., etc.
|
| If that doesn't appeal to you, or if you don't have the
| requisite knowledge, which I admit is fairly broad and
| encompassing, then it's not for you. For those of you that meet
| those checkboxes, they're a pretty amazing deal.
|
| Where else could I get a 4c/8t CPU with 32 GB of RAM and four
| (4) 6TB disks for $38 a month? I really don't know of many
| places with that much hardware for that little cost. And yes,
| it's an Intel i7-3770, but I don't care. It's still a hell of a
| lot of hardware for not much price.
| nobankai wrote:
| You should have asked to delete your account after what you
| said about Jordan Neely:
| https://news.ycombinator.com/item?id=42969922
|
| I will never let you live down this disgusting comment and
| hope it haunts your character until your last breath. You
| better hope I don't repost it under the next comment you make
| demanding moral authority from your audience.
___________________________________________________________________
(page generated 2025-02-19 23:00 UTC)