[HN Gopher] Systemd: Enable Indefinite Service Restarts
___________________________________________________________________
Systemd: Enable Indefinite Service Restarts
Author : secure
Score : 66 points
Date : 2024-01-17 20:08 UTC (1 days ago)
(HTM) web link (michael.stapelberg.ch)
(TXT) w3m dump (michael.stapelberg.ch)
| tadfisher wrote:
| This must be a different philosophy. When I see something like
| this happening, I investigate to find out _why_ the service is
| failing to start, which usually uncovers some dependency that can
| be encoded in the service unit, or some bug in the service.
| chpatrick wrote:
| If your server has a bug that makes it crash every two hours
| you still want it up the rest of the time until you fix it.
| tekla wrote:
| Of course you understand you can do both, like I do.
| zhengyi13 wrote:
| I think the author's specified use case is to address transient
| conditions that drive failures.
|
| When the given (transient) condition goes away (either
| passively, or because somebody fixed something), then the
| service comes back without anyone needing to remember to
| restart the (now dead) service.
|
| By way of example, I've run apps that would refuse to come up
| fully if they couldn't hit the DB at startup. Alternatively,
| they might also die if their DB connection went away. App lives
| on one server; DB lives on another.
|
| It'd be awfully nice in that case to be able to fix the DB, and
| have the app service come back automatically.
| ot wrote:
| Imagine you use systemd to manage daemons in a large
| distributed system. Crashes could be caused by a failure in a
| dependency. Once you fix the dependency, you want all your
| systems to recover as quickly as possible, you don't want to go
| through each one of them to manually restart things.
|
| This doesn't mean that you don't investigate, it just means
| that you have an additional guarantee that the system can
| automatically eventually recover.
|
| If you set a limit on number or time or restart, what's a
| reasonable limit? That will be context dependent, and as soon
| as it's more than a few minutes, it may as well be infinite.
| mise_en_place wrote:
| That's exactly why systemd should blindly attempt to restart
| the service infinitely. Seperation of concerns. An init system
| should simply start and monitor services. That is what an init
| system is meant to do. The fact that systemd is overengineered
| and tries to do multiple things causes headaches for a lot of
| us. Busybox-init is one of the best alternatives, I would use
| that everywhere if I could.
| akira2501 wrote:
| I've always preferred daemontools and runit's ideology here. If a
| service dies, wait one second, then try starting it. Do this
| forever.
|
| The last thing I need is emergent behavior out of my service
| manager.
| freedomben wrote:
| Systemd can do that exactly that. it just doesn't do that by
| default. But if that's what you want, it's trivial
| akira2501 wrote:
| Is it possible to do this system wide? Or do I have to do it
| for each individual service? It may be a trivial amount of
| work but if the configuration is fragile, I've gained
| nothing.
| izacus wrote:
| It's literally described in the article.
| mise_en_place wrote:
| I've been bitten by the restart limit many times. Our application
| server (backend) was crash looping, newest build fixed the crash,
| but systemd refused to restart the service due to the limit. A
| subtle but very annoying default behavior.
| dijit wrote:
| are you saying systemd was refusing to restart after manual
| intervention?
| mise_en_place wrote:
| Correct, because the startup limit had been reached: `service
| start request repeated too quickly, refusing to start`.
| dijit wrote:
| Thats terrifying, systemd shouldn't pretend to be smarter
| than manual intervention.
|
| That violates everything I ever enjoyed linux for, I left
| Windows because it thought it knew better than me.
| freedomben wrote:
| Did your deployment process/script not include restarting the
| service?
| mise_en_place wrote:
| It does, but systemd refused to start the service because of
| the startup limit.
| o11c wrote:
| It would be nice if `RestartSec` weren't constant.
|
| Then you could have the default be 100ms for one-time blips, but
| (after a burst of failures) fall back gradually to 10s to avoid
| spinning during longer outages.
|
| That said, beware of failure _chains_ causing the interval to add
| up. AFAIK there 's no way to have the kernel notify you of when a
| different process starts listening on a port.
| dijit wrote:
| > AFAIK there's no way to have the kernel notify you of when a
| different process starts listening on a port.
|
| You can use mandatory access control for this.
|
| AppArmour or SELinux are examples.
|
| Unfortunately they are hard, not sexy and sysadmins (people who
| tend to do not sexy hard things) are a dead/dying breed
| nomel wrote:
| > AFAIK there's no way to have the kernel notify you of when a
| different process starts listening on a port.
|
| Would the ExecCondition be appropriate here, minimally, with a
| script that runs `lsof -nP -iTCP:${yourport} -sTCP:LISTEN`?
| saint_yossarian wrote:
| There's `RestartSteps` and `RestartMaxDelaySec` for that, see
| the manpage `systemd.service`.
| o11c wrote:
| Ah, not in the man page on my system.
|
| Available since systemd 254, released July 2023 (only 1
| release since then). Huh, has release rate severely slowed
| down?
| halyconWays wrote:
| Seems reasonable if the service is failing due to a transient
| network issue, which takes many minutes to resolve.
| ElectricSpoon wrote:
| > I would guess the developers wanted to prevent laptops running
| out of battery too quickly
|
| And I would guess sysadmins also don't like their logging
| facilities filling the disks just because a service is stuck in a
| start loop. There are many reasons to think a service failing to
| start multiple times in a row won't start. Misconfiguration is
| probably the most frequent reason for that.
| twic wrote:
| Exactly. If a service crashes within a second ten times in a
| row, it's not going to come up cleanly an eleventh time. The
| right thing to do is stay down, and let monitoring get the
| attention of a human operator who can figure out what the
| problem is. Continually rebooting is just going to fill up
| logs, spam other services, and generally make trouble.
|
| I'm sure there are exceptions to this. For those, set
| Restart=always. But it's an absolutely terrible default.
| deathanatos wrote:
| Heh. We used syslog at one place, with it configured to push
| logs into ELK. The ingestion into ELK broke ... which caused
| syslog to start logging that it couldn't forward logs. Now that
| might seem like screaming into a void, but _that_ log went to
| local disk, and syslog retried it as fast as disk would
| otherwise allow, so instantly every machine in the fleet
| started filling up its disks with logs.
|
| (You can guess how we noticed the problem...)
|
| Also logrotate. (And bounded on size.)
| freedomben wrote:
| it's wild how easy it is to misconfigure (or not configure)
| logrotate properly and have a log file fill up the disk. Out
| of memory and/or out of disk are the two error cases that
| have led to the most pain in my career. I think most people
| who started with docker in the early days (long before there
| was a docker system prune) had this happen where old docker
| containers/images filled up the disk and wreaked havoc at an
| unsuspecting point.
| deathanatos wrote:
| > _Why does systemd give up by default?_
|
| > _I'm not sure. If I had to speculate, I would guess the
| developers wanted to prevent laptops running out of battery too
| quickly because one CPU core is permanently busy just restarting
| some service that's crashing in a tight loop._
|
| _sigh_ ... bounded randomized exponential backoff retry.
|
| (exponential: double the maximum time you might wait each
| iteration. Randomized: the time you want is a random amount,
| between [0, current maximum] (yes, zero.). Bounded: you stop
| doubling at a certain point, like 5 minutes, so that we'll never
| wait longer than 5 minutes; otherwise, at some point you're
| waiting for [?]s, which I guess is like giving up.)
|
| (The concern about logs filling up is a worse one. It won't
| directly solve this, but a high enough max wait usually slows the
| rate of log generation enough that it becomes small enough to not
| matter. Also do your log rotations on size.)
| kaba0 wrote:
| Arguably, this logic should live in another place that monitors
| the service.
|
| Especially that service startup failure is usually not
| something that gets fixed on its own, like a network connection
| (where exponential backoff is (in)famous). A bad config file,
| or a failed disk won't recover in 10 minutes on its own, so
| systemd's default makes sense here, I believe.
| jxf wrote:
| Q: Why is the optimal lower bound zero and not "at least as
| long as you waited last time"?
| isatty wrote:
| Regardless, all this opinionated settings should be by OS
| maintainers or similar. I don't see why a low level init system
| tries to make decisions for others. Yes, it may be with good
| intentions, but don't.
| bogota wrote:
| The amount of times i had to fight and debug systemd compared
| to any other init system is at least 10x.
|
| Yes it does a lot of stuff for you and in others I had to
| write custom scripts but it was much more understandable and
| maintainable long term. Sadly systemd won and now i build my
| own OS without it.
| izacus wrote:
| Seems like OS maintainers can set those settings, what
| exactly is the problem?
| franknord23 wrote:
| I believe this allows you to have cascading restart strategies,
| similar to what can be done in Erlang/OTP: Only after the
| StartLimit= has been reached, systemd considers the service as
| failed. Then services that have Required= set on the failed
| service will be restarted/marked failed as well.
|
| I think you can even have systemd reboot or move the system into
| a recovery mode (target) if an essential unit does not come up.
| That way, you can get pretty robust systems that are highly
| tolerant to failures.
|
| (Now after reading `man systemd.unit`, i am not fully sure how
| exactly restarts are cascaded to requiring units.)
| twinpeak wrote:
| Recently discovered while making a monitoring script that systemd
| exposes a few properties that can be used to alert on a service
| that is continuously failing to start if it's set to restart
| indefinitely. # Get the number of restarts for
| a service to see if it exceeds an arbitrary threshold.
| systemctl show -p NRestarts "${SYSTEMD_UNIT}" | cut -d= -f2
| # Get when the service started, to work out how long it's been
| running, as the restart counter isn't reset once the service does
| start successfully. systemctl show -p
| ActiveEnterTimestamp "${SYSTEMD_UNIT}" | cut -d= -f2
| # Clear the restart counter if the service has been running for
| long enough based on the timestamp above systemctl reset-
| failed "${SYSTEMD_UNIT}"
___________________________________________________________________
(page generated 2024-01-18 23:00 UTC)