[HN Gopher] Systemd: Enable Indefinite Service Restarts
       ___________________________________________________________________
        
       Systemd: Enable Indefinite Service Restarts
        
       Author : secure
       Score  : 66 points
       Date   : 2024-01-17 20:08 UTC (1 days ago)
        
 (HTM) web link (michael.stapelberg.ch)
 (TXT) w3m dump (michael.stapelberg.ch)
        
       | tadfisher wrote:
       | This must be a different philosophy. When I see something like
       | this happening, I investigate to find out _why_ the service is
       | failing to start, which usually uncovers some dependency that can
       | be encoded in the service unit, or some bug in the service.
        
         | chpatrick wrote:
         | If your server has a bug that makes it crash every two hours
         | you still want it up the rest of the time until you fix it.
        
         | tekla wrote:
         | Of course you understand you can do both, like I do.
        
         | zhengyi13 wrote:
         | I think the author's specified use case is to address transient
         | conditions that drive failures.
         | 
         | When the given (transient) condition goes away (either
         | passively, or because somebody fixed something), then the
         | service comes back without anyone needing to remember to
         | restart the (now dead) service.
         | 
         | By way of example, I've run apps that would refuse to come up
         | fully if they couldn't hit the DB at startup. Alternatively,
         | they might also die if their DB connection went away. App lives
         | on one server; DB lives on another.
         | 
         | It'd be awfully nice in that case to be able to fix the DB, and
         | have the app service come back automatically.
        
         | ot wrote:
         | Imagine you use systemd to manage daemons in a large
         | distributed system. Crashes could be caused by a failure in a
         | dependency. Once you fix the dependency, you want all your
         | systems to recover as quickly as possible, you don't want to go
         | through each one of them to manually restart things.
         | 
         | This doesn't mean that you don't investigate, it just means
         | that you have an additional guarantee that the system can
         | automatically eventually recover.
         | 
         | If you set a limit on number or time or restart, what's a
         | reasonable limit? That will be context dependent, and as soon
         | as it's more than a few minutes, it may as well be infinite.
        
         | mise_en_place wrote:
         | That's exactly why systemd should blindly attempt to restart
         | the service infinitely. Seperation of concerns. An init system
         | should simply start and monitor services. That is what an init
         | system is meant to do. The fact that systemd is overengineered
         | and tries to do multiple things causes headaches for a lot of
         | us. Busybox-init is one of the best alternatives, I would use
         | that everywhere if I could.
        
       | akira2501 wrote:
       | I've always preferred daemontools and runit's ideology here. If a
       | service dies, wait one second, then try starting it. Do this
       | forever.
       | 
       | The last thing I need is emergent behavior out of my service
       | manager.
        
         | freedomben wrote:
         | Systemd can do that exactly that. it just doesn't do that by
         | default. But if that's what you want, it's trivial
        
           | akira2501 wrote:
           | Is it possible to do this system wide? Or do I have to do it
           | for each individual service? It may be a trivial amount of
           | work but if the configuration is fragile, I've gained
           | nothing.
        
             | izacus wrote:
             | It's literally described in the article.
        
       | mise_en_place wrote:
       | I've been bitten by the restart limit many times. Our application
       | server (backend) was crash looping, newest build fixed the crash,
       | but systemd refused to restart the service due to the limit. A
       | subtle but very annoying default behavior.
        
         | dijit wrote:
         | are you saying systemd was refusing to restart after manual
         | intervention?
        
           | mise_en_place wrote:
           | Correct, because the startup limit had been reached: `service
           | start request repeated too quickly, refusing to start`.
        
             | dijit wrote:
             | Thats terrifying, systemd shouldn't pretend to be smarter
             | than manual intervention.
             | 
             | That violates everything I ever enjoyed linux for, I left
             | Windows because it thought it knew better than me.
        
         | freedomben wrote:
         | Did your deployment process/script not include restarting the
         | service?
        
           | mise_en_place wrote:
           | It does, but systemd refused to start the service because of
           | the startup limit.
        
       | o11c wrote:
       | It would be nice if `RestartSec` weren't constant.
       | 
       | Then you could have the default be 100ms for one-time blips, but
       | (after a burst of failures) fall back gradually to 10s to avoid
       | spinning during longer outages.
       | 
       | That said, beware of failure _chains_ causing the interval to add
       | up. AFAIK there 's no way to have the kernel notify you of when a
       | different process starts listening on a port.
        
         | dijit wrote:
         | > AFAIK there's no way to have the kernel notify you of when a
         | different process starts listening on a port.
         | 
         | You can use mandatory access control for this.
         | 
         | AppArmour or SELinux are examples.
         | 
         | Unfortunately they are hard, not sexy and sysadmins (people who
         | tend to do not sexy hard things) are a dead/dying breed
        
         | nomel wrote:
         | > AFAIK there's no way to have the kernel notify you of when a
         | different process starts listening on a port.
         | 
         | Would the ExecCondition be appropriate here, minimally, with a
         | script that runs `lsof -nP -iTCP:${yourport} -sTCP:LISTEN`?
        
         | saint_yossarian wrote:
         | There's `RestartSteps` and `RestartMaxDelaySec` for that, see
         | the manpage `systemd.service`.
        
           | o11c wrote:
           | Ah, not in the man page on my system.
           | 
           | Available since systemd 254, released July 2023 (only 1
           | release since then). Huh, has release rate severely slowed
           | down?
        
       | halyconWays wrote:
       | Seems reasonable if the service is failing due to a transient
       | network issue, which takes many minutes to resolve.
        
       | ElectricSpoon wrote:
       | > I would guess the developers wanted to prevent laptops running
       | out of battery too quickly
       | 
       | And I would guess sysadmins also don't like their logging
       | facilities filling the disks just because a service is stuck in a
       | start loop. There are many reasons to think a service failing to
       | start multiple times in a row won't start. Misconfiguration is
       | probably the most frequent reason for that.
        
         | twic wrote:
         | Exactly. If a service crashes within a second ten times in a
         | row, it's not going to come up cleanly an eleventh time. The
         | right thing to do is stay down, and let monitoring get the
         | attention of a human operator who can figure out what the
         | problem is. Continually rebooting is just going to fill up
         | logs, spam other services, and generally make trouble.
         | 
         | I'm sure there are exceptions to this. For those, set
         | Restart=always. But it's an absolutely terrible default.
        
         | deathanatos wrote:
         | Heh. We used syslog at one place, with it configured to push
         | logs into ELK. The ingestion into ELK broke ... which caused
         | syslog to start logging that it couldn't forward logs. Now that
         | might seem like screaming into a void, but _that_ log went to
         | local disk, and syslog retried it as fast as disk would
         | otherwise allow, so instantly every machine in the fleet
         | started filling up its disks with logs.
         | 
         | (You can guess how we noticed the problem...)
         | 
         | Also logrotate. (And bounded on size.)
        
           | freedomben wrote:
           | it's wild how easy it is to misconfigure (or not configure)
           | logrotate properly and have a log file fill up the disk. Out
           | of memory and/or out of disk are the two error cases that
           | have led to the most pain in my career. I think most people
           | who started with docker in the early days (long before there
           | was a docker system prune) had this happen where old docker
           | containers/images filled up the disk and wreaked havoc at an
           | unsuspecting point.
        
       | deathanatos wrote:
       | > _Why does systemd give up by default?_
       | 
       | > _I'm not sure. If I had to speculate, I would guess the
       | developers wanted to prevent laptops running out of battery too
       | quickly because one CPU core is permanently busy just restarting
       | some service that's crashing in a tight loop._
       | 
       |  _sigh_ ... bounded randomized exponential backoff retry.
       | 
       | (exponential: double the maximum time you might wait each
       | iteration. Randomized: the time you want is a random amount,
       | between [0, current maximum] (yes, zero.). Bounded: you stop
       | doubling at a certain point, like 5 minutes, so that we'll never
       | wait longer than 5 minutes; otherwise, at some point you're
       | waiting for [?]s, which I guess is like giving up.)
       | 
       | (The concern about logs filling up is a worse one. It won't
       | directly solve this, but a high enough max wait usually slows the
       | rate of log generation enough that it becomes small enough to not
       | matter. Also do your log rotations on size.)
        
         | kaba0 wrote:
         | Arguably, this logic should live in another place that monitors
         | the service.
         | 
         | Especially that service startup failure is usually not
         | something that gets fixed on its own, like a network connection
         | (where exponential backoff is (in)famous). A bad config file,
         | or a failed disk won't recover in 10 minutes on its own, so
         | systemd's default makes sense here, I believe.
        
         | jxf wrote:
         | Q: Why is the optimal lower bound zero and not "at least as
         | long as you waited last time"?
        
         | isatty wrote:
         | Regardless, all this opinionated settings should be by OS
         | maintainers or similar. I don't see why a low level init system
         | tries to make decisions for others. Yes, it may be with good
         | intentions, but don't.
        
           | bogota wrote:
           | The amount of times i had to fight and debug systemd compared
           | to any other init system is at least 10x.
           | 
           | Yes it does a lot of stuff for you and in others I had to
           | write custom scripts but it was much more understandable and
           | maintainable long term. Sadly systemd won and now i build my
           | own OS without it.
        
           | izacus wrote:
           | Seems like OS maintainers can set those settings, what
           | exactly is the problem?
        
       | franknord23 wrote:
       | I believe this allows you to have cascading restart strategies,
       | similar to what can be done in Erlang/OTP: Only after the
       | StartLimit= has been reached, systemd considers the service as
       | failed. Then services that have Required= set on the failed
       | service will be restarted/marked failed as well.
       | 
       | I think you can even have systemd reboot or move the system into
       | a recovery mode (target) if an essential unit does not come up.
       | That way, you can get pretty robust systems that are highly
       | tolerant to failures.
       | 
       | (Now after reading `man systemd.unit`, i am not fully sure how
       | exactly restarts are cascaded to requiring units.)
        
       | twinpeak wrote:
       | Recently discovered while making a monitoring script that systemd
       | exposes a few properties that can be used to alert on a service
       | that is continuously failing to start if it's set to restart
       | indefinitely.                   # Get the number of restarts for
       | a service to see if it exceeds an arbitrary threshold.
       | systemctl show -p NRestarts "${SYSTEMD_UNIT}" | cut -d= -f2
       | # Get when the service started, to work out how long it's been
       | running, as the restart counter isn't reset once the service does
       | start successfully.         systemctl show -p
       | ActiveEnterTimestamp "${SYSTEMD_UNIT}" | cut -d= -f2
       | # Clear the restart counter if the service has been running for
       | long enough based on the timestamp above         systemctl reset-
       | failed "${SYSTEMD_UNIT}"
        
       ___________________________________________________________________
       (page generated 2024-01-18 23:00 UTC)