[HN Gopher] Writing systemd units that stop gracefully before sh...
       ___________________________________________________________________
        
       Writing systemd units that stop gracefully before shutdown
        
       Author : dghubble
       Score  : 151 points
       Date   : 2022-10-26 16:14 UTC (6 hours ago)
        
 (HTM) web link (www.psdn.io)
 (TXT) w3m dump (www.psdn.io)
        
       | SrslyJosh wrote:
       | It shouldn't be this hard to stop a service gracefully. This is
       | far, far more complicated than SysV init, where you just need to
       | drop a script into /etc/init.d symlink it from the appropriate rc
       | directories. (For shutdown/reboot, you'd create symlinks in rc5.d
       | and rc6.d named KNNwhatever, where NN is an integer that
       | specifies the order the script will be run in. The "K" stands for
       | "kill".)
       | 
       | Edit: Note that my example runlevels are for Solaris, other
       | UNIX/Linux OSes will vary.
        
         | LukeShu wrote:
         | > It shouldn't be this hard to stop a service gracefully.
         | 
         | It's not.
         | 
         | > you just need to drop a script into /etc/init.d symlink it
         | from the appropriate rc directories.
         | 
         | You just need to drop a unit file into /etc/systemd/system/ and
         | symlink it from the appropriate
         | /etc/systemd/system/${target}.wants/ directories.
         | 
         | Don't tell me that "shutdown.target.wants" and
         | "reboot.target.wants" are harder than "rc0.d" and "rc6.d".
         | 
         | A lot of the article is about ordering of dependencies (don't
         | stop a dependency until after the dependent has stopped). Don't
         | tell me that adding `Before=` and `After=` lines in the unit
         | file is harder than having to remember all of the dependencies
         | and manually figure out the correct "NN" for it all to work
         | correctly.
         | 
         | A lot of the article is about either having your daemon handle
         | SIGTERM, or coming up with the appropriate `ExecStop=` command.
         | The same command you'd be writing in your rc script (the
         | "handle SIGTERM" stuff being for if your rc script simply says
         | `kill $PID`).
         | 
         | That is: The complex parts of the article are things that were
         | complex with sysvinit too.
        
       | SoftTalker wrote:
       | I fundamentally disagree with the idea that software should
       | require or even expect a graceful shutdown. You can never stop
       | the user from yanking the power cord out of the socket, which is
       | what they will do if you force a bunch of housekeeping to happen
       | before shutdown.
       | 
       | You have to deal with crash/power failure recovery anyway. So do
       | your housekeeping on startup. Shutdown should be a quick and
       | simple termination.
        
         | maw wrote:
         | I'm with you.
         | 
         | It's easier said than done, of course, but crash-only software
         | is a worthwhile goal IMO.
        
         | mixmastamyk wrote:
         | The twenty years of laptops I've had wouldn't even flinch at a
         | power cord disconnect.
        
         | jefftk wrote:
         | Just because you need to be able to handle an employee being
         | hit by a bus doesn't mean employees should ghost their
         | companies, or that companies shouldn't have systems for when
         | someone gives their two weeks notice.
         | 
         | The article gives the examples that "A load balancer might stop
         | accepting new connections and disable its readiness endpoint. A
         | database might flush to disk. An agent might inform a cluster
         | it's leaving the group." All of these seem like they're worth
         | doing, and improve expected case shutdown behavior, though you
         | should also write and test the abrupt shutdown case.
        
         | nerdponx wrote:
         | Hope for the best, plan for the worst, right?
         | 
         | Here's a contrived analogy: modern airplanes are designed to
         | stay in the air even if an engine burns out, but we would still
         | rather fly with both engines at full power whenever possible.
        
         | empthought wrote:
         | This is a weird take; most systems in data centers don't have
         | people walking from rack to rack yanking power cords, and most
         | consumer systems don't even have a power cord to yank.
        
           | bravetraveler wrote:
           | While I agree it's a bit of a weird take, for example --
           | there may be performance tradeoffs made in any given workload
           | to make the disk consistent, inconsistently
           | 
           | The 'most' there is doing some effort
           | 
           | It is actually quite a common practice for those being
           | audited for disaster recovery to do exactly that -- yank
           | cables. More realistically, flip some switches
           | 
           | We do it once a year, set aside a region and time... then
           | test our processes
           | 
           | It serves a few purposes, most importantly -- are our
           | services fault tolerant, and can we bring them back?
           | 
           | I think it's reasonable to trap the signals and make a best
           | effort basis, knowing that PID 1 (or the environment) will
           | eventually have to SIGKILL you -- ready or not
           | 
           | Just because we can't save all of the state doesn't mean we
           | shouldn't try
        
             | empthought wrote:
             | Right, there are failure modes that have to be tested and
             | accounted for, and one of them is the state being
             | inconsistent after a shutdown.
             | 
             | The previous poster seemed to advocate for not thinking of
             | this as a failure mode at all but rather normal operation,
             | which I just don't see as true.
        
         | thfuran wrote:
         | If I'm turning on a computer, it's because I want to use it
         | right now. If I'm turning off a computer, it's because I don't
         | need to be using it right now. I guess you do need to make sure
         | shutdown is fast enough that a laptop won't start cooking
         | itself if someone tells it to shut down and then immediately
         | sticks it in a bag, but it seems great more useful in general
         | to optimize for startup time. Providing the happy path of a
         | clean shutdown is useful for that, even if you do still
         | occasionally need to handle power failure recovery.
        
       | akeck wrote:
       | I love this. Lots of details I didn't know.
        
       | kzrdude wrote:
       | Halt is apparently not the same as poweroff.
        
         | chasil wrote:
         | A "halt -fp" just unmounts file systems and immediately shuts
         | down.
         | 
         | I find that CentOS systems that I've used for a while seem to
         | hang on shutdowns; halt -fp is a way to get them down quickly.
         | It is important to terminate any sensitive processes
         | beforehand.
        
           | SoftTalker wrote:
           | For systems that hang or take intolerably long to shutdown, I
           | typically do:                    systemctl --force
           | [poweroff|reboot]
           | 
           | From the man page, this means that "shutdown of all running
           | services is skipped, however all processes are killed and all
           | file systems are unmounted or mounted read-only, immediately
           | followed by the powering off."
        
         | andrewaylett wrote:
         | Kids of today, etc.
         | 
         | AT power supplies didn't have any mechanism for the system to
         | tell the power supply it wasn't needed any more. So when you
         | shut down the computer, it would wind up at a screen with a
         | message approximating "it is now safe to switch off your
         | computer", at which point the system would halt.
         | 
         | ATX power supplies added the ability for the OS to trigger an
         | actual power off. But that's a different end-state to halting,
         | and if you halt the system then it stays on. You may wonder why
         | anyone would want to halt when power off is an option, and to
         | be honest I'm not entirely sure -- possibly because you have a
         | hardware watchdog which will trigger a reboot of a halted
         | machine but not of a powered off machine?
        
       | buscoquadnary wrote:
        
       | CGamesPlay wrote:
       | A much more challenging task is writing a systemd unit that
       | _starts_ gracefully before shutdown. I wanted to write a unit
       | that could issue an API call to delete the instance rather than
       | doing a normal power off. Putting it at a reasonable place in the
       | sequence took a lot of trial and error! The trick is actually to
       | have the unit be _started_ at some point in normal boot up (e.g.
       | "armed") and then do the actual task when the unit is _stopped_.
       | 
       | Here's the unit I ended up with:
       | https://github.com/CGamesPlay/infra/blob/master/private-serv...
        
         | SrslyJosh wrote:
         | Sadly, I think this would be child's play with SysV init.
        
         | vngzs wrote:
         | I have the following unit file saved for that purpose:
         | [Unit]         #
         | https://stackoverflow.com/questions/36729207/trigger-event-on-
         | aws-ec2-instance-stop-terminate         Description=unlink
         | agent from remote server         Before=shutdown.target
         | [Service]         Type=oneshot
         | EnvironmentFile=-/etc/environment         KillMode=none
         | ExecStart=/bin/true         ExecStop=/opt/service-
         | name/shutdown-unlink         RemainAfterExit=yes
         | User=root         [Install]         WantedBy=multi-user.target
         | 
         | If I recall correctly, the KillMode=none is important as it
         | causes the shutdown-unlink binary to escape systemd process
         | supervision. Without it, you may deal with systemd immediately
         | halting your shutdown unit (and killing the process) when it
         | hits the shutdown target.
        
           | CGamesPlay wrote:
           | You unit races with the network being brought down, since it
           | isn't listed as a "Before" target, FYI.
        
             | AnssiH wrote:
             | They need After=network.target, not Before=network.target.
             | 
             | The shutdown order is the reverse of startup order, and
             | they execute the payload on shutdown (ExecStop).
        
             | boring_twenties wrote:
             | I could be mistaken, but I don't think even that would be
             | sufficient? Before would make your command execute before
             | the network is brought down, but it wouldn't have the
             | latter wait for your command to actually complete.
        
               | dghubble wrote:
               | The post shows long-running stop scripts / containers and
               | demonstrates them delaying shutdown (not with KillMode
               | none though)
               | 
               | @CGamesPlay network.target's "primary purpose is for
               | ordering things properly at shutdown: since the shutdown
               | ordering of units in systemd is the reverse of the
               | startup ordering, any unit that is order
               | After=network.target can be sure that it is stopped
               | before the network is shut down if the system is powered
               | off."
               | 
               | https://www.freedesktop.org/wiki/Software/systemd/Network
               | Tar...
        
         | hinkley wrote:
         | I tend to push the metaphor of software being meant to be read
         | and only incidentally to be run by a computer as far as I can.
         | We are telling a story to future us or our successors. Stories
         | have rules and it's jarring when you violate them.
         | 
         | There's an idea in software that's a bit like the corollary of
         | Chekhov's gun. Chekhov's gun is about not presaging jarring
         | story elements that will never come to pass. But it's nearly as
         | jarring to leave important story arcs as complete surprises
         | until the end. Producing the gun moments before the curtain
         | goes down would be quite a WTF. We didn't know that was a
         | possibility. That's a niche that some people occupy, but it is
         | a niche.
         | 
         | Introducing things early fights with Locality of Reference, but
         | when we're talking about things of deep, dramatic importance
         | (like an attempted murder, or a reaper process) it's important
         | to introduce that "character" early in the story so that people
         | know that it exists. Failing to do so is a form of deus ex
         | machina and we only appreciate that in very small doses.
         | 
         | So framed that way, I don't see a problem with having to start
         | a killswitch while you're spinning everything up. It's there,
         | people can see it, and know to ask questions about it.
        
           | slivanes wrote:
           | Is this ML output?
        
             | renewiltord wrote:
             | If it is, then call me a sucker because I thought it was an
             | entertaining perspective.
        
             | glitchcrab wrote:
             | If it's not then it's a lot of rambling nonsense instead.
        
               | hinkley wrote:
               | Seems a few people managed to follow it.
        
             | quickthrower2 wrote:
             | Or mushroom output?
        
         | [deleted]
        
         | dghubble wrote:
         | That's what this post builds to solving at the end and in the
         | next post - having a unit deletes the instance from a cluster
         | before shutdown
        
       | exikyut wrote:
       | Meta: I can't reach this website!
       | 
       | Chrome is giving me an instant NXDOMAIN error.
       | 
       | Dig shows that                 $ dig psdn.io @1.1.1.1       ...
       | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19283
       | ...       ;; QUESTION SECTION:       ;psdn.io.
       | IN      A
       | 
       | so then I prefix "www." like is in the URL...                 $
       | dig www.psdn.io @1.1.1.1       ...       ;; ->>HEADER<<- opcode:
       | QUERY, status: NXDOMAIN, id: 64024       ...       ;; QUESTION
       | SECTION:       ;www.psdn.io.                   IN      A
       | ;; ANSWER SECTION:       www.psdn.io.            300     IN
       | CNAME   poseidon-www.pages.dev.
       | 
       | Okay, fine:                 $ dig poseidon-www.pages.dev @1.1.1.1
       | ...       ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id:
       | 55471       ...       ;; QUESTION SECTION:       ;poseidon-
       | www.pages.dev.                IN      A
       | 
       | ...wat??
       | 
       | (Where there's no ANSWER section, none was returned, just an
       | AUTHORITY section)
       | 
       | This is reproducible for me with 1.1.1.1, 8.8.8.8 and 9.9.9.9.
        
         | dghubble wrote:
         | Hmm, sorry you're not seeing it. Its just a CNAME to Cloudflare
         | Pages, nothing fancy                 dig www.psdn.io @1.1.1.1
         | ; <<>> DiG 9.16.33-RH <<>> www.psdn.io @1.1.1.1       ;; global
         | options: +cmd       ;; Got answer:       ;; ->>HEADER<<-
         | opcode: QUERY, status: NOERROR, id: 37362       ;; flags: qr rd
         | ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0,   ADDITIONAL: 1       ;;
         | QUESTION SECTION:       ;www.psdn.io.                   IN
         | A              ;; ANSWER SECTION:       www.psdn.io.
         | 126     IN      CNAME   poseidon-www.pages.dev.       poseidon-
         | www.pages.dev. 126     IN      A       172.66.45.44
         | poseidon-www.pages.dev. 126     IN      A       172.66.46.212
        
       ___________________________________________________________________
       (page generated 2022-10-26 23:00 UTC)