[HN Gopher] Writing systemd units that stop gracefully before sh...
___________________________________________________________________
Writing systemd units that stop gracefully before shutdown
Author : dghubble
Score : 151 points
Date : 2022-10-26 16:14 UTC (6 hours ago)
(HTM) web link (www.psdn.io)
(TXT) w3m dump (www.psdn.io)
| SrslyJosh wrote:
| It shouldn't be this hard to stop a service gracefully. This is
| far, far more complicated than SysV init, where you just need to
| drop a script into /etc/init.d symlink it from the appropriate rc
| directories. (For shutdown/reboot, you'd create symlinks in rc5.d
| and rc6.d named KNNwhatever, where NN is an integer that
| specifies the order the script will be run in. The "K" stands for
| "kill".)
|
| Edit: Note that my example runlevels are for Solaris, other
| UNIX/Linux OSes will vary.
| LukeShu wrote:
| > It shouldn't be this hard to stop a service gracefully.
|
| It's not.
|
| > you just need to drop a script into /etc/init.d symlink it
| from the appropriate rc directories.
|
| You just need to drop a unit file into /etc/systemd/system/ and
| symlink it from the appropriate
| /etc/systemd/system/${target}.wants/ directories.
|
| Don't tell me that "shutdown.target.wants" and
| "reboot.target.wants" are harder than "rc0.d" and "rc6.d".
|
| A lot of the article is about ordering of dependencies (don't
| stop a dependency until after the dependent has stopped). Don't
| tell me that adding `Before=` and `After=` lines in the unit
| file is harder than having to remember all of the dependencies
| and manually figure out the correct "NN" for it all to work
| correctly.
|
| A lot of the article is about either having your daemon handle
| SIGTERM, or coming up with the appropriate `ExecStop=` command.
| The same command you'd be writing in your rc script (the
| "handle SIGTERM" stuff being for if your rc script simply says
| `kill $PID`).
|
| That is: The complex parts of the article are things that were
| complex with sysvinit too.
| SoftTalker wrote:
| I fundamentally disagree with the idea that software should
| require or even expect a graceful shutdown. You can never stop
| the user from yanking the power cord out of the socket, which is
| what they will do if you force a bunch of housekeeping to happen
| before shutdown.
|
| You have to deal with crash/power failure recovery anyway. So do
| your housekeeping on startup. Shutdown should be a quick and
| simple termination.
| maw wrote:
| I'm with you.
|
| It's easier said than done, of course, but crash-only software
| is a worthwhile goal IMO.
| mixmastamyk wrote:
| The twenty years of laptops I've had wouldn't even flinch at a
| power cord disconnect.
| jefftk wrote:
| Just because you need to be able to handle an employee being
| hit by a bus doesn't mean employees should ghost their
| companies, or that companies shouldn't have systems for when
| someone gives their two weeks notice.
|
| The article gives the examples that "A load balancer might stop
| accepting new connections and disable its readiness endpoint. A
| database might flush to disk. An agent might inform a cluster
| it's leaving the group." All of these seem like they're worth
| doing, and improve expected case shutdown behavior, though you
| should also write and test the abrupt shutdown case.
| nerdponx wrote:
| Hope for the best, plan for the worst, right?
|
| Here's a contrived analogy: modern airplanes are designed to
| stay in the air even if an engine burns out, but we would still
| rather fly with both engines at full power whenever possible.
| empthought wrote:
| This is a weird take; most systems in data centers don't have
| people walking from rack to rack yanking power cords, and most
| consumer systems don't even have a power cord to yank.
| bravetraveler wrote:
| While I agree it's a bit of a weird take, for example --
| there may be performance tradeoffs made in any given workload
| to make the disk consistent, inconsistently
|
| The 'most' there is doing some effort
|
| It is actually quite a common practice for those being
| audited for disaster recovery to do exactly that -- yank
| cables. More realistically, flip some switches
|
| We do it once a year, set aside a region and time... then
| test our processes
|
| It serves a few purposes, most importantly -- are our
| services fault tolerant, and can we bring them back?
|
| I think it's reasonable to trap the signals and make a best
| effort basis, knowing that PID 1 (or the environment) will
| eventually have to SIGKILL you -- ready or not
|
| Just because we can't save all of the state doesn't mean we
| shouldn't try
| empthought wrote:
| Right, there are failure modes that have to be tested and
| accounted for, and one of them is the state being
| inconsistent after a shutdown.
|
| The previous poster seemed to advocate for not thinking of
| this as a failure mode at all but rather normal operation,
| which I just don't see as true.
| thfuran wrote:
| If I'm turning on a computer, it's because I want to use it
| right now. If I'm turning off a computer, it's because I don't
| need to be using it right now. I guess you do need to make sure
| shutdown is fast enough that a laptop won't start cooking
| itself if someone tells it to shut down and then immediately
| sticks it in a bag, but it seems great more useful in general
| to optimize for startup time. Providing the happy path of a
| clean shutdown is useful for that, even if you do still
| occasionally need to handle power failure recovery.
| akeck wrote:
| I love this. Lots of details I didn't know.
| kzrdude wrote:
| Halt is apparently not the same as poweroff.
| chasil wrote:
| A "halt -fp" just unmounts file systems and immediately shuts
| down.
|
| I find that CentOS systems that I've used for a while seem to
| hang on shutdowns; halt -fp is a way to get them down quickly.
| It is important to terminate any sensitive processes
| beforehand.
| SoftTalker wrote:
| For systems that hang or take intolerably long to shutdown, I
| typically do: systemctl --force
| [poweroff|reboot]
|
| From the man page, this means that "shutdown of all running
| services is skipped, however all processes are killed and all
| file systems are unmounted or mounted read-only, immediately
| followed by the powering off."
| andrewaylett wrote:
| Kids of today, etc.
|
| AT power supplies didn't have any mechanism for the system to
| tell the power supply it wasn't needed any more. So when you
| shut down the computer, it would wind up at a screen with a
| message approximating "it is now safe to switch off your
| computer", at which point the system would halt.
|
| ATX power supplies added the ability for the OS to trigger an
| actual power off. But that's a different end-state to halting,
| and if you halt the system then it stays on. You may wonder why
| anyone would want to halt when power off is an option, and to
| be honest I'm not entirely sure -- possibly because you have a
| hardware watchdog which will trigger a reboot of a halted
| machine but not of a powered off machine?
| buscoquadnary wrote:
| CGamesPlay wrote:
| A much more challenging task is writing a systemd unit that
| _starts_ gracefully before shutdown. I wanted to write a unit
| that could issue an API call to delete the instance rather than
| doing a normal power off. Putting it at a reasonable place in the
| sequence took a lot of trial and error! The trick is actually to
| have the unit be _started_ at some point in normal boot up (e.g.
| "armed") and then do the actual task when the unit is _stopped_.
|
| Here's the unit I ended up with:
| https://github.com/CGamesPlay/infra/blob/master/private-serv...
| SrslyJosh wrote:
| Sadly, I think this would be child's play with SysV init.
| vngzs wrote:
| I have the following unit file saved for that purpose:
| [Unit] #
| https://stackoverflow.com/questions/36729207/trigger-event-on-
| aws-ec2-instance-stop-terminate Description=unlink
| agent from remote server Before=shutdown.target
| [Service] Type=oneshot
| EnvironmentFile=-/etc/environment KillMode=none
| ExecStart=/bin/true ExecStop=/opt/service-
| name/shutdown-unlink RemainAfterExit=yes
| User=root [Install] WantedBy=multi-user.target
|
| If I recall correctly, the KillMode=none is important as it
| causes the shutdown-unlink binary to escape systemd process
| supervision. Without it, you may deal with systemd immediately
| halting your shutdown unit (and killing the process) when it
| hits the shutdown target.
| CGamesPlay wrote:
| You unit races with the network being brought down, since it
| isn't listed as a "Before" target, FYI.
| AnssiH wrote:
| They need After=network.target, not Before=network.target.
|
| The shutdown order is the reverse of startup order, and
| they execute the payload on shutdown (ExecStop).
| boring_twenties wrote:
| I could be mistaken, but I don't think even that would be
| sufficient? Before would make your command execute before
| the network is brought down, but it wouldn't have the
| latter wait for your command to actually complete.
| dghubble wrote:
| The post shows long-running stop scripts / containers and
| demonstrates them delaying shutdown (not with KillMode
| none though)
|
| @CGamesPlay network.target's "primary purpose is for
| ordering things properly at shutdown: since the shutdown
| ordering of units in systemd is the reverse of the
| startup ordering, any unit that is order
| After=network.target can be sure that it is stopped
| before the network is shut down if the system is powered
| off."
|
| https://www.freedesktop.org/wiki/Software/systemd/Network
| Tar...
| hinkley wrote:
| I tend to push the metaphor of software being meant to be read
| and only incidentally to be run by a computer as far as I can.
| We are telling a story to future us or our successors. Stories
| have rules and it's jarring when you violate them.
|
| There's an idea in software that's a bit like the corollary of
| Chekhov's gun. Chekhov's gun is about not presaging jarring
| story elements that will never come to pass. But it's nearly as
| jarring to leave important story arcs as complete surprises
| until the end. Producing the gun moments before the curtain
| goes down would be quite a WTF. We didn't know that was a
| possibility. That's a niche that some people occupy, but it is
| a niche.
|
| Introducing things early fights with Locality of Reference, but
| when we're talking about things of deep, dramatic importance
| (like an attempted murder, or a reaper process) it's important
| to introduce that "character" early in the story so that people
| know that it exists. Failing to do so is a form of deus ex
| machina and we only appreciate that in very small doses.
|
| So framed that way, I don't see a problem with having to start
| a killswitch while you're spinning everything up. It's there,
| people can see it, and know to ask questions about it.
| slivanes wrote:
| Is this ML output?
| renewiltord wrote:
| If it is, then call me a sucker because I thought it was an
| entertaining perspective.
| glitchcrab wrote:
| If it's not then it's a lot of rambling nonsense instead.
| hinkley wrote:
| Seems a few people managed to follow it.
| quickthrower2 wrote:
| Or mushroom output?
| [deleted]
| dghubble wrote:
| That's what this post builds to solving at the end and in the
| next post - having a unit deletes the instance from a cluster
| before shutdown
| exikyut wrote:
| Meta: I can't reach this website!
|
| Chrome is giving me an instant NXDOMAIN error.
|
| Dig shows that $ dig psdn.io @1.1.1.1 ...
| ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19283
| ... ;; QUESTION SECTION: ;psdn.io.
| IN A
|
| so then I prefix "www." like is in the URL... $
| dig www.psdn.io @1.1.1.1 ... ;; ->>HEADER<<- opcode:
| QUERY, status: NXDOMAIN, id: 64024 ... ;; QUESTION
| SECTION: ;www.psdn.io. IN A
| ;; ANSWER SECTION: www.psdn.io. 300 IN
| CNAME poseidon-www.pages.dev.
|
| Okay, fine: $ dig poseidon-www.pages.dev @1.1.1.1
| ... ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id:
| 55471 ... ;; QUESTION SECTION: ;poseidon-
| www.pages.dev. IN A
|
| ...wat??
|
| (Where there's no ANSWER section, none was returned, just an
| AUTHORITY section)
|
| This is reproducible for me with 1.1.1.1, 8.8.8.8 and 9.9.9.9.
| dghubble wrote:
| Hmm, sorry you're not seeing it. Its just a CNAME to Cloudflare
| Pages, nothing fancy dig www.psdn.io @1.1.1.1
| ; <<>> DiG 9.16.33-RH <<>> www.psdn.io @1.1.1.1 ;; global
| options: +cmd ;; Got answer: ;; ->>HEADER<<-
| opcode: QUERY, status: NOERROR, id: 37362 ;; flags: qr rd
| ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;;
| QUESTION SECTION: ;www.psdn.io. IN
| A ;; ANSWER SECTION: www.psdn.io.
| 126 IN CNAME poseidon-www.pages.dev. poseidon-
| www.pages.dev. 126 IN A 172.66.45.44
| poseidon-www.pages.dev. 126 IN A 172.66.46.212
___________________________________________________________________
(page generated 2022-10-26 23:00 UTC)