[HN Gopher] Graceful Shutdown in Go: Practical Patterns
       ___________________________________________________________________
        
       Graceful Shutdown in Go: Practical Patterns
        
       Author : mkl95
       Score  : 222 points
       Date   : 2025-05-04 21:09 UTC (1 days ago)
        
 (HTM) web link (victoriametrics.com)
 (TXT) w3m dump (victoriametrics.com)
        
       | wbl wrote:
       | If a distribute system relies on clients gracefully exiting to
       | work the system will eventually break badly.
        
         | smcleod wrote:
         | Way back when, in physical land - I used STONITH for that!
         | https://smcleod.net/2015/07/delayed-serial-stonith/
        
         | XorNot wrote:
         | There's valid reasons to want the typical exit not to look like
         | a catastrophic one even if that's a recoverable situation.
         | 
         | That my application went down from sig int makes a big
         | difference compared to kill.
         | 
         | Blue-Green migrations for example require a graceful exit
         | behavior.
        
           | shoo wrote:
           | > Blue-Green migrations for example require a graceful exit
           | behavior.
           | 
           | it may not always be necessary. e.g. if you are deploying a
           | new version of a stateless backend service, and there is a
           | load balancer forwarding traffic to current version and new
           | version backends, the load balancer could be responsible for
           | cutting over, allowing in flight requests to be processed by
           | the current version backends while only forwarding new
           | requests to the new backends. then the old backends could be
           | ungracefully terminated once the LB says they are not
           | processing any requests.
        
         | ikiris wrote:
         | There's a big gap between graceful shutdown to be nice to
         | clients / workflows, and clients relying on it to work.
        
         | Thaxll wrote:
         | No one said that.
        
         | Rhapso wrote:
         | And i believe that so much that I don't even consider graceful
         | shutdown in design. Components should be able to safely (and
         | even frequently) hard-crash and so long as a critical
         | percentage of the system is WAI then it shouldn't meaningfully
         | impact the overall system.
         | 
         | The only way to make sure a system can handle components hard
         | crashing, is if hard crashing is a normal thing that happens
         | all the time.
         | 
         | All glory to the chaos monkey!
        
         | eknkc wrote:
         | Yeah. However, I do not need to pull the plug to shut things
         | down even if the software was designed to tolerate that.
         | 
         | In a second thought though, maybe I do. That might be the only
         | way to ensure the assumption is true. Like the Netflix's chaos
         | monkey thing a couple years ago.
        
         | icedchai wrote:
         | _Relying_ on graceful exit and _supporting_ it are two
         | different things. You want to _support_ it so you can stop
         | serving clients without giving them nasty 5xx errors.
        
       | evil-olive wrote:
       | another factor to consider is that if you have a typical
       | Prometheus `/metrics` endpoint that gets scraped every N seconds,
       | there's a period in between the "final" scrape and the actual
       | process exit where any recorded metrics won't get propagated.
       | this may give you a false impression about whether there are any
       | errors occurring during the shutdown sequence.
       | 
       | it's also possible, if you're not careful, to lose the last few
       | seconds of logs from when your service is shutting down. for
       | example, if you write to a log file that is watched by a sidecar
       | process such as Promtail or Vector, and on startup the service
       | truncates and starts writing to that same path, you've got a race
       | condition that can cause you to lose logs from the shutdown.
        
         | tmpz22 wrote:
         | Is it me or are observability stacks kind of ridiculous. Logs,
         | metrics, and traces, each with their own databases, sidecars,
         | visualization stacks. Language-specific integration libraries
         | written by whoever felt like it. MASSIVE cloud bills.
         | 
         | Then after you go through all that effort most of the data is
         | utterly ignored and rarely are the business insights much
         | better then the trailer park version ssh'ing into a box and
         | greping a log file to find the error output.
         | 
         | Like we put so much effort into this ecosystem but I don't
         | think it has paid us back with any significant increase in
         | uptime, performance, or ergonomics.
        
           | nkraft11 wrote:
           | I can say that going from a place that had all of that
           | observability tooling set up to one that was at the "ssh'ing
           | into a box and greping a log" stage, you best believe I
           | missed company A immensely. Even knowing which box to ssh
           | into, which log file to grep, and which magic words to search
           | far was nigh impossible if you weren't the dev that set up
           | the machine and wrote the bug in the first place.
        
             | MortyWaves wrote:
             | I completely agree with you but I also think, like many
             | aspects of "tech" certain segments of it have been
             | monopolised and turned into profit generators for certain
             | organisations. DevOps, Agile/Scrum, Observability,
             | Kubernetes, are all examples of this.
             | 
             | This dilutes the good and helpful stuff with marketing
             | bullshit.
             | 
             | Grafana seemingly inventing new time series databases and
             | engines every few months is absolutely painful to try keep
             | up to date with in order to make informed decisions.
             | 
             | So much so I've started using rrdtool/smokeping again.
        
               | bbkane wrote:
               | You might look into https://openobserve.ai/ - you can
               | self host it and it's a single binary that ingests
               | logs/metrics/traces. I've found it useful for my side
               | projects.
        
           | 01HNNWZ0MV43FF wrote:
           | Programs are for people. That's why we got JSON, a bunch of
           | debuggers, Python, and so on. Programming is only like 10
           | percent of programming
        
           | evil-olive wrote:
           | if you're working on a system simple enough that "SSH to the
           | box and grep the log file" works, then by all means have at
           | it.
           | 
           | but many systems are more complicated than that. the
           | observability ecosystem exists for a reason, there is a real
           | problem that it's solving.
           | 
           | for example, your app might outgrow running on a single box.
           | now you need to SSH into N different hosts and grep the log
           | file from all of them. or you invent your own version of log-
           | shipping with a shell script that does SCP in a loop.
           | 
           | going a step further, you might put those boxes into an auto-
           | scaling group so that they would scale up and down
           | automatically based on demand. now you _really_ want some
           | form of automatic log-shipping, or every time a host in the
           | ASG gets terminated, you 're throwing away the logs of
           | whatever traffic it served during its lifetime.
           | 
           | or, maybe you notice a performance regression and narrow it
           | down to one particular API endpoint being slow. often it's
           | helpful to be able to graph the response duration of that
           | endpoint over time. has it been slowing down gradually, or
           | did the response time increase suddenly? if it was a sudden
           | increase, what else happened around the same time? maybe a
           | code deployment, maybe a database configuration change, etc.
           | 
           | perhaps the service you operate isn't standalone, but instead
           | interacts with services written by other teams at your
           | company. when something goes wrong with the system as a
           | whole, how do you go about root-causing the problem? how do
           | you trace the lifecycle of a request or operation through all
           | those different systems?
           | 
           | when something goes wrong, you SSH to the box and look at the
           | log file...but how do you know something went wrong to begin
           | with? do you rely solely on user complaints hitting your
           | support@ email? or do you have monitoring rules that will
           | proactively notify you if a "huh, that should never happen"
           | thing is happening?
        
           | HdS84 wrote:
           | Overall, I think centralized logging and metrics are super
           | valuable. But stacks are all missing the mark. For example,
           | every damn log message has hundreds of fields,. Most of which
           | never change. Why not push this information once, on service
           | startup an not with every log message? OK, obviously the
           | current system provides huge bills to the benefit of the
           | company or's offering these services.
        
             | valyala wrote:
             | > For example, every damn log message has hundreds of
             | fields,. Most of which never change. Why not push this
             | information once, on service startup an not with every log
             | message?
             | 
             | If the log field doesn't change with every log entry, then
             | good databases for logs (such as VictoriaLogs) compress
             | such a field by 1000x and more times, so its' storage space
             | usage can be ignored, and it doesn't affect query
             | performance in any way.
             | 
             | Storing many fields per every log entry simplifies further
             | analysis of these logs, since you can get all the needed
             | information from a single log entry instead of jumping over
             | big number of interconnected logs. This also improves
             | analysis of logs at scale by filtering and grouping the
             | logs by any subset of numerous fields. Such logs with big
             | number of fields are named "wide events". See the following
             | excellent article about this type of logs -
             | https://jeremymorrell.dev/blog/a-practitioners-guide-to-
             | wide... .
        
         | utrack wrote:
         | Jfyi, I'm doing exactly this (and more) in a platform library;
         | it covers the issues I've encountered during the last 8+ years
         | I've been working with Go highload apps. During this time
         | developing/improving the platform and rolling was a hobby of
         | mine in every company :)
         | 
         | It (will) cover the stuff like "sync the logs"/"wait for
         | ingresses to catch up with the liveness handler"/etc.
         | 
         | https://github.com/utrack/caisson-go/blob/main/caiapp/caiapp...
         | 
         | https://github.com/utrack/caisson-go/tree/main/closer
         | 
         | The docs are sparse and some things aren't covered yet; however
         | I'm planning to do the first release once I'm back from a
         | holiday.
         | 
         | In the end, this will be a meta-platform (carefully crafted
         | building blocks), and a reference platform library, covering a
         | typical k8s/otel/grpc+http infrastructure.
        
           | peterldowns wrote:
           | I'll check this out, thanks for sharing. I think all of us
           | golang infra/platform people probably have had to write our
           | own similar libraries. Thanks for sharing yours!
        
         | RainyDayTmrw wrote:
         | I never understood why Prometheus and related use a "pull"
         | model for data, when most things use a "push" model.
        
           | evil-olive wrote:
           | Prometheus doesn't necessarily lock you into the "pull"
           | model, see [0].
           | 
           | however, there are some benefits to the pull model, which is
           | why I think Prometheus does it by default.
           | 
           | with a push model, your service needs to spawn a background
           | thread/goroutine/whatever that pushes metrics on a given
           | interval.
           | 
           | if that background thread crashes or hangs, metrics from that
           | service instance stop getting reported. how do you detect
           | that, and fire an alert about it happening?
           | 
           | "cloud-native" gets thrown around as a buzzword, but this is
           | an example where it's actually meaningful. Prometheus assumes
           | that whatever service you're trying to monitor, you're
           | probably already registering each instance in a service-
           | discovery system of some kind, so that other things (such as
           | a load-balancer) know where to find it.
           | 
           | you tell Prometheus how to query that service-discovery
           | system (Kubernetes, for example [1]) and it will
           | automatically discover all your service instances, and start
           | scraping their /metrics endpoints.
           | 
           | this provides an elegant solution to the "how do you monitor
           | a service that is up and running, except its metrics-
           | reporting thread has crashed?" problem. if it's up and
           | running, it should be registered for service-discovery, and
           | Prometheus can trivially record (this is the `up` metric) if
           | it discovers a service but it's not responding to /metrics
           | requests.
           | 
           | and this greatly simplifies the client-side metrics
           | implementation, because you don't need a separate metrics
           | thread in your service. you don't need to ensure it runs
           | forever and never hangs and always retries and all that. you
           | just need to implement a single HTTP GET endpoint, and have
           | it return text in a format simple enough that you can sprintf
           | it yourself if you need to.
           | 
           | for a more theoretical understanding, you can also look at it
           | in terms of the "supervision trees" popularized by Erlang.
           | parents monitor their children, by pulling status from them.
           | children are not responsible for pushing status reports to
           | their parents (or siblings). with the push model, you have a
           | supervision graph instead of a supervision tree, with all the
           | added complexity that entails.
           | 
           | 0: https://prometheus.io/docs/instrumenting/pushing/
           | 
           | 1: https://prometheus.io/docs/prometheus/latest/configuration
           | /c...
        
             | bbkane wrote:
             | Thanks for writing this out; very insightful!
        
             | raffraffraff wrote:
             | Great answer. I managed metrics systems way back (cacti,
             | nagios, graphite, kairosdb) and one thing that always
             | sucked about push based metrics was coping with variable
             | volume of data coming from an uncontrollable number of
             | sources. Scaling was a massive headache. "Scraping" helps
             | to solve this through splitting duty across a number of
             | "scrapers" that autodiscover sources. And by placing limits
             | on how much it will scrape from any given metrics source,
             | you can effectively protect the system from overload.
             | Obviously this comes at the expense of dropping metrics
             | from noisy sources, but as the metrics owner I say "too
             | bad, your fault, fix your metrics". Back in the old days
             | you had to accept whatever came in through the fire hose.
        
           | dilyevsky wrote:
           | That's an artifact of the original google's borgmon design.
           | Fwiw, in a "v2" system at Google they tried switching to
           | push-only and it went sideways so they settled on sort of
           | hybrid pull-push streaming api
        
             | PrayagS wrote:
             | Is "v2" based on their paper around Monarch?
        
               | dilyevsky wrote:
               | It is Monarch, yes
        
         | PrayagS wrote:
         | > another factor to consider is that if you have a typical
         | Prometheus `/metrics` endpoint that gets scraped every N
         | seconds, there's a period in between the "final" scrape and the
         | actual process exit where any recorded metrics won't get
         | propagated. this may give you a false impression about whether
         | there are any errors occurring during the shutdown sequence.
         | 
         | Have you come across any convenient solution for this? If my
         | scrape interval is 15 seconds, I don't exactly have 30 seconds
         | to record two scrapes.
         | 
         | This behavior has sort of been the reason why our services
         | still use statsd since the push-based model doesn't see this
         | problem.
        
       | gchamonlive wrote:
       | This is one of the things I think Elixir is really smart in
       | handling. I'm not very experienced in it, but it seems to me that
       | having your processes designed around tiny VM processes that are
       | meant to panic, quit and get respawned eliminates the need to
       | have to intentionally create graceful shutdown routines, because
       | this is already embedded in the application architecture.
        
         | cle wrote:
         | How does that eliminate the need for the graceful shutdown the
         | author discusses?
        
           | fredrikholm wrote:
           | In the same way that GC eliminates the need for manual memory
           | management.
           | 
           | Sometimes it's not enough and you have to 'do it by hand',
           | but generally if you're working in a system that has GC,
           | freeing memory is not something that you think of often.
           | 
           | The BEAM is designed for building distributed, fault tolerant
           | systems in the sense that these type of concerns are first
           | class objects, as compared to having them as external
           | libraries (eg. Kafka) or completely outside of the system
           | (eg. Kubernetes).
           | 
           | The three points the author lists in the beginning of the
           | article are already built in and their behavior are described
           | rather than implemented, which is what I think OP meant with
           | not having to 'intentionally create graceful shutdown
           | routines'.
        
             | joaohaas wrote:
             | I really don't see how what you are describing has anything
             | to do with the graceful shutdown strategies/tips mentioned
             | in the post.
             | 
             | - Some applications want to instantly terminate upon
             | receiving kill sigs, others want to handle them, OP shows
             | how to handle them
             | 
             | - In the case of HTTP servers, you want to stop listening
             | for new requests, but finish handling current ones under a
             | timer. TBF, OPs post actually handles that badly with a
             | time.Sleep when there's a running connection, instead of
             | using a sync.WaitGroup like most applications would want to
             | do
             | 
             | - Regardless if the application is GCd or not, you probably
             | want to still manually close connections, so you can handle
             | any possible errors (a lot of connections stuff flushes
             | data on close)
        
               | fredrikholm wrote:
               | Thread OPs comment was pointing out that in Elixir there
               | is no need to manually implement these strategies as they
               | already exist within OTP as first class members on the
               | BEAM.
               | 
               | Blog post author has to hand roll these, including
               | picking the wrong solution with time.Sleep as you
               | mentioned.
               | 
               | My analogy with GC was in that spirit; if GC is built in,
               | you don't need custom allocators, memory debuggers etc
               | 99% of the time because you won't be poking around memory
               | the same way that you would in say C. Malloc/free still
               | happens.
               | 
               | Likewise, graceful shutdown, trapping signals, restarting
               | queues, managing restart strategies for subsystems,
               | service monitoring, timeouts, retries, fault recovery,
               | caching, system wide (as in distributed) error handling,
               | system wide debugging, system wide tracing... and so on,
               | are already there on the BEAM.
               | 
               | This is not the case for other runtimes. Instead, to the
               | extent that you can achieve these functionalities from
               | within your runtime at all (without relying on completely
               | external software like Kubernetes, Redis, Datadog etc),
               | you do so by glueing together a tonne of libraries that
               | might or might not gel nicely.
               | 
               | The BEAM is built specifically for the domain "send many
               | small but important messages across the world without
               | falling over", and it shows. They've been incrementally
               | improving it for some ~35 years, there's very few known
               | unknowns left.
        
       | deathanatos wrote:
       | > _After updating the readiness probe to indicate the pod is no
       | longer ready, wait a few seconds to give the system time to stop
       | sending new requests._
       | 
       | > _The exact wait time depends on your readiness probe
       | configuration_
       | 
       | A terminating pod is not ready by definition. The service will
       | also mark the endpoint as terminating (and as not ready). This
       | occurs on the transition into Terminating; you don't have to fail
       | a readiness check to cause it.
       | 
       | (I don't know about the ordering of the SIGTERM & the various
       | updates to the objects such as Pod.status or the endpoint slice;
       | there might be a _small_ window after SIGTERM where you could
       | still get a connection, but it isn 't the large "until we fail a
       | readiness check" TFA implies.)
       | 
       | (And as someone who manages clusters, honestly that infintesimal
       | window probably doesn't matter. Just stop accepting new
       | connections, gracefully close existing ones, and terminate
       | reasonably fast. But I feel like half of the apps I work with
       | fall into either a bucket of "handle SIGTERM & take forever to
       | terminate" or "fail to handle SIGTERM (and take forever to
       | terminate)".
        
       | giancarlostoro wrote:
       | I had a coworker that would always say, if your program cannot
       | cleanly handle ctrl c and a few other commands to close it, then
       | its written poorly.
        
         | amelius wrote:
         | Ctrl-C is reserved for copy into the clipboard ... Stopping the
         | program instead is highly counter-intuitive and will result in
         | angry users.
        
           | moooo99 wrote:
           | Have you really never cancelled a program in a terminal
           | session?
        
             | tgv wrote:
             | I think it was a joke. The style, clearly, almost
             | pedantically stating an annoyance as fact, does suggest
             | that.
        
               | kevin_thibedeau wrote:
               | Definitely yanking us around.
        
               | moooo99 wrote:
               | Probably was an r/whoosh moment on my part
        
         | danhau wrote:
         | Your coworker is correct.
        
       | zdc1 wrote:
       | I've been bitten by the surprising amount of time it takes for
       | Kubernetes to update loadbalancer target IPs in some
       | configurations. For me, 90% of the graceful shutdown battle was
       | just ensuring that traffic was actually being drained before pod
       | termination.
       | 
       | Adding a global preStop hook with a 15 second sleep did wonders
       | for our HTTP 503 rates. This creates time between when the
       | loadbalancer deregistration gets kicked off, and when SIGTERM is
       | actually passed to the application, which in turn simplifies a
       | lot of the application-side handling.
        
         | LazyMans wrote:
         | We just realized this was a problem too
        
         | rdsubhas wrote:
         | Yes. Prestop sleep is the magic SLO solution for high quality
         | rolling deployments.
         | 
         | IMHO, there are two things that kubernetes could improve on:
         | 
         | 1. Pods should be removed from Endoints _before_ initiating the
         | shutdown sequence. Like the termination grace, there should be
         | an option for termination delay. 2. PDB should allow an option
         | for recreation _before_ eviction.
        
       | eberkund wrote:
       | I created a small library for handling graceful shutdowns in my
       | projects: https://github.com/eberkund/graceful
       | 
       | I find that I typically have a few services that I need to start-
       | up and sometimes they have different mechanisms for start-up and
       | shutdown. Sometimes you need to instantiate an object first,
       | sometimes you have a context you want to cancel, other times you
       | have a "Stop" method to call.
       | 
       | I designed the library to help my consolidate this all in one
       | place with a unified API.
        
         | mariusor wrote:
         | Haha, I had the exact same idea, though my API looks a bit less
         | elegant. Maybe it's because it allows the caller to set-up
         | multiple signals to handle and in which way to do it.
         | 
         | https://pkg.go.dev/git.sr.ht/~mariusor/wrapper#example-Regis...
        
         | pseidemann wrote:
         | I did something similar as well:
         | https://github.com/pseidemann/finish
        
       | cientifico wrote:
       | We've adopted Google Wire for some projects at JustWatch, and
       | it's been a game changer. It's surprisingly under the radar, but
       | it helped us eliminate messy shutdown logic in Kubernetes. Wire
       | forces clean dependency injection, so now everything shuts down
       | in order instead... well who knows :-D
       | 
       | https://go.dev/blog/wire https://github.com/google/wire
        
       | liampulles wrote:
       | I tend to use a waitgroup plus context pattern. Any internal
       | service which needs to wind down for shutdown gets a context
       | which it can listen to in a goroutine to start shutting down, and
       | a waitgroup to indicate that it is finished shutting down.
       | 
       | Then the main app goroutine can close the context when it wants
       | to shutdown, and block on the waitgroup until everything is
       | closed.
        
         | mariusor wrote:
         | If you look at the article, it presents some additional
         | niceties, like having middleware that is aware of the shutdown
         | - though they didn't detail exactly how the WithCancellation()
         | function is doing that.
         | 
         | So if you send a SIG-INT/-TERM signal to the server there's a
         | delay to clean up resources, during which the new requests get
         | served a response that doesn't try to access them and fail in
         | unexpected ways, but a configurable "not in service" error.
        
       | fpoling wrote:
       | I was hoping the article describe how to perform the application
       | restart without dropping a single incoming connections when a new
       | service instance receives the listening socket from the old
       | instance.
       | 
       | It is relatively straightforward to implement under systemd. And
       | nginx has been supporting that for over 20 years. Sadly
       | Kuberenets and Docker have no support for that assuming it is
       | done in load balancer or the reverse proxy.
        
         | joaohaas wrote:
         | You're probably looking for Cloudflare's tableflip:
         | https://github.com/cloudflare/tableflip
        
       | gitroom wrote:
       | honestly i always end up wrestling with logs and shutdowns too,
       | nothing ever feels simple - feels like every setup needs its own
       | pile of band aids
        
       | karel-3d wrote:
       | one tiny thing I see quite often: people think that if you do
       | `log.Fatal`, it will still run things in `defer`. It won't!
       | package main                  import (          "fmt"
       | "log"         )                  func main() {          defer
       | fmt.Println("in defer")                   log.Fatal("fatal")
       | }
       | 
       | this just runs "fatal"... because log.Fatal calls os.Exit, and
       | that closes everything immediately.                   package
       | main                  import (          "fmt"          "log"
       | )                  func main() {          defer fmt.Println("in
       | defer")                   panic("fatal")         }
       | 
       | This shows both `fatal` and `in defer`
        
       | Savageman wrote:
       | I wish it would talk about liveness too, I've see several times
       | apps that use the same endpoint for liveness/readiness but it
       | feels wrong.
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:02 UTC)