[HN Gopher] Graceful Shutdown in Go: Practical Patterns
___________________________________________________________________
Graceful Shutdown in Go: Practical Patterns
Author : mkl95
Score : 222 points
Date : 2025-05-04 21:09 UTC (1 days ago)
(HTM) web link (victoriametrics.com)
(TXT) w3m dump (victoriametrics.com)
| wbl wrote:
| If a distribute system relies on clients gracefully exiting to
| work the system will eventually break badly.
| smcleod wrote:
| Way back when, in physical land - I used STONITH for that!
| https://smcleod.net/2015/07/delayed-serial-stonith/
| XorNot wrote:
| There's valid reasons to want the typical exit not to look like
| a catastrophic one even if that's a recoverable situation.
|
| That my application went down from sig int makes a big
| difference compared to kill.
|
| Blue-Green migrations for example require a graceful exit
| behavior.
| shoo wrote:
| > Blue-Green migrations for example require a graceful exit
| behavior.
|
| it may not always be necessary. e.g. if you are deploying a
| new version of a stateless backend service, and there is a
| load balancer forwarding traffic to current version and new
| version backends, the load balancer could be responsible for
| cutting over, allowing in flight requests to be processed by
| the current version backends while only forwarding new
| requests to the new backends. then the old backends could be
| ungracefully terminated once the LB says they are not
| processing any requests.
| ikiris wrote:
| There's a big gap between graceful shutdown to be nice to
| clients / workflows, and clients relying on it to work.
| Thaxll wrote:
| No one said that.
| Rhapso wrote:
| And i believe that so much that I don't even consider graceful
| shutdown in design. Components should be able to safely (and
| even frequently) hard-crash and so long as a critical
| percentage of the system is WAI then it shouldn't meaningfully
| impact the overall system.
|
| The only way to make sure a system can handle components hard
| crashing, is if hard crashing is a normal thing that happens
| all the time.
|
| All glory to the chaos monkey!
| eknkc wrote:
| Yeah. However, I do not need to pull the plug to shut things
| down even if the software was designed to tolerate that.
|
| In a second thought though, maybe I do. That might be the only
| way to ensure the assumption is true. Like the Netflix's chaos
| monkey thing a couple years ago.
| icedchai wrote:
| _Relying_ on graceful exit and _supporting_ it are two
| different things. You want to _support_ it so you can stop
| serving clients without giving them nasty 5xx errors.
| evil-olive wrote:
| another factor to consider is that if you have a typical
| Prometheus `/metrics` endpoint that gets scraped every N seconds,
| there's a period in between the "final" scrape and the actual
| process exit where any recorded metrics won't get propagated.
| this may give you a false impression about whether there are any
| errors occurring during the shutdown sequence.
|
| it's also possible, if you're not careful, to lose the last few
| seconds of logs from when your service is shutting down. for
| example, if you write to a log file that is watched by a sidecar
| process such as Promtail or Vector, and on startup the service
| truncates and starts writing to that same path, you've got a race
| condition that can cause you to lose logs from the shutdown.
| tmpz22 wrote:
| Is it me or are observability stacks kind of ridiculous. Logs,
| metrics, and traces, each with their own databases, sidecars,
| visualization stacks. Language-specific integration libraries
| written by whoever felt like it. MASSIVE cloud bills.
|
| Then after you go through all that effort most of the data is
| utterly ignored and rarely are the business insights much
| better then the trailer park version ssh'ing into a box and
| greping a log file to find the error output.
|
| Like we put so much effort into this ecosystem but I don't
| think it has paid us back with any significant increase in
| uptime, performance, or ergonomics.
| nkraft11 wrote:
| I can say that going from a place that had all of that
| observability tooling set up to one that was at the "ssh'ing
| into a box and greping a log" stage, you best believe I
| missed company A immensely. Even knowing which box to ssh
| into, which log file to grep, and which magic words to search
| far was nigh impossible if you weren't the dev that set up
| the machine and wrote the bug in the first place.
| MortyWaves wrote:
| I completely agree with you but I also think, like many
| aspects of "tech" certain segments of it have been
| monopolised and turned into profit generators for certain
| organisations. DevOps, Agile/Scrum, Observability,
| Kubernetes, are all examples of this.
|
| This dilutes the good and helpful stuff with marketing
| bullshit.
|
| Grafana seemingly inventing new time series databases and
| engines every few months is absolutely painful to try keep
| up to date with in order to make informed decisions.
|
| So much so I've started using rrdtool/smokeping again.
| bbkane wrote:
| You might look into https://openobserve.ai/ - you can
| self host it and it's a single binary that ingests
| logs/metrics/traces. I've found it useful for my side
| projects.
| 01HNNWZ0MV43FF wrote:
| Programs are for people. That's why we got JSON, a bunch of
| debuggers, Python, and so on. Programming is only like 10
| percent of programming
| evil-olive wrote:
| if you're working on a system simple enough that "SSH to the
| box and grep the log file" works, then by all means have at
| it.
|
| but many systems are more complicated than that. the
| observability ecosystem exists for a reason, there is a real
| problem that it's solving.
|
| for example, your app might outgrow running on a single box.
| now you need to SSH into N different hosts and grep the log
| file from all of them. or you invent your own version of log-
| shipping with a shell script that does SCP in a loop.
|
| going a step further, you might put those boxes into an auto-
| scaling group so that they would scale up and down
| automatically based on demand. now you _really_ want some
| form of automatic log-shipping, or every time a host in the
| ASG gets terminated, you 're throwing away the logs of
| whatever traffic it served during its lifetime.
|
| or, maybe you notice a performance regression and narrow it
| down to one particular API endpoint being slow. often it's
| helpful to be able to graph the response duration of that
| endpoint over time. has it been slowing down gradually, or
| did the response time increase suddenly? if it was a sudden
| increase, what else happened around the same time? maybe a
| code deployment, maybe a database configuration change, etc.
|
| perhaps the service you operate isn't standalone, but instead
| interacts with services written by other teams at your
| company. when something goes wrong with the system as a
| whole, how do you go about root-causing the problem? how do
| you trace the lifecycle of a request or operation through all
| those different systems?
|
| when something goes wrong, you SSH to the box and look at the
| log file...but how do you know something went wrong to begin
| with? do you rely solely on user complaints hitting your
| support@ email? or do you have monitoring rules that will
| proactively notify you if a "huh, that should never happen"
| thing is happening?
| HdS84 wrote:
| Overall, I think centralized logging and metrics are super
| valuable. But stacks are all missing the mark. For example,
| every damn log message has hundreds of fields,. Most of which
| never change. Why not push this information once, on service
| startup an not with every log message? OK, obviously the
| current system provides huge bills to the benefit of the
| company or's offering these services.
| valyala wrote:
| > For example, every damn log message has hundreds of
| fields,. Most of which never change. Why not push this
| information once, on service startup an not with every log
| message?
|
| If the log field doesn't change with every log entry, then
| good databases for logs (such as VictoriaLogs) compress
| such a field by 1000x and more times, so its' storage space
| usage can be ignored, and it doesn't affect query
| performance in any way.
|
| Storing many fields per every log entry simplifies further
| analysis of these logs, since you can get all the needed
| information from a single log entry instead of jumping over
| big number of interconnected logs. This also improves
| analysis of logs at scale by filtering and grouping the
| logs by any subset of numerous fields. Such logs with big
| number of fields are named "wide events". See the following
| excellent article about this type of logs -
| https://jeremymorrell.dev/blog/a-practitioners-guide-to-
| wide... .
| utrack wrote:
| Jfyi, I'm doing exactly this (and more) in a platform library;
| it covers the issues I've encountered during the last 8+ years
| I've been working with Go highload apps. During this time
| developing/improving the platform and rolling was a hobby of
| mine in every company :)
|
| It (will) cover the stuff like "sync the logs"/"wait for
| ingresses to catch up with the liveness handler"/etc.
|
| https://github.com/utrack/caisson-go/blob/main/caiapp/caiapp...
|
| https://github.com/utrack/caisson-go/tree/main/closer
|
| The docs are sparse and some things aren't covered yet; however
| I'm planning to do the first release once I'm back from a
| holiday.
|
| In the end, this will be a meta-platform (carefully crafted
| building blocks), and a reference platform library, covering a
| typical k8s/otel/grpc+http infrastructure.
| peterldowns wrote:
| I'll check this out, thanks for sharing. I think all of us
| golang infra/platform people probably have had to write our
| own similar libraries. Thanks for sharing yours!
| RainyDayTmrw wrote:
| I never understood why Prometheus and related use a "pull"
| model for data, when most things use a "push" model.
| evil-olive wrote:
| Prometheus doesn't necessarily lock you into the "pull"
| model, see [0].
|
| however, there are some benefits to the pull model, which is
| why I think Prometheus does it by default.
|
| with a push model, your service needs to spawn a background
| thread/goroutine/whatever that pushes metrics on a given
| interval.
|
| if that background thread crashes or hangs, metrics from that
| service instance stop getting reported. how do you detect
| that, and fire an alert about it happening?
|
| "cloud-native" gets thrown around as a buzzword, but this is
| an example where it's actually meaningful. Prometheus assumes
| that whatever service you're trying to monitor, you're
| probably already registering each instance in a service-
| discovery system of some kind, so that other things (such as
| a load-balancer) know where to find it.
|
| you tell Prometheus how to query that service-discovery
| system (Kubernetes, for example [1]) and it will
| automatically discover all your service instances, and start
| scraping their /metrics endpoints.
|
| this provides an elegant solution to the "how do you monitor
| a service that is up and running, except its metrics-
| reporting thread has crashed?" problem. if it's up and
| running, it should be registered for service-discovery, and
| Prometheus can trivially record (this is the `up` metric) if
| it discovers a service but it's not responding to /metrics
| requests.
|
| and this greatly simplifies the client-side metrics
| implementation, because you don't need a separate metrics
| thread in your service. you don't need to ensure it runs
| forever and never hangs and always retries and all that. you
| just need to implement a single HTTP GET endpoint, and have
| it return text in a format simple enough that you can sprintf
| it yourself if you need to.
|
| for a more theoretical understanding, you can also look at it
| in terms of the "supervision trees" popularized by Erlang.
| parents monitor their children, by pulling status from them.
| children are not responsible for pushing status reports to
| their parents (or siblings). with the push model, you have a
| supervision graph instead of a supervision tree, with all the
| added complexity that entails.
|
| 0: https://prometheus.io/docs/instrumenting/pushing/
|
| 1: https://prometheus.io/docs/prometheus/latest/configuration
| /c...
| bbkane wrote:
| Thanks for writing this out; very insightful!
| raffraffraff wrote:
| Great answer. I managed metrics systems way back (cacti,
| nagios, graphite, kairosdb) and one thing that always
| sucked about push based metrics was coping with variable
| volume of data coming from an uncontrollable number of
| sources. Scaling was a massive headache. "Scraping" helps
| to solve this through splitting duty across a number of
| "scrapers" that autodiscover sources. And by placing limits
| on how much it will scrape from any given metrics source,
| you can effectively protect the system from overload.
| Obviously this comes at the expense of dropping metrics
| from noisy sources, but as the metrics owner I say "too
| bad, your fault, fix your metrics". Back in the old days
| you had to accept whatever came in through the fire hose.
| dilyevsky wrote:
| That's an artifact of the original google's borgmon design.
| Fwiw, in a "v2" system at Google they tried switching to
| push-only and it went sideways so they settled on sort of
| hybrid pull-push streaming api
| PrayagS wrote:
| Is "v2" based on their paper around Monarch?
| dilyevsky wrote:
| It is Monarch, yes
| PrayagS wrote:
| > another factor to consider is that if you have a typical
| Prometheus `/metrics` endpoint that gets scraped every N
| seconds, there's a period in between the "final" scrape and the
| actual process exit where any recorded metrics won't get
| propagated. this may give you a false impression about whether
| there are any errors occurring during the shutdown sequence.
|
| Have you come across any convenient solution for this? If my
| scrape interval is 15 seconds, I don't exactly have 30 seconds
| to record two scrapes.
|
| This behavior has sort of been the reason why our services
| still use statsd since the push-based model doesn't see this
| problem.
| gchamonlive wrote:
| This is one of the things I think Elixir is really smart in
| handling. I'm not very experienced in it, but it seems to me that
| having your processes designed around tiny VM processes that are
| meant to panic, quit and get respawned eliminates the need to
| have to intentionally create graceful shutdown routines, because
| this is already embedded in the application architecture.
| cle wrote:
| How does that eliminate the need for the graceful shutdown the
| author discusses?
| fredrikholm wrote:
| In the same way that GC eliminates the need for manual memory
| management.
|
| Sometimes it's not enough and you have to 'do it by hand',
| but generally if you're working in a system that has GC,
| freeing memory is not something that you think of often.
|
| The BEAM is designed for building distributed, fault tolerant
| systems in the sense that these type of concerns are first
| class objects, as compared to having them as external
| libraries (eg. Kafka) or completely outside of the system
| (eg. Kubernetes).
|
| The three points the author lists in the beginning of the
| article are already built in and their behavior are described
| rather than implemented, which is what I think OP meant with
| not having to 'intentionally create graceful shutdown
| routines'.
| joaohaas wrote:
| I really don't see how what you are describing has anything
| to do with the graceful shutdown strategies/tips mentioned
| in the post.
|
| - Some applications want to instantly terminate upon
| receiving kill sigs, others want to handle them, OP shows
| how to handle them
|
| - In the case of HTTP servers, you want to stop listening
| for new requests, but finish handling current ones under a
| timer. TBF, OPs post actually handles that badly with a
| time.Sleep when there's a running connection, instead of
| using a sync.WaitGroup like most applications would want to
| do
|
| - Regardless if the application is GCd or not, you probably
| want to still manually close connections, so you can handle
| any possible errors (a lot of connections stuff flushes
| data on close)
| fredrikholm wrote:
| Thread OPs comment was pointing out that in Elixir there
| is no need to manually implement these strategies as they
| already exist within OTP as first class members on the
| BEAM.
|
| Blog post author has to hand roll these, including
| picking the wrong solution with time.Sleep as you
| mentioned.
|
| My analogy with GC was in that spirit; if GC is built in,
| you don't need custom allocators, memory debuggers etc
| 99% of the time because you won't be poking around memory
| the same way that you would in say C. Malloc/free still
| happens.
|
| Likewise, graceful shutdown, trapping signals, restarting
| queues, managing restart strategies for subsystems,
| service monitoring, timeouts, retries, fault recovery,
| caching, system wide (as in distributed) error handling,
| system wide debugging, system wide tracing... and so on,
| are already there on the BEAM.
|
| This is not the case for other runtimes. Instead, to the
| extent that you can achieve these functionalities from
| within your runtime at all (without relying on completely
| external software like Kubernetes, Redis, Datadog etc),
| you do so by glueing together a tonne of libraries that
| might or might not gel nicely.
|
| The BEAM is built specifically for the domain "send many
| small but important messages across the world without
| falling over", and it shows. They've been incrementally
| improving it for some ~35 years, there's very few known
| unknowns left.
| deathanatos wrote:
| > _After updating the readiness probe to indicate the pod is no
| longer ready, wait a few seconds to give the system time to stop
| sending new requests._
|
| > _The exact wait time depends on your readiness probe
| configuration_
|
| A terminating pod is not ready by definition. The service will
| also mark the endpoint as terminating (and as not ready). This
| occurs on the transition into Terminating; you don't have to fail
| a readiness check to cause it.
|
| (I don't know about the ordering of the SIGTERM & the various
| updates to the objects such as Pod.status or the endpoint slice;
| there might be a _small_ window after SIGTERM where you could
| still get a connection, but it isn 't the large "until we fail a
| readiness check" TFA implies.)
|
| (And as someone who manages clusters, honestly that infintesimal
| window probably doesn't matter. Just stop accepting new
| connections, gracefully close existing ones, and terminate
| reasonably fast. But I feel like half of the apps I work with
| fall into either a bucket of "handle SIGTERM & take forever to
| terminate" or "fail to handle SIGTERM (and take forever to
| terminate)".
| giancarlostoro wrote:
| I had a coworker that would always say, if your program cannot
| cleanly handle ctrl c and a few other commands to close it, then
| its written poorly.
| amelius wrote:
| Ctrl-C is reserved for copy into the clipboard ... Stopping the
| program instead is highly counter-intuitive and will result in
| angry users.
| moooo99 wrote:
| Have you really never cancelled a program in a terminal
| session?
| tgv wrote:
| I think it was a joke. The style, clearly, almost
| pedantically stating an annoyance as fact, does suggest
| that.
| kevin_thibedeau wrote:
| Definitely yanking us around.
| moooo99 wrote:
| Probably was an r/whoosh moment on my part
| danhau wrote:
| Your coworker is correct.
| zdc1 wrote:
| I've been bitten by the surprising amount of time it takes for
| Kubernetes to update loadbalancer target IPs in some
| configurations. For me, 90% of the graceful shutdown battle was
| just ensuring that traffic was actually being drained before pod
| termination.
|
| Adding a global preStop hook with a 15 second sleep did wonders
| for our HTTP 503 rates. This creates time between when the
| loadbalancer deregistration gets kicked off, and when SIGTERM is
| actually passed to the application, which in turn simplifies a
| lot of the application-side handling.
| LazyMans wrote:
| We just realized this was a problem too
| rdsubhas wrote:
| Yes. Prestop sleep is the magic SLO solution for high quality
| rolling deployments.
|
| IMHO, there are two things that kubernetes could improve on:
|
| 1. Pods should be removed from Endoints _before_ initiating the
| shutdown sequence. Like the termination grace, there should be
| an option for termination delay. 2. PDB should allow an option
| for recreation _before_ eviction.
| eberkund wrote:
| I created a small library for handling graceful shutdowns in my
| projects: https://github.com/eberkund/graceful
|
| I find that I typically have a few services that I need to start-
| up and sometimes they have different mechanisms for start-up and
| shutdown. Sometimes you need to instantiate an object first,
| sometimes you have a context you want to cancel, other times you
| have a "Stop" method to call.
|
| I designed the library to help my consolidate this all in one
| place with a unified API.
| mariusor wrote:
| Haha, I had the exact same idea, though my API looks a bit less
| elegant. Maybe it's because it allows the caller to set-up
| multiple signals to handle and in which way to do it.
|
| https://pkg.go.dev/git.sr.ht/~mariusor/wrapper#example-Regis...
| pseidemann wrote:
| I did something similar as well:
| https://github.com/pseidemann/finish
| cientifico wrote:
| We've adopted Google Wire for some projects at JustWatch, and
| it's been a game changer. It's surprisingly under the radar, but
| it helped us eliminate messy shutdown logic in Kubernetes. Wire
| forces clean dependency injection, so now everything shuts down
| in order instead... well who knows :-D
|
| https://go.dev/blog/wire https://github.com/google/wire
| liampulles wrote:
| I tend to use a waitgroup plus context pattern. Any internal
| service which needs to wind down for shutdown gets a context
| which it can listen to in a goroutine to start shutting down, and
| a waitgroup to indicate that it is finished shutting down.
|
| Then the main app goroutine can close the context when it wants
| to shutdown, and block on the waitgroup until everything is
| closed.
| mariusor wrote:
| If you look at the article, it presents some additional
| niceties, like having middleware that is aware of the shutdown
| - though they didn't detail exactly how the WithCancellation()
| function is doing that.
|
| So if you send a SIG-INT/-TERM signal to the server there's a
| delay to clean up resources, during which the new requests get
| served a response that doesn't try to access them and fail in
| unexpected ways, but a configurable "not in service" error.
| fpoling wrote:
| I was hoping the article describe how to perform the application
| restart without dropping a single incoming connections when a new
| service instance receives the listening socket from the old
| instance.
|
| It is relatively straightforward to implement under systemd. And
| nginx has been supporting that for over 20 years. Sadly
| Kuberenets and Docker have no support for that assuming it is
| done in load balancer or the reverse proxy.
| joaohaas wrote:
| You're probably looking for Cloudflare's tableflip:
| https://github.com/cloudflare/tableflip
| gitroom wrote:
| honestly i always end up wrestling with logs and shutdowns too,
| nothing ever feels simple - feels like every setup needs its own
| pile of band aids
| karel-3d wrote:
| one tiny thing I see quite often: people think that if you do
| `log.Fatal`, it will still run things in `defer`. It won't!
| package main import ( "fmt"
| "log" ) func main() { defer
| fmt.Println("in defer") log.Fatal("fatal")
| }
|
| this just runs "fatal"... because log.Fatal calls os.Exit, and
| that closes everything immediately. package
| main import ( "fmt" "log"
| ) func main() { defer fmt.Println("in
| defer") panic("fatal") }
|
| This shows both `fatal` and `in defer`
| Savageman wrote:
| I wish it would talk about liveness too, I've see several times
| apps that use the same endpoint for liveness/readiness but it
| feels wrong.
___________________________________________________________________
(page generated 2025-05-05 23:02 UTC)