[HN Gopher] Nginx gracefully upgrades executable on the fly
___________________________________________________________________
Nginx gracefully upgrades executable on the fly
Author : pantuza
Score : 119 points
Date : 2022-01-04 17:40 UTC (5 hours ago)
(HTM) web link (nginx.org)
(TXT) w3m dump (nginx.org)
| bragr wrote:
| I've implemented this a few times in a few languages based on
| exactly what nginx does. It works well, and it is pretty straight
| forward if you are comfortable with posix style signals, sockets,
| and daemons.
|
| I'm not sure it is super critical in the age of containerized
| workloads with rolling deploys but at the very least the
| connection draining is a good pattern to implement to prevent
| deploy/scaling related error spikes.
| wereHamster wrote:
| Even with containerized workloads, you still have an ingress,
| or SPOF (or multiple, when using multicast), and the seamless
| restart is meant for exactly those processes. Nginx is often
| used (https://kubernetes.github.io/ingress-nginx/), or when you
| use AWS, GCS etc they provide such a service for you.
|
| Not sure how the cloud providers do it though, maybe
| combination of low DNS TTL and rolling restart since they often
| have huge fleets of servers which handle ingress?
| nullify88 wrote:
| A container though should be immutable and ideally shouldn't
| have changes made to it. If the container were to die, it'd
| revert back to the old version? It looks to me like these
| seamless upgrades would be an anti pattern to containers.
|
| With ingress you'd have a load balancer in front or have it
| routed in the network layer using BGP.
| wereHamster wrote:
| How do you restart the load balancer though, without
| dropping traffic?
| cbb330 wrote:
| two nginx load balancers, reroute to the secondary via
| dns, restart primary
| nullify88 wrote:
| You would need more than one to do a rolling restart.
| Alternatively to do it with one instance of a software
| load balancer is a bit more work, spin another instance
| up and update DNS. Wait for traffic to the old one to die
| as TTLs expire, then decommission.
|
| But I agree it isn't as easy as a in place upgrade.
| krab wrote:
| I think the parent was talking more about the fact that at
| some point, you have a component that should be available
| as much as possible. In the case you mention, that would be
| the load balancer. Being able to upgrade it in place might
| be easier than other ways.
| Grollicus wrote:
| If you really want to have no SPOF you'd probably build
| something like this:
|
| Multihomed IP <-> Loadbalancer <-> Application
|
| By having the same setup running on multiple locations you
| can replace the load banacers by taking one location offline
| (stop announcing the corresponding route). Application
| instances can be replaced by taking the application instance
| out of the load balancer.
| nullify88 wrote:
| The systemv init script for nginx had an upgrade operation (in
| addition to start/stop/reload etc) which would send the signal.
| Worked like a charm.
| moderation wrote:
| See Envoy Proxy's Hot Restart [0]
|
| 0.
| https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...
| monroeclinton wrote:
| Also HAProxy, they both use UNIX sockets via ancillary
| messages+SCM_RIGHTS I believe.
|
| https://www.haproxy.com/blog/truly-seamless-reloads-with-hap...
| jabedude wrote:
| Seems like a useful feature for a service manager like systemd to
| have for its managed services. It is already able to perform
| inetd style socket activation, I imagine this would be a welcome
| feature
| jasonjayr wrote:
| inetd style socket activation (iirc) forks a process for every
| connection.
|
| So, simply replacing the binary on disk will cause all new
| connections going forward to use the new binary, while existing
| held connections (with in-memory references to the old binary's
| inode) will finish the operations. Once they are done and all
| references to that inode are gone, the blocks referencing the
| binary will be removed.
| wahern wrote:
| > inetd style socket activation (iirc) forks a process for
| every connection.
|
| inetd supports both process-per-connection and single
| process/multiple connections using the "nowait" and "wait"
| declarations, respectively. The former passes an accept'd
| socket, the latter passes the listening socket.
| TimWolla wrote:
| This is already possible. You can configure whether you want
| inetd style socket activation (where systemd calls accept() and
| passes you the client socket)), or just systemd listening to
| the socket (where systemd passes you the listen socket and your
| binary calls accept()).
|
| https://www.freedesktop.org/software/systemd/man/systemd.soc...
| secondcoming wrote:
| I've always found the multi-process approach taken by both nginx
| and apache to be nothing but a hindrance when you have to write a
| custom module. It means that you may have to use shared memory,
| which is a PITA.
|
| I don't know why they haven't moved on from it; it only really
| made sense when uni-core processors were the norm.
| politelemon wrote:
| So if I understood correctly, would it be like this
|
| cp new/nginx /path/to/nginx kill -SIGUSR2 <processid>
|
| That does sound pretty neat if you're not running nginx in a
| container. I wonder if they've built a Windows equivalent for
| that.
| majke wrote:
| Just a shout out: it's super hard to do it for UDP / QUIC / H3.
| Beware.
|
| (but I don't think nginx supports h3 out of the box yet)
| krab wrote:
| Why so? I thought UDP was stateless, making that process even
| easier. But I never implemented it.
| TimWolla wrote:
| UDP itself is stateless, but QUIC itself is stateful. Without
| knowing the background I would assume the issue to be that
| the incoming UDP packets will be routed to the new process
| after the reload and that new process is not aware of the
| existing QUIC connections, because the state resides in the
| old process. Thus it is not able to decrypt the packets for
| example.
| petters wrote:
| How are quick/http3 servers usually upgraded? As you say, it
| seems tricky.
| monroeclinton wrote:
| I've been working on something similar in a load balancer I've
| been writing in Rust. It's still a work in progress.
|
| Basically the parent executes the new binary after it receives a
| USR1 signal. Once the child is healthy it kills the parent via
| SIGTERM. The listener socket file descriptor is passed over an
| environment variable.
|
| https://github.com/monroeclinton/- (this is the proper url, it's
| called dash)
| mholt wrote:
| We did this for Caddy 1 too [1]. It was really cool. I am not
| sure how many people used this feature, so I haven't implemented
| it for Caddy 2 yet, and in the ~two years that Caddy 2 has been
| released, I've only had the request once. It's a bit
| tricky/tedious to do properly, but I'm willing to bring it over
| to Caddy 2 with a sufficient sponsorship.
|
| [1]: https://github.com/caddyserver/caddy/blob/v1/upgrade.go
| eliaspro wrote:
| I'm torn on this feature.
|
| In the one hand, an application should never be able to replace
| itself with "random code" to be executed. I want my systems to
| be immutable. I want my services to be run with the smallest
| set of privileges required.
|
| On the other hand, it encourages "consumer level" users to keep
| their software up-to-date, even when it wasn't installed from a
| distribution's repository etc.
|
| So I think in general it's a good feature to have, as advanced
| users/distributions will restrict what a service/process is
| able to to anyways and won't have any downsides of not using
| this feature.
|
| It should be optional, that's all!
| blibble wrote:
| if you can log into the machine and replace the nginx
| executable you are probably capable of running it too
| mholt wrote:
| > an application should never be able to replace itself with
| "random code" to be executed.
|
| To clarify: it doesn't, nor has it ever worked that way. You
| have to be the one to do that (or someone with privileges to
| write to that file on disk). Most production setups don't
| give Caddy that permission. And you have to trigger the
| upgrade too.
| zimbatm wrote:
| If Caddy were to support systemd socket activation, this self-
| restart dance is not necessary as the parent process (systemd)
| is holding the socket for you. And for other systems, they can
| use https://github.com/zimbatm/socketmaster instead. I believe
| this to be more elegant and robust than the nginx approach as
| there is no PID re-parenting issues.
|
| But I suspect that most Caddy deployments are done via docker,
| and that requires a whole container restart anyways.
| tyingq wrote:
| It's kind of fun to watch things go out of fashion and back
| in. We used to use inetd, mostly because memory was
| expensive, so it could spawn a service only when a request
| came in, then the spawned process would exit and give the
| memory back to the os. Then someone decided tcpd should sit
| between inetd and servers, for security and logging. Then,
| every service just ran as it's own daemon. Now I'm
| occasionally seeing posts like this reviving inetd.
| mholt wrote:
| Good point, and I'm not sure which deployment method is more
| popular.
|
| In general I am personally not a fan of Docker due to added
| complexities (often unnecessary for static binaries like
| Caddy) and technical limitations such as this. All my Caddy
| deployments use systemd (which I don't love either, sigh).
| JesseObrien wrote:
| Can you explain any of the technical details around this
| perchance? I'm super curious. I know that SO_REUSEPORT[1]
| exists but is that the only little trick to make this work?
| From what I've read with SO_REUSEPORT it can open up that port
| to hijacking by rogue processes, so is that fine to rely on?
|
| [1] https://lwn.net/Articles/542629/
| fragmede wrote:
| If an attacker is already running rogue processes on your
| box, the minor details surrounding SO_REUSEPORT is the least
| of your worries. An attacker could just restart nginx, and
| won't care about lost requests.
| duskwuff wrote:
| You don't even need that. If the old server process exec()s
| the new one, it can pass on its file descriptors -- including
| the listening socket -- when that happens.
| mholt wrote:
| Yep, we don't use SO_REUSEPORT. We just pass it from the
| old process to the new one.
| tyingq wrote:
| You could also be fancy and pass open sockets over a unix
| domain socket with sendmsg().
| tyingq wrote:
| >it can open up that port to hijacking by rogue processes
|
| That seems relevant if the process is using a non-privileged
| port that's >= 1024. If we're talking about privileged ports
| (<= 1023), though, only another root process could hijack
| that, and those can already hijack you many other ways.
| bogomipz wrote:
| I am curious does anyone know why Nginx uses SIGWINCH for this? I
| know Apache uses WINCH as well which makes me wonder if there was
| some historical reason a server process wound up using a signal
| meant for a TTY?
| bob1029 wrote:
| I've considered building something like this to allow for us to
| update customer software while it's serving users.
|
| In my proposals, there would be a simple application-aware http
| proxy process that we'd maintain and install on all environments.
| It would handle relaying public traffic to the appropriate final
| process on an alternate port. There would be a special pause
| command we could invoke on the proxy that would buy us time to
| swap the processes out from under the TCP requests. A second
| resume command would be issued once the process is running and
| stable. Ideally, the whole deal completes in ~5 seconds. Rapid
| test rollbacks would be double that. You can do most of the work
| ahead of time by toggling between an A and B install path for the
| binaries, with a third common data path maintained in the middle
| (databases, config, etc)
|
| With the above proposal, the user experience would be a brief
| delay at time of interaction, but we already have some UX
| contexts where delays of up to 30 seconds are anticipated.
| Absolutely no user request would be expected to drop with this
| approach, even in a rollback scenario. Our product is broad
| enough that entire sections of it can be a flaming wasteland
| while other pockets of users are perfectly happy, so keeping the
| happy users unbroken is key.
| kayodelycaon wrote:
| https://en.wikipedia.org/wiki/Blue-green_deployment
|
| DNS not required. You can use a load balancer to do the same
| thing. If you don't want a full second setup, do a rolling
| restart of application servers instead.
|
| Edit: I forgot... you can do this with containers too.
| rootlocus wrote:
| How do the two processes listen to the same port?
| nullify88 wrote:
| Once the USR2 signal is received the master process forks, the
| child process inherits the parents file descriptors including
| listen(). One process stops accepting connections creating a
| queue in the kernel. The new process takes over and starts
| accepting connections.
|
| You can follow the trail by searching for ngx_exec_new_binary
| in the nginx repo.
| krab wrote:
| Just to add - Nginx normally spawns several worker processes
| that all process connections to the same port.
| nullify88 wrote:
| Correct but to clarify, only the master process binds to
| the ports. The master process creates socketpairs to the
| workers for interprocess communication. The workers accept
| connections over the shared socket.
|
| https://www.nginx.com/blog/socket-sharding-nginx-
| release-1-9...
|
| Page also has an example of how SO_REUSEPORT effects flow.
| loeg wrote:
| There's an ioctl for this on FreeBSD and Linux -- SO_REUSEPORT.
| You could also just leave the listening socket open when
| exec'ing the new httpd, or send it with a unix domain socket.
| ctrlrsf wrote:
| Using socket option SO_REUSEPORT allows multiple processes to
| bind to same port.
| VWWHFSfQ wrote:
| is this what it's actually doing though? It doesn't say the
| reuseport option to the listen directive is required for
| this.
| markbnj wrote:
| This article on how haproxy uses SO_REUSEPORT goes into some
| more detail: https://www.haproxy.com/blog/truly-seamless-
| reloads-with-hap...
___________________________________________________________________
(page generated 2022-01-04 23:00 UTC)