[HN Gopher] Kamal Proxy - A minimal HTTP proxy for zero-downtime...
___________________________________________________________________
Kamal Proxy - A minimal HTTP proxy for zero-downtime deployments
Author : norbert1990
Score : 191 points
Date : 2024-09-21 07:55 UTC (15 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| oDot wrote:
| DHH mentioned they built it to move from the cloud to bare metal.
| He glorifies the simplicity but I can't help thinking they are a
| special use case of predictable, non-huge load.
|
| Uber, for example, moved to the cloud. I feel like in the span
| between them there are far more companies for which Kamal is not
| enough.
|
| I hope I'm wrong, though. It'll be nice for many companies to be
| have the choice of exiting the cloud.
| pinkgolem wrote:
| I mean most B2B company have a pretty predictable load when
| providing services to employees..
|
| I can get weeks advance notice before we have a load increase
| through new users
| appendix-rock wrote:
| You can't talk about typical cases and then bring up Uber.
| olieidel wrote:
| > I feel like in the span between them there are far more
| companies for which Kamal is not enough.
|
| I feel like this is a bias in the HN bubble: In the real world,
| 99% of companies with any sort of web servers (cloud or
| otherwise) are running very boring, constant, non-Uber
| workloads.
| ksec wrote:
| Not just HN but overall the whole internet. Because all the
| news and article, tech achievements are pumped out from Uber
| and other big tech companies.
|
| I am pretty sure Uber belongs to the 1% of the internet
| companies in terms of scale. 37Signals isn't exactly small
| either. They spend $3M a year on infrastructure in 2019.
| Likely a lot higher now.
|
| The whole Tech cycle needs to stop having a top down approach
| where everyone are doing what Big tech are using. Instead we
| should try to push the simplest tool from low end all the way
| to 95% mark.
| nchmy wrote:
| They spend considerably less on infra now - this was the
| entire point of moving off cloud. DHH has written and
| spoken lots about it, providing real numbers. They bought
| their own servers and the savings paid for it all in like 6
| months. Now its just money in the bank til they replace the
| hardware in 5 years.
|
| Cloud is a scam for the vast majority of companies.
| martinald wrote:
| I don't think that's the real point. The real point is that
| 'big 3' cloud providers are so overpriced that you could run
| hugely over provisioned infra 24/7 for your load (to cope with
| any spikes) and still save a fortune.
|
| The other thing is that cloud hardware is generally very very
| slow and many engineers don't seem to appreciate how bad it is.
| Slow single thread performance because of using the most
| parallel CPUs possible (which are the cheapest per W for the
| hyperscalers), very poor IO speeds, etc.
|
| So often a lot of this devops/infra work is solved by just
| using much faster hardware. If you have a fairly IO heavy
| workload then switching from slow storage to PCIe4 7gbyte/sec
| NVMe drives is going to solve so many problems. If your app
| can't do much work in parallel then CPUs with much faster
| single threading performance can have huge gains.
| jsheard wrote:
| It's sad that what should have been a huge efficiency win,
| amortizing hardware costs across many customers, ended up
| often being more expensive than just buying big servers and
| letting them idle most of the time. Not to say the efficiency
| isn't there, but the cloud providers are pocketing the
| savings.
| toomuchtodo wrote:
| If you want a compute co-op, build a co-op (think VCs
| building their own GPU compute clusters for portfolio
| companies). Public cloud was always about using marketing
| and the illusion of need for dev velocity (which is real,
| hypergrowth startups and such, just not nearly as prevalent
| as the zeitgeist would have you believe) to justify the eye
| watering profit margin.
|
| Most businesses have fairly predictable interactive
| workload patterns, and their batch jobs are not high
| priority and can be managed as such (with the usual
| scheduling and bin packing orchestration). Wikipedia is one
| of the top 10 visited sites on the internet, and they run
| in their own datacenter, for example. The FedNow instant
| payment system the Federal Reserve recently went live with
| still runs on a mainframe. Bank of America was saving $2B a
| year running their own internal cloud (although I have
| heard they are making an attempt to try to move to a public
| cloud).
|
| My hot take is public cloud was an artifact of ZIRP and
| cheap money, where speed and scale were paramount, cost
| being an afterthought (Russ Hanneman pre-revenue bit here,
| "get big fast and sell"; great fit for cloud). With that
| macro over, and profitability over growth being the go
| forward MO, the equation might change. Too early to tell
| imho. Public cloud margins are compute customer
| opportunities.
| miki123211 wrote:
| Wikipedia is often brought up in these discussions, but
| it's a really bad example.
|
| To a vast majority of Wikipedia users who are not logged
| in, all it needs to do is show (potentially pre-rendered)
| article pages with no dynamic, per-user content. Those
| pages are easy to cache or even offload to a CDN. FOr all
| the users care, it could be a giant key-value store,
| mapping article slugs to HTML pages.
|
| This simplicity allows them to keep costs down, and the
| low costs mean that they don't have to be a business and
| care about time-on-page, personalized article
| recommendations or advertising.
|
| Other kinds of apps (like social media or messaging) have
| very different usage patterns and can't use this kind of
| structure.
| toomuchtodo wrote:
| > Other kinds of apps (like social media or messaging)
| have very different usage patterns and can't use this
| kind of structure.
|
| Reddit can't turn a profit, Signal is in financial peril.
| Meta runs their own data centers. WhatsApp could handle
| ~3M open TCP connections per server, running the
| operation with under 300 servers [1] and serving ~200M
| users. StackOverflow was running their Q&A platform off
| of 9 on prem servers as of 2022 [2]. Can you make a
| profitable business out of the expensive complex machine?
| That is rare, based on the evidence. If you're not a
| business, you're better off on Hetzner (or some other
| dedicated server provider) boxes with backups. If you're
| down you're down, you'll be back up shortly. Downtime is
| cheaper than five 9s or whatever.
|
| I'm not saying "cloud bad," I'm saying cloud where it
| makes sense. And those use cases are the exception, not
| the rule. If you're not scaling to an event where you can
| dump these cloud costs on someone else (acquisition
| event), or pay for them yourself (either donations,
| profitability, or wealthy benefactor), then it's
| pointless. It's techno performance art or fancy make
| work, depending on your perspective.
|
| [1] https://news.ycombinator.com/item?id=33710911
|
| [2] https://www.datacenterdynamics.com/en/news/stack-
| overflow-st...
| miki123211 wrote:
| You can always buy some servers to handle your base load, and
| then get extra cloud instances when needed.
|
| If you're running an ecommerce store for example, you could
| buy some extra capacity from AWS for Christmas and Black
| Friday, and rely on your own servers exclusively for the rest
| of the year.
| sgarland wrote:
| > The other thing is that cloud hardware is generally very
| very slow and many engineers don't seem to appreciate how bad
| it is.
|
| This. Mostly disk latency, for me. People who have only ever
| known DBaaS have no idea how absurdly fast they can be when
| you don't have compute and disk split by network hops, and
| your disks are NVMe.
|
| Of course, it doesn't matter, because the 10x latency hit is
| overshadowed by the miasma of everything else in a modern
| stack. My favorite is introducing a caching layer because you
| can't write performant SQL, and your DB would struggle to
| deliver it anyway.
| igortg wrote:
| I'm using a managed Postgres instance in a well known
| provider and holy shit, I couldn't believe how slow it is.
| For small datasets I couldn't notice, but when one of the
| tables reached 100K rows, queries started to take 5-10
| seconds (the same query takes 0.5-0.6 in my standard i5 Dell
| laptop).
|
| I wasn't expecting blasting speed on the lowest tear, but 10x
| slower is bonkers.
| toberoni wrote:
| I feel Uber is the outlier here. For every unicorn company
| there are 1000s of companies that don't need to scale to
| millions of users.
|
| And due to the insane markup of many cloud services it can make
| sense to just use beefier servers 24/7 to deal with the peaks.
| From my experience crazy traffic outliers that need
| sophisticated auto-scaling rarely happens outside of VC-fueled
| growth trajectories.
| 000ooo000 wrote:
| Strange choice of language for the actions:
|
| >To route traffic through the proxy to a web application, you
| *deploy* instances of the application to the proxy. *Deploying*
| an instance makes it available to the proxy, and replaces the
| instance it was using before (if any).
|
| >e.g. `kamal-proxy deploy service1 --target web-1:3000`
|
| 'Deploy' is a fairly overloaded term already. Fun conversations
| ahead. Is the app deployed? Yes? No I mean is it deployed to the
| proxy? Hmm our Kamal proxy script is gonna need some changes and
| a redeployment so that it deploys the deployed apps to the proxy
| correctly.
|
| Unsure why they couldn't have picked something like 'bind', or
| 'intercept', or even just 'proxy'... why 'deploy'..
| irundebian wrote:
| Deployed = running and registered to the proxy.
| vorticalbox wrote:
| Then wouldn't "register" be a better term?
| rad_gruchalski wrote:
| It's registered but is it deployed?
| IncRnd wrote:
| Who'd a thunk it?
| 8organicbits wrote:
| If your ingress traffic comes from a proxy, what would deploy
| mean other than that traffic from the proxy is now flowing to
| the new app instance?
| nahimn wrote:
| "Yo dawg, i heard you like deployments, so we deployed a
| deployment in your deployment so your deployment can deploy"
| -Xzibit
| chambored wrote:
| "Pimp my deployment!"
| shafyy wrote:
| Also exciting that Kamal 2 (currently RC
| https://github.com/basecamp/kamal/releases) will support auto-SSL
| and make it easy to run multiple apps on one server with Kamal.
| francislavoie wrote:
| They're using the autocert package which is the bare minimum.
| It's brittle, doesn't allow for horizontal scaling of your
| proxy instances because you're subject to Let's Encrypt rate
| limits and simultaneous cert limits. (Disclaimer: I help
| maintain Caddy) Caddy/Certmagic solves this by writing the data
| to a shared storage so only a single cert will be issued and
| reused/coordinated across all instances through the storage. It
| also doesn't have issuer fallback, doesn't do rate limit
| avoidance, doesn't respect ARI, etc.
|
| Holding requests until an upstream is available is also
| something Caddy does well, just configure the reverse_proxy
| with try_duration and try_interval, it will keep trying to
| choose a healthy upstream (determined via active health checks
| done in a separate goroutine) for that request until it times
| out.
|
| Their proxy headers handling doesn't consider trusted IPs so if
| enabled, someone could spoof their IP by setting X-Forwarded-
| For. At least it's off by default, but they don't warn about
| this.
|
| This looks pretty undercooked. I get that it's simple and
| that's the point, but I would never use this for anything in
| its current state. There's just so many better options out
| there.
| markusw wrote:
| I just want to say I use Caddy for this exact thing and it
| works beautifully! :D Thank you all for your work!
| shafyy wrote:
| Interesting, thanks for the detailed explanation! I'm not
| very experienced with devops, so this is very helpful!
| kh_hk wrote:
| I don't understand how to use this, maybe I am missing something.
|
| Following the example, it starts 4 replicas of a 'web' service.
| You can create a service by running a deploy to one of the
| replicas, let's say example-web-1. What does the other 3 replicas
| do?
|
| Now, let's say I update 'web'. Let's assume I want to do a zero-
| downtime deployment. That means I should be able to run a build
| command on the 'web' service, start this service somehow (maybe
| by adding an extra replica), and then run a deploy against the
| new target?
|
| If I run a `docker compose up --build --force-recreate web` this
| will bring down the old replica, turning everything moot.
|
| Instructions unclear, can anyone chime in and help me understand?
| thelastparadise wrote:
| Why would I not just do k8s rollout restart deployment?
|
| Or just switch my DNS or router between two backends?
| jgalt212 wrote:
| You still need some warm-up routine to run for the newly
| online server before the hand-off occurs. I'm not a k8s
| expert, but the above described events can be easily handled
| by a bash or fab script.
| ahoka wrote:
| What events do you mean? If the app needs a warm up, then
| it can use its readiness probe to ask for some delay until
| it gets request routed to it.
| jgalt212 wrote:
| GET requests to pages that fill caches or those that make
| apache start up more than n processes.
| thelastparadise wrote:
| This is a health/readiness probe in k8s. It's already
| solved quite solidly.
| joeatwork wrote:
| I think this is part of a lighter weight Kubernetes
| alternative.
| ianpurton wrote:
| Lighter than the existing light weight kubernetes
| alternatives i.e. k3s :)
| diggan wrote:
| Or, hear me out: Kubernetes alternatives that don't
| involve any parts of Kubernetes at all :)
| ozgune wrote:
| I think the parent project, Kamal, positions itself as a
| simpler alternative to K8s when deploying web apps. They have
| a question on this on their website: https://kamal-deploy.org
|
| "Why not just run Capistrano, Kubernetes or Docker Swarm?
|
| ...
|
| Docker Swarm is much simpler than Kubernetes, but it's still
| built on the same declarative model that uses state
| reconciliation. Kamal is intentionally designed around
| imperative commands, like Capistrano.
|
| Ultimately, there are a myriad of ways to deploy web apps,
| but this is the toolkit we've used at 37signals to bring HEY
| and all our other formerly cloud-hosted applications home to
| our own hardware."
| sisk wrote:
| For the first part of your question about the other replicas,
| docker will load balance between all of the replicas either
| with a VIP or by returning multiple IPs in the DNS request[0].
| I didn't check if this proxy balances across multiple records
| returned in a DNS request but, at least in the case of VIP-
| based load balancing, should work like you would expect.
|
| For the second part about updating the service, I'm a little
| less clear. I guess the expectation would be to bring up a
| differently-named service within the same network, and then
| `kamal-proxy deploy` it? So maybe the expectation is for
| service names to include a version number? Keeping the old
| version hot makes sense if you want to quickly be able to route
| back to it.
|
| [0]: https://docs.docker.com/reference/compose-
| file/deploy/#endpo...
| blue_pants wrote:
| Can someone briefly explain how ZDD works in general?
|
| I guess both versions of the app must be running simultaneously,
| with new traffic being routed to the new version of the app.
|
| But what about DB migrations? Assuming the app uses a single
| database, and the new version of the app introduces changes to
| the DB schema, the new app version would modify the schema during
| startup via a migration script. However, the previous version of
| the app still expects the old schema. How is that handled?
| andrejguran wrote:
| Migrations have to be backwards compatible so the DB schema can
| serve both versions of the app. It's an extra price to pay for
| having ZDD or rolling deployments and something to keep in
| mind. But it's generally done by all the larger companies
| efortis wrote:
| Yes, both versions must be running at some point.
|
| The load balancer starts accepting connections on Server2 and
| stops accepting new connections on Server1. Then, Server1
| disconnects when all of its connections are closed.
|
| It could be different Servers or multiple Workers on one
| server.
|
| During that window, as the other comments said, migrations have
| to be backwards compatible.
| diggan wrote:
| First step is to decouple migrations from deploys, you want
| manual control over when the migrations run, contrary to many
| frameworks default of running migrations when you deploy the
| code.
|
| Secondly, each code version has to work with the current schema
| and the schema after a future migration, making all code
| effectively backwards compatible.
|
| Your deploys end up being something like:
|
| - Deploy new code that works with current and future schema
|
| - Verify everything still works
|
| - Run migrations
|
| - Verify everything still works
|
| - Clean up the acquired technical debt (the code that worked
| with the schema that no longer exists) at some point, or run
| out of runway and it won't be an issue
| wejick wrote:
| This is very good explanation, no judgment and simply
| educational. Appreciated
|
| Though I'm still surprised that some people run DB alteration
| on application start up. Never saw one in real life.
| diggan wrote:
| > Though I'm still surprised that some people run DB
| alteration on application start up
|
| I think I've seen it more commonly in the Golang ecosystem,
| for some reason. Also not sure how common it is nowadays,
| but seen lots of deployments (contained in Ansible scripts,
| Makefiles, Bash scripts or whatever) where the
| migration+deploy is run directly in sequence automatically
| for each deploy, rather than as discrete steps.
|
| Edit: Maybe it's more of an educational problem than
| something else, where learning resources either don't
| specify _when_ to actually run migrations or straight up
| recommend people to run migrations on application startup
| (one example: https://articles.wesionary.team/integrating-
| migration-tool-i...)
| miki123211 wrote:
| It makes things somewhat easier if your app is smallish and
| your workflow is something like e.g. Github Actions
| automatically deploying all commits on main to Fly or
| Render.
| whartung wrote:
| We do this. It has worked very well for us.
|
| There's a couple of fundamental rules to follow. First,
| don't put something that will have insane impact into the
| application deploy changes. 99% of the DB changes are very
| cheap, and very minor. If the deploy is going to be very
| expensive, then just don't do it, we'll do it out of band.
| This has not been a problem in practice with our 20ish
| person team.
|
| Second, it was kind of like double entry accounting. Once
| you committed the change, you can not go back and "fix it".
| If you did something really wrong (i.e. see above), then
| sure, but if not, you commit a correcting entry instead.
| Because you don't know who has recently downloaded your
| commit, and run it against their database.
|
| The changes are a list of incremental steps that the system
| applies in order, if they had not been applied before. So,
| they are treated as, essentially, append only.
|
| And it has worked really well for us, keeping the diverse
| developers who deploy again local databases in sync with
| little drama.
|
| I've incorporated the same concept in my GUI programs that
| stand up their own DB. It's a very simple system.
| e_y_ wrote:
| At my company, DB migrations on startup was a flag that was
| enabled for local development and disabled for production
| deploys. Some teams had it enabled for staging/pre-
| production deploys, and a few teams had it turned on for
| production deploys (although those teams only had
| infrequent, minor changes like adding a new column).
|
| Personally I found the idea of having multiple instances
| running the same schema update job at the same time (even
| if locks would keep it from running in practice) to be
| concerning so I always had it disabled for deploys.
| jacobsimon wrote:
| This is the way
| svvvy wrote:
| I thought it was correct to run the DB migrations for the new
| code first, then deploy the new code. While making sure that
| the DB schema is backwards compatible with both versions of
| the code that will be running during the deployment.
|
| So maybe there's something I'm missing about running DB
| migrations after the new code has been deployed - could you
| explain?
| ffsm8 wrote:
| I'm not the person you've asked, but I've worked in devops
| before.
|
| It kinda doesn't matter which you do first. And if you
| squint a little, it's effectively the same thing, because
| the migration will likely only become available via a
| deployment too
|
| So yeah, the only things that's important is that the DB
| migration can't cause an incompatibility with any currently
| deployed version of the code - and if it would, you'll have
| to split the change so it doesn't. It'll force another
| deploy for the change you want to do, but it's what you're
| forced to do if maintenance windows aren't an option. Which
| is kinda a given for most b2c products
| shipp02 wrote:
| So if you add any constraints/data, you can't rely on them
| being there until version n+2 or you need to have 2 paths 1
| for the old date, 1 for new?
| simonw wrote:
| Effectively yes. Zero downtime deployments with database
| migrations are fiddly.
| globular-toast wrote:
| There's a little bit more to it. Firstly you can deploy the
| migration first as long as it's forwards compatible (ie. old
| code can read from it). That migration needs to be zero
| downtime; it can't, for example, rewrite whole tables or
| otherwise lock them, or requests will time out. Doing a whole
| new schema is one way to do it, but not always necessary. In
| any case you probably then need a backfill job to fill up the
| new schema with data before possibly removing the old one.
|
| There's a good post about it here:
| https://rtpg.co/2021/06/07/changes-checklist.html
| stephenr wrote:
| Others have described the _how_ part if you do need truly zero
| downtime deployments, but I think it 's worth pointing out that
| for most organisations, and most migrations, the amount of
| downtime due to a db migration is virtually indistinguishable
| from zero, particularly if you have a regional audience, and
| can aim for "quiet" hours to perform deployments.
| diggan wrote:
| > the amount of downtime due to a db migration is virtually
| indistinguishable from zero
|
| Besides, once you've run a service for a while that has
| acquired enough data for migrations to take a while, you
| realize that there are in fact two different types of
| migrations. "Schema migrations" which are generally fast and
| "Data migrations" that depending on the amount of data can
| take seconds or days. Or you can do the "data migrations"
| when needed (on the fly) instead of processing all the data.
| Can get gnarly quickly though.
|
| Splitting those also allows you to reduce maintenance
| downtime if you don't have zero-downtime deployments already.
| stephenr wrote:
| Very much so, we handle these very differently for $client.
|
| Schema migrations are versioned in git with the app, with
| up/down (or forward/reverse) migration scripts and are
| applied automatically during deployment of the associated
| code change to a given environment.
|
| SQL Data migrations are stored in git so we have a record
| but are never applied automatically, always manually.
|
| The other thing we've used along these lines, is having one
| or more low priority job(s) added to a queue, to apply some
| kind of change to records. These are essentially still data
| migrations, but they're written as part of the application
| code base (as a Job) rather than in SQL.
| sgarland wrote:
| Schema migrations can be quite lengthy, mostly if you made
| a mistake earlier. Some things that come to mind are
| changing a column's type, or extending VARCHAR length (with
| caveats; under certain circumstances it's instant).
| lukevp wrote:
| Not OP, but I would consider this a data migration as
| well. Anything that requires an operation on every row in
| a table would qualify. Really changing the column type is
| just a built in form of a data migration.
| globular-toast wrote:
| Lengthy migrations doesn't matter. What matters is whether
| they hold long locks or not. Data migrations might take a
| while but they won't lock anything. Schema migrations, on
| the other hand, can easily do so, like if you add a new
| column with a default value. The whole table must be
| rewritten and it's locked for the entire time.
| jakjak123 wrote:
| Most are not affected by db migrations in the sense that
| migrations are run before the service starts the web server
| during boot. the database might block traffic for other
| already running connections though,in which case you have a
| problem with your database design.
| gsanderson wrote:
| I haven't tried it but it looks like Xata has come up with a
| neat solution to DB migrations (at least for postgres). There
| can be two versions of the app running.
|
| https://xata.io/blog/multi-version-schema-migrations
| efxhoy wrote:
| Strong migrations helps writing migrations that are safe for
| ZDD deploys. We use it in our rails app, catches quite a few
| potential footguns. https://github.com/ankane/strong_migrations
| risyachka wrote:
| Did they mention anywhere why they decided to write their own
| proxy instead of using Traefik or something else battle tested?
| yla92 wrote:
| They were actually using Traefik until this "v2.0.0" (pre-
| release right now) version.
|
| There are some context about why they switched and decided to
| roll their own, from the PR.
|
| https://github.com/basecamp/kamal/pull/940
| stackskipton wrote:
| As SRE, that PR scares me. There is no long explanation of
| why we are throwing out third party, extremely battle tested
| HTTP Proxy software for our own homegrown except "Traefik
| didn't do what we wanted 100%".
|
| Man, I've been there where you wish third party software had
| some feature but writing your own is WORST thing you can do
| for a company 9/10 times. My current company is dealing with
| massive tech debt because of all this homegrown software.
| mt42or wrote:
| NIH. Nothing else to add.
| elktown wrote:
| Meanwhile in the glorious land of "never invented here":
| https://news.ycombinator.com/item?id=38526780
| ahdfyasdf wrote:
| Is there a way to configure timeouts?
|
| https://github.com/basecamp/kamal-proxy/blob/main/internal/s...
|
| https://github.com/basecamp/kamal-proxy/blob/main/internal/s...
|
| https://blog.cloudflare.com/exposing-go-on-the-internet/
| ianpurton wrote:
| DHH in the past has said "This setup helped us dodge the
| complexity of Kubernetes"
|
| But this looks like somehow a re-invention of what Kubernetes
| provides.
|
| Kubernetes has come a long way in terms of ease of deployment on
| bare metal.
| wejick wrote:
| No downtime deployment is always there long before kube. It
| does look as simple as ever been, not like kube for sure.
| moondev wrote:
| Does this handle a host reboot?
| francislavoie wrote:
| In theory it should, because they do health checking to track
| status of the upstreams. The upstream server being down would
| be a failed TCP connection which would fail the health check.
|
| Obviously, rebooting the machine the proxy is running on is
| trickier though. I don't feel confident they've done enough to
| properly support having multiple proxy instances running side
| by side (no shared storage mechanism for TLS certs at least),
| which would allow upgrading one at a time and using a
| router/firewall/DNS in front of it to route to both normally,
| then switch it to one at a time while doing maintenance to
| reboot them, and back to both during normal operations.
| rohvitun wrote:
| Aya
| 0xblinq wrote:
| 3 years from now they'll have invented their own AWS. NIH
| syndrome in full swing.
| bdcravens wrote:
| It's a matter of cost, not NIH syndrome. In Basecamp's case,
| saving $2.4M a year isn't something to ignore.
|
| https://basecamp.com/cloud-exit
|
| Of course, it's fair to say that rebuilding the components that
| the industry uses for hosting on bare metal is NIH syndrome.
| mannyv wrote:
| A new proxy is a proxy filled with issues. It's nice that
| it's go, but in production I'd go with nginx or something
| else and replay traffic to kamal. There are enough weird
| behaviors out there (and bad actors) that I'd be worried
| about exploits etc.
| viraptor wrote:
| It's an interesting choice to make this a whole app, when the
| zero-downtime deployments can be achieved with other servers
| trivially these days. For example any app+web proxy which
| supports Unix sockets can do zero-downtime by moving the file.
| It's atomic and you can send the warm-up requests with curl.
| Building a whole system with registration feels like an overkill.
| dewey wrote:
| That's just a small part of Kamal (https://kamal-deploy.org),
| their deployment tool they build and used to move from the
| cloud to their own hardware, saving millions
| (https://basecamp.com/cloud-exit).
| ksajadi wrote:
| This primarily exists to take care of a fundamental issue in
| Docker Swarm (Kamal's orchestrator of choice) where replacing
| containers of a service disrupts traffic. We had the same problem
| (when building JAMStack servers at Cloud 66) and used Caddy
| instead of writing our own proxy and also looked at Traefik which
| would have been just as suitable.
|
| I don't know why Kamal chose Swarm over k8s or k3s (simplicity
| perhaps?) but then, complexity needs a home, you can push it
| around but cannot hide it, hence a home grown proxy.
|
| I have not tried Kamal proxy to know, but I am highly skeptical
| of something like this, because I am pretty sure I will be
| chasing it for support for anything from WebSockets to SSE, to
| HTTP/3 to various types of compression and encryption.
| jauntywundrkind wrote:
| Kamal feels built around the premise that "Kubernetes is too
| complicated" (after Basecamp got burned by some hired help),
| and from that justification it goes out and recreates a sizable
| chunk of the things Kubernetes does.
|
| Your list of things a reverse proxy might do is a good example
| to me of how I expect this to go: what starts out as an
| ambition to be simple inevitably has to grow & grow more of
| complexity it sought to avoid.
|
| Part of me strongly thinks we need competition & need other
| things trying to create broad ideally extensible ways or
| running systems. But a huge part of me sees Kamal & thinks,
| man, this is a lot of work being done only to have to keep
| walking backwards into the complexity they were trying to
| avoid. Usually second system syndrome is the first system being
| simple the second being overly complicated, and on the tin the
| case is inverse, but man, the competency of Kube & it's
| flexibility/adaptability as being a framework for Desired State
| Management really shows through for me.
| ksajadi wrote:
| I agree with you and at the risk of self-promotion, that's
| why we built Cloud 66 (which takes care of Day-1 (build and
| deploy) as well as Day-2 (scale and maintenance) part of
| infrastructure. As we all can see there is a lot to this than
| just wrapping code in a Dockerfile and pushing it out to a
| Swarm cluster.
| hipadev23 wrote:
| I feel like you're conflating the orchestration with proxying.
| There's no reason they couldn't be using caddy or traefik or
| envoy for the proxy (just like k8s ends up using them as an
| ingress controller), while still using docker.
| ksajadi wrote:
| Docker is the container engine. Swarm is the orchestration,
| the same as Kubernetes. The concept of "Service" in k8s takes
| care of a lot of the proxying, while still using Docker (not
| anymore tho). In Swarm, services exist but only take care of
| container lifecycle and not traffic. While networking is left
| to the containers, Swarm services always get in the way,
| causing issues that will require a proxy.
|
| In k8s for example, you can use Docker and won't need a proxy
| for ZDD (while you might want one for Ingress and other uses)
| simonw wrote:
| Does this implement the "traffic pausing" pattern?
|
| That's where you have a proxy which effectively pauses traffic
| for a few seconds - incoming requests appear to take a couple of
| seconds longer than usual, but are still completed after that
| short delay.
|
| During those couple of seconds you can run a blocking
| infrastructure change - could be a small database migration, or
| could be something a little more complex as long as you can get
| it finished in less than about 5 seconds.
| ignoramous wrote:
| tbh, sounds like "living dangerously" pattern to me.
| francislavoie wrote:
| Not really, works quite well as long as your proxy/server
| have enough memory to hold the requests for a little while.
| As long as you're not serving near your max load all the
| time, it's a breeze.
| tinco wrote:
| Have you seen that done in production? It sounds really
| dangerous, I've worked for an app server company for years and
| this is the first I've heard of this pattern. I'd wave it away
| if I didn't notice in your bio that you co-created Django so
| you've probably seen your fair share of deployments.
| written-beyond wrote:
| Just asking, isn't this what every serverless platform uses
| while it spins up an instance? Like it's why cold starts are
| a topic at all, or else the first few requests would just
| fail until the instance spun up to handle the request.
| simonw wrote:
| I first heard about it from Braintree.
| https://simonwillison.net/2011/Jun/30/braintree/
| francislavoie wrote:
| Caddy does this! (As you know, I think. I feel I remember we
| discussed this some time ago)
___________________________________________________________________
(page generated 2024-09-21 23:00 UTC)