[HN Gopher] Intelligent Kubernetes Load Balancing at Databricks
___________________________________________________________________
Intelligent Kubernetes Load Balancing at Databricks
Author : ayf
Score : 113 points
Date : 2025-10-01 05:06 UTC (17 hours ago)
(HTM) web link (www.databricks.com)
(TXT) w3m dump (www.databricks.com)
| bbkane wrote:
| Thanks for writing - I found the Power of Two Choices algorithm
| particularly interesting (I haven't seen it before).
|
| From the recent grpConf (
| https://www.youtube.com/playlist?list=PLj6h78yzYM2On4kCcnWjl... )
| it seems gRPC as a standard is also moving in this "proxyless"
| model - gRPC will read xDS itself.
| walth wrote:
| You might be interested in nginx's implementation
|
| https://nginx.org/en/docs/http/ngx_http_upstream_module.html...
| shizcakes wrote:
| Less featureful than this, but we've been doing GRPC client side
| load balancing with kuberesolver[1] since 2018. It allows GRPC to
| handle the balancer implementations. It's been rock solid for
| more than half a decade now.
|
| 1: https://github.com/sercand/kuberesolver
| azaras wrote:
| What is the difference between Kuberesolver and using a
| Headless Service?
|
| In the README.md file, they compare it with a ClusterIP
| service, but not with a Headless on "ClusterIP: None".
|
| The advantages of using Kuberesolver are that you do not need
| to change DNS refresh and cache settings. However, I think this
| is preferable to the application calling the Kubernetes API.
| euank wrote:
| I can give an n=1 anecdote here: the dns resolver used to
| have hard-coded caching which meant that it would be
| unresponsive to pod updates, and cause mini 30s outages.
|
| The code in question was: https://github.com/grpc/grpc-
| go/blob/b597a8e1d0ce3f63ef8a7b6...
|
| That meant that deploying a service which drained in less
| than 30s would have a little mini-outage for that service
| until the in-process DNS cache expired, with of course no way
| to configure it.
|
| Kuberesolver streams updates, and thus lets clients talk to
| new pods almost immediately.
|
| I think things are a little better now, but based on my
| reading of https://github.com/grpc/grpc/issues/12295, it
| looks like the dns resolver still might not resolve new pod
| names quickly in some cases.
| gaurav324 wrote:
| kuberesolver is an interesting take as well. Directly watching
| the K8s API from each client could raise scaling concerns at
| very large scale, but it does open the door to using richer
| Kubernetes metadata for smarter load-balancing decisions.
| thanks for sharing!
| debarshri wrote:
| I think with some rate limiting, it can scale. But it might
| be a security issue as ideally you don't want client to be
| aware of kubernetes also, it would be difficult to scope the
| access.
| arccy wrote:
| if you don't want to expose k8s then there's the generic xds
| protocol
| darkstar_16 wrote:
| We use a headless service and client side load balancing for
| this. What's the difference ?
| arccy wrote:
| instead of polling for endpoint updates, they're pushed to
| the client through k8s watches
| hanikesn wrote:
| I've been using a standardized xds resolver[1]. The benefit
| here is that you don't have to patch grpc clients.
|
| [1] https://github.com/wongnai/xds
| atombender wrote:
| Do you know how this compares to the Nginx ingress
| controller, which has a native gRPC mode?
| thewisenerd wrote:
| we have the same issue with HTTP as well, due to HTTP keepalive,
| which many clients have out-of-the box.
|
| the "impact" can be reduced by configuring an overall connection-
| ttl, so it takes some time when new pods come up but it works out
| over time.
|
| --
|
| that said, i'm not surprised that even a company as large as
| databricks feels that adding a service mesh is going to add
| operational complexity.
|
| looks like they've taken the best parts (endpoint watch, sync to
| clients with xDS) and moved it client-side. compared to the
| failure mode of a service mesh, this seems better.
| gaurav324 wrote:
| Yes, we've leaned toward minimizing operational overhead.
| Taking the useful parts of a mesh (xDS endpoint and routing
| updates) into the client has worked extremely well in practice
| and has been very reliable, without the extra moving parts of a
| full mesh.
| agrawroh wrote:
| When we started we had a lot of pieces like Certificate
| Management in-house and adding a full blown Service Mesh was a
| big operational overhead. We started with building only the
| parts we needed and started integrating things like xDS
| natively in rest of our clients.
| pm90 wrote:
| I haven't been keeping up but is there still hype over full
| mesh like istio/linkerd? Ive seen it tried in a couple of
| places but didn't work super well; the last place couldn't
| because datadog apparently bills sidecar containers as
| additional hosts so using sidecar proxy would have doubled our
| datadog bill.
| thewisenerd wrote:
| > kube-proxy supports only basic algorithms like round-robin or
| random selection
|
| this is "partially" true.
|
| if you're using ipvs, you can configure the scheduler to just
| about anything ipvs supports (including wrr). they removed the
| validation for the scheduler name quite a while back.
|
| kubernetes itself though doesn't "understand" (i.e., can NOT
| represent) the nuances (e.g., weights per endpoint with wrr),
| which is the problem.
| jedberg wrote:
| I wonder why they didn't use rendezvous hashing (aka HRW)[0]?
|
| It feels like it would solve all the requirement that they laid
| out, is fully client side, and doesn't require real time updates
| for the host list via discovery.
|
| [0] https://en.wikipedia.org/wiki/Rendezvous_hashing
| deviation wrote:
| HRW would cover the simple case, but they needed way more--
| e.g. per-request balancing, zone affinity, live health checks,
| spillover, ramp-ups, etc. Once you need all that dynamic
| behavior, plain hashing just doesn't cut it IMO. A custom
| client-side + discovery setup makes more sense.
| dastbe wrote:
| the problem is that they want to apply a number of
| stateful/lookaside load balancing strategies, which become more
| difficult to do in a fully decentralized system. it's generally
| easier to asynchronously aggregate information and either
| decide routing updates centrally or redistribute that aggregate
| to inform local decisions.
| dilyevsky wrote:
| Curios why cross-cluster loadbalancing would be necessary in a
| setup where you operate "thousands of clusters"? I assume these
| are per-customer isolated environments?
___________________________________________________________________
(page generated 2025-10-01 23:01 UTC)