[HN Gopher] Intelligent Kubernetes Load Balancing at Databricks
       ___________________________________________________________________
        
       Intelligent Kubernetes Load Balancing at Databricks
        
       Author : ayf
       Score  : 113 points
       Date   : 2025-10-01 05:06 UTC (17 hours ago)
        
 (HTM) web link (www.databricks.com)
 (TXT) w3m dump (www.databricks.com)
        
       | bbkane wrote:
       | Thanks for writing - I found the Power of Two Choices algorithm
       | particularly interesting (I haven't seen it before).
       | 
       | From the recent grpConf (
       | https://www.youtube.com/playlist?list=PLj6h78yzYM2On4kCcnWjl... )
       | it seems gRPC as a standard is also moving in this "proxyless"
       | model - gRPC will read xDS itself.
        
         | walth wrote:
         | You might be interested in nginx's implementation
         | 
         | https://nginx.org/en/docs/http/ngx_http_upstream_module.html...
        
       | shizcakes wrote:
       | Less featureful than this, but we've been doing GRPC client side
       | load balancing with kuberesolver[1] since 2018. It allows GRPC to
       | handle the balancer implementations. It's been rock solid for
       | more than half a decade now.
       | 
       | 1: https://github.com/sercand/kuberesolver
        
         | azaras wrote:
         | What is the difference between Kuberesolver and using a
         | Headless Service?
         | 
         | In the README.md file, they compare it with a ClusterIP
         | service, but not with a Headless on "ClusterIP: None".
         | 
         | The advantages of using Kuberesolver are that you do not need
         | to change DNS refresh and cache settings. However, I think this
         | is preferable to the application calling the Kubernetes API.
        
           | euank wrote:
           | I can give an n=1 anecdote here: the dns resolver used to
           | have hard-coded caching which meant that it would be
           | unresponsive to pod updates, and cause mini 30s outages.
           | 
           | The code in question was: https://github.com/grpc/grpc-
           | go/blob/b597a8e1d0ce3f63ef8a7b6...
           | 
           | That meant that deploying a service which drained in less
           | than 30s would have a little mini-outage for that service
           | until the in-process DNS cache expired, with of course no way
           | to configure it.
           | 
           | Kuberesolver streams updates, and thus lets clients talk to
           | new pods almost immediately.
           | 
           | I think things are a little better now, but based on my
           | reading of https://github.com/grpc/grpc/issues/12295, it
           | looks like the dns resolver still might not resolve new pod
           | names quickly in some cases.
        
         | gaurav324 wrote:
         | kuberesolver is an interesting take as well. Directly watching
         | the K8s API from each client could raise scaling concerns at
         | very large scale, but it does open the door to using richer
         | Kubernetes metadata for smarter load-balancing decisions.
         | thanks for sharing!
        
           | debarshri wrote:
           | I think with some rate limiting, it can scale. But it might
           | be a security issue as ideally you don't want client to be
           | aware of kubernetes also, it would be difficult to scope the
           | access.
        
           | arccy wrote:
           | if you don't want to expose k8s then there's the generic xds
           | protocol
        
         | darkstar_16 wrote:
         | We use a headless service and client side load balancing for
         | this. What's the difference ?
        
           | arccy wrote:
           | instead of polling for endpoint updates, they're pushed to
           | the client through k8s watches
        
         | hanikesn wrote:
         | I've been using a standardized xds resolver[1]. The benefit
         | here is that you don't have to patch grpc clients.
         | 
         | [1] https://github.com/wongnai/xds
        
           | atombender wrote:
           | Do you know how this compares to the Nginx ingress
           | controller, which has a native gRPC mode?
        
       | thewisenerd wrote:
       | we have the same issue with HTTP as well, due to HTTP keepalive,
       | which many clients have out-of-the box.
       | 
       | the "impact" can be reduced by configuring an overall connection-
       | ttl, so it takes some time when new pods come up but it works out
       | over time.
       | 
       | --
       | 
       | that said, i'm not surprised that even a company as large as
       | databricks feels that adding a service mesh is going to add
       | operational complexity.
       | 
       | looks like they've taken the best parts (endpoint watch, sync to
       | clients with xDS) and moved it client-side. compared to the
       | failure mode of a service mesh, this seems better.
        
         | gaurav324 wrote:
         | Yes, we've leaned toward minimizing operational overhead.
         | Taking the useful parts of a mesh (xDS endpoint and routing
         | updates) into the client has worked extremely well in practice
         | and has been very reliable, without the extra moving parts of a
         | full mesh.
        
         | agrawroh wrote:
         | When we started we had a lot of pieces like Certificate
         | Management in-house and adding a full blown Service Mesh was a
         | big operational overhead. We started with building only the
         | parts we needed and started integrating things like xDS
         | natively in rest of our clients.
        
         | pm90 wrote:
         | I haven't been keeping up but is there still hype over full
         | mesh like istio/linkerd? Ive seen it tried in a couple of
         | places but didn't work super well; the last place couldn't
         | because datadog apparently bills sidecar containers as
         | additional hosts so using sidecar proxy would have doubled our
         | datadog bill.
        
       | thewisenerd wrote:
       | > kube-proxy supports only basic algorithms like round-robin or
       | random selection
       | 
       | this is "partially" true.
       | 
       | if you're using ipvs, you can configure the scheduler to just
       | about anything ipvs supports (including wrr). they removed the
       | validation for the scheduler name quite a while back.
       | 
       | kubernetes itself though doesn't "understand" (i.e., can NOT
       | represent) the nuances (e.g., weights per endpoint with wrr),
       | which is the problem.
        
       | jedberg wrote:
       | I wonder why they didn't use rendezvous hashing (aka HRW)[0]?
       | 
       | It feels like it would solve all the requirement that they laid
       | out, is fully client side, and doesn't require real time updates
       | for the host list via discovery.
       | 
       | [0] https://en.wikipedia.org/wiki/Rendezvous_hashing
        
         | deviation wrote:
         | HRW would cover the simple case, but they needed way more--
         | e.g. per-request balancing, zone affinity, live health checks,
         | spillover, ramp-ups, etc. Once you need all that dynamic
         | behavior, plain hashing just doesn't cut it IMO. A custom
         | client-side + discovery setup makes more sense.
        
         | dastbe wrote:
         | the problem is that they want to apply a number of
         | stateful/lookaside load balancing strategies, which become more
         | difficult to do in a fully decentralized system. it's generally
         | easier to asynchronously aggregate information and either
         | decide routing updates centrally or redistribute that aggregate
         | to inform local decisions.
        
       | dilyevsky wrote:
       | Curios why cross-cluster loadbalancing would be necessary in a
       | setup where you operate "thousands of clusters"? I assume these
       | are per-customer isolated environments?
        
       ___________________________________________________________________
       (page generated 2025-10-01 23:01 UTC)