[HN Gopher] How Netflix uses eBPF flow logs at scale for network...
       ___________________________________________________________________
        
       How Netflix uses eBPF flow logs at scale for network insight
        
       Author : el_duderino
       Score  : 246 points
       Date   : 2021-06-08 13:42 UTC (9 hours ago)
        
 (HTM) web link (netflixtechblog.com)
 (TXT) w3m dump (netflixtechblog.com)
        
       | Hikikomori wrote:
       | Datadog does something similar with the network performance
       | monitoring option in their agent, it also uses eBPF. As a network
       | engineer I've wanted tools like this for many years (say goodbye
       | to tcpdump, in most cases), the data it produces is incredibly
       | useful in a cloud environment where you typically do not get
       | insights like this from the native toolset.
       | 
       | https://www.datadoghq.com/blog/network-performance-monitorin...
        
         | redis_mlc wrote:
         | The history is that Datadog hired the Boundary people, who
         | added this feature about 2 years ago.
         | 
         | (Boundary was a small SF startup that created a network flow
         | product. Netflix evaluated it for use on Cassandra clusters.
         | BMC acquired Boundary in 2015.)
         | 
         | https://www.crunchbase.com/organization/boundary
        
       | [deleted]
        
       | cranekam wrote:
       | Facebook has a similar system, unsurprisingly. An agent runs on
       | every host to sample 1 in N packets, accumulating counts by
       | source/dest host/cluster/service/container/port and so on before
       | sending the aggregate data to Scuba [0] for ad-hoc analysis. This
       | tool was really useful -- in a matter of seconds you could see
       | traffic types and volumes broken down by almost any dimension.
       | Did service X see a huge jump in traffic last week? From where?
       | Which container or service? How much bandwidth did your
       | compression changes save? And so on. It also had some really neat
       | stuff to identify whether or not a flow was TLSed or not, which
       | was crucial for working out what still needed to be encrypted in
       | light of the Snowden revelations.
       | 
       | TCP retransmits were also sampled in detail. Being able to see at
       | a glance if e.g. every host in a rack is the source of (or target
       | for) retransmits made troubleshooting much faster.
       | 
       | These systems were really awesome and a good example of what
       | could be built with relatively little effort (the teams involved
       | were small) when you already have great components for reliable
       | log transmission and flexible data analysis.
       | 
       | [0] https://research.fb.com/publications/scuba-diving-into-
       | data-...
        
         | hintymad wrote:
         | > TCP retransmits were also sampled in detail.
         | 
         | Hosts usually collects SNMP metrics, which includes TCP
         | retransmits and more. Do you know what SNMP was lacking
         | compared to eBPF? What I can think of is more dimensions in
         | eBPF's case.
        
           | Hikikomori wrote:
           | eBPF can give detailed stats and TCP state information on a
           | per connection basis (flow), much more powerful than TCP
           | stats you can grab with SNMP that are aggregated.
        
           | hintymad wrote:
           | Thanks for all the answers. I learned so much!
        
           | paulfurtado wrote:
           | Doing it with eBPF gives you a hook for each retransmit, this
           | makes it possible to know the exact connection, process, and
           | network interface that hit the retransmit and allows you
           | measure things like "100s of hosts are hitting 1000s of
           | retransmits to 10.0.0.56:443", which 10.0.0.56's netstat
           | metrics may not clearly indicate. It gets more interesting if
           | you break things down by VM hosts, racks, rows, data centers,
           | etc.
           | 
           | If you go deeper with the eBPF tracing, you can also
           | determine which code path the retransmit occurred on, which
           | may or many not be interesting.
        
           | takeda wrote:
           | eBPF similarly to DTrace allows you to view internals of your
           | OS. With SNMP you have fixed number of metrics that you can
           | view, while with eBPF you can create new ones. You could
           | implement SNMP daemon that uses eBPF to get the data, and
           | perhaps that will happen in the future if it already didn't.
        
           | cranekam wrote:
           | Correct me if I'm wrong but the SNMP retransmit counters are
           | just that: a count of retransmits the host sent. Raw
           | retransmit counts are often just a vague indication that
           | something's up -- the host is retransmiting, either because
           | there's loss on the path or the receiver is overloaded. But
           | given a host can talk over many paths to many other hosts a
           | raw count isn't specific enough to be useful.
           | 
           | The system Facebook (which predated eBPF via a custom ftrace
           | event) produced, effectively, tuples of `(src_ip, src_port,
           | dst_ip, dst_port, src_container, dst_container, ...)` and
           | aggregated them over all hosts. This allowed counting
           | retransmits by, say, receiving host. If there's one host that
           | has a bad cable and is receiving retransmits from 1000
           | clients we may not see that signal in simple counters on the
           | clients -- for them it's just a tiny bump in the overall
           | retransmit rate. But if we aggregate by receiving host the
           | bad guy will stand out like a sore thumb. Same thing for all
           | the hosts in a rack, or all hosts reachable over a given
           | router interface, or whatever else you want. One of my common
           | workflows when facing a bump in general errors (e.g. timeouts
           | to the cache layer) was to quickly try grouping retransmits
           | by a few dimensions to see if one particular combination of
           | hosts stood out as sending or receiving more retransmits.
           | 
           | tl;dr: the SNMP data is one-dimensional. FB's system allowed
           | aggregating and querying by many dimensions. This is really
           | useful when there are thousands of machines talking to each
           | other over many network paths.
        
           | ikiris wrote:
           | you're comparing apples and pumpkins.
           | 
           | SNMP is a query mechanism, eBPF is the sampling mechanism.
        
             | hintymad wrote:
             | Aha! I was too removed from the infrastructure then. All I
             | knew was that SNMP metrics showed up in our telemetry
             | system, and we got a set of standard metrics to look into.
             | Our platform team took care of having sampling agent in
             | place. I didn't know the protocol was about querying
             | instead of sampling.
        
       | seanalltogether wrote:
       | I don't know if they're allowed to post images because of the
       | information that might be contained within, but this post if very
       | hard to follow without some sort of visualization of what they
       | generate with all this.
        
         | jeffbee wrote:
         | Is your question along the lines of what anyone would do with
         | netflow data at all? Check out COTS netflow products like
         | Kentik to see what they do.
         | 
         | https://www.kentik.com/product/core/
        
         | tptacek wrote:
         | The canonical answer to this question is ntop.
        
       | Jedd wrote:
       | Netflix traffic patterns are definitely unlike almost every other
       | large network, but I do wonder how much they're missing by
       | sampling TCP only.
       | 
       | TFA mentions transport of (aggregated) flow data back to base
       | over a choice of protocols, including UDP, which makes sense --
       | you don't want your monitoring data affecting your business data
       | when you get close to redlining. (You'd expect you'd have enough
       | forensic data leading up to that point to make some informed
       | decisions.)
       | 
       | QUIC runs over UDP, and I can imagine that growing rapidly for
       | most corporate & public-facing networks.
        
       | dstick wrote:
       | Half of this makes no sense to me whatsoever, but it's
       | fascinating nonetheless! Seems like a huge challenge. And if
       | anyone would care to explain to me what "capacity posture" means
       | in the following sentence:
       | 
       | "Without having network visibility, it's difficult to improve our
       | reliability, security and capacity posture."
       | 
       | I'd be one happy camper :)
        
         | tikkabhuna wrote:
         | I've heard "security posture" before which relates to how your
         | organisation is currently set up to handle InfoSec events and
         | activities.
         | 
         | https://www.balbix.com/insights/what-is-cyber-security-postu...
         | 
         | Extrapolating that, I would say that capacity posture is
         | planning around expected capacity. As an example, perhaps they
         | say a given data centre has to handle twice the amount of
         | bandwidth of the peak.
        
         | erosenbe0 wrote:
         | The article is a good start but could use a lot of editing
         | 
         | I'd look at the eBPF books and articles by Gregg.
        
         | junon wrote:
         | The three words "reliability, security [and] capacity" are
         | adjectives in list form for the successive noun "posture".
         | Posture here simply means "ability, position(ing), readiness,
         | attributes, quality", etc.
         | 
         | From webster:
         | 
         | > Posture (n.) - state or condition at a given time especially
         | with respect to capability in particular circumstances
         | 
         | To be clear, you could re-write it more verbosely:
         | 
         | > Without having network visibility, it's difficult to improve
         | our reliability posture, security posture, and capacity
         | posture.
         | 
         | While not _exactly_ the same meaning (the original sentence
         | groups them together as a single, nuclear idea), maybe it helps
         | parse the grammar a bit better.
        
         | wcarss wrote:
         | I'm also guessing, but I think it's something like their plan
         | for how much capacity (i.e. network bandwidth) to have
         | available at particular times and places, and their strategy
         | for updating the plan.
        
           | edf13 wrote:
           | I'd agree with your translation...
           | 
           | Estimating bandwidth requirements is hard. Many use educated
           | guess + over provision.
        
         | ww520 wrote:
         | Network bandwidth utilization doesn't degrade gracefully. When
         | network saturation is at 80% or more, it can degrade
         | drastically and the usage can fall off the cliff. It's
         | important to monitor the network bandwidth utilization to raise
         | alarms at lower thresholds so that you can add bandwidth
         | capacity at real time, diverge traffic, or rate limit to slow
         | down the download. That relates to the reliability of the
         | service and the real time bandwidth capacity adjustment. For
         | long term planning, it's important to collect statistics on
         | peak/average/median network usage for capital expenditure
         | purpose, to spend money on buying more bandwidth, adding
         | switches, routers, servers, or building more data centers. This
         | deals with the long term capacity planning.
        
       | jeffbee wrote:
       | I got the weird sense of deja vu reading this, searched around
       | and realized the first half is copied and pasted from another
       | year-old blog post. Last year they were "at Hyper Scale" and this
       | year their flow logs are only "at scale" so I guess they're
       | shrinking.
       | 
       | https://netflixtechblog.com/hyper-scale-vpc-flow-logs-enrich...
        
         | Hikikomori wrote:
         | The older article is about VPC flow logs, in the new one they
         | are using eBPF on instances/containers to gather flow
         | information.
        
       | kthxb wrote:
       | Of course Brendan Gregg, the god of eBPF, tracing and all things
       | profiling has his finger in the pie
        
         | bostonsre wrote:
         | The dude is my tech hero, just awe inspiring.
        
         | Bayart wrote:
         | I've got his _Systems Performance_ book, and it 's really
         | fantastic. This guy is incredible. I really need to pick his
         | BPF one.
         | 
         | He's got a good blog too [1] !
         | 
         | [1] http://www.brendangregg.com/blog/
        
       | ksec wrote:
       | Wondering if this is on their Linux Server only and or does it
       | work on FreeBSD with their Edge Appliance?
        
         | takeda wrote:
         | For FreeBSD they probably use DTrace to get that.
        
         | cyberpunk wrote:
         | It uses ebpf so just the Linux boxes
        
       | chrissnell wrote:
       | We had a poor man's version of this in 2006 at Backcountry.com.
       | We ran OpenBSD firewalls on the edge and used pfflowd to
       | translate the pfsync messages (the protocol used to synchronize
       | two PF firewalls in a HA configuration) into Netflow datagrams,
       | which we could then monitor in real time with top(1)-like tools.
       | It was awesome. I miss having that kind of visibility.
        
         | eb0la wrote:
         | Netflow! My first "big data" project was getting netflow data
         | from 100s of routers in a internet backbone and start to learn
         | where traffic came and if it was worthy to do a peereng
         | agreement with them.
         | 
         | It stopped working the day Google activated their proxies on
         | mobile networks. I saw Google traffic increase 1-2% week-over-
         | week during several months. I quitted the job when it was >30%
         | of the backbone traffic and don't know what happened at the
         | end...
        
       ___________________________________________________________________
       (page generated 2021-06-08 23:00 UTC)