[HN Gopher] How Netflix uses eBPF flow logs at scale for network...
___________________________________________________________________
How Netflix uses eBPF flow logs at scale for network insight
Author : el_duderino
Score : 246 points
Date : 2021-06-08 13:42 UTC (9 hours ago)
(HTM) web link (netflixtechblog.com)
(TXT) w3m dump (netflixtechblog.com)
| Hikikomori wrote:
| Datadog does something similar with the network performance
| monitoring option in their agent, it also uses eBPF. As a network
| engineer I've wanted tools like this for many years (say goodbye
| to tcpdump, in most cases), the data it produces is incredibly
| useful in a cloud environment where you typically do not get
| insights like this from the native toolset.
|
| https://www.datadoghq.com/blog/network-performance-monitorin...
| redis_mlc wrote:
| The history is that Datadog hired the Boundary people, who
| added this feature about 2 years ago.
|
| (Boundary was a small SF startup that created a network flow
| product. Netflix evaluated it for use on Cassandra clusters.
| BMC acquired Boundary in 2015.)
|
| https://www.crunchbase.com/organization/boundary
| [deleted]
| cranekam wrote:
| Facebook has a similar system, unsurprisingly. An agent runs on
| every host to sample 1 in N packets, accumulating counts by
| source/dest host/cluster/service/container/port and so on before
| sending the aggregate data to Scuba [0] for ad-hoc analysis. This
| tool was really useful -- in a matter of seconds you could see
| traffic types and volumes broken down by almost any dimension.
| Did service X see a huge jump in traffic last week? From where?
| Which container or service? How much bandwidth did your
| compression changes save? And so on. It also had some really neat
| stuff to identify whether or not a flow was TLSed or not, which
| was crucial for working out what still needed to be encrypted in
| light of the Snowden revelations.
|
| TCP retransmits were also sampled in detail. Being able to see at
| a glance if e.g. every host in a rack is the source of (or target
| for) retransmits made troubleshooting much faster.
|
| These systems were really awesome and a good example of what
| could be built with relatively little effort (the teams involved
| were small) when you already have great components for reliable
| log transmission and flexible data analysis.
|
| [0] https://research.fb.com/publications/scuba-diving-into-
| data-...
| hintymad wrote:
| > TCP retransmits were also sampled in detail.
|
| Hosts usually collects SNMP metrics, which includes TCP
| retransmits and more. Do you know what SNMP was lacking
| compared to eBPF? What I can think of is more dimensions in
| eBPF's case.
| Hikikomori wrote:
| eBPF can give detailed stats and TCP state information on a
| per connection basis (flow), much more powerful than TCP
| stats you can grab with SNMP that are aggregated.
| hintymad wrote:
| Thanks for all the answers. I learned so much!
| paulfurtado wrote:
| Doing it with eBPF gives you a hook for each retransmit, this
| makes it possible to know the exact connection, process, and
| network interface that hit the retransmit and allows you
| measure things like "100s of hosts are hitting 1000s of
| retransmits to 10.0.0.56:443", which 10.0.0.56's netstat
| metrics may not clearly indicate. It gets more interesting if
| you break things down by VM hosts, racks, rows, data centers,
| etc.
|
| If you go deeper with the eBPF tracing, you can also
| determine which code path the retransmit occurred on, which
| may or many not be interesting.
| takeda wrote:
| eBPF similarly to DTrace allows you to view internals of your
| OS. With SNMP you have fixed number of metrics that you can
| view, while with eBPF you can create new ones. You could
| implement SNMP daemon that uses eBPF to get the data, and
| perhaps that will happen in the future if it already didn't.
| cranekam wrote:
| Correct me if I'm wrong but the SNMP retransmit counters are
| just that: a count of retransmits the host sent. Raw
| retransmit counts are often just a vague indication that
| something's up -- the host is retransmiting, either because
| there's loss on the path or the receiver is overloaded. But
| given a host can talk over many paths to many other hosts a
| raw count isn't specific enough to be useful.
|
| The system Facebook (which predated eBPF via a custom ftrace
| event) produced, effectively, tuples of `(src_ip, src_port,
| dst_ip, dst_port, src_container, dst_container, ...)` and
| aggregated them over all hosts. This allowed counting
| retransmits by, say, receiving host. If there's one host that
| has a bad cable and is receiving retransmits from 1000
| clients we may not see that signal in simple counters on the
| clients -- for them it's just a tiny bump in the overall
| retransmit rate. But if we aggregate by receiving host the
| bad guy will stand out like a sore thumb. Same thing for all
| the hosts in a rack, or all hosts reachable over a given
| router interface, or whatever else you want. One of my common
| workflows when facing a bump in general errors (e.g. timeouts
| to the cache layer) was to quickly try grouping retransmits
| by a few dimensions to see if one particular combination of
| hosts stood out as sending or receiving more retransmits.
|
| tl;dr: the SNMP data is one-dimensional. FB's system allowed
| aggregating and querying by many dimensions. This is really
| useful when there are thousands of machines talking to each
| other over many network paths.
| ikiris wrote:
| you're comparing apples and pumpkins.
|
| SNMP is a query mechanism, eBPF is the sampling mechanism.
| hintymad wrote:
| Aha! I was too removed from the infrastructure then. All I
| knew was that SNMP metrics showed up in our telemetry
| system, and we got a set of standard metrics to look into.
| Our platform team took care of having sampling agent in
| place. I didn't know the protocol was about querying
| instead of sampling.
| seanalltogether wrote:
| I don't know if they're allowed to post images because of the
| information that might be contained within, but this post if very
| hard to follow without some sort of visualization of what they
| generate with all this.
| jeffbee wrote:
| Is your question along the lines of what anyone would do with
| netflow data at all? Check out COTS netflow products like
| Kentik to see what they do.
|
| https://www.kentik.com/product/core/
| tptacek wrote:
| The canonical answer to this question is ntop.
| Jedd wrote:
| Netflix traffic patterns are definitely unlike almost every other
| large network, but I do wonder how much they're missing by
| sampling TCP only.
|
| TFA mentions transport of (aggregated) flow data back to base
| over a choice of protocols, including UDP, which makes sense --
| you don't want your monitoring data affecting your business data
| when you get close to redlining. (You'd expect you'd have enough
| forensic data leading up to that point to make some informed
| decisions.)
|
| QUIC runs over UDP, and I can imagine that growing rapidly for
| most corporate & public-facing networks.
| dstick wrote:
| Half of this makes no sense to me whatsoever, but it's
| fascinating nonetheless! Seems like a huge challenge. And if
| anyone would care to explain to me what "capacity posture" means
| in the following sentence:
|
| "Without having network visibility, it's difficult to improve our
| reliability, security and capacity posture."
|
| I'd be one happy camper :)
| tikkabhuna wrote:
| I've heard "security posture" before which relates to how your
| organisation is currently set up to handle InfoSec events and
| activities.
|
| https://www.balbix.com/insights/what-is-cyber-security-postu...
|
| Extrapolating that, I would say that capacity posture is
| planning around expected capacity. As an example, perhaps they
| say a given data centre has to handle twice the amount of
| bandwidth of the peak.
| erosenbe0 wrote:
| The article is a good start but could use a lot of editing
|
| I'd look at the eBPF books and articles by Gregg.
| junon wrote:
| The three words "reliability, security [and] capacity" are
| adjectives in list form for the successive noun "posture".
| Posture here simply means "ability, position(ing), readiness,
| attributes, quality", etc.
|
| From webster:
|
| > Posture (n.) - state or condition at a given time especially
| with respect to capability in particular circumstances
|
| To be clear, you could re-write it more verbosely:
|
| > Without having network visibility, it's difficult to improve
| our reliability posture, security posture, and capacity
| posture.
|
| While not _exactly_ the same meaning (the original sentence
| groups them together as a single, nuclear idea), maybe it helps
| parse the grammar a bit better.
| wcarss wrote:
| I'm also guessing, but I think it's something like their plan
| for how much capacity (i.e. network bandwidth) to have
| available at particular times and places, and their strategy
| for updating the plan.
| edf13 wrote:
| I'd agree with your translation...
|
| Estimating bandwidth requirements is hard. Many use educated
| guess + over provision.
| ww520 wrote:
| Network bandwidth utilization doesn't degrade gracefully. When
| network saturation is at 80% or more, it can degrade
| drastically and the usage can fall off the cliff. It's
| important to monitor the network bandwidth utilization to raise
| alarms at lower thresholds so that you can add bandwidth
| capacity at real time, diverge traffic, or rate limit to slow
| down the download. That relates to the reliability of the
| service and the real time bandwidth capacity adjustment. For
| long term planning, it's important to collect statistics on
| peak/average/median network usage for capital expenditure
| purpose, to spend money on buying more bandwidth, adding
| switches, routers, servers, or building more data centers. This
| deals with the long term capacity planning.
| jeffbee wrote:
| I got the weird sense of deja vu reading this, searched around
| and realized the first half is copied and pasted from another
| year-old blog post. Last year they were "at Hyper Scale" and this
| year their flow logs are only "at scale" so I guess they're
| shrinking.
|
| https://netflixtechblog.com/hyper-scale-vpc-flow-logs-enrich...
| Hikikomori wrote:
| The older article is about VPC flow logs, in the new one they
| are using eBPF on instances/containers to gather flow
| information.
| kthxb wrote:
| Of course Brendan Gregg, the god of eBPF, tracing and all things
| profiling has his finger in the pie
| bostonsre wrote:
| The dude is my tech hero, just awe inspiring.
| Bayart wrote:
| I've got his _Systems Performance_ book, and it 's really
| fantastic. This guy is incredible. I really need to pick his
| BPF one.
|
| He's got a good blog too [1] !
|
| [1] http://www.brendangregg.com/blog/
| ksec wrote:
| Wondering if this is on their Linux Server only and or does it
| work on FreeBSD with their Edge Appliance?
| takeda wrote:
| For FreeBSD they probably use DTrace to get that.
| cyberpunk wrote:
| It uses ebpf so just the Linux boxes
| chrissnell wrote:
| We had a poor man's version of this in 2006 at Backcountry.com.
| We ran OpenBSD firewalls on the edge and used pfflowd to
| translate the pfsync messages (the protocol used to synchronize
| two PF firewalls in a HA configuration) into Netflow datagrams,
| which we could then monitor in real time with top(1)-like tools.
| It was awesome. I miss having that kind of visibility.
| eb0la wrote:
| Netflow! My first "big data" project was getting netflow data
| from 100s of routers in a internet backbone and start to learn
| where traffic came and if it was worthy to do a peereng
| agreement with them.
|
| It stopped working the day Google activated their proxies on
| mobile networks. I saw Google traffic increase 1-2% week-over-
| week during several months. I quitted the job when it was >30%
| of the backbone traffic and don't know what happened at the
| end...
___________________________________________________________________
(page generated 2021-06-08 23:00 UTC)