[HN Gopher] Serving Netflix Video at 400Gb/s [pdf]
       ___________________________________________________________________
        
       Serving Netflix Video at 400Gb/s [pdf]
        
       Author : drewg123
       Score  : 442 points
       Date   : 2021-09-19 14:47 UTC (8 hours ago)
        
 (HTM) web link (people.freebsd.org)
 (TXT) w3m dump (people.freebsd.org)
        
       | ramshanker wrote:
       | Meanwhile my internet connection is 40mbps
        
       | segfaultbuserr wrote:
       | > _Run kTLS workers, RACK / BBR TCP pacers with domain affinity_
       | 
       | TCP BBR is in the FreeBSD upstream now? Cool.
        
         | zamadatix wrote:
         | As of 13.0
        
       | woah wrote:
       | Dumb question, but what, specifically, does "fabric" mean in this
       | context?
        
         | adwn wrote:
         | Fabric = data transmission infrastructure between CPUs or
         | cores. This includes the wires which carry the data signals, as
         | well as routing logic.
        
         | zamadatix wrote:
         | For the nitty gritty details
         | https://en.wikichip.org/wiki/amd/infinity_fabric
        
       | drewg123 wrote:
       | These are the slides from my EuroBSDCon presentation. AMA
        
         | polskibus wrote:
         | Was FreeBSD your first choice? Or did you try with Linux first?
         | What were the numbers for Linux-based solution, if there was
         | one?
        
           | drewg123 wrote:
           | FreeBSD was selected at the outset of the Open Connect CDN
           | (~2012 or so).
           | 
           | We did a bake off a few years ago, and _FOR THIS WORKLOAD_
           | FreeBSD outperformed Linux. I don 't want to get into an OS
           | war, that's not productive.
        
             | kayson wrote:
             | By how much?
        
             | drewg123 wrote:
             | Its important to consider that we've poured man years into
             | this workload on FreeBSD. Just off the top of my head,
             | we've worked on in house, and/or contributed to or funded,
             | or encouraged vendors to pursue: - async sendfile (so
             | sendfile does not block, and you don't need thread pools or
             | AIO) - RACK and BBR TCP in FreeBSD (for good QoE) - kTLS
             | (so you can keep using sendfile with tls, saves ~60% CPU
             | over reading data into userspace and encrypting there) -
             | Numa - kTLS offload (to save memory bandwidth by moving
             | crypto to the NIC)
             | 
             | Not to mention tons of VM system and scheduler improvements
             | which have been motivated by our workload.
             | 
             | FreeBSD itself has improved tremendously over the last few
             | releases in terms of scalability
        
               | sandGorgon wrote:
               | a lot of this is in Linux now right ? i am asking a
               | personal opinion and not necessarily a "why-dont-u-move-
               | to-linux" question.
               | 
               | Genuinely curious on where u see state of art when it
               | comes to Linux.
        
               | drewg123 wrote:
               | Yes. I ran it for a bake off ~2 years ago. At the time,
               | the code in linux was pretty raw, and I had to fix a bug
               | in their SW kTLS that caused data corruption that was
               | visible to clients. So I worry that it was not in
               | frequent use at the time, though it may be now.
               | 
               | My understanding is that they don't do 0-copy inline
               | ktls, but I could be wrong about that.
        
               | spockz wrote:
               | Thank you for pushing kTLS!
        
             | jenny91 wrote:
             | Was licensing also a contributing factor, or was that
             | irrelevant for you?
        
               | drewg123 wrote:
               | My understanding is that licensing did factor into the
               | decision. However, I didn't join Netflix until after the
               | decision had been made.
        
               | tatersolid wrote:
               | I recall reading that Netflix chose FreeBSD a decade ago
               | due to asynchronous _disk_ IO was (and still is?) broken
               | and /or limited to fixed block offsets. So nginx just
               | works better on FreeBSD versus Linux for serving static
               | files from spinning rust or SSD.
        
               | Sesse__ wrote:
               | This used to be the case, but with io_uring, Linux has
               | very much non-broken buffered async I/O. (Windows has
               | copied io_uring pretty much verbatim now, but that's a
               | different story.)
        
               | gavinray wrote:
               | Could you expand more on the Windows io_uring bit please?
               | 
               | I have run Debian based Linux my entire life and recently
               | moved circumstantially to Windows. I have no idea how
               | it's kernel model works and I find io_uring exciting.
               | 
               | Wasn't aware of any adoption of io_uring ideas in Windows
               | land, sounds interesting
        
         | Matthias247 wrote:
         | The architecture slides don't show any in-memory read caching
         | of data? I guess there is at least some, but would it be at the
         | disk side or the NIC side? I guess sendfile without direct IO
         | would read from a cache.
        
           | drewg123 wrote:
           | Caching is left off for simplicity.
           | 
           | We keep track of popular titles, and try to cache them in
           | RAM, using the normal page cache LRU mechanism. Other titles
           | are marked with SF_NOCACHE and are discarded from RAM ASAP.
        
             | Matthias247 wrote:
             | in which node would that page cache be allocated? In the
             | one where the disk is attached, or where the data is used?
             | Or is this more or less undefined or up to the OS?
        
               | drewg123 wrote:
               | This is gone over in the talk. We allocate the page
               | locally to where the data is used. The idea is that we'd
               | prefer the NVME drive to eat any latency for the NUMA bus
               | transfer, and not have the CPU (SW TLS) or NIC (inline HW
               | TLS) stall waiting for a transfer.
        
             | markjdb wrote:
             | How much data ends up being served from RAM? I had the
             | impression that it was negligible and that the page cache
             | was mostly used for file metadata and infrequently accessed
             | data.
        
               | drewg123 wrote:
               | It depends. Normally about 10-ish percent. I've seen well
               | over that in the past for super popular titles on their
               | release date.
        
         | otterley wrote:
         | How would you characterize the economics of this vs.
         | alternative solutions? Achieving 400Gb/s is certainly a
         | remarkable achievement, but is it the lowest price-per-Gb/s
         | solution compared to alternatives? (Even multiple servers.)
        
           | dragontamer wrote:
           | I doubt that the 32-core EPYC they focused on is even the
           | most economical solution in this situation.
           | 
           | If they're really RAM-bandwidth constrained, then the 24-core
           | 74F3 (which still has all 256MBs of L3 cache) or even 8-core
           | 72F3 may be better.
        
             | sroussey wrote:
             | L3 won't matter.
             | 
             | The more compute clusters the more PCI lanes in EPYC and
             | the SSD lanes go direct per 8cores.
        
               | samstave wrote:
               | I recall when I was at Intel in the 90s and 3MB of L3 was
               | a big deal.
        
           | pastrami_panda wrote:
           | This, can someone add to this it's crucial.
        
           | drewg123 wrote:
           | I'm not on the hardware team, so I don't have the cost
           | breakdown. But my understanding is that flash storage is the
           | most expensive line item, and it matters little if its
           | consolidated into one box, or spread over a rack, you still
           | have to pay for it. By serving more from fewer boxes, you can
           | reduce component duplication (cases, mobos, ram, PSUs) and
           | more importantly, power & cooling required.
           | 
           | The real risk is that we introduce a huge blast radius if one
           | of these machines goes down.
        
             | [deleted]
        
             | sroussey wrote:
             | Ouch. I guess there is always a trade off to be made...
        
           | fragmede wrote:
           | Particularly, kTLS is a solution that fits here, on a single-
           | box, but I wonder how things would look if the high-perf
           | storage boxes sent video unencrypted and there was a second
           | box that dealt only with the TLS. We'd have to know how many
           | streams 400Gb/s represents though, and have a far more
           | detailed picture of Netflix's TLS needs.
        
         | HalcyonicStorm wrote:
         | Is there anyway we can see a video of the presentation? I'm
         | extremely interested
        
           | allanjude wrote:
           | The videos should appear on the conference's youtube channel
           | in a few weeks: https://www.youtube.com/eurobsdcon
        
         | gautamcgoel wrote:
         | 1. Could GPU acceleration help at all?
         | 
         | 2. When serving video, do you use floating point operations at
         | all? Could this workload run on a hypothetical CPU with no
         | floating point units?
         | 
         | 3. How many of these hardware platforms do you guys own?
         | 10k?100k?
        
           | sroussey wrote:
           | The videos are precomputed. So no GPU required to stream
        
           | drewg123 wrote:
           | 1) No. Well, potentially as a crypto accelerator, but QAT and
           | Chelsio T6 are less power hungry and more available. GPUs are
           | so expensive/unavailable now that leveraging them in creative
           | ways makes less sense than just using a NIC like the CX6-Dx,
           | which has crypto as a low cost feature.
           | 
           | 2) These are just static files, all encoding is done before
           | it hits the CDN.
        
         | betaby wrote:
         | How exactly 'Constrained to use 1 IP address per host' helps
         | eliminate cross-NUMA transfers?
        
           | drewg123 wrote:
           | If we could use more than 1 IP, then we could treat 1 400Gb
           | box as 4 100Gb boxes. That could lead to the "perfect case"
           | every time, since connections would always stay local to the
           | numa node where content is present.
        
             | betaby wrote:
             | There is no kTLS for IPv6? IPv6 space is abundant and most
             | mobiles is USA/Canada have IPv6. Won't that solve the
             | problem?
        
               | drewg123 wrote:
               | IPv4 and IPv6 can both use kTLS. We offer service via V6,
               | but most clients connect via IPv4. It differs by region,
               | and even time of day, but IPv4 is still the vast majority
               | of traffic.
        
               | chungy wrote:
               | I've had to blackhole Netflix IPv6 ranges on my router
               | because Netflix would identify my IPv6 connection as
               | being a "VPN" even though it's not.
        
               | drewg123 wrote:
               | If you could email me your ranges, I can try to look into
               | it internally. Use my last name at gmail.com (or at
               | freebsd.org)
        
               | genewitch wrote:
               | I'm not GP; Hurricane Electric ipv6 always had Netflix
               | think I was on a VPN, but now I have real ipv6 through
               | the same ISP, I just pay more money so Netflix doesn't
               | complain anymore.
        
         | kvathupo wrote:
         | This may be a naive question, but data is sent at 400Gb/s to
         | the NIC, right? If so, is it fair to assume to assume that data
         | is actually sent/received at a similar rate?
         | 
         | I ask since I was curious why you guys opted not to bypass
         | sendfile(2). I suppose it wouldn't matter in the event that the
         | client is some viewer, as opposed to another internal machine.
        
           | drewg123 wrote:
           | We actually try really, really hard not to blast 400Gb/s at a
           | single client. The 400Gb/s is in aggregate.
           | 
           | Our transport team is working on packet pacing, or really
           | packet spreading, so that any bursts we send are small enough
           | to avoid being dropped by the client, or an intermediary
           | (cable modem, router, etc).
        
           | Matthias247 wrote:
           | Well it's not 1 client. It's thousands of viewers and
           | streams. An individual stream will have whatever the maximum
           | 4k bandwidth for Netflix is.
        
         | dragontamer wrote:
         | I eventually figured it out. But I would suggest maybe giving a
         | brief 1-slide thingy on "lagg", and link aggregation? (Maybe it
         | was clear from the presentation, but I only see the slides here
         | soooooo...)
         | 
         | I'm not the best at network infrastructure, though I'm more
         | familiar with NUMA stuff. So I was trying to figure out how you
         | only got 1-IP address on each box despite 4-ports across
         | 2-nics.
         | 
         | I assume some Linux / Windows devops people are just not as
         | familiar with FreeBSD tools like that!
         | 
         | EDIT: Now that I think of it: maybe a few slides on how link-
         | aggregation across NICs / NUMA could be elaborated upon
         | further? I'm frankly not sure if my personal understanding is
         | correct. I'm imagining how TCP-connections are fragmented into
         | IP-packets, and how those packets may traverse your network,
         | and how they get to which NUMA node... and it seems really more
         | complex to me than your slides indicate? Maybe this subject
         | will take more than just one slide?
        
           | drewg123 wrote:
           | Thanks for the feeback. I was hoping that the other words on
           | the slide (LACP and bonding) would give enough context.
           | 
           | I'm afraid that my presentation didn't really have room to
           | dive much into LACP. I briefly said something like the
           | following when giving that slide:
           | 
           | Basically, the LACP link partner (router) hashes traffic
           | consistently across the multiple links in the LACP bundle,
           | using a hash of its choosing (typically an N-tuple, involving
           | IP address and TCP port). Once it has selected a link for
           | that connection, that connection will always land on that
           | link on ingress (unless the LACP bundle changes in terms of
           | links coming and going). We're free to choose whatever egress
           | NIC we want (it does not need to be the same NIC the
           | connection entered on). The issue is that there is no way for
           | us to tell the router to move the TCP connection from one NIC
           | to another (well, there is in theory, but our routers can't
           | do it)
           | 
           | I hope that helps
        
             | dragontamer wrote:
             | > We're free to choose whatever egress NIC we want
             | 
             | Wait, I got lost again... You say you can "output on any
             | egress NIC". So all four egress NICs have access to the TLS
             | encryption keys and are cooperating through the FreeBSD
             | kernel to get this information?
             | 
             | Is there some kind of load-balancing you're doing on the
             | machine? Trying to see which NIC has the least amount of
             | traffic and routing to the least utilized NIC?
        
               | drewg123 wrote:
               | In terms of LACP in general, not for hw TLS. For HW TLS,
               | the keys and crypto state are NIC specific.
        
               | samstave wrote:
               | Whats happening when at the switch(es) the NICs are
               | connected into?
        
               | martyvis wrote:
               | The whole paper is discussing bottlenecks, path
               | optimisation between resources, and impacts of those on
               | overall throughout. It's not a simple loadbalancing
               | question being answered
        
           | zamadatix wrote:
           | LACP/link aggregation are IEEE Ethernet standards concepts
           | supported by nearly every hardware or software network stack.
           | https://en.wikipedia.org/wiki/Link_aggregation
        
         | zamadatix wrote:
         | Hey Drew, thanks for taking the time.
         | 
         | What would you rate the relative complexity of working with the
         | NIC offloading vs the more traditional optimizations in the
         | rest of the deck? Have you compared other NIC vendors before or
         | has Mellanox been the go to that's always done what you've
         | needed?
        
         | minimax wrote:
         | What lead you to investigate PCIe relaxed ordering? Can you
         | suggest a book or other resource to learn more about PCIe
         | performance?
        
           | drewg123 wrote:
           | To be honest, it was mostly the suggestion from AMD.
           | 
           | At the time, AMD was the only Gen4 PCIe available, and it was
           | hard to determine if the Mellanox NIC or the AMD PCIe root
           | was the limiting factor. When AMD suggested Relaxed Ordering,
           | that brought its importance to mind.
        
         | 404mm wrote:
         | As an infrastructure engineer, these numbers are absolutely
         | mind blowing to me!
         | 
         | Not sure if it's ok to ask.. how many servers like this one
         | does it take to serve the US clients?
        
           | drewg123 wrote:
           | I don't have the answer, and even if I did, I'm not sure I'd
           | be allowed to tell you :)
           | 
           | But note that these are flash servers; they serve the most
           | popular content we have. We also have "storage" servers with
           | huge numbers of spinning drives that serve the longer tail.
           | They are constrained by spinning rust speeds, and can't serve
           | this fast.
        
         | Aissen wrote:
         | How do you benchmark this ? Do you use real-life traffic, or
         | have a fleet of TLS clients ? If you have a custom testsuite,
         | are the clients homogeneous ? How many machines do you need ?
         | Do the clients use KTLS ?
        
           | drewg123 wrote:
           | We test on production traffic. We don't have a testbench.
           | 
           | This is problematic, because sometimes results are not
           | reproducible.
           | 
           | Eg, if I test on the day of a new release of a popular title,
           | we might be serving a lot of it cached from RAM, so that cuts
           | down memory bandwidth requirements and leads to an overly
           | rosy picture of performance. I try to account for this in my
           | testing.
        
             | Aissen wrote:
             | Thanks, it's an interesting set of tradeoffs.
        
         | GrigoriyMikh wrote:
         | To be clear, was FreeBSD used because of historical reasons or
         | because similar performance can't be/harder to achieve on
         | Linux?
        
           | Thaxll wrote:
           | I mean most CDN and FANG run on Linux, I think in that case
           | it's kTLS that makes a big difference the rest not much.
        
             | drewg123 wrote:
             | Async sendfile is also an advantage for FreeBSD. It is
             | specific to FreeBSD. It allows an nginx worker to send from
             | a file that's cold on disk without blocking, and without
             | resorting to threadpools using a thread-per-file.
             | 
             | The gist is that the sendfile() call stages the pages
             | waiting to be read in the socket buffer, and marks the
             | mbufs with M_NOTREADY (so they cannot be sent by TCP). When
             | the disk read completes, a sendfile callback happens in the
             | context of the disk ithread. This clears the M_NOTREADY
             | flag and tells TCP they are ready to be sent. See
             | https://www.nginx.com/blog/nginx-and-netflix-contribute-
             | new-...
        
               | Thaxll wrote:
               | sendfile() with splice and io_uring is similar? I know
               | that this is very experimental on Linux.
               | 
               | The overall idea is to copy bytes from disk to the socket
               | with almost no allocation and not blocking, this is the
               | idea right?
        
           | smoldesu wrote:
           | I'd imagine scaling, licensing and overhead all had something
           | to do with it, too.
        
             | GrigoriyMikh wrote:
             | Interesting. Are there any benchmarks that you would
             | recommend to look at, regarding FreeBSD vs Linux networking
             | performance?
        
               | GrigoriyMikh wrote:
               | For anyone interested, here is benchs from late 2018,
               | comparing Fedora and FreeBSD performance:
               | https://matteocroce.medium.com/linux-and-freebsd-
               | networking-...
        
               | trasz wrote:
               | It's probably worth noting that there have been huge
               | scalability improvements - including introduction of
               | epochs (+/- RCU) - in FreeBSD over the last few years,
               | for both networking and VFS.
        
               | lostmsu wrote:
               | Why does the author put so much effort into testing VMs?
               | Bare metal installations aren't even tried, so the
               | article won't represent more typical setup (unless you
               | want to run in cloud, in which case it would make sense
               | to test in cloud).
        
               | teknoraver wrote:
               | Because the cards were in PCI passthrough, so the
               | performance was exactly the same of a physical system
        
               | lostmsu wrote:
               | Author mentions as much as to say that there's was some
               | indirectness at least in the interrupts.
               | 
               | There are also VirtIO drivers involved, and according to
               | the article, they had effect too.
        
               | genewitch wrote:
               | If given the choice I'd never run anything on bare metal
               | again. Let's say we have some service we want to run on a
               | bare metal server. For not very much more hardware money,
               | amortized, I can set up two or three VMs to do the same
               | server, duplicated. Then if any subset of that metal goes
               | bad, a replica/duplicate is already ready to go. There's
               | no network overhead, etc.
               | 
               | I've been doing this for stuff like SMTPE authority
               | servers and ntpd and things that absolutely cannot go
               | down, for over a decade.
        
               | smoldesu wrote:
               | That doesn't really matter for benchmarking purposes. I
               | think the parent comment was emphasizing that syscall
               | benchmarks don't make sense when you're running through a
               | hypervisor, since you're running tacitly different
               | instructions than would be run on a bare-metal or
               | provisioned server.
        
           | crest wrote:
           | Nginx and OpenSSL are open source. Give it a try and
           | reproduce their results with Linux ;-).
        
             | betaby wrote:
             | sendfile + kTLS. I'm unaware of the in-kernel TLS
             | implementation for Linux. Is there any around?
        
               | merb wrote:
               | https://www.kernel.org/doc/html/latest/networking/tls-
               | offloa... https://docs.mellanox.com/display/OFEDv521040/K
               | ernel+Transpo...
        
               | drewg123 wrote:
               | Yes, Linux has kTLS. When I tried to use it, it was
               | horribly broken, so my fear is that its not well
               | used/tested, but it exists. Mellanox, for example,
               | developed their inline hardware kTLS offload on linux.
        
               | Matthias247 wrote:
               | > When I tried to use it, it was horribly broken, so my
               | fear is that its not well used/tested, but it exists.
               | 
               | Do you have any additional references around this? I'm
               | aware that most rarely used functionality is often broken
               | and therefore usually don't recommend people to use it,
               | but would like to learn about kTLS in particular. I think
               | for Linux OpenSSL 3 now added support for it in
               | userspace. But there's also the kernel components as well
               | as drivers - all of them could have their set of issues.
        
               | drewg123 wrote:
               | I recall that simple transmits from offset 0..N in a file
               | worked. But range requests of the form N..N+2MB lead to
               | corrupt data. Its been 2+ years, and I heard it was later
               | fixed in Linux.
        
               | Matthias247 wrote:
               | Interesting. I had imagined the range handling is purely
               | handled by the reading side of things, and wouldn't care
               | how the sink is implemented (kTLS, TLS, a pipe, etc). So
               | I assumed the offset should be invisible for kTLS, which
               | just sees a stream of data as usual.
        
               | Sesse__ wrote:
               | I've used sendfile + kTLS on Linux for a similar use
               | case. It worked fine from the start, was broken in two
               | (?) kernel releases for some use cases, and now works
               | fine again from what I can tell. This is software kTLS,
               | though; haven't tried hardware (not the least because it
               | easily saturates 40 Gbit/sec, and I just don't have that
               | level of traffic).
        
               | genewitch wrote:
               | I once recommended a switch and router upgrade to allow
               | for more, new WAPs for an office that was increasingly
               | becoming dependent on laptops and video conferencing. I
               | went with brand new kit, like just released earlier in
               | the year because I'd heard good things about the traffic
               | shaping, etc.
               | 
               | Well, the printers wouldn't pair with the new APs,
               | certain laptops with fruit logos would intermittently
               | drop connection, and so on.
               | 
               | I probably will never use that brand again, even though
               | they escalated and promised patches quickly - within 6
               | hours they had found the issue and we're working on
               | fixing it, but the damage to my reputation was already
               | done.
               | 
               | Since then I've always demanded to be able to test any
               | new idea/kit/service for at least a week or two just to
               | see if I can break it.
        
               | [deleted]
        
             | Koshkin wrote:
             | IMO the question was reasonable, whereas the answers like
             | yours have always sounded to me like "fuck you."
        
             | relix42 wrote:
             | It was done a tested more than once. As I recall, it took
             | quite a bit to get Linux to perform to the level that BSD
             | was performing for (a) this use can and (b) the years of
             | investment Netflix had already put into the BSD systems.
             | 
             | So, could Linux be tweaked and made as performant for
             | _this_ use case. I expect so. The question to be answered
             | is _why_.
        
         | LeifCarrotson wrote:
         | This level of architecture management on big server CPUs is
         | amazing! I occasionally handle problems like this on a small
         | scale, like minimizing wake time and peripheral power
         | management on an 8 bit microcontroller, but there the entire
         | scope is digestible once you get into it, and the kernel is
         | custom-designed for the application.
         | 
         | However, in my case, and I expect in yours, requirements
         | engineering is the place where you can make the greatest
         | improvements. For example, I can save a few cycles and a few
         | microwatts by sequencing my interrupts optimally or moving some
         | of the algorithm to a look-up table, but if I can, say,
         | establish that an LED indicator flash that might need to be 2x
         | as bright but only lasts for a couple milliseconds every second
         | is as visible as a 500ms LED on/off blink cycle, that's a 100x
         | power savings that I can't hope to reach with micro-
         | optimizations.
         | 
         | What are your application-level teams doing to reduce the data
         | requirements? General-purpose NUMA fabrics are needed to move
         | data in arbitrary ways between disc/memory/NICs, but your needs
         | aren't arbitrary - you basically only require a pipeline from
         | disc to memory to the NIC. Do you, for example, keep the first
         | few seconds of all your content cached in memory, because users
         | usually start at the start of a stream rather than a few
         | minutes in? Alternatively, if 1000 people all start the same
         | episode of Stranger Things within the same minute, can you add
         | queues at the external endpoints or time shift them all
         | together so it only requires one disk read for those thousand
         | users?
        
           | fragmede wrote:
           | _> Alternatively, if 1000 people all start the same episode
           | of Stranger Things within the same minute_
           | 
           | It would be _fascinating_ to hear from Netflix on some
           | serious details of the usage patterns they see and particular
           | optimizations that they do for that, but I doubt there 's so
           | much they can do given the size of the streams, the
           | 'randomness' of what people watch and when they watch, and
           | for the fact that the linked slides say the servers have
           | 18x2TB NVME drives per-server and 256GB.
           | 
           | I wouldn't be surprised if the Netflix logo opener exists
           | once on disk instead of being the first N seconds of every
           | file though.
        
             | allanjude wrote:
             | In previous talks Netflix has mentioned that due to serving
             | so many 1000s of people from each box, that they basically
             | do 0 caching in memory, all of the system memory is needed
             | for buffers that are enroute to users, and they purposely
             | avoid keeping any buffer cache beyond what is needed for
             | sendfile()
        
         | jareklupinski wrote:
         | > Mellanox ConnectX-6 Dx - Support for NIC kTLS offload
         | 
         | Wild, didn't know nVidia was side-eyeing such far-apart but
         | still parallel channels for their ?GPUs?.
         | 
         | Was this all achievable using nVidia's APIs out-of-the-box, or
         | did the firmware/driver require some in-house engineering :)
        
           | fragmede wrote:
           | Mellanox was bought by nVidia 2 years ago, so while it's
           | technically accurate to say it's an nVidia card, that elides
           | their history. Mellanox has been selling networking cards to
           | the supercomputing market since 1999. Netflix absolutely had
           | to do some tuning of various counters/queues/other settings
           | in order to optimize for their workload and get the level of
           | performance they're reporting here, but Mellanox sells NICs
           | with firmware/drivers that work out-of-the-box.
        
         | swozey wrote:
         | I build this stuff so it's so cool to read this, I can't really
         | be public about my stuff. Are you using completely custom
         | firmwares on your Mellanoxes? Do you have plans for nvmeOF?
         | I've had so many issues with kernels/firmware scaling this
         | stuff that we've got a team of kernel devs now. Also how stable
         | are these servers? Do you feel like they're teetering at the
         | edge of reliability? I think once we've ripped out all the OEM
         | firmware we'll be in a much better place.
         | 
         | Are you running anything custom in your Mellanoxes? dpdk stuff
        
       | kierank wrote:
       | How do you think this will need to be re-architected for HTTP3?
       | Will you do kernel bypass with a userspace stack?
        
         | drewg123 wrote:
         | We've looked at http3 / quic, and have not yet seen any
         | advantages for our workload. Just a large loss of CPU
         | efficiency.
        
       | cleverpebble wrote:
       | If this is one server, I can't imagine how much bandwidth Netflix
       | pushes out of a single PoP - does anyone have those numbers, or
       | at least estimates?
        
         | wmf wrote:
         | I saw a photo somewhere of a Netflix POP that showed dozens of
         | servers and a huge chassis router with hundreds of 100G links,
         | so that's terabits.
        
           | tgtweak wrote:
           | Capacity isn't indicative of bandwidth peak - but I'm sure
           | peaks are well past the terabit measure of each server is
           | being designed for 100-400Gbps.
        
         | dylan604 wrote:
         | There were some crazy number like 80% of internet traffic is
         | Netflix. I have no idea of the validity of that, but without
         | seeing actual numbers, that sounds like a lot.
        
           | kingosticks wrote:
           | There are some numbers at
           | https://www.lightreading.com/carrier-security/security-
           | strat...
        
         | tgtweak wrote:
         | They colocate these servers inside ISPs typically. Would be
         | interesting to know, for the traffic that does go to a Netflix
         | primary pop - how much the hottest one consumes.
         | 
         | It feels like a luxury that they can fit their entire content
         | library in all encoded formats into a single server (even
         | though it's a massive server).
        
           | sroussey wrote:
           | It's not the entire content library. It's a cache of the most
           | watched at that time.
        
         | zamadatix wrote:
         | It may disappoint but the servers are big because it's cheaper
         | than buying racks of servers not because it was the only way to
         | get enough servers into a PoP. Their public IX table should
         | give an idea that most are just a couple
         | https://openconnect.netflix.com/en/peering/ (note they have
         | bulk storage servers too for the less popular content not just
         | 1 big box serving every file).
         | 
         | I've seen their boxes walking through Equinix facilities Dallas
         | in Dallas and Chicago and it is a bit jarring how small some of
         | the largest infrastructure can be.
         | 
         | It's worth noting that they have many smaller boxes at ISP colo
         | locations as well not just the big regional DCs like Equinix.
        
           | genewitch wrote:
           | Seeing Hulu at equinix, as well as PlayStation Network, in
           | large cages, and then our two small cages was rather eye
           | opening. Some people have a lot of money to throw at a
           | problem, others have smart money to throw at smart engineers.
        
             | gsnedders wrote:
             | I mean, there's gotta be some question about costs of
             | hardware (plus colocation costs, etc.) versus the costs of
             | engineers to optimise the stack.
             | 
             | I don't doubt there's a point at which it's cheaper to
             | focus on reducing hardware and colocation costs, but for
             | the vast majority engineers are the expensive thing.
        
       | creativenolo wrote:
       | I'd love to know how many users could be served off one of these
       | boxes
        
         | puppet-master wrote:
         | You can do reasonable 1080p at 3Mbit with h.264, if these were
         | live streams you could fit around 130k sessions on the machine.
         | But this workload is prerecorded bulk data, I imagine the
         | traffic is considerably spikier than in the live streaming
         | case, perhaps reducing the max sessions (without performance
         | loss) by 2-3x.
        
       | Sunspark wrote:
       | It's disappointing to me that Netflix keeps reducing the bitrate
       | on their videos.. they're not the only offender but the most
       | prominent.
        
         | [deleted]
        
       | alberth wrote:
       | @drewg123
       | 
       | 1. Is there any Likelihood Netflix needs will migrate to ARM in
       | thr next few years? (I see you're right up at end of the deck,
       | curious if you're seeing more advancements with ARM than x86 and
       | as such, project ARM to surprise x86 for your needs in the
       | foreseeable future)
       | 
       | 2. Can you comment more on thr 800 Gbps reference at end of deck
        
         | drewg123 wrote:
         | We're open to migrating to the best technology for the use
         | case. Arm beat Intel to Gen4 PCIe, which is one of the reasons
         | we tried them. Our stack runs just fine on arm64, thanks to the
         | core FreeBSD support for arm64
         | 
         | The 800Gb is just a science experiment.. a dual-socket milan
         | with a relatively equal number of PCIe lanes from each socket.
         | It should have been delivered ages ago, but supply chain
         | shortages impacted the delivery dates of a one-off prototype
         | like this.
        
       | lprd wrote:
       | Will the presentation be available online anywhere? This topic is
       | very interesting to me as I maintain a Plex server for my family
       | and have always been curious how Netflix does streaming at scale.
        
         | antman wrote:
         | How large is your family?
        
           | lokimedes wrote:
           | Just the closest /16 and a few friends.
        
           | lprd wrote:
           | Haha, not large at all! I've just built a couple of servers
           | and their main responsibility is content storage and then
           | content serving (transcoding). My main bottleneck right now
           | is that I don't have 10Gbps switches (both servers have dual
           | 10Gbps intel NICs ready to go). I have to do some throttling
           | so that users don't experience any buffering issues.
        
         | erk__ wrote:
         | At some point the presentations will come up here
         | https://www.youtube.com/c/EuroBSDcon
        
         | drewg123 wrote:
         | There is a EuroBSDCon youtube channel. The last time I
         | presented, it was a few weeks before the recordings were
         | processed and posted to the channel. I'm not sure how soon they
         | will be available this year.
         | 
         | I have to say that a lot of this may be overkill for plex. Do
         | you know if plex even uses sendfile?
        
           | lprd wrote:
           | It's 100% overkill, haha! I was just asking because streaming
           | architecture really interests me. I have two nodes (one is
           | TrueNAS with ~50tb of storage), and the other is a compute
           | machine (all SSDs) which consumes the files on the NAS and
           | delivers the content. My biggest bottleneck right now is that
           | my internal network isn't 10Gpbs, so I have to throttle some
           | services so that users don't experience any buffering issues.
           | 
           | Truenas also introduced me to ZFS and I have been amazed by
           | it so far! I haven't dug to deep into FreeBSD yet, but that's
           | next on my list.
        
             | drewg123 wrote:
             | One thing to remember with ZFS is that it does not use the
             | page cache; it uses ARC. This means that sendfile is not
             | zero-copy with ZFS, and I think that async sendfile may not
             | work at all. We use ZFS only for boot/root/logs and UFS for
             | content.
        
               | lprd wrote:
               | Very interesting, I had no idea. Would adding something
               | like an intel optane help performance for large transfers
               | with a lot of I/O? I manage my services with a simple
               | docker-compose file and establish the data link to my NAS
               | via NFS, which I'm sure is adding another layer of
               | complexity (I believe NFSv4 is asynchronous?).
        
               | onedr0p wrote:
               | Short answer: no
               | 
               | There's a lot of blogs that go over this in detail for
               | our usecase at home. Your can do it but you will not see
               | much improvements if any at all.
        
               | lvh wrote:
               | From the tail end of the presentation I'd expect UFS2
               | isn't a potential bottleneck (I'd naively expect it to be
               | good at huge files, and the kernel to be good at
               | sendfile(2).) Is that your opinion as well, or are there
               | efficiency gains hiding inside the filesystem?
        
       | msdrigg wrote:
       | I think its funny that these servers use FreeBSD but if you zoom
       | in on the CPU graphic in the slides it is reading a book called
       | 'Programming Linux'
        
         | [deleted]
        
         | tyingq wrote:
         | That is funny. The title is pretty damn small. Here's what
         | appears to be the source SVG:
         | https://openclipart.org/download/28107/klaasvangend-processo...
         | 
         | And the text is in a <tspan>, so it could be appropriately re-
         | titled if desired :)
        
           | jhgb wrote:
           | > so it could be appropriately re-titled if desired
           | 
           | ...to something like "How to not be too curious for your own
           | good"?
        
         | speedgoose wrote:
         | Good catch.
        
         | drewg123 wrote:
         | Wow, I never zoomed in on that. I've been using that as clip
         | art in my slides for years now. You should work for CSI
        
           | cucumb3rrelish wrote:
           | this guy enhances
        
       | arnaudsm wrote:
       | Slightly unrelated, but did they ever consider P2P? It scales
       | really well and saves Terabytes of bandwidth, so I wonder what
       | the cons are, except that's it's associated with illegal
       | distribution.
        
         | zamadatix wrote:
         | P2P doesn't offer as many experience consistency guarantees and
         | in general adds a lot of complexity (=cost) that doesn't exist
         | in a client server model. Even if you went full bore on P2P you
         | still have to maintain a significant portion of the centralized
         | infrastructure both to seed new content as well as for clients
         | that aren't good P2P candidates (poor connections, limited
         | asymmetric connections, heavily limited data connections). Once
         | you got through all of those technical issues even if you found
         | you could overall save cost by reducing the core
         | infrastructure... it rides on the premise customers are going
         | to be fine with their devices using their upload (both active
         | bandwidth and total data) at the same price as when it was a
         | hosted service.
         | 
         | But yes I imagine licensing is definitely an issue too. Right
         | now only certain shows can be saved offline to a device and
         | only for very restricted periods of time for the same reason.
         | It's also worth noting in many cases Netflix doesn't pay for
         | bandwidth.
        
           | arnaudsm wrote:
           | Thanks for the detailed answer!
           | 
           | It's already there. The piratebay already serves me movies at
           | 200 MBps with almost zero infrastructure cost. It's probably
           | more a licensing issue like you said.
        
             | zamadatix wrote:
             | It's "not already there" as what torrents do is not the
             | same as what Netflix does even though both are related to
             | media delivery. Try seeking on a torrent which only caches
             | the local blocks and getting the same seek delay, try
             | getting instant start and seek when 1 million people are
             | watching a launch, try getting people to pay you the same
             | amount to use their upload, try getting the same battery
             | life on a torrent as a single HTTPS stream using burst
             | intervals. As I said P2P doesn't offer as many experience
             | consistency guarantees and has many other problems.
             | Licensing is just one of many issues.
             | 
             | Of course for free many people are willing to live with
             | starting and seeking being slow or battery life being worse
             | or having to allocate the storage up front or using their
             | upload and so on but again the question is can it fit in
             | the cost offset of some of the centralized infrastructure
             | not what you could do for free. I don't have anything
             | against torrents, quite the opposite I am quite a heavy
             | user of BOTH streaming services and torrenting due to DRM
             | restrictions on quality for some of my devices, but it
             | isn't a single issue problem like you are pressing to make
             | it be.
             | 
             | For some internal distributed action Netflix has made IPFS
             | into something they can use but not for end delivery
             | https://blog.ipfs.io/2020-02-14-improved-bitswap-for-
             | contain...
        
         | sedatk wrote:
         | Residential connections have much lower upload speeds than
         | their download speed. That can impact web browsing speeds
         | because outgoing packets can saturate the bandwidth and delay
         | TCP handshake packets. This is a problem I've been having
         | constantly with my Comcast internet connection. If the net
         | feels slow, I check if something's doing some large upload
         | despite that it's gigabit. I've tried QoS, managed switches
         | etc, none helped the situation. P2P is a no-no from that
         | perspective, in addition to other valid points until same
         | up/down speeds become a norm (miss you Sonic!).
        
         | cucumb3rrelish wrote:
         | wouldn't p2p mean that the client receiving somehow
         | participates in the sharing too? that wouldn't go well with
         | some internet plan caps
        
         | fortuna86 wrote:
         | Maybe dumb question, but does p-2-p work on native TV apps,
         | chromecast, etc ? I know it does if you run a client app on
         | Windows or Mac
        
           | zamadatix wrote:
           | That's not P2P in quite the same sense of the above was
           | talking about. Chromecast is client/server it's just your
           | device can become the server or instruct the Chromecast to
           | connect to a certain server.
        
           | wmf wrote:
           | If Netflix had made P2P the standard ten years ago then TVs
           | and Chromecasts would support P2P today. But they didn't so
           | they don't.
        
       | gigatexal wrote:
       | Hilarious to see all the attempts at "but why not Linux".
       | 
       | I wonder how many will read this and consider trying out FreeBSD.
       | It's a rather dope OS but I am a bit biased.
        
       | 5faulker wrote:
       | Seems like an absolute overkill for now.
        
         | lvh wrote:
         | I don't see why it's "absolute overkill" or even "for now"? It
         | seems reasonable for Netflix to want to optimize that pipeline
         | and minimize the amount of hardware they need to deploy,
         | especially for dense areas? Top quality video from Netflix can
         | be 2MB/s, so back-of-the-envelope that's only 20k streams. I'd
         | expect needing that capacity on any random evening in Logan
         | Square (a Chicago neighborhood).
        
       | anonuser123456 wrote:
       | Out of curiosity, why TLS encrypt when it's already DRM? To
       | prevent snooping on viewer habits?
        
         | drewg123 wrote:
         | Yes. See a tech blog from last year:
         | https://netflixtechblog.com/how-netflix-brings-safer-and-fas...
        
         | numlock86 wrote:
         | https://doesmysiteneedhttps.com/
         | 
         | Maybe they should add "But I already have DRM!" to the list.
         | DRM solves a complete different problem.
        
           | boardwaalk wrote:
           | More generically, "But my content is already encrypted!" I'm
           | surprised it isn't already there.
        
       | tester756 wrote:
       | >Serving Netflix Video at 400Gb/s [pdf]
       | 
       | Why not 50GB/s?
       | 
       | We don't say 50000 m/h instead of 50km/h
        
         | drewg123 wrote:
         | Because its network bandwidth, and network units are in bits
         | for historical reasons. I actually griped about this in my talk
         | .. it makes it a pain to talk about memory bw vs network bw
         | without mental gymnastics to move around that factor of 8
        
           | jakeinspace wrote:
           | Personally, I think that keeping the bit as the default
           | network bandwidth unit is a good idea. When expressing
           | physical transmission limitations, the equations of
           | information theory are all in terms of bits (being the
           | smallest unit of classical information). Obviously, this
           | means that somewhere on the spectrum from the Shannon Limit
           | to storage on disk, we've gotta shift from bits to bytes. The
           | NIC is as good a border as any.
        
           | tester756 wrote:
           | thank you
        
         | [deleted]
        
         | nix23 wrote:
         | Wow you have much to learn buddy.
        
       | azinman2 wrote:
       | What's amazing to me is how much data they're pumping out --
       | their bandwidth bills must be insane. Or they did some pretty
       | amazing negotiating contracts considering Netflix is like what,
       | $10-15/mo? And there are many who "binge" many shows likely
       | consuming gigabytes per day (in addition to all the other costs,
       | not least of which is actually making the programming)?
        
         | cavisne wrote:
         | Netflix has many peering relationships with isps
         | 
         | https://openconnect.netflix.com/en/peering/
         | 
         | The actual bandwidth is significant though, even compared to
         | something like YouTube
         | 
         | https://www.smh.com.au/technology/these-graphs-show-the-impa...
        
           | dmead wrote:
           | there's a graph somewhere showing the bitrate for netflix
           | before and after they paid out extortion fees to comcast.
        
             | RKearney wrote:
             | Extortion fees? Netflix wanted Comcast to provide them with
             | free bandwidth. While some ISPs may appreciate that
             | arrangement as it offloads the traffic from their transit
             | links, Comcast is under no obligation to do so.
             | 
             | You could argue then that Comcast wasn't upgrading their
             | saturated transit links, which they weren't with Level 3,
             | but to assume that every single ISP should provide free
             | pipes to companies is absurd.
        
               | sedatk wrote:
               | If Netflix pays for the bandwidth, what do Comcast's
               | customers pay for?
        
               | pstuart wrote:
               | Customer service!
        
               | sedatk wrote:
               | "I have people skills; I am good at dealing with people.
               | Can't you understand that? What the hell is wrong with
               | you people?"
        
               | freedomben wrote:
               | I don't normally upvote jokes unless they're funny but
               | also make a legitimate point. This was a good one :-D
        
               | cbg0 wrote:
               | Comcast were purposefully keeping those links under
               | capacity so they could double dip - get money from both
               | their customers and Netflix.
        
               | RKearney wrote:
               | So no business should ever have to pay for bandwidth
               | because customers of said business are already paying? Or
               | should I get free internet because businesses are already
               | paying ISPs for their network access?
        
               | dmead wrote:
               | https://www.merriam-webster.com/dictionary/extortion
        
               | leereeves wrote:
               | Businesses have to pay for bandwidth to get data to the
               | customer's ISP, but they generally don't pay the
               | customer's ISP for the same bandwidth the customer has
               | already paid for.
        
               | RKearney wrote:
               | Netflix did not have to pay Comcast either. One of their
               | ISPs Level 3 already had peering arrangements with
               | Comcast to deliver the content. Instead of paying their
               | own ISP, Netflix wanted to get free bandwidth from
               | Comcast. There's a difference.
        
               | leereeves wrote:
               | Comcast throttled _customers '_ bandwidth, refusing to
               | deliver the speeds they had promised to their customers
               | when those customers wanted to use it to watch Netflix.
               | 
               | The issue exists because Comcast is lying to their
               | customers, promising them more bandwidth than they are
               | able to deliver, so when Comcast's customers wanted to
               | use all the bandwidth they bought to watch Netflix,
               | Comcast couldn't afford to honor their promises.
               | 
               | And Comcast's customers are as unhappy with Comcast as
               | Netflix is; Comcast only gets away with acting as they do
               | because they have a monopoly in the markets they serve.
        
               | yjftsjthsd-h wrote:
               | That sounds reasonable? If I'm an end-user, what I'm
               | paying my ISP for is to get me data from Netflix (or
               | youtube or wikipedia or ...); that is the whole _point_
               | of having an ISP. If that means they need to run extra
               | links to Netflix, then tough; _that 's what I'm paying
               | them for_.
        
               | RKearney wrote:
               | You're paying them for access to the ISPs network. For an
               | internet connection, this also comes with access to other
               | networks through your ISPs peering and transit
               | connections.
               | 
               | If you know of any ISP that would give me 100+ Gbps
               | connectivity for my business, please let me know as I'd
               | love to eliminate my bandwidth costs.
        
               | deadbunny wrote:
               | The bandwidth paid for by their customers you mean?
        
             | NetBeck wrote:
             | Fast.com was originally a tool to measure if your line was
             | throttling Netflix traffic.[0]
             | 
             | [0] https://qz.com/688033/netflix-launched-this-handy-
             | speed-test...
        
         | kristjansson wrote:
         | Hence, to a large extent, the devices described in the
         | presentation. They're (co)located with ISP hardware, so the
         | bulk data can be transferred directly to the user at minimal /
         | zero marginal cost
        
         | [deleted]
        
         | dragontamer wrote:
         | One of the funny truths about the business of networks, is that
         | the one who "controls" the most bandwidth has the #1
         | negotiating spot at the table.
         | 
         | Take for example, if say Verizon decided to charge more for
         | bandwidth to Netflix... if Netflix said "no" and went with
         | another provider, then Verizon's customer's would suffer from
         | worse access times to Netflix.
         | 
         | Verizon has the advantage in that they have a huge customer
         | base that no one wants to piss off Verizon. So it cuts both
         | ways. Bandwidth becomes not a cost at this scale, but instead a
         | moat.
        
           | btgeekboy wrote:
           | Not sure if you realize which ISP you picked, but Verizon and
           | Netflix actually had peering disputes in 2014 which gave me
           | quite the headache at my then-current employer.
        
           | harikb wrote:
           | My hypothesis is that if Netflix/Youtube hadn't stressed out
           | our bandwidth and forced the ISPs of the world upgrade for
           | the last decade, the world wouldn't have been ready for WFH
           | of the covid world.
           | 
           | ISPs would have been more than happy to show the middle
           | finger to the WFH engineers, but not to the binge-watching
           | masses.
        
             | elithrar wrote:
             | > My hypothesis is that if Netflix/Youtube hadn't stressed
             | out our bandwidth and forced the ISPs of the world upgrade
             | for the last decade, the world wouldn't have been ready for
             | WFH of the covid world.
             | 
             | Couldn't agree more.
             | 
             | We see the opposite when it comes to broadband monopolies:
             | "barely good enough" DSL infrastructure, congested HFC, and
             | adversarial relationships w.r.t. subscriber privacy and
             | experience.
             | 
             | When it became worthwhile to invest in not just peering but
             | also last-mile because poor Netflix/YouTube/Disney+/etc
             | performance was a reason for users to churn away, they
             | invested.
             | 
             | This isn't to say that this is all "perfect" for consumers
             | either, but this tension has only been good for consumers
             | vs. what we had in the 90's and early-mid 00's.
        
             | lotsofpulp wrote:
             | The US is still not ready. The vast majority of people have
             | access to only over subscribed coaxial cable internet, with
             | non existent upload allocation.
        
             | oriolid wrote:
             | The binge-watching masses are easy to satisfy. All it takes
             | for the stream to work is average speed of a few megabits
             | per second, but there's so much caching at client end that
             | high latency and few seconds of total blackout every now
             | and then don't really matter.
        
         | js4ever wrote:
         | Of course they are not paying $120/TB like AWS public pricing
         | 
         | I heard they are paying something between $0.20 (eu/us) to $10
         | (exotic) per TB based on the region of the world where the
         | traffic is coming from
        
           | city41 wrote:
           | Video data does not stream over AWS. AWS is used for
           | everything else though.
        
           | elithrar wrote:
           | > I heard they are paying something between $0.20 (eu/us) to
           | $10 (exotic) per TB based on the region of the world where
           | the traffic is coming from
           | 
           | They're likely paying even less. $0.20/TB ($0.0002/GB) is
           | aggressive but at their scale, connectivity and per-machine
           | throughput, it's lower still.
           | 
           | A few points to take home:
           | 
           | - They [Netflix, YT, etc] model cost by Mbps - that is, the
           | cost to deliver traffic at a given peak. You have to
           | provision for peaks, or take a reliability/quality of
           | experience hit, and for on-demand video your peaks at usually
           | 2x your average.
           | 
           | - This can effectively be "converted" into a $/TB rate but
           | that's an abstraction, and not a productive way to model.
           | Serving (e.g.) 1000PB (1EB) into a geography at a peak of
           | 3Tbps per day is much cheaper than serving it at a peak of
           | 15Tbps.
           | 
           | - Netflix, more so than most others, benefits from having a
           | "fixed" corpus at any given moment. Their library is huge,
           | but (unlike YouTube) users aren't uploading content, they
           | aren't doing live streaming or sports events, etc - and thus
           | can intelligently place content to reduce the need to cache
           | fill their appliances. Cheaper to cache fill if you can
           | trickle most of it during the troughs as you don't need a big
           | a backbone, peering links, etc. to do so.
           | 
           | - This means that Netflix (rightfully!) puts a lot of effort
           | into per-machine throughput, because they want to get as much
           | user-facing throughput as possible from the given (space,
           | power, cost) of a single box. That density is also attractive
           | to ISPs, as it means that every "1RU of space" they give
           | Netflix has a better ROI in terms of network cost reductions
           | vs. others, esp. when combined with the fact that "Netflix
           | works great" is an attractive selling point for users.
        
           | 101008 wrote:
           | Sorry for the naive question, but to offer those prices,
           | there are two options: A) Amazon is losing money to keep
           | Netflix as their client, B) They are doing profit even with
           | $0.20/TB, which means the $120/TB is, at least, 119.80
           | profitable. Wow.
        
             | kristjansson wrote:
             | AWS {in,e}gress pricing strategy is not motivated by their
             | cost of provisioning that service. Cloudflare had a good
             | (if self-motivated) analysis of their cost structure
             | discussed on here a while ago
             | 
             | https://news.ycombinator.com/item?id=27930151
        
             | manquer wrote:
             | $0.20/ TB costs is not for their agreements with Amazon,
             | bulk of their video traffic is directly peering with ISPs,
             | AWS largely serves for their APIs, and orchestration
             | infrastructure.
             | 
             | Amazon and most Cloud providers do overcharge for b/w. You
             | _can_ buy a OVH /Hetzner type box with a guaranteed un-
             | metered 1 Gbps public bandwidth for ~ $120/month easily,
             | which if fully utilized is equivalent 325TB / month or
             | $3-4/TB, completely ignoring the 8/16 core bare metal
             | server and attached storage you also get. This is SMB/
             | self-service prices, you can get much deals with basic
             | negotiating and getting into a contract with a DC.
             | 
             | One thing to remember though not all bandwidth are equal,
             | CSPs like AWS provide a lot of features such as very
             | elastic scale up on-demand, a lot of protection up to L4
             | and advanced SDN under the hood to make sure your VMs can
             | leverage the b/w, that is computationally expensive and
             | costly.
        
             | robocat wrote:
             | Netflix has appliances installed at exchange points that
             | caches most of Netflix. Each appliance peers locally and
             | serves the streams locally.
             | 
             | The inbound data stream to fill the cache of the appliance
             | is rate limited and time limited - see
             | https://openconnect.zendesk.com/hc/en-
             | us/articles/3600356180... The actual inbound data to the
             | appliance will be higher than the fill because not
             | everything is cached.
             | 
             | The outbound stream from the appliance serves consumers. In
             | New Zealand for example, Netflix has 40Gbps of connectivity
             | to a peering exchange in Auckland.
             | https://www.peeringdb.com/ix/97
             | 
             | So although total Netflix bandwidth to consumers is
             | massive, it has little in common with the bandwidth you pay
             | for at Amazon.
             | 
             | Disclaimer: I am not a network engineer.
        
           | sroussey wrote:
           | AWS public pricing is comically high. No one even prices in
           | that manner.
        
       | tgtweak wrote:
       | Cool but how much does one of those servers cost... 120k?
       | 
       | Pretty crazy stuff though, would love to see something similar
       | from cloudflare eng. since those workloads are extremely broad vs
       | serving static bits from disk.
        
         | dragontamer wrote:
         | https://www.thinkmate.com/system/a+-server-1114s-wtrt
         | 
         | For a EPYC 7502P 32-core / 256GB RAM / 2x Mellanox Connect-X 6
         | dual-nic, I'm seeing $10,000.
         | 
         | Then comes the 18x SSD drives, lol, plus the cards that can
         | actually hold all of that. So the bulk of the price is
         | SSD+associated hardware (HBA??). The CPU/RAM/Interface is
         | actually really cheap, based on just some price-shopping I did.
        
           | zamadatix wrote:
           | The WD SN720 is an NVMe SSD so there are no HBAs involved it
           | just plugs into the (many) PCIe lanes the Epyc CPU provides.
           | CDW gives an instant anonymous buyer web price of ~$11,000.00
           | for all 18.
        
         | kristjansson wrote:
         | Surprisingly less it seems? Those NICs are only like a kilobuck
         | each, the drives are like 0.5k, CPU is like 3k-5k. So maybe
         | 15k-20k all in, with the flash comprising about half that?
         | 
         | Seems surprisingly cheap, but I'm not sure if that's just great
         | cost engineering on Netflix part or a poor prior on my part ...
         | I'll chose to blame Nvidia's pricing in other domains for
         | biasing me up
        
           | virtuallynathan wrote:
           | That's in the right ballpark, I believe.
        
         | daper wrote:
         | I think the described machines were designed for a very
         | specific workloads even in CDNs world. As I guess they serve
         | exclusivelly fragments of on-demand videos, no dynamic content.
         | So high bytes to number of requests ratio, nearly 100% hit
         | ratio, very long cache TTL (probably without revalidation,
         | almost no purge requests), small / easy to interpret requests,
         | very effective HTTP keep alive, relatively small number of new
         | TLS sessions (TLS handshake is not accelerated in hardware),
         | low number of objects due to their size (much fewer
         | stat()/open() calls that would block nginx process). Not
         | talking about other CDN functionality like workers, WAF, page
         | rules, rewrites, custom cache keys, no or very little logging
         | of requests etc. That really simplifies things a lot compared
         | to Cloudflare or Akamai.
        
       | morty_s wrote:
       | Enjoying the slides! Is there a small typo on slide 17?
       | 
       | >Strategy: Keep as much of our 200GB/sec of bulk data off the
       | NUMA fabric [as] possible
        
         | drewg123 wrote:
         | No, 400Gb/s == 50 GB/s
         | 
         | With software kTLS, the data is moved an and out of memory 4x,
         | so that's 200GB/s of bandwidth needed to serve 50GB/s (400Gb/s)
         | of data.
        
       | betaby wrote:
       | And yet a lot of folks think Netflix is on AWS. I would think AWS
       | gives a huge discount to Netflix and ask 'just run at least
       | something on aws'. Those promotional videos have nothing useful
       | about netflix on aws https://aws.amazon.com/solutions/case-
       | studies/netflix-case-s... and targeted to managers and unaware of
       | the scale engineers.
        
         | dangrossman wrote:
         | Netflix is on AWS -- the website and app backends, not the
         | video streams.
         | 
         | * Netflix.com resolves to
         | ec2-52-3-144-142.compute-1.amazonaws.com from my computer at
         | this moment.
         | 
         | Some more domains their apps reach out to include:
         | 
         | * nflxvideo.net: ec2-46-137-171-215.eu-
         | west-1.compute.amazonaws.com.
         | 
         | * appboot.netflix.com:
         | ec2-34-199-156-113.compute-1.amazonaws.com.
         | 
         | * uiboot.netflix.com:
         | ec2-52-203-145-205.compute-1.amazonaws.com.
        
           | betaby wrote:
           | That exactly what I mean. AWS would represent single digit of
           | the infrastructure and traffic.
        
             | dangrossman wrote:
             | Maybe of their traffic, but not their infrastructure.
             | 
             | Netflix rents 300,000 CPUs on AWS just for video encoding.
             | Encoding one TV series produces over 1,000 files that sit
             | on S3, ready for those PoPs to pull them in and cache them.
             | 
             | They also do more data processing, analytics and ML than
             | most people realize and that's all happening on AWS.
             | 
             | For example, not only are each user's recommendations
             | personalized, but all of the artwork is personalized too.
             | If you like comedy, the cover art for Good Will Hunting may
             | be a photo of Robin Williams. If you like romance, the
             | cover art for the same movie may be a photo of Matt Damon
             | and Minnie Driver going in for a kiss.
             | 
             | They're also doing predictive analytics on what each person
             | will want to watch _tomorrow_ in particular. Every evening,
             | all those non-AWS servers go ask AWS what video files they
             | should download from S3 to proactively cache the next day
             | of streaming.
        
               | betaby wrote:
               | ~300000 CPUs is about 2K racks worth of servers. I would
               | think if you have such constant load it would be more
               | economical to move that on-premise.
        
               | jeffbee wrote:
               | 300k cpu cores is "only" about 1200 nodes these days,
               | definitely not 2000 racks.
        
               | betaby wrote:
               | If CPU == core thread, then yes.
        
               | kristjansson wrote:
               | The load is emphatically not constant though? Encoding
               | has to be done once, as fast as possible, and updated
               | rarely. That warehouse (or just aisle) of racks would be
               | sitting idle much of the time...
        
               | killingtime74 wrote:
               | You're thinking like an engineer. A business person would
               | just ask for a discount (which I'm sure they do) and
               | constantly renegotiate. Don't forget the financing of on
               | premise and data centre workers and how that looks on
               | your balance sheet.
        
               | jeffbee wrote:
               | I wonder if there is any discussion of whether they could
               | make it cheaper with Google's compression ASICs, which I
               | assume will be a cloud service at some point.
        
               | whoknowswhat11 wrote:
               | Amazon offers something not has fancy (but maybe more
               | flexible?)
               | 
               | VT1 instances can support streams up to 4K UHD resolution
               | at 60 frames per second (fps) and can transcode up to 64
               | simultaneous 1080p60 streams in real time.
               | 
               | One thing is that a place like netflix has got to have a
               | LOT of target formats (think of range of devices netflix
               | delivers to). ASICS sometimes struggle when faced with
               | this needed flexibility.
        
               | despacito wrote:
               | A lot of that batch processing is done by borrowing from
               | unused reservations for the streaming service.
               | https://netflixtechblog.com/creating-your-own-ec2-spot-
               | marke...
        
           | Xorlev wrote:
           | And their batch processing.
           | 
           | I recall from a talk at Re:Invent forever ago, they had
           | everything on AWS except:
           | 
           | - Their CDN, which has to be deployed wherever there's
           | demand.
           | 
           | - An old Oracle setup with some ancient software that runs
           | their DVD business.
           | 
           | The latter may not be true anymore. In any case, it's smart
           | to break from cloud providers when it makes sense for your
           | problem. Video streaming is network-heavy, so pushing the
           | actual streaming to where consumers is really smart.
        
           | isclever wrote:
           | Go to fast.com, open webdev tools on the network tab. Run a
           | speedtest, the hosts are the local caches (OpenConnect) that
           | you would use to stream movies.
           | 
           | Note: I don't know for sure, but its the most likely.
        
             | virtuallynathan wrote:
             | Yep, that is generally the case. The steering is slightly
             | different, but close enough.
        
         | bsaul wrote:
         | wait, aren't they running anything at all ??? I was 100% sure
         | netflix had the most advanced use of aws ( inventing things
         | like chaos monkey, etc)...
         | 
         | What are they running on ?
        
           | johneth wrote:
           | They run their backend on AWS (i.e. accounts, catalogue,
           | watch history, etc.)
           | 
           | They serve their video data using their own custom hardware
           | outside of AWS.
        
           | elzbardico wrote:
           | They run EVERYTHING but last mile file serving on aws.
           | Encoding, storage, catalog, recommendations, accounts, etc.
           | all of it in AWS.
        
             | dralley wrote:
             | They have resident cache boxes hooked directly into many
             | ISPs. Those aren't on AWS.
        
               | lvh wrote:
               | That sounds like the last-mile the GP is talking about?
        
         | Xorlev wrote:
         | Netflix runs ~everything but video streaming from AWS. That's
         | more than just lip service.
        
         | teleforce wrote:
         | I have a friend who used to work with iflix, an Asian version
         | of Netflix [1]. According to him all of their contents are on
         | AWS.
         | 
         | [1]https://www.iflix.com/
        
         | robjan wrote:
         | Netflix use AWS for their commodity workloads and rolled their
         | own CDN because delivering high quality streams without using
         | too much bandwidth on the public internet is one of their
         | differentiators.
        
         | Hikikomori wrote:
         | Laughably wrong.
        
       | justin_ wrote:
       | There is no video for this available yet AFAIK, but for those
       | interested there is a 2019 EuroBSDcon presentation online from
       | the same speaker that focuses on the TLS offloading:
       | https://www.youtube.com/watch?v=p9fbofDUUr4
       | 
       | The slides here look great. I'm looking forward to watching the
       | recording.
        
       | mrlonglong wrote:
       | At 400Gb/s, how long would it take to read through the entire
       | catalogue?
        
       ___________________________________________________________________
       (page generated 2021-09-19 23:00 UTC)