[HN Gopher] Serving Netflix Video at 400Gb/s [pdf]
___________________________________________________________________
Serving Netflix Video at 400Gb/s [pdf]
Author : drewg123
Score : 442 points
Date : 2021-09-19 14:47 UTC (8 hours ago)
(HTM) web link (people.freebsd.org)
(TXT) w3m dump (people.freebsd.org)
| ramshanker wrote:
| Meanwhile my internet connection is 40mbps
| segfaultbuserr wrote:
| > _Run kTLS workers, RACK / BBR TCP pacers with domain affinity_
|
| TCP BBR is in the FreeBSD upstream now? Cool.
| zamadatix wrote:
| As of 13.0
| woah wrote:
| Dumb question, but what, specifically, does "fabric" mean in this
| context?
| adwn wrote:
| Fabric = data transmission infrastructure between CPUs or
| cores. This includes the wires which carry the data signals, as
| well as routing logic.
| zamadatix wrote:
| For the nitty gritty details
| https://en.wikichip.org/wiki/amd/infinity_fabric
| drewg123 wrote:
| These are the slides from my EuroBSDCon presentation. AMA
| polskibus wrote:
| Was FreeBSD your first choice? Or did you try with Linux first?
| What were the numbers for Linux-based solution, if there was
| one?
| drewg123 wrote:
| FreeBSD was selected at the outset of the Open Connect CDN
| (~2012 or so).
|
| We did a bake off a few years ago, and _FOR THIS WORKLOAD_
| FreeBSD outperformed Linux. I don 't want to get into an OS
| war, that's not productive.
| kayson wrote:
| By how much?
| drewg123 wrote:
| Its important to consider that we've poured man years into
| this workload on FreeBSD. Just off the top of my head,
| we've worked on in house, and/or contributed to or funded,
| or encouraged vendors to pursue: - async sendfile (so
| sendfile does not block, and you don't need thread pools or
| AIO) - RACK and BBR TCP in FreeBSD (for good QoE) - kTLS
| (so you can keep using sendfile with tls, saves ~60% CPU
| over reading data into userspace and encrypting there) -
| Numa - kTLS offload (to save memory bandwidth by moving
| crypto to the NIC)
|
| Not to mention tons of VM system and scheduler improvements
| which have been motivated by our workload.
|
| FreeBSD itself has improved tremendously over the last few
| releases in terms of scalability
| sandGorgon wrote:
| a lot of this is in Linux now right ? i am asking a
| personal opinion and not necessarily a "why-dont-u-move-
| to-linux" question.
|
| Genuinely curious on where u see state of art when it
| comes to Linux.
| drewg123 wrote:
| Yes. I ran it for a bake off ~2 years ago. At the time,
| the code in linux was pretty raw, and I had to fix a bug
| in their SW kTLS that caused data corruption that was
| visible to clients. So I worry that it was not in
| frequent use at the time, though it may be now.
|
| My understanding is that they don't do 0-copy inline
| ktls, but I could be wrong about that.
| spockz wrote:
| Thank you for pushing kTLS!
| jenny91 wrote:
| Was licensing also a contributing factor, or was that
| irrelevant for you?
| drewg123 wrote:
| My understanding is that licensing did factor into the
| decision. However, I didn't join Netflix until after the
| decision had been made.
| tatersolid wrote:
| I recall reading that Netflix chose FreeBSD a decade ago
| due to asynchronous _disk_ IO was (and still is?) broken
| and /or limited to fixed block offsets. So nginx just
| works better on FreeBSD versus Linux for serving static
| files from spinning rust or SSD.
| Sesse__ wrote:
| This used to be the case, but with io_uring, Linux has
| very much non-broken buffered async I/O. (Windows has
| copied io_uring pretty much verbatim now, but that's a
| different story.)
| gavinray wrote:
| Could you expand more on the Windows io_uring bit please?
|
| I have run Debian based Linux my entire life and recently
| moved circumstantially to Windows. I have no idea how
| it's kernel model works and I find io_uring exciting.
|
| Wasn't aware of any adoption of io_uring ideas in Windows
| land, sounds interesting
| Matthias247 wrote:
| The architecture slides don't show any in-memory read caching
| of data? I guess there is at least some, but would it be at the
| disk side or the NIC side? I guess sendfile without direct IO
| would read from a cache.
| drewg123 wrote:
| Caching is left off for simplicity.
|
| We keep track of popular titles, and try to cache them in
| RAM, using the normal page cache LRU mechanism. Other titles
| are marked with SF_NOCACHE and are discarded from RAM ASAP.
| Matthias247 wrote:
| in which node would that page cache be allocated? In the
| one where the disk is attached, or where the data is used?
| Or is this more or less undefined or up to the OS?
| drewg123 wrote:
| This is gone over in the talk. We allocate the page
| locally to where the data is used. The idea is that we'd
| prefer the NVME drive to eat any latency for the NUMA bus
| transfer, and not have the CPU (SW TLS) or NIC (inline HW
| TLS) stall waiting for a transfer.
| markjdb wrote:
| How much data ends up being served from RAM? I had the
| impression that it was negligible and that the page cache
| was mostly used for file metadata and infrequently accessed
| data.
| drewg123 wrote:
| It depends. Normally about 10-ish percent. I've seen well
| over that in the past for super popular titles on their
| release date.
| otterley wrote:
| How would you characterize the economics of this vs.
| alternative solutions? Achieving 400Gb/s is certainly a
| remarkable achievement, but is it the lowest price-per-Gb/s
| solution compared to alternatives? (Even multiple servers.)
| dragontamer wrote:
| I doubt that the 32-core EPYC they focused on is even the
| most economical solution in this situation.
|
| If they're really RAM-bandwidth constrained, then the 24-core
| 74F3 (which still has all 256MBs of L3 cache) or even 8-core
| 72F3 may be better.
| sroussey wrote:
| L3 won't matter.
|
| The more compute clusters the more PCI lanes in EPYC and
| the SSD lanes go direct per 8cores.
| samstave wrote:
| I recall when I was at Intel in the 90s and 3MB of L3 was
| a big deal.
| pastrami_panda wrote:
| This, can someone add to this it's crucial.
| drewg123 wrote:
| I'm not on the hardware team, so I don't have the cost
| breakdown. But my understanding is that flash storage is the
| most expensive line item, and it matters little if its
| consolidated into one box, or spread over a rack, you still
| have to pay for it. By serving more from fewer boxes, you can
| reduce component duplication (cases, mobos, ram, PSUs) and
| more importantly, power & cooling required.
|
| The real risk is that we introduce a huge blast radius if one
| of these machines goes down.
| [deleted]
| sroussey wrote:
| Ouch. I guess there is always a trade off to be made...
| fragmede wrote:
| Particularly, kTLS is a solution that fits here, on a single-
| box, but I wonder how things would look if the high-perf
| storage boxes sent video unencrypted and there was a second
| box that dealt only with the TLS. We'd have to know how many
| streams 400Gb/s represents though, and have a far more
| detailed picture of Netflix's TLS needs.
| HalcyonicStorm wrote:
| Is there anyway we can see a video of the presentation? I'm
| extremely interested
| allanjude wrote:
| The videos should appear on the conference's youtube channel
| in a few weeks: https://www.youtube.com/eurobsdcon
| gautamcgoel wrote:
| 1. Could GPU acceleration help at all?
|
| 2. When serving video, do you use floating point operations at
| all? Could this workload run on a hypothetical CPU with no
| floating point units?
|
| 3. How many of these hardware platforms do you guys own?
| 10k?100k?
| sroussey wrote:
| The videos are precomputed. So no GPU required to stream
| drewg123 wrote:
| 1) No. Well, potentially as a crypto accelerator, but QAT and
| Chelsio T6 are less power hungry and more available. GPUs are
| so expensive/unavailable now that leveraging them in creative
| ways makes less sense than just using a NIC like the CX6-Dx,
| which has crypto as a low cost feature.
|
| 2) These are just static files, all encoding is done before
| it hits the CDN.
| betaby wrote:
| How exactly 'Constrained to use 1 IP address per host' helps
| eliminate cross-NUMA transfers?
| drewg123 wrote:
| If we could use more than 1 IP, then we could treat 1 400Gb
| box as 4 100Gb boxes. That could lead to the "perfect case"
| every time, since connections would always stay local to the
| numa node where content is present.
| betaby wrote:
| There is no kTLS for IPv6? IPv6 space is abundant and most
| mobiles is USA/Canada have IPv6. Won't that solve the
| problem?
| drewg123 wrote:
| IPv4 and IPv6 can both use kTLS. We offer service via V6,
| but most clients connect via IPv4. It differs by region,
| and even time of day, but IPv4 is still the vast majority
| of traffic.
| chungy wrote:
| I've had to blackhole Netflix IPv6 ranges on my router
| because Netflix would identify my IPv6 connection as
| being a "VPN" even though it's not.
| drewg123 wrote:
| If you could email me your ranges, I can try to look into
| it internally. Use my last name at gmail.com (or at
| freebsd.org)
| genewitch wrote:
| I'm not GP; Hurricane Electric ipv6 always had Netflix
| think I was on a VPN, but now I have real ipv6 through
| the same ISP, I just pay more money so Netflix doesn't
| complain anymore.
| kvathupo wrote:
| This may be a naive question, but data is sent at 400Gb/s to
| the NIC, right? If so, is it fair to assume to assume that data
| is actually sent/received at a similar rate?
|
| I ask since I was curious why you guys opted not to bypass
| sendfile(2). I suppose it wouldn't matter in the event that the
| client is some viewer, as opposed to another internal machine.
| drewg123 wrote:
| We actually try really, really hard not to blast 400Gb/s at a
| single client. The 400Gb/s is in aggregate.
|
| Our transport team is working on packet pacing, or really
| packet spreading, so that any bursts we send are small enough
| to avoid being dropped by the client, or an intermediary
| (cable modem, router, etc).
| Matthias247 wrote:
| Well it's not 1 client. It's thousands of viewers and
| streams. An individual stream will have whatever the maximum
| 4k bandwidth for Netflix is.
| dragontamer wrote:
| I eventually figured it out. But I would suggest maybe giving a
| brief 1-slide thingy on "lagg", and link aggregation? (Maybe it
| was clear from the presentation, but I only see the slides here
| soooooo...)
|
| I'm not the best at network infrastructure, though I'm more
| familiar with NUMA stuff. So I was trying to figure out how you
| only got 1-IP address on each box despite 4-ports across
| 2-nics.
|
| I assume some Linux / Windows devops people are just not as
| familiar with FreeBSD tools like that!
|
| EDIT: Now that I think of it: maybe a few slides on how link-
| aggregation across NICs / NUMA could be elaborated upon
| further? I'm frankly not sure if my personal understanding is
| correct. I'm imagining how TCP-connections are fragmented into
| IP-packets, and how those packets may traverse your network,
| and how they get to which NUMA node... and it seems really more
| complex to me than your slides indicate? Maybe this subject
| will take more than just one slide?
| drewg123 wrote:
| Thanks for the feeback. I was hoping that the other words on
| the slide (LACP and bonding) would give enough context.
|
| I'm afraid that my presentation didn't really have room to
| dive much into LACP. I briefly said something like the
| following when giving that slide:
|
| Basically, the LACP link partner (router) hashes traffic
| consistently across the multiple links in the LACP bundle,
| using a hash of its choosing (typically an N-tuple, involving
| IP address and TCP port). Once it has selected a link for
| that connection, that connection will always land on that
| link on ingress (unless the LACP bundle changes in terms of
| links coming and going). We're free to choose whatever egress
| NIC we want (it does not need to be the same NIC the
| connection entered on). The issue is that there is no way for
| us to tell the router to move the TCP connection from one NIC
| to another (well, there is in theory, but our routers can't
| do it)
|
| I hope that helps
| dragontamer wrote:
| > We're free to choose whatever egress NIC we want
|
| Wait, I got lost again... You say you can "output on any
| egress NIC". So all four egress NICs have access to the TLS
| encryption keys and are cooperating through the FreeBSD
| kernel to get this information?
|
| Is there some kind of load-balancing you're doing on the
| machine? Trying to see which NIC has the least amount of
| traffic and routing to the least utilized NIC?
| drewg123 wrote:
| In terms of LACP in general, not for hw TLS. For HW TLS,
| the keys and crypto state are NIC specific.
| samstave wrote:
| Whats happening when at the switch(es) the NICs are
| connected into?
| martyvis wrote:
| The whole paper is discussing bottlenecks, path
| optimisation between resources, and impacts of those on
| overall throughout. It's not a simple loadbalancing
| question being answered
| zamadatix wrote:
| LACP/link aggregation are IEEE Ethernet standards concepts
| supported by nearly every hardware or software network stack.
| https://en.wikipedia.org/wiki/Link_aggregation
| zamadatix wrote:
| Hey Drew, thanks for taking the time.
|
| What would you rate the relative complexity of working with the
| NIC offloading vs the more traditional optimizations in the
| rest of the deck? Have you compared other NIC vendors before or
| has Mellanox been the go to that's always done what you've
| needed?
| minimax wrote:
| What lead you to investigate PCIe relaxed ordering? Can you
| suggest a book or other resource to learn more about PCIe
| performance?
| drewg123 wrote:
| To be honest, it was mostly the suggestion from AMD.
|
| At the time, AMD was the only Gen4 PCIe available, and it was
| hard to determine if the Mellanox NIC or the AMD PCIe root
| was the limiting factor. When AMD suggested Relaxed Ordering,
| that brought its importance to mind.
| 404mm wrote:
| As an infrastructure engineer, these numbers are absolutely
| mind blowing to me!
|
| Not sure if it's ok to ask.. how many servers like this one
| does it take to serve the US clients?
| drewg123 wrote:
| I don't have the answer, and even if I did, I'm not sure I'd
| be allowed to tell you :)
|
| But note that these are flash servers; they serve the most
| popular content we have. We also have "storage" servers with
| huge numbers of spinning drives that serve the longer tail.
| They are constrained by spinning rust speeds, and can't serve
| this fast.
| Aissen wrote:
| How do you benchmark this ? Do you use real-life traffic, or
| have a fleet of TLS clients ? If you have a custom testsuite,
| are the clients homogeneous ? How many machines do you need ?
| Do the clients use KTLS ?
| drewg123 wrote:
| We test on production traffic. We don't have a testbench.
|
| This is problematic, because sometimes results are not
| reproducible.
|
| Eg, if I test on the day of a new release of a popular title,
| we might be serving a lot of it cached from RAM, so that cuts
| down memory bandwidth requirements and leads to an overly
| rosy picture of performance. I try to account for this in my
| testing.
| Aissen wrote:
| Thanks, it's an interesting set of tradeoffs.
| GrigoriyMikh wrote:
| To be clear, was FreeBSD used because of historical reasons or
| because similar performance can't be/harder to achieve on
| Linux?
| Thaxll wrote:
| I mean most CDN and FANG run on Linux, I think in that case
| it's kTLS that makes a big difference the rest not much.
| drewg123 wrote:
| Async sendfile is also an advantage for FreeBSD. It is
| specific to FreeBSD. It allows an nginx worker to send from
| a file that's cold on disk without blocking, and without
| resorting to threadpools using a thread-per-file.
|
| The gist is that the sendfile() call stages the pages
| waiting to be read in the socket buffer, and marks the
| mbufs with M_NOTREADY (so they cannot be sent by TCP). When
| the disk read completes, a sendfile callback happens in the
| context of the disk ithread. This clears the M_NOTREADY
| flag and tells TCP they are ready to be sent. See
| https://www.nginx.com/blog/nginx-and-netflix-contribute-
| new-...
| Thaxll wrote:
| sendfile() with splice and io_uring is similar? I know
| that this is very experimental on Linux.
|
| The overall idea is to copy bytes from disk to the socket
| with almost no allocation and not blocking, this is the
| idea right?
| smoldesu wrote:
| I'd imagine scaling, licensing and overhead all had something
| to do with it, too.
| GrigoriyMikh wrote:
| Interesting. Are there any benchmarks that you would
| recommend to look at, regarding FreeBSD vs Linux networking
| performance?
| GrigoriyMikh wrote:
| For anyone interested, here is benchs from late 2018,
| comparing Fedora and FreeBSD performance:
| https://matteocroce.medium.com/linux-and-freebsd-
| networking-...
| trasz wrote:
| It's probably worth noting that there have been huge
| scalability improvements - including introduction of
| epochs (+/- RCU) - in FreeBSD over the last few years,
| for both networking and VFS.
| lostmsu wrote:
| Why does the author put so much effort into testing VMs?
| Bare metal installations aren't even tried, so the
| article won't represent more typical setup (unless you
| want to run in cloud, in which case it would make sense
| to test in cloud).
| teknoraver wrote:
| Because the cards were in PCI passthrough, so the
| performance was exactly the same of a physical system
| lostmsu wrote:
| Author mentions as much as to say that there's was some
| indirectness at least in the interrupts.
|
| There are also VirtIO drivers involved, and according to
| the article, they had effect too.
| genewitch wrote:
| If given the choice I'd never run anything on bare metal
| again. Let's say we have some service we want to run on a
| bare metal server. For not very much more hardware money,
| amortized, I can set up two or three VMs to do the same
| server, duplicated. Then if any subset of that metal goes
| bad, a replica/duplicate is already ready to go. There's
| no network overhead, etc.
|
| I've been doing this for stuff like SMTPE authority
| servers and ntpd and things that absolutely cannot go
| down, for over a decade.
| smoldesu wrote:
| That doesn't really matter for benchmarking purposes. I
| think the parent comment was emphasizing that syscall
| benchmarks don't make sense when you're running through a
| hypervisor, since you're running tacitly different
| instructions than would be run on a bare-metal or
| provisioned server.
| crest wrote:
| Nginx and OpenSSL are open source. Give it a try and
| reproduce their results with Linux ;-).
| betaby wrote:
| sendfile + kTLS. I'm unaware of the in-kernel TLS
| implementation for Linux. Is there any around?
| merb wrote:
| https://www.kernel.org/doc/html/latest/networking/tls-
| offloa... https://docs.mellanox.com/display/OFEDv521040/K
| ernel+Transpo...
| drewg123 wrote:
| Yes, Linux has kTLS. When I tried to use it, it was
| horribly broken, so my fear is that its not well
| used/tested, but it exists. Mellanox, for example,
| developed their inline hardware kTLS offload on linux.
| Matthias247 wrote:
| > When I tried to use it, it was horribly broken, so my
| fear is that its not well used/tested, but it exists.
|
| Do you have any additional references around this? I'm
| aware that most rarely used functionality is often broken
| and therefore usually don't recommend people to use it,
| but would like to learn about kTLS in particular. I think
| for Linux OpenSSL 3 now added support for it in
| userspace. But there's also the kernel components as well
| as drivers - all of them could have their set of issues.
| drewg123 wrote:
| I recall that simple transmits from offset 0..N in a file
| worked. But range requests of the form N..N+2MB lead to
| corrupt data. Its been 2+ years, and I heard it was later
| fixed in Linux.
| Matthias247 wrote:
| Interesting. I had imagined the range handling is purely
| handled by the reading side of things, and wouldn't care
| how the sink is implemented (kTLS, TLS, a pipe, etc). So
| I assumed the offset should be invisible for kTLS, which
| just sees a stream of data as usual.
| Sesse__ wrote:
| I've used sendfile + kTLS on Linux for a similar use
| case. It worked fine from the start, was broken in two
| (?) kernel releases for some use cases, and now works
| fine again from what I can tell. This is software kTLS,
| though; haven't tried hardware (not the least because it
| easily saturates 40 Gbit/sec, and I just don't have that
| level of traffic).
| genewitch wrote:
| I once recommended a switch and router upgrade to allow
| for more, new WAPs for an office that was increasingly
| becoming dependent on laptops and video conferencing. I
| went with brand new kit, like just released earlier in
| the year because I'd heard good things about the traffic
| shaping, etc.
|
| Well, the printers wouldn't pair with the new APs,
| certain laptops with fruit logos would intermittently
| drop connection, and so on.
|
| I probably will never use that brand again, even though
| they escalated and promised patches quickly - within 6
| hours they had found the issue and we're working on
| fixing it, but the damage to my reputation was already
| done.
|
| Since then I've always demanded to be able to test any
| new idea/kit/service for at least a week or two just to
| see if I can break it.
| [deleted]
| Koshkin wrote:
| IMO the question was reasonable, whereas the answers like
| yours have always sounded to me like "fuck you."
| relix42 wrote:
| It was done a tested more than once. As I recall, it took
| quite a bit to get Linux to perform to the level that BSD
| was performing for (a) this use can and (b) the years of
| investment Netflix had already put into the BSD systems.
|
| So, could Linux be tweaked and made as performant for
| _this_ use case. I expect so. The question to be answered
| is _why_.
| LeifCarrotson wrote:
| This level of architecture management on big server CPUs is
| amazing! I occasionally handle problems like this on a small
| scale, like minimizing wake time and peripheral power
| management on an 8 bit microcontroller, but there the entire
| scope is digestible once you get into it, and the kernel is
| custom-designed for the application.
|
| However, in my case, and I expect in yours, requirements
| engineering is the place where you can make the greatest
| improvements. For example, I can save a few cycles and a few
| microwatts by sequencing my interrupts optimally or moving some
| of the algorithm to a look-up table, but if I can, say,
| establish that an LED indicator flash that might need to be 2x
| as bright but only lasts for a couple milliseconds every second
| is as visible as a 500ms LED on/off blink cycle, that's a 100x
| power savings that I can't hope to reach with micro-
| optimizations.
|
| What are your application-level teams doing to reduce the data
| requirements? General-purpose NUMA fabrics are needed to move
| data in arbitrary ways between disc/memory/NICs, but your needs
| aren't arbitrary - you basically only require a pipeline from
| disc to memory to the NIC. Do you, for example, keep the first
| few seconds of all your content cached in memory, because users
| usually start at the start of a stream rather than a few
| minutes in? Alternatively, if 1000 people all start the same
| episode of Stranger Things within the same minute, can you add
| queues at the external endpoints or time shift them all
| together so it only requires one disk read for those thousand
| users?
| fragmede wrote:
| _> Alternatively, if 1000 people all start the same episode
| of Stranger Things within the same minute_
|
| It would be _fascinating_ to hear from Netflix on some
| serious details of the usage patterns they see and particular
| optimizations that they do for that, but I doubt there 's so
| much they can do given the size of the streams, the
| 'randomness' of what people watch and when they watch, and
| for the fact that the linked slides say the servers have
| 18x2TB NVME drives per-server and 256GB.
|
| I wouldn't be surprised if the Netflix logo opener exists
| once on disk instead of being the first N seconds of every
| file though.
| allanjude wrote:
| In previous talks Netflix has mentioned that due to serving
| so many 1000s of people from each box, that they basically
| do 0 caching in memory, all of the system memory is needed
| for buffers that are enroute to users, and they purposely
| avoid keeping any buffer cache beyond what is needed for
| sendfile()
| jareklupinski wrote:
| > Mellanox ConnectX-6 Dx - Support for NIC kTLS offload
|
| Wild, didn't know nVidia was side-eyeing such far-apart but
| still parallel channels for their ?GPUs?.
|
| Was this all achievable using nVidia's APIs out-of-the-box, or
| did the firmware/driver require some in-house engineering :)
| fragmede wrote:
| Mellanox was bought by nVidia 2 years ago, so while it's
| technically accurate to say it's an nVidia card, that elides
| their history. Mellanox has been selling networking cards to
| the supercomputing market since 1999. Netflix absolutely had
| to do some tuning of various counters/queues/other settings
| in order to optimize for their workload and get the level of
| performance they're reporting here, but Mellanox sells NICs
| with firmware/drivers that work out-of-the-box.
| swozey wrote:
| I build this stuff so it's so cool to read this, I can't really
| be public about my stuff. Are you using completely custom
| firmwares on your Mellanoxes? Do you have plans for nvmeOF?
| I've had so many issues with kernels/firmware scaling this
| stuff that we've got a team of kernel devs now. Also how stable
| are these servers? Do you feel like they're teetering at the
| edge of reliability? I think once we've ripped out all the OEM
| firmware we'll be in a much better place.
|
| Are you running anything custom in your Mellanoxes? dpdk stuff
| kierank wrote:
| How do you think this will need to be re-architected for HTTP3?
| Will you do kernel bypass with a userspace stack?
| drewg123 wrote:
| We've looked at http3 / quic, and have not yet seen any
| advantages for our workload. Just a large loss of CPU
| efficiency.
| cleverpebble wrote:
| If this is one server, I can't imagine how much bandwidth Netflix
| pushes out of a single PoP - does anyone have those numbers, or
| at least estimates?
| wmf wrote:
| I saw a photo somewhere of a Netflix POP that showed dozens of
| servers and a huge chassis router with hundreds of 100G links,
| so that's terabits.
| tgtweak wrote:
| Capacity isn't indicative of bandwidth peak - but I'm sure
| peaks are well past the terabit measure of each server is
| being designed for 100-400Gbps.
| dylan604 wrote:
| There were some crazy number like 80% of internet traffic is
| Netflix. I have no idea of the validity of that, but without
| seeing actual numbers, that sounds like a lot.
| kingosticks wrote:
| There are some numbers at
| https://www.lightreading.com/carrier-security/security-
| strat...
| tgtweak wrote:
| They colocate these servers inside ISPs typically. Would be
| interesting to know, for the traffic that does go to a Netflix
| primary pop - how much the hottest one consumes.
|
| It feels like a luxury that they can fit their entire content
| library in all encoded formats into a single server (even
| though it's a massive server).
| sroussey wrote:
| It's not the entire content library. It's a cache of the most
| watched at that time.
| zamadatix wrote:
| It may disappoint but the servers are big because it's cheaper
| than buying racks of servers not because it was the only way to
| get enough servers into a PoP. Their public IX table should
| give an idea that most are just a couple
| https://openconnect.netflix.com/en/peering/ (note they have
| bulk storage servers too for the less popular content not just
| 1 big box serving every file).
|
| I've seen their boxes walking through Equinix facilities Dallas
| in Dallas and Chicago and it is a bit jarring how small some of
| the largest infrastructure can be.
|
| It's worth noting that they have many smaller boxes at ISP colo
| locations as well not just the big regional DCs like Equinix.
| genewitch wrote:
| Seeing Hulu at equinix, as well as PlayStation Network, in
| large cages, and then our two small cages was rather eye
| opening. Some people have a lot of money to throw at a
| problem, others have smart money to throw at smart engineers.
| gsnedders wrote:
| I mean, there's gotta be some question about costs of
| hardware (plus colocation costs, etc.) versus the costs of
| engineers to optimise the stack.
|
| I don't doubt there's a point at which it's cheaper to
| focus on reducing hardware and colocation costs, but for
| the vast majority engineers are the expensive thing.
| creativenolo wrote:
| I'd love to know how many users could be served off one of these
| boxes
| puppet-master wrote:
| You can do reasonable 1080p at 3Mbit with h.264, if these were
| live streams you could fit around 130k sessions on the machine.
| But this workload is prerecorded bulk data, I imagine the
| traffic is considerably spikier than in the live streaming
| case, perhaps reducing the max sessions (without performance
| loss) by 2-3x.
| Sunspark wrote:
| It's disappointing to me that Netflix keeps reducing the bitrate
| on their videos.. they're not the only offender but the most
| prominent.
| [deleted]
| alberth wrote:
| @drewg123
|
| 1. Is there any Likelihood Netflix needs will migrate to ARM in
| thr next few years? (I see you're right up at end of the deck,
| curious if you're seeing more advancements with ARM than x86 and
| as such, project ARM to surprise x86 for your needs in the
| foreseeable future)
|
| 2. Can you comment more on thr 800 Gbps reference at end of deck
| drewg123 wrote:
| We're open to migrating to the best technology for the use
| case. Arm beat Intel to Gen4 PCIe, which is one of the reasons
| we tried them. Our stack runs just fine on arm64, thanks to the
| core FreeBSD support for arm64
|
| The 800Gb is just a science experiment.. a dual-socket milan
| with a relatively equal number of PCIe lanes from each socket.
| It should have been delivered ages ago, but supply chain
| shortages impacted the delivery dates of a one-off prototype
| like this.
| lprd wrote:
| Will the presentation be available online anywhere? This topic is
| very interesting to me as I maintain a Plex server for my family
| and have always been curious how Netflix does streaming at scale.
| antman wrote:
| How large is your family?
| lokimedes wrote:
| Just the closest /16 and a few friends.
| lprd wrote:
| Haha, not large at all! I've just built a couple of servers
| and their main responsibility is content storage and then
| content serving (transcoding). My main bottleneck right now
| is that I don't have 10Gbps switches (both servers have dual
| 10Gbps intel NICs ready to go). I have to do some throttling
| so that users don't experience any buffering issues.
| erk__ wrote:
| At some point the presentations will come up here
| https://www.youtube.com/c/EuroBSDcon
| drewg123 wrote:
| There is a EuroBSDCon youtube channel. The last time I
| presented, it was a few weeks before the recordings were
| processed and posted to the channel. I'm not sure how soon they
| will be available this year.
|
| I have to say that a lot of this may be overkill for plex. Do
| you know if plex even uses sendfile?
| lprd wrote:
| It's 100% overkill, haha! I was just asking because streaming
| architecture really interests me. I have two nodes (one is
| TrueNAS with ~50tb of storage), and the other is a compute
| machine (all SSDs) which consumes the files on the NAS and
| delivers the content. My biggest bottleneck right now is that
| my internal network isn't 10Gpbs, so I have to throttle some
| services so that users don't experience any buffering issues.
|
| Truenas also introduced me to ZFS and I have been amazed by
| it so far! I haven't dug to deep into FreeBSD yet, but that's
| next on my list.
| drewg123 wrote:
| One thing to remember with ZFS is that it does not use the
| page cache; it uses ARC. This means that sendfile is not
| zero-copy with ZFS, and I think that async sendfile may not
| work at all. We use ZFS only for boot/root/logs and UFS for
| content.
| lprd wrote:
| Very interesting, I had no idea. Would adding something
| like an intel optane help performance for large transfers
| with a lot of I/O? I manage my services with a simple
| docker-compose file and establish the data link to my NAS
| via NFS, which I'm sure is adding another layer of
| complexity (I believe NFSv4 is asynchronous?).
| onedr0p wrote:
| Short answer: no
|
| There's a lot of blogs that go over this in detail for
| our usecase at home. Your can do it but you will not see
| much improvements if any at all.
| lvh wrote:
| From the tail end of the presentation I'd expect UFS2
| isn't a potential bottleneck (I'd naively expect it to be
| good at huge files, and the kernel to be good at
| sendfile(2).) Is that your opinion as well, or are there
| efficiency gains hiding inside the filesystem?
| msdrigg wrote:
| I think its funny that these servers use FreeBSD but if you zoom
| in on the CPU graphic in the slides it is reading a book called
| 'Programming Linux'
| [deleted]
| tyingq wrote:
| That is funny. The title is pretty damn small. Here's what
| appears to be the source SVG:
| https://openclipart.org/download/28107/klaasvangend-processo...
|
| And the text is in a <tspan>, so it could be appropriately re-
| titled if desired :)
| jhgb wrote:
| > so it could be appropriately re-titled if desired
|
| ...to something like "How to not be too curious for your own
| good"?
| speedgoose wrote:
| Good catch.
| drewg123 wrote:
| Wow, I never zoomed in on that. I've been using that as clip
| art in my slides for years now. You should work for CSI
| cucumb3rrelish wrote:
| this guy enhances
| arnaudsm wrote:
| Slightly unrelated, but did they ever consider P2P? It scales
| really well and saves Terabytes of bandwidth, so I wonder what
| the cons are, except that's it's associated with illegal
| distribution.
| zamadatix wrote:
| P2P doesn't offer as many experience consistency guarantees and
| in general adds a lot of complexity (=cost) that doesn't exist
| in a client server model. Even if you went full bore on P2P you
| still have to maintain a significant portion of the centralized
| infrastructure both to seed new content as well as for clients
| that aren't good P2P candidates (poor connections, limited
| asymmetric connections, heavily limited data connections). Once
| you got through all of those technical issues even if you found
| you could overall save cost by reducing the core
| infrastructure... it rides on the premise customers are going
| to be fine with their devices using their upload (both active
| bandwidth and total data) at the same price as when it was a
| hosted service.
|
| But yes I imagine licensing is definitely an issue too. Right
| now only certain shows can be saved offline to a device and
| only for very restricted periods of time for the same reason.
| It's also worth noting in many cases Netflix doesn't pay for
| bandwidth.
| arnaudsm wrote:
| Thanks for the detailed answer!
|
| It's already there. The piratebay already serves me movies at
| 200 MBps with almost zero infrastructure cost. It's probably
| more a licensing issue like you said.
| zamadatix wrote:
| It's "not already there" as what torrents do is not the
| same as what Netflix does even though both are related to
| media delivery. Try seeking on a torrent which only caches
| the local blocks and getting the same seek delay, try
| getting instant start and seek when 1 million people are
| watching a launch, try getting people to pay you the same
| amount to use their upload, try getting the same battery
| life on a torrent as a single HTTPS stream using burst
| intervals. As I said P2P doesn't offer as many experience
| consistency guarantees and has many other problems.
| Licensing is just one of many issues.
|
| Of course for free many people are willing to live with
| starting and seeking being slow or battery life being worse
| or having to allocate the storage up front or using their
| upload and so on but again the question is can it fit in
| the cost offset of some of the centralized infrastructure
| not what you could do for free. I don't have anything
| against torrents, quite the opposite I am quite a heavy
| user of BOTH streaming services and torrenting due to DRM
| restrictions on quality for some of my devices, but it
| isn't a single issue problem like you are pressing to make
| it be.
|
| For some internal distributed action Netflix has made IPFS
| into something they can use but not for end delivery
| https://blog.ipfs.io/2020-02-14-improved-bitswap-for-
| contain...
| sedatk wrote:
| Residential connections have much lower upload speeds than
| their download speed. That can impact web browsing speeds
| because outgoing packets can saturate the bandwidth and delay
| TCP handshake packets. This is a problem I've been having
| constantly with my Comcast internet connection. If the net
| feels slow, I check if something's doing some large upload
| despite that it's gigabit. I've tried QoS, managed switches
| etc, none helped the situation. P2P is a no-no from that
| perspective, in addition to other valid points until same
| up/down speeds become a norm (miss you Sonic!).
| cucumb3rrelish wrote:
| wouldn't p2p mean that the client receiving somehow
| participates in the sharing too? that wouldn't go well with
| some internet plan caps
| fortuna86 wrote:
| Maybe dumb question, but does p-2-p work on native TV apps,
| chromecast, etc ? I know it does if you run a client app on
| Windows or Mac
| zamadatix wrote:
| That's not P2P in quite the same sense of the above was
| talking about. Chromecast is client/server it's just your
| device can become the server or instruct the Chromecast to
| connect to a certain server.
| wmf wrote:
| If Netflix had made P2P the standard ten years ago then TVs
| and Chromecasts would support P2P today. But they didn't so
| they don't.
| gigatexal wrote:
| Hilarious to see all the attempts at "but why not Linux".
|
| I wonder how many will read this and consider trying out FreeBSD.
| It's a rather dope OS but I am a bit biased.
| 5faulker wrote:
| Seems like an absolute overkill for now.
| lvh wrote:
| I don't see why it's "absolute overkill" or even "for now"? It
| seems reasonable for Netflix to want to optimize that pipeline
| and minimize the amount of hardware they need to deploy,
| especially for dense areas? Top quality video from Netflix can
| be 2MB/s, so back-of-the-envelope that's only 20k streams. I'd
| expect needing that capacity on any random evening in Logan
| Square (a Chicago neighborhood).
| anonuser123456 wrote:
| Out of curiosity, why TLS encrypt when it's already DRM? To
| prevent snooping on viewer habits?
| drewg123 wrote:
| Yes. See a tech blog from last year:
| https://netflixtechblog.com/how-netflix-brings-safer-and-fas...
| numlock86 wrote:
| https://doesmysiteneedhttps.com/
|
| Maybe they should add "But I already have DRM!" to the list.
| DRM solves a complete different problem.
| boardwaalk wrote:
| More generically, "But my content is already encrypted!" I'm
| surprised it isn't already there.
| tester756 wrote:
| >Serving Netflix Video at 400Gb/s [pdf]
|
| Why not 50GB/s?
|
| We don't say 50000 m/h instead of 50km/h
| drewg123 wrote:
| Because its network bandwidth, and network units are in bits
| for historical reasons. I actually griped about this in my talk
| .. it makes it a pain to talk about memory bw vs network bw
| without mental gymnastics to move around that factor of 8
| jakeinspace wrote:
| Personally, I think that keeping the bit as the default
| network bandwidth unit is a good idea. When expressing
| physical transmission limitations, the equations of
| information theory are all in terms of bits (being the
| smallest unit of classical information). Obviously, this
| means that somewhere on the spectrum from the Shannon Limit
| to storage on disk, we've gotta shift from bits to bytes. The
| NIC is as good a border as any.
| tester756 wrote:
| thank you
| [deleted]
| nix23 wrote:
| Wow you have much to learn buddy.
| azinman2 wrote:
| What's amazing to me is how much data they're pumping out --
| their bandwidth bills must be insane. Or they did some pretty
| amazing negotiating contracts considering Netflix is like what,
| $10-15/mo? And there are many who "binge" many shows likely
| consuming gigabytes per day (in addition to all the other costs,
| not least of which is actually making the programming)?
| cavisne wrote:
| Netflix has many peering relationships with isps
|
| https://openconnect.netflix.com/en/peering/
|
| The actual bandwidth is significant though, even compared to
| something like YouTube
|
| https://www.smh.com.au/technology/these-graphs-show-the-impa...
| dmead wrote:
| there's a graph somewhere showing the bitrate for netflix
| before and after they paid out extortion fees to comcast.
| RKearney wrote:
| Extortion fees? Netflix wanted Comcast to provide them with
| free bandwidth. While some ISPs may appreciate that
| arrangement as it offloads the traffic from their transit
| links, Comcast is under no obligation to do so.
|
| You could argue then that Comcast wasn't upgrading their
| saturated transit links, which they weren't with Level 3,
| but to assume that every single ISP should provide free
| pipes to companies is absurd.
| sedatk wrote:
| If Netflix pays for the bandwidth, what do Comcast's
| customers pay for?
| pstuart wrote:
| Customer service!
| sedatk wrote:
| "I have people skills; I am good at dealing with people.
| Can't you understand that? What the hell is wrong with
| you people?"
| freedomben wrote:
| I don't normally upvote jokes unless they're funny but
| also make a legitimate point. This was a good one :-D
| cbg0 wrote:
| Comcast were purposefully keeping those links under
| capacity so they could double dip - get money from both
| their customers and Netflix.
| RKearney wrote:
| So no business should ever have to pay for bandwidth
| because customers of said business are already paying? Or
| should I get free internet because businesses are already
| paying ISPs for their network access?
| dmead wrote:
| https://www.merriam-webster.com/dictionary/extortion
| leereeves wrote:
| Businesses have to pay for bandwidth to get data to the
| customer's ISP, but they generally don't pay the
| customer's ISP for the same bandwidth the customer has
| already paid for.
| RKearney wrote:
| Netflix did not have to pay Comcast either. One of their
| ISPs Level 3 already had peering arrangements with
| Comcast to deliver the content. Instead of paying their
| own ISP, Netflix wanted to get free bandwidth from
| Comcast. There's a difference.
| leereeves wrote:
| Comcast throttled _customers '_ bandwidth, refusing to
| deliver the speeds they had promised to their customers
| when those customers wanted to use it to watch Netflix.
|
| The issue exists because Comcast is lying to their
| customers, promising them more bandwidth than they are
| able to deliver, so when Comcast's customers wanted to
| use all the bandwidth they bought to watch Netflix,
| Comcast couldn't afford to honor their promises.
|
| And Comcast's customers are as unhappy with Comcast as
| Netflix is; Comcast only gets away with acting as they do
| because they have a monopoly in the markets they serve.
| yjftsjthsd-h wrote:
| That sounds reasonable? If I'm an end-user, what I'm
| paying my ISP for is to get me data from Netflix (or
| youtube or wikipedia or ...); that is the whole _point_
| of having an ISP. If that means they need to run extra
| links to Netflix, then tough; _that 's what I'm paying
| them for_.
| RKearney wrote:
| You're paying them for access to the ISPs network. For an
| internet connection, this also comes with access to other
| networks through your ISPs peering and transit
| connections.
|
| If you know of any ISP that would give me 100+ Gbps
| connectivity for my business, please let me know as I'd
| love to eliminate my bandwidth costs.
| deadbunny wrote:
| The bandwidth paid for by their customers you mean?
| NetBeck wrote:
| Fast.com was originally a tool to measure if your line was
| throttling Netflix traffic.[0]
|
| [0] https://qz.com/688033/netflix-launched-this-handy-
| speed-test...
| kristjansson wrote:
| Hence, to a large extent, the devices described in the
| presentation. They're (co)located with ISP hardware, so the
| bulk data can be transferred directly to the user at minimal /
| zero marginal cost
| [deleted]
| dragontamer wrote:
| One of the funny truths about the business of networks, is that
| the one who "controls" the most bandwidth has the #1
| negotiating spot at the table.
|
| Take for example, if say Verizon decided to charge more for
| bandwidth to Netflix... if Netflix said "no" and went with
| another provider, then Verizon's customer's would suffer from
| worse access times to Netflix.
|
| Verizon has the advantage in that they have a huge customer
| base that no one wants to piss off Verizon. So it cuts both
| ways. Bandwidth becomes not a cost at this scale, but instead a
| moat.
| btgeekboy wrote:
| Not sure if you realize which ISP you picked, but Verizon and
| Netflix actually had peering disputes in 2014 which gave me
| quite the headache at my then-current employer.
| harikb wrote:
| My hypothesis is that if Netflix/Youtube hadn't stressed out
| our bandwidth and forced the ISPs of the world upgrade for
| the last decade, the world wouldn't have been ready for WFH
| of the covid world.
|
| ISPs would have been more than happy to show the middle
| finger to the WFH engineers, but not to the binge-watching
| masses.
| elithrar wrote:
| > My hypothesis is that if Netflix/Youtube hadn't stressed
| out our bandwidth and forced the ISPs of the world upgrade
| for the last decade, the world wouldn't have been ready for
| WFH of the covid world.
|
| Couldn't agree more.
|
| We see the opposite when it comes to broadband monopolies:
| "barely good enough" DSL infrastructure, congested HFC, and
| adversarial relationships w.r.t. subscriber privacy and
| experience.
|
| When it became worthwhile to invest in not just peering but
| also last-mile because poor Netflix/YouTube/Disney+/etc
| performance was a reason for users to churn away, they
| invested.
|
| This isn't to say that this is all "perfect" for consumers
| either, but this tension has only been good for consumers
| vs. what we had in the 90's and early-mid 00's.
| lotsofpulp wrote:
| The US is still not ready. The vast majority of people have
| access to only over subscribed coaxial cable internet, with
| non existent upload allocation.
| oriolid wrote:
| The binge-watching masses are easy to satisfy. All it takes
| for the stream to work is average speed of a few megabits
| per second, but there's so much caching at client end that
| high latency and few seconds of total blackout every now
| and then don't really matter.
| js4ever wrote:
| Of course they are not paying $120/TB like AWS public pricing
|
| I heard they are paying something between $0.20 (eu/us) to $10
| (exotic) per TB based on the region of the world where the
| traffic is coming from
| city41 wrote:
| Video data does not stream over AWS. AWS is used for
| everything else though.
| elithrar wrote:
| > I heard they are paying something between $0.20 (eu/us) to
| $10 (exotic) per TB based on the region of the world where
| the traffic is coming from
|
| They're likely paying even less. $0.20/TB ($0.0002/GB) is
| aggressive but at their scale, connectivity and per-machine
| throughput, it's lower still.
|
| A few points to take home:
|
| - They [Netflix, YT, etc] model cost by Mbps - that is, the
| cost to deliver traffic at a given peak. You have to
| provision for peaks, or take a reliability/quality of
| experience hit, and for on-demand video your peaks at usually
| 2x your average.
|
| - This can effectively be "converted" into a $/TB rate but
| that's an abstraction, and not a productive way to model.
| Serving (e.g.) 1000PB (1EB) into a geography at a peak of
| 3Tbps per day is much cheaper than serving it at a peak of
| 15Tbps.
|
| - Netflix, more so than most others, benefits from having a
| "fixed" corpus at any given moment. Their library is huge,
| but (unlike YouTube) users aren't uploading content, they
| aren't doing live streaming or sports events, etc - and thus
| can intelligently place content to reduce the need to cache
| fill their appliances. Cheaper to cache fill if you can
| trickle most of it during the troughs as you don't need a big
| a backbone, peering links, etc. to do so.
|
| - This means that Netflix (rightfully!) puts a lot of effort
| into per-machine throughput, because they want to get as much
| user-facing throughput as possible from the given (space,
| power, cost) of a single box. That density is also attractive
| to ISPs, as it means that every "1RU of space" they give
| Netflix has a better ROI in terms of network cost reductions
| vs. others, esp. when combined with the fact that "Netflix
| works great" is an attractive selling point for users.
| 101008 wrote:
| Sorry for the naive question, but to offer those prices,
| there are two options: A) Amazon is losing money to keep
| Netflix as their client, B) They are doing profit even with
| $0.20/TB, which means the $120/TB is, at least, 119.80
| profitable. Wow.
| kristjansson wrote:
| AWS {in,e}gress pricing strategy is not motivated by their
| cost of provisioning that service. Cloudflare had a good
| (if self-motivated) analysis of their cost structure
| discussed on here a while ago
|
| https://news.ycombinator.com/item?id=27930151
| manquer wrote:
| $0.20/ TB costs is not for their agreements with Amazon,
| bulk of their video traffic is directly peering with ISPs,
| AWS largely serves for their APIs, and orchestration
| infrastructure.
|
| Amazon and most Cloud providers do overcharge for b/w. You
| _can_ buy a OVH /Hetzner type box with a guaranteed un-
| metered 1 Gbps public bandwidth for ~ $120/month easily,
| which if fully utilized is equivalent 325TB / month or
| $3-4/TB, completely ignoring the 8/16 core bare metal
| server and attached storage you also get. This is SMB/
| self-service prices, you can get much deals with basic
| negotiating and getting into a contract with a DC.
|
| One thing to remember though not all bandwidth are equal,
| CSPs like AWS provide a lot of features such as very
| elastic scale up on-demand, a lot of protection up to L4
| and advanced SDN under the hood to make sure your VMs can
| leverage the b/w, that is computationally expensive and
| costly.
| robocat wrote:
| Netflix has appliances installed at exchange points that
| caches most of Netflix. Each appliance peers locally and
| serves the streams locally.
|
| The inbound data stream to fill the cache of the appliance
| is rate limited and time limited - see
| https://openconnect.zendesk.com/hc/en-
| us/articles/3600356180... The actual inbound data to the
| appliance will be higher than the fill because not
| everything is cached.
|
| The outbound stream from the appliance serves consumers. In
| New Zealand for example, Netflix has 40Gbps of connectivity
| to a peering exchange in Auckland.
| https://www.peeringdb.com/ix/97
|
| So although total Netflix bandwidth to consumers is
| massive, it has little in common with the bandwidth you pay
| for at Amazon.
|
| Disclaimer: I am not a network engineer.
| sroussey wrote:
| AWS public pricing is comically high. No one even prices in
| that manner.
| tgtweak wrote:
| Cool but how much does one of those servers cost... 120k?
|
| Pretty crazy stuff though, would love to see something similar
| from cloudflare eng. since those workloads are extremely broad vs
| serving static bits from disk.
| dragontamer wrote:
| https://www.thinkmate.com/system/a+-server-1114s-wtrt
|
| For a EPYC 7502P 32-core / 256GB RAM / 2x Mellanox Connect-X 6
| dual-nic, I'm seeing $10,000.
|
| Then comes the 18x SSD drives, lol, plus the cards that can
| actually hold all of that. So the bulk of the price is
| SSD+associated hardware (HBA??). The CPU/RAM/Interface is
| actually really cheap, based on just some price-shopping I did.
| zamadatix wrote:
| The WD SN720 is an NVMe SSD so there are no HBAs involved it
| just plugs into the (many) PCIe lanes the Epyc CPU provides.
| CDW gives an instant anonymous buyer web price of ~$11,000.00
| for all 18.
| kristjansson wrote:
| Surprisingly less it seems? Those NICs are only like a kilobuck
| each, the drives are like 0.5k, CPU is like 3k-5k. So maybe
| 15k-20k all in, with the flash comprising about half that?
|
| Seems surprisingly cheap, but I'm not sure if that's just great
| cost engineering on Netflix part or a poor prior on my part ...
| I'll chose to blame Nvidia's pricing in other domains for
| biasing me up
| virtuallynathan wrote:
| That's in the right ballpark, I believe.
| daper wrote:
| I think the described machines were designed for a very
| specific workloads even in CDNs world. As I guess they serve
| exclusivelly fragments of on-demand videos, no dynamic content.
| So high bytes to number of requests ratio, nearly 100% hit
| ratio, very long cache TTL (probably without revalidation,
| almost no purge requests), small / easy to interpret requests,
| very effective HTTP keep alive, relatively small number of new
| TLS sessions (TLS handshake is not accelerated in hardware),
| low number of objects due to their size (much fewer
| stat()/open() calls that would block nginx process). Not
| talking about other CDN functionality like workers, WAF, page
| rules, rewrites, custom cache keys, no or very little logging
| of requests etc. That really simplifies things a lot compared
| to Cloudflare or Akamai.
| morty_s wrote:
| Enjoying the slides! Is there a small typo on slide 17?
|
| >Strategy: Keep as much of our 200GB/sec of bulk data off the
| NUMA fabric [as] possible
| drewg123 wrote:
| No, 400Gb/s == 50 GB/s
|
| With software kTLS, the data is moved an and out of memory 4x,
| so that's 200GB/s of bandwidth needed to serve 50GB/s (400Gb/s)
| of data.
| betaby wrote:
| And yet a lot of folks think Netflix is on AWS. I would think AWS
| gives a huge discount to Netflix and ask 'just run at least
| something on aws'. Those promotional videos have nothing useful
| about netflix on aws https://aws.amazon.com/solutions/case-
| studies/netflix-case-s... and targeted to managers and unaware of
| the scale engineers.
| dangrossman wrote:
| Netflix is on AWS -- the website and app backends, not the
| video streams.
|
| * Netflix.com resolves to
| ec2-52-3-144-142.compute-1.amazonaws.com from my computer at
| this moment.
|
| Some more domains their apps reach out to include:
|
| * nflxvideo.net: ec2-46-137-171-215.eu-
| west-1.compute.amazonaws.com.
|
| * appboot.netflix.com:
| ec2-34-199-156-113.compute-1.amazonaws.com.
|
| * uiboot.netflix.com:
| ec2-52-203-145-205.compute-1.amazonaws.com.
| betaby wrote:
| That exactly what I mean. AWS would represent single digit of
| the infrastructure and traffic.
| dangrossman wrote:
| Maybe of their traffic, but not their infrastructure.
|
| Netflix rents 300,000 CPUs on AWS just for video encoding.
| Encoding one TV series produces over 1,000 files that sit
| on S3, ready for those PoPs to pull them in and cache them.
|
| They also do more data processing, analytics and ML than
| most people realize and that's all happening on AWS.
|
| For example, not only are each user's recommendations
| personalized, but all of the artwork is personalized too.
| If you like comedy, the cover art for Good Will Hunting may
| be a photo of Robin Williams. If you like romance, the
| cover art for the same movie may be a photo of Matt Damon
| and Minnie Driver going in for a kiss.
|
| They're also doing predictive analytics on what each person
| will want to watch _tomorrow_ in particular. Every evening,
| all those non-AWS servers go ask AWS what video files they
| should download from S3 to proactively cache the next day
| of streaming.
| betaby wrote:
| ~300000 CPUs is about 2K racks worth of servers. I would
| think if you have such constant load it would be more
| economical to move that on-premise.
| jeffbee wrote:
| 300k cpu cores is "only" about 1200 nodes these days,
| definitely not 2000 racks.
| betaby wrote:
| If CPU == core thread, then yes.
| kristjansson wrote:
| The load is emphatically not constant though? Encoding
| has to be done once, as fast as possible, and updated
| rarely. That warehouse (or just aisle) of racks would be
| sitting idle much of the time...
| killingtime74 wrote:
| You're thinking like an engineer. A business person would
| just ask for a discount (which I'm sure they do) and
| constantly renegotiate. Don't forget the financing of on
| premise and data centre workers and how that looks on
| your balance sheet.
| jeffbee wrote:
| I wonder if there is any discussion of whether they could
| make it cheaper with Google's compression ASICs, which I
| assume will be a cloud service at some point.
| whoknowswhat11 wrote:
| Amazon offers something not has fancy (but maybe more
| flexible?)
|
| VT1 instances can support streams up to 4K UHD resolution
| at 60 frames per second (fps) and can transcode up to 64
| simultaneous 1080p60 streams in real time.
|
| One thing is that a place like netflix has got to have a
| LOT of target formats (think of range of devices netflix
| delivers to). ASICS sometimes struggle when faced with
| this needed flexibility.
| despacito wrote:
| A lot of that batch processing is done by borrowing from
| unused reservations for the streaming service.
| https://netflixtechblog.com/creating-your-own-ec2-spot-
| marke...
| Xorlev wrote:
| And their batch processing.
|
| I recall from a talk at Re:Invent forever ago, they had
| everything on AWS except:
|
| - Their CDN, which has to be deployed wherever there's
| demand.
|
| - An old Oracle setup with some ancient software that runs
| their DVD business.
|
| The latter may not be true anymore. In any case, it's smart
| to break from cloud providers when it makes sense for your
| problem. Video streaming is network-heavy, so pushing the
| actual streaming to where consumers is really smart.
| isclever wrote:
| Go to fast.com, open webdev tools on the network tab. Run a
| speedtest, the hosts are the local caches (OpenConnect) that
| you would use to stream movies.
|
| Note: I don't know for sure, but its the most likely.
| virtuallynathan wrote:
| Yep, that is generally the case. The steering is slightly
| different, but close enough.
| bsaul wrote:
| wait, aren't they running anything at all ??? I was 100% sure
| netflix had the most advanced use of aws ( inventing things
| like chaos monkey, etc)...
|
| What are they running on ?
| johneth wrote:
| They run their backend on AWS (i.e. accounts, catalogue,
| watch history, etc.)
|
| They serve their video data using their own custom hardware
| outside of AWS.
| elzbardico wrote:
| They run EVERYTHING but last mile file serving on aws.
| Encoding, storage, catalog, recommendations, accounts, etc.
| all of it in AWS.
| dralley wrote:
| They have resident cache boxes hooked directly into many
| ISPs. Those aren't on AWS.
| lvh wrote:
| That sounds like the last-mile the GP is talking about?
| Xorlev wrote:
| Netflix runs ~everything but video streaming from AWS. That's
| more than just lip service.
| teleforce wrote:
| I have a friend who used to work with iflix, an Asian version
| of Netflix [1]. According to him all of their contents are on
| AWS.
|
| [1]https://www.iflix.com/
| robjan wrote:
| Netflix use AWS for their commodity workloads and rolled their
| own CDN because delivering high quality streams without using
| too much bandwidth on the public internet is one of their
| differentiators.
| Hikikomori wrote:
| Laughably wrong.
| justin_ wrote:
| There is no video for this available yet AFAIK, but for those
| interested there is a 2019 EuroBSDcon presentation online from
| the same speaker that focuses on the TLS offloading:
| https://www.youtube.com/watch?v=p9fbofDUUr4
|
| The slides here look great. I'm looking forward to watching the
| recording.
| mrlonglong wrote:
| At 400Gb/s, how long would it take to read through the entire
| catalogue?
___________________________________________________________________
(page generated 2021-09-19 23:00 UTC)